Regular Expressions in JavaScript.

...


Contents :

...


What is a "Regular Expression"? ...

A Regular Expression is like a secret code that helps you find words or phrases that are similar to each other in a sentence. It’s like finding all the words that start with the same letter or have the same ending.

For example, if you want to find all the words that rhyme with "cat" within a sentence, you can use a Regular Expression to do that :

const sentence                  = "The cat in the hat sat on the mat." ; 

const regular_expression        = /.at/g ; 

const matches                   = sentence.match ( regular_expression ) ; 

window.alert ( JSON.stringify ( matches ) ) ; 

This code will output :

["cat","hat","sat","mat"]

In the example above,

/.at/g

... is called a 'literal' Regular Expression.

...


Components ...

Below is a list of some of the most common "building-blocks" that make up a Regular Expression ...

...


Delimiters ...

A delimiter is like a special character that tells the computer when a group of words or code starts and when it ends.

In the example above, we used a 'forward-slash' character to mark the beginning and end of a 'literal' Regular Expression.

/.at/g

...


Wildcards ...

The wildcard symbol, indicated by a 'dot', stands for "any character, except a 'new-line character'" (unless flags are set to 'sticky') :

/.at/g

...


Literal characters ...

Literal characters represent themselves, and are 'case-sensitive' unless flags are set to 'ignoreCase' :

/.at/g

...


Flags ...

Flags can be combined to change the way the computer reads and processes Regular Expressions. In the first example above, a flag was attached to a 'literal' Regular Expression by placing a single letter immediately after the last delimiter :

/.at/g

JavaScript includes 7 flags, listed below with descriptions of what they do :

g'global'Searches for multiple matches.
i'ignoreCase'Uppercase and lowercase character sensitivity is ignored.
s'dotAll'Wildcards represent all 'literal characters', including 'newline characters'.
m'multiline'Start and end 'boundary matchers' apply to each line.
u'unicode'Processes all 'literal characters' in 'unicode character format'.

y'sticky'... regular_expression.exec(...) updates regular_expression.lastIndex :
String.prototype.matches            = function ( _regular_expression ) { 
  const   string_of_text_characters =  this                  ; 
  const   regular_expression        = _regular_expression    ;
  const   matches                   =  [ ]                   ; 
  let     match                     =  null                  ; 
  let     current_index             =  0                     ; 
  // ... 
  while ( current_index < string_of_text_characters.length ) { 
    if  ( match                     = regular_expression.exec ( string_of_text_characters ) ) { 
      matches.push ( { 
        found                       : ( match [ 0 ]                      ) , 
        index      : { 
          start                     : (                    current_index ) , 
          end                       : ( regular_expression.lastIndex     ) , 
          } 
        } ) ; 
      current_index                 = ( regular_expression.lastIndex     ) ; 
      } 
    else { 
      regular_expression.lastIndex  = ( ++current_index                  ) ; 
      } 
    } 
  // ... 
  if ( matches.length>0 ) return matches ; 
  else                    return null    ; 
  // returns null if no matches were found ... 
  } ;

// ... 

const sentence                      = 'The cat in the hat sat on the mat.'    ; 
const regular_expression            = /.at/y                                  ; 
const output                        = sentence.matches ( regular_expression ) ; 

window.alert ( JSON.stringify ( output ) ) ; 

/* 
output = [ 
  { "found":"cat" , "index":{"start": 4,"end": 7} } , 
  { "found":"hat" , "index":{"start":15,"end":18} } , 
  { "found":"sat" , "index":{"start":19,"end":22} } , 
  { "found":"mat" , "index":{"start":30,"end":33} } , 
  ] ; 
*/  
d'hasIndices'... matches returned by regular_expression.exec(...) will include index data :
String.prototype.matches            = function ( _regular_expression ) { 
  const   string_of_text_characters =  this                  ; 
  const   regular_expression        = _regular_expression    ; 
  const   matches                   =  [ ]                   ; 
  let     match                     =  null                  ; 
  let     current_index             =  0                     ; 
  // ... 
  while ( current_index < string_of_text_characters.length ) { 
    if  ( match                     = regular_expression.exec ( string_of_text_characters ) ) { 
      matches.push ( match ) ; 
      current_index                 = ( regular_expression.lastIndex     ) ; 
      } 
    else { 
      regular_expression.lastIndex  = ( ++current_index                  ) ; 
      } 
    } 
  // ... 
  if ( matches.length>0 ) return matches ; 
  else                    return null    ; 
  // returns null if no matches were found ... 
  } ;

// ... 

const sentence                      = 'The cat in the hat sat on the mat.'    ; 
const regular_expression            = /.at/yd                                 ; 
const output                        = sentence.matches ( regular_expression ) ; 

console.clear ( ) ; 
console.log   ( output ) ; 

/* 
output = [ 
  [ 0:"cat" , "indices":[ 4, 7] , ... ] , // output[0] ... 
  [ 0:"hat" , "indices":[15,18] , ... ] , // output[1] ... 
  [ 0:"sat" , "indices":[19,22] , ... ] , // output[2] ... 
  [ 0:"mat" , "indices":[30,33] , ... ] , // output[3] ... 
  ] 
*/ 

...


Disjunctions ...

The Regular Expression in the very first example above contained just one "target pattern" :

/.at/

The disjunction symbol, (also called the 'alternation operator'), indicated by a vertical-line or 'pipe character', is "used to separate two target patterns" :

/target_pattern_1|target_pattern_2/

...


Quantifiers ...

A quantifier indicates how many times a "target pattern" should be repeated.

There are several types of quantifier in JavaScript :

/target_pattern*/     

/* ... * matches to "zero or more" occurences of the target_pattern ... */
/target_pattern?/     

/* ... ? matches to "zero or one" occurence of the target_pattern ... */
/target_pattern+/     

/* ... + matches to "one or more" occurences of the target_pattern ... */
/target_pattern{X}/   

/* ... {X} matches to exactly X-number of occurences of the target_pattern ... */
/target_pattern{X,}/  

/* ... {X,} matches to X-number or more occurences of the target_pattern ... */
/target_pattern{X,Y}/ 

/* ... {X,Y} matches from X-number to Y-number of occurences of the target_pattern ... */

Lazy quantifiers are used to match the shortest "slice of text" that satisfies a "target pattern" :

/target_pattern*?/    

/* ... * matches to "zero or more" occurences of the target_pattern, but as few times as possible ... */
/target_pattern??/    

/* ... ? matches to "zero or one" occurence of the target_pattern, but as few times as possible ... */
/target_pattern+?/    

/* ... + matches to "one or more" occurences of the target_pattern, but as few times as possible ... */
/target_pattern{X,Y}?/

/* ... {X,Y} matches from X-number to Y-number of occurences of the target_pattern, but as few times as possible ... */

...


Character escapes ...

Character escapes are used to represent specific text characters, like 'newline' and 'tab'.

There are several character escapes used in JavaScript :

\f ... 'formfeed' \0 ... 'null-zero'.
\n ... 'newline'. \xhh ... 2-digit hexadecimal ASCII character.
\r ... 'carriage return'. \uhhhh ... 4-digit hexadecimal Unicode character.
\t ... 'horizontal tab'. \u{hhh} ... (1-6)-digit hexadecimal Unicode character.
\v ... 'vertical tab'. \cA-Z ... control character *** ...

... 'Literal characters' ...
\^  ,  \$  ,  \\  ,  \.  ,  \*  ,  \+  ,  \?  ,  \(  ,  \)  ,  \[  ,  \]  ,  \{  ,  \}  ,  \\  ,  \/

...


Character classes ...

Character classes are used to represent general types of text characters, like "all digits", or "all white-spaces".

There are several character classes used in JavaScript :

\d ... digits. \D ... non-digits.
\s ... space, tabs, newlines. \S ... all but \s.
\w ... letters, digits, underscore. \W ... all but \w.

...


Character sets ...

Character sets are used to define a range of possibles characters to match, such as "a-z", or "0-9".

/[a-z]/   

/* ... matches lowercase letters "a-z" ... */
/[A-Za-z]/

/* ... matches uppercase and lowercase letters "A-Z" ... */
/[0-9]/   

/* ... matches digits "0-9" ... */
/[123xyz]/

/* ... matches "1" , "2" , "3" , "x" , "y" , or "z" ... */

...


Negated character sets ...

Negative character sets are used to define a range of possibles characters to avoid.

/[^123xyz]/

/* ... matches all excluding "1" , "2" , "3" , "x" , "y" , or "z" ... */
/[^0-9]/   

/* ... matches all excluding digits "0-9" ... */
/[^A-Za-z]/

/* ... matches all excluding uppercase and lowercase letters "A-Z" ... */
/[^a-z]/   

/* ... matches all excluding lowercase letters "a-z" ... */

...


Boundary matchers ...

Boundary matchers are used to represent "word boundaries", "line boundaries" (when flags are set to 'dotAll'), or entire "string boundaries".

\^ ... line or word boundary start. \$ ... line or word boundary end.
\b ... matches a word boundary. \B ... matches a non-word boundary.

...


Unicode property escapes ...

Unicode property escapes are used to match characters based on their Unicode properties.

There are several types of unicode property escapes in JavaScript :

\p{Script=Greek} ... matches 'any character used in the Greek script'.
\p{Letter} ... matches 'any letter character'.
\p{Number} ... matches 'any numeric character'.
\p{Punctuation} ... matches 'any punctuation character'.
\p{Symbol} ... matches 'any symbol character'.
\p{Emoji_Presentation} ... matches 'any emoji character'.

...


Grouping constructs ...

"Capturing groups", "backreferences", "named groups", "non-capturing groups", and "lookahead and lookbehind assertions", are all types of "grouping constructs", discussed below ...

...


Capturing groups ...

Capturing groups are target patterns surrounded by 'round-brackets'. Capturing groups are "matched, remembered, and can be referred back to" in a Regular Expression.

/(target_pattern)/

...


Backreferences ...

Backreferences are used to "refer back to capturing groups".

/(target_pattern)\1/

Backreferences are signalled with a 'backslash', followed by a 'numerical address' that matches the order in which the target capturing group appears.

const string             = "The cat in the hat, with the huge grin, sat on the mat." ; 

const regular_expression = /.(at).*?\1/g ; 

const output             = string.match ( regular_expression ) ; 

window.alert ( JSON.stringify ( output ) ) ; 

/* 
output = [ "cat in the hat" , "sat on the mat" ] 
*/ 

...


Named groups ...

Named groups (like backreferences) are used to refer back to capturing groups.

/(?<name>target_pattern)\k<name>/

A named group is signalled with a 'question-mark', followed by the "name" of the group surrounded by 'angle brackets'. Named groups can be back-referenced inside Regular Expressions using 'backslash-k', followed by the "name" of a group within 'angle brackets'.

const string             = "The cat in the hat, with the huge grin, sat on the mat." ; 

const regular_expression = /.(?<rhyming_word>at).*?\k<rhyming_word>/g ; 

const output             = string.match ( regular_expression ) ; 

window.alert ( JSON.stringify ( output ) ) ; 

/* 
output = [ "cat in the hat" , "sat on the mat" ] 
*/ 

A named group can also be referenced in JavaScript using regular_expression.replace(...), with a string containing a 'dollar-sign' followed by the "name" of the group within 'angle brackets':

const string             = "John Smith" ;

const regular_expression = /(?<first>\w+)\s+(?<last>\w+)/ ;

const output             = string.replace ( regex , "$<last>, $<first>" ) ; 

window.alert ( JSON.stringify ( output ) ) ; 

/* 
output = "Smith, John" 
*/ 

A named group can also be referenced in JavaScript using regular_expression.exec(...) and match.groups["name"] :

const regular_expression = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/ ; 

const match              = regular_expression.exec ( "2023-05-29" ) ; 

const output             = { 
 "year"                  : match.groups [ "year"  ] , 
 "month"                 : match.groups [ "month" ] , 
 "day"                   : match.groups [ "day"   ] , 
 } ; 

window.alert ( JSON.stringify ( output ) ) ; 

/* 
output = { "year":"2023" , "month":"05" , "day":"29" } 
*/ 

...


Non-capturing groups ...

Unlike capturing groups, non-capturing groups are not remembered, and "cannot be referred back to" in a Regular Expression.

/(?:target_pattern)/

Non-capturing groups are indicated with a 'question-mark', followed by a 'colon character', and then a 'target pattern', all contained within 'round-brackets'.

...


Lookahead and lookbehind assertions ...

A lookahead assertion is used to match a target_pattern only if it is followed by a lookahead_pattern :

/target_pattern(?=positive_lookahead_pattern)/

Negative lookahead assertions are used to match a target_pattern only if it is not followed by a lookahead_pattern :

/target_pattern(?!negative_lookahead_pattern)/

A lookbehind assertion is used to match a target_pattern only if it is preceded by a lookbehind_pattern :

/(?<=positive_lookbehind_pattern)target_pattern/

Negative lookbehind assertions are used to match a target_pattern only if it is not preceded by a lookbehind_pattern :

/(?<!negative_lookbehind_pattern)target_pattern/

Lookahead and lookbehind assertions are not remembered, and "cannot be referred back to" in a Regular Expression.

...


Afrolex ...

Afrolex is a tool that allows you use what you have learnt from "Regular Expressions in JavaScript" to search a "string" containing over 350,000 English words that are separated by 'newline characters' for new words or for words that rhyme:


"Afrolex"

...


Credits ...
"Title" : ( "Regular Expressions in JavaScript." ) ,
"Created" : ( "Tue May 23 2023 09:10:59 GMT+0100 (BST)" ) ,
"Published" : ( "Thu Jun 01 2023 06:57:55 GMT+0100 (BST)" ) ,
"Updated" : ( "Thu Jun 01 2023 06:57:55 GMT+0100 (BST)" ) ,
"Author" : [ "Dave Auguste" ] ,
"Assistance" : [ "Bing Chat" ] ,
"Commissions" : [ "@dave_on_fiverr" ] ,
"Support" : [ "buymeacoffee.com/daveauguste" ] ,