Regular Expressions - Syntax
Regular expression (regular expression) describes a string matching pattern, which can be used to check whether a string contains a certain substring, replace the matching substring, or select from a certain string. Extract substrings that meet certain conditions, etc.
When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the * in regular expressions .
Constructing regular expressions is the same as creating mathematical expressions. That is, small expressions can be combined together to create larger expressions using a variety of metacharacters and operators. The components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components.
Regular expressions are text patterns composed of ordinary characters (such as the characters a through z) and special characters (called "metacharacters"). A pattern describes one or more strings to match when searching for text. A regular expression acts as a template that matches a character pattern with a searched string.
Normal characters
Normal characters include all printable and nonprintable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation, and some other symbols.
Non-printing characters
Non-printing characters can also be part of regular expressions. The following table lists the escape sequences that represent non-printing characters:
Character | Description |
---|---|
\ cx | matches the control character specified by x. For example, \cM matches a Control-M or carriage return character. The value of x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character. |
\f | Matches a form feed character. Equivalent to \x0c and \cL. |
\n | Matches a newline character. Equivalent to \x0a and \cJ. |
\r | Matches a carriage return character. Equivalent to \x0d and \cM. |
\s | Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v]. |
\S | Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v]. |
\t | Matches a tab character. Equivalent to \x09 and \cI. |
\v | Matches a vertical tab character. Equivalent to \x0b and \cK. |
Special characters
The so-called special characters are characters with special meanings, such as the * in "*.txt" mentioned above. Simply put, they represent the meaning of any string. If you want to find files with * in the file name, you need to escape the *, that is, add a \ before it. ls\*.txt.
Many metacharacters require special treatment when trying to match them. To match these special characters, you must first "escape" the characters, that is, precede them with a backslash character (\). The following table lists the special characters in regular expressions:
Special Characters | Description | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
$ | Matches the end position of the input string. If the RegExp object's Multiline property is set, $ also matches '\n' or '\r'. To match the $ character itself, use \$. | |||||||||||||||||||||||
( ) | Marks the beginning and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use \( and \). | |||||||||||||||||||||||
* | Matches the preceding subexpression zero or more times. To match * characters, use \*. | |||||||||||||||||||||||
+ | Matches the preceding subexpression one or more times. To match the + character, use \+. | |||||||||||||||||||||||
. | Matches any single character except the newline character \n. To match ., use \. | |||||||||||||||||||||||
[ | Marks the beginning of a square bracket expression. To match [, use \[. | |||||||||||||||||||||||
? | Matches the preceding subexpression zero or one time, or specifies a non-greedy qualifier. To match the ? character, use \?. | |||||||||||||||||||||||
\ | Mark the next character as either a special character, a literal character, a backward reference, or an octal escape character. For example, 'n' matches the character 'n'. '\n' matches a newline character. The sequence '\\' matches "\", while '\(' matches "(". | |||||||||||||||||||||||
^ | matches the beginning of the input string unless When used in a square bracket expression, it indicates that the character set is not accepted. To match the ^ character itself, use the \^ | |||||||||||||||||||||||
##| | ||||||||||||||||||||||||
## |
Character | Description |
---|---|
* | Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}. |
+ | Matches the preceding subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}. |
? | Matches the preceding subexpression zero or one time. For example, "do(es)?" would match "do" or "do" in "does". ? Equivalent to {0,1}. |
{n} | n is a non-negative integer. Match a certain number of n times. For example, 'o{2}' does not match the 'o' in "Bob", but it does match both o's in "food". |
{n,} | n is a non-negative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob", but it matches all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'. |
{n,m} | m and n are both non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers. |
Since chapter numbers will likely exceed nine in large input documents, you need a way to handle two- or three-digit chapter numbers. Qualifiers give you this ability. The following regular expression matches chapter titles numbered with any number of digits:
/Chapter [1-9][0-9]*/
Note that the qualifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, only numbers from 0 to 9 (inclusive) are specified.
The + qualifier is not used here because there is not necessarily a need for a number in the second or subsequent position. Don’t use it either? characters because it limits chapter numbers to only two digits. You need to match at least one number after Chapter and a space character.
If you know that chapter numbers are limited to only 99 chapters, you can use the following expression to specify at least one but at most two digits.
/Chapter [0-9]{1,2}/
The disadvantage of the above expression is that chapter numbers greater than 99 still only match the first two digits. Another drawback is that Chapter 0 will also match. A better expression to match only two digits would be:
/Chapter [1-9][0-9]?/
or
/Chapter [1-9][0-9]{0,1}/
*, + and ? qualifiers are greedy in that they will match as many as possible To match text, non-greedy or minimal matching can be achieved by adding a ? after them.
For example, you might search an HTML document for section titles enclosed in H1 tags. The text would look like this in your document:
<H1>Chapter 1 – Introduction to Regular Expressions</H1>
The following expression matches everything from the opening less-than sign (<) to the greater-than sign (>) of the closing H1 tag.
/<.*>/
If you only need to match the opening H1 tag, the "non-greedy" expression below will only match <H1>.
/<.*?>/
By placing ? after the *, +, or ? qualifier, the expression is converted from a "greedy" expression to a "non-greedy" expression, or a minimum match.
Locators
Locators enable you to pin a regular expression to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.
The locator is used to describe the boundary of a string or a word. ^ and $ refer to the beginning and end of the string respectively. \b describes the front or back boundary of a word. \B represents a non-word boundary.
The qualifiers of regular expressions are:
Character | Description |
---|---|
^ | Matches the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position after \n or \r. |
$ | Matches the position at the end of the input string. If the RegExp object's Multiline property is set, $ also matches the position preceding \n or \r. |
\b | Matches a word boundary, that is, the position between a word and a space. |
\B | Non-word boundary matching. |
Note: You cannot use qualifiers with anchor points. Since there cannot be more than one position immediately before or after a newline or word boundary, expressions such as ^* are not allowed.
To match text at the beginning of a line of text, use the ^ character at the beginning of the regular expression. Do not confuse this use of ^ with the use inside bracket expressions.
To match text at the end of a line of text, use the $ character at the end of the regular expression.
To use anchor points when searching for chapter titles, the following regular expression matches a chapter title that contains only two trailing digits and occurs at the beginning of the line:
/^Chapter [1-9][0-9]{0,1}/
True Not only does the chapter title appear at the beginning of the line, but it is also the only text in the line. It appears both at the beginning of a line and at the end of the same line. The following expression ensures that the specified match only matches chapters and not cross-references. You can do this by creating a regular expression that matches only the beginning and end of a line of text.
/^Chapter [1-9][0-9]{0,1}$/
Matching word boundaries is slightly different, but adds important capabilities to regular expressions. Word boundaries are the positions between words and spaces. A non-word boundary is any other position. The following expression matches the first three characters of the word Chapter because these three characters appear after a word boundary:
/\bCha/
\b The position of the characters is very important. It looks for a match at the beginning of the word if it's at the beginning of the string to be matched. If it's at the end of the string, it looks for a match at the end of the word. For example, the following expression matches the string ter in the word Chapter because it occurs before a word boundary:
/ter\b/
The following expression matches the string apt in Chapter but not the characters in aptitude String apt:
/\Bapt/
The string apt occurs at non-word boundaries in the word Chapter, but at word boundaries in the word aptitude. For the \B non-word boundary operator, position does not matter because the match does not care whether it is the beginning or the end of a word.
Selection
Enclose all selections in parentheses, and separate adjacent selections with |. But using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.
Among them, ?: is one of the non-capturing elements, and the other two non-capturing elements are ?= and ?!. These two have more meanings. The former is a forward lookup and matches at any beginning. The search string is matched at any position within the regular expression pattern within parentheses, which is a negative lookahead that matches the search string at any initial position that does not match the regular expression pattern.
Backreference
Adding parentheses around a regular expression pattern or part of a pattern will cause the associated match to be stored in a temporary buffer, and each captured submatch will be Expressions are stored in the order they appear from left to right. Buffer numbering starts at 1 and can store up to 99 captured subexpressions. Each buffer can be accessed using '\n', where n is a one- or two-digit decimal number that identifies the specific buffer.
Captures can be overridden using the non-capturing metacharacters '?:', '?=' or '?!', ignoring the saving of related matches.
One of the simplest and most useful applications of backreferences is the ability to find matches of two identical adjacent words in text. Take the following sentence as an example:
Is is the cost of of gasoline going up up?
The above sentence obviously has multiple repeated words. It would be nice to devise a way to locate this sentence without having to look for repetitions of each word. The following regular expression uses a single subexpression to achieve this:
/\b([a-z]+) \b/gi
Captures an expression, as specified by [a-z]+, that includes one or more letters. The second part of the regular expression is a reference to a previously captured submatch, i.e., the second occurrence of the word exactly matched by the bracket expression. \1 specifies the first submatch. Word boundary metacharacters ensure that only whole words are detected. Otherwise, phrases such as "is issued" or "this is" will not be correctly recognized by this expression.
The global tag (g) after the regular expression indicates that the expression is applied to as many matches as can be found in the input string. The case-insensitive (i) tag at the end of the expression specifies case-insensitivity. Multiline tags specify potential matches that may occur on either side of newline characters.
Backreferences also break down a Universal Resource Indicator (URI) into its components. Suppose you want to decompose the following URI into protocol (ftp, http, etc.), domain address, and page/path:
http://www.w3cschool.cc:80/html/html-tutorial.html
The following regular expression provides this functionality:
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/
First A parenthetical subexpression captures the protocol portion of the Web address. This subexpression matches any word preceded by a colon and two forward slashes. The second parenthetical subexpression captures the domain address portion of the address. The subexpression matches one or more characters except / and :. The third bracketed subexpression captures the port number (if one is specified). This subexpression matches zero or more digits following the colon. This subexpression can be repeated only once. Finally, the fourth parenthetical subexpression captures the path and/or page information specified by the Web address. This subexpression matches any sequence of characters that does not include the # or space character.
Applying the regular expression to the URI above, each submatch contains the following:
The first bracketed subexpression contains "http"
The second bracket subexpression contains "www.w3cschool.cc"
- ##The third bracket subexpression contains ":80"
- The fourth bracket subexpression contains "../html/html-tutorial.html"