Home  >  Article  >  Backend Development  >  Detailed explanation of what regular expressions are and their usage

Detailed explanation of what regular expressions are and their usage

阿神
阿神Original
2017-03-28 14:54:118858browse

1. What is a regular expression?

Regular expression (regular expression) describes a string matching pattern, which can be used to: contain Matches a certain

(1) Check whether a string contains a string that matches a certain rule, and the string can be obtained;

(2) Flexibly perform string processing based on matching rules replacement operation.

Regular expressions are actually very simple to learn, and a few more abstract concepts are also easy to understand. The reason why many people feel that regular expressions are complicated is that, on the one hand, most documents do not explain them from the shallower to the deeper, and do not pay attention to the order of concepts, which makes it difficult to understand; on the other hand, various engines The documentation that comes with it usually introduces its unique functions, but these unique functions are not the first thing we need to understand.

Related courses: Boolean education regular expression video tutorial


##2 .How to use regular expressions

2.1 Ordinary characters

Letters, numbers, Chinese characters, underscores, As well as punctuation marks that are not specially defined in the following chapters, they are all ordinary characters. Ordinary characters in an expression, when matching a string, match the same character.

Example 1: Expression c, when matching the string abcdef, the matching result is: success; the matched content is: c; the matched position is: starting at 2 and ending at 3. (Note: Whether the subscript starts from 0 or 1 may differ depending on the current programming language).

Example 2: Expression bcd, when matching the string abcde, the matching result is: success; the matched content is: bcd; the matched position is: starting at 1 and ending at 4.

2.2 Simple escape characters

For some characters that are inconvenient to write, use the method of adding \ in front. In fact, we are all familiar with these characters.

Detailed explanation of what regular expressions are and their usage

There are other punctuation marks that have special uses in later chapters. Add \ in front to represent the symbol itself. For example: ^ and $ have special meanings. If you want to hide the ^ and $ characters in the string, the regular expressions need to be written as \^ and \$.

Detailed explanation of what regular expressions are and their usage

The matching method of these escape characters is similar to that of ordinary characters. Also matches the same character.

Example: Expression \$d, when matching the string abc$de, the matching result is: success; the matched content is: $d; the matched position is: starting at 3 and ending at 5.

2.3 Expressions that can match 'multiple characters'

Some expression methods in regular expressions can match multiple any one of these characters. For example, the expression \d can match any number. Although it can match any of the characters, it can only be one, not multiple. This is just like when playing poker, the king can replace any card, but the jackpot can replace one card.

Detailed explanation of what regular expressions are and their usage

Example 1: Expression \d\d, when matching abc123, the matching result is: success; the matched content is: 12; the matched position is: Starts at 3 and ends at 5.

Example 2: Expression a.\d, when matching aaa100, the matching result is: success; the matched content is: aa1; the matched position is: starting at 1, ended in 4.

2.4 Custom expressions that can match 'multiple characters'

Use square brackets [] to include a series of characters that can match them any character. Use [^] to include a series of characters, and it can match any character except the characters among them. In the same way, although any one of them can be matched, it can only be one, not multiple.

Detailed explanation of what regular expressions are and their usage

Example 1: When the expression [bcd][bcd] matches abc123, the matching result is: success; the matched content is: bc; the matched position is : Starts at 1 and ends at 3.

Example 2: When the expression [^abc] matches abc123, the matching result is: success; the matched content is: 1; the matched position is: starting at 3 and ending at 4.

2.5 Special symbols that modify the number of matches

The expressions mentioned in the previous chapter, whether they are expressions that can only match one type of character or expressions that can match multiple characters, can only be matched once. If you use an expression plus a special symbol that modifies the number of matches, you can match repeatedly without writing the expression again.

The usage method is: put the "number of times modification" after the modified expression. For example: [bcd][bcd] can be written as [bcd]{2}.

Detailed explanation of what regular expressions are and their usage

Example 1: When the expression \d+/.?\d* matches it costs $12.5 , the matching result is: success; the matched content is: 12.5 ; The matched positions are: starting at 10 and ending at 14.

Example 2: When the expression go{2, 8}gle matches Ads by goooooogle, the matching result is: success; the matched content is: goooooogle; the matched position is: starting at 7, Ended at 17.

2.6 Some other symbols representing abstract meanings

Some symbols represent abstract special meanings in expressions:

Detailed explanation of what regular expressions are and their usage

Further text explanation is still relatively abstract, so examples are given to help everyone understand.

Example 1: When the expression ^aaa matches xxx aaa xxx, the matching result is: failure. Because ^ is required to match the beginning of the string, ^aaa can only match when aaa is at the beginning of the string, such as: aaa xxx xxx.

Example 2: When the expression aaa$ matches xxx aaa xxx, the matching result is: failure. Because $ is required to match the end of the string, aaa$ can only match when aaa is at the end of the string, such as: xxx xxx aaa.

Example 3: Expression .\b. When matching @@@abc, the matching result is: success; the matched content is: @a; the matched position is: starting at 2 and ending at 4.

Further explanation: \b is similar to ^ and $. It does not match any character itself, but it requires it to be on both sides of the position in the matching result. One side is the \w range and the other side is the non-\w range. .

Example 4: When the expression \bend\b matches weekend, endfor, end, the matching result is: success; the matched content is: end; the matched position is: starting at 15 and ending at 18.

Some symbols can affect the relationship between subexpressions within an expression:

Detailed explanation of what regular expressions are and their usage

Example 5: The expression Tom|Jack matches the string I' m Tom,he is Jack, the matching result is: success; the matched content is: Tom; the matched position is: starting at 4 and ending at 7. When matching the next one, the matching result is: success; the matched The content is: Jack; the matched position is: starting at 15 and ending at 19.

Example 6: When the expression (go\s*)+ matches Let's go go go!, the matching result is: success; the matched content is: go go go; the matched position is: start On 6, ended on 14.

Example 7: When the expression ¥(\d+\.?\d) matches $10.9,¥20.5, the matching result is: success; the matched content is: ¥20.5; the matched position is : Starts at 6 and ends at 10. The content matched by obtaining the bracket range alone is: 20.5.


3. Some advanced usage of regular expressions

3.1 Greedy and non-greedy in the number of matches

Greedy mode:

When using modified matching times When using special symbols, there are several representation methods that can enable the same expression to match different times at the same time, such as: "{m, n}", "{m,}", ?, *, +, the specific number of matches depends on Depends on the matching string. This kind of repeated matching expression an indefinite number of times always matches as many times as possible during the matching process. For example, for the text dxxxdxxxd, the example is as follows:

Detailed explanation of what regular expressions are and their usage

It can be seen that when matching, \w+ always matches as many characters as possible that meet its rules. Although in the second example, it does not match the last d, it is also to make the entire expression match successfully. In the same way, expressions with * and "{m, n}" are matched as much as possible, and expressions with ? are also "matched" as much as possible, depending on whether they can match or not. This matching principle is called greedy mode.

Non-greedy mode:

Add the ? sign after the special symbol that modifies the number of matches, so that expressions with an indefinite number of matches can be matched as little as possible, and expressions that can be matched or not matched can be "unmatched" as much as possible. This matching principle is called non-greedy mode, also called reluctant mode. If there are fewer matches, the entire regular expression will fail to match. Similar to the greedy mode, the non-greedy mode will minimally match more to make the entire regular expression match successfully. For example, for the text "dxxxdxxxd":

Detailed explanation of what regular expressions are and their usage

##For more situations, examples are as follows:

Example 1: Expression (. *) matches the string

aa

bb

The result is: success; the matched content is:

aa

bb

the entire string , the in the expression will match the last in the string.

Example 2: In contrast, if the expression (.*) matches the same string in example 1, only

aa

, when matching the next one again, you can get the second

bb

.

3.2 Backreference\1,\2...

When the expression is matched, the expression engine will include parentheses () The string matched by the expression is recorded. When obtaining the matching result, the string matched by the expression contained in parentheses can be fired separately. This has been demonstrated many times in the previous examples. In practical applications, when a certain boundary is used to search and the content to be obtained does not include the boundary, parentheses must be used to specify the desired range. For example, the previous (.*?) .

In fact, "the string matched by the expression contained in parentheses" can not only be used after the matching is completed, but can also be used during the matching process. The part after the expression can refer to the previous "submatch in parentheses that has already matched the string". The reference method is \ plus a number. \1 refers to the string matched in the first pair of brackets, \2 refers to the string matched in the second pair of brackets... and so on. If a pair of brackets contains another pair of brackets, the outer brackets are sorted first. Number. In other words, which pair of left parentheses ( comes first, then this pair will be sorted first.

Example 1: The expression ('|")(.*?)(/1) is matching 'Hello', "World", the matching result is: success; the matched content is: 'Hello'. When matching the next one, it can match "World"

Example 2: Expression. (\w)\1{4,} When matching aa bbbb abcdefg ccccc 111121111 999999999, the matching result is: success; the matched content is: cccccc. When matching the next one, you will get 999999999. This expression requires \w. The characters in the range are repeated at least 5 times. Pay attention to the difference with \w{5,}

Example 3: Expression .*?/1> When matching , The matching result is: success. If and are not matched, the matching will fail; if it is changed to another pairing, the matching can also be successful.

##3.3 Preliminary. Search, no match; reverse pre-search, no matchIn the previous chapter, I talked about several special symbols that represent abstract meanings: ^, $, \b. One thing they have in common is that they do not match any characters themselves, but only add a condition to the "two ends of the string" or the "gap between characters". After understanding this concept, this section will continue to introduce another one. A more flexible method that adds conditions to "both ends" or "gaps"

Forward pre-search

: (?=xxxxx), (?!xxxxx)

Format: (?=xxxxx), in the matched string, the "gap" or "both ends" it is located in. The additional condition is: the right side of the gap must be able to match the expression of "xxxxx" . Because it is only used as an additional condition on this gap, it does not affect the subsequent expressions to actually match the characters after this gap. This is similar to \b , which does not match any characters by itself. \b just takes the characters before and after the gap and makes a judgment. It will not affect the subsequent expressions to actually match.

Example 1: When the expression Windows(?=NT|XP) matches Windows 98, Windows NT, and Windows 2000, it will only match Windows in Windows NT, and other Windows words will not be matched.

Example 2: The expression (\w)((?=\1\1\1)(\1))+ will match the first 4 of 6 f when matching the string aaa ffffff 9999999999 , can match 9 9 and the first 7. This expression can be interpreted as: if letters and numbers are repeated more than 4 times, the part before the last 2 digits will be matched. Of course, this expression does not need to be written like this, but it is only used for demonstration purposes.

Format: (?!xxxxx) , located on the right side of the gap, must not match the xxxxx part of the expression.

Example 3: When the expression ((?!\bstop\b).)+ matches fdjka ljfdl stop fjdsla fdj, it will match from the beginning to the position before stop. If there is no stop in the string, then Matches the entire string.

Example 4: When the expression do(?!\w) matches the string done, do, dog, it can only match do. In this example, using (?!\w) after do has the same effect as using \b.

Reverse pre-search: (?

The concepts of these two formats are similar to forward pre-search , the condition required for reverse pre-search is: the "left side" of the gap. The two formats respectively require that it must be able to match and must not be able to match the specified expression, instead of judging the right side. The same as "forward pre-search" in that they are an addition to the gap and do not match any characters themselves.


4. Other general rules

4.1 Rule 1

In expressions, you can use \xXX and \uXXXX to represent a character (X represents a hexadecimal number)

Detailed explanation of what regular expressions are and their usage

4.2 Rule 2

While the expressions \s, \d, \w, \b represent special meanings, the corresponding Capital letters indicate the opposite meaning

1Detailed explanation of what regular expressions are and their usage

4.3 Rule 3

has special meaning in expressions, Summary of characters that need to add \ to match the character itself

1Detailed explanation of what regular expressions are and their usage

4.4 Rule 4

Brackets () If you want the matching results not to be recorded for later use, you can use the (?:xxxxx) format.

Example 1: When the expression (?:(\w)\1)+ matches "a bbccdd efg", the result is "bbccdd". Matches within the bracket (?:) range are not logged, so (\w) is quoted using \1.

4.5 Rule 5

Introduction to commonly used expression attribute settings: Ignorecase, Singleline, Multiline, Global

1Detailed explanation of what regular expressions are and their usage

Related articles:

How to use regular expressions to match parentheses in PHP

Summary on the use of common functions in PHP regular expressions

Simple code example of php regular expression matching Chinese characters

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn