What exactly are regular expressions?
Characters are the most basic unit when computer software processes text, which may be letters, numbers, punctuation marks, spaces, newlines, Chinese characters, etc. A string is a sequence of 0 or more characters. Text is text, string. To say that a certain string matches a certain regular expression usually means that part (or several parts) of the string can satisfy the conditions given by the expression.
When writing programs or web pages that process strings, there is often a need to find strings that match certain complex rules. Regular expressions are tools used to describe these rules. In other words, regular expressions are codes that record text rules.
It is very likely that you have used the wildcard (wildcard) used for file search under Windows/Dos, that is, * and ?. If you wanted to find all Word documents in a certain directory, you would search for *.doc. Here, * will be interpreted as an arbitrary string. Similar to wildcards, regular expressions are also tools used for text matching, but they can describe your needs more accurately than wildcards - of course, at the cost of being more complicated - for example, you can write a regular expression, Used to find all strings starting with 0, followed by 2-3 digits, then a hyphen "-", and finally 7 or 8 digits (like 010-12345678 or 0376-7654321).
Getting Started
The best way to learn regular expressions is to start with examples. After understanding the examples, you can modify and experiment with them yourself. A number of simple examples are given below, and they are explained in detail.
Suppose you are searching for hi in an English novel, you can use the regular expression hi.
This is almost the simplest regular expression. It can accurately match a string like this: it consists of two characters, the first character is h, and the last character is i. Usually, tools that process regular expressions will provide an option to ignore case. If this option is selected, it can match any of the four cases hi, HI, Hi, and hI.
Unfortunately, many words contain the two consecutive characters hi, such as him, history, high, etc. If you use hi to search, the hi here will also be found. If we want to find the word hi accurately, we should use \bhi\b.
\b is a special code specified by regular expressions (well, some people call it a metacharacter), which represents the beginning or end of a word, which is the boundary of a word. Although English words are usually separated by spaces, punctuation marks, or newlines, \b does not match any of these word-separating characters, it only matches one position.
If you need a more precise statement, \b matches a position where the preceding character and the following character are not both (one is, one is not or does not exist) \w.
If what you are looking for is hi followed by a Lucy not far away, you should use \bhi\b.*\bLucy\b.
Here, . is another metacharacter, matching any character except newline characters. * is also a metacharacter, but it represents not a character, nor a position, but a quantity - it specifies that the content before * can be repeatedly used any number of times to make the entire expression match. Therefore, .* together means any number of characters that do not include a newline. Now the meaning of \bhi\b.*\bLucy\b is obvious: first the word hi, then any number of characters (but not newlines), and finally the word Lucy.
The newline character is '\n', the character whose ASCII encoding is 10 (hexadecimal 0x0A).
If other metacharacters are used at the same time, we can construct more powerful regular expressions. For example, the following example:
0\d\d-\d\d\d\d\d\d\d\d matches a string that starts with 0, then two numbers, and then It is a hyphen "-", and the last is 8 digits (that is, China's phone number. Of course, this example can only match the situation where the area code is 3 digits).
\d here is a new metacharacter, matching a single digit (0, or 1, or 2, or...). - is not a metacharacter, it only matches itself - the hyphen (or minus sign, or hyphen, or whatever you want to call it).
In order to avoid so many annoying repetitions, we can also write this expression like this: 0\d{2}-\d{8}. The {2}({8}) after \d here means that the previous \d must be repeated and matched 2 times (8 times) in a row.
Testing regular expressions
Other available testing tools:
RegexBuddy
Javascript Regular Expression Online Testing Tool
If you don’t find regular expressions difficult to read and write, either you are a genius, or you are not from Earth. The syntax of regular expressions can be confusing, even for people who use it regularly. Because it is difficult to read and write and prone to errors, it is necessary to find a tool to test regular expressions.
Some details of regular expressions are different in different environments. This tutorial introduces the behavior of regular expressions under Microsoft .Net Framework 2.0, so I will introduce you to a tool under .Net Regex Tester. First make sure you have .Net Framework 2.0 installed, and then download Regex Tester. This is a green software. After downloading, open the compressed package and run RegexTester.exe directly.
The following is a screenshot of Regex Tester running:
##Metacharacters
Now you already know several useful metacharacters, such as \b,.,*, and \d. There are more metacharacters in regular expressions, such as \s matching any whitespace character, Including spaces, tab characters (Tab), newline characters, Chinese full-width spaces, etc. \w matches letters or numbers or underscores or Chinese characters, etc.
Special processing of Chinese/Chinese characters is supported by the regular expression engine provided by .Net. For details in other environments, please check the relevant documents.
Let’s look at more examples:
\ba\w*\b matches words starting with the letter a - first the beginning of a word (\b), then The letter a, then any number of letters or numbers (\w*), and finally the end of the word (\b).
Okay, now let’s talk about what the words in the regular expression mean: no less than one consecutive \w. Yes, this really has little to do with the thousands of things with the same name that you have to memorize when learning English :)
\d+ matches 1 or more consecutive numbers. The + here is a metacharacter similar to *, the difference is that * matches repeated any number of times (possibly 0 times), while + matches repeated 1 or more times.
\b\w{6}\b Matches words of exactly 6 characters.
Table 1. Commonly used metacharacters
Regular expression engines usually provide a "test whether the specified string matches a regular expression" Methods, such as the RegExp.test() method in JavaScript or the Regex.IsMatch() method in .NET. Matching here refers to whether there is any part of the string that conforms to the expression rules. If ^ and $ are not used, for \d{5,12}, using this method can only ensure that the string contains 5 to 12 consecutive digits, rather than the entire string being 5 to 12 digits.
Metacharacters ^ (the symbol on the same key as the number 6) and $ both match a position, which is somewhat similar to \b. ^ matches the beginning of the string you are looking for, and $ matches the end. These two codes are very useful when verifying the input content. For example, if a website requires that the QQ number you fill in must be 5 to 12 digits, you can use: ^\d{5,12}$.
{5,12} here is similar to {2} introduced before, except that {2} can only be matched twice, no more, no less, and {5,12} is repeated. The number of times cannot be less than 5 times and cannot be more than 12 times, otherwise it will not match.
Because ^ and $ are used, the entire input string must be used to match \d{5,12}, which means that the entire input must be 5 to 12 numbers, so if you enter If your QQ number can match this regular expression, then it meets the requirements.
Similar to the option to ignore case, some regular expression processing tools also have an option to process multiple lines. If this option is selected, the meaning of ^ and $ becomes the start and end of the matched line.
Character escape
If you want to search for the metacharacters themselves, for example, if you search for ., or *, there is a problem: you cannot specify them, because they will be interpreted as something else. At this time you have to use \ to cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \\.
For example: deerchao\.net matches deerchao.net, and C:\\Windows matches C:\Windows.
Repeat
You have already seen the previous matching repetitions of *,+,{2},{5,12} way. The following are all qualifiers in regular expressions (specified number of codes, such as *, {5,12}, etc.):
Table 2. Commonly used qualifiers
Here are some examples of using repetition:
Windows\d+ matches Windows followed by 1 or more digits
^\w+ matches the first word of a line (or The first word of the entire string. The specific matching meaning depends on the option settings)
Character class
If you want to find numbers and letters Or numbers, whitespace is easy because there are already metacharacters corresponding to these character sets, but if you want to match a character set without predefined metacharacters (such as the vowels a, e, i, o, u), What should I do?
It's very simple, you just need to list them in square brackets, like [aeiou] matches any English vowel, [.?!] matches punctuation marks (. or? or!) .
We can also easily specify a character range. The meaning represented by [0-9] is exactly the same as \d: one digit; similarly [a-z0-9A-Z_] is also completely Equivalent to \w (if only English is considered).
The following is a more complex expression: \(?0\d{2}[) -]?\d{8}.
"(" and ")" are also metacharacters, which will be mentioned in the grouping section later, so they need to be escaped here.
This expression can match phone numbers in several formats, such as (010)88886666, or 022-22334455, or 02912345678, etc. Let's do some analysis on it: first there is an escape character \(, which can appear 0 or 1 times (?), then a 0, followed by 2 numbers (\d{2}), then) or - or one of the spaces, which appears 1 time or not (?), and finally 8 digits (\d{8}).
Branch conditions
Unfortunately, the expression just now can also match "No" like 010)12345678 or (022-87654321 "Correct" format. To solve this problem, we need to use branch conditions. The branch conditions in regular expressions refer to several rules. If any one of these rules is met, it should be regarded as a match. The specific method is to use | Separate different rules. Don’t understand? It doesn’t matter. Look at the example:
0\d{2}-\d{8}|0\d{3}-\d{7} The expression can match two phone numbers separated by hyphens: one is a three-digit area code and an 8-digit local number (such as 010-12345678), and the other is a 4-digit area code and a 7-digit local number (0376-2233445). ##
\(0\d{2}\)[- ]?\d{8}|0\d{2}[- ]?\d{8} This expression matches the phone number of the 3-digit area code, where The area code can be enclosed in parentheses or not. The area code and the local number can be separated by a hyphen or a space, or there can be no separation. You can try to use branch conditions to extend this expression to also support 4-digit area codes.
\d{5}-\d{4}|\d{5} This expression is used to match zip codes in the United States. The rule for US zip codes is 5 digits, or 9 digits separated by hyphens. The reason why this example is given is because it can illustrate a problem: when using branch conditions, pay attention to the order of each condition. If you change it to \d{5}|\d{5}-\d{4}, then only 5-digit zip codes (and the first 5 digits of 9-digit zip codes) will be matched. The reason is that when matching branch conditions, each condition will be tested from left to right. If a certain branch is met, other conditions will not be considered.
Group
We have already mentioned how to repeat a single character (just add the qualifier directly after the character); but if you want What to do if multiple characters are repeated? You can use parentheses to specify a subexpression (also called grouping), and then you can specify the number of repetitions of this subexpression. You can also perform other operations on the subexpression (will be introduced later).
(\d{1,3}\.){3}\d{1,3} is a simple IP address matching expression. To understand this expression, analyze it in the following order: \d{1,3} matches a number from 1 to 3 digits, (\d{1,3}\.){3} matches a three-digit number plus an English The period (the whole is the group) is repeated three times, and finally a one to three-digit number (\d{1,3}) is added.
No number in the IP address can be greater than 255. Don’t be fooled by the writers of the third season of "24"...
Unfortunately, it will also match 256.300 An IP address like .888.999 that cannot exist. If you can use arithmetic comparison, you may be able to solve this problem simply, but regular expressions do not provide any mathematical functions, so you can only use lengthy grouping, selection, and character classes to describe a correct IP address:( (2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]| [01]?\d\d?).
The key to understanding this expression is to understand 2[0-4]\d|25[0-5]|[01]?\d\d?, I won’t go into details here, you can You should be able to analyze its meaning.
Antonym
Sometimes it is necessary to find characters that do not belong to a easily defined character class. For example, if you want to find any character other than numbers, you need to use the antonym:
Table 3. Commonly used antonym codes
Backreference
After using parentheses to specify a subexpression, the text matching this subexpression (that is, the content captured by this group) can be further processed in the expression or other programs. By default, each group will automatically have a group number. The rule is: from left to right, with the left bracket of the group as the mark, the group number of the first appearing group is 1, the second one is 2, and so on. analogy.
Uh... Actually, group number allocation is not as simple as I just said:
Group 0 corresponds to the entire regular expression
In fact, the group number allocation process is It needs to be scanned twice from left to right: the first pass only assigns to unnamed groups, the second pass only assigns to named groups - therefore the group numbers of all named groups are greater than the unnamed group numbers
You can use syntax such as (?:exp) to deprive a group of the right to participate in group number allocation.
Backward reference is used to repeatedly search for text matching a previous group. For example, \1 represents the text matched by group 1. Hard to understand? See example:
\b(\w+)\b\s+\1\b can be used to match repeated words, like go go, or kitty kitty. This expression is first a word, that is, more than one letter or number (\b(\w+)\b) between the beginning and end of the word. This word will be captured in the group numbered 1, Then there are one or more whitespace characters (\s+), and finally the content captured in group 1 (that is, the previously matched word) (\1).
You can also specify the group name of the subexpression yourself. To specify a group name for a subexpression, use the syntax: (?<Word>\w+) (or replace the angle brackets with ': (?'Word'\w+)), so that \ The group name of w+ is specified as Word. To back-reference the content captured by this group, you can use \k<Word>, so the previous example could also be written like this:\b(?<Word>\w+)\b\s+\k<Word>\b .
When using parentheses, there are many special-purpose syntaxes. Some of the most commonly used ones are listed below:
Table 4. Common grouping syntax
Zero-width assertion
Earthlings, do you think the names of these terms are too complicated and difficult to remember? I'm feeling it too. Just know that there is such a thing, what is it called, let it go! If the person has no name, he can concentrate on sword practice; if the object has no name, he can choose at will...
The next four are used to find the words before or after certain contents (but not including these contents) Things, that is to say, they are used like \b,^,$ to specify a position that should satisfy certain conditions (i.e. assertions), so they are also called zero-width assertions. It's best to use an example to illustrate:
Assertion is used to declare a fact that should be true. Regular expression matching will only continue when the assertion is true.
(?=exp) is also called a zero-width positive lookahead assertion. It asserts that the expression exp can be matched after the position where it appears. For example, \b\w+(?=ing\b), matches the front part of the word ending with ing (other than ing). For example, when searching for I'm singing while you're dancing., it will match sing and dance. .
(?<=exp) is also called zero-width positive post-lookback assertion. It asserts that the position before itself can match the expression exp. For example, (?<=\bre)\w+\b will match the second half of the word starting with re (other than re). For example, when searching for reading a book, it matches ading.
If you want to add a comma between every three digits in a very long number (added from the right, of course), you can find the parts that need to be preceded and added with commas like this: ((( ?<=\d)\d{3})+\b, when used to search for 1234567890, the result is 234567890.
The following example uses both assertions: (?<=\s)\d+(?=\s) matches numbers separated by whitespace characters (again, these whitespace characters are not included) .
Negative zero-width assertion
We mentioned earlier how to find characters that are not a certain character or are not in a certain character class method (antonym). But what if we just want to make sure a certain character doesn't appear, but don't want to match it? For example, if we want to find a word in which the letter q appears, but the q is not followed by the letter u, we can try this:
\b\w*q[^u]\w *\b matches words containing the letter q that is not followed by the letter u. But if you do more testing (or if your thinking is sharp enough, you can observe it directly), you will find that if q appears at the end of a word, like Iraq, Benq, this expression will go wrong. This is because [^u] always matches one character, so if q is the last character of the word, the following [^u] will match the word separator after q (which may be a space, a period or other What), the following \w*\b will match the next word, so \b\w*q[^u]\w*\b can match the entire Iraq fighting. A negative zero-width assertion can solve this problem because it only matches one position and does not consume any characters. Now, we can solve this problem like this: \b\w*q(?!u)\w*\b.
Zero-width negative lookahead assertion (?!exp) asserts that the expression exp cannot be matched after this position. For example: \d{3}(?!\d) matches three digits, and these three digits cannot be followed by digits; \b((?!abc)\w)+\b matches the continuous string abc that does not contain word.
Similarly, we can use (?<!exp), a zero-width negative lookback assertion to assert that the previous position cannot match the expression exp: (?<![a-z])\d {7} matches a seven-digit number that is not preceded by a lowercase letter.
Please analyze the expression (?<=<(\w+)>).*(?=<\/\1>) in detail. This expression best expresses the true meaning of zero-width assertion. use.
A more complex example: (?<=<(\w+)>).*(?=<\/\1>) matches the content inside a simple HTML tag that does not contain attributes. (<?(\w+)>) specifies a prefix: a word enclosed in angle brackets (for example, it might be <b>), then .* (an arbitrary string), and finally a suffix ( ?=<\/\1>). Pay attention to the \/ in the suffix, which uses the character escape mentioned earlier; \1 is a back reference, which refers to the first group captured, the content matched by the previous (\w+), so if the prefix If it is actually <b>, the suffix is </b>. The entire expression matches the content between <b> and </b> (again, not including the prefix and suffix itself).
Comments
Another use of parentheses is to include comments via the syntax (?#comment). For example: 2[0-4]\d(?#200-249)|25[0-5](?#250-255)|[01]?\d\d?(?#0-199).
If you want to include comments, it is best to enable the "Ignore whitespace characters in pattern" option, so that you can add spaces, tabs, and newlines arbitrarily when writing expressions, but these will be ignored when actually used. . When this option is enabled, all text following # to the end of the line will be ignored as comments. For example, we can write the previous expression like this:
(?<= # 断言要匹配的文本的前缀 <(\w+)> # 查找尖括号括起来的字母或数字(即HTML/XML标签) ) # 前缀结束 .* # 匹配任意文本 (?= # 断言要匹配的文本的后缀 <\/> # 查找尖括号括起来的内容:前面是一个"/",后面是先前捕获的标签 ) # 后缀结束
Greedy and Lazy
When the regular expression contains repeated When used as a qualifier, the usual behavior is to match as many characters as possible (while still allowing the entire expression to be matched). Take this expression as an example: a.*b, it will match the longest string starting with a and ending with b. If you use it to search for aabab, it will match the entire string aabab. This is called greedy matching.
Sometimes, we need lazy matching, that is, matching as few characters as possible. The qualifiers given above can be converted into lazy matching patterns by appending a question mark ? after them. In this way, .*? means matching any number of repetitions, but using the fewest repetitions that make the entire match successful. Now look at the lazy version of the example:
a.*?b matches the shortest string starting with a and ending with b. If you apply it to aabab, it will match aab (characters 1 to 3) and ab (characters 4 to 5).
Why is the first match aab (first to third characters) instead of ab (second to third characters)? Simply put, because regular expressions have another rule that has a higher priority than the lazy/greedy rule: the match that begins earliest has the highest priority—The match that begins earliest wins.
Table 5. Lazy Qualifiers
Processing Options
In C#, you can use the Regex(String, RegexOptions) constructor to set regular expression processing options. For example: Regex regex = new Regex(@"\ba\w{6}\b", RegexOptions.IgnoreCase);
The above introduces several options such as ignoring case, processing multiple lines, etc. These options Can be used to change the way regular expressions are processed. The following are commonly used regular expression options in .Net:
Table 6. Commonly used processing options
A frequently asked question is: Yes Isn't it possible to use only one of multi-line mode and single-line mode at the same time? The answer is: no. There is no relationship between these two options, except that their names are confusingly similar.
Balanced group/recursive matching
The balanced group syntax introduced here is supported by the .Net Framework; other languages/libraries may not necessarily support it This functionality may be supported but requires a different syntax.
Sometimes we need to match a nestable hierarchical structure like (100 * (50 + 15)). In this case, simply using \(.+\) will only match the leftmost left The content between the brackets and the rightmost right bracket (here we are discussing greedy mode, lazy mode also has the following problems). If the number of occurrences of the left bracket and the right bracket in the original string is not equal, such as (5 / (3 + 2))), then the number of the two in our matching result will not be equal. Is there any way to match the longest, matching content between brackets in such a string?
In order to avoid ( and \( completely confusing your brain, let’s use angle brackets instead of round brackets. Now our question becomes how to put xx <aa <bbb> <bbb> ; aa> In a string like yy, the content inside the longest pair of angle brackets is captured?
The following syntax structure needs to be used here:
(?'group') Name the captured content group and push it onto the stack (Stack)
(?'-group') Pop the captured content named group that was last pushed onto the stack from the stack. If the stack is originally empty, Then the matching of this group fails
(?(group)yes|no) If there is a captured content named group on the stack, continue to match the expression of the yes part, otherwise continue to match the no part
(?!) Zero-width negative lookahead assertion, since there is no suffix expression, trying to match always fails
If you are not a programmer (or you call yourself a programmer but don’t know what a stack is ), you can understand the above three syntaxes like this: the first is to write a "group" on the blackboard, the second is to erase a "group" from the blackboard, and the third is to read what is written on the blackboard. Whether there is "group", if so, continue to match the yes part, otherwise match the no part
.What we need to do is to push an "Open" every time we encounter a left bracket, and pop one up every time we encounter a right bracket. At the end, we will see if the stack is empty - if it is not empty, then It proves that there are more left brackets than right brackets, so the match should fail. The regular expression engine will backtrack (discard some of the first or last characters) and try to match the entire expression.
< #最外层的左括号 [^<>]* #最外层的左括号后面的不是括号的内容 ( ( (?'Open'<) #碰到了左括号,在黑板上写一个"Open" [^<>]* #匹配左括号后面的不是括号的内容 )+ ( (?'-Open'>) #碰到了右括号,擦掉一个"Open" [^<>]* #匹配右括号后面不是括号的内容 )+ )* (?(Open)(?!)) #在遇到最外层的右括号前面,判断黑板上还有没有没擦掉的"Open";如果还有,则匹配失败 > #最外层的右括号
One of the most common applications of balanced groups is to match HTML. The following example can match nested <div> tags: <div[^>]*>[^<> ]*(((?'Open'<div[^>]*>)[^<>]*)+((?'-Open'</div>)[^<> ]*)+)*(?(Open)(?!))</div>.
There is something else not mentioned
A large number of elements for constructing regular expressions have been described above, but there are still many things that have not been mentioned. Below is a list of some elements not mentioned, with syntax and simple explanations. You can find more detailed references online to learn about them when you need to use them. If you have installed the MSDN Library, you can also find detailed documentation on regular expressions under .net.
The introduction here is very brief. If you need more detailed information and do not have the MSDN Library installed on your computer, you can view the MSDN online documentation on regular expression language elements.
Table 7. Syntax not yet discussed in detail