Home > Article > Backend Development > Regular expression tutorial - detailed explanation of matching a group of characters
The example in this article describes the method of matching a group of characters in the regular expression tutorial. Share it with everyone for your reference, the details are as follows:
Note: In all examples, the regular expression matching results are included between [and] in the source text. Some examples will be implemented using Java. If The usage of regular expressions in Java itself will be explained in the corresponding places. All java examples are tested under JDK1.6.0_13.
1. Match one of multiple characters
In the previous article "Regular Expression Tutorial: Detailed Explanation of Matching a Single Character", an example of matching a text file starting with na or sa , the regular expression used is .a.\.txt. If there is another file called cal.txt, it will also be matched. What should I do if I only want to match files starting with na or sa?
Since we only want to find n or s, using one that can match any character is obviously not possible. In regular expressions, we can use [and] to define a character set. In the character set defined using [and], all characters between these two metacharacters are part of the set. Character set The matching result is text that matches any member of the set.
Let’s look at an example similar to the previous one:
Text:
sales.txt
na1.txt
na2 .txt
sa1.txt
sanatxt.txt
cal.txt
Regular expression: [ns]a.\.txt
Result:
sales.txt
【na1.txt】
【na2.txt】
【sa1.txt】
sanatxt.txt
cal.txt
Analysis: The regular expression used here starts with [na]. This set will match the characters n or s and will not match anything else. character. [ and ] do not match any characters; they only define a set of characters. Next, a matches a character a, \. will match a . character itself, txt matches the txt character itself, and the matching results are consistent with our expectations.
However, if one of the files is usa1.txt, then it will also be matched. This is a problem of positional matching, which will be discussed later.
2. Use the character set interval
In the above example, what if we only want to match files that start with na or sa and are followed by a number? In the regular expression [ns]a.\.txt, . will match any character, including numbers. This problem can be solved using a character set:
sales.txt
na1.txt
na2.txt
sa1.txt
san.txt
sanatxt.txt
cal.txt
Regular expression: [ns]a[0123456789]\.txt
Result:
sales.txt
【na1.txt】
【na2.txt】
【sa1.txt】
san. txt
sanatxt.txt
cal.txt
Analysis: As you can see from the results, we only match files that start with na or sa, followed by a number. And san.txt was not matched because the character set [0123456789] was used to limit the third character to only a number.
In regular expressions, some character intervals are frequently used, such as 0-9, a-z, etc. In order to simplify the definition of character intervals, regular expressions provide a special metacharacter - to Define character range. Like the example above, we can use regular expressions to match: [ns]a[0-9]\.txt, and the result is exactly the same as above.
The character range is not limited to numbers. The following are legal character ranges:
[A-F]: Matches all uppercase letters from A to F.
[A-Z]: Matches all uppercase letters from A to Z.
[A-z]: Matches all letters from ASCII character A to ASCII character z. But this interval is generally not used, it is just an example. Because they also contain characters such as [ and ^, which are arranged between Z and a in ASCII.
The first and last characters of the character interval can be any character in the ASCII character list. But in actual use, the most commonly used ranges are numbers and alphabetic characters.
Note: When defining a character interval, the last character of the interval cannot be smaller than the first character (such as [9-0]). This is not allowed. - as a metacharacter can only appear between [ and ], if it is anywhere outside [ and ], it is just an ordinary character and will only match - itself.
Multiple character ranges can be given in the same character set. For example: [0-9a-zA-Z] will match any uppercase and lowercase letters and numbers.
Let’s take a look at an example of matching colors on a web page:
Text:
<span style="background-color:#3636FF;height:30px; width:60px;">测试</span>
Regular expression: #[0 -9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa- f]
Result:7cd560ecc512978988606c112e14d7c8Test54bdf357c58b8a65c66d7c19c8e4d114
Analysis: In web pages, color is generally expressed as an RGB value starting with #, R represents red, G represents green, and B represents blue. Any color can be blended through different combinations of RGB. RGB values are represented by hexadecimal values, such as #000000 representing white, #FFFFFF representing black, and #FF0000 representing red. Therefore, the regular expression for matching colors in web pages starts with #, followed by the same set of 6 [0-9A-Fa-f] characters (this can be abbreviated as #[0-9A-Fa-f]{6}, This will be discussed later in Repeat Matching).
3. Get non-matching
Character set is usually used to specify a set of characters that must match one of them, but in some cases, we need to do the opposite, giving a set of characters that do not need to be obtained, in other words, except in that character set Characters in , any other characters can be matched.
For example, to match files that begin with na or sa and are not followed by numbers:
Text:
sales.txt
na1.txt
na2.txt
sa1.txt
sanatxt.txt
san.txt
Regular expression: [ns]a [^0-9]\.txt
Result:
sales.txt
na1.txt
na2.txt
sa1.txt
sanatxt.txt
[san.txt]
Analysis: The pattern used in this example is exactly the opposite of the previous one, and the previous [0-9] only matches Numbers, and here [^0-9] matches non-numbers.
Note: ^ between [and] means negation. If it appears at the beginning of the regular expression, it means that the positional match is matched, which will be discussed later. At the same time, the effect of ^ will apply to all characters or character intervals in a given character set, not just the character or character interval immediately following the ^ character. For example, [^0-9a-z] means not matching any numbers or lowercase letters.
4. Summary
The metacharacters [and] are used to define a set of characters, and their meaning is that they must match one of the characters in the set. There are two ways to define a character set: one is to list all characters; the other is to use metacharacters - given in the form of character intervals. Character sets can be negated using the metacharacter ^, which will forcibly exclude the given character set from the matching operation. Except for the characters in the character set, other characters can be matched.
In the next article, we will discuss the use of some metacharacters in regular expressions.
I hope this article will be helpful for everyone to learn regular expressions.
For more regular expression tutorials on matching a group of characters, please pay attention to the PHP Chinese website for related articles!