Regular expression - matching rules


Basic Pattern Matching

Everything starts from the basics. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of a string. Patterns can be simple, consisting of ordinary strings, or very complex, often using special characters to represent a range of characters, recurrences, or to represent context. For example:

^once

This pattern contains a special character ^, which means that the pattern only matches those strings starting with once. For example, this pattern matches the string "once upon a time" but does not match "There once was a man from NewYork". Just like the ^ symbol indicates the beginning, the $ symbol matches strings that end with a given pattern.

bucket$

This pattern matches "Who kept all of this cash in a bucket" but does not match "buckets". When the characters ^ and $ are used together, they represent an exact match (strings are the same as patterns). For example:

^bucket$

only matches the string "bucket". If a pattern does not include ^ and $, then it matches any string that contains the pattern. For example: the pattern

once

matches the string

There once was a man from NewYork
Who kept all of his cash in a bucket.

.

The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letters themselves, as do numbers. Other slightly more complex characters, such as punctuation and white characters (spaces, tabs, etc.), require escape sequences. All escape sequences begin with a backslash (\). The escape sequence for the tab character is: \t. So if we want to detect whether a string starts with a tab character, we can use this pattern:

^\t

Similarly, use \n to represent "new line" and \r to represent carriage return. Other special symbols can be used with a backslash in front. For example, the backslash itself is represented by \\, the period is represented by \., and so on.

Character cluster

In INTERNET programs, regular expressions are usually used to verify user input. When a user submits a FORM, it is not enough to use ordinary literal characters to determine whether the entered phone number, address, email address, credit card number, etc. are valid.

So we need to use a more free way to describe the pattern we want, which is character clusters. To create a cluster that represents all vowel characters, place all vowel characters in square brackets:

[AaEeIiOoUu]

This pattern matches any vowel character, but can only represent one character. Use hyphens to represent a range of characters, such as:

[a-z] //匹配所有的小写字母 
[A-Z] //匹配所有的大写字母 
[a-zA-Z] //匹配所有的字母 
[0-9] //匹配所有的数字 
[0-9\.\-] //匹配所有的数字,句号和减号 
[ \f\r\t\n] //匹配所有的白字符

Likewise, these only represent one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:

^[a-z][0-9]$

Although [a-z] represents a range of 26 letters, here it can only match strings whose first character is a lowercase letter.

It was mentioned earlier that ^ represents the beginning of a string, but it also has another meaning. When ^ is used within a set of square brackets, it means "not" or "exclude" and is often used to eliminate a certain character. Using the previous example, we require that the first character cannot be a number:

^[^0-9][0-9]$

This pattern matches "&5", "g7" and "-2", but does not match "12", "66" "It doesn't match. Here are a few examples of excluding specific characters:

[^a-z] //除了小写字母以外的所有字符 
[^\\/\^] //除了(\)(/)(^)之外的所有字符 
[^\"\'] //除了双引号(")和单引号(')之外的所有字符

The special characters "." (dot, period) are used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string that ends with the number 5 and starts with some other non-"newline" character. The pattern "." can match any string, except empty strings and strings containing only a "new line".

PHP's regular expressions have some built-in common character clusters, the list is as follows:

Character clusterDescription
[[:alpha:]]Any letters
[[:digit:]]Any numbers
[[:alnum:]]Any letters and numbers
[[:space:]]Any whitespace characters
[[:upper:]]Any uppercase letters
[[: lower:]]Any lowercase letters
[[:punct:]]Any punctuation
[[:xdigit:]]Any hexadecimal number, equivalent to [0-9a-fA-F]

Determine recurring occurrences

Up to now, you already know how to match a letter or number, but more often than not, you may want to match a word or a group of numbers. A word consists of several letters, and a group of numbers consists of several singular numbers. The curly braces ({}) following a character or character cluster are used to determine the number of times the preceding content is repeated.

##^[a-zA-Z_]$All letters and underscores^[[:alpha:]]{3}$All 3-letter words^a$letter a^a{4}$aaaa^a{2,4}$aa,aaa or aaaa^a{1,3}$a,aa Or aaa^a{2,}$A string containing more than two a^a {2,}For example: aardvark and aaab, but not applea{2,}For example: baad and aaa, but Nantucket No\t{2}Two tab characters.{2}All two characters

These examples describe three different uses of curly braces. A number, {x} means "the preceding character or character cluster appears only x times"; a number plus a comma, {x,} means "the preceding content appears x or more times"; two Comma-separated numbers, {x,y} means "the previous content appears at least x times, but not more than y times". We can extend the pattern to more words or numbers:

^[a-zA-Z0-9_]{1,}$ //所有包含一个以上的字母、数字或下划线的字符串 
^[1-9]{1,}$ //所有的正数 
^\-{0,1}[0-9]{1,}$ //所有的整数 
^[-]?[0-9]+\.?[0-9]+$ //所有的浮点数

The last example is not easy to understand, is it? Look at it this way: with everything starting with an optional minus sign ([-]?) (^), followed by 1 or more digits ([0-9]+), and a decimal point (\.) followed by 1 or more digits ([0-9]+) and nothing else ($) after them. Below you will learn about the simpler methods you can use.

The special characters "?" are equal to {0,1}, they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:

^\-?[0-9]{1,}\.?[0-9]{1,}$

The special characters "*" and {0,} are equal, and they both represent "0 or more previous contents". Finally, the character "+" is equal to {1,}, which means "one or more previous contents", so the above four examples can be written as:

^[a-zA-Z0-9_]+$ //所有包含一个以上的字母、数字或下划线的字符串 
^[0-9]+$ //所有的正数 
^\-?[0-9]+$ //所有的整数 
^\-?[0-9]*\.?[0-9]*$ //所有的浮点数

Of course this is not technically possible Reduces the complexity of regular expressions but makes them easier to read.

Character clusterDescription