Home >Backend Development >PHP Tutorial >Summary of regular expressions in Php (1)_PHP tutorial
1. Concept
The syntax pattern is similar to .
Thedelimiter can be any non-alphanumeric, non-numeric, non-whitespace
If the delimiter is used in an expression, it needs to be escaped with a backslash .
Metacharacters
Basic components of a regular expression
/Atoms and metacharacters/Pattern modifiers / The power of regular expressions lies in their ability to include selections and loops within patterns. They are encoded in the schema using metacharacters, which do not represent themselves; they are parsed in some special way.
It is divided into two types depending on whether it is inside or outside the square brackets.
1. Metacharacters outside square brackets
元字符(符号) 说明 一般用于转义字符 ^ 断言目标的开始位置(或在多行模式下是行首) $ 目标的结束位置(活在多行模式下行尾) . 匹配除换行符外任何字符(默认时) [,] 开始,结束字符类定义 | 开始一个可选分支 ( ,) 子组的开始,结尾标记 ? 作为量词,表示 0 次或 1 次匹配。位于量词后面用于改变量词的贪婪特性 * 量词,0 次或多次匹配 + 量词,1 次或多次匹配 { ,} 自定义量词开始标记,结束标记
元字符 说明 转义字符 ^ 仅在作为第一个字符时,表明字符类取反 - 标记字符范围 Examples of metacharacter usage
2. The part in the square brackets in the pattern is called "character class"
Metacharacters (symbols)
Description
is generally used to escape characters
^
Assert the start position of the target (or the beginning of the line in multiline mode)
$
The end position of the target (live at the end of the line in multi-line mode)
.
matches any character except newline (default)
[,]
Start, end character class definition
|
Start an optional branch
( ,)
Start and end tag of subgroup
?
serves as a quantifier, indicating 0 or 1 matches. Located after the quantifier to change the greedy property of the quantifier
*
Quantifier, 0 or more matches
+
Quantifier, 1 or more matches
{ ,}
Customized quantifier start tag, end tag
Metacharacters
Description
Escape character
^
Only when used as the first character, indicates that the character class is inverted
-
Mark character range
1. Escape (backslash)
is followed by a non-alphanumeric character, canceling any special meaning that character might have. This applies within and outside character classes.
For non-numeric characters, you always need to add a backslash in front of them when matching the original text to indicate that they represent themselves.
When matching "*", because it has a special meaning, use "*" to cancel its special meaning
Match "." with "."
Match "" with "\"
But be careful:
Backslash has special meaning in single-quoted strings and double-quoted strings, so to match a backslash, the pattern must be written "\\" or '\'
2.The second use of backslash provides a means of controlling the visible encoding of non-printing characters. Except for the binary
Symbol |
Description |
a |
Ring character (hex 07)
|
cx |
"control-x", x is any character
|
e |
Escape (hex 1B)
|
f |
Page change (hex 0C)
|
n |
Line break (hex 0A)
|
p{xx} (p |
A character that matches the xx attribute |
P{xx} (p |
A character that does not match the xx attribute |
r |
Enter (hex 0D)
|
t |
Horizontal tab (hex 09)
|
xhh |
hh hexadecimal encoded character |
ddd |
ddd octal-encoded character, or backreference
|
|
Another way to use spaces
|
40
|
Also considered spaces when less than 40 subgroups are provided.
|
7
|
Always a backreference
|
11
|
may be a backreference or a tab
|
| Always a tab
|
A tab followed by a 3 (because only 3 octal digits are read at most at a time |
|
The character represented by octal 113 |
|
377 |
The octal system 377 is the decimal system 255, so it represents a character with all ones
|
81 |
A backreference or a binary 0 followed by the two numbers 8 and 1 (because 8 is not an octal significant digit)
|
3. The third use of backslash, describing a specific character class
符号 |
说明 |
d |
任意十进制数字 |
D |
任意非十进制数字 |
h |
任意水平空白字符 |
H |
任意非水平空白字符 |
s |
任意空白字符 |
S |
任意空白字符 |
v |
任意垂直空白字符(since PHP 5.2.4)
|
V |
任意非垂直空白字符(since PHP 5.2.4)
|
w |
任意单词字符
|
W |
任意非单词字符
|
Symbol
b |
单词边界 注意在字符类中是退格
|
B |
非单词边界
|
A |
目标的开始位置(独立于多行模式)
|
Z |
目标的结束位置或结束处的换行符(独立于多行模式)
|
z |
目标的结束位置(独立于多行模式)
|
G |
在目标中首次匹配位置
|
b | Word boundary Note that it is backspace in the character class | B | Non-word boundaries | A | Start position of target (independent of multiline mode) | Z | The end position of the target or the newline character at the end (independent of multiline mode) | z | The end position of the target (independent of multiline mode) | G |
First matching position in target
A,Z,z
Because they always match the beginning and end of the target string and will not be restricted by pattern modifiers
The difference between Z and z is that when the end character of the string is a newline character, Z will regard it as a match at the end of the string, while z only matches the end of the string.
Code 1 <span $p</span>='#\A[a-z]{3}#m'<span ; </span><span $str</span>='<span abc defg hijkl</span>'<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $all</span><span ); </span><span print_r</span>(<span $all</span>); I found that the result is the same with or without the mode modifier m only matches And the code
<span $p</span>='#^[a-z]{3}#m'<span ; </span><span $str</span>='<span abc defg hijkl</span>'<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $all</span><span ); </span><span print_r</span>(<span $all</span>);
Without m After adding it, it matches Sure enough, Similarly, can be compared
Code
<span <span $p</span>='#[a-z]\Z#'<span ; </span><span $str</span>="a\n"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $all</span><span ); </span><span print_r</span>(<span $all</span>);</span>When the pattern is corrected to E When the pattern is modified to
In a preg_match()() call with the $offset parameter specified, it is successful only when the current matching position is at the matching start point
When the value of $offset is not 0, it is different from A.
See php manual
That is, put the characters with special meaning between Q and E
Such as code 4 <span $p</span>='#\w+\Q.$.\E$#'<span ; </span><span $str</span>="a.$."<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $all</span><span ); </span><span print_r</span>(<span $all</span>); matches a.$.
As of PHP 5.2.4. . For example, footKbar matches "footbar". But the matching result obtained is "bar". However, the use of K will not interfere with the content within the subgroup. For example, if (foot)Kbar matches "footbar", the result in the first subgroup will still be "foo". Translator's Note: The effect of placing K in the subgroup and outside the subgroup is the same. p{Lu} matches uppercase letters Period Outside of character class C can be used to match single bytes, which means that in UTF-8 mode, periods can match multi-byte characters
比如'#[[:upper:]]#'匹配大写字母 '#[[:alpha:]]#' 匹配字母
竖线字符用于分离模式中的可选路径。 比如模式gilbert|Sullivan匹配 ”gilbert” 或者 ”sullivan”。 竖线可以在模式中出现任意多个,并且允许有空的可选路径(匹配空字符串)。 匹配的处理从左到右尝试每一个可选路径,并且使用第一个成功匹配的。 如果可选路径在子组(下面定义)中, 则”成功匹配”表示同时匹配了子模式中的分支以及主模式中的其他部分。 代码5 <span $p</span>='#p(hp|ython|erl)#'<span ; </span><span $str</span>="php python perl"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $all</span><span ); </span><span print_r</span>(<span $all</span>);
子组通过圆括号分割界定,并且它们可以嵌套,主要有以下两种用法与功能 1.将可选分支局部化。比如 模式 p(hp|ython|erl) 匹配 2.将子组设定为捕获子组。 整个模式匹配后, 左括号从左至右出现的次序就是对应子组的下标(从 1 开始), 可以通过这些下标数字来获取捕获子模式匹配结果。 代码6 <span $p</span>='#(\d)#'<span ; </span><span $str</span>="abc123"<span ; </span><span $r</span>=<span preg_replace</span>(<span $p</span>,'<font color=red>\1</font>',<span $str</span><span ); </span><span echo</span> <span $r</span>;
在子组定义的左括号后面紧跟字符串 ”?:” 会使得该子组不被单独捕获, 并且不会对其后子组序号的计算产生影响 代码7:匹配数字 把数字改为红色的 <span $p</span>='#.*(?:\d).*([a-z])#U'<span ; </span><span $str</span>="3df5g"<span ; </span><span $r</span>=<span preg_replace</span>(<span $p</span>,'<font color=red>\1</font>',<span $str</span><span ); </span><span echo</span> <span $r</span>; 如果匹配数字的模式不加?: 那么 为了方便简写,如果需要在非捕获子组开始位置设置选项, ,比如:
上面两种写法实际上是相同的模式。因为可选分支会从左到右尝试每个分支, 并且选项没有在子模式结束前被重置, 并且由于选项的设置会穿透对后面的其他分支产生影响,因此, 上面的模式都会匹配 ”SUNDAY” 以及 ”Saturday”。 在 PHP 4.3.3 中, 代码如下8: <span $p</span>="#.*(?<alpha>[a-z]{3})(?'digit'\d{3}).*#"<span ; </span><span $str</span>="abc123111def111g"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果: 有时需要多个匹配可以在一个正则表达式中选用子组。 为了让多个子组可以共用一个后向引用数字的问题, (?\语法允许复制数字。 考虑下面的正则表达式匹配Sunday: (?:(Sat)ur|(Sun))day 这里当后向引用 1 空时Sun 存储在后向引用 2 中. 当后向引用 2 不存在的时候 Sat 存储在后向引用 1中。 使用 (?|修改模式来修复这个问题: 代码9: <span $p</span>='#(?:(sat)ur|(sun))day#'<span ; </span><span $str</span>="sunday saturday"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果: (?|(Sat)ur|(Sun))day 使用这个模式, Sun和Sat都会被存储到后向引用1中。 在看这个模式前先看以2个下代码 代码10-1 $p=<span '</span><span #(a|b)\d#</span><span '</span><span ; $str</span>=<span "</span><span b2a1</span><span "</span><span ; preg_match_all($p,$str,$arr); print_r($arr);</span> 结果是:Array <em id="__mceDel">( [0] => Array ( [0] => b2 [1] => a1 ) [1] => Array ( [0] => b [1] => a ) )<br /><span <strong>代码10-2</strong></span><br /></em> $p=<span '</span><span #((a)|b)\d#</span><span '</span><span ; $str</span>=<span "</span><span b2a1</span><span "</span><span ; preg_match_all($p,$str,$arr); print_r($arr);</span> 结果: <strong>Array ( [0] => Array ( [0] => b2 [1] => a1 ) [1] => Array ( [0] => b [1] => a ) [2] => Array ( [0] => [1] => a )</strong> )<br /><strong>对10-2代码:<br /></strong>第一次完整匹配到的内容是b2,所以包括匹配内容b的括号即为其第一个子模式是即为b,第二个子模式由于(a)没有匹配,所以为空<br />第二次完整匹配到a1,其第一个子模式为a,第二次的由于((a)|b)是外层大括号里包含的<br /><strong>代码10-3:<br /></strong> $p=<span '</span><span #((a)|(b))\d#</span><span '</span><span ; $str</span>=<span "</span><span b2a1</span><span "</span><span ; preg_match_all($p,$str,$arr); print_r($arr);</span>
<strong> 结果:<br /></strong> Array ( [0] => Array ( [0] => b2 [1] => a1 ) [1] => Array ( [0] => b [1] => a ) [2] => Array ( [0] => [1] => a ) [3] => Array ( [0] => b [1] => ) ) 代码10-4: $p=<span '</span><span #(?:(a)|(b))\d#</span><span '</span><span ; $str</span>=<span "</span><span b2a1</span><span "</span><span ; preg_match_all($p,$str,$arr); print_r($arr);</span> 结果:<br />Array ( [0] => Array ( [0] => b2 [1] => a1 ) [1] => Array ( [0] => [1] => a ) [2] => Array ( [0] => b [1] => ) )
<strong> </strong> 代码10: <span $p</span>='#(?|(sat)ur|(sun))day#'<span ; </span><span $str</span>="sunday saturday"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果
如果紧跟反斜线的数字小于 10, 它总是一个后向引用。模式中的捕获数要大于等于后向引用的个数
后向引用会直接匹配被引用捕获组在目标字符串中实际捕获到的内容, 而不是匹配子组模式的内容
(sens|respons)e and \1ibility将会匹配
<span $p</span>='#(sens|respons)e and \1ibility#'<span ; </span><span $str</span>="sense and sensibility response and responsibility sense and responsibility"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>);
结果 ab(?i)c匹配abC
如果在后向引用时被强制进行了大小写敏感匹配 ((?i)abc)\s+\1 匹配 ABC ABC AbC AbC 只要两个一样不分大小写 但不匹配 这里其实要考虑的是后向引用期望得到的内容是和那个被引用的捕获子组得到的内容是完全一致的 代码12: <span $p</span>='#((?i)abc)\s+\1#'<span ; </span><span $str</span>="abc abc |ABC ABC |AbC AbC |abc Abc "<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果
先看以下代码13 <span $p</span>='#(a|(bc))#'<span ; </span><span $str</span>="abc "<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 完整匹配了2次 [0][0]是第一次完整的匹配 [1][0]是第一次匹配的第一个子模式 [2][0]是第一次匹配的第二个子模式 [0][1]第二次完整匹配 [1][1]第二次匹配的第一个子模式 [2][1]是第二次匹配的第二个子模式 从上面可以发现对于模式 (a|(bc)) 最外面的括号是第一个匹配子模式 里面的括号里的是第二个子模式 所以对于以下代码14: <span $p</span>='#(a|(bc))\2#'<span ; </span><span $str</span>="aabcbc"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果 当第一匹配 就无从 所以第一次完整匹配中必须得有让第二个子模式存在的机会即里面的括号里的内容必须被匹配到,所以必须得有 因为可能会有多达 99 个后向引用, 所有紧跟反斜线后的数字都可能是一个潜在的后向引用计数。 如果模式在后向引用之后紧接着还是一个数值字符, 那么必须使用一些分隔符用于终结后向引用语法。 以下代码15为例: <span $p</span>='#([a-z]{3})\1 5#x'<span ; </span><span $str</span>="aaaaaa5"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 模式后向引用\1 我们空下一格,然后在模式修正里忽略模式里的空格就能成功匹配
(a\1) 就不会得到任何匹配 而这种引用可以用于内部的子模式重复 (a|b\1)会匹配 ”a”但不会匹配b( 因为子组内部有一个可选路径,可选路径中有一条路能够完成匹配,在匹配完成后, 后向引用就能够引用到内容了)。 代码16: <span $p</span>='#(a|b\1)+#'<span ; </span><span $str</span>="abba"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果 在每次子模式的迭代过程中, 后向引用匹配上一次迭代时这个子组匹配到的字符串。为了做这种工作, 模式必须满足这样一个条件,模式在第一次迭代的时候, 必须能够保证不需要匹配后向引用。 这种条件可以像上面的例子用可选路径来实现,也可以通过使用最小值为 0 的量词修饰后向引用的方式来完成。
在 PHP 5.2.2之后, g转义序列可以用于子模式的绝对和相对引用。 这个转义序列必须紧跟一个无符号数字或一个负数, 可以选择性的使用括号对数字进行包裹。 序列\1, \g1,\g{1} 之间是同义词关系。 这种用法可以消除使用反斜线紧跟数值描述反向引用时候产生的歧义。 这种转义序列有利于区分后向引用和八进制数字字符, 也使得后向引用后面紧跟一个原文匹配数字变的更明了,比如 \g{2}1。 代码17: <span $p</span>='#([a-z]{2})\g{1}5#'<span ; </span><span $str</span>="abab5"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 可与代码15对比
\g 转义序列紧跟一个负数代表一个相对的后向引用。比如: (foo)(bar)\g{-1} 可以匹配字符串 ”foobarbar”(foo)(bar)\g{-2} 可以匹配 ”foobarfoo”。 这在长的模式中作为一个可选方案, 用来保持对之前一个特定子组的引用的子组序号的追踪。 代码18 <span $p</span>='#(foo)(bar)\g{-1}#'<span ; </span><span $p1</span>='#(foo)(bar)\g{-2}#'<span ; </span><span $str</span>="foobarbar"<span ; </span><span $str1</span>="foobarfoo"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span preg_match_all</span>(<span $p1</span>,<span $str1</span>,<span $arr1</span><span ); </span><span print_r</span>(<span $arr</span><span ); </span><span print_r</span>(<span $arr1</span>); 结果: 后向引用也支持使用子组名称的语法方式描述, 比如 (?P=name) 或者 PHP 5.2.2 开始可以实用\k8a11bc632ea32a57b3e3693c7987c420 或 \k’name’。 另外在 PHP 5.2.4 中加入了对\k{name} 和 \g{name} 的支持。 代码19: <span $p</span>="#(?<span 'alpha'</span>[a-z]{2})(?<digt>[0-9]{3})\k<digt>(?<span P=alpha</span>)#"<span ; </span><span $str</span>="aa123123aa"<span ; </span><span preg_match_all</span>(<span $p</span>,<span $str</span>,<span $arr</span><span ); </span><span print_r</span>(<span $arr</span>); 结果: 可与代码8比较着看 注意标红的 Alpha P
|