Home  >  Article  >  Backend Development  >  Detailed introduction to the use of php regular expressions_PHP tutorial

Detailed introduction to the use of php regular expressions_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:11:13950browse

前言

正则表达式是烦琐的,但是强大的,学会之后的应用会让你除了提高效率外,会给你带来绝对的成就感。只要认真去阅读这些资料,加上应用的时候进行一定的参考,掌握正则表达式不是问题。


 1. 引子

  目前,正则表达式已经在很多软件中得到广泛的应用,包括*nix(Linux, Unix等),HP等操作系统,PHP,C#,Java等开发环境,以及很多的应用软件中,都可以看到正则表达式的影子。

  正则表达式的使用,可以通过简单的办法来实现强大的功能。为了简单有效而又不失强大,造成了正则表达式代码的难度较大,学习起来也不是很容易,所以需要付出一些努力才行,入门之后参照一定的参考,使用起来还是比较简单有效的。

    例子: ^.+@.+\\..+$

  这样的代码曾经多次把我自己给吓退过。可能很多人也是被这样的代码给吓跑的吧。继续阅读本文将让你也可以自由应用这样的代码。

  注意:这里的第7部分跟前面的内容看起来似乎有些重复,目的是把前面表格里的部分重新描述了一次,目的是让这些内容更容易理解。

2. 正则表达历史
  正则表达式的“祖先”可以一直上溯至对人类神经系统如何工作的早期研究。Warren McCulloch 和 Walter Pitts 这两位神经生理学家研究出一种数学方式来描述这些神经网络。

  1956 年, 一位叫 Stephen Kleene 的数学家在 McCulloch 和 Pitts 早期工作的基础上,发表了一篇标题为“神经网事件的表示法”的论文,引入了正则表达式的概念。正则表达式就是用来描述他称为“正则集的代数”的表达式,因此采用“正则表达式”这个术语。

  随后,发现可以将这一工作应用于使用 Ken Thompson 的计算搜索算法的一些早期研究,Ken Thompson 是 Unix 的主要发明人。正则表达式的第一个实用应用程序就是 Unix 中的 qed 编辑器。

  如他们所说,剩下的就是众所周知的历史了。从那时起直至现在正则表达式都是基于文本的编辑器和搜索工具中的一个重要部分。


3. 正则表达式定义

  正则表达式(regular expression)描述了一种字符串匹配的模式,可以用来检查一个串是否含有某种子串、将匹配的子串做替换或者从某个串中取出符合某个条件的子串等。

        列目录时, dir *.txt或ls *.txt中的*.txt就不是一个正则表达式,因为这里*与正则式的*的含义是不同的。

  正则表达式是由普通字符(例如字符 a 到 z)以及特殊字符(称为元字符)组成的文字模式。正则表达式作为一个模板,将某个字符模式与所搜索的字符串进行匹配。
3.1 字符

  1、普通字符:

           由所有那些未显式指定为元字符的打印和非打印字符组成。这包括所有的大写和小写字母字符,所有数字,所有标点符号以及一些符号。

  2、非打印字符:
字符  含义
\cx  匹配由x指明的控制字符。例如, \cM 匹配一个 Control-M 或回车符。x 的值必须为 A-Z 或 a-z 之一。否则,将 c 视为一个原义的 'c' 字符。
\f  匹配一个换页符。等价于 \x0c 和 \cL。
\n  匹配一个换行符。等价于 \x0a 和 \cJ。
\r  匹配一个回车符。等价于 \x0d 和 \cM。
\s  匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S  匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
\t  匹配一个制表符。等价于 \x09 和 \cI。
\v  匹配一个垂直制表符。等价于 \x0b 和 \cK。


3、元字符(特殊字符):

   所谓元字符(特殊字符),就是一些有特殊含义的字符,如上面说的"*.txt"中的*,简单的说就是表示任何字符串的意思。如果要查找文件名中有*的文件,则需要对*进行转义,即在其前加一个\。ls \*.txt。正则表达式有以下特殊字符。

          要在正则表达式模式中包含元字符以使其不具有特殊含义,您必须使用反斜杠 (\) 转义字符。例如,下面的正则表达式与顺序依次为字母 A、字母 B、星号和字母 C 的模式匹配:

         /AB\*C/;
元字符  说明
$  匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性,则 $ 也匹配 '\n' 或 '\r'。要匹配 $ 字符本身,请使用 \$。
( )  标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 \( 和 \)。
*  匹配前面的子表达式零次或多次。要匹配 * 字符,请使用 \*。
+  匹配前面的子表达式一次或多次。要匹配 + 字符,请使用 \+。
.  匹配除换行符 \n之外的任何单字符。要匹配 .,请使用 \。
[  标记一个中括号表达式的开始。要匹配 [,请使用 \[。
?  匹配前面的子表达式零次或一次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 \?。
\  将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如, 'n' 匹配字符 'n'。'\n' 匹配换行符。序列 '\\' 匹配 "\",而 '\(' 则匹配 "("。
^  匹配输入字符串的开始位置,除非在方括号表达式中使用,此时它表示不接受该字符集合。要匹配 ^ 字符本身,请使用 \^。
{  标记限定符表达式的开始。要匹配 {,请使用 \{。
|  指明两项之间的一个选择。要匹配 |,请使用 \|。

          构造正则表达式的方法和创建数学表达式的方法一样。也就是用多种元字符与操作符将小的表达式结合在一起来创建更大的表达式。正则表达式的组件可以是单个的字符、字符集合、字符范围、字符间的选择或者所有这些组件的任意组合。

4、限定符:

        限定符用来指定正则表达式的一个给定组件必须要出现多少次才能满足匹配。有*或+或?或{n}或{n,}或{n,m}共6种。
*、+和?限定符都是贪婪的,因为它们会尽可能多的匹配文字,只有在它们的后面加上一个?就可以实现非贪婪或最小匹配。
   正则表达式的限定符有:
 
字符  描述
*  匹配前面的子表达式零次或多次。例如,zo* 能匹配 "z" 以及 "zoo"。* 等价于{0,}。
+  匹配前面的子表达式一次或多次。例如,'zo+' 能匹配 "zo" 以及 "zoo",但不能匹配 "z"。+ 等价于 {1,}。
?  匹配前面的子表达式零次或一次。例如,"do(es)?" 可以匹配 "do" 或 "does" 中的"do" 。? 等价于 {0,1}。
{n}  n 是一个非负整数。匹配确定的 n 次。例如,'o{2}' 不能匹配 "Bob" 中的 'o',但是能匹配 "food" 中的两个 o。
{n,}  n 是一个非负整数。至少匹配n 次。例如,'o{2,}' 不能匹配 "Bob" 中的 'o',但能匹配 "foooood" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。
{n,m}  m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,"o{1,3}" 将匹配 "fooooood" 中的前三个 o。'o{0,1}' 等价于 'o?'。请注意在逗号和两个数之间不能有空格。 

5、定界符:边界

用来描述字符串或单词的边界,^和$分别指字符串的开始与结束,\b描述单词的前或后边界,\B表示非单词边界。不能对定位符使用限定符。
3.2 字符类[ ]

可以使用字符类指定字符列表以匹配正则表达式中的一个位置。使用方括号([ 和 ])定义字符类。例如,下面的正则表达式定义了匹配 bag、beg、big、bog 或 bug 的字符类:
/b[aeiou]g/
1、字符类中的转义序列:
通常在正则表达式中具有特殊含义的大多数元字符和元序列在字符类中“不具有”那些特殊含义。例如,在正则表达式中星号用于表示重复,但是出现在字符类中时则不具有此含义。下列字符类匹配星号本身以及列出的任何其它字符:
/[abc*123]/
但是,下表中列出的三个字符功能与元字符相同,在字符类中具有特殊含义:

] :定义字符类的结尾。
- :定义字符范围

: Define meta-sequences and remove special meanings of meta-characters.
For any character to be recognized as a literal character (without special metacharacter meaning), the character must be preceded by a backslash escape character. For example, the following regular expression contains character classes that match any of four symbols: $, , ], or -.
/[$\]-]/

2. The range of characters in the character class:
Use hyphens to specify the range of characters, such as A-Z, a-z or 0-9. These characters must form a valid range within the character class. For example, the following character class matches any character or any number in the range a-z:
/[a-z0-9]/
You can also use the xnn ASCII character code to specify a range by ASCII value. For example, the following character class matches any character in the extended ASCII character set (such as é and ê):
/[x80-x9A]/

3. Inverted character classes:
If you use the caret (^) character at the beginning of a character class, the meaning of the set will be reversed, that is, any characters not listed All considered a match. The following character classes match any character except lowercase letters (a-z) or numbers:
/[^a-z0-9]/
You must type the caret (^) character at the beginning of the character class Indicates reversal. Otherwise, you are just adding the caret character to the characters of the character class. For example, the following character class matches any of many symbol characters, including the caret:
/[!.,#+*%$&^]/
3.3 Grouping and Selection

Use parentheses to enclose all selections, and separate adjacent selections with |. But using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.
Among them, ?: is one of the non-capturing elements, and the other two non-capturing elements are ?= and ?!. These two have more meanings. The former is a forward lookup, and it starts to match any parentheses. The regular expression pattern matches the search string at any position that does not match the regular expression pattern, which is negative lookahead and matches the search string at any beginning position that does not match the regular expression pattern.

For example: /(very)+/ can match very googd or very very good

1. Back reference (reverse reference):

If you define a standard bracket group in a pattern, you can later reference it in a regular expression. This is called a "backreference" and this type of group is called a "capturing group".

                                                                      ,,,,,,,,,,,,,,,, by adding parentheses around a regular expression pattern or part of a pattern,, will cause the associated matches to be stored in a temporary buffer, such as Content storage. The buffers in which submatches are stored are numbered starting from 1 and numbered consecutively up to a maximum of 99 subexpressions. Each buffer can be accessed using 'n', where n is a one- or two-digit decimal number that identifies a specific buffer.
For example, in the following regular expression, sequence 1 matches any substring matched within the capturing bracket group:
/(d+)-by-1/; // Matches string: 48-by -48
You can specify up to 99 such backreferences in a regular expression by typing 1, 2,..., 99.

You can use the non-capturing metacharacters '?:', '?=', or '?!' to ignore the preservation of related matches.

2. Use non-capturing groups and forward search groups:
Non-capturing groups are groups only used for grouping. They will not be "collected" and will not match limited reverse. Quote. You can use (?: and ?!) to define non-capturing groups, as follows:
/(?:com|org|net);
For example, pay attention to adding (com |org) (use php to demonstrate):

Capturing group):

Copy the code The code is as follows:

$pattern = '/(w+)@(w+ ).(com|org)/';
$str = "bob@example.com";
preg_match($pattern, $str, $match);
print_r($match);

Array
(
[0] => bob@example.com
[1] => bob
[2] => example
[ 3] => com
)

Non-capturing group):

Copy code The code is as follows:

$pattern = '/(w+)@(w+).(?:com|org)/';
$str = "bob@example.com";
preg_match($pattern , $str, $match);
print_r($match);

Array
(
[0] => bob@example.com
[1 ] => bob
[2] => example
)
A special type of non-capturing group is the "forward lookup group", which includes two types: "forward lookup group" and "negative lookahead group". Use (?= and ?!) to define a forward lookup group, which specifies that the subpattern positions in the group must match. However, the portion of the string that matches the forward lookup group may match the rest of the pattern in the regular expression. For example, since (?=e) is a forward search group in the following code, the character e it matches can be matched by the subsequent part of the regular expression, in this case the capturing group w*):
Copy code The code is as follows:

$pattern = '/sh(?=e)(w*)/i';
$str = "Shelly sells seashells by the seashore";
preg_match($pattern, $str, $match);
print_r($match);

Array
(
[0] => Shelly
[1] => elly
)

Use (?! and) to define a negative lookahead group, which specifies that the subpattern positions in the group must not match. For example:

Pattern: $pattern = '/sh(?!e)(w*)/i';
Array
(
[0] => shore
[1] => ; ore
)
3.2 Mode correction identifier

Also:

U: Indicates PCRE_UNGREEDY, which means non-greedy, equivalent to .*? in perl/python language. During the matching process, for .* regularity, it is executed immediately as soon as there is a match, instead of waiting.* consumes all The characters go back one by one.

PHP regular expression patterns are usually followed by parameters such as /i, /is, /s, /isU, etc. So what are these? Let’s take a look below:

Pattern modifiers -- Explanation of modifiers used in regular expression patterns
Explanation
Listed below are the modifiers currently used in PCRE. In parentheses are the internal PCRE names of these modifiers. Spaces and newlines in modifiers are ignored, other characters will cause errors.

i (PCRE_CASELESS)
If this modifier is set, the characters in the pattern will match both uppercase and lowercase letters.

m (PCRE_MULTILINE)
By default, PCRE treats the target string as a single "line" of characters (even if it contains newlines). The "start of line" metacharacter (^) only matches the beginning of the string, and the "end of line" metacharacter ($) only matches the end of the string, or the last character before it if it is a newline (unless D is set) modifier). This is the same as Perl.

When this modifier is set, "line start" and "line end" will not only match the beginning and end of the entire string, but also match after and before the newline character in it respectively. This is equivalent to Perl's /m modifier. If there are no "n" characters in the target string or ^ or $ in the pattern, setting this modifier has no effect.

s (PCRE_DOTALL)
If this modifier is set, the dot metacharacter (.) in the pattern matches all characters, including newlines. Without this setting, newline characters are not included. This is equivalent to Perl's /s modifier. Excluded character classes such as [^a] always match newlines, regardless of whether this modifier is set.

x (PCRE_EXTENDED)
If this modifier is set, whitespace characters in the pattern are completely ignored except those that are escaped or in character classes. # outside the character class and all characters between the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, allowing comments to be added to complex patterns. Note, however, that this only applies to data characters. Whitespace characters may never appear in special character sequences within a pattern, such as sequences that introduce conditional subpatterns (?( in the middle.

e
If this modifier is set, preg_replace() performs the normal replacement of the backreference in the replacement string, evaluates it as PHP code, and replaces it with its result The string being searched for.

Only preg_replace() uses this modifier, other PCRE functions will ignore it.

Note: This modifier is not available in PHP3.

A (PCRE_ANCHORED)
If this modifier is set, the pattern is forced to be "anchored", that is, it is forced to match only from the beginning of the target string. This effect can also be achieved via the appropriate mode itself (the only way this is achieved in Perl).

D (PCRE_DOLLAR_ENDONLY)
If this modifier is set, dollar metacharacters in the pattern only match the end of the target string. Without this option, if the last character is a newline character, the dollar sign will also match before this character (but not before any other newline character). This option is ignored if the m modifier is set. There is no equivalent modifier in Perl.

S
When a pattern will be used several times, it is worth analyzing it first to speed up matching. If this modifier is set additional analysis will be performed. Currently, analyzing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U (PCRE_UNGREEDY)
This modifier inverts the value of the number of matches so that it is not repeated by default, but becomes repeated when followed by "?". This is not compatible with Perl. This option can also be enabled by setting the (?U) modifier in the pattern or by following the quantifier with a question mark (e.g. .*?).

For example:

Copy code The code is as follows:

        $str = 'src="http://www.test.cn/1.mp3" type="application/x-mplayer2"test,3333'; 
    echo preg_replace('/src="(.*)"/', '--', $str); 
    echo '
'; 
    echo preg_replace('/src="(.*)"/U', '--', $str); 
    echo '
'; 
    echo preg_replace('/src="(.*?)"/', '--', $str);//等效preg_replace('|src="(.*)"|U', '--', $str); 

结果:

--test,3333

-- type="application/x-mplayer2"test,3333

-- type="application/x-mplayer2"test,3333

从这里我们就可以看出,第一个执行结果一直匹配到最后一个满足条件的字符,专业一点就叫贪婪匹配,

第二个执行结果只匹配第一个满足条件的字符,叫 非贪婪匹配。

X(PCRE_EXTRA)
  此修正符启用了一个 PCRE 中与 Perl 不兼容的额外功能。模式中的任何反斜线后面跟上一个没有特殊意义的字母导致一个错误,从而保留此组合以备将来扩充。默认情况下,和 Perl 一样,一个反斜线后面跟一个没有特殊意义的字母被当成该字母本身。当前没有其它特性受此修正符控制。

u(PCRE_UTF8)
  此修正符启用了一个 PCRE 中与 Perl 不兼容的额外功能。模式字符串被当成 UTF-8。本修正符在 Unix 下自 PHP 4.1.0 起可用,在 win32 下自 PHP 4.2.3 起可用。自 PHP 4.3.5 起开始检查模式的 UTF-8 合法性。


4. 各种操作符的运算优先级

   相同优先级的从左到右进行运算,不同优先级的运算先高后低。各种操作符的优先级从高到低如下:
 
操作符  描述
\  转义符
(), (?:), (?=), []  圆括号和方括号
*, +, ?, {n}, {n,}, {n,m}  限定符
^, $, \anymetacharacter  位置和顺序
|  “或”操作

5. All symbol explanations
character description
Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character . For example, 'n' matches the character "n". 'n' matches a newline character. The sequence '\' matches "" and "(" matches "(".
^ matches the beginning of the input string. If the Multiline property of the RegExp object is set, ^ also matches after 'n' or 'r' position.
$ matches the end of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before 'n' or 'r'.
* matches the preceding subexpression zero times or Multiple times. For example, zo* matches "z" and "zoo". * is equivalent to {0,}.
+ matches the previous subexpression one or more times. For example, 'zo+' matches "zo." " and "zoo", but not "z". + Equivalent to {1,}.
? matches the preceding subexpression zero or once. For example, "do(es)?" matches "do " or "do" in "does".? Equivalent to {0,1}.
{n} n is a non-negative integer. Matches a definite n times. For example, 'o{2}' cannot match 'o' in "Bob", but can match two o's in "food".
{n,} n is a non-negative integer. For example, 'o{2,}' cannot be matched. Matches 'o' in "Bob", but matches all o's in "foooood". 'o{1,}' is equivalent to 'o+', which is equivalent to 'o*'. .
{n,m} m and n are non-negative integers, where n (?:pattern) matches the pattern but does not get the matching result, which means it is a non-getting match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, 'industr(?:y|ies) is a shorter expression than 'industry|industries'.
(?=pattern) Forward lookup, match the search string at the beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the character containing the prefetch.
(?!pattern) Negative lookup, matches the search string at the beginning of any string that does not match pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example, 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1", but not "Windows" in "Windows 2000". Prefetching does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, rather than starting after the characters containing the prefetch
x|y matches x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz] character set. Matches any one of the characters contained. For example, '[abc]' matches 'a' in "plain".
[^xyz] Negative character set. Matches any character not included. For example, '[^abc]' matches the 'p' in "plain".
[a-z] character range. Matches any character within the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.
[^a-z] Negative character range. Matches any character not within the specified range. For example, '[^a-z]' matches any character that is not in the range 'a' to 'z'.
b matches a word boundary, which refers to the position between a word and a space. For example, 'erb' matches the 'er' in "never" but not the 'er' in "verb".
B matches non-word boundaries. 'erB' matches 'er' in "verb" but not in "never".
cx matches the control character specified by x. For example, cM matches a Control-M or carriage return character. The value of x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
d matches a numeric character. Equivalent to [0-9].
D matches a non-numeric character. Equivalent to [^0-9].
f matches a form feed.等价于 \x0c 和 \cL。
\n  匹配一个换行符。等价于 \x0a 和 \cJ。
\r  匹配一个回车符。等价于 \x0d 和 \cM。
\s  匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S  匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
\t  匹配一个制表符。等价于 \x09 和 \cI。
\v  匹配一个垂直制表符。等价于 \x0b 和 \cK。
\w  匹配包括下划线的任何单词字符。等价于'[A-Za-z0-9_]'。
\W  匹配任何非单词字符。等价于 '[^A-Za-z0-9_]'。
\xn  匹配 n,其中 n 为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如,'\x41' 匹配 "A"。'\x041' 则等价于 '\x04' & "1"。正则表达式中可以使用 ASCII 编码。.
\num  匹配 num,其中 num 是一个正整数。对所获取的匹配的引用。例如,'(.)\1' 匹配两个连续的相同字符。
\n  标识一个八进制转义值或一个向后引用。如果 \n 之前至少 n 个获取的子表达式,则 n 为向后引用。否则,如果 n 为八进制数字 (0-7),则 n 为一个八进制转义值。
\nm  标识一个八进制转义值或一个向后引用。如果 \nm 之前至少有 nm 个获得子表达式,则 nm 为向后引用。如果 \nm 之前至少有 n 个获取,则 n 为一个后跟文字 m 的向后引用。如果前面的条件都不满足,若 n 和 m 均为八进制数字 (0-7),则 \nm 将匹配八进制转义值 nm。
\nml  如果 n 为八进制数字 (0-3),且 m 和 l 均为八进制数字 (0-7),则匹配八进制转义值 nml。
\un  匹配 n,其中 n 是一个用四个十六进制数字表示的 Unicode 字符。例如, \u00A9 匹配版权符号 (?)。

6. 部分例子
正则表达式  说明
/\b([a-z]+) \1\b/gi  一个单词连续出现的位置
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/  将一个URL解析为协议、域、端口及相对路径
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/  定位章节的位置
/[-a-z]/  A至z共26个字母再加一个-号。
/ter\b/  可匹配chapter,而不能terminal
/\Bapt/  可匹配chapter,而不能aptitude
/Windows(?=95 |98 |NT )/  可匹配Windows95或Windows98或WindowsNT,当找到一个匹配后,从Windows后面开始进行下一次的检索匹配。

7. 正则表达式匹配规则

7.1 基本模式匹配

   一切从最基本的开始。模式,是正规表达式最基本的元素,它们是一组描述字符串特征的字符。模式可以很简单,由普通的字符串组成,也可以非常复杂,往往用特殊的字符表示一个范围内的字符、重复出现,或表示上下文。例如:

    ^once

  这个模式包含一个特殊的字符^,表示该模式只匹配那些以once开头的字符串。例如该模式与字符串"once upon a time"匹配,与"There once was a man from NewYork"不匹配。正如如^符号表示开头一样,$符号用来匹配那些以给定模式结尾的字符串。

    bucket$

  这个模式与"Who kept all of this cash in a bucket"匹配,与"buckets"不匹配。字符^和$同时使用时,表示精确匹配(字符串与模式一样)。例如:

    ^bucket$

  只匹配字符串"bucket"。如果一个模式不包括^和$,那么它与任何包含该模式的字符串匹配。例如:模式

    once

与字符串

    There once was a man from NewYork
    Who kept all of his cash in a bucket.

是匹配的。

   在该模式中的字母(o-n-c-e)是字面的字符,也就是说,他们表示该字母本身,数字也是一样的。其他一些稍微复杂的字符,如标点符号和白字符(空格、制表符等),要用到转义序列。所有的转义序列都用反斜杠(\)打头。制表符的转义序列是:\t。所以如果我们要检测一个字符串是否以制表符开头,可以用这个模式:

    ^\t

类似的,用\n表示“新行”,\r表示回车。其他的特殊符号,可以用在前面加上反斜杠,如反斜杠本身用\\表示,句号.用\.表示,以此类推。
7.2 字符簇
在INTERNET的程序中,正规表达式通常用来验证用户的输入。当用户提交一个FORM以后,要判断输入的电话号码、地址、EMAIL地址、信用卡号码等是否有效,用普通的基于字面的字符是不够的。
所以要用一种更自由的描述我们要的模式的办法,它就是字符簇。要建立一个表示所有元音字符的字符簇,就把所有的元音字符放在一个方括号里:

    [AaEeIiOoUu]

这个模式与任何元音字符匹配,但只能表示一个字符。用连字号可以表示一个字符的范围,如:

    [a-z] //匹配所有的小写字母
    [A-Z] //匹配所有的大写字母
    [a-zA-Z] //匹配所有的字母
    [0-9] //匹配所有的数字
    [0-9\.\-] //匹配所有的数字,句号和减号
    [ \f\r\t\n] //匹配所有的白字符

同样的,这些也只表示一个字符,这是一个非常重要的。如果要匹配一个由一个小写字母和一位数字组成的字符串,比如"z2"、"t6"或"g7",但不是"ab2"、"r2d3" 或"b52"的话,用这个模式:

    ^[a-z][0-9]$

尽管[a-z]代表26个字母的范围,但在这里它只能与第一个字符是小写字母的字符串匹配。

前面曾经提到^表示字符串的开头,但它还有另外一个含义。当在一组方括号里使用^是,它表示“非”或“排除”的意思,常常用来剔除某个字符。还用前面的例子,我们要求第一个字符不能是数字:

    ^[^0-9][0-9]$

这个模式与"&5"、"g7"及"-2"是匹配的,但与"12"、"66"是不匹配的。下面是几个排除特定字符的例子:

    [^a-z] //除了小写字母以外的所有字符
    [^\\\/\^] //除了(\)(/)(^)之外的所有字符
    [^\"\'] //除了双引号(")和单引号(')之外的所有字符

特殊字符"." (点,句号)在正规表达式中用来表示除了“新行”之外的所有字符。所以模式"^.5$"与任何两个字符的、以数字5结尾和以其他非“新行”字符开头的字符串匹配。模式"."可以匹配任何字符串,除了空串和只包括一个“新行”的字符串。

PHP的正规表达式有一些内置的通用字符簇,列表如下:

    字符簇 含义
    [[:alpha:]] 任何字母
    [[:digit:]] 任何数字
    [[:alnum:]] 任何字母和数字
    [[:space:]] 任何白字符
    [[:upper:]] 任何大写字母
    [[:lower:]] 任何小写字母
    [[:punct:]] 任何标点符号
    [[:xdigit:]] 任何16进制的数字,相当于[0-9a-fA-F]

7.3 确定重复出现
到现在为止,你已经知道如何去匹配一个字母或数字,但更多的情况下,可能要匹配一个单词或一组数字。一个单词有若干个字母组成,一组数字有若干个单数组成。跟在字符或字符簇后面的花括号({})用来确定前面的内容的重复出现的次数。

    字符簇 含义
    ^[a-zA-Z_]$ 所有的字母和下划线
    ^[[:alpha:]]{3}$ 所有的3个字母的单词
    ^a$ 字母a
    ^a{4}$ aaaa
    ^a{2,4}$ aa,aaa或aaaa
    ^a{1,3}$ a,aa或aaa
    ^a{2,}$ 包含多于两个a的字符串
    ^a{2,} 如:aardvark和aaab,但apple不行
    a{2,} 如:baad和aaa,但Nantucket不行
    \t{2} 两个制表符
    .{2} 所有的两个字符

这些例子描述了花括号的三种不同的用法。一个数字,{x}的意思是“前面的字符或字符簇只出现x次”;一个数字加逗号,{x,}的意思是“前面的内容出现x或更多的次数”;两个用逗号分隔的数字,{x,y}表示“前面的内容至少出现x次,但不超过y次”。我们可以把模式扩展到更多的单词或数字:

    ^[a-zA-Z0-9_]{1,}$ //所有包含一个以上的字母、数字或下划线的字符串
    ^[0-9]{1,}$ //所有的正数
    ^\-{0,1}[0-9]{1,}$ //所有的整数
    ^\-{0,1}[0-9]{0,}\.{0,1}[0-9]{0,}$ //所有的小数

最后一个例子不太好理解,是吗?这么看吧:与所有以一个可选的负号(\-{0,1})开头(^)、跟着0个或更多的数字([0-9]{0,})、和一个可选的小数点(\.{0,1})再跟上0个或多个数字([0-9]{0,}),并且没有其他任何东西($)。下面你将知道能够使用的更为简单的方法。

特殊字符"?"与{0,1}是相等的,它们都代表着:“0个或1个前面的内容”或“前面的内容是可选的”。所以刚才的例子可以简化为:

    ^\-?[0-9]{0,}\.?[0-9]{0,}$

特殊字符"*"与{0,}是相等的,它们都代表着“0个或多个前面的内容”。最后,字符"+"与 {1,}是相等的,表示“1个或多个前面的内容”,所以上面的4个例子可以写成:

    ^[a-zA-Z0-9_]+$ //所有包含一个以上的字母、数字或下划线的字符串
    ^[0-9]+$ //所有的正数
    ^\-?[0-9]+$ //所有的整数
    ^\-?[0-9]*\.?[0-9]*$ //所有的小数

当然这并不能从技术上降低正规表达式的复杂性,但可以使它们更容易阅读。


8.posix和perl标准的正则表达式区别

PHP同时使用两套正则表达式规则,一套是由电气和电子工程师协会(IEEE)制定的POSIX Extended 1003.2兼容正则(事实上PHP对此标准的支持并不完善),另一套来自PCRE(Perl Compatible Regular Expression)库提供PERL兼容正则,这是个开放源代码的软件,作者为 Philip Hazel。

使用POSIX兼容规则的函数有:
ereg_replace()
ereg()
eregi()
eregi_replace()
split()
spliti()
sql_regcase()
mb_ereg_match()
mb_ereg_replace()
mb_ereg_search_getpos()
mb_ereg_search_getregs()
mb_ereg_search_init()
mb_ereg_search_pos()
mb_ereg_search_regs()
mb_ereg_search_setpos()
mb_ereg_search()
mb_ereg()
mb_eregi_replace()
mb_eregi()
mb_regex_encoding()
mb_regex_set_options()
mb_split()

使用PERL兼容规则的函数有:
preg_grep()
preg_replace_callback()
preg_match_all()
preg_match()
preg_quote()
preg_split()
preg_replace()

定界符:

POSIX兼容正则没有定界符,函数的相应参数会被认为是正则。

PERL兼容正则可以使用任何不是字母、数字或反斜线(/)的字符作为定界符,如果作为定界符的字符必须被用在表达式本身中,则需要用反斜线转义。也可以使用(),{},[] 和 <> 作为定界符

修正符:

POSIX兼容正则没有修正符。

PERL兼容正则中可能使用的修正符(修正符中的空格和换行被忽略,其它字符会导致错误):

i (PCRE_CASELESS):
匹配时忽略大小写。

m(PCRE_MULTILINE):
当设定了此修正符,行起始(^)和行结束($)除了匹配整个字符串开头和结束外,还分别匹配其中的换行符(/n)的之后和之前。

s(PCRE_DOTALL):
如果设定了此修正符,模式中的圆点元字符(.)匹配所有的字符,包括换行符。没有此设定的话,则不包括换行符。

x(PCRE_EXTENDED):
如果设定了此修正符,模式中的空白字符除了被转义的或在字符类中的以外完全被忽略。

e:
如果设定了此修正符,preg_replace() 在替换字符串中对逆向引用作正常的替换,将其作为 PHP 代码求值,并用其结果来替换所搜索的字符串。 只有 preg_replace() 使用此修正符,其它 PCRE 函数将忽略之。

A(PCRE_ANCHORED):
如果设定了此修正符,模式被强制为“anchored”,即强制仅从目标字符串的开头开始匹配。

D(PCRE_DOLLAR_ENDONLY):
如果设定了此修正符,模式中的行结束($)仅匹配目标字符串的结尾。没有此选项时,如果最后一个字符是换行符的话,也会被匹配在里面。如果设定了 m 修正符则忽略此选项。

S:
当一个模式将被使用若干次时,为加速匹配起见值得先对其进行分析。如果设定了此修正符则会进行额外的分析。目前,分析一个模式仅对没有单一固定起始字符的 non-anchored 模式有用。

U(PCRE_UNGREEDY):
使“?”的默认匹配成为贪婪状态的。

X(PCRE_EXTRA):
模式中的任何反斜线后面跟上一个没有特殊意义的字母导致一个错误,从而保留此组合以备将来扩充。默认情况下,一个反斜线后面跟一个没有特殊意义的字母被当成该字母本身。

u(PCRE_UTF8):
模式字符串被当成UTF-8。

逻辑区隔:

POSIX兼容正则和PERL兼容正则的逻辑区隔符号作用和使用方法完全一致:
[]:包含任选一操作的相关信息。
{}:包含匹配次数的相关信息。
():包含一个逻辑区间的相关信息,可被用来进行引用操作。
|:表示“或”,[ab]和a|b是等价的。

元字符与“[]”相关:

有两组不同的元字符:一种是模式中除了方括号内都能被识别的,还有一种是在方括号“[]”内被识别的。

POSIX compatible regular and PERL compatible regular "[]" and "consistent" metacharacters:
/ Universal escape character with several uses
^ matches the beginning of the string
$ matches characters End of string
? Matches 0 or 1
* Matches 0 or more characters of the previously specified type
+ Matches 1 or more characters of the previously specified type

POSIX-compatible regex and PERL-compatible regex "outside []" and "inconsistent" metacharacters:
. PERL-compatible regex matches any character except the newline character
. POSIX-compatible regex matches any character

POSIX-compatible regular and PERL-compatible regular "[]" "consistent" metacharacters:
/ There are several uses of the universal escape character
^ negates the character, but only if it is the first Valid when characters are
- Specify the character ASCII range. Study the ASCII code carefully and you will find that [W-c] is equivalent to [WXYZ//^_`abc]

Posix-compatible regular expressions and PERL-compatible regular expressions have "inconsistent" metacharacters "within []":
- The specification of [a-c-e] in POSIX-compatible regular expressions will throw an error.
- The specification of [a-c-e] in PERL compatible regular expressions is equivalent to [a-e].

Number of matches related to "{}":

POSIX-compatible regular expressions and PERL-compatible regular expressions are exactly the same in terms of matching times:
{2}: means matching the previous character 2 times
{2,}: means matching the previous character 2 or more times, The default is greedy (as much as possible) matching
{2,4}: means matching the previous character 2 or 4 times

Logical intervals are related to "()":

The area enclosed by () is a logical interval. The main function of the logical interval is to reflect the logical order in which some characters appear. Another use is that it can be used for reference (the value in this interval can be referenced to a variable ). The latter function is rather strange:
$str = "http://www.163.com/";
// POSIX compatible regular:
echo ereg_replace("(. +)","//1",$str);
// PERL compatible regular:
echo preg_replace("/(.+)/ ","$1",$str);
// Display two links
?>

When quoting, parentheses can be nested, and the logical order is calibrated according to the order in which "(" appears.

Type match:

POSIX compatible regular:
[:upper:]: matches all uppercase letters
[:lower:]: matches all lowercase letters
[:alpha:]: matches all letters
[:alnum:]: Matches all letters and digits
[:digit:]: Matches all digits
[:xdigit:]: Matches all hexadecimal characters, equivalent to [0- 9A-Fa-f]
[:punct:]: matches all punctuation marks, equivalent to [.,"'?!;:]
[:blank:]: matches spaces and TAB, equivalent In [ /t]
[:space:]: matches all whitespace characters, equivalent to [ /t/n/r/f/v]
[:cntrl:]: matches all ASCII 0 to 31 Control characters between
[:graph:]: Match all printable characters, equivalent to: [^ /t/n/r/f/v]
[:print:]: Match all printable characters and spaces, equivalent to: [^/t/n/r/f/v]
[.c.]: Unknown function
[=c=]: Unknown function
[ :<:]: Matches the beginning of a word
[:>:]: Matches the end of a word

PERL compatible regex (here you can see the power of PERL regex):
/a alarm, that is, the BEL character ('0)
/cx "control-x", where x is any character
/e escape ('0B)
/f formfeed ('0C)
/n newline ('0A)
/r carriage return ('0D)
/ t tab character tab ('0)
/xhh character with hexadecimal code hh
/ddd character with octal code ddd, or backreference
/d any decimal digit
/ D Any non-decimal character
/s Any blank character
/S Any non-blank character
/w Any "word" character
/W Any "non-word" character
/b word boundary
/B non-word boundary
/A beginning of target (independent of multiline mode)
/Z end of target or before the trailing newline ( independent of multiline mode)
/z end of target (independent of multiline mode)
/G first matching position in target

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/326950.htmlTechArticlePreface Regular expressions are cumbersome, but powerful. After learning, applying them will not only improve your efficiency, It will bring you an absolute sense of accomplishment. Just read this information carefully and add...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn