Home  >  Article  >  Backend Development  >  Knowledge summary of regular expressions in PHP interviews (super detailed)

Knowledge summary of regular expressions in PHP interviews (super detailed)

不言
不言forward
2019-01-09 10:25:456593browse

This article brings you a knowledge summary (super detailed) about regular expressions in PHP interviews. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Related recommendations: "2019 PHP Interview Questions Summary (Collection)"

1. Introduction

1. What is a regular expression

A regular expression (Regular Expression) is a formula that uses a certain pattern to match a type of string.
Regular expressions use a single string to describe and match a series of strings that match a certain syntax rule.
Regular expressions are cumbersome, but they are powerful. After learning, applying them will not only improve your efficiency, but also give you an absolute sense of accomplishment. As long as you read this tutorial carefully and make certain references when applying it, mastering regular expressions is not a problem.
Many programming languages ​​support string operations using regular expressions.

2. The role of regular expressions

Split, search, match, and replace strings

3. Regular expressions in PHP Formula

There are two sets of regular expression function libraries in PHP. The functions of the two are similar, but the execution efficiency is slightly different:

One set is composed of PCRE (Perl Compatible Regular Expression ) provided by the library. Functions named with the prefix "preg_";
A set provided by POSIX (Portable Operating System Interface of Unix) extensions. Use functions named with the prefix "ereg_";

PCRE comes from the Perl language, and Perl is one of the most powerful languages ​​for string operations. The initial version of PHP was a product developed by Perl.
PCRE syntax supports more features and is more powerful than POSIX syntax. Therefore, this article mainly introduces the regular expressions of PCRE syntax

4. The composition of regular expressions

In PHP, a regular expression Divided into three parts: delimiters, expressions and pattern modifiers.

Delimiter

The delimiter can use any ascii character except letters, numbers, backslash (\) and whitespace characters.
The most commonly used delimiters are forward slash (/), hash symbol (#) and negation symbol (~).

Expression

consists of some special characters and non-special strings. It is the main part that determines the matching rules of regular expressions.

Mode modifier

is used to turn on and off certain functions/modes.

2. Delimiter

1. Selection of delimiter

When using the PCRE function, the regular expression Must be enclosed by delimiters. The
delimiter can use any ASCII character except letters, numbers, backslashes (\), and whitespace characters.
The most commonly used delimiters are forward slash (/), hash symbol (#) and negation symbol (~).

/foo bar/ (合法)
#^[^0-9]$# (合法)
+php+    (合法)
%[a-zA-Z0-9_-]%    (合法)
#[a-zA-Z0-9_-]/    (非法,两边的分隔符不同)
a[a-zA-Z0-9_-]a    (非法,分隔符不能是字母)
\[a-zA-Z0-9_-]\    (非法,分隔符不能是反斜线(`\`))

In addition to the delimiters mentioned above, you can also use bracket-style delimiters. The left bracket and the right bracket serve as the start and end delimiters respectively.

{this is a pattern}

2. Use of delimiter

If the delimiter is used in a regular expression, it must use a backslash (\) Escape.
If delimiters often appear within regular expressions, it is best to use other delimiters to improve readability.

/http:\/\//
#http://#

When you need to put a string into a regular expression, you can use the preg_quote() function to escape it. Its second parameter (optional) can be used to specify the delimiter that needs to be escaped.

//在这个例子中,preg_quote($word) 用于保持星号和正斜杠(/)原文涵义,使其不使用正则表达式中的特殊语义。
$textBody = "This book is */very/* difficult to find.";
$word = "*/very/*";
$reg = "/" . preg_quote($word, '/') . "/";

echo $reg; // 输出 '/\*\/very\/\*/'

echo preg_replace ($reg, "<i>" . $word . "</i>", $textBody); // 输出 'This book is <i>*/very/*</i> difficult to find.'

You can add pattern modifiers after the end delimiter to affect the matching effect.

The following example is a case-insensitive match

#[a-z]#i

3. Metacharacters

1. Escape character

##\Change the next character Marked by a special character, a literal character, or a backreference.
Character Description
For example, 'n' matches the character "n". 'n' matches a newline character. The sequence '\' matches "" and "(" matches "(".
2. Locator

CharacterDescription##^$\b\B

3. Qualifier

Matches the beginning of the input string (or at In multi-line mode, it is the beginning of the line)
matches the end of the input string (or in multi-line mode, it is the end of the line )
Matches a word boundary, that is, the position between a word and a space
Non-word boundary matching
Character Description
* Matches the preceding subexpression zero or more times.
For example, zo can match "z" and "zoo". Equivalent to {0,}.
Matches the preceding subexpression one or more times.
For example, 'zo ' can match "zo" and "zoo", but not "z". Equivalent to {1,}.
? When this character is used as a quantifier, it means matching the previous subexpression zero or one time.
For example, "do(es)?" can match "do" or "does" . ? Equivalent to {0,1}.
{n} n is a non-negative integer. Match a certain number of n times.
For example, 'o{2}' cannot match the 'o' in "Bob", but it can match the two o's in "food".
{n,} n is a non-negative integer. Match at least n times.
For example, 'o{2,}' cannot match 'o' in "Bob", but it can match all o's in "foooood". 'o{1,}' is equivalent to 'o '. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are both non-negative integers, where n <= m. Match at least n times and at most m times.
For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.

4. Common characters

CharacterDescription
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. Equivalent to [^0-9].
\w Matches letters, numbers, and underscores. Equivalent to [A-Za-z0-9_].
\W Matches non-letters, numbers, and underscores. Equivalent to [^A-Za-z0-9_].
\s Matches any whitespace characters, including spaces, tabs, and form feeds etc. Equivalent to [ \f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
. Matches any single character except newlines (n, r).
To match any character including 'n', use a regular expression like "(.
n)".

5. Non-printing characters

CharacterDescription
\n Matches a newline character. Equivalent to x0a and cJ.
\r Matches a carriage return character. Equivalent to x0d and cM.
\t Matches a tab character. Equivalent to x09 and cI.

6. Multiple selection branch characters

CharactersDescription
| vertical bar characters| can match multiple selections.
For example, 'z|food' can match "z" or "food". '(z|f|g)ood' matches "zood", "food" or "good".

7. Character group

CharacterDescription
[x|y] Matches x or y.
For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz] character set. Matches any one of the characters contained.
For example, [abc] would match 'a' in "plain".
[^xyz] Negative value character set. Matches any character not included.
For example, [^abc] can match 'p', 'l', 'i', 'n' in "plain".
[a-z]Character range. Matches any character within the specified range.
For example, [a-z] matches any lowercase alphabetic character in the range 'a' to 'z'.
[^a-z]Negative character range. Matches any character not within the specified range.
For example, [^a-z] matches any character that is not in the range 'a' to 'z'.

8. Non-greedy matching character

CharacterDescription
? Matches the pattern when the character immediately follows any other limiter (*, , ?, {n}, {n,}, {n,m}) Be non-greedy.
Non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible.
For example, for the string "oooo", 'o ?' will match a single "o", while 'o ' will match all 'o's.

9. ( )Group

Reverse (look behind) positive pre-check is similar to forward positive pre-check, except In the opposite direction. ##(?Reverse negative pre-check is similar to forward negative pre-check, but in the opposite direction. For example, "(?

四、模式修饰符

1. i(不区分大小写)

如果设置了这个修饰符,正则表达式中的字母会进行大小写不敏感匹配。

2. m(多行模式)

默认情况下,PCRE 认为目标字符串是由单行字符组成的(然而实际上它可能会包含多行)。
"行首"元字符 (^) 仅匹配字符串的开始位置, 而"行末"元字符 ($) 仅匹配字符串末尾, 或者最后的换行符(除非设置了 D 修饰符)。

当这个修饰符设置之后,“行首”元字符 (^) 和“行末”元字符 ($) 就会匹配目标字符串中任意换行符之前或之后,另外,还分别匹配目标字符串的最开始和最末尾位置。

如果目标字符串 中没有 "n" 字符,或者正则表达式中没有出现 ^$,设置这个修饰符不产生任何影响。

3. s(点号通配模式)

默认情况下,点号(.)不匹配换行符。
如果设置了这个修饰符,正则表达式中的点号元字符匹配所有字符,包含换行符。

4. U(贪婪模式)

这个修饰符与前面提到的 ? 作用相同,使正则表达式默认为非贪婪匹配,通过量词后紧跟 ? 的方式可以使其转为贪婪匹配。

在非贪婪模式,通常不能匹配超过 pcre.backtrack_limit 的字符。

贪婪模式

$str = &#39;<b>abc</b><b>def</b>';
$pattern = '/<b>.*</b>/';
preg_replace($pattern, '\\1', $str);<p><code>.*</code>会匹配 <code>abc</b><b>def</code></p>
<h4><strong>非贪婪模式</strong></h4>
<p><strong>方法一、使用 <code>?</code> 转为非贪婪模式</strong></p>
<pre class="brush:php;toolbar:false">$str = '<b>abc</b><b>def</b>';
$pattern = '/<b>.*?</b>/';
preg_replace($pattern, '\\1', $str);

.*会分别匹配 abcdef

方法二、使用修饰符 U 转为非贪婪模式

$str = '<b>abc</b><b>def</b>';
$pattern = '/<b>.*</b>/U';
preg_replace($pattern, '\\1', $str);

5. u(支持UTF-8转义表达)

此修正符使正则表达式和目标字符串都被认为是 utf-8 编码。
无效的目标字符串会导致 preg_* 函数什么都匹配不到;无效的正则表达式字符串会导致 E_WARNING 级别的错误。

$str = '中文';

$pattern = '/^[\x{4e00}-\x{9fa5}]+$/u';

if (preg_match($pattern, $str)) {
    echo '该字符串全是中文';
} else {
    echo '该字符串不全是中文';
}

6. D(结尾限制)

默认情况下,如果使用 $ 限制结尾字符,当字符串以一个换行符结尾时, $符号还会匹配该换行符(但不会匹配之前的任何换行符)。
如果设置这个修饰符,正则表达式中的 $ 符号仅匹配目标字符串的末尾。
如果设置了修饰符 m,这个修饰符被忽略。

7. x

如果设置了这个修饰符,正则表达式中的没有经过转义的或不在字符类中的空白数据字符总会被忽略, 并且位于一个未转义的字符类外部的#字符和下一个换行符之间的字符也被忽略。

8. A

如果设置了这个修饰符,正则表达式被强制为"锚定"模式,也就是说约束匹配使其仅从 目标字符串的开始位置搜索。

9. S

当一个正则表达式需要多次使用的时候,为了得到匹配速度的提升,值得花费一些时间对其进行一些额外的分析。
如果设置了这个修饰符,这个额外的分析就会执行。
当前,这种对一个正则表达式的分析仅仅适用于非锚定模式的匹配(即没有单独的固定开始字符)。

五、反向引用

使用 ( ) 标记的开始和结束的多个原子,不仅是一个独立的单元,也是一个子表达式。
在一个 ( ) 中的子表达式外面,反斜线紧跟一个大于 0 的数字,就是对之前出现的某个子表达式的后向引用。
后向引用用于重复搜索前面某个  ( ) 中的子表达式匹配的文本。

1. 在正则表达式中使用反向引用

(sens|respons)e and \1ibility 将会匹配 ”sense and sensibility” 和 ”response and responsibility”, 而不会匹配 ”sense and responsibility”

2. 在PCRE函数中使用反向引用

<?php
$str = '<b>abc</b><b>def</b>';
$pattern = '/<b>(.*)<\/b><b>(.*)<\/b>/';
$replace = preg_replace($pattern, '\\1', $str);
echo $replace . "\n";

$replace = preg_replace($pattern, '\\2', $str);
echo $replace . "\n";

输出:

abc
def

六、正则表达式常用PCRE函数

PHP官网的讲解已经很详细了,这里不再做多余的论述

执行正则表达式匹配 preg_match()

执行正则表达式全局匹配 preg_match_all()

执行一个正则表达式的搜索和替换 preg_replace()

执行一个正则表达式搜索并且使用一个回调进行替换 preg_replace_callback()

执行多个正则表达式搜索并且使用对应回调进行替换 preg_replace_callback_array()

通过一个正则表达式分隔字符串 preg_split()

七、应用实践

1. 正则表达式匹配中文

UTF-8汉字编码范围是 0x4e00-0x9fa5
在ANSI(GB2312)环境下,0xb0-0xf70xa1-0xfe

UTF-8要使用 u模式修正符 使模式字符串被当成 UTF-8
在ANSI(GB2312)环境下,要使用chr将Ascii码转换为字符

UTF-8

<?php

$str = &#39;中文&#39;;

$pattern = &#39;/[\x{4e00}-\x{9fa5}]/u&#39;;

preg_match($pattern, $str, $match);

var_dump($match);

ANSI(GB2312)

<?php

$str = &#39;中文&#39;;

$pattern = &#39;/[&#39;.chr(0xb0).&#39;-&#39;.chr(0xf7).&#39;][&#39;.chr(0xa1).&#39;-&#39;.chr(0xfe).&#39;]/&#39;;

preg_match($pattern, $str, $match);

var_dump($match);

2. 正则表达式匹配页面中所有img标签中的src的值。

<?php

$str = &#39;<img alt="高清大图" id="color" src="color.jpg" />';

$pattern = '/<img.*?src="(.*?)".*?\/?>/i';

preg_match($pattern, $str, $match);

var_dump($match);
CharacterDescription
(pattern) Match pattern and get this match. To match parentheses characters, use \( or \).
(?:pattern) Matches pattern but does not obtain the matching result, which means that this is a non-acquisition match and is not stored. for later use. This is useful when using the "or" character (|) to combine parts of a regular expression.
For example, 'industr(?:y|ies) is a simpler expression than 'industry|industries'.
(?=pattern)Look ahead positive assert at the beginning of any string matching pattern Match the search string. This is a non-fetch match, that is, the match does not need to be fetched for later use.
For example, "Windows(?=95|98|NT|2000)" can match "Windows" in "Windows2000", but cannot match "Windows" in "Windows3.1". Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the character containing the prefetch.
(?!pattern) Positive negative assert (negative assert), matches at the beginning of any string that does not match pattern Find string. This is a non-fetch match, that is, the match does not need to be fetched for later use.
For example, "Windows(?!95|98|NT|2000)" can match "Windows" in "Windows3.1", but cannot match "Windows" in "Windows2000". Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the character containing the prefetch.
##(?<=pattern)For example, "(?<=95|98|NT|2000)Windows" can match "Windows" in "2000Windows", but cannot match "Windows" in "3.1Windows".

The above is the detailed content of Knowledge summary of regular expressions in PHP interviews (super detailed). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:segmentfault.com. If there is any infringement, please contact admin@php.cn delete