Home >Backend Development >PHP Problem >What does php regular expression mean?

What does php regular expression mean?

青灯夜游
青灯夜游Original
2023-02-08 13:43:254305browse

In PHP, regular expression is a custom grammar rule that describes the character arrangement pattern. It has a very complete grammar system that can write patterns, providing a flexible and intuitive character String processing method. Regular expressions describe a string matching pattern that can be used to check whether a string contains a certain substring, replace the matching substring, or extract a substring that meets a certain condition from a string, etc. wait.

What does php regular expression mean?

The operating environment of this tutorial: windows7 system, PHP8 version, DELL G3 computer

Maybe you have heard of regular expressions before, roughly The impression is that it is difficult to learn, very complicated, and has a feeling of being unfathomable. In fact, regular expressions are not that mysterious. It is a custom grammar rule that describes the character arrangement pattern.

What is a regular expression?

Regular expressions are also called pattern expressions. They have a very complete set of patterns that can be written. The syntax system provides a flexible and intuitive string processing method. Regular expressions construct patterns with specific rules, compare them with input string information, and use them in specific functions to achieve operations such as string matching, search, replacement, and segmentation.

To give an example in our daily life, if you want to search for all txt format files in a certain directory on your computer, you can enter *.txt in the directory and then press the Enter key. List all txt format files in the directory. The *.txt used here can be understood as a simple regular expression.

The following two examples are constructed using the syntax of regular expressions, as shown below:

/http(s)?:\/\/[\w.]+[\w\/]*[\w.]*\??[\w=&\+\%]*/is      // 匹配网址 URL 的正则表达式
/^\w{3,}@([a-z]{2,7}|[0-9]{3})\.(com|cn)$/                    // 匹配邮箱地址的正则表达式

Don’t be deterred by the seemingly garbled strings in the above examples, they are expressed according to regular expressions It is a string composed of ordinary characters and characters with special functions. And these strings must be used in specific regular expression functions to be effective.

The purpose of regular expressions

Regular expressions describe a string matching pattern that can be used to check a Whether the string contains a certain substring, replacing the matching substring, or extracting a substring that meets a certain condition from a string, etc. For example, when a user submits a form, to determine whether the entered phone number, email address, etc. is valid, ordinary literal-based character verification is obviously not enough.

Regular expressions are literal patterns composed of ordinary characters (such as the characters a through z) and special characters (called "metacharacters"). A regular expression acts as a template that matches a character pattern with a searched string. A regular expression pattern can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all these components.

The purpose of using regular expressions is to achieve powerful functions in a simple way. In order to be simple, effective and powerful, the regular expression rules are complicated. It is even more difficult to construct correct and effective regular expressions, so some effort is required. After getting started, through certain reference and a lot of practice, it is quite effective and interesting to use regular expressions in development practice.

Commonly used terms in regular expressions

Before learning regular expressions, let’s first understand some of them This is an easily confused term, which is of great help in learning regular expressions.

1) grep

was originally a command in the ED editor, used to display specific content in the file. Later became a standalone tool grep.

2) egrep

Although grep is constantly updated and upgraded, it still cannot keep up with the pace of technology. For this reason, Bell Labs wrote egrep, which means "extended grep". This greatly enhances the power of regular expressions.

3) POSIX (Portable Operating System Interface of UNIX)

Portable Operating System Interface. As grep evolved, other developers also created their own versions with unique styles based on their own preferences. But problems also arise. Some programs support certain metacharacters, while others do not. Hence, POSIX. POSIX is a set of standards that ensure portability between operating systems. However, POSIX, like SQL, has not become the final standard and can only be used as a reference.

4) Perl (Practical Extraction and Reporting Language)

Practical Extraction and Reporting Language. In 1987, Larry Wall released Perl. In the following 7 years, from Perl1 to the current Perl5, it eventually became another standard after POSIX.

5) PCRE

The success of Perl has made other developers compatible with "Perl" to some extent, including C/C, Java, Python, etc., which all have their own regular expressions. In 1997, Philip Hazel developed the PCRE library, which is a set of regular expression engines compatible with Perl regular expressions. Other developers can integrate PCRE into their own languages ​​to provide users with rich regular expression functions. PCRE is used by many software, including PHP.

Regular expression syntax rules

Before using regular expressions, we must first learn the syntax of regular expressions. The constituent elements of regular expressions generally include ordinary characters, metacharacters, qualifiers, anchor points, non-printing characters and specified replacements.

1) Ordinary characters

Ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters, including all uppercase and lowercase letters, numbers, and punctuation symbols and some symbols. The simplest regular expression is a single ordinary character used to compare search strings. For example, the single-character regular expression /A/ will always match the letter A.

You can also combine multiple single characters to form a longer expression. For example, the regular expression /the/ will match the, there, other and over the lazy dog ​​in the search string. There is no need to use any concatenation operators, just enter the characters consecutively.

2) Metacharacters

In addition to ordinary characters, regular expressions can also contain "metacharacters". Metacharacters can be divided into single-character metacharacters and multi-character metacharacters. For example, the metacharacter \d, which matches numeric characters.

All single-character metacharacters are listed in the following table.

##/ represents a text regular expression in JavaScript the beginning and end of the pattern. Adding a single-character flag after the second "/" specifies search behavior /abc/gi is a JavaScript text regular expression that matches "abc". The g (global) flag specifies to find all occurrences of the pattern, the i (ignore case) flag makes the search case-insensitive \ Mark the next character Matches the special character, literal, backreference, or octal escape character \n with a newline character. \( matches "(". \\ matches "\"

These special characters will lose their meaning when they appear within bracket expressions and become ordinary characters. To match these special characters, you must first escape the character by preceding it with a backslash \. For example, to search for text characters, use the expression \ .

In addition to the above single-character metacharacters, there are also some multi-character metacharacters, as shown in the following table.

Metacharacters Behavior Example
* Matches the preceding character or subexpression zero or more times, equivalent to {0,} zo* matches "z" and "zoo"
Matches the preceding character or subexpression one or more times, equivalent to {1,} zo matches "zo" and "zoo", but not "z"
? Matches the preceding character or subexpression zero or once times, equivalent to {0,1}
when ? follows any other qualification (*, ,?, {n}, {n,} or {n,m}), the matching pattern is non-greedy. The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible
zo? matches "z" and "zo", but Does not match "zoo"
o ? Matches only a single "o" in "oooo", while o matches all "o"s
do(es)? Matches "do" or "does" "do" matches
#^ matches the beginning of the search string. If the m (multiline search) character is included in the flag, ^ will also match the position after \n or \r. If ^ is used as the first character in a bracket expression, the character set is inverted ^\d{3} matches 3 characters from the beginning of the search string
[^ abc] Matches any character except a, b, c
$ Matches the end of the search string. If the m (multiline search) character is included in the flag, ^ will also match the position preceding \n or \r. \d{3}$ matches the 3 digits at the end of the search string
. matches anything except the newline character \n any single character. To match any character including \n, use a pattern such as [\s\S] a.c matches "abc" "a1c" and "a-c"
[] Marks the beginning and end of bracket expressions [1-4] matches "1", "2", "3", or "4"
[^aAeEiIoOuU] Matches any non-vowel character
{} Marks the beginning and end of the qualifier expression a {2,3} matches "aa" and "aaa"
() Marks the beginning and end of the subexpression, you can save the subexpression to For future use A(\d) matches "A0" through "A9". Save this number for future use
| Indicates a choice between two or more items z|food with "z ” or “food” matches
(z|f)ood matches “zood” or “food”
Metacharacters Behavior Example
\b with a word Boundary matching. That is, the position between the word and the space er\b matches the "er" in "never", but does not match the "er" in "verb"
\B Non-boundary word matching er\B matches the "er" in "verb", but not the "er" in "never"
\d Number character matching, equivalent to [0-9] In the search string "12 345", \d{2} matches "12" Matches "34". \d matches "1", "2", "3", "4" and "5"
\D Matches non-numeric characters, equivalent to [^0-9] /D matches "abc" and "def" in "abc123 def"
\w matches Matches any character in A-Z, a-z, 0-9 and underscores, which is equivalent to [A-Za-z0-9] In the search string "The quick brown fox...",\ w matches "The", "quick", "brown" and "fox"
\W matches except A-Z, a-z, 0-9 and underscore Matches any character, equivalent to [^A-Za-z0-9] In the search string "The quick brown fox...", \W with "..." and all spaces Matches the
[xyz] character set, matches any one of the specified characters [abc] and matches the "a" in "plain"
[^xyz] Reverse character set, matches any character not specified [^abc] Same as in "plain" "p", "1", "i" and "n" match the
[a-z] character range, matching any character within the specified range [a-z] Matches any lowercase alphabetic character in the range "a" to "z"
[^a-z] The reverse character range, with Matches any character not in the specified range [^a-z] Matches any character not in the range 'a' to 'z'
{n} Match exactly n times, n is a non-negative integer o{2} does not match the "o" in "Bob", but matches both "o"s in "fooood"
{n,} Match at least n times, n is a non-negative integer
*Equal to {0,}
Equal to {1,}
o{2} does not match "o" in "Bob" but matches all "o"s in "fooood"
{n,m} Match at least n times and at most m times. n and m are non-negative integers, where n<= m, there cannot be a space between the comma and the number
? Equivalent to {0,1}
In the search string "1234567", \d{ 1,3} matches "123", "456" and "7"
(pattern) Matches the pattern and saves the match. Saved matches can be retrieved from array elements returned by the exec Method in JavaScript. To match the bracket character (), use "\(" or "\)" (Chapter|Section) [1-9] Matches "Chapter 5", save "Chapter" for future use Use
(?:pattern) to match the pattern but not save the match, i.e. the match will not be stored for future use. This is useful when combining pattern parts with the "or" character (|) industry(?:y|ies) is equal to industry|industries
(? =Mode) Positive prediction goes first. Once a match is found, the search for the next match begins before the matching text. Matches will not be saved for future use ^(?=_.*\d.{4,8}$ Apply the following restrictions to the password:
It must be between 4 and 8 characters long between and must contain at least one digit, in this pattern, *\d looks for any number of characters followed by a digit. For the search string "abc3qr", matches "abc3".
From before this match , (instead of after) starting with {4,8} matches a string containing 4~8 characters, matching "abc3qr".
^ and $ specify the start and end positions of the search string and will prevent matching if the search string contains any characters other than the matching characters
(?! pattern) Negative predictions go first. Matches a search string that does not match the pattern. Once a match is found, the search for the next match begins before the matching text. Matches are not saved for future use \b(?!th)/w \b matches words that do not begin with "th" In this pattern, \b matches a word boundary. For the search string "quick", matches the first space. (?!th) matches a non-"th" string matches "qu", starting from that match, !w matches one word, i.e. matches "quick"
\cx matches the control character indicated by x. The value of x must be in the range A-Z or a-z. If not, c is assumed to be the literal "c" character itself \cM matches Ctrl M or a carriage return character
\xn Match n, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. ASCII codes are allowed in regular expressions \x41 matches "A", \x41 is equivalent to "\x04" followed by "1" (since n must be exactly two digits)
\num Matches num, where num is a positive integer. This is a reference to a match saved with (.)\1 Matches two consecutive identical characters
\n identifies An octal escape code or backreference. If \n is preceded by at least n capturing subexpressions, then n is a backreference; otherwise, if n is an octal number (0-7), then n is an octal escape code (\d) \1 Matches two consecutive identical digits
\nm identifies an octal escape code or backreference. If \nm is preceded by at least nm capturing subexpressions, then nm is a backreference. If \nm is preceded by at least n capturing subexpressions, then n is a backreference followed by the text m. If none of the above conditions exist, when n and m are octal digits (0-7), \nm matches the octal escape code nm \11 matches the tab character
\nml When n is an octal digit (0-3), m and 1 are octal digits (0-7), match the octal escape code nml \011 Matches the tab character
\un Matches n, where n is a Unicode character represented as a 4-digit decimal number \u00A9 and Copyright Symbol (©️) matches

3) Non-printing characters

Non-printing characters are composed of ordinary characters and escape characters. Characters used to match specific behaviors in regular expressions, such as line feeds, form feeds, whitespace characters, etc. The following table lists nonprinting characters. The characters

## match and are equivalent to \fFormfeed\x0c and \cLLinefeedCarriage returnAny whitespace characters, including spaces, tabs, and form feedsAny non-whitespace character##\tTab character\x09 and \cI\vVertical tab \x0b and \cK4) Priority order
##\n
\x0a and \cJ \r
\x0d and \cM \s
[\f\b\r\t\v] \S
[^\f\b\r\t\v]
When using regular expressions, you need to pay attention to the matching order. Usually operations with the same priority are performed from left to right, and operations with different priorities are performed from high to low. The matching order priority of various operators is from high to low, as shown in the following table.

Order

Metacharacters Description 1 \Escape characters2( ), (?:), (?=), [ ] Parentheses and square brackets3*, ,{n},{n,},{n,m}qualifier4^,$,\ any metacharacterAnchors and sequences5|replace

另外,字符具有高于替换运算符的优先级,例如,允许 "m|food" 匹配 "m" 或 "food"。

替换

正则表达式中的替换允许对两个或多个替换选项之间的选择进行分组。实际上可以在模式中指定两种匹配模式的或关系。可以使用管道|字符指定两个或多个替换选项之间的选择,称之为“替换”。匹配管道字符任一侧最大的表达式。

例如:

/Chapter|Section [1-9][0-9]{0,1}/

该正则表达式匹配的是字符串“Chapter”或者字符串“Section”后跟一个或两个数字。

如果搜索字符串是“Section 22”,那么该表达式匹配“Section 22”。但是,如果搜索字符串是“Chapter 22”,那么表达式匹配单词“Chapter”,而不是匹配“Chapter 22”。

为了解决这种形式的表达式可能带来的误导,可以使用括号来限制替换的范围,即确保它只应用于两个单词“Chapter”和“Section”。可以通过添加括号来使正则表达式匹配“Chapter 1”或“Section 3”。将以上表达式改成如下形式:

/(Chapter|Section) [1-9][0-9]{0,1}/

修改后,如果搜索字符串是“Section 22”,那么该表达式匹配“Section 22”。如果搜索字符串是“Chapter 22”,那么表达式匹配单词也会是“Chapter 22”。

子表达式

正则表达式中放置括号可创建子表达式,子表达式允许匹配搜索文本中的模式并将匹配项分成多个单独的子匹配项,程序可检索生成的子匹配项。

例如匹配邮箱账号的正则表达式:

/(\w+)@(\w+)\.(\w+)/

该正则表达式包含 3 个子表达式,3 个子表达式分别进行匹配并保留匹配结果,与其他表达式匹配结果作为一个整体显示出来。

下面的示例将通用资源指示符(URI)分解为其组件:

/(\w+):\/\/([^\/:]+)(:\d*)?([^# ]*)/
  • 第一个括号子表达式保存 Web 地址的协议部分,匹配在冒号和两个正斜杠前面的任何单词。

  • 第二个括号子表达式保存地址的域地址部分,匹配不包括左斜线/或冒号:字符的任何字符序列。

  • 第三个括号子表达式保存网站端口号(如果指定了的话),匹配冒号后面的零个或多个数字。

  • 第四个括号子表达式保存 Web 地址指定的路径和/或页信息,匹配零个或多个数字字符#或空白字符之外的字符。

如果我们使用这个正则表达式匹配字符串“http://msdn.microsoft.com:80/scripting/default.htm”,那么 3 个子表达式的匹配结果分别为 http、msdn.microsoft.com:80、/scripting/default.htm。

反向引用

反向引用用于查找重复字符组。此外,可使用反向引用来重新排列输入字符串中各个元素的顺序和位置,以重新设置输入字符串的格式。

可以从正则表达式和替换字符串中引用子表达式。每个子表达式都由一个编号来标识,并称作反向引用。

在正则表达式中,每个保存的子匹配项按照它们从左到右出现的顺序存储。用于存储子匹配项的缓冲区编号从 1 开始,最多可存储 99 个子表达式。在正则表达式中,可以使用 \n 来访问每个缓冲区,其中 n 标识特定缓冲区的一位或两位十进制数字。

反向引用的一个应用是,提供查找文本中两个相同单词的匹配项的能力。以下面的句子为例:

Is is the cost of of gasoline going up up?

该句子包含多个重复的单词。如果能设计一种方法定位该句子,而不必查找每个单词的重复出现,就会很有用。

下面的正则表达式使用单个子表达式来实现这一点:

/\b([a-z]+) \1\b/

在此情况下,子表达式是括在括号中的所有内容。该子表达式包括由 [a-z]+ 指定的一个或多个字母字符。正则表达式的第二部分是对以前保存的子匹配项的引用,即单词的第二个匹配项正好由括号表达式匹配。\1 用于指定第一个子匹配项。\b 单词边界元字符确保只检测单独的单词。否则,诸如“is issued”或“this is”之类的词组将不能正确地被此表达式识别。所以,使用表达式 /\b([a-z]+)\1\b/ 匹配字符串“Is is the cost of of gasoline going up up?”得到的结果为 is、of、up。

在 PHP 中使用正则表达式

PHP 有两套函数库支持的正则表达式处理操作:

  • 一套是由 PCRE(Perl Compatible Regular Expression)库提供、与 Perl 语言兼容的正则表达式函数,以preg_为函数的前缀名称;

  • The other set is POSIX (Portable Operating System Interface) extended syntax regular expression function, with ereg_ as the prefix of the function.

The functions of the two function libraries are similar, but the execution efficiency of PCRE is higher than that of POSIX, so we only introduce the PCRE function library.

Recommended learning: "PHP Video Tutorial"

The above is the detailed content of What does php regular expression mean?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn