Home  >  Article  >  Backend Development  >  PHP Regular Expressions Complete Manual, Regular Expressions Complete Manual_PHP Tutorial

PHP Regular Expressions Complete Manual, Regular Expressions Complete Manual_PHP Tutorial

WBOY
WBOYOriginal
2016-07-13 10:22:02891browse

Complete Manual of PHP Regular Expressions, Complete Manual of Regular Expressions

Complete Manual of Regular Expressions in PHP

Foreword

Regular expressions are cumbersome, but powerful. After learning, applying them will not only improve your efficiency, but also give you an absolute sense of accomplishment. As long as you read these materials carefully and make certain references when applying them, mastering regular expressions is not a problem.

Index

1._Introduction
2._History of regular expressions
3._Regular expression definition

  3.1_Normal characters
3.2_Non-printing characters
3.3_Special characters
3.4_Qualifier
3.5_locator
3.6_Select
3.7_Back reference

  4._Operation priority of various operators
5._Explanation of all symbols
6._Some examples
7._Regular expression matching rules

 7.1_Basic pattern matching
7.2_Character cluster
7.3_Determine recurrence
1. Introduction
At present, regular expressions have been widely used in many software, including *nix (Linux, Unix, etc.), operating systems such as HP, development environments such as PHP, C#, Java, etc., and many application software, you can see the shadow of regular expressions.

The use of regular expressions can achieve powerful functions in a simple way. In order to be simple and effective yet powerful, the regular expression code is more difficult and not easy to learn, so it requires some effort. After getting started, it is relatively simple and effective to use it by referring to certain references.

Example: ^.+@.+\..+$

Code like this has scared me away many times. Maybe many people are scared away by such code. Continuing reading this article will give you the freedom to apply code like this too.

Note: Part 7 here seems to be somewhat repetitive with the previous content. The purpose is to re-describe the parts in the previous table to make these contents easier to understand.
2. History of regular expressions

The “ancestors” of regular expressions can be traced all the way back to early research on how the human nervous system works. Warren McCulloch and Walter Pitts, two neurophysiologists, developed a mathematical way to describe these neural networks.

In 1956, a mathematician named Stephen Kleene worked on McCulloch and Pitts Based on the early work, a paper titled "Representation of Neural Network Events" was published, which introduced the concept of regular expressions. Regular expressions are used to describe expressions that he calls "the algebra of regular sets," hence the term "regular expression."

Subsequently, it was found that this work could be applied to some early research using the computational search algorithm of Ken Thompson, who was Primary inventor of Unix. The first practical application of regular expressions was the qed editor in Unix.

As they say, the rest is history as we all know. Regular expressions have been an important part of text-based editors and search tools ever since.

3. Regular expression definition

Regular expression (regular expression) describes a string matching pattern, which can be used to check whether a string contains a certain substring, replace the matching substring, or extract a substring that meets a certain condition from a certain string, etc.

When listing directories, dir *.txt or ls *.txt in *.txt is not a regular expression, because the meaning of * here is different from that of * in regular expressions.

Regular expressions are composed of ordinary characters (such as the characters a to z) and special characters (called metacharacters). A regular expression acts as a template that matches a character pattern with a searched string.

  3.1 Common characters

Consists of all those printing and non-printing characters that are not explicitly designated as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation, and some symbols.

  3.2 Non-printing characters

Character Meaning
cx matches the control character specified by x. For example, cM matches a Control-M or carriage return character. The value of x must be A-Z or a-z one. Otherwise, c is treated as a literal 'c' character.
f matches a form feed character. Equivalent to x0c and cL.
n matches a newline character. Equivalent to x0a and cJ.
r matches a carriage return character. Equivalent to x0d and cM.
s matches any whitespace character, including spaces, tabs, form feeds, and so on. Equivalent to [fnrtv].
S matches any non-whitespace character. Equivalent to [^ fnrtv].
t matches a tab character. Equivalent to x09 and cI.
v matches a vertical tab character. Equivalent to x0b and cK.


3.3 Special characters

The so-called special characters are characters with special meanings, such as the * in "*.txt" mentioned above. Simply put, they represent the meaning of any string. If you want to find files with * in the file name, you need to escape the *, that is, add one in front of it. ls *.txt. Regular expressions have the following special characters.

Special character description
$ matches the end of the input string. If the Multiline property of the RegExp object is set, $ also matches 'n' or 'r'. To match the $ character itself, use $.
( ) marks the beginning and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use ( and ).
* Matches the preceding subexpression zero or more times. To match the * character, use *.
+ Matches the preceding subexpression one or more times. To match the + character, use +.
. Matches any single character except the newline character n. To match ., use .
[ marks the beginning of a bracket expression. To match [, use [.
? Match the preceding subexpression zero or once, or specify a non-greedy qualifier. To match the ? character, use ?.
Mark the next character as either a special character, a literal character, a backreference, or an octal escape character. For example, 'n' matches the character 'n'. 'n' matches a newline character. sequence '\' matches "", while '(' matches "(".
^ matches the beginning of the input string, unless used in a square bracket expression, in which case it indicates that the set of characters is not accepted. To match the ^ character itself, use ^.
{ marks the beginning of a qualifier expression. To match {, use {.
| Indicates a choice between two items. To match |, use |.

You construct regular expressions the same way you create mathematical expressions. That is, using a variety of metacharacters and operators to combine small expressions to create larger expressions. The components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components.

  3.4 Qualifier

Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 types: * or + or ? or {n} or {n,} or {n,m}.

The *, + and ? qualifiers are all greedy, because they will match as many literals as possible. Only adding a ? after them can achieve non-greedy or minimal matching.

The qualifiers of regular expressions are:

Character Description
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" would match "do" or "do" in "does". ? Equivalent to {0,1}.
{n} n is a nonnegative integer. Match a certain number of n times. For example, 'o{2}' does not match the 'o' in "Bob", but does match "food" two o's in.
{n,} n is a non-negative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob", but does match "foooood" all o's in. 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are both non-negative integers, where n <= m. Match at least n times and at most m Second-rate. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.

  3.5 Locator

Used to describe the boundary of a string or a word, ^ and $ refer to the beginning and end of the string respectively, b describes the front or back boundary of a word, and B represents a non-word boundary. Qualifiers cannot be used on locators.

  3.6 Select

Enclose all selections in parentheses, and separate adjacent selections with |. But using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.

Among them, ?: is one of the non-capturing elements, and the other two non-capturing elements are ?= and ?!. These two have more meanings. The former is a forward lookup and matches the regular expression in parentheses at any beginning. The search string is matched at any position of the regular expression pattern, which is a negative lookahead that matches the search string at any initial position that does not match the regular expression pattern.

  3.7 Backreferences

Adding parentheses around a regular expression pattern or part of a pattern will cause the associated matches to be stored in a temporary buffer, with each captured submatch stored as encountered from left to right in the regular expression pattern. Buffer number to store submatch from Starting with 1, numbering continues up to a maximum of 99 subexpressions. Each buffer can be accessed using 'n', where n A one- or two-digit decimal number that identifies a specific buffer.

You can use the non-capturing metacharacters '?:', '?=', or '?!' to ignore the preservation of related matches.

4. Operational precedence of various operators

  Operations with the same priority are performed from left to right, and operations with different priorities are performed from high to low. The precedence of various operators from high to low is as follows:

Operator Description
Escape character
(), (?:), (?=), [] parentheses and square brackets
*, +, ?, {n}, {n,}, {n,m} qualifier
^, $, anymetacharacter position and order
| "OR" operation

5. All symbol explanations

Character Description
Marks the next character as a special character, or a literal character, or a backreference, or an octal escape character. For example, 'n' matches the character "n". 'n' Matches a newline character. The sequence '\' matches "" and "(" matches "(".
^ matches the beginning of the input string. ^ also matches 'n' or 'r' if the RegExp object's Multiline property is set position after that.
$ matches the end of the input string. $ also matches 'n' or 'r' if the RegExp object's Multiline property is set previous location.
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" would match "do" or "do" in "does". ? Equivalent to {0,1}.
{n} n is a nonnegative integer. Match a certain number of n times. For example, 'o{2}' does not match the 'o' in "Bob", but does match "food" two o's in.
{n,} n is a nonnegative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob", but does match "foooood" all o's in. 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are both non-negative integers, where n <= m. Match at least n times and at most m Second-rate. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.
? when the character immediately follows any other qualifier (*, +, ?, {n}, {n,}, {n,m}) Later, the matching pattern is non-greedy. Non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o's.
. matches any single character except "n". To match any character including 'n', use a pattern like '[.n]'.
(pattern) matches pattern and gets this match. The obtained matches can be obtained from the generated Matches collection, in VBScript Use the SubMatches collection in JScript and the $0…$9 properties in JScript. To match parentheses characters, use '(' or ')'.
(?:pattern) matches the pattern but does not get the match result, which means it is a non-get match and is not stored for later use. This is done using "or" The character (|) is useful to combine parts of a pattern. For example, 'industr(?:y|ies) is a comparison 'industry|industries' A shorter expression.
(?=pattern) forward lookup, any pattern matching Matches the search string at the beginning of the string. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows in 3.1" "Windows". Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the character containing the prefetch.
(?!pattern) negative lookahead, if any pattern does not match Matches the search string at the beginning of the string. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1", but not "Windows 2000" in "Windows". Prefetching does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, rather than starting after the character containing the prefetch
x|y matches x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz] character set. Matches any one of the characters contained. For example, '[abc]' matches 'a' in "plain".
[^xyz] Negative value character set. Matches any character not included. For example, '[^abc]' matches the 'p' in "plain".
[a-z] Character range. Matches any character within the specified range. For example, '[a-z]' matches 'a' through 'z' Any lowercase alphabetic character within the range.
[^a-z] Negative character range. Matches any character not within the specified range. For example, '[^a-z]' matches anything not 'a' through 'z' Any character within the range.
b matches a word boundary, which is the position between a word and a space. For example, 'erb' would match 'er' in "never" but not 'er' in "verb".
B matches non-word boundaries. 'erB' matches 'er' in "verb" but not in "never".
cx matches the control character specified by x. For example, cM matches a Control-M or carriage return character. The value of x must be A-Z or a-z one. Otherwise, c is treated as a literal 'c' character.
d matches a numeric character. Equivalent to [0-9].
D matches a non-numeric character. Equivalent to [^0-9].
f matches a form feed character. Equivalent to x0c and cL.
n matches a newline character. Equivalent to x0a and cJ.
r matches a carriage return character. Equivalent to x0d and cM.
s matches any whitespace character, including spaces, tabs, form feeds, and so on. Equivalent to [fnrtv].
S matches any non-whitespace character. Equivalent to [^ fnrtv].
t matches a tab character. Equivalent to x09 and cI.
v matches a vertical tab character. Equivalent to x0b and cK.
w matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'.
W matches any non-word character. Equivalent to '[^A-Za-z0-9_]'.
xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be exactly two digits long. For example, 'x41' matches "A". 'x041' is equivalent to 'x04' & "1". ASCII encoding can be used in regular expressions. .
num matches num, where num is a positive integer. A reference to the match obtained. For example, '(.)1' matches two consecutive identical characters.
n identifies an octal escape value or a backreference. If n is preceded by at least n fetched subexpressions, n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
nm identifies an octal escape value or a backreference. nm is a backward reference if nm is preceded by at least nm obtainable subexpressions. if nm There are at least n previous obtains, then n is a backward reference followed by the literal m. If none of the previous conditions are met, and if n and m are both octal numbers (0-7), then nm will match the octal escape value nm.
nml If n is an octal digit (0-3) and m and l are both octal digits (0-7), then matches the octal escape value nml.
un matches n, where n is a Unicode character represented as four hexadecimal digits. For example, u00A9 matches the copyright symbol (?).

6. Some examples

Regular Expression Description
/b([a-z]+) 1b/gi The position where a word appears continuously
/(w+)://([^/:]+)(:d*)?([^# ]*)/ Parses a URL into protocol, domain, port and relative path
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/ Locate the position of the chapter
/[-a-z]/ A to z, a total of 26 letters plus a - sign.
/terb/ can match chapter, but not terminal
/Bapt/ can match chapter, but not aptitude
/Windows(?=95 |98 |NT )/ Can match Windows95 or Windows98 or WindowsNT. When a match is found, start the next one from behind Windows

7. Regular expression matching rules

 7.1 Basic pattern matching

Everything starts from the basics. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of a string. Patterns can be simple, consisting of ordinary strings, or very complex, often using special characters to represent a range of characters, recurrences, or to represent context. For example:

  ^once

This pattern contains a special character ^, which means that the pattern only matches those strings starting with once. For example, this pattern matches the string "once upon a time" matches "There once was a man from NewYork" does not match. Just as the ^ symbol indicates the beginning, the $ symbol is used to match those strings that end with the given pattern.

bucket$

This pattern is similar to "Who kept all of this cash in a bucket" matches, but does not match "buckets". When the characters ^ and $ are used at the same time, it means an exact match (the string is the same as the pattern). For example:

  ^bucket$

Only matches the string "bucket". If a pattern does not include ^ and $, then it matches any string that contains the pattern. For example: pattern

once

and string

There once was a man from NewYork
Who kept all of his cash in a bucket.

is matched.

The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letters themselves, and the same goes for numbers. Some other slightly more complex characters, such as punctuation and white characters (space cells, tabs, etc.), escape sequences must be used. All escape sequences begin with a backslash (). The escape sequence for the tab character is: t. So if we want to detect if a string starts with a tab character, we can Use this pattern:

  ^t

Similarly, use n to represent "new line" and r to represent carriage return. Other special symbols can be used with a backslash in front. For example, the backslash itself is represented by \, the period is represented by ., and so on.

 7.2 Character Clusters

In INTERNET programs, regular expressions are usually used to verify user input. When a user submits a FORM, it is not enough to use ordinary literal characters to determine whether the entered phone number, address, email address, credit card number, etc. are valid.

So we need to use a more free way to describe the pattern we want, which is character clusters. To create a cluster representing all vowel characters, place all vowel characters in square brackets:

 [AaEeIiOoUu]

This pattern matches any vowel character, but can only represent one character. Use hyphens to represent a range of characters, such as:

 [a-z] //Match all lowercase letters
[A-Z] // Match all uppercase letters
[a-zA-Z] // Match all letters
[0-9] // Match all numbers
[0-9.-] // Match all numbers, periods and minus signs
[ frtn] // Match all white characters

Again, these only represent one character, which is a very important one. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:

  ^[a-z][0-9]$

Although [a-z] represents a range of 26 letters, here it can only match strings whose first character is a lowercase letter.

It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When ^ is used within a set of square brackets, it means "not" or "exclude" and is often used to eliminate a certain character. Using the previous example, we require that the first character cannot be a number:

  ^[^0-9][0-9]$

This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:

 [^a-z] //All characters except lowercase letters
[^\/^] //All characters except ()(/)(^)
[^"'] //All characters except double quotes (") and single quotes (')

Special character "." (dot, period) is used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string that ends with the number 5 and begins with some other non-"newline" character. The pattern "." can match any string, except empty strings and strings containing only a "new line".

PHP’s regular expressions have some built-in common character clusters, the list is as follows:

Character cluster meaning

 [[:alpha:]] any letter
[[:digit:]] Any number
[[:alnum:]] Any letters and numbers
[[:space:]] Any white character
[[:upper:]] Any uppercase letter
[[:lower:]] Any lowercase letter
[[:punct:]] Any punctuation mark
[[:xdigit:]] Any hexadecimal number, equivalent to [0-9a-fA-F]

 7.3 Determine recurrence

By now, you already know how to match a letter or number, but more often than not, you may want to match a word or a group of numbers. A word consists of several letters, and a group of numbers consists of several singular numbers. The curly braces ({}) following a character or character cluster are used to determine the number of times the preceding content is repeated.

Character cluster meaning
^[a-zA-Z_]$ All letters and underscores
^[[:alpha:]]{3}$ All 3-letter words
^a$ letter a
​ ^a{4}$ aaaa
​ ^a{2,4}$ aa,aaa or aaaa
​ ^a{1,3}$ a,aa or aaa
^a{2,}$ A string containing more than two a's
^a{2,} Such as: aardvark and aaab, but not apple
a{2,} such as: baad and aaa, but not Nantucket
t{2} Two tab characters
.{2} All two characters

These examples describe three different uses of curly braces. A number, {x} means "the previous character or character cluster appears only x times"; a number plus a comma, {x,} means "the previous content appears x or more times"; two numbers separated by commas, {x,y} means "the previous content appears at least x times, but no more than y times". We can extend the pattern to more words or numbers :

  ^[a-zA-Z0-9_]{1,}$ //All strings containing more than one letter, number or underscore
​ ^[0-9]{1,}$ //All positive numbers
​ ^-{0,1}[0-9]{1,}$ //All integers
​ ^-{0,1}[0-9]{0,}.{0,1}[0-9]{0,}$ //All decimals

The last example is not easy to understand, is it? Look at it this way: with everything starting with an optional negative sign (-{0,1}) (^), followed by 0 or more digits ([0-9]{0,}), and a An optional decimal point (.{0,1}) followed by 0 or more digits ([0-9]{0,}) and nothing else ($). Below you will learn about the simpler methods you can use.

The special character "?" is equivalent to {0,1}, they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:

  ^-?[0-9]{0,}.?[0-9]{0,}$

The special characters "*" are equal to {0,}, they both represent "0 or more previous contents". Finally, the character "+" is the same as {1,} is equal and means "1 or more previous contents", so the above 4 examples can be written as:

  ^[a-zA-Z0-9_]+$ //All strings containing more than one letter, number or underscore
^[0-9]+$ //All positive numbers
^-?[0-9]+$ //All integers
^-?[0-9]*.?[0-9]*$ //All decimals

Of course this doesn’t technically reduce the complexity of regular expressions, but it makes them easier to read.


Detailed source reference: http://www.jb51.net/article/26215.htm

php regular expression

Both single quotes and / can be used as inclusion characters in regular templates. In fact, # can also be used. This is for the convenience of pattern matching. For example, if you want to match content with single quotes, then do not use single quotes to wrap the template Include it, use / or #, this can simplify the template, otherwise, you need to add the escape character

before the single quotation mark in the template. The i at the end means case insensitive, which is not sensitive to upper and lower case, s can be understood as full-text matching, that is, if the matched content has line breaks, it is best to add this switch

In the template, \s means a space

I am an expert in regular expressions , there is a special book that teaches this, you can ask me for it, QQ1389951
Give me points haha

Where is the detailed description of regular expressions in the PHP manual

There is no detailed description of regular expressions in the PHP manual. There is a manual on regular expressions. You can look for more details.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/851758.htmlTechArticlePHP Complete Manual of Regular Expressions, Complete Manual of Regular Expressions PHP Complete Manual of Regular Expressions Preface Regular expressions are It’s cumbersome, but powerful. Once you learn it, the application will help you get rid of...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn