Home  >  Article  >  Backend Development  >  Regular Expressions (39)

Regular Expressions (39)

WBOY
WBOYOriginal
2016-08-08 09:23:23832browse

Introduction to regular expressions:

??Regular expression is a grammatical rule used to describe character arrangement and matching patterns. It is mainly used for pattern segmentation, matching, search and replacement operations of strings. The exact (text) matching we've used so far is also a regular expression.
??In PHP, a regular expression is generally a programmatic description of a text pattern composed of a combination of regular characters and some special characters (similar to wildcards).

??In PHP, regular expressions have three functions:

??matching, and are often used to extract information from strings.
??Replace matching text with new text.
??Split a string into a set of smaller chunks of information.
??A regular expression contains at least one atom

There are two sets of regular expression function libraries in PHP. The functions of the two are similar, but the execution efficiency is slightly different:

??One set It is provided by the PCRE (Perl Compatible Regular Expression) library. Functions named with the prefix "preg_";
??A set of functions provided by POSIX (Portable Operating System Interface of Unix) extensions. Use functions named with the prefix "ereg_";
?? One of the reasons for using regular expressions is that in typical search and replace operations, only exact text can be matched, and searches for dynamic text in objects have Difficult, even impossible.

Grammar rules for regular expressions

PCRE regular expression:
??PCRE stands for Perl Compatible Regular Expression, which means Perl compatible regular expression.
??PCRE comes from the Perl language, and Perl is one of the most powerful languages ​​for string operations. The initial version of PHP was a product developed by Perl.
??PCRE syntax supports more features, is more powerful than POSIX syntax, implements the same functional functions, and has a slight advantage in using the PCRE library. But they also have a lot in common.
??In PCRE, the pattern expression (ie regular expression) is usually enclosed between two backslashes "/", such as "/apple/". Users only need to put the pattern content that needs to be matched between the delimiters. The delimiting characters are not limited to "/". Any character other than letters, numbers and slashes "" can be used as delimiters, such as "#", "|", "!", etc.

Atom (Atom)

Atom is the basic unit that makes up a regular expression. When analyzing a regular expression, it should be treated as a whole.
??Atomic characters include all English letters, numbers, punctuation marks and other symbols. Atoms also include the following.
??Single characters, numbers, such as a-z, A-Z, 0-9.
??Model units such as (ABC) can be understood as large atoms composed of multiple atoms.
??Atomic table, such as [ABC].
??Reused pattern units, such as: \1
??Common escape characters, such as: d, D, w
??Escape metacharacters, such as: *, .

Common escape characters

Atomic description
------------------------------------------------ --------------------------------
d Match a number; equivalent to [0-9]
D Match Any character except numbers; equivalent to [^0-9]
w  Matches an English letter, number or underscore; equivalent to [0-9a-zA-Z_]
W Matches anything except English letters, Any character except numbers and underscores; equivalent to [^0-9a-zA-Z_]
s matches a whitespace character; equivalent to [fnrtv]
S matches any character except whitespace characters; etc. Equivalent to [^fnrtv]
f  Match a form feed character equivalent to x0c or cL
n Match a newline character; equivalent to x0a or cJ
r  Match a carriage return character equivalent to x0d or cM
t Matches a tab character; equivalent to x09 or cl
v Matches a vertical tab character; equivalent to x0b or ck
oNN Matches an octal number
xNN Matches a sixteenth Base number
cC Matches a control character

Meta-character (Meta-character)

Metacharacters are characters with special meaning used to construct regular expressions. If you want to include the metacharacter itself in the regular expression, you must add "" before it to escape
Metacharacter description
------------------ --------------------------------------------------
* 0 times, 1 time or more matches the atom before it
+ 1 or more times matches the atom before it
? 0 times or 1 time matches the atom before it
| Matches two or Multiple choices
^  Or A matches the atoms at the beginning of the string
$  Or Z matches the atoms at the end of the string
b  matches the boundary of the word
B  matches the part other than the boundary of the word
[] Matches any atom in square brackets
[^] Matches any character except the atoms in square brackets
{m} Indicates that the preceding atom appears exactly m times
{m,n} Indicates that its previous atom appears at least m times, and at least n times (n>m)
{m,} Indicates that its previous atom appears no less than m times
() Represents an atom as a whole
. Match and divide Any character except newline

String boundary restrictions

In some cases, the matching range needs to be limited to obtain more accurate matching results. "^" and "$" specify the start and end of the string respectively.
??For example, in the string "Tom and Jerry chased each other in the house until tom's uncelcome in"
??The metacharacter "^" or "A" is placed at the beginning of the string to ensure that the pattern match occurs At the beginning of the string;
/^Tom/
?? The metacharacter "$" or "Z" is placed at the end of the string to ensure that pattern matching occurs at the end of the string.
/in$/
??If you do not add boundary restriction metacharacters, you will get more matching results.
/^Tom$/Exact Match/Tom/Fuzzy Match

Word Boundary Limitation

When using the search function of various editing software, you can get more accurate results by selecting "Find by Word" . Similar functionality is available in regular expressions.
??For example: in the string "This island is a beautiful land" the
?? metacharacter "b" matches the word boundary;
/bisb/ matches the word "is", does not match "This" and "island".
/bis/ matches the word "is" and "is" in "island", but does not match "This"
?? The metacharacter "B" matches outside of word boundaries.
/BisB/ will explicitly indicate that it will not match the left or right boundaries of the word, only the inside of the word. So in this example there is no result.
/Bis/ matches the "is" in the word "This"

repeated matching

There are some metacharacters in regular expressions that are used to repeatedly match certain atoms: "?", "*" , "+". The main difference between them is the number of repeated matches.
??Metacharacter "?": Indicates 0 or 1 matching of the atom immediately preceding it.
For example: /colou?r/ matches "colour" or "color".
??Metacharacter "*": Indicates 0, 1 or more matches of the atom immediately preceding it.
For example: /zo*/ can match z, zoo
?? The metacharacter "+": indicates matching the atom immediately preceding it one or more times.
For example: /go+gle/ matches "gogle", "google" or "gooogle" and other strings containing multiple o's in the middle.

Any character

The metacharacter "." matches any character except newline.
?? Equivalent to: [^n] (Unix system) or [^rn] (windows system).
??For example: /pr.y/ can match the strings "prey", "pray" or "pr%y", etc.
??You can usually use the ".*" combination to match any character except newlines. In some books, it is also called "full match" or "single-inclusive match".
??For example:
??/^a.*z$/ means that it can match any string starting with the letter "a" and ending with the letter "z" that does not include a newline character.
??/.+/ can also complete a similar matching function, but the difference is that it matches at least one character.
??/^a.+z$/ matches "a%z" but does not match the string "az"

Atomic table - square bracket expression

The atom table "[]" stores a group of atoms, which are equal to each other and only match one of the atoms. If you want to match an "a" or "e" use [ae].
??For example: Pr[ae]y matches "Pray" or "Prey".
??The atom table "[^]" is also called the excluded atom table, matching any character except the atoms in the table.
??For example: /p[^u]/ matches "pa" in "part", but cannot match "pu" in "computer" because "u" is excluded from the match.
??The atom table "[-]" is used to connect a group of atoms arranged in ASCII code order to simplify writing.
??For example: /x[0123456789]/ can be written as x[0-9], which is used to match a string consisting of the letter "x" and a number.
??For example:
??/[a-zA-Z]/matches all uppercase and lowercase letters
??/^[a-z][0-9]$/matches such as "z2", " t6", "g7"
??/0[xX][0-9a-fA-F]/ matches a simple hexadecimal number, such as "0x9".
??/[^0-9a-zA-Z_]/ matches any character except English letters, numbers and underscores, which is equivalent to W.
??/0?[ xX][0-9a-fA-F]+/ matches hexadecimal numbers, which can match "0x9B3C" or "X800", etc.
??/<[A-Za-z][A-Za-z0-9]*>/ can match "

", "" or "" HTML tags, and do not strictly control case.

Pattern selector

The metacharacter "|" is also called the pattern selector. Matches one of two or more choices in a regular expression.
??For example:
??In the string "There are many apples and pears.", /apple|pear/ matches "apple" when it is run the first time; it matches "pear" when it is run again. You can also continue to add options, such as: /apple|pear|banana|lemon/

Pattern unit

The metacharacter "()" turns the regular expression into an atom (or pattern unit). Similar to parentheses in mathematical expressions, "()" can be used as a unit alone.
??For example:
??/(Dog)+/ matches "Dog", "DogDog", "DogDogDog", because the atoms immediately before "+" are enclosed by metacharacters "()" The string "Dog".
??/You (very )+ old/matches "You very old", "You very veryold"
??/Hello (world|earth)/matches "Hello world", "Hello earth"
??Expressions in a pattern unit will be matched or evaluated first.

Reused pattern unit

The system automatically stores the matches in the pattern unit "()" in sequence, and can be referenced in the form of "1", "2", and "3" when needed. This method is very convenient for managing regular expressions when they contain the same pattern units. Note that you need to write "\1" and "\2" when using it
For example:
??/^d{2}([W])d{2}\1d{4}$/matches "12- 31-2006", "09/27/1996", "86 01 4321" and other strings. But the above regular expression does not match the format of "12/34-5678". This is because the result "/" of pattern "[W]" has already been stored. When the next position "1" is referenced, its matching pattern is also the character "/".
??Use the non-storage pattern unit "(?:)" when there is no need to store the matching results
??For example /(?:a|b|c)(D|E|F)\1g/ will Matches "aEEg". In some regular expressions, it is necessary to use non-storage mode units. Otherwise, the order of subsequent references needs to be changed. The above example can also be written as /(a|b|c)(C|E|F)\2g/.


The above has introduced regular expressions (39), including aspects of it. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn