Home  >  Article  >  Backend Development  >  Introduction to regular rules for matching Chinese (GB2312/utf-8)

Introduction to regular rules for matching Chinese (GB2312/utf-8)

WBOY
WBOYOriginal
2016-07-25 08:59:121346browse
This article introduces the regular rules used to match Chinese (GB2312 and utf-8 format). Friends in need can refer to it.

The following is a list of possible modifiers currently used in PCRE. In parentheses are the internal PCRE names of these modifiers. Spaces and newlines in modifiers are ignored, other characters will cause errors.

I hope this article can help everyone understand and master the concepts related to regular expressions more deeply.

i (PCRE_CASELESS) If this modifier is set, characters in the pattern will match both uppercase and lowercase letters.

m(PCRE_MULTILINE) By default, PCRE treats the target string as consisting of a single "line" of characters (even if it contains newlines). The "start of line" metacharacter (^) only matches the beginning of the string, and the "end of line" metacharacter ($) only matches the end of the string, or the last character before it if it is a newline (unless D is set modifier). This is the same as Perl.

When this modifier is set, "line start" and "line end" not only match the beginning and end of the entire string, but also match after and before the newline character in it respectively. This is equivalent to Perl's /m modifier. If there are no "n" characters in the target string or ^ or $ in the pattern, setting this modifier has no effect.

s(PCRE_DOTALL) If this modifier is set, the dot metacharacter (.) in the pattern matches all characters, including newlines. Without this setting, newline characters are not included. This is equivalent to Perl's /s modifier. Excluded character classes such as [^a] always match newlines, regardless of whether this modifier is set.

x(PCRE_EXTENDED) If this modifier is set, whitespace characters in the pattern are completely ignored except those that are escaped or in a character class, and all characters between # outside of an unescaped character class and the next newline character , including both ends, are also ignored. This is equivalent to Perl's /x modifier, allowing comments to be added to complex patterns. Note, however, that this only applies to data characters. Whitespace characters may never appear in special character sequences within a pattern, such as sequences that introduce conditional subpatterns (?( in the middle.

e If this modifier is set, preg_replace() performs the normal replacement of the backreference in the replacement string, evaluates it as PHP code, and replaces the searched string with its result.

Only preg_replace() uses this modifier, other PCRE functions will ignore it.

Note: This modifier is not available in PHP3.

A (PCRE_ANCHORED) If this modifier is set, the pattern is forced to be "anchored", which means it is forced to match only from the beginning of the target string. This effect can also be achieved with the appropriate mode itself (the only way this is achieved in Perl).

D(PCRE_DOLLAR_ENDONLY) If this modifier is set, dollar metacharacters in the pattern match only the end of the target string. Without this option, if the last character is a newline character, the dollar sign will also match before this character (but not before any other newline character). This option is ignored if the m modifier is set. There is no equivalent modifier in Perl.

S When a pattern is going to be used several times, it's worth analyzing it first to speed up matching. If this modifier is set additional analysis will be performed. Currently, analyzing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U(PCRE_UNGREEDY) This modifier inverts the value of the match count so that it is not repeated by default, but becomes repeated when followed by a "?" This is not compatible with Perl. This option can also be enabled by setting the (?U) modifier in the pattern or by following the quantifier with a question mark (e.g. .*?).

X(PCRE_EXTRA) This modifier enables an extra feature in PCRE that is incompatible with Perl. Any backslash in the pattern followed by a letter with no special meaning results in an error, thus preserving this combination for future expansion. By default, like Perl, a backslash followed by a letter with no special meaning is treated as the letter itself. No other traits are currently controlled by this modifier.

u(PCRE_UTF8) This modifier enables an additional feature in PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available since PHP 4.1.0 under Unix and since PHP 4.2.3 under win32. Patterns are checked for UTF-8 validity since PHP 4.3.5. That’s it. I hope it can help you with the knowledge about regular expression matching Chinese content in PHP. Programmer's Home, I wish you all the best in your studies and progress.



Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn