Home >Java >javaTutorial >Detailed introduction to Java regular expressions

Detailed introduction to Java regular expressions

PHP中文网
PHP中文网Original
2017-06-22 14:52:381362browse

Expression meaning:

x Character x. For example, a represents the character a
\\ backslash character. When writing, write \\\\. (Note: Because Java parses \\\\ into a regular expression \\ during the first parsing, and then parses it into \\ during the second parsing, so any escape characters that are not listed in 1.1 include those in 1.1 \\, and those with \ must be written twice)
\0n Character n with octal value 0 (0 <= n <= 7)
\0nn Character with octal value 0 Character nn (0 <= n <= 7)
\0mnn Character mnn with octal value 0 (0 <= m <= 3, 0 <= n <= 7)
\xhh Character hh
\uhhhh with hexadecimal value 0x Character hhhh
\t with hexadecimal value 0x Tab ('\u0009')
\n New line (Line feed) character ('\u000A')
\r Carriage return character ('\u000D')
\f Page feed character ('\u000C')
\a Alarm (bell) character (' \u0007')
\e Escape character ('\u001B')
\cx Control character corresponding to x
2. Character class
[abc] a, b or c (simple class ). For example, [egd] indicates that it contains the characters e, g or d.
[^abc] Any character except a, b or c (negative). For example [^egd] means it does not contain the characters e, g or d.
[a-zA-Z] a to z or A to Z, including the letters at both ends (range)
[a-d[m-p]] a to d or m to p: [a-dm-p ] (Union)
[a-z&&[def]] d, e or f (Intersection)
[a-z&&[^bc]] a to z, except b and c: [ad-z] (minus)
[a-z&&[^m-p]] a to z, not m to p: [a-lq-z] (minus)
3. Predefined character classes (note the backslash The bar must be written twice, for example \d is written as \\d) any character

(may or may not match the line terminator)
\d Numbers: [0-9]
\D Non-numbers: [^0-9]
\s Blank characters: [ \t\n\x0B\f\r]
\S Non-whitespace characters: [^\s]
\w Word characters: [a-zA-Z_0-9]
\W Non-word characters :[^\w]
4.POSIX character class (US-ASCII only) (note that the backslash must be written twice, for example, \p{Lower} is written as \\p{Lower})
\p {Lower} Lowercase alphabetic characters: [a-z].
\p{Upper} Uppercase alphabetic characters: [A-Z]
\p{ASCII} All ASCII: [\x00-\x7F]
\p{Alpha} Alphabetic characters: [\p{Lower} \p{Upper}]
\p{Digit} Decimal digits: [0-9]
\p{Alnum} Alphanumeric characters: [\p{Alpha}\p{Digit}]
\ p{Punct} Punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} Visible Characters: [\p{Alnum}\p{Punct}]
\p{Print} Printable characters: [\p{Graph}\x20]
\p{Blank} Space or tab character: [ \t]
\p{Cntrl} Control characters: [\x00-\x1F\x7F]
\p{XDigit} Hexadecimal digits: [0-9a-fA-F]
\ p{Space} Blank character: [ \t\n\x0B\f\r]
5.java.lang.Character class (simple java character type)
\p{javaLowerCase} Equivalent to java. lang.Character.isLowerCase()
\p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
6. Class for Unicode blocks and categories
\p{InGreek} Characters in Greek blocks (simple blocks)
\p {Lu} Uppercase letters (simple category)
\p{Sc} Currency symbols
\P{InGreek} All characters except those in Greek blocks (negated)
[\p{L}&&[^ \p{Lu}]] All letters, except uppercase letters (minus)
7. Boundary matcher
^ At the beginning of the line, use ^ at the beginning of the regular expression. For example: ^(abc). Represents a string starting with abc. Note that the parameter MULTILINE must be set when compiling, such as Pattern p = Pattern.compile(regex,Pattern.MULTILINE);
$ Please use it at the end of the regular expression. For example: (^bca).*(abc$) means a line starting with bca and ending with abc.
\b Word boundaries. For example, \b(abc) means that the beginning or end of the word contains abc, (both abcjj and jjabc can match)
\B Non-word boundary. For example, \B(abc) means that the middle of the word contains abc, (jjabcjj matches but jjabc, abcjj does not match)
\A The beginning of the input
\G The end of the previous match (I personally feel that this parameter is useless) . For example, \\Gdog means to search for dog at the end of the previous match. If there is no dog, then search from the beginning. Note that if the beginning is not dog, it cannot match.
\Z End of input, used only for the final terminator (if any)
Line terminator is a sequence of one or two characters that marks the end of the line of the input character sequence.
The following codes are recognized as line terminators:
-New line (line feed) character ('\n'),
-Carriage return character followed by new line character ("\r\n" ),
‐a single carriage return ('\r'),
‐next line character ('\u0085'),
‐line separator ('\u2028') or
‐ Paragraph separator ('\u2029).
\z End of input
When compiling a pattern, one or more flags can be set, for example
Pattern pattern = Pattern.compile(patternString,Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CASE);
Below Six flags are supported:
‐CASE_INSENSITIVE: Matching characters is case-independent. This flag only considers US ASCII characters by default.
‐UNICODE_CASE: When combined with CASE_INSENSITIVE, use Unicode letter matching
‐MULTILINE: ^ and $ match the beginning and end of a line, rather than the entire input
‐UNIX_LINES: When matching ^ in multiline mode and $, only '\n' is treated as a line terminator
‐DOTALL: When this flag is used, the . symbol matches all characters including line terminators
‐CANON_EQ: Consider the specification of Unicode characters Equivalent
8.Greedy quantifier
X? X, one or not once
X* X, zero or more times
X+ X, exactly n times
X{n,} X, at least n times
X{n,m} ? X, one or none
X*? X, zero or more times
X+? n,}? X, at least n times
X{n,m}? X, at least n times, but not more than m times
10.Possessive quantifier
X?+ times
X++ X, one or more
X{n}+ X, exactly n times
X{n,}+ At least n times, but no more than m times
The difference between Greedy, Reluctant, and Possessive is: (Note that it is only applicable when fuzzy processing is performed.)
The greedy quantifier is regarded as "greedy" because it is the first time Read the entire fuzzy matched string. If the first match attempt (the entire input string) fails, the matcher will back off one character after the last character in the matched string and try again, repeating this process until a match is found or there are no more remaining characters. until you can retreat. Depending on the quantifier used in the expression, the last thing it tries to match is 1 or 0 characters.
However, reluctant quantifiers take the opposite approach: they start at the beginning of the string being matched, and then progressively read one character at a time to search for a match. The last thing they try to match is the entire input string.
Finally, the possessive quantifier always reads the entire input string and attempts a match once (and only once). Unlike the greedy quantifier, possessive never retreats.
11.Logical operator
XY X followed by Y
X|Y X or Y
(X) X as a capturing group. For example, (abc) means to capture abc as a whole. For example, in the expression ((A)(B(C))), there are four such groups:
1 ((A)(B(C)))
2 \A
3 ( B (C))
4 (C)
can be referenced to the corresponding group through \ n in the expression. 1\2 means ab34cdabcd.
13. Quote
\ Nothing, but quote the following characters
\Q Nothing, but quote all characters up to \E. The string between QE will be used unchanged (except for the escaped characters in 1.1). For example, ab\\Q{|}\\\\E
can match ab{|}\\
\E Nothing, but ends the reference starting from \Q
14. Special construction (non-capturing)
(?:X) X, as a non-capturing group
(?idmsux-idmsux) Nothing, but changes the matching flag from on to off. For example: expression (?i)abc(?-i)def At this time, (?i) turns on the case-insensitive switch, abc matches
idmsux description is as follows:
‐i CASE_INSENSITIVE: US-ASCII character set not case sensitive. (?i)
‐d UNIX_LINES: Turn on UNIX line breaks
‐m MULTILINE: Multiline mode (?m)
UNIX switching behavior\n
WINDOWS switching behavior\r\n( ?s)
‐u UNICODE_CASE : Unicode is not case sensitive. (?u)
‐x COMMENTS: You can use comments in pattern, ignore the whitespace in pattern, and "#" until the end (# is followed by comments). (?x) For example (?x)abc#asfsdadsa can match the string abc
(?idmsux-idmsux:X) X as a non-capturing group with the given flags on - off. Similar to the above, the above expression can be rewritten as: (?i:abc)def, or (?i)abc(?-i:def)
(?=X) lookahead. A zero-width positive lookahead assertion continues matching only if subexpression X matches to the right of this position. For example, \w+(?=\d) means a letter followed by a number, but does not capture the number (no backtracking)
(?!X) X, via a zero-width negative lookahead. Zero-width negative lookahead assertion. Continue matching only if subexpression X does not match to the right of this position. For example, \w+(?!\d) means a letter is not followed by a digit, and digits are not captured.
(?<=X) X, through a positive lookbehind of zero width. Zero-width positive post assertion. Matching continues only if subexpression X matches to the left of this position. For example, (?<=19)99 means that 99 is preceded by the number 19, but the preceding 19 is not captured. (No backtracking)
(? (?>X) X, as an independent non-capturing group (no backtracking)
The difference between (?=X) and (?> ) does not backtrack. For example, when the matched string is abcm
, it can be matched when the expression is a(?:b|bc), and when the expression is a(?>b|bc) It can also match


##.

The above is the detailed content of Detailed introduction to Java regular expressions. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn