Java regular expressions
Regular expressions define string patterns.
Regular expressions can be used to search, edit or process text.
Regular expressions are not limited to a certain language, but there are subtle differences in each language.
Java regular expressions are most similar to Perl's.
java.util.regex package mainly includes the following three classes:
Pattern class:
pattern object is a Compiled representation of regular expressions. Pattern class has no public constructor. To create a Pattern object, you must first call its public static compile method, which returns a Pattern object. This method accepts a regular expression as its first parameter.
Matcher class:
The Matcher object is an engine that interprets and matches input strings. Like the Pattern class, Matcher has no public constructor. You need to call the matcher method of the Pattern object to obtain a Matcher object.
PatternSyntaxException:
PatternSyntaxException is a non-mandatory exception class that represents a syntax error in a regular expression pattern.
Capturing group
Capturing group is a method of processing multiple characters as a single unit. It is created by grouping characters within brackets.
For example, the regular expression (dog) creates a single group containing "d", "o", and "g".
Capturing groups are numbered by counting their opening brackets from left to right. For example, in the expression ((A)(B(C))), there are four such groups:
((A)(B(C)))
(A)
(B(C))
(C)
You can check how many groups an expression has by calling the groupCount method of the matcher object. The groupCount method returns an int value, indicating that the matcher object currently has multiple capturing groups.
There is also a special group (group 0), which always represents the entire expression. The group is not included in the return value of groupCount.
Example
The following example illustrates how to find a numeric string from a given string:
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { public static void main( String args[] ){ // 按指定模式在字符串查找 String line = "This order was placed for QT3000! OK?"; String pattern = "(.*)(\d+)(.*)"; // 创建 Pattern 对象 Pattern r = Pattern.compile(pattern); // 现在创建 matcher 对象 Matcher m = r.matcher(line); if (m.find( )) { System.out.println("Found value: " + m.group(0) ); System.out.println("Found value: " + m.group(1) ); System.out.println("Found value: " + m.group(2) ); } else { System.out.println("NO MATCH"); } } }
The compilation and running results of the above example are as follows:
Found value: This order was placed for QT3000! OK? Found value: This order was placed for QT300 Found value: 0
Regular expression syntax
Characters | Description |
---|---|
\ | Mark the next character as a special character, text, backreference, or octal escape. For example, "n" matches the character "n". "\n" matches a newline character. The sequence "\\" matches "\", "\(" matches "(". |
##^ | Matches the beginning of the input string. If the Multiline property of the RegExp object is set, ^ will also match the position after "\n" or "\r" . |
$ | Matches the position at the end of the input string. If the Multiline property of the RegExp object is set, $ will also match the position before "\n" or "\r". |
* | Matches the preceding character or subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}. |
+ | Matches the preceding character or subexpression one or more times. For example, "zo+" matches "zo" and "zoo" but not "z". + Equivalent to {1,}. |
? | Matches the preceding character or subexpression zero or one times. For example, "do(es)?" matches "do" or "do" in "does". ? Equivalent to {0,1}. |
{n} | n is a non-negative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob", but does match both "o"s in "food". |
{n,} | n is a non-negative integer . Match at least n times. For example, "o{2,}" does not match the "o" in "Bob" but matches all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*". |
{n,m} | ## M and n are nonnegative integers, where n <= m. Match at least n times and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note: You cannot insert spaces between commas and numbers. |
? | When this character follows any other qualifier (*, +, ?, { n}, {n,}, {n,m}), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible string searched, while the default "greedy" pattern matches the longest possible string searched. For example, in the string "oooo", "o+?" matches only a single "o", while "o+" matches all "o"s. |
. | Matches any single character except "\r\n". To match any character including "\r\n", use a pattern such as "[\s\S]". |
( pattern) | matches pattern and captures that The matching subexpression. Captured matches can be retrieved from the resulting "matches" collection using the $0…$9 attribute. To match the bracket character ( ), use "\(" or "\)". |
(?: pattern) | matches pattern but does not capture the subexpression of that match, i.e. it is a non-capturing match and does not store the match for later use. This is useful when combining pattern parts with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'. |
(?=pattern) | Subexpression that performs forward prediction lookahead search An expression that matches a string at the beginning of a string that matches pattern. It is a non-capturing match, i.e. a match that cannot be captured for later use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Prediction lookaheads do not occupy characters, that is, after a match occurs, the next match is searched immediately after the previous match, not after the characters that make up the prediction lookahead. |
(?!pattern) | Subexpression that performs reverse prediction lookahead search An expression that matches a search string that is not at the beginning of a string that matches pattern. It is a non-capturing match, i.e. a match that cannot be captured for later use. For example, 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but not "Windows" in "Windows 2000". Prediction lookaheads do not occupy characters, that is, after a match occurs, the next match is searched immediately after the previous match, not after the characters that make up the prediction lookahead. |
##x|y | matches x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food". |
[ xyz] | Character set. Matches any character contained in . For example, "[abc]" matches the "a" in "plain". |
[^ xyz] | Reverse character set. Matches any characters not included. For example, "[^abc]" matches "p", "l", "i", and "n" in "plain". |
[ a-z] | Character range. Matches any character within the specified range. For example, "[a-z]" matches any lowercase letter in the range "a" through "z". |
[^ a-z] | Reverse range characters. Matches any character not within the specified range. For example, "[^a-z]" matches any character that is not in the range "a" to "z". |
\b | Matches a word boundary, that is, the position between a word and a space. For example, "er\b" matches the "er" in "never", but not the "er" in "verb". |
\B | Non-word boundary matching. "er\B" matches the "er" in "verb", but not the "er" in "never". |
\c x | matches the control characters indicated by x. For example, \cM matches Control-M or a carriage return character. The value of x must be between A-Z or a-z. If this is not the case, c is assumed to be the "c" character itself. |
\d | Number character matching. Equivalent to [0-9]. |
\D | Matches non-numeric characters. Equivalent to [^0-9]. |
\f | Form feed matching. Equivalent to \x0c and \cL. |
\n | Newline matching. Equivalent to \x0a and \cJ. |
\r | Matches a carriage return character. Equivalent to \x0d and \cM. |
\s | Matches any whitespace characters, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v]. |
\S | Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v]. |
\t | Tab matching. Equivalent to \x09 and \cI. |
\v | Vertical tab matching. Equivalent to \x0b and \cK. |
\w | Matches any type character, including the underscore. Equivalent to "[A-Za-z0-9_]". |
\W | matches any non-word character. Equivalent to "[^A-Za-z0-9_]". |
\xn | matches n, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04"&"1". Allow ASCII codes in regular expressions. |
\num | matches num, where num is a positive integer. Backreference to capture match. For example, "(.)\1" matches two consecutive identical characters. |
\n | Identifies an octal escape code or backreference. If \n is preceded by at least n capturing subexpressions, then n is a backreference. Otherwise, if n is an octal number (0-7), then n is an octal escape code. |
\nm | Identifies an octal escape code or backreference. If \nm is preceded by at least nm capturing subexpressions, then nm is a backreference. If \nm is preceded by at least n captures, then n is a backreference followed by the characters m. If neither of the previous conditions exists, \nm matches the octal value nm, where n and m are octal digits ( 0-7). |
\nml | ##When n is an octal number ( 0-3), m and l are octal numbers (0-7), matching the octal escape code nml. |
\u n | matches n, where n is a Unicode character represented as a four-digit hexadecimal number. For example, \u00A9 matches the copyright symbol (©). |
Methods of Matcher class
Index method
The index method provides useful index values that accurately indicate the input string Where can I find the match:
Serial number | Method and instructions |
---|---|
1 | public int start() Returns the initial index of the previous match. |
2 | public int start(int group) Returns the initial index of the subsequence captured by the given group during the previous matching operation |
3 | public int end() Returns the offset after the last matching character. |
4 | public int end(int group) Returns the offset after the last character of the subsequence captured by the given group during the previous matching operation. |
Research method
The research method is used to check the input string and return a Boolean value indicating whether the pattern is found:
Serial number | Method and instructions |
---|---|
public boolean lookingAt()
Attempts to match an input sequence starting from the beginning of the region to this pattern. | |
public boolean find()
Try to find the next subsequence of the input sequence that matches this pattern. | |
public boolean find(int start)
Resets this matcher and attempts to find the next subsequence of the input sequence starting at the specified index that matches the pattern. | |
public boolean matches()
Try to match the entire area with the pattern. |
Serial number | Methods and instructions |
---|---|
1 | public Matcher appendReplacement(StringBuffer sb, String replacement) Implement non-terminal add and replace steps. |
2 | public StringBuffer appendTail(StringBuffer sb) Implement terminal addition and replacement steps. |
3 | public String replaceAll(String replacement) Replace pattern for each subsequence of the input sequence that matches the given replacement string. |
4 | public String replaceFirst(String replacement) Replacement pattern matches the first subsequence of the input sequence with the given replacement string. |
5 | public static String quoteReplacement(String s) Returns the literal replacement string for the specified string. This method returns a string that works just like a literal string passed to the appendReplacement method of the Matcher class. |
start and end methods
The following is an example of counting the number of times the word "cat" appears in the input string:
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static final String REGEX = "\bcat\b"; private static final String INPUT = "cat cat cat cattie cat"; public static void main( String args[] ){ Pattern p = Pattern.compile(REGEX); Matcher m = p.matcher(INPUT); // 获取 matcher 对象 int count = 0; while(m.find()) { count++; System.out.println("Match number "+count); System.out.println("start(): "+m.start()); System.out.println("end(): "+m.end()); } } }
The compilation and running results of the above example are as follows:
Match number 1 start(): 0 end(): 3 Match number 2 start(): 4 end(): 7 Match number 3 start(): 8 end(): 11 Match number 4 start(): 19 end(): 22
You can see that this example uses word boundaries to ensure that the letters "c" "a" "t" are not just a substring of a longer word. It also provides some useful information about where in the input string the match occurred.
The Start method returns the initial index of the subsequence captured by the given group during the previous matching operation, and the end method adds one to the index of the last matched character.
matches and lookingAt methods
The matches and lookingAt methods are both used to try to match an input sequence pattern. The difference between them is that matcher requires the entire sequence to match, while lookingAt does not.
These two methods are often used at the beginning of the input string.
We use the following example to explain this function:
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static final String REGEX = "foo"; private static final String INPUT = "fooooooooooooooooo"; private static Pattern pattern; private static Matcher matcher; public static void main( String args[] ){ pattern = Pattern.compile(REGEX); matcher = pattern.matcher(INPUT); System.out.println("Current REGEX is: "+REGEX); System.out.println("Current INPUT is: "+INPUT); System.out.println("lookingAt(): "+matcher.lookingAt()); System.out.println("matches(): "+matcher.matches()); } }
The compilation and running results of the above example are as follows:
Current REGEX is: foo Current INPUT is: fooooooooooooooooo lookingAt(): true matches(): false
replaceFirst and replaceAll methods
replaceFirst and The replaceAll method is used to replace text matching a regular expression. The difference is that replaceFirst replaces the first match and replaceAll replaces all matches.
The following example explains this function:
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static String REGEX = "dog"; private static String INPUT = "The dog says meow. " + "All dogs say meow."; private static String REPLACE = "cat"; public static void main(String[] args) { Pattern p = Pattern.compile(REGEX); // get a matcher object Matcher m = p.matcher(INPUT); INPUT = m.replaceAll(REPLACE); System.out.println(INPUT); } }
The compilation and running results of the above example are as follows:
The cat says meow. All cats say meow.
appendReplacement and appendTail methods
The Matcher class also provides The appendReplacement and appendTail methods are used for text replacement:
Look at the following example to explain this function:
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static String REGEX = "a*b"; private static String INPUT = "aabfooaabfooabfoob"; private static String REPLACE = "-"; public static void main(String[] args) { Pattern p = Pattern.compile(REGEX); // 获取 matcher 对象 Matcher m = p.matcher(INPUT); StringBuffer sb = new StringBuffer(); while(m.find()){ m.appendReplacement(sb,REPLACE); } m.appendTail(sb); System.out.println(sb.toString()); } }
The compilation and running results of the above example are as follows:
-foo-foo-foo-
Methods of the PatternSyntaxException class
PatternSyntaxException is a non-mandatory exception class that indicates a syntax error in a regular expression pattern.
The PatternSyntaxException class provides the following methods to help us see what errors occurred.
Serial number | Method and instructions |
---|---|
public String getDescription()
Get a description of the error. | |
public int getIndex()
Get wrong index. | |
public String getPattern()
Get the wrong regular expression pattern. | |
public String getMessage()
Returns a multiline string containing a description of the syntax error and its index, the error's regular expression pattern, and a visual indication of the error's index in the pattern. |