Home  >  Article  >  Backend Development  >  Knowledge summary and sharing of regular expressions in Python

Knowledge summary and sharing of regular expressions in Python

黄舟
黄舟Original
2017-09-23 11:34:261406browse

This article introduces the basic knowledge of Python regular expressions. The content of this article does not include how to write efficient regular expressions and how to optimize regular expressions. Please check other tutorials for these topics.

1. Regular expression syntax

1.1 Characters and character classes
1 Special characters: \.^$?+*{}[]()|
If you want to use literal values ​​for the above special characters, you must use \ to escape
2 Character classes
1. One or more characters contained in [] are called character classes, and character classes are used in matching If no quantifier is specified, only one of them will be matched.
   2. A range can be specified within the character class, for example [a-zA-Z0-9] represents any character from a to z, A to Z, and 0 to 9
   3. The left square bracket is followed by A ^ means negating a character class. For example, [^0-9] means that it can match any non-digit character.
4. Within the character class, except for \, other special characters no longer have special meanings and all represent literal values. ^ placed in the first position represents negation, placed in other positions represents ^ itself, - placed in the middle represents a range, and placed as the first character in a character class represents - itself.

5. Shorthand can be used inside the character class, such as \d \s \w
3 Shorthand
Can match any character except newline, if there is re.DOTALL flag , then matches any character including newline
   \d matches a Unicode digit, if re.ASCII is included, matches 0-9
   \D matches Unicode non-digit
   \s matches Unicode blank, if accompanied by re. .ASCII, then match a
in \t\n\r\f\v    \S matches Unicode non-blank
   \w matches Unicode word character, if it contains re.ascii, then matches [a-zA -Z0-9_] One of
    \W Matches Unicode non-monad character

 1.2 Quantifier
   1. ? Matches the previous character 0 or 1 times
   2. * Matches the preceding character Character 0 or more times
3. + matches the previous character 1 or more times
4. {m} matches the previous expression m times
5. {m,} matches the previous expression at least m times
6. {,n} matches the previous regular expression at most n times
7. {m,n} matches the previous regular expression at least m times and at most n times
Notes:
The above quantifiers are all greedy modes and will match as many matches as possible. If you want to change to non-greedy mode, follow the quantifier with a ? to achieve

1.3 Grouping and capturing
1. The role of () :
1. Capture the contents of the regular expression in () for further processing. You can turn off the capture function of this bracket by following ?: after the left bracket
2. Extract part of the regular expression Grouping, so as to use quantifiers or |
2 Reflection refers to the content captured in the previous ():
1. Backreference
by group number Each parentheses that does not use ?: will be assigned a group, Starting from 1 and increasing from left to right, you can use \i to reference the content captured by the expression in the previous ()
2. Back-reference the content captured in the previous parentheses through the group name
You can use the left bracket to Followed by ?P, put the group name in angle brackets to create an alias for a group, and then use (?P=name) to reference the previously captured content. Such as (? P\w+)\s+(?P=word) to match repeated words.
3 Notes:
Backreferences cannot be used in character class [].

1.4 Assertions and Markers
Assertions will not match any text, but only impose certain constraints on the text where the assertion is located
1 Commonly used assertions:
1. \b matches the boundary of a word and is placed in the character class [] means backspace
                                                     Responddie in on the in on non-word boundaries, affected by ASCII tags
    3. \A can match at the beginning
    4. ^ can match at the beginning, if there is the MULTILINE flag , then match
after each newline character 5. \Z matches
at the end 6. $ matches at the end, if there is the MULTILINE flag, match
before each newline character 7. ( ?=e) Positive look-ahead
   8. (?!e) Negative look-ahead
   9. (?<=e) Positive look-back
   10. (?   2 Explanation of look-ahead lookback
   Look-ahead: exp1(?=exp2) The content after exp1 must match exp2
   Negative look-ahead: exp1(?!exp2) The content after exp1 cannot match exp2
   Look-back: (?< =exp2)exp1 The content before exp1 must match exp2
. Negative lookahead: (?. For example: we want to find hello, but hello must be followed by world, regular expression. The expression can be written like this: "(hello)\s+(?=world)", which is used to match "hello wangxing" and "hello world". It can only match the latter's hello

 1.5 Conditional matching
  (?(id)yes_exp|no_exp): If the subexpression corresponding to the id matches the content, then it will match yes_exp, otherwise it will match no_exp

 1.6 Flags of regular expressions
  1. Regular expression There are two ways to use the flag
1. By passing in the flag parameter to the compile method, multiple flags can be separated by |, such as re.compile(r"#[\da-f]{6}\b" , re.IGNORECASE|re.MULTILINE)
2. Add a flag to the regular expression by adding (? flag) in front of the regular expression, such as (?ms)#[\da-z]{6}\ b
  2. Commonly used flags
  re.A or re.ASCII, so that \b \B \s \S \w \W \d \D assumes that the string is ASCII
  re .I or re.IGNORECASE makes the regular expression ignore case
  re.M or re.MULTILINE multi-line matching, so that each ^ is matched after each carriage return, and each $ is matched before each carriage return
re.S or re.DOTALL enables . to match any character, including carriage return
re. [ ], since the default whitespace is no longer interpreted. Such as:
     re.compile(r"""
         [^>]*? #Not an attribute of src
    src= #src attribute The beginning of
?P=quote) #Right bracket
""",re.VERBOSE|re.IGNORECASE)


2. Python regular expression module

2.1 Regular expressions have four main functions for processing strings

1. Match to see whether a string conforms to the grammar of the regular expression, usually returning true or false
2. Obtain the regular expression Formula to extract text that meets the requirements in the string

3. Replace the text that matches the regular expression in the search string and replace it with the corresponding string

4. Split the string using regular expressions


# 2.2 Two ways to use regular expressions in the re module in Python

1. Use the re.compile(r, f) method to generate a regular expression object, and then call The corresponding method of the regular expression object. The advantage of this approach is that it can be used multiple times after generating the regular expression object.
2. There is a corresponding module method for each object method of the regular expression object in the re module. The difference is that the first parameter passed in is a regular expression string. This method is suitable for regular expressions that are used only once. 2.3 Common methods of regular expression objects

.

1. rx.findall(s,start, end):
Returns a list. If there is no grouping in the regular expression, the list contains all matching content.
If there is no grouping in the regular expression, If there is grouping, each element in the list is a tuple. The tuple contains the content matched in the subgroup, but the content matched by the entire regular expression is not returned.
  2. rx.finditer(s, start, end):
Return an iterable object
Iterate over the iterable object and return a matching object each time. You can call the group() method of the matching object to view the content matched by the specified group. 0 represents the entire regular expression. The content matched by the formula
3. rx.search(s, start, end):
Returns a matching object. If there is no match, it returns None
The search method only matches once and stops. It will not Continue to match
4. rx.match(s, start, end):
If the regular expression matches at the beginning of the string, a matching object is returned, otherwise None
is returned. 5. rx.sub(x, s, m):
Returns a string. Replace each matching position with x and return the replaced string. If m is specified, it will be replaced up to m times. For x, you can use /i or /g id can be a group name or number to reference the captured content.
   x in the module method re.sub(r, x, s, m) can use a function. At this time, we can push the captured content through this function for processing and then replace the matched text.
6. rx.subn(x, s, m):
Same as re.sub() method, the difference is that it returns a tuple, one of which is the result string and one is for replacement number.
7. rx.split(s, m): split the string
Return a list
Use the content matched by the regular expression to split the string
If there are groups in the regular expression, Then put the content matched by the group in the middle of each two divisions in the list as part of the list, such as:
  rx = re.compile(r"(\d)[a-z]+(\d)")
  s = "ab12dk3klj8jk9jks5"
   result = rx.split(s)
    Return ['ab1', '2', '3', 'klj', '8', '9', 'jks5' ]
8. rx.flags(): Flags set when compiling regular expressions
9. rx.pattern(): String used when compiling regular expressions

2.4 Attributes and methods of matching objects

  01. m.group(g, ...)
    Returns the content matched by the number or group name. The default or 0 indicates that the entire expression matches Content, if multiple are specified, a tuple will be returned
   02. m.groupdict(default)
   Return a dictionary. The keys of the dictionary are the group names of all named groups, and the values ​​are the contents captured by the named groups. If there is a default parameter, it will be used as the default value for those groups that do not participate in the matching.
03. m.groups(default)
Returns a tuple. Contains all subgroups that capture content, starting from 1. If a default value is specified, this value is used as the value of the group that did not capture the content.
  04. m.lastgroup()
   The number of the matched content The name of the highest capturing group. If there is no or no name used, None is returned (uncommonly used)
  05. m.lastindex()
   The number of the highest-numbered capturing group that matches the content, if not, None is returned .
06. m.start(g):
The subgroup of the current matching object is matched from that position in the string. If the current group does not participate in the match, -1
is returned. 07. m.end (g)
                                 around forward through from that position in the string. If the current group does not participate in the match, -1
  08. m.span()
    Returns a binary Group, the content is the return value of m.start(g) and m.end(g)
  09. m.re()
   The regular expression that generates this matching object
   10. m. string()
The string passed to match or search for matching
11. m.pos()
The starting position of the search. That is, the beginning of the string, or the position specified by start (not commonly used)
  12. m.endpos()
   The end position of the search. That is, the end position of the string, or the position specified by end (not commonly used)

 2.5 Summary


1. For the regular expression matching function, Python does not have a method to return true and false, but it can be judged by whether the return value of the match or search method is None
2. For the regular expression search function , if you only search once, you can use the matching object returned by the search or match method. For multiple searches, you can use the iterable object returned by the finditer method to iteratively access
3. For the replacement function of regular expressions, you can use regular expressions It can be implemented by the sub or subn method of the formula object, or by the re module method sub or subn. The difference is that the replacement text of the sub method of the module can be generated using a function. 4. For the regular expression segmentation function, You can use the split method of the regular expression object. It should be noted that if the regular expression object is grouped, the content captured by the group will also be placed in the returned list

The above is the detailed content of Knowledge summary and sharing of regular expressions in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn