Home > Article > Backend Development > Python regular expressions
Regular expressions are a powerful and standard method for searching, replacing and parsing complex strings. All regular expressions in Python are under the re module.
1 Commonly used matches
^ matches the beginning of the string
$ matches the end of the string
b matches the boundary of a word
d matches any number
D matches any non-numeric character
x? matches one Optional x (matches x character 1 or 0 times)
x* matches x 0 or more times
x+ matches x 1 or more times
x{n,m} at least n times and at most m times x
(a|b|c) either matches a, or matches b, or matches c
(x) generally represents a memory group, you can use the groups() function of the object returned by the re.search function to obtain it Its value
2 General purpose
#------------------------------------------------------------------------------- # coding: utf-8 # Purpose:正则表达式 # # Author: zdk # # Created: 26/02/2013 # Copyright: (c) zdk 2013 #------------------------------------------------------------------------------- import re if __name__ == '__main__': addr = "100 BROAD ROAD APT.3" print(re.sub("ROAD","RD",addr)) # 100 BRD RD APT.3 print(re.sub(r"\bROAD\b","RD",addr)) # 100 BROAD RD APT.3 pattern = ".*B.*(ROAD)?" print(re.search(pattern,"ROAD")) #None print(re.search(pattern,"B")) #<_sre.SRE_Match object at 0x0230F020><span style="background-color:#FAFAFA;font-family:Monaco, 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', Consolas, 'Courier New', monospace;font-size:1em;line-height:1.5;"> </span>
(1) re.sub("ROAD","RD",addr) Use the re.sub function to search the string addr, and use "RD" to satisfy the expression "ROAD" "Replace
(2) re.sub(r"bROADb","RD",addr), "b" means "word boundary", in Python, because the character "" must be escaped in the string, This can get very cumbersome, so Python prefixes it with r to indicate that all characters in the string are not escaped.
(3) re.search(pattern, "ROAD") The re module has a search function. This function has two parameters, one is a regular expression and the other is a string. The search function returns a search function that can be described in a variety of ways. This matching object returns None if no match is found.
3 Loose regular expressions
The above are all "compact" type expressions, which are more difficult to read. Even if the meaning of the expression is clear now, there is no guarantee that it will be remembered a few months later. Therefore, Python allows users to use so-called loose regular expressions to complete the needs of inline documents. There are two main differences from general expressions in the following two aspects: whitespace characters are ignored. Spaces, tabs, and carriage returns do not match themselves (if you want to match a space in a loose regular expression, you do not need to escape it by adding a backslash before it)
Ignore Note. Like normal Python code, comments start with the # symbol and end at the end of the line.
#松散带有内联注释的正则表达式 pattern = """ ^ # begin of string M{0,3} # 0 to 3 M (CM|CD|D?C{0,3}) #CM or CD or D or D 0 to 3 C $ #end of string """ print(re.search(pattern,"MCM",re.VERBOSE)) #<_sre.SRE_Match object at 0x021BAF60> print(re.search(pattern,"M99",re.VERBOSE)) #None
(1) When using loose regular expressions, the most important thing is: an extra parameter re.VERBOSE must be passed, which is a constant of the re module, indicating that the regular expression to be matched is a Loose regular expressions. Pattern spaces and comments are ignored, but at the same time have better readability.
4 Case Studies: Parsing phone numbers
must match the following phone numbers:
800-555-1212
800 555 1212
800.555.1212
(800)555-1212
1-8 00-555 -1212
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext.1234
work 1-(800) 555,1212 #1234
Format comparison Much we need to know The area code is 800, the trunk number is 555, and the other digits of the phone number are 1212. For those with extension numbers, we need to know that the extension number is 1234
phonePattern = re.compile(r''' # don't match beginging of string (\d{3}) # 3 digits \D* #any number of non-digits (\d{3}) # 3 digits \D* #any number of non-digits (\d{4}) # 4 digits \D* #any number of non-digits (\d*) #any number of digits ''',re.VERBOSE) print(phonePattern.search('work 1-(800)555.1212 #1234').groups()) #('800', '555', '1212', '1234')
print(phonePattern.search('work 1-(800)555.1212 # 1234').groups()) #('800', '555', '1212', '1234')
(1) A loose regular expression as above, first match 3 digit area codes (not necessarily from the first It starts with characters, so there is no use ^), then matches any number of non-numeric characters, then matches 3 numeric trunk numbers, then matches any number of non-numeric characters, then matches 4 numeric numbers, and then matches any number of non-numeric characters, then match any number of digit extension numbers, and then use the groups function to group them to get the correct phone number.