Home > Article > Backend Development > How to use regular expressions in python
Strings are the most commonly used data structure in programming, and the need to operate on strings is almost everywhere. For example, to determine whether a string is a legal email address, although you can programmatically extract the substrings before and after @, and then determine whether it is a word and a domain name, this is not only troublesome, but also difficult to reuse the code. Regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule is considered to "match". Otherwise, the string is illegal.
So the way we judge whether a string is a legal Email is:
Create a regular expression that matches Email;
Use this regular expression to match the user's input to determine whether it is legal.
Because regular expressions are also represented by strings, we must first understand how to use characters to describe characters.
In regular expressions, if characters are given directly, it is an exact match. Use \d to match a number, \w to match a letter or number, so:
'00\d' can match '007', but cannot match '00A';
' \d\d\d' can match '010';
'\w\w\d' can match 'py3';
. can match any character, so:
'py.' can match 'pyc', 'pyo', 'py!', etc.
To match variable-length characters, in regular expressions, use * to represent any number of characters (including 0), use to represent at least one character, use ? to represent 0 or 1 characters, and use {n } represents n characters, and {n,m} represents n-m characters:
Let’s look at a complex example: \d{3}\s \d{3,8}.
Let’s interpret it from left to right:
\d{3} means matching 3 numbers, such as '010';
\s can match a space ( Also includes tab and other whitespace characters), so \s means at least one space, such as matching ' ', ' ', etc.;
\d{3,8} means 3-8 numbers, such as '1234567' .
Taken together, the above regular expression can match phone numbers with area codes separated by any number of spaces.
What if you want to match a number like '010-12345'? Since '-' is a special character, it needs to be escaped with '\' in regular expressions, so the above regular expression is \d{3}\-\d{3,8}.
However, '010 - 12345' still cannot be matched because of spaces. So we need more complex matching methods.
Related recommendations: "Python Video Tutorial"
Advanced
To make a more precise match, you can use [ ] represents a range, for example:
[0-9a-zA-Z\_] can match a number, letter or underscore;
[0-9a-zA-Z\_] can Matches a string consisting of at least one number, letter or underscore, such as 'a100', '0_Z', 'Py3000', etc.;
[a-zA-Z\_][0-9a-zA -Z\_]* can match a string starting with a letter or an underscore, followed by any number of strings consisting of a number, a letter, or an underscore, which is a legal variable in Python;
[a-zA-Z\ _][0-9a-zA-Z\_]{0, 19} more precisely limits the length of the variable to 1-20 characters (up to 19 characters after the first character).
A|B can match A or B, so (P|p)ython can match 'Python' or 'python'.
^ means the beginning of the line, ^\d means it must start with a number.
$ indicates the end of the line, \d$ indicates that it must end with a number.
You may have noticed that py can also match 'python', but adding ^py$ turns it into a whole line match, so it can only match 'py'.
re module
With the preparatory knowledge, we can use regular expressions in Python. Python provides the re module, which contains all regular expression functions. Since Python's string itself is also escaped with \, special attention should be paid to:
s = 'ABC\\-001' # Python's string # The corresponding regular expression string becomes: # ' ABC\-001'
Therefore we strongly recommend using Python's r prefix, so you don't have to worry about escaping:
s = r'ABC\-001' # Python The string # corresponding to the regular expression string remains unchanged: # 'ABC\-001'
Let's first see how to determine whether the regular expression matches:
>>> import re >>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345') <_sre.SRE_Match object; span=(0, 9), match='010-12345' >>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345') >>>
match( ) method determines whether there is a match. If the match is successful, it returns a Match object, otherwise it returns None. The common judgment method is:
test = 'The string entered by the user'if re.match(r'regular expression', test):
print('ok')else: print('failed')
cut String splitting
Using regular expressions to split strings is more flexible than using fixed characters. Please see the normal splitting code:
>>> 'a b c'.split(' ') ['a', 'b', '', '', 'c']
Well, continuous spaces cannot be recognized. , try using regular expressions:
>>> re.split(r'\s+', 'a b c') ['a', 'b', 'c']
It can be divided normally no matter how many spaces there are. Join, try:
>>> re.split(r'[\s\,]+', 'a,b, c d') ['a', 'b', 'c', 'd']
Join again; try:
>>> re.split(r'[\s\,\;]+', 'a,b;; c d') ['a', 'b', 'c', 'd']
If the user enters a set of tags, remember to use regular expressions to convert irregular input into correct ones next time array.
Group
除了简单地判断是否匹配之外,正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组(Group)。比如:
^(\d{3})-(\d{3,8})$分别定义了两个组,可以直接从匹配的字符串中提取出区号和本地号码:
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345') >>> m <_sre.SRE_Match object; span=(0, 9), match='010-12345' >>>> m.group(0)'010-12345' >>> m.group(1)'010' >>> m.group(2)'12345'
如果正则表达式中定义了组,就可以在Match对象上用group()方法提取出子串来。
注意到group(0)永远是原始字符串,group(1)、group(2)……表示第1、2、……个子串。
提取子串非常有用。来看一个更凶残的例子:
>>> t = '19:05:30' >>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)>>> m.groups() ('19', '05', '30')
这个正则表达式可以直接识别合法的时间。但是有些时候,用正则表达式也无法做到完全验证,比如识别日期:
'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'
对于'2-30','4-31'这样的非法日期,用正则还是识别不了,或者说写出来非常困难,这时就需要程序配合识别了。
贪婪匹配
最后需要特别指出的是,正则匹配默认是贪婪匹配,也就是匹配尽可能多的字符。举例如下,匹配出数字后面的0:
>>> re.match(r'^(\d+)(0*)$', '102300').groups() ('102300', '')
由于\d+采用贪婪匹配,直接把后面的0全部匹配了,结果0*只能匹配空字符串了。
必须让\d+采用非贪婪匹配(也就是尽可能少匹配),才能把后面的0匹配出来,加个?就可以让\d+采用非贪婪匹配:
>>> re.match(r'^(\d+?)(0*)$', '102300').groups() ('1023', '00')
编译
当我们在Python中使用正则表达式时,re模块内部会干两件事情:
编译正则表达式,如果正则表达式的字符串本身不合法,会报错;
用编译后的正则表达式去匹配字符串。
如果一个正则表达式要重复使用几千次,出于效率的考虑,我们可以预编译该正则表达式,接下来重复使用时就不需要编译这个步骤了,直接匹配:
>>> import re # 编译: >>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$') # 使用: >>> re_telephone.match('010-12345').groups() ('010', '12345') >>> re_telephone.match('010-8086').groups() ('010', '8086')
编译后生成Regular Expression对象,由于该对象自己包含了正则表达式,所以调用对应的方法时不用给出正则字符串。
参数
修饰符
模式
The above is the detailed content of How to use regular expressions in python. For more information, please follow other related articles on the PHP Chinese website!