Strings are the most commonly used data structure in programming, and the need to operate on strings is almost everywhere. For example, to determine whether a string is a legal email address, although you can programmatically extract the substrings before and after @, and then determine whether it is a word and a domain name, this is not only troublesome, but also difficult to reuse the code. Regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule is considered to "match". Otherwise, the string is illegal.
So the way we judge whether a string is a legal Email is:
Create a regular expression that matches Email;
Use this regular expression to match the user's input to determine whether it is legal.
Because regular expressions are also represented by strings, we must first understand how to use characters to describe characters.
In regular expressions, if characters are given directly, it is an exact match. Use \d to match a number, \w to match a letter or number, so:
'00\d' can match '007', but cannot match '00A';
' \d\d\d' can match '010';
'\w\w\d' can match 'py3';
. can match any character, so:
'py.' can match 'pyc', 'pyo', 'py!', etc.
To match variable-length characters, in regular expressions, use * to represent any number of characters (including 0), use to represent at least one character, use ? to represent 0 or 1 characters, and use {n } represents n characters, and {n,m} represents n-m characters:
Let’s look at a complex example: \d{3}\s \d{3,8}.
Let’s interpret it from left to right:
\d{3} means matching 3 numbers, such as '010';
\s can match a space ( Also includes tab and other whitespace characters), so \s means at least one space, such as matching ' ', ' ', etc.;
\d{3,8} means 3-8 numbers, such as '1234567' .
Taken together, the above regular expression can match phone numbers with area codes separated by any number of spaces.
What if you want to match a number like '010-12345'? Since '-' is a special character, it needs to be escaped with '\' in regular expressions, so the above regular expression is \d{3}\-\d{3,8}.
However, '010 - 12345' still cannot be matched because of spaces. So we need more complex matching methods.
Related recommendations: "Python Video Tutorial"
Advanced
To make a more precise match, you can use [ ] represents a range, for example:
[0-9a-zA-Z\_] can match a number, letter or underscore;
[0-9a-zA-Z\_] can Matches a string consisting of at least one number, letter or underscore, such as 'a100', '0_Z', 'Py3000', etc.;
[a-zA-Z\_][0-9a-zA -Z\_]* can match a string starting with a letter or an underscore, followed by any number of strings consisting of a number, a letter, or an underscore, which is a legal variable in Python;
[a-zA-Z\ _][0-9a-zA-Z\_]{0, 19} more precisely limits the length of the variable to 1-20 characters (up to 19 characters after the first character).
A|B can match A or B, so (P|p)ython can match 'Python' or 'python'.
^ means the beginning of the line, ^\d means it must start with a number.
$ indicates the end of the line, \d$ indicates that it must end with a number.
You may have noticed that py can also match 'python', but adding ^py$ turns it into a whole line match, so it can only match 'py'.
re module
With the preparatory knowledge, we can use regular expressions in Python. Python provides the re module, which contains all regular expression functions. Since Python's string itself is also escaped with \, special attention should be paid to:
s = 'ABC\\-001' # Python's string # The corresponding regular expression string becomes: # ' ABC\-001'
Therefore we strongly recommend using Python's r prefix, so you don't have to worry about escaping:
s = r'ABC\-001' # Python The string # corresponding to the regular expression string remains unchanged: # 'ABC\-001'
Let's first see how to determine whether the regular expression matches:
>>> import re >>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345') <_sre.SRE_Match object; span=(0, 9), match='010-12345' >>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345') >>>
match( ) method determines whether there is a match. If the match is successful, it returns a Match object, otherwise it returns None. The common judgment method is:
test = 'The string entered by the user'if re.match(r'regular expression', test):
print('ok')else: print('failed')
cut String splitting
Using regular expressions to split strings is more flexible than using fixed characters. Please see the normal splitting code:
>>> 'a b c'.split(' ') ['a', 'b', '', '', 'c']
Well, continuous spaces cannot be recognized. , try using regular expressions:
>>> re.split(r'\s+', 'a b c') ['a', 'b', 'c']
It can be divided normally no matter how many spaces there are. Join, try:
>>> re.split(r'[\s\,]+', 'a,b, c d') ['a', 'b', 'c', 'd']
Join again; try:
>>> re.split(r'[\s\,\;]+', 'a,b;; c d') ['a', 'b', 'c', 'd']
If the user enters a set of tags, remember to use regular expressions to convert irregular input into correct ones next time array.
Group
除了简单地判断是否匹配之外,正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组(Group)。比如:
^(\d{3})-(\d{3,8})$分别定义了两个组,可以直接从匹配的字符串中提取出区号和本地号码:
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345') >>> m <_sre.SRE_Match object; span=(0, 9), match='010-12345' >>>> m.group(0)'010-12345' >>> m.group(1)'010' >>> m.group(2)'12345'
如果正则表达式中定义了组,就可以在Match对象上用group()方法提取出子串来。
注意到group(0)永远是原始字符串,group(1)、group(2)……表示第1、2、……个子串。
提取子串非常有用。来看一个更凶残的例子:
>>> t = '19:05:30' >>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)>>> m.groups() ('19', '05', '30')
这个正则表达式可以直接识别合法的时间。但是有些时候,用正则表达式也无法做到完全验证,比如识别日期:
'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'
对于'2-30','4-31'这样的非法日期,用正则还是识别不了,或者说写出来非常困难,这时就需要程序配合识别了。
贪婪匹配
最后需要特别指出的是,正则匹配默认是贪婪匹配,也就是匹配尽可能多的字符。举例如下,匹配出数字后面的0:
>>> re.match(r'^(\d+)(0*)$', '102300').groups() ('102300', '')
由于\d+采用贪婪匹配,直接把后面的0全部匹配了,结果0*只能匹配空字符串了。
必须让\d+采用非贪婪匹配(也就是尽可能少匹配),才能把后面的0匹配出来,加个?就可以让\d+采用非贪婪匹配:
>>> re.match(r'^(\d+?)(0*)$', '102300').groups() ('1023', '00')
编译
当我们在Python中使用正则表达式时,re模块内部会干两件事情:
编译正则表达式,如果正则表达式的字符串本身不合法,会报错;
用编译后的正则表达式去匹配字符串。
如果一个正则表达式要重复使用几千次,出于效率的考虑,我们可以预编译该正则表达式,接下来重复使用时就不需要编译这个步骤了,直接匹配:
>>> import re # 编译: >>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$') # 使用: >>> re_telephone.match('010-12345').groups() ('010', '12345') >>> re_telephone.match('010-8086').groups() ('010', '8086')
编译后生成Regular Expression对象,由于该对象自己包含了正则表达式,所以调用对应的方法时不用给出正则字符串。
参数
修饰符
模式
The above is the detailed content of How to use regular expressions in python. For more information, please follow other related articles on the PHP Chinese website!

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1
Easy-to-use and free code editor

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.