Home > Article > Backend Development > Python Chinese regular expression notes
From the perspective of strings, Chinese is not as neat and standardized as English. This is an inevitable reality. This article combines online information and personal experience, taking the python language as an example to make a brief summary. Welcome to add or correct mistakes.
A little experience
You can use the repr() function to view the original format of the string. This is helpful for writing regular expressions.
Python’s re module has two similar functions: re.match(), re.search. The matching process of the two functions is exactly the same, but the starting point is different. match only matches from the beginning of the string, and if it fails, it gives up; while search will persevere and completely traverse all possible positions in the entire string until it successfully finds a match, or it fails after searching the string. end. If you understand the characteristics of match (it is faster in some cases), you can use it freely; if not, search is usually the function you need.
Find all possible matches from a pile of text and return them in the form of a list. In this case, use the findall() function. See the code below for examples.
Under utf8, each Chinese character occupies 3 character positions, and the regular expression is [\x80-\xff]{3}. You all know this.
Under unicode, the format of Chinese characters is \uXXXX. As long as you find the range of the corresponding character set, you can match the corresponding string, making it easy to select the text in a certain language you need from multi-language texts. However, for an adhesive language like Japanese, which has both Chinese characters and hiragana and katakana, the results may be biased.
Two character classes can be used side by side. For example, hiragana, katakana, and Chinese are put together, u"[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+", from Define the text to be matched.
When matching Chinese, the format of the regular expression and the target string must be the same. This is crucial. Or use the default utf8, and you don't need to do anything extra at this time; if it is unicode, you need to add u"" format before the regular expression.
You can define unicode string like this: string=u"I love regular expressions". If the string is not unicode, you can use the unicode() function to convert it. If you know the encoding of the source string, you can use newstr=unicode(oldstring, original_coding_name) to convert. For example, unicode(string, "utf8") is commonly used under Linux, and cp936 may be used under Windows. I have not tested it.
Example program
#!/usr/bin/python # -*- coding: utf-8 -*- # #author: rex #blog: http://iregex.org #filename py_utf8_unicode.py #created: 2010-06-27 09:11 import re def findPart(regex, text, name): res=re.findall(regex, text) if res: print "There are %d %s parts:\n"% (len(res), name) for r in res: print "\t",r print #sample is utf8 by default. sample='''en: Regular expression is a powerful tool for manipulating text. zh: 正则表达式是一种很有用的处理文本的工具。 jp: 正規表現は非常に役に立つツールテキストを操作することです。 jp-char: あアいイうウえエおオ kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다. puc: 。?!、,;:“ ”‘ '——……·-·《》〈〉!¥%&*# ''' #let's look its raw representation under the hood: print "the raw utf8 string is:\n", repr(sample) print #find the non-ascii chars: findPart(r"[\x80-\xff]+",sample,"non-ascii") #convert the utf8 to unicode usample=unicode(sample,'utf8') #let's look its raw representation under the hood: print "the raw unicode string is:\n", repr(usample) print #get each language parts: findPart(u"[\u4e00-\u9fa5]+", usample, "unicode chinese") findPart(u"[\uac00-\ud7ff]+", usample, "unicode korean") findPart(u"[\u30a0-\u30ff]+", usample, "unicode japanese katakana") findPart(u"[\u3040-\u309f]+", usample, "unicode japanese hiragana") findPart(u"[\u3000-\u303f\ufb00-\ufffd]+", usample, "unicode cjk Punctuation")
The output result is:
the raw utf8 string is: 'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n' There are 14 non-ascii parts: 正则表达式是一种很有用的处理文本的工具。 正規表現は非常に役に立つツールテキストを操作することです。 あアいイうウえエおオ 정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다 。?!、,;:“ ”‘ '——……·-·《》〈〉!¥%&*# the raw unicode string is: u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n' There are 6 unicode chinese parts: 正则表达式是一种很有用的处理文本的工具 正規表現 非常 役 立 操作 There are 8 unicode korean parts: 정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다 There are 6 unicode japanese katakana parts: ツールテキスト ア イ ウ エ オ There are 11 unicode japanese hiragana parts: は に に つ を することです あ い う え お There are 5 unicode cjk Punctuation parts: 。 。 。?!、,;: - 《》〈〉!¥%&*#
For more articles related to Python Chinese regular expression notes, please pay attention to the PHP Chinese website!