总结
Parser具有通用性,处理良性的xml,解析完后你可以得到xml文档任何位置的信息.优先选择
Regex具有针对性,处理非良性的xml,当你预先知道需要匹配的信息位置,尝试Regex
在Update3中给出了一个实例。
我现在有这样的一个字符串:
str="<a>1</a>...<b>A</b>...<a>2</a>...<b>B</b>"
以下两种re分别匹配<a>
与</a>
之间内容,<b>
与</b>
之间内容
p1=re.compile(r'(?<=<a>)(.*?)(?=</a>)')
#p1.findall(str)=['1','2']
p2=re.compile(r'(?<=<b>)(.*?)(?=</b>)')
#p2.findall(str)=['A','B']
问题1:是否能利用'|'操作,使一个pattern来完成如下的匹配:
p3=re.compile(r'(?<=<a>)|(?<=<b>)(.*?)(?=</a>)|(?=</b>)')
#p3.findall(str)=['1','2','A','B']
问题2:能否使用group来完成如下匹配:
p4=re.compile(r'(?<=<a>)(.*?)(?=</a>)(?<=<b>)(.*?)(?=</b>))
#p4.findall(str)=[('1','A'),('2','B')]
Updata1问题2已解决
问题2来源手册中写到:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
p=re.compile(r'(?<=<a>)(.*?)(?=</a>).+?(?<=<b>)(.*?)(?=</b>))
#p.findall(str)=[('1','A'),('2','B')]
再推荐一个Python正则交互式的网站regex101。
改变正则式,匹配结果能即时更新,很方便测试自己的正则式是否正确。
update2
匹配xml文档的内容,Regex or Parser?
Parser适合解析,Parser更robust一些,解析完后你可以得到xml文档任何位置的信息;Regex适合针对性的匹配,处理非良性的xml时,当你预先知道需要匹配的信息位置时,尝试Regex。在**Update3中给出了一个实例。
Update3
问题的回答逐渐转变到同一个声音告诉你"一定不要用Regex解析xml"。
对此我的粗浅看法:
1.我的问题是"Regex匹配XML内容,不是用Regex来解析XML文档"。问题源自基于新闻语料Reuters-21578的文本分类器。数据源就是抓取语料文档中标签<PLACES>
与<TEXT>
内的的信息,并且一一对应起来。这篇文章提到了该语料的处理体会:
这些数据文件貌似是有一定的格式的,我刚开始也试图把他们当做标准的xml文档来处理(因为下载包里还像模像样的包含了一个SGML DTD 的文件),但老是报错。最终发现很多的记录格式是错误的,而且错误千奇百怪。所以干脆放弃,直接把它们全部看做文本文件来处理得了。
我用python中lxml库来尝试parse,结果当然是parse失败,error_log中提示很多mismatch.当然lxml也提供了处理broken xml的方法,即recover - try hard to parse through broken XML.recover的代价是不易处理PLACES
与TEXT
信息的对应关系。但换作Regex,匹配规则就类比上述问题:只有当两个group同时匹配到内容,这样的配对信息就保留。如果其中一个为空,这样的配对信息就丢弃。
2.具体问题具体分析,少说绝对
这个问题同样在stackoverflow中出现,回答各式各样。
得票最高的是"不要用Regex来解析xml"。同时也有其它一些启发性的回答,摘录一个
While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.
If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.
Regexes worked just fine for me, and were very fast to set up.
天蓬老师2017-04-17 11:56:16
>>> str = "<a>1</a>...<b>A</b>...<a>2</a>...<b>B</b>"
>>> p3 = re.compile(r'(?<=<(?P<tag>a|b)>)(.*?)(?=</(?P=tag)>)')
>>> [m.group() for m in p3.finditer(str)]
['1', 'A', '2', 'B']
>>> p3.findall(str)
[('a', '1'), ('b', 'A'), ('a', '2'), ('b', 'B')]
高洛峰2017-04-17 11:56:16
How many times have I said this...I'm tired of it...
XML has its own library lxml, BS4
Regular expressions should be used to do their proper job, instead of using your brain to manipulate XML
ringa_lee2017-04-17 11:56:16
str = '<a>1</a>...<b>A</b>...<a>2</a>...<b>B</b>'
p5 = re.compile(r'(?<=<[ab]>)(.*?)(?=</[ab]>)')
p5.findall(str) # ['1', 'A', '2', 'B']
天蓬老师2017-04-17 11:56:16
Supplement 3:
Here are the positive answers to the questions directly addressed separately from Supplement 2.
As for the matching problem itself, my suggestion is:
A
and B
are paired, it is best to observe whether there are line breaks, parent tags, etc., which can be used to distinguish each <a><b>
group. For example, it would be best to have such a data source: <r><a></a><b></b></r>
<a>
or <b>
. For example, ABABAAABAB
, then the first two of the three A
in the middle are best discarded. <a>(?P<texta>.*)</a>.*<b>(?P<textb>.*)</b>
. (?:<(?P<tagname>a)>(?P<text>.*)</a>)|(?:<(?P<tagname>b)>(?P<text>.*)</b>)
, get all the tags at once, whether they are A
or B
. A
and B
as a set of valid data. Note that the above codes are all written by hand. They have not been tested or even looked at in detail. They are only for reference.
Supplement 2:
There is a legitimate reason why XML is not standard. In response to this actual situation, my suggestion is:
In addition, I must criticize the poster very seriously: You are another negative example of XY PROBLEM.
At first, I only came up with a very simple and standardized XML fragment, but after two updates, I finally revealed the important information that "XML may not be standardized".
Are you deliberately saving some trump card to protect your fragile self-esteem when you are criticized? !
Can you be more vulnerable! ! !
Supplement 1:
Cannot agree with Update 2 of the question text.
Using regular expressions to match regular XML means that as long as you dig a few small holes within the rules of XML, lazy programmers will fall into them.
I think Regex parsing XML is "definitely not suitable for practical applications" and should not be a matter of doubt. If it is forced to be done, it means that the actual program can only be adapted to some specific situations. And if there is any slight change in the data source (for example, the programmer temporarily commented out a small number of labels), humans may be required to hotfix it. The result is that skyscrapers are built on loose sand, and the programs that programmers work hard to write will soon become unusable. This will be a never-ending cycle.
"As long as it is a matter, there is no absolute". Isn't this judgment itself "absolute"? I think principles are principles. Some issues have clear right and wrong, and some muddy waters cannot be disturbed. If you can step back a little here and let go a little bit there on issues of principle, then the program written in this way may only fall into an elusive and unpredictable ending.
It’s normal to have other opinions on SOF. Do you have to agree with them when you see them? !
The only thing that is certain is that if you use XML as an example to learn regular rules, there is no harm in doing it.
I would rather turn this issue upside down.
Why do some people always like to use regular expressions to parse XML/HTML? !
When did it become possible to use Parser or Regex to parse XML? Each has its own strengths, and it became an issue that can be discussed and discussed? ? ! !
Is this an issue that needs to be discussed? ? ? ! ! !
Iron principle!
Never step back!
No matter how simple XML is, it won’t work!
Because you cannot use a simple regular expression to cover all the complex structures of XML. There are so many situations in XML, where it is weird but correct, where it is just tolerated, and where it should simply report an error. This is not covered by regular expressions.
For example, in the following situations, ask yourself: If you use regular rules to do it, will you consider everything?
<!-- this <a></a> should be ignored -->
<![CDATA[ this <a></a> should be ignored too ]]>
<a>A < B</a>
is A < B
instead of A < B
<a />
is an equivalent to <a></a>
, shouldn't be ignoredSo regularity and XML interpreters are two things with completely different complexities. The result of mixing is: the price will definitely be returned to you with interest one day. Don't give up on writing solid code just because "it will serve your purpose." This is using physical "diligence" to cover up absolute laziness in mind.
Players who have participated in the Informatics Olympiad in middle schools or ACM/ICPC in universities understand a simple truth:
The sample data can be passed, and the entire question can be Accepted are two completely different concepts.
The same goes for actual programming. For this requirement, considering that XML is a standard, the code involving XML must "guarantee" that it can work for XML that conforms to the standard, instead of constantly tossing to make the code "look" applicable. The one-sided "sample data" you set.
Look at this article "An Interlude in Linux 2.6.39-rc3" and remember the teachings of Linus Torvalds:
This kind of “I broke things, so now I will jiggle things randomly until they unbreak” is not acceptable.
This "I messed up, I just tinkered with it until it worked again" approach is unacceptable.