php editor Baicao introduces you how to parse invalid XML files. When processing XML files, you sometimes encounter invalid XML, perhaps because it is not well-formed or contains errors. Parsing invalid XML files is an important task to ensure that we get the required data correctly. To solve this problem, we can use PHP’s built-in functions and libraries to check and fix invalid XML. Below we will introduce in detail several commonly used methods to parse invalid XML files.
Currently, I'm working on a feature that involves parsing xml that we receive from other products. I decided to run some tests against some actual customer data and it looks like other products allow users to enter input that should be considered invalid. Anyway, I still have to try and figure out a way to parse it. We are using javax.xml.parsers.documentbuilder
and I am getting the following error while typing.
<xml> ... <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description> ... </xml>
As you may know, the description appears to contain an invalid tag (d30036ddc824403d51b03f80ff62bdc4
). Now, this description tag is considered a leaf tag and should not have any nested tags inside. Regardless, this is still a problem and produces an exception on documentbuilder.parse(...)
I know this is invalid xml, but it is predictably invalid. Any ideas on ways to parse such input?
"xml" is worse than invalid - it is not well-formed; see Well-formed and valid xml.
Informal assessments of the predictability of violations are not helpful. The text data is not xml. There is no consistent xml tool or library that can help you deal with it.
Let the provider resolve the issue themselves. Requires well-formed xml. (Technically, the term well-formed xml is redundant, but may help with emphasis.)
Use tolerant tag parserFix issues before parsing to xml:
Standalone: xmlstarlet Features powerful recovery and repair capabilities Credit: romanperekhrest
xmlstarlet fo -o -r -h -d bad.xml 2>/dev/null
Standalone and c/c: html tidy Valid also works with xml. taggle is a port tagsoup to c .
python: Beautiful Soup Based on python. See the comments in the Differences between Parsers section. See also Answers to this question for more information
Advice on handling malformed tags in python,
Specifically includes lxml's recover=true
option.
See also this answer to learn how to use codecs.encodedfile()
to clean up illegal characters.
java: tagsoup and jsoup focus on html. filterinputstream
Can be used for preprocessing cleanup.
.net:
xmlreadersettings。 conformancelevel
可以设置为
conformancelevel.fragment
这样 xmlreader
可以读取缺少根元素的 xml 格式良好的解析实体 .xmlreader.readtofollowing()
有时可以
用于解决 xml 语法问题,但请注意
下面#3 中的违规警告。microsoft.language.xml.xmlparser
据说是“容错”的。转到:设置decoder.strict
到 false
,如示例所示,作者:@chuckx。
php:请参阅domdocument::$recover 和 libxml_use_internal_errors(true)。请参阅此处的好示例。
ruby:nokogiri 支持“温和的 well-形式性”。
r:请参阅htmltreeparse() 用于 r 中的容错标记解析。
perl:请参阅xml::liberal ,一个“超级自由的 xml 解析器,可以解析损坏的 xml。”
使用文本编辑器手动将数据处理为文本或 以编程方式使用字符/字符串函数。这样做 以编程方式可以从棘手到不可能作为 看起来是什么 可预测的往往不是——打破规则很少受到规则的约束。
对于无效字符错误,请使用正则表达式删除/替换无效字符:
preg_replace('/[^\x{0009}\x{000a}\x{000d} \x{0020}-\x{d7ff}\x{e000}-\x{fffd}]+/u', ' ', $s);
string.tr ("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{d7ff}\u{e000}-\u{fffd}", ' ')
inputstr.replace (/[^\x09\x0a\x0d\x20-\xff\x85\xa0-\ud7ff\ue000-\ufdcf\ufde0-\ufffd]/gm, '')
对于与号,使用正则表达式将匹配项替换为 &
: 信用:blhsin,演示 p>
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
请注意,上述正则表达式不会接受注释或 cdata
按照设计,标准 xml 解析器永远不会接受无效的 xml。
您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容,或将其包装在 cdata 中。
The above is the detailed content of How to parse invalid (error/malformed) XML?. For more information, please follow other related articles on the PHP Chinese website!