How to parse invalid (error/malformed) XML?-Java-php.cn

Home

Java

How to parse invalid (error/malformed) XML?

PHPz

Feb 09, 2024 pm 11:20 PM

overflow

php editor Baicao introduces you how to parse invalid XML files. When processing XML files, you sometimes encounter invalid XML, perhaps because it is not well-formed or contains errors. Parsing invalid XML files is an important task to ensure that we get the required data correctly. To solve this problem, we can use PHP’s built-in functions and libraries to check and fix invalid XML. Below we will introduce in detail several commonly used methods to parse invalid XML files.

Question content

Currently, I'm working on a feature that involves parsing xml that we receive from other products. I decided to run some tests against some actual customer data and it looks like other products allow users to enter input that should be considered invalid. Anyway, I still have to try and figure out a way to parse it. We are using javax.xml.parsers.documentbuilder and I am getting the following error while typing.

&lt;xml&gt;
  ...
  &lt;description&gt;Example:Description:&lt;THIS-IS-PART-OF-DESCRIPTION&gt;&lt;/description&gt;
  ...
&lt;/xml&gt;

As you may know, the description appears to contain an invalid tag (<this-is-part-of-description></this-is-part-of-description>). Now, this description tag is considered a leaf tag and should not have any nested tags inside. Regardless, this is still a problem and produces an exception on documentbuilder.parse(...)

I know this is invalid xml, but it is predictably invalid. Any ideas on ways to parse such input?

Workaround

"xml" is worse than invalid - it is not well-formed; see Well-formed and valid xml.

Informal assessments of the predictability of violations are not helpful. The text data is not xml. There is no consistent xml tool or library that can help you deal with it.

Options, ideal first:

Let the provider resolve the issue themselves. Requires well-formed xml. (Technically, the term well-formed xml is redundant, but may help with emphasis.)
Use tolerant tag parserFix issues before parsing to xml:
- Standalone:  xmlstarlet Features powerful recovery and repair capabilities ^{_{Credit: romanperekhrest}}
```
xmlstarlet fo -o -r -h -d bad.xml 2&gt;/dev/null
```
- Standalone and c/c: html tidy Valid also works with xml. taggle is a port tagsoup to c .
- python: Beautiful Soup Based on python. See the comments in the Differences between Parsers section. See also Answers to this question for more information Advice on handling malformed tags in python, Specifically includes lxml's recover=true option. See also this answer to learn how to use codecs.encodedfile() to clean up illegal characters.
- java: tagsoup and jsoup focus on html. filterinputstream Can be used for preprocessing cleanup.
- .net：
  - xmlreadersettings.checkcharacters 可以禁用以解决非法 xml 字符问题。
  - @jdweng 注释 xmlreadersettings。 conformancelevel 可以设置为 conformancelevel.fragment这样 xmlreader 可以读取缺少根元素的 xml 格式良好的解析实体 .
  - @jdweng 还报告 xmlreader.readtofollowing() 有时可以用于解决 xml 语法问题，但请注意下面#3 中的违规警告。
  - microsoft.language.xml.xmlparser据说是“容错”的。
- 转到：设置decoder.strict到 false，如示例所示，作者：@chuckx。
- php：请参阅domdocument::$recover 和 libxml_use_internal_errors(true)。请参阅此处的好示例。
- ruby：nokogiri 支持“温和的 well-形式性”。
- r：请参阅htmltreeparse() 用于 r 中的容错标记解析。
- perl：请参阅xml::liberal ，一个“超级自由的 xml 解析器，可以解析损坏的 xml。”
使用文本编辑器手动将数据处理为文本或以编程方式使用字符/字符串函数。这样做以编程方式可以从棘手到不可能作为看起来是什么可预测的往往不是——打破规则很少受到规则的约束。
- 对于无效字符错误，请使用正则表达式删除/替换无效字符：
  - php： preg_replace('/[^\x{0009}\x{000a}\x{000d} \x{0020}-\x{d7ff}\x{e000}-\x{fffd}]+/u', ' ', $s);
  - ruby： string.tr ("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{d7ff}\u{e000‌}-\u{fffd}", ' ')
  - javascript： inputstr.replace (/[^\x09\x0a\x0d\x20-\xff\x85\xa0-\ud7ff\ue000-\ufdcf\ufde0-\ufffd]/gm, '')
- 对于与号，使用正则表达式将匹配项替换为 &:^{_{信用：blhsin，演示}} p>
```
&amp;(?!(?:#\d+|#x[0-9a-f]+|\w+);)
```

请注意，上述正则表达式不会接受注释或 cdata

按照设计，标准 xml 解析器永远不会接受无效的 xml。

您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容，或将其包装在 cdata 中。

The above is the detailed content of How to parse invalid (error/malformed) XML?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:stackoverflow. If there is any infringement, please contact admin@php.cn delete