Home >Java >How to parse invalid (error/malformed) XML?

How to parse invalid (error/malformed) XML?

PHPz
PHPzforward
2024-02-09 23:20:40759browse

php editor Baicao introduces you how to parse invalid XML files. When processing XML files, you sometimes encounter invalid XML, perhaps because it is not well-formed or contains errors. Parsing invalid XML files is an important task to ensure that we get the required data correctly. To solve this problem, we can use PHP’s built-in functions and libraries to check and fix invalid XML. Below we will introduce in detail several commonly used methods to parse invalid XML files.

Question content

Currently, I'm working on a feature that involves parsing xml that we receive from other products. I decided to run some tests against some actual customer data and it looks like other products allow users to enter input that should be considered invalid. Anyway, I still have to try and figure out a way to parse it. We are using javax.xml.parsers.documentbuilder and I am getting the following error while typing.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

As you may know, the description appears to contain an invalid tag (d30036ddc824403d51b03f80ff62bdc4). Now, this description tag is considered a leaf tag and should not have any nested tags inside. Regardless, this is still a problem and produces an exception on documentbuilder.parse(...)

I know this is invalid xml, but it is predictably invalid. Any ideas on ways to parse such input?

Workaround

"xml" is worse than invalid - it is not well-formed; see Well-formed and valid xml.

Informal assessments of the predictability of violations are not helpful. The text data is not xml. There is no consistent xml tool or library that can help you deal with it.

Options, ideal first:

  1. Let the provider resolve the issue themselves. Requires well-formed xml. (Technically, the term well-formed xml is redundant, but may help with emphasis.)

  2. Use tolerant tag parserFix issues before parsing to xml:

  3. 使用文本编辑器手动将数据处理为文本或 以编程方式使用字符/字符串函数。这样做 以编程方式可以从棘手到不可能作为 看起来是什么 可预测的往往不是——打破规则很少受到规则的约束

    • 对于无效字符错误,请使用正则表达式删除/替换无效字符:

      • php: preg_replace('/[^\x{0009}\x{000a}\x{000d} \x{0020}-\x{d7ff}\x{e000}-\x{fffd}]+/u', ' ', $s);
      • ruby: string.tr ("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{d7ff}\u{e000‌​}-\u{fffd}", ' ')
      • javascript: inputstr.replace (/[^\x09\x0a\x0d\x20-\xff\x85\xa0-\ud7ff\ue000-\ufdcf\ufde0-\ufffd]/gm, '')
    • 对于与号,使用正则表达式将匹配项替换为 &amp;: 信用:blhsin演示 p>

      &amp;(?!(?:#\d+|#x[0-9a-f]+|\w+);)

请注意,上述正则表达式不会接受注释或 cdata

按照设计,标准 xml 解析器永远不会接受无效的 xml。

您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容,或将其包装在 cdata 中。

The above is the detailed content of How to parse invalid (error/malformed) XML?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:stackoverflow.com. If there is any infringement, please contact admin@php.cn delete