Introduction to emerging XML processing method VTD-XML-XML/RSS Tutorial-php.cn

Home

Backend Development

XML/RSS Tutorial

Introduction to emerging XML processing method VTD-XML

黄舟

Feb 28, 2017 pm 05:20 PM

Problem

Usually when we mention the use of XML, the most troublesome part is the verbosity of XML and the parsing speed of XML. This problem becomes particularly serious when large XML files need to be processed. What I mention here is the topic of how to optimize XML processing speed.

When we choose to process XML files, we generally have two options:

DOM, which is the W3C standard model, builds XML structural information in a tree shape , which provides interfaces and methods for traversing this tree.
SAX, a low-level parser, forward read-only processing element by element, does not contain structural information.
Both of the above options have their own pros and cons, but neither is a particularly good solution. Their pros and cons are as follows:

DOM

Advantages: Ease of use, because all XML structure information exists in memory, and traversal is simple, supporting XPath.
Disadvantages: Parsing speed is too slow, memory usage is too high (5x~10x of the original file), and it is almost impossible to use for large files.
SAX

Advantages: Parsing is fast, and the memory usage is not related to the size of XML (it can be done without increasing memory as XML grows).
Disadvantages: Poor usability, because there is no structural information, it cannot be traversed, and it does not support XPath. If you need a structure, you can only read a little bit and construct a little bit, which makes the maintainability very poor.
We can see that DOM and SAX are basically two opposite extremes, but neither one can meet most of our requirements well. We need to find another processing method. Note that the efficiency problem with XML is not a problem with XML itself, but a problem with the Parser that processes XML, just like the two methods we saw above have different efficiency trade-offs.

Thinking

We like the use of DOM-like methods because we can traverse, which means that XPath can be supported, which greatly enhances the ease of use, but the efficiency of DOM is very low. As we already know, the efficiency problem lies in the processing mechanism. So, what aspects of DOM affect its efficiency? Let us do a comprehensive dissection below:

In most platforms today based on virtual machine (hosted, or any similar mechanism) technology, the creation and destruction of objects is a time-consuming job (it is worth the main is the time consuming of Garbage Collection), the large number of object creation and destruction used in the DOM mechanism is undoubtedly one of the reasons that affects its efficiency (it will cause too many Garbage Collections).
Each object will have an additional 32 bits to store its memory address. When there are a large number of objects like DOM, this additional cost is not small.
The main efficiency issue causing the above two problems is that DOM and SAX are both extractive parsing modes. This parsing mode is destined to require a large number of creation (destroy) objects for both DOM and SAX, causing efficiency problems. The so-called extractive parsing means that when parsing XML, DOM or SAX will extract a part of the original file (generally a string), and then parse and construct it in memory (the output is naturally one or some objects). Take DOM as an example. DOM will parse each element, attribute, PRocessing-instruction, comment, etc. into an object and give it a structure. This is the so-called extractive parsing.
Another problem caused by the issue of extractive is update efficiency. In DOM (SAX does not support update, so we won’t mention it at all), every time we need to make changes, all we have to do is to update the object’s information. Then parse back the XML string. Note that this parsing is a complete parsing, that is to say, the original file is not used, but the DOM model is directly re-parsed completely into an XML string. In other words, DOM does not support Incremental Update (incremental update).
Another "small" problem that is likely not to be noticed is the encoding of XML. No matter what kind of parsing method is used, it needs to be able to handle the encoding of XML, that is, decoding when reading and writing when writing. when coding. Another efficiency problem with DOM is that when I only want to make a small modification to a large XML, it must first decode the entire file and then build the structure. Invisibly, it is another expense.
Let us summarize the problem. Simply put, the efficiency problem of DOM mainly lies in its extractive parsing mode (the same is true for SAX, which has the same problem). This has triggered a series of related problems. If these can be overcome If there is an efficiency bottleneck, then it is conceivable that the XML processing efficiency will be further improved. If the ease of use and processing efficiency of XML are greatly improved, then the application scope and application model of XML will be further sublimated, and perhaps many wonderful things that have never been thought of before will be produced. XML based products come.

The way out

VTD-XML is the answer given after thinking about the above problems. It is a non-extractive XML parser. Because of its excellent mechanism, it is a good solution (avoiding ) solves the various problems raised above, and also "incidentally" brings other non-extractive benefits, such as fast parsing and traversal, XPath support, Incremental Update, etc. I have a set of data here, taken from the official website of VTD-XML:

The parsing speed of VTD-XML is 1.5x~2.0x that of SAX (with NULL content handler). With NULL content handler means that no additional processing logic is inserted into SAX parsing, which is the maximum speed of SAX.
The memory usage of VTD-XML is 1.3x~1.5x that of the original XML (the 1.0x part is the original XML, and the 0.3x~0.5x part is the part occupied by VTD-XML), while the memory usage of the DOM is 1.3x~1.5x that of the original XML. 5x~10x of XML. For example, if the size of an XML is 50MB, then the memory occupied by VTD-XML will be between 65MB~75MB, while the memory occupied by DOM will be between 250M~500MB. Using DOM to process large XML files based on this data is almost an impossible option.
You may find it incredible, is it really possible to make an XML parser that is easier to use than DOM and faster than SAX? Don’t rush to a conclusion, let’s take a look at the principles of VTD-XML!

Basic Principle

Like most good products, the principle of VTD-XML is not complicated, but very clever. In order to achieve the purpose of non-extractive, it reads the original XML file into the memory unchanged in binary mode, without even decoding it, and then parses the position of each element on this byte array and records some information. Subsequent traversal operations are performed on these saved records. If the XML content needs to be extracted, the position and other information in the record are used to decode the original byte array and return a string. This all seems simple, but this simple process does have multiple performance details and hides several potential capabilities. Let's first describe each performance detail:

In order to avoid excessive object creation, VTD-XML decided to use the original numerical type as the record type, so that heap is not needed. The record mechanism of VTD-XML is called VTD (Virtual Token Descriptor). VTD solves the performance bottleneck in the tokenization stage, which is really a clever and thoughtful approach. VTD is a 64-bits length numerical type that records information such as the starting position (offset), length (length), depth (depth) and token type (type) of each element.
Note that VTD is of fixed length (officially decided to use 64bits). The purpose of this is to improve performance. Because the length is fixed, it is extremely efficient (O(1)) when reading, querying and other operations, that is, Arrays, an efficient structure that can be used to organize VTDs, greatly reduce performance problems caused by the large use of objects.
The superpower of VTD (no exaggeration at all) is that it can simply transform a tree-shaped data structure such as XML into an operation on a byte array, any operation you can imagine on a byte array All can be applied to XML. This is because the XML read in is binary (byte array), and VTD records the location of each element and other access information. When we find the VTD to be operated, we only need to use information such as offset and length. Perform any operation on the original byte array, or you can operate directly on the VTD. For example, if I want to find an element in a large XML and delete it, then I only need to find the VTD of this element (the traversal method will be discussed later), delete this VTD from the VTD array, and then use all Just write the VTD to another byte array. Because the deleted VTD marks the location of the element to be deleted, this element will not appear in the newly written byte array. Use VTD to write the new The byte array is actually a copy of the byte array, and its efficiency is quite high. This is the so-called incremental update.
Regarding the traversal method of VTD-XML, it uses LC (Location Cache), which is simply a tree-shaped table structure built using VTD with its depth as the standard. The entry of LC is also a 64-bit long numerical type. The first 32 bits represent the index of a VTD, and the last 32 bits represent the index of the first child of this VTD. You can use this information to calculate any position you want to reach. For specific traversal methods, please refer to the article on the official website. VTD-XML based on this traversal method has a different operating interface from DOM, which is understandable. Moreover, this traversal method of VTD-XML can take you to the place you need in the fewest steps. , the traversal performance is very outstanding.

Summary

As you can see above, VTD-XML has fascinating features, and now version 1.5 has added support for XPath (as long as it can be traversed, it can support XPath , this is a matter of time :-)), its practicality has exceeded the scope of what we imagine today. Another superpower of VTD-XML is that based on its current processing method, it can fully support the future Binary XML standard and push the application of XML to a higher level through Binaryization! This is what I'm looking forward to now! :-)

However, VTD-XML still has many areas that need improvement and perfection, and this aspect is worthy of our efforts and discussion.

The above is the introduction of the emerging XML processing method VTD-XML. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Building Feeds with XML: A Hands-On Guide to RSSApr 14, 2025 am 12:17 AM

The steps to build an RSSfeed using XML are as follows: 1. Create the root element and set the version; 2. Add the channel element and its basic information; 3. Add the entry element, including the title, link and description; 4. Convert the XML structure to a string and output it. With these steps, you can create a valid RSSfeed from scratch and enhance its functionality by adding additional elements such as release date and author information.

Creating RSS Documents: A Step-by-Step TutorialApr 13, 2025 am 12:10 AM

The steps to create an RSS document are as follows: 1. Write in XML format, with the root element, including the elements. 2. Add, etc. elements to describe channel information. 3. Add elements, each representing a content entry, including,,,,,,,,,,,. 4. Optionally add and elements to enrich the content. 5. Ensure the XML format is correct, use online tools to verify, optimize performance and keep content updated.

XML's Role in RSS: The Foundation of Syndicated ContentApr 12, 2025 am 12:17 AM

The core role of XML in RSS is to provide a standardized and flexible data format. 1. The structure and markup language characteristics of XML make it suitable for data exchange and storage. 2. RSS uses XML to create a standardized format to facilitate content sharing. 3. The application of XML in RSS includes elements that define feed content, such as title and release date. 4. Advantages include standardization and scalability, and challenges include document verbose and strict syntax requirements. 5. Best practices include validating XML validity, keeping it simple, using CDATA, and regularly updating.

From XML to Readable Content: Demystifying RSS FeedsApr 11, 2025 am 12:03 AM

RSSfeedsareXMLdocumentsusedforcontentaggregationanddistribution.Totransformthemintoreadablecontent:1)ParsetheXMLusinglibrarieslikefeedparserinPython.2)HandledifferentRSSversionsandpotentialparsingerrors.3)Transformthedataintouser-friendlyformatsliket

Is There an RSS Alternative Based on JSON?Apr 10, 2025 am 09:31 AM

JSONFeed is a JSON-based RSS alternative that has its advantages simplicity and ease of use. 1) JSONFeed uses JSON format, which is easy to generate and parse. 2) It supports dynamic generation and is suitable for modern web development. 3) Using JSONFeed can improve content management efficiency and user experience.

RSS Document Tools: Building, Validating, and Publishing FeedsApr 09, 2025 am 12:10 AM

How to build, validate and publish RSSfeeds? 1. Build: Use Python scripts to generate RSSfeed, including title, link, description and release date. 2. Verification: Use FeedValidator.org or Python script to check whether RSSfeed complies with RSS2.0 standards. 3. Publish: Upload RSS files to the server, or use Flask to generate and publish RSSfeed dynamically. Through these steps, you can effectively manage and share content.

Securing Your XML/RSS Feeds: A Comprehensive Security ChecklistApr 08, 2025 am 12:06 AM

Methods to ensure the security of XML/RSSfeeds include: 1. Data verification, 2. Encrypted transmission, 3. Access control, 4. Logs and monitoring. These measures protect the integrity and confidentiality of data through network security protocols, data encryption algorithms and access control mechanisms.

XML/RSS Interview Questions & Answers: Level Up Your ExpertiseApr 07, 2025 am 12:19 AM

XML is a markup language used to store and transfer data, and RSS is an XML-based format used to publish frequently updated content. 1) XML describes data structures through tags and attributes, 2) RSS defines specific tag publishing and subscribed content, 3) XML can be created and parsed using Python's xml.etree.ElementTree module, 4) XML nodes can be queried for XPath expressions, 5) Feedparser library can parse RSSfeed, 6) Common errors include tag mismatch and encoding issues, which can be validated by XMLlint, 7) Processing large XML files with SAX parser can optimize performance.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux latest version

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

Where is the login entrance for gmail email?

7500

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers