search
HomeBackend DevelopmentXML/RSS TutorialDetailed introduction to encoding XML documents using UTF-8

Google's Sitemap service requires that all published site maps must use Unicode's UTF-8 encoding. Google doesn't even allow other Unicode encodings like UTF-16, let alone non-Unicode encodings like ISO-8859-1. Technically this means that Google is using a non-standard XML parser, since the XML Recommendation specifically requires that "all XML handlers must accept the UTF-8 and UTF-16 encodings of Unicode 3.1", but this Is it really a big problem?

Everyone can use UTF-8

Universality is the first and most compelling reason to choose UTF-8. It can handle every script currently used in the world. Although there are still a few gaps, they are becoming less and less obvious and are gradually being filled in. Literals that are not included are usually not implemented in any other character set, and even if they are, they cannot be used in XML. In the best case, these scripts are passed through font borrowing to a single-byte character set like Latin-1. Real support for such rare scripts will probably come first from Unicode, and probably only Unicode supports them.

But this is just one reason to use Unicode. Why choose UTF-8 instead of UTF-16 or other Unicode encodings? One of the most immediate reasons is the extensive tool support. Basically every major editor possible for XML can handle UTF-8, including JEdit, BBEdit, Eclipse, emacs and even Notepad. No other Unicode encoding has such extensive tool support among XML and non-XML tools. For some of these editors, such as BBEdit and Eclipse, UTF-8 is not the default character set. Now it is necessary to change the default settings. All tools should select UTF-8 as the default encoding when shipped from the factory. Unless this is done, we will be stuck in a quagmire of non-interoperability when files travel across borders, platforms and languages. But until all programs use UTF-8 as the default encoding, it's easy to change the default settings yourself. In Eclipse, for example, the General/Editors preference panel shown in Figure 1 allows you to specify that all files use UTF-8. You may notice that Eclipse expects the default to be MacRoman, but if this is the case, the file will not compile when passed to a programmer using Microsoft® Windows® or to a computer outside the United States and Western Europe. Figure 1. Changing the default character set of Eclipse

Of course, for UTF-8 to work, all files exchanged by developers must also use UTF -8, but that's not a problem. Unlike MacRoman, UTF-8 is not limited to a few scripts or platforms. Anyone can use UTF-8. MacRoman, Latin-1, SJIS, and various other legacy national character sets cannot do that.

Detailed introduction to encoding XML documents using UTF-8UTF-8 works fine in tools that don't support multibyte data. Other Unicode formats such as UTF-16 tend to contain many zero bytes. Many tools interpret these bytes as end-of-file or some other special delimiter, causing undesirable, unexpected, and often unpleasant results. For example, if UTF-16 data is loaded unchanged into a C

String

, the string may be truncated from the second byte of the first ASCII character. UTF-8 files contain null only where they actually represent

null

. Of course, such a naive tool should not be chosen to process XML documents. However, documents in legacy systems often end up in strange places, and no one really recognizes or understands that those character sequences are just old wine in new bottles. UTF-8 is less likely to cause problems than UTF-16 or other Unicode encodings for systems that don't support Unicode and XML. What the experts sayXML is the first major standard to fully support UTF-8, but that’s just the beginning. Various standards organizations are gradually recommending UTF-8. For example, URLs containing non-ASCII characters are a long-standing problem on the Web. URLs containing non-ASCII characters that work on a PC won't work on a Mac, and vice versa. The World Wide Web Consortium (

W3C

) and the Internet Engineering Task Force (IETF) recently resolved this issue by agreeing that all URLs must be encoded in UTF-8 and no other encodings.

The W3C and IETF are getting tougher on whether to use UTF-8 first, last, or occasionally. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "If a character encoding must be chosen, it must be UTF-8, UTF-16, or UTF-32. US-ASCII is upwardly compatible with UTF-8 (US- ASCII strings are also UTF-8 strings, see [RFC 3629]), so if compatibility with US-ASCII is required, UTF-8 is very suitable. "In fact, compatibility with US-ASCII is so important that it is almost required. The W3C wisely explains, "In other cases, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Reasons for choosing one encoding may include efficiency of internal processing and interoperability with other processes."

I agree with the reason for the efficiency of internal processing. For example, the internal representation of strings in the Java™ language is UTF-16, so indexing of strings is faster. However, Java code never exposes this internal representation to the program with which it exchanges data. Instead, for external data exchange, use java.io.Writer, specifying the character set explicitly. When choosing, UTF-8 is highly recommended.

IETF is even more explicit. The IETF Charset Policy [RFC 2277] states that in non-deterministic languages:

protocols must be able to use the UTF-8 character set, which consists of the ISO 10646 encoding set and the UTF-8 character encoding method, See [10646] Annex R (released in revision 2) for the full text.

In addition, the protocol may specify how to use other ISO 10646 character sets and character encoding schemes, such as UTF-16, but the inability to use UTF-8 is a violation of this policy. This violation will not be entered or promoted to the standards track. During the process, it is necessary to go through the change procedure ([BCP9] Section 9) and provide clear and reliable reasons in the protocol specification document.

Existing protocols, or protocols for transferring data from existing data stores, may need to support other datasets, or even use default encodings other than UTF-8. This is allowed, but must be able to support UTF-8.

Points: Support for legacy protocols and files may require acceptance of character sets and encodings other than UTF-8 for some time to come, but I'd be very careful if that had to be the case. Every new protocol, application, and document should use UTF-8.

Chinese, Japanese and Korean

A common misconception is that UTF-8 is a compression format. This is not the case. In UTF-8 ASCII characters take up only half the space compared to other Unicode encodings, especially UTF-16. However, the UTF-8 encoding of some characters takes up 50% more space, especially hieroglyphics like Chinese, Japanese, and Korean (CJK).

But even if CJK XML is encoded in UTF-8, the actual size may be smaller than UTF-16. For example, Chinese XML documents contain a large number of ASCII characters, such as , &, =, ", ' and spaces. The UTF-8 encoding of these characters is smaller than UTF-16. The specific compression/expansion factors vary depending on the document. Different, but in either case, the difference is unlikely to be obvious.

Finally, it is worth mentioning that hieroglyphic scripts such as Chinese and Japanese use characters compared to alphabetical scripts such as Latin and Cyrillic. Often less. Due to the sheer amount of characters, three or more bytes per character are required to fully represent these languages, that is, compared to the same words or sentences in English or Russian. Can be expressed in fewer words. For example, "tree" is represented by "wood" in Japanese (very much like a tree) and requires three bytes in UTF-8, while the English word "tree" contains four letters. , requiring four bytes. The Japanese word "grove" is "林" (two trees close together). Encoding in UTF-8 requires three bytes, while the English word "grove" has five. letters, requires five bytes. The Japanese word "sen" (three trees) still requires three bytes, while the corresponding English word "forest" requires six bytes.

If compression is really needed. , use zip or gzip. After compression, the sizes of UTF-8 and UTF-16 are similar, no matter which encoding is used, the larger the original size, the less redundancy removed by the compression algorithm. More.

Robustness

The real advantage is in the design, UTF-8 is a more robust and easier to interpret format than any other text encoding ever devised before or since. . First of all, compared with UTF-16, UTF-8 does not have the endianness problem. UTF-8 is represented by both big-endian and little-endian, because UTF-8 is based on 8-bit bytes rather than 16-bit words. Defined. UTF-8 has no endianness ambiguity, which must be resolved through endianness flags or other heuristics.

UTF-8 A more important feature is statelessness. Every byte in a UTF-8 stream or sequence is unambiguous. In UTF-8, you can always know the position. That is to say, given a byte, you can immediately determine whether it is a single-byte character, the first byte of a double-byte character, or the first byte of a double-byte character. The second byte, or the second, third, or fourth byte of a three-byte/four-byte character (there are other possibilities, of course, but you get the idea). In UTF-16, it is impossible to determine whether the byte "0x41" is the letter "A". Sometimes it is, sometimes it isn't. Sufficient state must be logged to determine position in the flow. If one byte is lost, all subsequent data will be unusable. In UTF-8, missing or corrupted bytes are easy to determine and do not affect other data.

UTF-8 is not a panacea. Applications that require random access to specific locations in a document may operate faster using fixed-width encodings such as UCS2 or UTF-32. (If you take substitution pairs into account, UTF-16 is a variable-length character encoding.) However, XML processing does not fall into this category of applications. The XML specification specifically requires that parsers start parsing from the first byte of an XML document until the last byte, and all existing parsers do this. Faster random access doesn't help XML processing, and while that might be a good reason to use a different encoding for a database or other system, it doesn't apply to XML.

Conclusion

In an increasingly international world, language and political boundaries are blurring, and character sets that rely on region are no longer applicable. Unicode is the only character set that can interoperate across many geographies. UTF-8 is the best Unicode encoding available:

Extensive tool support, including best-in-class compatibility with legacy ASCII systems.

It is simple and efficient to handle.

Anti-corruption.

Platform independent.

It’s time to stop arguing about character sets and encodings, choose UTF-8 and end the dispute.

The above is the detailed content of Detailed introduction to encoding XML documents using UTF-8. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How to Parse and Utilize XML-Based RSS FeedsHow to Parse and Utilize XML-Based RSS FeedsApr 16, 2025 am 12:05 AM

RSSfeedsuseXMLtosyndicatecontent;parsingtheminvolvesloadingXML,navigatingitsstructure,andextractingdata.Applicationsincludebuildingnewsaggregatorsandtrackingpodcastepisodes.

RSS Documents: How They Deliver Your Favorite ContentRSS Documents: How They Deliver Your Favorite ContentApr 15, 2025 am 12:01 AM

RSS documents work by publishing content updates through XML files, and users subscribe and receive notifications through RSS readers. 1. Content publisher creates and updates RSS documents. 2. The RSS reader regularly accesses and parses XML files. 3. Users browse and read updated content. Example of usage: Subscribe to TechCrunch's RSS feed, just copy the link to the RSS reader.

Building Feeds with XML: A Hands-On Guide to RSSBuilding Feeds with XML: A Hands-On Guide to RSSApr 14, 2025 am 12:17 AM

The steps to build an RSSfeed using XML are as follows: 1. Create the root element and set the version; 2. Add the channel element and its basic information; 3. Add the entry element, including the title, link and description; 4. Convert the XML structure to a string and output it. With these steps, you can create a valid RSSfeed from scratch and enhance its functionality by adding additional elements such as release date and author information.

Creating RSS Documents: A Step-by-Step TutorialCreating RSS Documents: A Step-by-Step TutorialApr 13, 2025 am 12:10 AM

The steps to create an RSS document are as follows: 1. Write in XML format, with the root element, including the elements. 2. Add, etc. elements to describe channel information. 3. Add elements, each representing a content entry, including,,,,,,,,,,,. 4. Optionally add and elements to enrich the content. 5. Ensure the XML format is correct, use online tools to verify, optimize performance and keep content updated.

XML's Role in RSS: The Foundation of Syndicated ContentXML's Role in RSS: The Foundation of Syndicated ContentApr 12, 2025 am 12:17 AM

The core role of XML in RSS is to provide a standardized and flexible data format. 1. The structure and markup language characteristics of XML make it suitable for data exchange and storage. 2. RSS uses XML to create a standardized format to facilitate content sharing. 3. The application of XML in RSS includes elements that define feed content, such as title and release date. 4. Advantages include standardization and scalability, and challenges include document verbose and strict syntax requirements. 5. Best practices include validating XML validity, keeping it simple, using CDATA, and regularly updating.

From XML to Readable Content: Demystifying RSS FeedsFrom XML to Readable Content: Demystifying RSS FeedsApr 11, 2025 am 12:03 AM

RSSfeedsareXMLdocumentsusedforcontentaggregationanddistribution.Totransformthemintoreadablecontent:1)ParsetheXMLusinglibrarieslikefeedparserinPython.2)HandledifferentRSSversionsandpotentialparsingerrors.3)Transformthedataintouser-friendlyformatsliket

Is There an RSS Alternative Based on JSON?Is There an RSS Alternative Based on JSON?Apr 10, 2025 am 09:31 AM

JSONFeed is a JSON-based RSS alternative that has its advantages simplicity and ease of use. 1) JSONFeed uses JSON format, which is easy to generate and parse. 2) It supports dynamic generation and is suitable for modern web development. 3) Using JSONFeed can improve content management efficiency and user experience.

RSS Document Tools: Building, Validating, and Publishing FeedsRSS Document Tools: Building, Validating, and Publishing FeedsApr 09, 2025 am 12:10 AM

How to build, validate and publish RSSfeeds? 1. Build: Use Python scripts to generate RSSfeed, including title, link, description and release date. 2. Verification: Use FeedValidator.org or Python script to check whether RSSfeed complies with RSS2.0 standards. 3. Publish: Upload RSS files to the server, or use Flask to generate and publish RSSfeed dynamically. Through these steps, you can effectively manage and share content.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor