Home > Article > Backend Development > Semantics of XML tags
[Abstract] Although the XML document type definition provides a mechanism that can describe the syntax of the XML language in a machine-readable form, there is currently no similar mechanism to specify the specific semantics of the XML vocabulary. This means that there is no way to explain the meaning of XML tags, and the facts and relationships represented by XML cannot be clearly, comprehensively and normatively defined. This has serious practical and theoretical consequences. On the positive side, XML structures can be given arbitrary semantics and used in areas unforeseen by their original designers. On the less positive side, content developers and software engineers must rely on bland documentation or, worse still, have to rely on guessing the intent of the markup language designer. This process is time-consuming, labor-intensive, error-prone, and cannot be verified. Even if the designer's original documentation work is done perfectly, unsatisfactory situations will still occur. In addition, the lack of research on the semantic nature of markup also means that digital document processing, which belongs to the field of engineering applications, has no theory at all. Although some ongoing projects (XML Schema, RDF, Semantic Web) have achieved some results, none of these projects directly and comprehensively solve the core issues of XML markup semantics. This article reviews the development history of the concept of markup meaning, clarifies the motivation for interpreting the formal semantics of XML, and introduces a scientific research project on semantics - the BECHAMEL Markup Semantics Project.
[Keywords]SGML Text markup systems such as Standard Generalized Markup Language (SGML) and Extensible Markup Language (XML) have begun to be applied in all aspects of society, business, culture, and life. SGML/XML is a machine-readable technology that defines a descriptive markup language. Except for some parts that require special treatment, this language clearly defines the structure of the document and its underlying meaning. SGML/XML is developing rapidly, and widespread use of this technology can support high-performance document interoperability processing and publishing.
This beautiful wish has been partially realized. The superiority of SGML/XML has exceeded people's expectations. However, the functionality, interoperability, diversity and accessibility of the SGML/XML document system still need to be improved. If this opportunity is not seized, the consequences will be very serious: the industry has spent high financial costs and lost many opportunities; it may also lead to some disasters in critical safety applications; for people with disabilities, this will hinder They have equal access to the cultural and commercial benefits of contemporary society. In addition, long-standing problems continue to remind us that the best current digital document models are still flawed, or at least incomplete.
The root of these problems is that although SGML/XML can provide a meaningful structure for the document, SGML/XML cannot represent the basic semantic relationships between document components and topics in a systematic, machine-processable way. SGML/XML supports the description of machine-readable "grammar", but it does not provide a mechanism to explain the semantic connotation of a certain grammar. Therefore, there is no way to formally express the potential meaning of an SGML/XML vocabulary. Current SGML/XML cannot even express very simple basic semantic facts about document annotation systems. These facts are usually pre-designed by markup language designers, but the specific implementation still depends on markup language users and software.
This lack of expressive function forces SGML/XML users to guess the semantic relationships that markup language designers thought of but did not formally express. Content developers must guess at the designer's intent and work on those inferences when encoding content, without being able to clearly express their inferences and intentions to others or to applications that process the encoded content. Software designers also need to guess the possible intentions of markup language designers and design this guess into software tools and application systems. Sometimes second-order guesswork is necessary: the software designer guesses the content developer's inference of the markup language designer's intent.
Obviously, these speculations are incomplete, fallible and unproven. Moreover, the production and implementation processes are time-consuming and labor-intensive, and the functionality and interoperability are also poor. Equipping a general natural language document with an SGML/XML specification does not perfectly solve this problem. Of course, ordinary natural language documents can provide some hints to content providers and software engineers, but there are currently no general rules for SGML/XML documents. In any case, ordinary natural language documents are not in machine-readable form, and this is the problem we are talking about with the SGML/XML markup system.
The idea of machine-processable semantic description related to SGML and XML has not yet been formed. This is the source of current problems in the engineering field and obstacles to future development. There are also few related semantic studies, but many scholars have begun to pay attention to this issue. . The work on W3CSchema is related to this, but only covers a small part of this problem (such as data types). The W3C's "Semantic Web" project is also related to this, but it is for the development of general XML-based knowledge representation technology. Our research focuses on the semantics of document markup, which is hidden in actual document processing systems. People may say that the essence of the Semantic Web is to design semantic tags. However, in this article, we believe that to solve the above problems, we must also consider the essential meaning of tags in depth.
Next, this article first explains the meaning of markup from the historical background (markers played an interesting role in the development of text processing methods); secondly, it describes in detail what factors create the need for formal semantic markup and what factors determine the semantic requirements; finally, a brief introduction is given to a research project that multiple institutions are participating in the implementation of - the BECHAMEL Markup Semantics Project, which is working hard to solve the semantic problem of marks.
2 Historical Background
Document "marks" can probably be counted as part of the communication system, including early writing, copying, publishing and printing. However, with the development of digital text processing and typesetting, the use of marks has become conscious and It is common and has become an important area of innovation in system development. The period from the 1960s to the 1980s was a period of comprehensive and systematic development of document markup systems, with the focus being on improving the effectiveness and functionality of digital typesetting and text processing. In the early 1980s, people were still working on a theoretical framework for marking and using it to support the development of high-performance systems. Some results in this area have been published, but most of them are only recorded in working documents and products in various standard forms.
A view that emerged at this stage is that the document, as an intellectual achievement, is more suitable to be abstracted into an ordered hierarchical structure model of a series of objects (such as chapters, paragraphs, formulas, etc.) rather than one-dimensional text. Character flow model. The character stream is often mixed with a large number of encodings that define the format, structures describing the design layout (such as page numbers, columns, printing lines), matrices of pixel values, and other potential expressions in different document processing and storage systems. The ordered hierarchical structure model summarizes two essentially different annotations, namely annotations that identify editing text objects (titles, chapters, etc.) and annotations that describe layout requirements. The application of the former has achieved some results. Relevant document elements such as titles, chapters, paragraphs, equations, citations, etc. can be clearly marked by delimiter tags, and the elements can then be processed indirectly through rules mapped to the element type. This separation of content and form enables base-level indirection and abstraction in a common combinatorial economy. This form of separation has enormous and varied practical value in all aspects of document processing, and more importantly it seems to illuminate the question of what exactly a document is. The descriptive markup used to do this not only marks the scope of the element, but also carries the meaning that the document model wants to reveal (for example, this text is a chapter).
In the early 1980s, the American National Institute of Standardization (ANSI/ISO) released the influential SGML document markup metagrammar and sorted out previous theoretical and analytical work on markup and document structure. SGML provides a machine-readable form for defining a descriptive markup language. As a meta-grammar, SGML does not define a markup language, but details techniques for developing machine-readable markup languages. The core of this definition is a formal expression mechanism similar to the Backus-Naur Form (BNF). This mechanism carries rules for defining typed properties and their values, as well as other designs for further abstraction and indirection (see the comments on Document Type Definitions (DTDs) and Backus-Noel A summary of the degree of paradigm similarity). Structurally, an SGML document is a tree with ordered branches and labeled nodes, which is the formal product of its corresponding DTD.
After years of analysis and practice, the basic ideas behind SGML have been well known. Taking advantage of industry-level standards at the meta-syntax level and localized innovation at the vocabulary level, SGML's unique mechanisms (backus-norr paradigm-like meta-syntax, typed attribute/attribute value pairs, entity references, etc.) are applied Programs and tools are implemented efficiently. The SGML markup language itself appears to be evolving while also supporting and optimizing ideal workflows for document system design, implementation, and utilization. From the mid-1980s to the early 1990s, a large number of SGML-based annotation systems were developed.
Although the development of SGML received a lot of attention, and the ideas were good and successfully implemented in multiple fields, for the first ten years almost no one used it. There are many factors leading to this result, but the most important thing is that SGML itself is too complex. In particular, SGML contains many complex optional attributes, and the corresponding software may not have to implement them at all, resulting in very slow development of SGML software. Worse, if the document is not validated with a DTD, further analysis is impossible. Abbreviation control means that element boundaries cannot be determined without regard to document syntax. In addition, SGML also contains some other attributes, which will cause existing syntax analysis tools to be inapplicable to formal grammar and unable to perform efficient syntax analysis.
In terms of online publishing and communication, the SGML system can be applied to HTML (Hypertext Markup Language). The original version of HTML was loosely defined and lacked formal syntax instructions. Later there was interest in HTML's SGMLDTD, and it proved difficult to design a DTD for something that had become the "correct" practice. More importantly, because in the original HTML specification, vendors arbitrarily added programmatic tags (such as
Refer to the following fragment of the XML markup document
Readers familiar with the structure
The tags in the document elements will naturally be known P stands for paragraph. The paragraph has a title. The paragraph content after the title element forms the body of the text. It starts after the title element and ends before the paragraph end tag. The meaning and usage of tags are not immediately obvious, so authors or readers can refer to the documentation for the tag collection
Obvious tags are designed for the convenience of human readers. These tags cannot be extracted from the data structure with the help of a document parser. As shown in Figure 1, a parse tree (used by stylesheet programmers) shows the head, the citation, and the text before and after the citation, each of which is a separate child node of the paragraph, but the parse tree cannot show the following characteristics: the head It is an attribute of the entire paragraph, the text is two parts in the content structure, and the quotation is embedded inside the text.
In fact, the data structure itself has no distinction between paragraphs and quotations or anything related to them. A data structure is simply a graphical structure of related information, like a universal identifier with a "paragraph" value. The program should be able to infer the consistency between the meaning of the document and the tags used, and exploit this knowledge when the tree structure is converted from one form to another. However, this transformation (e.g., via XSLT, DSSSL, or a programming language like C++) relies on semantic reasoning rather than explicit encoding
Figure 2 shows how to enrich and enhance the syntax tree by leveraging semantic knowledge. The use of knowledge representation technology can encode the relationship between the whole and parts at a higher level, which is more suitable for computer processing. This figure shows a traditional semantic network representation method. Of course, other methods are also under development, including framework representation, rule representation, formal grammar, and logic-based representation. The development of the Semantic Web Project (Part 8 of this article) may even provide suitable representation methods for markup languages themselves. The crux of the matter is establishing a hierarchy of abstractions, relationships, and constraints that cannot be modeled and enforced by traditional XML/SGML parsers.
Encoding knowledge in machine-readable files (such as DTD or syntax structures) can be used to verify the semantic constraints of the document, providing a more powerful document model for applications. These more expressive representation methods provide strong support for the design and implementation of better document processing systems.
6 Application
In recent years, the development of many new technologies has made conventional structured annotation more and more popular. These technologies mainly emphasize the following aspects in information management.
Conversions and unions. For SGML/XML developers, the most common job is to design transformation forms to convert from one application syntax to another. This is done to create new types of file representations or to facilitate their storage in a database. Sometimes developers need to integrate or adapt large collections of digital documents, each represented by a non-interoperable markup language. Regardless of the size of the conversion, the conventional solution is to use a conversion programming language that acts directly on the parse tree. The tree structure generated in the source file analysis is converted into a tree structure instance in the target language. The converted tree is serialized into new document instances, graphics, or audio.
Information island. This problem is very similar to the above-mentioned conversion problem, but the goal is not to convert one form of document into another form of document, but to allow distributed storage of documents or document fragments to provide a common transparent access interface to system users. Although it is not necessary to convert documents verbatim from one markup language to another, the system must be able to ensure that the content of the document appears to blend seamlessly, even though the encoding of the document may vary widely.
Availability. Authoring tools are increasingly embracing structured markup, which has become a boon for visually impaired users to access digital documents. Declarative markup enables people to read with the help of a screen reader or braille display and make inferences with the help of mnemonics rather than drawing on graphical clues. However, such applications currently need to rely on the user's own capabilities or interface software, and structural inferences based on independent tag content or grammar. As described in the tag set documentation, tag syntax constraints and the meaning and use of tags are strictly dependent on the credibility of the document author. Unfortunately, authors often misuse tags. The worst example is using "head" tags to mark certain layouts on web pages.
Safe handling. Part of the impetus for the development of more expressive markup schema languages (such as the W3C's XML Schema language) is the realization that the consequences of markup errors, misuse, and abuse are far more serious than poorly formatted output. Declarative markup is used not only in e-commerce but also in secure information fields such as medical records and the aviation industry. Developers in these fields must not only ensure that the grammatical structure of digital documents is standardized, but also ensure that they comply with certain security protocols to ensure the safe processing, storage, transmission and presentation of documents.
7 Advantages of markup semantics
The current survey results of the BECHAMEL project show that markup semantics can solve the above problems in the following ways.
Declarative, machine-readable semantic description. As far as the current actual situation is concerned, structured markup language designers use natural language text to express the meaning of tags and clarify their appropriate use. The formal markup semantic system enables the relationships between ontologies to be clearly expressed by computer programs and enables automated processing.
Verification of hypothesis. In a document environment without a formal set of tags, a system with the ability to interpret tag semantics provides an environment for testing guesses and validating hypotheses. In this environment, an undisclosed user of a markup language will speculate on the properties and rules that he believes are consistently applied in the document database. The document processing software then retrieves those document elements that are or are not compatible with the assumed rules.
Enhancement of semantic constraints. A parser that supports validity verification can not only complete syntax verification like a conventional semantic parser, but also verify the guess while discovering or writing semantics. Such a parser can also enforce semantic constraints. This operation is consistent with hypothesis verification, but in this case the semantic constraints are known and canonical.
Optimized and more expressive APIs. Markup semantics are used when converting or representing digital documents using SGML and XML applications. But higher-level properties and associations are revealed only when the program is executed. Formal, machine-readable semantics will enrich application interfaces and speed up software design. With the development and changes of markup languages, these software will be more convenient and safer to maintain.
8 Related work
In response to the above challenges and problems, there are many other document processing technologies, standards and research plans. Next we review existing ideas that attempt to address these issues.
Semantic Web. The Semantic Web refers to a number of interconnected research and standardization efforts, like some of the current ideas around markup and knowledge representation technologies. The core one is the W3C resource description framework, which of course also includes other technologies, such as ISO's theme map technology. The Semantic Web has a wide scope and ambitious goals, aiming to use universal knowledge representation technology to improve markup languages, thereby "promoting the comprehensive development of human knowledge." The research and standardization of the Semantic Web is different from the current thinking: instead of semantic description of a specific field, it aims to achieve semantic annotation of knowledge in all fields. The current research goal is specifically focused on "document markup semantics" rather than "general semantic markup". Advances in Semantic Web technology will make it possible for us to use Semantic Web markup languages to encode the semantics of tags.
W3C’s Document Object Model. The Document Object Model is an application programming interface that is a hierarchical data structure generated after analyzing XML documents. People want to design a system that can provide various interfaces for markup semantics, similar to the markup syntax-related forms provided by DOM, and ultimately form a "semantic DOM" to supplement the W3C's syntax DOM.
W3C Schema. XML Schema is an XML-based language that can replace traditional DTDs and be used to constrain XML documents. The development of this language was driven by the limitations of DTDs, which are similar to the problems we faced in the BECHAMEL project. Schema allows document designers to define complex data types, just like in high-level programming languages. However, in order to encode all the relationships and constraints in tag set documentation, we also need a more powerful expression form than the current XML Schema. The architectural form of Hypermedia/Time based Structuring Language (HyTime). Adaptable architectural techniques come from the recognition that different markup language applications are often encoded with structures that vary in style but are semantically equivalent. Schema forms allow document class designers to map their own specific element instances to more general schema instances that are easier to map between different applications. These mappings indeed represent constrained forms of semantic knowledge and are helpful in solving the above transformation and integration challenges. The BECHAMEL project is, in part, about building a model that expresses more semantic relationships than architectural forms.
The above is the content of the semantics of XML tags. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!