Home  >  Article  >  类库下载  >  Java Basics Series 17: Detailed explanation of using DOM, SAX, JDOM, and DOM4J to parse XML files

Java Basics Series 17: Detailed explanation of using DOM, SAX, JDOM, and DOM4J to parse XML files

高洛峰
高洛峰Original
2016-10-13 15:57:121685browse

1 Introduction

In Java, there are many ways to parse XML files, among which the four most common methods are probably DOM, SAX, JDOM, and DOM4J. Among them, DOM and SAX, two ways of parsing XML files, have jdk's own API, so there is no need to introduce additional third-party jar packages. On the contrary, both JDOM and DOM4J parsing methods are third-party open source projects, so when using these two methods to parse XML files, you need to introduce additional related jar packages

(1) DOM

DOM is used with The official W3C standard for representing XML documents in a platform- and language-independent way. The DOM is a collection of nodes or pieces of information organized in a hierarchical structure that allows developers to find specific information in the tree. Analyzing this structure usually requires loading the entire document and constructing the hierarchy before any work can be done. Therefore, when using DOM to parse XML files, the parser needs to read the entire XML file into memory to form a tree structure. Convenient for subsequent operations

Advantages: The entire document tree is in memory, easy to operate; supports deletion, modification, rearrangement and other operations

Disadvantages: Transferring the entire document into memory (including useless nodes) is a waste of time and memory , if the XML is too large, it is easy to have a memory overflow problem

(2) SAX

Since the DOM needs to read the entire file at once when parsing the XML file, there are many shortcomings when the file is too large, so in order to solve this problem, there is With SAX, an event-driven parsing method,

SAX parses XML files by continuously loading content into memory from top to bottom. When the parser finds the start mark, end mark, text, document start mark, document When the end flag and other related flags are used, some corresponding events will be triggered. All we need to do is to write custom code in the methods of these events to save the obtained data

Advantages: There is no need to load the entire document in advance. Occupies less resources (memory); the code written using SAX parsing is less than the code written using DOM parsing

Disadvantages: not persistent; after the event, if the data is not saved, the data is lost; stateless; from Only text can be obtained in the event, but I don’t know which element the text belongs to

(3) JDOM

Using JDOM to parse XML files is similar to using DOM to parse. From the code point of view, the parsing ideas are similar. JDOM differs from DOM in two main ways: First, JDOM only uses concrete classes instead of interfaces, which simplifies the API in some aspects, but also limits flexibility. Secondly, JDOM’s API makes extensive use of Collections classes, simplifying the use of Java developers who are already familiar with these classes. Advantages: open source projects; easier to understand than DOM. Disadvantages: JDOM itself does not contain a parser. It usually uses a SAX2 parser to parse and validate input XML documents

(4) DOM4J

DOM4J is a very, very excellent Java XML API with excellent performance, powerful functions and extreme ease of use. It is also a Open source software. Nowadays you can see that more and more Java software is using DOM4J to read and write XML

Because DOM4J is very powerful in terms of performance and code writing, especially when the XML file is large, use DOM4J to parse it There will also be higher efficiency. Therefore, it is recommended that if you need to parse XML files, you can consider using DOM4J to parse them as much as possible. Of course, if the file is very small, it is also possible to use DOM to parse it

Advantages:

Open source project

DOM4J is an intelligent branch of JDOM, which incorporates functions that require more than basic XML documents

Excellent performance and flexibility Good, easy to use and other features

Second DOM parsing XML files

(1) Before writing and testing the code, you need to prepare an XML file. The file I prepared here is: demo1.xml

demo1.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<employees>
    <user id="1">
        <name>zifangsky</name>
        <age>10</age>
        <sex>male</sex>
        <contact>https://www.zifangsky.cn</contact>
    </user>
    <user id="2">
        <name>admin</name>
        <age>20</age>
        <sex>male</sex>
        <contact>https://www.tar.pub</contact>
    </user>
</employees>

(2) Code example:

package cn.zifangsky.xml;
 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
 
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
 
public class DomParseTest {
 
    public static void main(String[] args) {
        DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
        try {
            DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
            // 加载一个xml文件
            Document document = dBuilder
                    .parse("src/cn/zifangsky/xml/demo1.xml");
            // 获取user节点集合
            NodeList userList = document.getElementsByTagName("user");
            int userListLength = userList.getLength();
            System.out.println("此xml文件一共有" + userListLength + "个&#39;user&#39;节点\n");
 
            // 遍历
            for (int i = 0; i < userListLength; i++) {
                // 通过item方法获取指定的节点
                Node userNode = userList.item(i);
 
                // *********************解析属性***********************
 
                // 获取该节点的所有属性值,如:id="1"
                NamedNodeMap userAttributes = userNode.getAttributes();
                System.out.println("&#39;user&#39;节点" + i + "有"
                        + userAttributes.getLength() + "个属性:");
                /**
                 * 1 在不清楚有哪些属性的情况下可以遍历所有属性,
                 * 并获取每个属性对应的属性名和属性值
                 * */
                for (int j = 0; j < userAttributes.getLength(); j++) {
                    // &#39;user&#39;节点的每个属性组成的节点
                    Node attrnNode = userAttributes.item(j);
                    System.out.println("属性" + j + ": 属性名: "
                            + attrnNode.getNodeName() + " ,属性值: "
                            + attrnNode.getNodeValue());
                }
                /**
                 * 2 在知道有哪些属性值的情况下,可以获取指定属性名的属性值
                 * */
                Element userElement = (Element) userList.item(i);
                System.out.println("属性为&#39;id&#39;的对应值是: "
                        + userElement.getAttribute("id"));
 
                // *********************解析子节点************************
                NodeList childNodes = userNode.getChildNodes();
                System.out.println("\n该节点一共有" + childNodes.getLength()
                        + "个子节点,分别是:");
 
                // 遍历子节点
                for (int k = 0; k < childNodes.getLength(); k++) {
                    Node childNode = childNodes.item(k);
                    // 从输出结果可以看出,每行后面的换行符也被当做了一个节点,因此是:4+5=9个子节点
                    // System.out.println("节点名: " + childNode.getNodeName() +
                    // ",节点值: " + childNode.getTextContent());
                    // 仅取出子节点中的&#39;ELEMENT_NODE&#39;,换行符组成的Node是&#39;TEXT_NODE&#39;
                    if (childNode.getNodeType() == Node.ELEMENT_NODE) {
                        // System.out.println("节点名: " + childNode.getNodeName()
                        // + ",节点值: " + childNode.getTextContent());
                        // 最低一层是文本节点,节点名是&#39;#text&#39;
                        System.out.println("节点名: " + childNode.getNodeName()
                                + ",节点值: "
                                + childNode.getFirstChild().getNodeValue());
                    }
                }
 
                System.out.println("***************************");
            }
 
        } catch (Exception e) {
            e.printStackTrace();
        }
 
    }
 
}

As can be seen from the above code, when using DOM to parse XML files, you generally need to do the following steps:

Create a Document Builder Factory (DocumentBuilderFactory) instance

Through the above The DocumentBuilderFactory generates a new document builder (DocumentBuilder)

Use the above DocumentBuilder to parse (parse) an XML file and generate a document tree (Document)

Get the node with the specified id through the Document or get all the qualified nodes based on the node name The node set

traverses each node, and you can obtain the attributes, attribute values ​​and other related parameters of the node

If the node still has child nodes, you can continue to traverse all its child nodes according to the above method

(3) above The code output is as follows:

此xml文件一共有2个&#39;user&#39;节点
 
&#39;user&#39;节点0有1个属性:
属性0: 属性名: id ,属性值: 1
属性为&#39;id&#39;的对应值是: 1
 
该节点一共有9个子节点,分别是:
节点名: name,节点值: zifangsky
节点名: age,节点值: 10
节点名: sex,节点值: male
节点名: contact,节点值: https://www.zifangsky.cn
***************************
&#39;user&#39;节点1有1个属性:
属性0: 属性名: id ,属性值: 2
属性为&#39;id&#39;的对应值是: 2
 
该节点一共有9个子节点,分别是:
节点名: name,节点值: admin
节点名: age,节点值: 20
节点名: sex,节点值: male
节点名: contact,节点值: https://www.tar.pub
***************************

Three SAX parsed XML files

在进行本次测试时,并不引入其他XML文件,仍然使用上面的demo1.xml文件

由于SAX解析XML文件跟DOM不同,它并不是将整个文档都载入到内存中。解析器在解析XML文件时,通过逐步载入文档,从上往下一行行的解析XML文件,在碰到文档开始标志、节点开始标志、文本文档、节点结束标志、文档结束标志时进行对应的事件处理。因此,我们首先需要构造一个这样的解析处理器来申明:当解析到这些标志时,我们需要进行怎样的自定义处理

(1)解析处理器SAXParseHandler.java:

package cn.zifangsky.xml;
 
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
public class SAXParseHandler extends DefaultHandler {
 
    /**
     * 用来遍历XML文件的开始标签
     * */
    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {
        super.startElement(uri, localName, qName, attributes);
         
        //解析&#39;user&#39;元素的属性值
//      if(qName.equals("user"))
//          System.out.println("&#39;user&#39;元素的id属性值是:" + attributes.getValue("id"));
         
        //遍历并打印元素的属性
        int length = attributes.getLength();
        if(length > 0){
            System.out.println("元素&#39;" + qName + "&#39;的属性是:");
             
            for(int i=0;i<length;i++){
                System.out.println("    属性名:" + attributes.getQName(i) + ",属性值: " + attributes.getValue(i));
            }
            System.out.println();
        }
         
        System.out.print("<" + qName + ">");
 
    }
 
    /**
     * 用来遍历XML文件的结束标签
     * */
    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        super.endElement(uri, localName, qName);
         
        System.out.println("<" + qName + "/>");
    }
     
    /**
     * 文本内容
     * */
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        super.characters(ch, start, length);
        String value = new String(ch, start, length).trim();
        if(!value.equals(""))
            System.out.print(value);
    }
 
    /**
     * 用来标识解析开始
     * */
    @Override
    public void startDocument() throws SAXException {
        System.out.println("SAX解析开始");
        super.startDocument();
    }
     
    /**
     * 用来标识解析结束
     * */
    @Override
    public void endDocument() throws SAXException {
        System.out.println("SAX解析结束");
        super.endDocument();
    }
 
}

关于上面代码的一些含义我这里就不再做解释了,可以自行参考注释内容

(2)测试:

SAXParseTest.java文件:

package cn.zifangsky.xml;
 
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
 
public class SAXParseTest {
 
    public static void main(String[] args) {
        SAXParserFactory sFactory = SAXParserFactory.newInstance();
        try {
            SAXParser saxParser = sFactory.newSAXParser();
            //创建自定义的SAXParseHandler解析类
            SAXParseHandler saxParseHandler = new SAXParseHandler();
            saxParser.parse("src/cn/zifangsky/xml/demo1.xml", saxParseHandler);
 
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
 
}

从上面的代码可以看出,使用SAX解析XML文件时,一共传递进去了两个参数,分别是:XML文件路径和前面定义的解析处理器。有了具体的XML文件以及对应的处理器来处理对应的标志事情,因此SAX这种解析方式就可以顺利地进行解析工作了

(3)上面测试的输出如下:

SAX解析开始
<employees>元素&#39;user&#39;的属性是:
    属性名:id,属性值: 1
 
<user><name>zifangsky<name/>
<age>10<age/>
<sex>male<sex/>
<contact>https://www.zifangsky.cn<contact/>
<user/>
元素&#39;user&#39;的属性是:
    属性名:id,属性值: 2
 
<user><name>admin<name/>
<age>20<age/>
<sex>male<sex/>
<contact>https://www.tar.pub<contact/>
<user/>
<employees/>
SAX解析结束

四 JDOM解析XML文件

跟前面两种解析方式不同的是,使用JDOM来解析XML文件需要下载额外的jar包

(1)下载jar包并导入到项目中:

下载地址:http://www.jdom.org/downloads/index.html

目前最新版本是:JDOM 2.0.6

然后将下载得到的“jdom-2.0.6.jar”文件导入到测试项目中

注:关于如何在一个Java项目中导入额外的jar,这里将不多做解释,不太会的童鞋可以自行百度

(2)测试代码:

JDOMTest.java:

package cn.zifangsky.xml;
 
import java.util.List;
 
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.input.SAXBuilder;
 
public class JDOMTest {
 
    /**
     * @param args
     */
    public static void main(String[] args) {
        SAXBuilder saxBuilder = new SAXBuilder();
        try {
            Document document = saxBuilder.build("src/cn/zifangsky/xml/demo1.xml");
             
            //获取XML文件的根节点
            Element rootElement = document.getRootElement();
//          System.out.println(rootElement.getName());
            List<Element> usersList = rootElement.getChildren();  //获取子节点
            for(Element u : usersList){
//              List<Attribute> attributes = u.getAttributes();
//              for(Attribute attribute : attributes){
//                  System.out.println("属性名:" + attribute.getName() + ",属性值:" + attribute.getValue());
//              }
                System.out.println("&#39;id&#39;的值是: " + u.getAttributeValue("id"));
            }
             
             
        }catch (Exception e) {
            e.printStackTrace();
        }
 
    }
 
}

从上面的代码可以看出,使用JDOM来解析XML文件,主要需要做以下几个步骤:

新建一个SAXBuilder

通过SAXBuilder的build方法传入一个XML文件的路径得到Document

通过Document的getRootElement方法获取根节点

通过getChildren方法获取根节点的所有子节点

然后是遍历每个子节点,获取属性、属性值、节点名、节点值等内容

如果该节点也有子节点,然后同样可以通过getChildren方法获取该节点的子节点

后面的步骤跟上面一样,不断递归到文本节点截止

(3)上面测试的输出如下:

&#39;id&#39;的值是: 1
&#39;id&#39;的值是: 2

五 DOM4J解析XML文件

jar包下载地址:https://sourceforge.net/projects/dom4j/files/

同样,在使用DOM4J解析XML文件时需要往项目中引入“dom4j-1.6.1.jar”文件

(1)一个简单实例:

i)DOM4JTest.java:

package cn.zifangsky.xml;
 
import java.io.File;
import java.util.Iterator;
import java.util.List;
 
import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
 
public class DOM4JTest {
 
    public static void main(String[] args) {
        SAXReader reader = new SAXReader();
        try {
            Document document = reader.read(new File("src/cn/zifangsky/xml/demo1.xml"));
            //获取XML文件的根节点
            Element rootElement = document.getRootElement();
            System.out.println(rootElement.getName());
             
            //通过elementIterator方法获取迭代器
            Iterator<Element> iterator = rootElement.elementIterator();
            //遍历
            while(iterator.hasNext()){
                Element user = iterator.next();
                //获取属性并遍历
                List<Attribute> aList = user.attributes();
             
                for(Attribute attribute : aList){
                    System.out.println("属性名:" + attribute.getName() + ",属性值:" + attribute.getValue());
                }
                 
                //子节点
                Iterator<Element> childList = user.elementIterator();
                while(childList.hasNext()){
                    Element child = childList.next();
//                  System.out.println(child.getName() + " : " + child.getTextTrim());
                    System.out.println(child.getName() + " : " + child.getStringValue());
                }
            }
         
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
 
}

从上面的代码可以看出,跟前面的JDOM解析方式流程是差不多的,并且关键地方也有注释,因此这里就不多做解释了

ii)上面的代码输出如下:

employees
属性名:id,属性值:1
name : zifangsky
age : 10
sex : male
contact : https://www.zifangsky.cn
属性名:id,属性值:2
name : admin
age : 20
sex : male
contact : https://www.tar.pub

(2)将XML文件解析成Java对象:

i)为了方便测试,这里准备一个新的XML文件:

demo2.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<user id="2">
    <name>zifangsky</name>
    <age>100</age>
    <sex>男</sex>
    <contact>https://www.zifangsky.cn</contact>
    <ownPet id="1">旺财</ownPet>
    <ownPet id="2">九头猫妖</ownPet>
</user>

ii)同时准备一个Java实体类,恰好跟上面的XML文件中的属性相对应:

User.java:

package cn.zifangsky.xml;
 
import java.util.List;
 
public class User {
    private String name;
    private String sex;
    private int age;
    private String contact;
    private List<String> ownPet;
 
    public String getName() {
        return name;
    }
 
    public void setName(String name) {
        this.name = name;
    }
 
    public String getSex() {
        return sex;
    }
 
    public void setSex(String sex) {
        this.sex = sex;
    }
 
    public int getAge() {
        return age;
    }
 
    public void setAge(int age) {
        this.age = age;
    }
 
    public String getContact() {
        return contact;
    }
 
    public void setContact(String contact) {
        this.contact = contact;
    }
 
    protected List<String> getOwnPet() {
        return ownPet;
    }
 
    protected void setOwnPet(List<String> ownPet) {
        this.ownPet = ownPet;
    }
 
    @Override
    public String toString() {
        return "User [name=" + name + ", sex=" + sex + ", age=" + age
                + ", contact=" + contact + ", ownPet=" + ownPet + "]";
    }
}

iii)测试代码:

XMLtoJava.java:

package cn.zifangsky.xml;
 
import java.io.File;
import java.util.ArrayList;
import java.util.List;
 
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
 
public class XMLtoJava {
 
    public User parseXMLtoJava(String xmlPath){
        User user = new User();
        List<String> ownPet = new ArrayList<String>();
         
        SAXReader saxReader = new SAXReader();
        try {
            Document document = saxReader.read(new File(xmlPath));
            Element rootElement = document.getRootElement();  //获取根节点
             
            List<Element> children = rootElement.elements();  //获取根节点的子节点
            //遍历
            for(Element child : children){
                String elementName = child.getName();  //节点名
                String elementValue = child.getStringValue();  //节点值
                switch (elementName) {
                case "name":
                    user.setName(elementValue);
                    break;
                case "sex":
                    user.setSex(elementValue);
                    break; 
                case "age":
                    user.setAge(Integer.valueOf(elementValue));
                    break;
                case "contact":
                    user.setContact(elementValue);
                    break; 
                case "ownPet":
                    ownPet.add(elementValue);
                    break; 
                default:
                    break;
                }
     
            }
            user.setOwnPet(ownPet);
 
        } catch (Exception e) {
            e.printStackTrace();
        }
        return user;
    }
     
    public static void main(String[] args) {
        XMLtoJava demo = new XMLtoJava();
        User user = demo.parseXMLtoJava("src/cn/zifangsky/xml/demo2.xml");
        System.out.println(user);
    }
 
}

经过前面的分析之后,上面这个代码也是很容易理解的:通过遍历节点,如果节点名跟Java类中的某个属性名相对应,那么就将节点值赋值给该属性

iv)上面的代码输出如下:

User [name=zifangsky, sex=男, age=100, contact=https://www.zifangsky.cn, ownPet=[旺财, 九头猫妖]]

(3)解析一个XML文件并尽可能原样输出:

DOM4JTest2:

package cn.zifangsky.xml;
 
import java.io.File;
import java.util.List;
 
import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
 
public class DOM4JTest2 {
 
    /**
     * 解析XML文件并尽可能原样输出
     * 
     * @param xmlPath
     *            待解析的XML文件路径
     * @return null
     * */
    public void parse(String xmlPath) {
        SAXReader saxReader = new SAXReader();
        try {
            Document document = saxReader.read(new File(xmlPath));
            Element rootElement = document.getRootElement();
 
            print(rootElement, 0);
 
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
 
    /**
     * 打印一个XML节点的详情
     * 
     * @param element
     *            一个XML节点
     * @param level
     *            用于判断xml节点前缩进多少的标识,每深入一层则多输出4个空格
     * @return null
     * */
    public void print(Element element, int level) {
        List<Element> elementList = element.elements(); // 当前节点的子节点List
 
        // 空格
        StringBuffer spacebBuffer = new StringBuffer("");
        for (int i = 0; i < level; i++)
            spacebBuffer.append("    ");
        String space = spacebBuffer.toString();
 
        // 输出开始节点及其属性值
        System.out.print(space + "<" + element.getName());
        List<Attribute> attributes = element.attributes();
        for (Attribute attribute : attributes)
            System.out.print(" " + attribute.getName() + "=\""
                    + attribute.getText() + "\"");
 
        // 有子节点
        if (elementList.size() > 0) {
            System.out.println(">");
            // 遍历并递归
            for (Element child : elementList) {
                print(child, level + 1);
            }
            // 输出结束节点
            System.out.println(space + "</" + element.getName() + ">");
 
        } else {
            // 如果节点没有文本则简化输出
            if (element.getStringValue().trim().equals(""))
                System.out.println(" />");
            else
                System.out.println(">" + element.getStringValue() + "</"
                        + element.getName() + ">");
        }
 
    }
 
    public static void main(String[] args) {
        DOM4JTest2 test2 = new DOM4JTest2();
        test2.parse("src/cn/zifangsky/xml/demo3.xml");
 
    }
 
}

这段代码同样没有什么新的东西,原理就是利用递归来不断进行解析输出,注意一下不同层次的节点的缩进即可。刚开始测试时建议用一些结构比较简单的代码,如上面的demo1.xml和demo2.xml文件。在测试没问题时可以选择一些复杂的XML文件来测试是否能够正常输出,比如:

demo3.xml:

<?xml version="1.0" encoding="UTF-8"?>
<application xmlns="http://wadl.dev.java.net/2009/02"
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <grammars />
    <resources base="http://localhost:9080/Demo/services/json/checkCode">
        <resource path="/">
            <resource path="addCheckCode">
                <method name="POST">
                    <request>
                        <representation mediaType="application/octet-stream" />
                    </request>
                    <response>
                        <representation mediaType="application/xml">
                            <param name="result" style="plain" type="xs:int" />
                        </representation>
                        <representation mediaType="application/json">
                            <param name="result" style="plain" type="xs:int" />
                        </representation>
                    </response>
                </method>
            </resource>
            <resource path="findCheckCodeByProfileId">
                <method name="POST">
                    <request>
                        <representation mediaType="application/octet-stream">
                            <param name="request" style="plain" type="xs:long" />
                        </representation>
                    </request>
                    <response>
                        <representation mediaType="application/xml" />
                        <representation mediaType="application/json" />
                    </response>
                </method>
            </resource>
        </resource>
    </resources>
</application>

为什么我在标题上说的是尽可能原样输出,其原因就是上面那段解析代码在碰到下面这种XML节点时,输出就不一样了:

<dc:creator><![CDATA[admin]]></dc:creator>
<category><![CDATA[运维]]></category>
<category><![CDATA[zabbix]]></category>
<category><![CDATA[端口]]></category>

这段XML文档节点最后输出如下:

<creator>admin</creator>
<category>运维</category>
<category>zabbix</category>
<category>端口</category>


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn