Ruby XML, XSLT and XPath tutorials

What is XML?

XML refers to eXtensible Markup Language.

Extensible Markup Language, a subset of Standard Universal Markup Language, a markup language used to mark electronic documents to make them structural.

It can be used to mark data and define data types. It is a source language that allows users to define their own markup language. It is ideally suited for World Wide Web transport, providing a unified approach to describing and exchanging structured data independent of applications or vendors.

For more information, please view our XML tutorial

XML parser structure and API

XML parsers mainly include DOM and SAX.

The SAX parser is based on event processing and needs to scan the XML document from beginning to end. During the scanning process, this will be called every time a syntax structure is encountered. An event handler with a specific syntax structure that sends an event to the application.
DOM is document object model parsing, which builds the hierarchical syntax structure of the document and establishes a DOM tree in the memory. The nodes of the DOM tree are identified in the form of objects. After the document parsing document is completed, The entire DOM tree of the document is placed in memory.

Parsing and creating XML in Ruby

The library REXML library can be used to parse XML documents in RUBY.

The REXML library is an XML toolkit for ruby. It is written in pure Ruby language and complies with the XML1.0 specification.

In Ruby version 1.8 and later, REXML will be included in the RUBY standard library.

The path of the REXML library is: rexml/document

All methods and classes are encapsulated into a REXML module.

The REXML parser has the following advantages over other parsers:

100% written in Ruby.
Works with SAX and DOM parsers.
It is lightweight, less than 2000 lines of code.
Easy to understand methods and classes.
Based on SAX2 API and full XPath support.
Install using Ruby without having to install it separately.

The following is the XML code of the example, saved as movies.xml:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

DOM parser

Let us parse the XML first Data, first we introduce the rexml/document library. Usually we can introduce REXML in the top-level namespace:

#!/usr/bin/ruby -w

require 'rexml/document'
include REXML

xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)

# 获取 root 元素
root = xmldoc.root
puts "Root element : " + root.attributes["shelf"]

# 以下将输出电影标题
xmldoc.elements.each("collection/movie"){ 
   |e| puts "Movie Title : " + e.attributes["title"] 
}

# 以下将输出所有电影类型
xmldoc.elements.each("collection/movie/type") {
   |e| puts "Movie Type : " + e.text 
}

# 以下将输出所有电影描述
xmldoc.elements.each("collection/movie/description") {
   |e| puts "Movie Description : " + e.text 
}

The output result of the above example is:

Root element : New Arrivals
Movie Title : Enemy Behind
Movie Title : Transformers
Movie Title : Trigun
Movie Title : Ishtar
Movie Type : War, Thriller
Movie Type : Anime, Science Fiction
Movie Type : Anime, Action
Movie Type : Comedy
Movie Description : Talk about a US-Japan war
Movie Description : A schientific fiction
Movie Description : Vash the Stampede!
Movie Description : Viewable boredom
SAX-like Parsing:

SAX parser

Process the same data file: movies.xml. It is not recommended to parse SAX into a small file. The following is a simple example:

#!/usr/bin/ruby -w

require 'rexml/document'
require 'rexml/streamlistener'
include REXML


class MyListener
  include REXML::StreamListener
  def tag_start(*args)
    puts "tag_start: #{args.map {|x| x.inspect}.join(', ')}"
  end

  def text(data)
    return if data =~ /^\w*$/     # whitespace only
    abbrev = data[0..40] + (data.length > 40 ? "..." : "")
    puts "  text   :   #{abbrev.inspect}"
  end
end

list = MyListener.new
xmlfile = File.new("movies.xml")
Document.parse_stream(xmlfile, list)

The above output result is:

tag_start: "collection", {"shelf"=>"New Arrivals"}
tag_start: "movie", {"title"=>"Enemy Behind"}
tag_start: "type", {}
  text   :   "War, Thriller"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
  text   :   "Talk about a US-Japan war"
tag_start: "movie", {"title"=>"Transformers"}
tag_start: "type", {}
  text   :   "Anime, Science Fiction"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
  text   :   "A schientific fiction"
tag_start: "movie", {"title"=>"Trigun"}
tag_start: "type", {}
  text   :   "Anime, Action"
tag_start: "format", {}
tag_start: "episodes", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
  text   :   "Vash the Stampede!"
tag_start: "movie", {"title"=>"Ishtar"}
tag_start: "type", {}
tag_start: "format", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
  text   :   "Viewable boredom"

XPath and Ruby

We can use XPath to view XML. XPath is a language for finding information in XML documents (see: XPath tutorial).

XPath is the XML path language, which is a language used to determine the location of a certain part of an XML (a subset of the Standard Universal Markup Language) document. XPath is based on the tree structure of XML and provides the ability to find nodes in the data structure tree.

Ruby supports XPath through REXML's XPath class, which is tree-based parsing (Document Object Model).

#!/usr/bin/ruby -w

require 'rexml/document'
include REXML

xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)

# 第一个电影的信息
movie = XPath.first(xmldoc, "//movie")
p movie

# 打印所有电影类型
XPath.each(xmldoc, "//type") { |e| puts e.text }

# 获取所有电影格式的类型，返回数组
names = XPath.match(xmldoc, "//format").map {|x| x.text }
p names

The output result of the above example is:

<movie title='Enemy Behind'> ... </>
War, Thriller
Anime, Science Fiction
Anime, Action
Comedy
["DVD", "DVD", "DVD", "VHS"]

XSLT and Ruby

There are two XSLT parsers in Ruby. A brief description is given below:

Ruby-Sablotron

This parser is Written and maintained by Justice Masayoshi Takahash. This is primarily written for the Linux operating system and requires the following libraries:

Sablot
Iconv
Expat

You can find these libraries in Ruby-Sablotron.

XSLT4R

XSLT4R was written by Michael Neumann. XSLT4R is used for simple command line interaction and can be used by third-party applications to transform XML documents.

XSLT4R requires XMLScan operations and includes the XSLT4R archive, which is a 100% Ruby module. These modules can be installed using the standard Ruby installation method (i.e. Ruby install.rb).

The syntax format of XSLT4R is as follows:

ruby xslt.rb stylesheet.xsl document.xml [arguments]

If you want to use XSLT4R in your application, you can introduce XSLT and enter the parameters you need. Examples are as follows:

require "xslt"

stylesheet = File.readlines("stylesheet.xsl").to_s
xml_doc = File.readlines("document.xml").to_s
arguments = { 'image_dir' => '/....' }

sheet = XSLT::Stylesheet.new( stylesheet, arguments )

# output to StdOut
sheet.apply( xml_doc )

# output to 'str'
str = ""
sheet.output = [ str ]
sheet.apply( xml_doc )

More information

For the complete REXML parser, please view the documentation REXML parser documentation.
You can download XSLT4R from the RAA Knowledge Base .