SAX, DOM & JDOM parsers for beginners

Parsing XML with SAX, DOM & JDOM 1

Parsing XML with SAX, DOM & JDOM

Hicham Qaissi

[email protected]


Contents 0. What is an XML parser? ............................................................................................ 3

1. Describing the example to develop ........................................................................... 3

2. SAX ............................................................................................................................. 6

3. DOM ........................................................................................................................ 11

4. JDOM ....................................................................................................................... 14

5. Conclusion ............................................................................................................... 16


0. What is an XML parser?

The XML parsers bring us the possibility of analyzing and composing of the XML

documents. Analyzing the XML data and structure, we can make some objects in some

languages programming (Java in our case). Also we can make the inverse process, in other

words, make a XML document from some data objects (See Fig. 1). In this manual, I analyze

with examples three kinds, SAX, DOM & JDOM.

1. Describing the example to develop

The example that I make is entertained. This is the same for the entire three API (SAX,

DOM and JDOM). The example consists in analyzing a XML document that contains

information about some books (ISBN code (isbn is an attribute), Name, Author name, Price,

Editorial). The program expects a book code (ISBN), and searches this book into the XML. If the

book exists, all its information are printed by the standard output, in other case, we print a

message notifying that the book doesn’t exist in the XML. Are you finding it as amusing as I do?

Let’s go!!!


The xml example (books.xml) is the following:

<books> <book isbn="0000000001"> <name>Book 1</name> <author>Author name 1</author> <price>12.54</price> <editorial>Editorial 1</editorial> </book> <book isbn="0000000002"> <name>Book 2</name> <author>Author name 2</author> <price>58.25</price> <editorial>Editorial 2</editorial> </book> <book isbn="0000000003"> <name>Book 3</name> <author>Author name 3</author> <price>29.45</price> <editorial>Editorial 3</editorial> </book> <book isbn="0000000004"> <name>Book 4</name> <author>Author name 4</author> <price>78.95</price> <editorial>Editorial 4</editorial> </book> <book isbn="0000000005"> <name>PBook 5</name> <author>Author name 5</author> <price>61.25</price> <editorial>Editorial 5</editorial> </book> </books>


For all parsers (SAX, DOM & JDOM), I use this DTO (Data Transfer Object):

public class MyBook { private String isbn ; private String name; private String author ; private String price ; private String editorial ; public String getIsbn() { return isbn ; } public void setIsbn(String isbn) { this. isbn = isbn; } public String getName() { return name; } public void setName(String name) { this. name = name; } public String getAuthor() { return author ; } public void setAuthor(String author) { this. author = author; } public String getPrice() { return price ; } public void setPrice(String price) { this. price = price; } public String getEditorial() { return editorial ; } public void setEditorial(String editorial) { this. editorial = editorial; } }


2. SAX

SAX (Simple API for XML), it Works by events and associated methods. As the parser is

reading the document XML and finds the components (the events) of the document

(elements, attributes, values, etc) or it detects errors, is invoking to the methods that the

programmer has associated. You can find more information about SAX on

www.saxproject.org.

First, be sure that you’ve included the sax jar in the classpath (The jar file can be

downloaded http://sourceforge.net/projects/sax/files/). We must instantiate the reader. This

reader implements the XMLReader’s interface, we can obtain it from the abstract class

SAXParser. I obtain SAXParser from the SAXParserFactory. The method parse of XMLReader

analyses the xml document:

import java.io.IOException; import org.xml.sax.SAXException; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.XMLReader; public class MySAXSeracher{ public static void main(String[] args) { try { SAXParserFactory factory = SAXParserFactory. newInstance(); factory.setNamespaceAware( true ); factory.setValidating( true ); SAXParser saxParser = factory.newSAXParser(); XMLReader xr = saxParser.getXMLReader(); xr.parse( args[0] ); } catch ( IOException ioe ) { System. out.println( "Error: " + ioe.getMessage() ); } catch ( SAXException saxe ){ System. out.println( "Error: " + saxe.getMessage() ); } catch ( ParserConfigurationException pce ){ System. out.println( "Error: " + pce.getMessage() ); } } }

If the program compiles, it means that java and the jar file are ok. Nevertheless, the

program doesn’t do anything because we haven’t been interested on any event at the

moment. It’s important to catch the exceptions java.io.IOException,

org.xml.sax.SAXException and

javax.xml.parsers.ParserConfigurationException.


To manipulate the events, our main class must extends

org.xml.sax.helpers.DefaultHandler. DefaultHandler implements the following

interfaces:

org.xml.sax.ContentHandler: events about data (The most extended)

org.xml.sax.ErrorHandler: events about errors

org.xml.sax.DTDhandler: DTD’s treatment

org.xml.sax.EntityResolver: foreign entities

We can make our own classes implementing ContentHandler and ErrorHandler to treat

the event which we are interested in:

Data: implementing ContentHandler and associate it to the reader (parser) with the

method setContenthandler().

Errors: implementing ErrorHandler and associate it to the reader (parser) with the

method setErrorHandler().

The most important methods in the interface ContentHandler (implemented by

DefaultHandler which is extended by our class MySAXSearcher) are:

• startDocument():Receive notification of the beginning of a document.

• endDocument(): Receive notification of the end of a document.

• startElement():Receive notification of the beginning of an element

• endElement():Receive notification of the end of an element.

• characters():Receive notification of character data.

See more about ContentHandler on

http://download.oracle.com/javase/1.4.2/docs/api/org/xml/sax/ContentHandler.html.

Now, MySAXSearcher is the following (I’ve made my own ContentHandler and

ErrorHandler, it’s much more clean than overriding the ContentHandler and ErrorHandler

interesting methods in our class that extends DefaultHandler):


MySAXSearcher.java:

import java.io.IOException; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; public class MySAXSearcher extends DefaultHandler{ public static void main(String[] args) { MySAXSearcher searcher = new MySAXSearcher(); searcher.searchBook(args[0], args[1]); } private void searchBook(String xml, String isbn){ try { SAXParserFactory factory = SAXParserFactory. newInstance(); factory.setNamespaceAware( true ); factory.setValidating( true ); SAXParser saxParser = factory.newSAXParser(); XMLReader xr = saxParser.getXMLReader(); // Assigning my own ContentHandler at my XMLReader. MyContentHandler ch = new MyContentHandler(); ch. isbnSearched = isbn; xr.setContentHandler( ch ); // Assigning my own ErrorHandler at my XMLReader. xr.setErrorHandler( new MyOwnErrorHandler() ); xr.setFeature( "http://xml.org/sax/features/validation" , false); xr.setFeature( "http://xml.org/sax/features/namespaces" , true); long before = System. currentTimeMillis(); xr.parse( xml ); long after = System. currentTimeMillis(); printResult (xml, ch, after - before); } catch ( IOException ioe ) { System. out.println( "Error: " + ioe.getMessage() ); } catch ( SAXException saxe ){ System. out.println( "Error: " + saxe.getMessage() ); } catch ( ParserConfigurationException pce ){ System. out.println( "Error: " + pce.getMessage() ); } } public void printResult(String xml, MyContentHandler ch, long time){ System. out.println( "Document " + xml + ". Parsed in : " + time + " ms"); if (ch. book != null){ System. out.println( "Book found:" ); System. out.println( " Isbn: " + ch. book .getIsbn()); System. out.println( " Name: " + ch. book .getName()); System. out.println( " Author: " + ch. book .getAuthor()); System. out.println( " Price: " + ch. book .getPrice()); System. out.println( " Editorial: " + ch. book .getEditorial());


} else { System. out.println( "Book not found" ); } } }

MyContentHandler.java:

import org.xml.sax.Attributes; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.SAXException; public class MyContentHandler implements ContentHandler { boolean isBookFound = false; String isbnSearched = "" ; String currentNode = "" ; MyBook book = null; // Overrided public void startDocument() throws SAXException { System. out.println( "***Start document***" ); } // Overrided public void endDocument() throws SAXException { System. out.println( "***End document***" ); } // Overrided public void startElement(String uri, String local, String raw, Attributes attrs) { currentNode = local; if ( "book" .equals(local) && ! isBookFound ){ // The book node only has an attribute (isbn) if ( "isbn" .equals(attrs.getLocalName(0)) && isbnSearched .equals(attrs.getValue(0))){ isBookFound = true; book = new MyBook(); book .setIsbn( isbnSearched ); } } } // Overrided public void characters( char ch[], int start, int length) { String value = "" ; // I get the text value for ( int i = start; i < start + length; i++) { value+= Character. toString(ch [i]); } if (! "" .equals(value.trim()) && isBookFound ){ if( "name" .equals( currentNode )){ book .setName(value.trim()); } else if ( "author" .equals( currentNode )){ book .setAuthor(value.trim()); } else if ( "price" .equals( currentNode )){ book .setPrice(value.trim()); } else if ( "editorial" .equals( currentNode )){ book .setEditorial(value.trim()); isBookFound = false; } }


10

} // Overrided public void endElement(String arg0, String arg1, String arg2) throws SAXException { } // Overrided public void endPrefixMapping(String arg0) throws SAXException { } // Overrided public void ignorableWhitespace( char[] arg0, int arg1, int arg2) throws SAXException { } // Overrided public void processingInstruction(String arg0, String arg1) throws SAXException { } // Overrided public void setDocumentLocator(Locator arg0) { } // Overrided public void skippedEntity(String arg0) throws SAXException { } // Overrided public void startPrefixMapping(String arg0, String arg1) throws SAXException { } }

MyErrorHandler.java:

import org.xml.sax.ErrorHandler; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; public class MyErrorHandler implements ErrorHandler { // Overrided public void warning(SAXParseException ex) { System. err.println( "[Warning] : " + ex.getMessage()); } // Overrided public void error(SAXParseException ex) { System. err.println( "[Error] : " +ex.getMessage()); } // Overrided public void fatalError(SAXParseException ex) throws SAXException { System. err.println( "[Error!] : " +ex.getMessage()); } }

With our xml (books.xml), and the book code to search 0000000003, we can executed

our program with:

java MySAXSearcher “books.xml” “0000000003”


11

The result must be the following:

***Start document***

***End document*** Document books.xml Parsed in: 141ms Book found: Isbn: 0000000003 Name: Book 3 Author: Author name 3 Price: 29.45 Editorial: Editorial 3

3. DOM

DOM (Document Object Model), while SAX offers access at all elements of document,

DOM brings the parsing as a tree that can be parsed and transformed. DOM has some

disadvantages and advantages with regards to SAX:

Disadvantage:

• The data can be acceded only when the entire document is parsed.

• The tree is an object loaded on the memory; this is problematic for big and

complex documents.

Advantages:

• With DOM we can manipulate (update, delete and add elements) the xml

document. Also, we can create a new xml document.

To manipulate an xml document, we must instantiate a Document (interface) object

that implements the Document interface (extends the interface Node). We use the classes’

javax.xml.parsers.DocumentBuilder and javax.xml.parsers.DocumentBuilderFactory, we

invoke the method parse() to obtain a Document object.

For manipulate an XML with DOM, there are some important classes’:

org.w3c.dom.Document (interface representing the entire XML document),

org.w3c.dom.Element (Elements in the XML document), org.w3c.dom.Node (node that has

some elements) and org.w3c.dom.Att (The attributes of every element).

Ok, now let’s talk in java code language. As DTO (Data Transfer Object), I use the same

object MyBook.


12

MyDOMSearcher.java:

import java.io.File; import java.io.IOException; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.xml.sax.SAXException; public class MyDOMSearcher { public static void main(String[] args) { MyDOMSearcher searcher = new MyDOMSearcher(); searcher.searchBook(args[0], args[1]); } private void searchBook(String xml, String isbn) { long before = System. currentTimeMillis(); MyBook book = null; try{ DocumentBuilderFactory factory = DocumentBuilderFactory. newInstance(); factory.setNamespaceAware( true); factory.setValidating( true); DocumentBuilder parser = factory.newDocumentBuilde r(); // I assign my own ErrorHandler to my Parser parser.setErrorHandler( new MyErrorHandler()); File file = new File(xml); Document doc = parser.parse (file); // I obtain all the elements <book> // NodeList is an interface that has 2 methods: // 1. item(int): returns the Node (Interface) Objec t of the position int. // 2. getLength(): returns the length of the List NodeList booksNodes = doc.getElementsByTagName( "book" ); NodeList bookChildsNodes = null; String isbnAttribute = "" ; for( int i = 0; i < booksNodes.getLength(); i++) { Node node = booksNodes.item(i); if(node != null && node.hasAttributes()) { isbnAttribute = node.getAttributes().getNamedItem( "isbn" ).getNodeValue(); if(isbnAttribute.equals(isbn)){ //I've caught the isbn searched if(book == null){ book = new MyBook(); book.setIsbn(isbn); } if(node.hasChildNodes()){ bookChildsNodes = node.getChildNodes(); for ( int j = 0; j < bookChildsNodes.getLength(); j++) { if( "name" .equals(bookChildsNodes.item(j).getNodeName())){ book.setName(bookChildsNodes.item(j).getTextContent ());


13

} else i f( "author" .equals(bookChildsNodes.item(j).getNodeName())){ book.setAuthor(bookChildsNodes.item(j).getTextCon tent()); } else if( "price" .equals(bookChildsNodes.item(j).getNodeName())){ book.setPrice(bookChildsNodes.item(j).getTextCont ent()); } else if( "editorial" .equals(bookChildsNodes.item(j).getNodeName())){ book.setEditorial(bookChildsNodes.item(j).getText Content()); // I've found my book. Ending the for iteration break; } } } } } } } catch(IOException ioe){ System. err.println( "[Error] : " +ioe.getMessage()); } catch(ParserConfigurationException pce){ System. err.println( "[Error] : " +pce.getMessage()); } catch(SAXException se){ System. err.println( "[Error] : " +se.getMessage()); } long after = System. currentTimeMillis(); printResults(xml, book, after - before); } public void printResults(String xml, MyBook book, long time) { System. out.println( "Document " + xml + ". Parsed in : " + time + " ms"); if (book != null){ System. out.println( "Book found:" ); System. out.println( " Isbn: " + book.getIsbn()); System. out.println( " Name: " + book.getName()); System. out.println( " Author: " + book.getAuthor()); System. out.println( " Price: " + book.getPrice()); System. out.println( " Editorial: " + book.getEditorial()); } else{ System. out.println( "Book not found" ); } } }


14

4. JDOM

All the precedents API’s are available for many programming languages, but their use

is laborious in Java. A specific API has been made for java (JDOM), that API uses the own

capacities and features of Java, therefore, using it make the XMlL parsing easily. We can find

some related information on www.jdom.org.

Now, let’s make the same example (searching a book in our XML) with JDOM (be sure

that the jar is installed in your classpath, you can download it on

http://www.jdom.org/dist/binary/).

MyJDOMSearcher.java:

import java.io.IOException; import java.util.Iterator; import java.util.List; import org.jdom.Document; import org.jdom.Element; import org.jdom.JDOMException; import org.jdom.input.SAXBuilder; public class MyJDOMSearcher { private String isbn ; private MyBook book ; private boolean noSearchMore = false; public static void main(String[] args) { try { long before = System. currentTimeMillis(); MyJDOMSearcher searcher = new MyJDOMSearcher(); // The second parameter is the isbn to search searcher. isbn = args[1]; SAXBuilder saxBuilder = new SAXBuilder(); Document document = saxBuilder.build(args[0]) ; searcher.searchBook(document.getRootElement() ); long after = System. currentTimeMillis(); searcher.printResults(args[0], after-before); } catch (JDOMException jde){ System. err.println( "[Error] JDOMException: " +jde.getMessage()); } catch (IOException ioe){ System. err.println( "[Error] IOException: " +ioe.getMessage()); } } private void searchBook(Element element){ inspect(element); List content = element.getContent(); Iterator iterator = content.iterator(); Element child = null; Object object = null;


15

while(iterator.hasNext()){ // All times we have "books" node object = iterator.next(); if(object instanceof Element){ child = ((Element)object); //Casting from Object to Element searchBook(child); } } } // Recursively descend the tree public void inspect(Element element) { if (! noSearchMore ){ // If I've had the book yet, I'll do anything if( "book" .equals(element.getQualifiedName()) & book == null){

if( isbn .equals(element.getAttribute( "isbn" ).getValue())){ book = new MyBook(); book .setIsbn( isbn ); } } if( book != null){ if( "name" .equals(element.getQualifiedName())){ book .setName(element.getValue()); } if( "author" .equals(element.getQualifiedName())){ book .setAuthor(element.getValue()); } if( "price" .equals(element.getQualifiedName())){ book .setPrice(element.getValue()); } if( "editorial" .equals(element.getQualifiedName())){ book .setEditorial(element.getValue()); noSearchMore = true; } } } } private void printResults(String xml, long time) { System. out.println( "Document " + xml + ". Parsed in : " + time + " ms"); if ( book != null){ System. out.println( "Book found:" ); System. out.println( " Isbn: " + book .getIsbn()); System. out.println( " Name: " + book .getName()); System. out.println( " Author: " + book .getAuthor()); System. out.println( " Price: " + book .getPrice()); System. out.println( " Editorial: " + book .getEditorial()); } else { System. out.println( "Book not found" ); } } }


16

5. Conclusion

Executing the same example with the three API’s (MySAXSearcher, MyDOMSearcher

and MyJDOMSearcher) having us parameters received the same xml file and the isbn to search

("0000000003"), the result (in time) obtained is the following:

MySAXSearcher MyDOMSearcher MyJDOMSearcher 93 ms 750 ms 609 ms

The SAX API is faster than DOM and JDOM (But it’s laborious).

��

Technology

SAX, DOM & JDOM parsers for beginners