Extensible Markup Language: XML

1

Extensible Markup Language: XML

• XML developed by World Wide Consortium’s (W3C’s) XML Working Group (1996)

• XML portable, widely supported technology for describing data

• XML quickly becoming standard for data exchange between applications

2

15.2 XML Documents• XML marks up data using tags, which are names

enclosed in angle brackets < >• All tags appear in pairs: <myTag> .. </myTag>• Elements: units of data (i.e., everything included

between a start tag and its corresponding end tag)• Root element contains all other document elements• Tag pairs cannot appear interleaved:<a><b></a></b>

Must be: <a><b></b></a>• Nested elements form hierarchies (trees)

Thus: What defines an XML document is not its tag names but that it has tags that are formatted in this way.

3

article.xml

1 <?xml version = "1.0"?>2 3 4 5 6 <article>7 8 <title>Simple XML</title>9 10 <date>December 21, 2001</date>11 12 <author>13 <firstName>John</firstName>14 <lastName>Doe</lastName>15 </author>16 17 <summary>XML is pretty easy.</summary>18 19 <content>In this chapter, we present a wide variety of examples20 that use XML.21 </content>22 23 </article>

End tag has format </start tag name>

Root element contains all other document elements

Optional XML declaration includes version information parameter

XML comments delimited by <!– and -->

article

title date author summary content

firstName lastName

Because of the nice <tag>.. </tag> structure, the data can be viewed as organized in a tree:

4

<?xml version = "1.0"?> <!– I-sequence structured with XML. --> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID> <DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc </DATA>

</SEQ> </SEQUENCEDATA>

An I-sequence structured as XML

SEQUENCEDATA

TYPE SEQ

DATAIDNAME

5

Parsing and displaying XML

• XML is just another data format• We need to write yet another parser• No more filters, please!

?

• No! XML is becoming standard• Many different systems can read XML – not many

systems can read our I-sequence format..• Thus, parsers exist already

6XML document opened in Internet Explorer

Minus sign

Each parent element/node can be expanded and collapsed

Plus sign

7XML document opened in Mozilla

Again: Each parent element/node can be expanded and collapsed (here by pressing the minus, not the element)

8

letter.xml

1 <?xml version = "1.0"?>2 3 4 5 6 <letter>7 <contact type = "from">8 <name>Jane Doe</name>9 <address1>Box 12345</address1>10 <address2>15 Any Ave.</address2>11 <city>Othertown</city>12 <state>Otherstate</state>13 <zip>67890</zip>14 <phone>555-4321</phone>15 <flag gender = "F" />16 </contact>17 18 <contact type = "to">19 <name>John Doe</name>20 <address1>123 Main St.</address1>21 <address2></address2>22 <city>Anytown</city>23 <state>Anystate</state>24 <zip>12345</zip>25 <phone>555-1234</phone>26 <flag gender = "M" />27 </contact>28 29 <salutation>Dear Sir:</salutation>30

Attribute (name-value pair, value in quotes): element

contact has the attribute type which has the value

“from”

Empty elements do not contain character data.

The tags of an empty element may be written in one like

this: <myTag />

AttributesData can also be placed in attributes: name/value pairs

9

letter.xml

31 <paragraph>It is our privilege to inform you about our new32 database managed with <technology>XML</technology>. This33 new system allows you to reduce the load on34 your inventory list server by having the client machine35 perform the work of sorting and filtering the data.36 </paragraph>37 38 <paragraph>Please visit our Web site for availability39 and pricing.40 </paragraph>41 42 <closing>Sincerely</closing>43 44 <signature>Ms. Doe</signature>45 </letter>

10

Intermezzo 1

1. Finish this i2xml.py filter so it translates a list of Isequence objects into XML (following the above structure) and saves it in a file. Assume the list contains only one Isequence object. Use your module with this driver program and translate this Fasta file into XML. Load the resulting XML file into a browser.

2.Change the XML structure defined by your filter so that TYPE is no longer a tag by itself but an attribute of the SEQ tag (see page 496).

3.Modify your i2xml filter so that it can now translate a list of several Isequence objects into one XML file, using the structure from part 2. Test your program with the same driver on this Fasta file.

http://www.daimi.au.dk/~chili/CSS/Intermezzi/30.10.1.htmlAll files found from the Example Programs page

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/i2xml.py

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/fasta2xml.py

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/sample.fasta

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/sample.fasta

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/u03518.fasta

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/u03518.fasta

http://www.daimi.au.dk/~chili/CSS/Intermezzi/30.10.1.html

11

solutionfrom Isequence import Isequenceimport sys

# Save a list of Isequences in XML

class SaveToFiles: """Stores a list of ISequences in XML format"""

def save_to_files(self, iseqlist, savefilename):

try: savefile = open(savefilename, "w") print >> savefile, "<?xml version = \"1.0\"?>" print >> savefile, "<SEQUENCEDATA>" for seq in iseqlist:

print >> savefile, ’ <SEQ type="%s">’%seq.get_type() print >> savefile, " <NAME>%s</NAME>"%seq.get_name() print >> savefile, " <ID>%s</ID>"%seq.get_id() print >> savefile, " <DATA>%s</DATA>"%seq.get_sequence() print >> savefile, " </SEQ>"

print >> savefile, "</SEQUENCEDATA>"

savefile.close() except IOError, message: sys.exit(message)

12solution XML file loaded in Internet Explorer

13

Parsers and trees• We’ve already seen that XML markup can be

displayed as a tree

• Some XML parsers exploit this. They – parse the file – extract the data– return it organized in a tree data structure called a Document

Object Model

article

title date author summary content

firstName lastName

14

15.4 Document Object Model (DOM)

• DOM parser retrieves data from XML document• Hierarchical tree structure called a DOM tree• Each component of an XML document

represented as a tree node• Parent nodes contain child nodes• Sibling nodes have same parent• Single root (or document) node contains all other

document nodes

15

DOM tree of previous example

article

title

author

summary

contents

lastName

firstName

date

Fig. 15.6 Tree structure for article.xml.

one single document root node

sibling nodes

parent node

child nodes

<?xml version = "1.0"?>  <article> <title>Simple XML</title> <date>December 21, 2001</date> <author> <firstName>John</firstName> <lastName>Doe</lastName> </author> <summary>XML is pretty easy.</summary> <content>In this chapter, we present a wide variety of examples that use XML. </content> </article>

16Python provides a DOM parser!• all nodes have name (of tag) and value• text (incl. whitespace) represented in nodes with tag name #text

<?xml version = "1.0"?>  <article> <title>Simple XML</title> <date>December 21, 2001</date> <author> <firstName>John</firstName> <lastName>Doe</lastName> </author> <summary>XML is pretty easy.</summary> <content>In this chapter, we present a wide variety of examples that use XML. </content> </article>

article

title

#text

#text

#text

#text

date

author

summary

content

#text

#text

#text

firstName

#text

lastName

#text

#text

Simple XML

#text

Dec..2001

#text

XML..easy.

#text

In this..XML.

#text

John

#text

Doe

17import sysfrom xml.dom.minidom import parse # stuff we have to importfrom xml.parsers.expat import ExpatError # the book uses an old version

.. << open xml file>>

try: document = parse( file ) file.close()except ExpatError: sys.exit( "Error processing XML file" )

rootElement = document.documentElementprint "Here is the root element of the document: %s" % rootElement.nodeName

# traverse all child nodes of root element for node in rootElement.childNodes:

print node.nodeName

# get first child node of root elementchild = rootElement.firstChildprint "\nThe first child of root element is:", child.nodeNameprint "whose next sibling is:",

# get next sibling of first childsibling = child.nextSiblingprint sibling.nodeName

print “Text inside “+ sibling.nodeName + “ tag is”,textnode = sibling.firstChild

print textnode.nodeValueprint "Parent node of %s is: %s" % ( sibling.nodeName, sibling.parentNode.nodeName )

Parse XML document and load data into variable document

List of a node’s children

get root element of the DOM tree, documentElement attribute refers to root node

nodeName refers to element’s tag name

Other node attributes:

firstChild

nextSibling

nodeValue

parentNode

revisedfig16_04.py

18Program outputHere is the root element of the document: articleThe following are its child elements:#texttitle#textdate#textauthor#textsummary#textcontent#text

The first child of root element is: #textwhose next sibling is: titleText inside "title" tag is Simple XMLParent node of title is: article

..

print “Text inside “+ sibling.nodeName + “ tag is”,textnode = sibling.firstChild

# print text value of siblingprint textnode.nodeValue..

article

title

#text

#text

#text

#text

date

author

summary

content

#text

#text

#text

firstName

#text

lastName

#text

#text

Simple XML

#text

Dec..2001

#text

XML..easy.

#text

In this..XML.

#text

John

#text

Doe

19

Parsing XML sequence?

• We have i2xml filter – we want xml2i also• Don’t have to write XML parser, Python provides one• Thus, algorithm:

– Open file– Use Python parser to obtain the DOM tree– Traverse tree to extract sequence information, build Isequence objects

SEQUENCEDATA

SEQ (type)

DATAIDNAME

SEQ (type)

DATAIDNAME

Ignoring whitespace nodes, we have to search a tree like this:

20from Isequence import Isequenceimport sysfrom xml.dom.minidom import parsefrom xml.parsers.expat import ExpatError

class Parser: """Parses xml file, stores sequences in Isequence list"""

def __init__( self ): self.iseqlist = [] # make empty list def parse_file( self, loadfilename ): try: loadfile = open( loadfilename, "r“ ) except IOError, message: sys.exit( message )

# Use Python's own xml parser to parse xml file: try: dom = parse( loadfilename ) loadfile.close() except ExpatError: sys.exit( "Couldn't parse xml file“ )

# now dom is our dom tree structure. Was the xml file a sequence file? if dom.documentElement.nodeName == "SEQUENCEDATA“ :

# recursively search the parse tree: for child in dom.documentElement.childNodes: self.traverse_dom_tree( child ) else: sys.exit( "This is not a sequence file" ) return self.iseqlist

part 1:2

21 def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ : # marks the beginning of a new sequence self.iseq = Isequence() # make new Isequence object self.iseqlist.append( self.iseq ) # add to list newformat = 0 # the type should be an attribute of the SEQ tag. # go through all attributes of this node: for i in range( node.attributes.length ): if node.attributes.item(i).name == "type“ :

# good, found a 'type' attribute newformat = 1 # get the value of the attribute, put it in the Isequence: self.iseq.set_type( node.getAttribute( "type" ) ) break

if not newformat: # we didn't find any 'type' attribute, this is old format print "No 'type' attribute in element SEQ"

# next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child )

elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue )

part 2:2

SEQ (type)

DATAIDNAME

22

What if the XML sequence format changes?

• Now the name of the finder of the sequence is also stored as a new tag:

SEQUENCEDATA

SEQ (type)

DATAIDFOUNDBY

SEQ (type)

DATAIDFOUNDBYNAME NAME

23

Robustness of XML format

• Our xml2i filter still works:– Can’t extract the finder information: ignores the foundby node:

– But: doesn’t crash! Still extracts other information– Easy to incorporate new info

def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ : .. # next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child )

elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue )

SEQ (type)

DATAIDFOUNDBY NAME

24

Compare with extending Fasta format

Say that the Fasta format is modified so the finder appears in the second line after a >:

>HSBGPG Human gene for bone gla protein (BGP)>BiRCCGAGACGGCGCGCGTCCCCTTCGGAGGCGCGGCGCTCTATTACGCGCGATCGACCC..

Our Fasta parser would go wrong:

for line in lines: if line[0] == '>': # new sequence starts items = line.split()

#put new Isequence obj. in list .. elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence( line.strip() ) # skip trailing newline

25

XML robust

• So, the good thing about XML is that it is robust because of its well-defined structure

• Widely used, i.e. this overall tag structure won’t change

• Parsers available in Python already:– Read XML into a DOM tree– DOM tree can be traversed but also manipulated (see next

slide)– Read XML using so-called SAX method

26

See all the methods and attributes of a DOM tree on pages 537ff

Attribute/Method Description appendChild( newChild ) Appends newChild to the list of child nodes.

Returns the appended child node.

attributes NamedNodeMap that contains the attribute nodes for the current node.

childNodes NodeList that contains the node’s current children.

firstChild First child node in the NodeList or None, if the node has no children.

insertBefore( newChild,

refChild ) Inserts the newChild node before the refChild node. refChild must be a child node of the current node; otherwise, insertBefore raises a ValueError exception.

isSameNode( other ) Returns true if other is the current node.

lastChild Last child node in the NodeList or None, if the current node has no children.

nextSibling The next node in the NodeList, or None, if the node has no next sibling.

nodeName Name of the node, or None, if the node does not have a name.

Possible to manipulate the DOM tree using these methods: add new nodes, remove nodes, set attributes etc.

27

Remark: book uses old version of DOM parser

• XML examples in book won’t work (except the revised fig16.04)

• Look in the presented example programs to see what you have to import

• All the methods and attributes of a DOM tree on pages 537ff are the same

28

Intermezzo 2

1. Copy this file and take a look at it in your editor:/users/chili/CSS.E03/Intermezzi/data.xml Any idea what this data is?

2. Open the file in a browser. Expand and collapse nodes by clicking the - and + symbols. Do you see the structure of the tree? Any idea what the data represents now?

3. Copy this program to the same directory. Run it and find the name of Jakob's mother's father's mother. See how the program works?

4. Modify the program so it reports the birth year of the current person as well as the name.

5. Enhance the program so the user can also go back to the son or daughter of the current person. See table on page 537.

6. If you have time: Enhance the program so it prints the current person's mother-in-law, if she exists.


http://www.daimi.au.dk/~chili/CSS/Intermezzi/xml_reader.py


29

solution name = person.getAttribute( "n" ) print( "%s" %name ) if name != 'Jakob‘ : print "%s's mother in law is“ %name , parentNode = person.parentNode

# parentNode is either an 'm' or an 'f' node. If it is a mother # node, we need the father node, and vice versa: if parentNode.nextSibling: spouse = parentNode.nextSibling.firstChild else: spouse = parentNode.previousSibling.firstChild

# Now we need the mother of the spouse: for childNode in spouse.childNodes: if childNode.nodeName == 'm‘ : print childNode.firstChild.getAttribute( 'n' ) break input = raw_input( "Report (m)other or (f)ather or (o)ffspring of %s? “ %name ) if input != 'm' and input != 'f' and input != 'o‘ : break

if input == 'o‘ : print "\n" + name + "'s offspring is“, person = person.parentNode.parentNode else: for child in person.childNodes: if child.nodeName == input: if input == 'm‘ : print "\nMother of “ + name + " is“, elif input == 'f': print "\nFather of “ + name + " is“, person = child.firstChild break