54
Jun 19, 2 022 XML eXtensible Markup Language THETOPPERSWAY.COM

18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Embed Size (px)

DESCRIPTION

3 HTML and XML, II HTML and XML look similar, because they are both SGML languages (SGML = Standard Generalized Markup Language) Both HTML and XML use elements enclosed in tags (e.g. This is an element ) Both use tag attributes (e.g., ) Both use entities (, &, ", ' ) More precisely, HTML is defined in SGML XML is a (very small) subset of SGML

Citation preview

Page 1: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

May 4, 2023

XML

eXtensible Markup Language

THETOPPERSWAY.COM

Page 2: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

2

HTML and XML

XML stands for eXtensible Markup Language

HTML is used to mark up text so it can be displayed to users

XML is used to mark up data so it can be processed by computers

HTML describes both structure (e.g. <p>, <h2>, <em>) and appearance (e.g. <br>, <font>, <i>)

XML describes only content, or “meaning”

HTML uses a fixed, unchangeable set of tags

In XML, we make up our own tags

Page 3: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

3

HTML and XML, II

HTML and XML look similar, because they are both SGML languages (SGML = Standard Generalized Markup Language) Both HTML and XML use elements enclosed in tags (e.g.

<body>This is an element</body>) Both use tag attributes (e.g.,

<font face="Verdana" size="1" color="red">) Both use entities (&lt;, &gt;, &amp;, &quot;, &apos;)

More precisely, HTML is defined in SGML XML is a (very small) subset of SGML

Page 4: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

4

HTML and XML, III

HTML is for humans HTML describes web pages You don’t want to see error messages about web pages you visit Browsers ignore and/or correct as many HTML errors as they can,

so HTML is often sloppy XML is for computers

XML describes data The rules are strict and errors are not allowed

In this way, XML is like a programming language Current versions of most browsers can display XML

However, browser support of XML tends to be inconsistent

Page 5: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

5

XML-related technologies DTD (Document Type Definition) and XML Schemas are used to

define legal XML tags and their attributes for particular purposes

CSS (Cascading Style Sheets) describe how to display HTML or XML in a browser

XSLT (eXtensible Stylesheet Language Transformations) and XPath are used to translate from one form of XML to another

DOM (Document Object Model), SAX (Simple API for XML, and JAXP (Java API for XML Processing) are all APIs for XML parsing.

Page 6: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

6

Example XML document

<?xml version="1.0"?><weatherReport> <date>7/1/2013</date> <city>Ghaziabad</city> <state>UP</state> <country>India</country> <high scale="F">103</high> <low scale="F">70</low> <morning>cloudy</morning> <afternoon>Sunny &amp; hot</afternoon> <evening>Clear and Cooler</evening></weatherReport>

From: XML: A Primer, by Simon St. Laurent

Page 7: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

7

Overall structure

An XML document may start with one or more processing instructions (PIs) or directives:

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="ss.xsl"?>

Following the directives, there must be exactly one tag, called the root element, containing all the rest of the XML:

<weatherReport> ...</weatherReport>

Page 8: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

8

XML building blocks

Aside from the directives, an XML document is built from: elements: high in <high scale="F">103</high> tags, in pairs: <high scale="F">103</high> attributes: <high scale="F">103</high> entities: <afternoon>Sunny &amp; hot</afternoon> character data, which may be:

parsed (processed as XML)--this is the default -PCDATA unparsed (all characters stand for themselves) - CDATA

Page 9: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

9

Elements and attributes Attributes and elements are somewhat interchangeable Example using just elements:

<name> <first>David</first> <last>Matuszek</last></name>

Example using attributes: <name first="David" last="Matuszek"></name>

You will find that elements are easier to use in your programs--this is a good reason to prefer them

Attributes often contain metadata, such as unique IDs Generally speaking, browsers display only elements (values

enclosed by tags), not tags and attributes

Page 10: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

10

Well-formed XML Every element must have both a start tag and an end tag, e.g.

<name> ... </name> But empty elements can be abbreviated: <break />. XML tags are case sensitive XML tags may not begin with the letters xml, in any

combination of cases Elements must be properly nested, e.g. not <b><i>bold and

italic</b></i> Every XML document must have one and only one root element The values of attributes must be enclosed in single or double

quotes, e.g. <time unit="days"> Character data cannot contain < or &

Page 11: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

11

Entities

Five special characters must be written as entities: &amp; for & (almost always necessary) &lt; for < (almost always necessary) &gt; for > (not usually necessary) &quot; for " (necessary inside double quotes) &apos; for ' (necessary inside single quotes)

These entities can be used even in places where they are not absolutely required

These are the only predefined entities in XML

Page 12: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

12

XML declaration

The XML declaration looks like this:<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

The XML declaration is not required by browsers, but is required by most XML processors (so include it!)

If present, the XML declaration must be first--not even whitespace should precede it

Note that the brackets are <? and ?> version="1.0" is required (this is the only version so far) encoding can be "UTF-8" (ASCII) or "UTF-16" (Unicode), or

something else, or it can be omitted standalone tells whether there is a separate DTD

Page 13: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

13

Processing instructions PIs (Processing Instructions) may occur anywhere in the XML

document (but usually first) A PI is a command to the program processing the XML

document to handle it in a certain way XML documents are typically processed by more than one

program Programs that do not recognize a given PI should just ignore it General format of a PI: <?target instructions?> Example: <?xml-stylesheet type="text/css"

href="mySheet.css"?>

Page 14: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

14

Comments <!-- This is a comment in both HTML and XML --> Comments can be put anywhere in an XML document Comments are useful for:

Explaining the structure of an XML document Commenting out parts of the XML during development and testing

Comments are not elements and do not have an end tag The blanks after <!-- and before --> are optional The character sequence -- cannot occur in the comment The closing bracket must be --> Comments are not displayed by browsers, but can be seen by

anyone who looks at the source code

Page 15: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

15

CDATA By default, all text inside an XML document is parsed You can force text to be treated as unparsed character data by

enclosing it in <![CDATA[ ... ]]> Any characters, even & and <, can occur inside a CDATA Whitespace inside a CDATA is (usually) preserved The only real restriction is that the character sequence ]]>

cannot occur inside a CDATA CDATA is useful when your text has a lot of illegal characters

(for example, if your XML document contains some HTML text)

Page 16: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

16

Names in XML Names (as used for tags and attributes) must begin with

a letter or underscore, and can consist of: Letters Digits & Special symbols . (dot) - (hyphen) _ (underscore) : (colon) should be used only for namespaces Combining characters and extenders (not used in English)

Page 17: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

17

Namespaces Recall that DTDs are used to define the tags that can be used

in an XML document An XML document may reference more than one DTD Namespaces are a way to specify which DTD defines a given

tag XML, like Java, uses qualified names

This helps to avoid collisions between names Java: myObject.myVariable XML: myDTD:myTag Note that XML uses a colon (:) rather than a dot (.) <book xmlns:dave="http://www.abc.org/ns">

Page 18: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

18

Review of XML rules Start with <?xml version="1"?> XML is case sensitive You must have exactly one root element that encloses all

the rest of the XML Every element must have a closing tag Elements must be properly nested Attribute values must be enclosed in double or single

quotation marks There are only five predeclared entities (&amp; &lt; &gt;

&quot; &apos;)

Page 19: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

19

Another well-structured example <novel>

<foreword> <paragraph>This is the great American novel. </paragraph></foreword> <chapter number="1"> <paragraph>It was a dark and stormy night. </paragraph> <paragraph>Suddenly, a shot rang out! </paragraph> </chapter></novel>

Page 20: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

20

XML as a tree

An XML document represents a hierarchy; a hierarchy is a tree

novel

foreword chapternumber="1"

paragraph paragraph paragraph

This is the greatAmerican novel.

It was a darkand stormy night.

Suddenly, a shotrang out!

Page 21: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

21

Valid XML You can make up your own XML tags and attributes, but...

...any program that uses the XML must know what to expect! A DTD (Document Type Definition) defines what tags are legal

and where they can occur in the XML XML is well-structured if it follows the rules given earlier In addition, XML is valid if it declares a DTD and conforms to

that DTD A DTD can be included in the XML, but is typically a separate

document Errors in XML documents will stop XML programs Some alternatives to DTDs are XML Schemas.

Page 22: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

22

Viewing XML XML is designed to be processed by computer

programs, not to be displayed to humans Nevertheless, almost all current browsers can display

XML documents For best results, update your browsers to the newest available

versions Remember:

HTML is designed to be viewed, XML is designed to be used

Page 23: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Parsers

What is a parser?

A program that analyses the grammatical structure of an input, with respect to a given formal grammar

The parser determines how a sentence can be constructed from the grammar of the language by describing the atomic elements of the input and the relationship among them

Page 24: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

XML-Parsing Standards

We will consider two parsing methods that implement W3C standards for accessing XML

SAX (Simple API for XML) event-driven parsing “serial access” protocol Read only API

DOM (Document Object Model) convert XML into a tree of objects “random access” protocol Can update XML document (insert/delete nodes)

Page 25: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

SAX Parser SAX = Simple API for XML

XML is read sequentially

When a parsing event happens, the parser invokes the corresponding method of the corresponding handler

The handlers are programmer’s implementation of standard Java API (i.e., interfaces and classes)

Similar to an I/O-Stream, goes in one direction

Page 26: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Sample Data

<orders><order> <onum>1020</onum> <takenBy>1000</takenBy> <customer>1111</customer> <recDate>10-DEC 94</recDate> <items> <item> <pnum>10506</pnum> <quantity>1</quantity> </item>

<item> <pnum>10507</pnum> <quantity>1</quantity> </item> <item> <pnum>10508</pnum> <quantity>2</quantity> </item> <item> <pnum>10509</pnum> <quantity>3</quantity> </item> </items></order>...</orders>

Orders Data in XML: several orders, each with several items each item has a part number and a quantity

Page 27: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Sample Data

<orders><order> <onum>1020</onum> <takenBy>1000</takenBy> <customer>1111</customer> <recDate>10-DEC 94</recDate> <items> <item> <pnum>10506</pnum> <quantity>1</quantity> </item>

<item> <pnum>10507</pnum> <quantity>1</quantity> </item> <item> <pnum>10508</pnum> <quantity>2</quantity> </item> <item> <pnum>10509</pnum> <quantity>3</quantity> </item> </items></order>...</orders>

Orders Data in XML: several orders, each with several items each item has a part number and a quantity

startDocument

endDocument

Parsing Event

Page 28: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Sample Data

<orders><order> <onum>1020</onum> <takenBy>1000</takenBy> <customer>1111</customer> <recDate>10-DEC-94</recDate> <items> <item> <pnum>10506</pnum> <quantity>1</quantity> </item>

<item> <pnum>10507</pnum> <quantity>1</quantity> </item> <item> <pnum>10508</pnum> <quantity>2</quantity> </item> <item> <pnum>10509</pnum> <quantity>3</quantity> </item> </items></order>...</orders>

Orders Data in XML: several orders, each with several items each item has a part number and a quantity

startElement

endElement

Page 29: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Sample Data

<orders><order> <onum>1020</onum> <takenBy>1000</takenBy> <customer>1111</customer> <recDate>10-DEC-94</recDate> <items> <item> <pnum>10506</pnum> <quantity>1</quantity> </item>

<item> <pnum>10507</pnum> <quantity>1</quantity> </item> <item> <pnum>10508</pnum> <quantity>2</quantity> </item> <item> <pnum>10509</pnum> <quantity>3</quantity> </item> </items></order>...</orders>

Orders Data in XML: several orders, each with several items each item has a part number and a quantity

characters

Page 30: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

SAX Parsers

SAX Parser

When you see the start of the document do …When you see the start of an element do … When you see

the end of an element do …

<?xml version="1.0"?>...

Page 31: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Used to create a SAX Parser Handles

document events: start tag, end tag, etc.

Handles Parser Errors

Handles DTD

Handles Entities

XMLReaderXML

XML-ReaderFactory

ContentHandler

EntityResolver

ErrorHandler

DTDHandler

Page 32: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

SAX API

Two important classes in the SAX API: SAXParser and HandlerBase.

A new SAXParser object is created by:

public SAXParser()

Register a SAX handler with the parser object to receive notifications of various parser events by:

public void setDocumentHandler(DocumentHandler h)

A similar method is available to register an error handler:

public void setErrorHandler(ErrorHandler h)

Page 33: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

SAX API - Continued

The HandlerBase class is a default base class for all handlers. It implements the default behavior for the various handlers. Application writers extend this class by rewriting the following

event handler methods:

public void startDocument() throws SAXExceptionpublic void endDocument() throws SAXExceptionpublic void startElement() throws SAXExceptionpublic void endElement() throws SAXExceptionpublic void characters() throws SAXExceptionpublic void warning() throws SAXExceptionpublic void error() throws SAXException

Page 34: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Creating a SAX Parserimport org.xml.sax.*;import oracle.xml.parser.v2.SAXParser;public class SampleApp extends HandlerBase { // Global variables declared here static public void main(String [] args){ Parser parser = new SAXParser(); SampleApp app = new SampleApp(); parser.setDocumentHandler(app); parser.setErrorHandler(app); try { parser.parse(createURL(args[0]).toString()); } catch (SAXParseException e) { System.out.println(e.getMessage()); } }}

Page 35: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

DOM Parser DOM = Document Object Model

Parser creates a tree object out of the document

User accesses data by traversing the tree The tree and its traversal conform to a W3C standard

The API allows for constructing, accessing and manipulating the structure and content of XML documents

Page 36: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

<?xml version="1.0"?><!DOCTYPE countries SYSTEM "world.dtd"><countries> <country continent="&as;"> <name>Israel</name> <population year="2001">6,199,008</population> <city capital="yes"><name>Jerusalem</name></city> <city><name>Ashdod</name></city> </country> <country continent="&eu;"> <name>France</name> <population year="2004">60,424,213</population> </country></countries>

Page 37: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

The DOM TreeDocument

countries

country

Asia

continent

Israel

name

year

2001

6,199,008

population

city

capital

yes

name

Jerusalem

country

Europe

continent

France

nameyear

2004

60,424,213

population

city

capital

no

name

Ashdod

Page 38: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Using a DOM Tree

DOM Parser DOM TreeXML File

API

Application

Page 39: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

The Node Interface

The nodes of the DOM tree include a special root (denoted document) element nodes text nodes and CDATA sections attributes comments and more ...

Every node in the DOM tree implements the Node interface

Page 40: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Node Navigation

Every node has a specific location in tree Node interface specifies methods for tree navigation

Node getFirstChild(); Node getLastChild(); Node getNextSibling(); Node getPreviousSibling(); Node getParentNode(); NodeList getChildNodes(); NamedNodeMap getAttributes()

Page 41: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Node Navigation (cont)

getFirstChild()getPreviousSibling()

getChildNodes()

getNextSibling()getLastChild()

getParentNode()

Page 42: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

<?xml version="1.0" standalone="no"?><!DOCTYPE geography SYSTEM "geo.dtd"> <geography> <state id="georgia"> <scode>GA</scode> <sname>Georgia</sname> <capital idref="atlanta"/> <citiesin idref="atlanta"/> <citiesin idref="columbus"/> <citiesin idref="savannah"/> <citiesin idref="macon"/> <nickname>Peach State</nickname> <population>6478216</population> </state> ... <city id="atlanta"> <ccode>ATL</ccode> <cname>Atlanta</cname> <stateof idref="georgia"/> </city> ...</geography>

DOM Parsing Example

Consider the following XML data file describinggeographical information about states in U.S.

Page 43: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

<!ELEMENT geography (state|city)*><!ELEMENT state (scode, sname, capital, citiesin*, nickname, population)> <!ATTLIST state id ID #REQUIRED><!ELEMENT scode (#PCDATA)><!ELEMENT sname (#PCDATA)><!ELEMENT capital EMPTY> <!ATTLIST capital idref IDREF #REQUIRED><!ELEMENT citiesin EMPTY> <!ATTLIST citiesin idref IDREF #REQUIRED><!ELEMENT nickname (#PCDATA)><!ELEMENT population (#PCDATA)><!ELEMENT city (ccode, cname, stateof)> <!ATTLIST city id ID #IMPLIED><!ELEMENT ccode (#PCDATA)><!ELEMENT cname (#PCDATA)><!ELEMENT stateof EMPTY> <!ATTLIST stateof idref IDREF #REQUIRED>

Geography XML Data DTD

Page 44: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

create type city_type as object ( ccode varchar2(15), cname varchar2(50));

create type cities_in_table as table of city_type;

create table state ( scode varchar2(15), sname varchar2(50), nickname varchar2(100), population number(30), capital city_type, cities_in cities_in_table, primary key (scode))nested table cities_in store as cities_tab;

Geography Data: SQL Schema (Oracle Objects)

Page 45: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

import org.w3c.dom.*;import org.w3c.dom.Node;import oracle.xml.parser.v2.*;

public class CreateGeoData { static public void main(String[] argv) throws SQLException { // Get an instance of the parser DOMParser parser = new DOMParser(); // Set various parser options: validation on, // warnings shown, error stream set to stderr. parser.setErrorStream(System.err); parser.setValidationMode(true); parser.showWarnings(true);

// Parse the document. parser.parse(url);

// Obtain the document. XMLDocument doc = parser.getDocument();

Create DOM Parser Object

Page 46: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

NodeList sl = doc.getElementsByTagName("state");NodeList cl = doc.getElementsByTagName("city");

XMLNode e = (XMLNode) sl.item(j);scode = e.valueOf("scode"); sname = e.valueOf("sname"); nickname = e.valueOf("nickname"); population = Long.parseLong(e.valueOf("population"));

XMLNode child = (XMLNode) e.getFirstChild();while (child != null) { if (child.getNodeName().equals("capital")) break; child = (XMLNode) child.getNextSibling();}

NamedNodeMap nnm = child.getAttributes();XMLNode n = (XMLNode) nnm.item(0);

Traverse DOM Tree

Page 47: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Node Manipulation Children of a node in a DOM tree can be manipulated - added,

edited, deleted, moved, copied, etc.

To constructs new nodes, use the methods of Document createElement, createAttribute, createTextNode,

createCDATASection etc.

To manipulate a node, use the methods of Node appendChild, insertBefore, removeChild, replaceChild,

setNodeValue, cloneNode(boolean deep) etc.

Page 48: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

SAX vs DOM Parsing: Efficiency

The DOM object built by DOM parsers is usually complicated and requires more memory storage than the XML file itself

A lot of time is spent on construction before use For some very large documents, this may be impractical

SAX parsers store only local information that is encountered during the serial traversal

Hence, programming with SAX parsers is, in general, more efficient

Page 49: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Programming using SAX is Difficult In some cases, programming with SAX is difficult:

How can we find, using a SAX parser, elements e1 with ancestor e2?

How can we find, using a SAX parser, elements e1 that have a descendant element e2?

How can we find the element e1 referenced by the IDREF attribute of e2?

Page 50: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Node Navigation SAX parsers do not provide access to elements other than the

one currently visited in the serial (DFS) traversal of the document

In particular, They do not read backwards They do not enable access to elements by ID or name

DOM parsers enable any traversal method

Hence, using DOM parsers is usually more comfortable

Page 51: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

More DOM Advantages

DOM object compiled XML

You can save time and effort if you send and receive DOM objects instead of XML files

But, DOM object are generally larger than the source

DOM parsers provide a natural integration of XML reading and manipulating

e.g., “cut and paste” of XML fragments

Page 52: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

Which should we use? DOM vs. SAX If your document is very large and you only need to extract

only a few elements – use SAX

If you need to manipulate (i.e., change) the XML – use DOM

If you need to access the XML many times – use DOM (assuming the file is not too large)

Page 53: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

53

Vocabulary

SGML: Standard Generalized Markup Language XML : Extensible Markup Language DTD: Document Type Definition element: a start and end tag, along with their contents attribute: a value given in the start tag of an element entity: a representation of a particular character or string PI: a Processing Instruction, to possibly be used by a

program that processes this XML namespace: a unique string that references a DTD well-formed XML: XML that follows the basic syntax rules valid XML: well-formed XML that conforms to a DTD

Page 54: 18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM

54

THANKS

THETOPPERSWAY.COM