Upload
sabina-richards
View
215
Download
0
Embed Size (px)
Citation preview
StructuredStructured-Document -Document Processing LanguagesProcessing Languages
Spring 2005 Spring 2005
Course ReviewCourse Review
Repetitio mater studiorum est!Repetitio mater studiorum est!
SDPL 2005 Course Review 2
Goals of the CourseGoals of the Course
Learn about central models and languages for Learn about central models and languages for – manipulatingmanipulating– representingrepresenting– transforming and transforming and – querying querying
structured documents (or XML)structured documents (or XML)
"Generic XML processing technology""Generic XML processing technology"
SDPL 2005 Course Review 3
Methodological GoalsMethodological Goals
Some central professional skillsSome central professional skills– consulting of technical specificationsconsulting of technical specifications– experimenting with SW implementationsexperimenting with SW implementations
Ability to think…?Ability to think…?– to find out relationshipsto find out relationships– to apply knowledge in new situationsto apply knowledge in new situations
("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)
SDPL 2005 Course Review 4
XML?XML?
ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics
(like markup languages like HTML do)(like markup languages like HTML do)
XML XML isis– A way to use markup to represent informationA way to use markup to represent information– A A metalanguagemetalanguage
» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs or SchemasDTDs or Schemas
» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML
SDPL 2005 Course Review 5
XML Encoding of Structure: XML Encoding of Structure: ExampleExample
<S><S>
SS
EE
<W><W> <W><W></W></W> <E A=‘1’><E A=‘1’> </E></E>HelloHello world!world!
WW
HelloHello
WW
world!world!
</W></W>
</S></S>
A=1A=1
SDPL 2005 Course Review 6
Basics of XML DTDsBasics of XML DTDs
A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documentsclass of documents
Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [ <!-- "[ <!-- "internal subsetinternal subset" may come here --> " may come here --> ]>]>
DTD is the union of the external and internal subsetDTD is the union of the external and internal subset
SDPL 2005 Course Review 7
How do Declarations Look Like?How do Declarations Look Like?
<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>
<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>
<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>
<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>
<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>
<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>
<!ATTLIST item <!ATTLIST item
priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED
unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >
SDPL 2005 Course Review 8
Element type declarationsElement type declarations
The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>
where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:
E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping
SDPL 2005 Course Review 9
XML Schema Definition XML Schema Definition LanguageLanguage
XML syntaxXML syntax– schema documents easier to manipulate by schema documents easier to manipulate by
programs (than the special DTD syntax)programs (than the special DTD syntax) Compatibility with namespacesCompatibility with namespaces
– can validate documents using declarations from can validate documents using declarations from multiple sourcesmultiple sources
Content datatypesContent datatypes– 44 built-in datatypes (including primitive Java 44 built-in datatypes (including primitive Java
datatypes, datatypes of SQL, and XML attribute datatypes, datatypes of SQL, and XML attribute types)types)
– mechanisms to derive user-defined datatypesmechanisms to derive user-defined datatypes
SDPL 2005 Course Review 10
XML NamespacesXML Namespaces
<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">
<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"><xsl:template match="doc/title"> <H1><H1>
<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>
</xsl:stylesheet> </xsl:stylesheet>
SDPL 2005 Course Review 11
3. XML Processor APIs3. XML Processor APIs
How can applications manipulate structured How can applications manipulate structured documents?documents?– An overview of document parser interfacesAn overview of document parser interfaces
3.1 SAX: an event-based interface3.1 SAX: an event-based interface
3.2 DOM: an object-based interface3.2 DOM: an object-based interface
3.3 JAXP: Java API for XML Processing3.3 JAXP: Java API for XML Processing
SDPL 2005 Course Review 12
A SAX-based applicationA SAX-based application
Application Main Application Main RoutineRoutine
startDocument()startDocument()
startElement()startElement()
characters()characters()
Parse()Parse()
Callback
Callback
Routines
Routines
endElement()endElement() <A i="1"><A i="1"> </A></A>Hi!Hi!
"A",[i="1"]"A",[i="1"]
"Hi!""Hi!"
"A""A"<?xml version='1.0'?><?xml version='1.0'?>
SDPL 2005 Course Review 13
DOM: What is it? DOM: What is it?
An object-based, language-neutral API for XML An object-based, language-neutral API for XML and HTML documentsand HTML documents
– Allows programs and scripts to build, navigate, and Allows programs and scripts to build, navigate, and modify documentsmodify documents
In contrast to “In contrast to “SSerial erial AAccess ccess XXML” could think as ML” could think as ““DDirectly irectly OObtainable in btainable in MMemory”emory”
SDPL 2005 Course Review 14
<invoice form="00" <invoice form="00" type="estimated">type="estimated"> <addressdata><addressdata> <name>John Doe</name><name>John Doe</name> <address><address> <streetaddress>Pyynpolku 1<streetaddress>Pyynpolku 1 </streetaddress></streetaddress> <postoffice>70460 KUOPIO<postoffice>70460 KUOPIO </postoffice></postoffice> </address></address> </addressdata></addressdata> ......
DOM structure modelDOM structure model
invoiceinvoice
namename
addressdataaddressdata
addressaddress
form="00"form="00"type="estimated"type="estimated"
John DoeJohn Doe streetaddressstreetaddress postofficepostoffice
70460 KUOPIO70460 KUOPIOPyynpolku 1Pyynpolku 1
......
DocumentDocument
ElementElement
NamedNodeMapNamedNodeMap
TextText
SDPL 2005 Course Review 15
Trans form ation P rocess
O utput P ro cess
X M L
T ext
H T M L
S tyleS heet
SourceDocument
Sourc e TreeR esult T ree
Overview of XSLT TransformationOverview of XSLT Transformation
SDPL 2005 Course Review 16
JAXP 1.1JAXP 1.1
An interface for “plugging-in” and using An interface for “plugging-in” and using XML processors in Java applicationsXML processors in Java applications– includes packagesincludes packages
» org.xml.sax:org.xml.sax: SAX 2.0 interface SAX 2.0 interface» org.w3c.dom:org.w3c.dom: DOM Level 2 interface DOM Level 2 interface» javax.xml.parsersjavax.xml.parsers::
initialization and use of parsersinitialization and use of parsers» javax.xml.transformjavax.xml.transform::
initialization and use of transformers initialization and use of transformers (XSLT processors)(XSLT processors)
Included in JDK starting from vers. 1.4Included in JDK starting from vers. 1.4
SDPL 2005 Course Review 17
XMLXML
.getXMLReader().getXMLReader()
JAXP: Using a SAX parser (1)JAXP: Using a SAX parser (1)
f.xmlf.xml
.parse(.parse( ” ”f.xml”)f.xml”)
.newSAXParser().newSAXParser()
SDPL 2005 Course Review 18
f.xmlf.xml
JAXP: Using a DOM parser (1)JAXP: Using a DOM parser (1)
.parse(”f.xml”).parse(”f.xml”)
.newDocument().newDocument()
.newDocumentBuilder().newDocumentBuilder()
SDPL 2005 Course Review 19
XSLTXSLT
JAXP: Using Transformers (1)JAXP: Using Transformers (1)
.newTransformer(…).newTransformer(…)
.transform(.,.).transform(.,.)
SDPL 2005 Course Review 20
CSS - Cascading Style SheetsCSS - Cascading Style Sheets
A stylesheet languageA stylesheet language– mainly to specify the representation of web pages by mainly to specify the representation of web pages by
attaching style (fonts, colours, margins, …) to attaching style (fonts, colours, margins, …) to HTML/XML documentsHTML/XML documents
Example style rule:Example style rule:
H1 H1 {color: blue; font-style: bold;}{color: blue; font-style: bold;}
SDPL 2005 Course Review 21
CSS Processing Model (simplified)CSS Processing Model (simplified)
0. Parse the document into a tree0. Parse the document into a tree1. Match style rules to elements of the tree1. Match style rules to elements of the tree
– annotate each element with a value assigned for each annotate each element with a value assigned for each relevant propertyrelevant property
» inheritance and, in case of competing rules, elaborate inheritance and, in case of competing rules, elaborate "cascade" rules applied to select which value is assigned"cascade" rules applied to select which value is assigned
2. Generate a formatting structure of the annotated 2. Generate a formatting structure of the annotated document treedocument tree– consists of nested rectangular boxesconsists of nested rectangular boxes
3. Render the formatting structure3. Render the formatting structure– display, print, audio-synthesize, ...display, print, audio-synthesize, ...
SDPL 2005 Course Review 22
Transformation & FormattingTransformation & Formatting
XSLT scriptXSLT script
II IIII
SDPL 2005 Course Review 23
Page regionsPage regions
A simple page can contain 1-5 regions, specified by child A simple page can contain 1-5 regions, specified by child elements of the elements of the simple-page-mastersimple-page-master
SDPL 2005 Course Review 24
Top-level formatting objectsTop-level formatting objects
Slightly simplified:Slightly simplified: fo:rootfo:root
fo:layout-master-setfo:layout-master-set
(fo:simple-page-master | fo:page-sequence-master)+(fo:simple-page-master | fo:page-sequence-master)+
fo:page-sequencefo:page-sequence++
fo:region-fo:region-bodybody
fo:region-fo:region-before?before? fo:region-fo:region-
end?end?
fo:region-fo:region-start?start?
fo:region-fo:region-after?after? specify masters specify masters
for page sequences, for page sequences, by referring to by referring to simple-page-masterssimple-page-masters
contents of pagescontents of pages
fo:flowfo:flow
SDPL 2005 Course Review 25
XQuery in a NutshellXQuery in a Nutshell
Functional expression languageFunctional expression language Strongly-typedStrongly-typed: : (XML Schema) types may be assigned to (XML Schema) types may be assigned to
expressions staticallyexpressions statically Extends XPath 2.0Extends XPath 2.0 ((but not all axesbut not all axes required) required)
– common for common for XQuery 1.0 and XPath 2.0:XQuery 1.0 and XPath 2.0:» Functions and OperatorsFunctions and Operators, W3C WD 4/4/2005, W3C WD 4/4/2005
Roughly: XQuery Roughly: XQuery XPath' + XSLT' + SQL' XPath' + XSLT' + SQL'
SDPL 2005 Course Review 26
FLWOR ("flower") ExpressionsFLWOR ("flower") Expressions
Constructed from Constructed from forfor, , letlet, , wherewhere, , order byorder by and and returnreturn clauses (~SQL clauses (~SQL selectselect--fromfrom--wherewhere))
Form: Form: (ForClause | LetClause)+ (ForClause | LetClause)+ WhereClause? WhereClause? OrderByClause?OrderByClause?""returnreturn" Expr" Expr
FLWOR binds variables to values, and uses FLWOR binds variables to values, and uses these bindings to construct a result these bindings to construct a result (an ordered sequence of nodes)(an ordered sequence of nodes)
SDPL 2005 Course Review 27
XQuery ExampleXQuery Example
forfor $pn $pn in in distinct-values(distinct-values(doc(doc(”sp.xml”)//pno)”sp.xml”)//pno)
letlet $sp:= $sp:=docdoc(”sp.xml”)//sp_tuple[pno=$pn](”sp.xml”)//sp_tuple[pno=$pn]
where where countcount($sp) >= 3($sp) >= 3
order byorder by $pn $pn
returnreturn
<well_supplied_item> <well_supplied_item> {{
<pno>{$pn}</pno><pno>{$pn}</pno>,,
<avgprice> <avgprice> {{avgavg($sp/price)($sp/price)}} </avgprice> </avgprice>
}} <well_supplied_item> <well_supplied_item>
SDPL 2005 Course Review 28
Course Main MessageCourse Main Message
XML is a universal way to represent info as XML is a universal way to represent info as tree-like data structures tree-like data structures
There are specialized and powerful There are specialized and powerful technologies for processing ittechnologies for processing it
The development is going onThe development is going on