40
Sebastian Bitzer ([email protected] ) Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured data

Sebastian Bitzer ([email protected])[email protected] Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Sebastian Bitzer ([email protected])Seminar Semistructured DataUniversity of OsnabrueckMay 2, 2003

XML

An introduction in relation to semistructured data

02.05.2003 XML 2

Overview

• Background / History

• Basic syntax

• XML and semistructured data

• Document type definitions

• Extensions for XML

• Paraphernalia

02.05.2003 XML 3

Overview

• Background / History– SGML– SGML, HTML and XML– World Wide Web Consortium

• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia

02.05.2003 XML 4

Standard Generalized Markup Language (SGML)

• model information exclusively on basis of its inner laws and its function

platform independent storage of structured information

• standard: ISO 8879 from 1986

02.05.2003 XML 5

SGML, HTML and XML

• SGML(web application) = HTML (is one special instance of SGML)

• XML SGML

02.05.2003 XML 6

Why XML from SGML?

SGML:– is exceedingly complex and difficult to

understand– is formally so complex, that online-applications

have difficulties to process it in reasonable time– has many properties which were not designed

for use in network environments (remember that it is a standard from 1986)

02.05.2003 XML 7

World Wide Web Consortium

• Nov 1996: initial XML draft

• Dec 1997: XML1.0 Proposed Recommendation

• Feb 1998: W3C Recommendation: Extensible Markup Language (XML) 1.0

• Oct 2000: XML1.0 2nd edition

02.05.2003 XML 8

Overview

• Background / History• Basic syntax

– Elements– Attributes– Well-formed XML documents

• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia

02.05.2003 XML 9

Elements

• element = <tag> content </tag>

• <tag>, </tag> = markups

• content = structures between markups

• no predefined tags

• basic content (no markups) is treated as text: PCDATA (Parsed Character Data)

• abbreviation for empty elements: <tag />

02.05.2003 XML 10

Example

<personnel><person>

<name> John Cage </name><function> Bearer </function>

</person><person>

<name> Elaine Vassal </name><function> chief secretary </function>

</person>…

</personnel>

02.05.2003 XML 11

Attributes

• sometimes called “property” in data models

• (name=“value”) pairs

• value always a string (type NMTOKEN)

• allows building of groups of elements

• ambiguity: information as attribute or element?

02.05.2003 XML 12

Example

<personnel><person sex=“m”>

<name> John Cage </name><function department=“civil rights”> Bearer </function>

</person><person sex=“f”>

<name> Elaine Vassal </name><function department=“admin”> chief secretary </function>

</person>…

</personnel>

02.05.2003 XML 13

Well-formed XML documents

• a XML document is well-formed, if:– tags nest properly

(not <t1><t2></t1></t2>)– attributes are unique within one element

(not <tag att=“a” att=“b”>)

02.05.2003 XML 14

Overview

• Background / History• Basic syntax• XML and semistructured data

– Simple transformations– Differences that make transformation more difficult– Additional constructs

• Document type definitions• Extensions for XML• Paraphernalia

02.05.2003 XML 15

Simple transformations

with basic XML syntax (no attributes, tree as data structure):

• from XML to ssd:<person>

<name> John Cage </name><function> Bearer </function>

</person>

{person : {name : “John Cage”, function : ”bearer”}}

02.05.2003 XML 16

Simple transformations II

• from ssd to XML (transformation function T):T(atomic value) = atomic value

T({l1 : v1, …, ln : vn}) =

<l1> T(v1) </l1>

<ln> T(vn) </ln>

02.05.2003 XML 17

Differences that make transformation more difficult

• different semantic of labels

• element or attribute

• order

• mixing elements and text

02.05.2003 XML 18

Semantics of labels

XML• graphs with labels on

nodes

ssd• graphs with labels on

edges

person

name age email

Alan 42 ab@com

person

name age email

Alan 42 ab@com

<person><name>Alan</name><age>42</age><email>ab@com</email>

</person>

{person : {name : “Alan”}, {age: 42}, {email: “ab@com”} }

02.05.2003 XML 19

Element or attribute

• ambiguity between representation of information as element or as attribute different possibilities of encoding

• in particular in combination with references

<a> <b id=“&o123”> some string </b></a><a c=“&o123” />

or:<a b=“&o123” /><a> <c id=“&o123”> some string </c></a>

a a

b c

“some string”

02.05.2003 XML 20

Order

• ssd model based on unordered collections

• XML elements are ordered

• but: XML attributes are not

• unordered data can be processed more efficiently

for data exchange applications ignore order of XML

02.05.2003 XML 21

Mixing elements and text

• XML allows mixing of PCDATA and subelements:

<talk>XML - An introduction in relation to semistructured data

<speaker> Sebastian Bitzer </speaker>

</talk>

02.05.2003 XML 22

Additional constructs in XML

• comments <!-- comment -->

• processing instructions <?application-name instruction-text>

• CDATA (for escaping)<![CDATA[ markups won’t be processed here ]]>

• entitiese.g. “&auml;” but also external files can be declared as

entities e.g. a gif-file as “&pic-1;”

02.05.2003 XML 23

Overview• Background / History• Basic syntax• XML and semistructured data• Document type definitions

– DTDs as grammars– DTDs as schemas– Attributes– Valid XML documents– Limitations

• Extensions for XML• Paraphernalia

02.05.2003 XML 24

DTDs as grammar

• document type definition (DTD) serves as grammar for underlying XML document

• is precisely a context-free grammar (non-terminal ordered list of one or more terminals and non-terminals)

• can be recursive

02.05.2003 XML 25

Definitions

DTD:

<!DOCTYPE root-name [ element-def.s ]>

element-def.s:

<!ELEMENT name ( content model )>

content model:

ordered list of names of elements which can occur in the outer element

02.05.2003 XML 26

Variations of content model<!ELEMENT r1 (a?, b*, c | d+)>

means that elements of type “r1” contain:– 0 or 1 “a” (“a” is optional) and– arbitrary many “b” (0 - ∞) and– either: exactly 1 “c” (“c” is obligatory)

or: at least 1 “d” (“d” is required)

groups can be build, too:<!ELEMENT r2 ((a, b)+, c?)>

means: at least one sequence of “a” followed by “b” comes in front of the optional “c”

02.05.2003 XML 27

DTDs as Schemas

• DTD:<!DOCTYPE db [

<!ELEMENT db ((r1,r2)*)><!ELEMENT r1 ((a,b,c)|(a,c,b)| (b,a,c) | (b,c,a) | (c,a,b) | (c,b,a))><!ELEMENT r2 ((c, d) | (d, c))><!ELEMENT a (#PCDATA)><!ELEMENT b (#PCDATA)><!ELEMENT c (#PCDATA)><!ELEMENT d (#PCDATA)>

]>can be seen as representation for relational schema

r1(a,b,c), r2(c,d)

02.05.2003 XML 28

Declaring attributes

<!ATTLIST el.name att.name1 type1 spec1 att.name2 type2 spec2 … >

el.name: element which is modified by att.s

type: often “CDATA”, but also more restricted e.g.: “(m|f)” for male or female in att. “sex”

spec: #REQUIRED, #IMPLIED, #FIXED or default value

02.05.2003 XML 29

Unique Identifiers

e.g.:<!ATTLIST person id ID #REQUIRED

mom IDREF #IMPLIED dad IDREF #IMPLIED children IDREFS #IMPLIED

instance:<person id=“john” mom=“jane” dad=“james”

children=“jack jim”>

02.05.2003 XML 30

Valid XML documents

• a XML document is valid, if:– document is well-formed– additionally has a DTD– conforms to that DTD:

• elements only nested as described in DTD

• just attributes used which are allowed by DTD

• all attributes of type ID must have distinct values

• all IDREFS must be to existing identifiers

02.05.2003 XML 31

Limitations of DTDs as schemas (summarized)

• order• only one atomic type (PCDATA, but no INT

etc.)

• names are global (partial solution: namespaces)

• IDREFs are not constrained to a certain type (“mother”-reference should point to a “person”)

02.05.2003 XML 32

Overview

• Background / History• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML

– DCD

– Document navigation

• Paraphernalia

02.05.2003 XML 33

Document Content Definitions

• making typing more precise• seems to be gone• recent approach: XML Schema which must e.g.:

– provide for primitive data typing, including byte, date, integer, sequence, SQL & Java primitive data types, etc.

– allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties

– mechanism for URI reference to standard semantic understanding of a construct;

– … (http://www.w3.org/TR/NOTE-xml-schema-req)

02.05.2003 XML 34

XLink & XPointer

• pointing to arbitrary positions in documents

• using IDs or relative position

• links can be defined externally to both source and target (files)

02.05.2003 XML 35

Overview

• Background / History• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia

– RDF– Stylesheets– SAX and DOM

02.05.2003 XML 36

Resource Description Framework

• for representing metadata

• consists of data model and syntax

• simple form: edge-labelled graph

• additionally: – containers (bag, sequence or alternative)– higher-order statements (“John says that …”)

02.05.2003 XML 37

Stylesheets

• to specify presentation of data• Cascading Style Sheets (CSS):

associate with each element type a presentation

• Extensible Stylesheet Language (XSL):specifies the presentation of a class of XML

documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary

http://www.w3.org/Style/XSL/

02.05.2003 XML 38

SAX and DOM

• Application Programming Interfaces• Simple API for XML (SAX)

– standard for parsing

• Document Object Model (DOM):interface that will allow programs and scripts to

dynamically access and update the content, structure and style of documents

– compile whole document and build a tree representation for it

http://www.w3.org/DOM/

02.05.2003 XML 39

Outlook

• Database issues:– How are we going to model XML? (graphs).– How are we going to query XML? (XML-QL)– How are we going to store XML (in a relational

database? object-oriented?)– How are we going to process XML efficiently?

(uh… well..., um..., ah..., get some good grad students!)

Raghu Ramakrishnanhttp://www.cs.wisc.edu/~cs784-1/handouts/intro-ssxml.ppt

02.05.2003 XML 40

References

• S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web. From relations to Semistructured Data and XML, Morgan Kaufmann Publishers, San Francisco 2000

• H. Lobin, Informationsmodellierung in XML und SGML, Berlin, Heidelberg, 2000

• World Wide Web Consortium, Extensible Markup Language (XML), http://www.w3.org/XML/