36
1 XML & XPath CS 445 Fall 2010 Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives.

XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

1

XML & XPath

CS 445Fall 2010

Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives.

Page 2: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

Structure in data representation• Relational data is highly structured– structure is defined by the schema– good for system design– good for precise query semantics / answers

• Structure can be limiting– authoring is constrained: schema-first– changes to structure not easy– querying constrained: must know schema– data exchange hard: integration of diff schema

2

Some reasons why more

data is not in databases

Page 3: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

WWW

Structured data - Databases

Unstructured Text - Documents

Semistructured Data

Page 4: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XML data

4

<data><person id=“o555” >

<name> Mary </name><address>

<street> Maple </street> <no> 345 </no> <city> Seattle </city>

</address></person><person>

<name> John </name><address> Thailand </address><phone> 23456 </phone>

</person></data>

Page 5: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

Need for loose structure

• Evolving, unknown, or irregular structure• Integration of structured, but

heterogeneous data sources• Textual data with tags and links• Combination of data models

5

Page 6: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XML: Syntax

Page 7: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XML Syntax• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element

An XML document is well formed if it has matching tags

Page 8: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

8

XML Syntax<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

attributes are alternative ways to represent data

Page 9: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XML Syntax

<person id=“o555”> <name> Jane </name> </person><person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person><person id=“o123” mother=“o456”><name>John</name></person>

oids and references in XML are just syntax

Page 10: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

10

XML Semantics: a Tree !<data>

<person id=“o555” ><name> Mary </name><address>

<street> Maple </street> <no> 345 </no> <city> Seattle </city>

</address></person><person>

<name> John </name><address> Thailand </address><phone> 23456 </phone>

</person></data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters !!!

Page 11: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XML Data• XML is self-describing• Schema elements become part of the data

– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part

of the data, and are repeated many times• Consequence: XML is much more flexible

Some real data:http://www.cs.washington.edu/research/xmldatasets/

Page 12: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

12

Relational Data as XML

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>

</person>

row row row

name name namephone phone phone“John” 3634 “Sue” “Dick”6343 6363

personXML: person

name phoneJohn 3634Sue 6343Dick 6363

Page 13: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

13

XML is Semi-structured Data• Missing attributes:

• Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person> ← no phone !

name phoneJohn 1234Joe -

Page 14: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

14

XML is Semi-structured Data• Repeated attributes

• Impossible in tables:

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

← two phones !

name phoneMary 2345 3456 ??

?

Page 15: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

15

XML is Semi-structured Data• Attributes with different types in different objects

• Nested collections (non 1NF)• Heterogeneous collections:

– <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

← structured name !

Page 16: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

Represent as tree

16

<A> <A> <A> 1 </A> <B> <B> 2 </B> <B> 3 </B> </B> </A> <A> <B> 4 </B> <A> <A> 5 </A> <A> 6 </A> </A> </A></A>

Page 17: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

17

Querying XML Data• Querying XML has two components

– Selecting data• pattern matching on structural & path properties• typical selection conditions

– Construct output, or transform data• construct new elements• restructure• order

Page 18: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

18

Querying XML Data

• XPath = simple navigation through the tree

• XQuery = the SQL of XML

• XSLT = recursive traversal

Page 19: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

19

Querying XML

How do you query a directed graph? a tree?

The standard approach used by many XML, semistructured-data, and object query languages:

• Define some sort of a template describing traversals from the root of the directed graph

• In XML, the basis of this template is called an XPath

Page 20: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

20

XPath is widely used

• XML Schema uses simple XPaths in defining keys and uniqueness constraints

• XQuery

• XSLT

• XLink and XPointer, hyperlinks for XML

Page 21: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

21

XPaths

In its simplest form, an XPath is like a path in a file system:/mypath/subpath/*/morepath

• The XPath returns a node set representing the XML nodes (and their subtrees) at the end of the path

• XPaths can have node tests at the end, returning only particular node types, e.g., text(), element(), attribute()

• XPath is fundamentally an ordered language: it can query in order-aware fashion, and it returns nodes in order

Page 22: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

22

Sample Data for Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

Page 23: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

23

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

Page 24: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

XPath

24

/bib/book/year

/bib/paper/year

//author

/bib//first-name

//author/*

/bib/book/@price

/bib/book/author[firstname]

/bib/book/author[firstname][address[.//zip][city]]/lastname

Page 25: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

25

XPath: Simple Expressions

Result: <year> 1995 </year> <year> 1998 </year>

Result: empty (there were no papers)

/bib/book/year

/bib/paper/year

Page 26: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

26

XPath: Restricted Kleene Closure

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

Result: <first-name> Rick </first-name>

//author

/bib//first-name

Page 27: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

27

Xpath: Text Nodes

Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman

Rick Hull doesnʼt appear because he has firstname, lastname

Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag

/bib/book/author/text()

Page 28: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

28

Xpath: Wildcard

Result: <first-name> Rick </first-name> <last-name> Hull </last-name>

* Matches any element

//author/*

Page 29: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

29

Xpath: Attribute Nodes

Result: “55”

@price means that price has to be an attribute

/bib/book/@price

Page 30: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

30

Xpath: Predicates

Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>

/bib/book/author[firstname]

Page 31: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

31

Xpath: More Predicates

Result: <lastname> … </lastname> <lastname> … </lastname>

/bib/book/author[firstname][address[.//zip][city]]/lastname

Note that two predicates next to one another are combined conjunctively. In this case author[firstname][address[..][..]] requires that author have both a firstname AND an address child (with additional conditions on the address child).

Page 32: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

32

Xpath: More Predicates

/bib/book[@price < 60]

/bib/book[author/@age < 25]

/bib/book[author/text()]

Page 33: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

33

Context Nodes and Relative Paths

XPath has a notion of a context node: itʼs analogous to a current directory– “.” represents this context node– “..” represents the parent node– We can express relative paths:

subpath/sub-subpath/../.. gets us back to the context node

By default, the document root is the context node

Page 34: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

dot in XPath qualifiers• //author• //author[first-name]• //author[./first-name]• //author[/first-name]• //author[//first-name]• //author[.//first-name]

34

equivalent

qualifier starts at root

Page 35: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

Xpath: Summarybib matches a bib element* matches any element/ matches the root element/bib matches a bib element under rootbib/paper matches a paper in bibbib//paper matches a paper in bib, at any depth//paper matches a paper at any depthpaper | book matches a paper or a book@price matches a price attributebib/book/@price matches price attribute in book, in bibbib/book[./@price<“55”]/

author/lastnamematches…

Page 36: XML & XPath - UMass Amherstavid.cs.umass.edu/courses/445/f2010/scribe-notes/Lecture-XML-XPath.pdfQuerying XML How do you query a directed graph? a tree? The standard approach used

Overview of Research issues• Data modeling and normalization• Query language design• Storage & publishing of XML

– XML → Relations– Relations → XML

• Theoretical work– expressiveness– containment, type checking

• Query execution & optimization36