1
1
1
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Web Data Management: An Introduction to
Semistructured and XML Databases
V. CHRISTOPHIDESI FUNDULAKI
Department of Computer ScienceUniversity of Crete
ICS - FORTH, Heraklion, Crete
2
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
A bit of History
� Research:
�1950’s: Lisp [Mac Carthy]
�1960’s: Tree languages [Buchi]
�1970’s: Relational DBs [Codd]
�1990: Graphlog [Univ. Toronto]
�1994: O2 extensions [INRIA]
�1995: Tsimmis & OEM [Stanford]
�1995: UnQL [UPenn]
Need to handle irregular Web data.
Use graph data models.
� Internet industry:�1957: Sputnik launches ARPA
�1972: First demonstration of ARPANET
�1989: Number of hosts breaks 100,000
�1991: CERN releases the World Wide WebHTML as the support for information
�1997: 20 Million Hosts,1 Million Web sites
�1998 : W3C releases XML to represent information on the Web
XML provides a syntax for irregular
textual Web information.
?
2
2
3
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Documents vs Databases
Document world
� plenty of small documents
�usually static
� implicit structure
�section, paragraph, toc,
� tagging
�human friendly
� content
�form/layout, annotation
� paradigms
�“Save as”, WYSIWYG
� metadata
�author name, date, subject
Database world
� a few large databases
�usually dynamic
� explicit structure
� types
� records
�machine friendly
� content
�data, methods
� paradigms
�Data Independence, Transaction Management, Query Languages
� metadata
�schema description
4
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
What to do with them
Documents
� editing
� spell-checking
� counting words
� retrieving (IR)
� printing
Database
� updating
� cleaning
� querying
� composing/transforming
3
3
5
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Query Languages
Document Retrieval
Claude Monet and San Diego Museum of Art
Database Querying
select p
from Artists a, a.artwork p
where a.first = “Claude”
and a.last = “Monet”
and p.located =
“San Diego Museum of Art”
6
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
How the Web is Today ?
� Information and its presentations are
mixed up in the form of HTML
documents
�all intended for human consumption
�many generated automatically by
applications
� Easy to fetch any Web page, from any
server, any platform
�access through a uniform interface
4
4
7
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� Everybody can write it:
�HTML is simple
�HTML is textual: it is human readable, you can use any editor, ...
� Everybody can read it:
�HTML is portable on any platform
�The browser is the universal application
� Everybody can search it:
�Keyword-based Search Engines: high recall, low precision
� It connects pieces of information together
�Through hypertext links
The Secrets of HTML Success
Hypertext Links
8
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
<B>MONET, Claude<B><BR>Haystacks at Chailly at Sunrise<BR>1865<BR>Oil on canvas<BR>30 x 60 cm (11 7/8 x 23 3/4 in.)<BR>San Diego Museum of Art <BR><P><IMG SRC=“http://192.41.13.240/artchive/ m/monet/hayricks.jpg”>
What’s Wrong with HTML
� If written properly, normal HTML markup may reflect document presentation, but it cannot adequately represent the semantics &structure of data
Artist Name
Date
Artifact Title
Dimensions
Material
Museum
Image Reference
5
5
9
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
HTML Document Presentation
10
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� Infomediaries:
�Community Web Portals
�Digital Museums & Libraries
� Electronic commerce:
�On-line Catalogs & Procurement
�Comparison Shoppers
�Market Places
�Virtual Enterprises
� Scientific applications:
�E-learning
�Data & Knowledge Grids
But Modern Web Applications Need More!
� Advanced Information Management
�finding,
�extracting,
�representing,
� interpreting,
�maintaining
� Flexible, Quick Interoperation: the ability to uniformly share, interpretand manipulate heterogeneousinformation
�applications cannot consume HTML
More than HTML documents: Data on the Web
More than Web browsers: Web-enabled Applications
6
6
11
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Paradigm Shift on the Web
application
relational data
Transform
Integrate
Warehouse
XML DataWEB (HTTP)
application
application
legacy data
object-relational
� New Web standard XML:
�XML generated by
applications
�XML consumed by
applications
� Data exchange:
�across platforms
�across organizations
�Web: from collection of
documents to Web data
published as documents
12
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
XML Data Representation: The Document View
<ARTIST><NAME>
<FIRST>Claude</FIRST> <LAST>Monet</LAST></NAME><ARTWORK>
<ARTIFACT><TITLE>Haystacks at Chailly at Sunrise</TITLE><DATE>1865</DATE><MATERIAL>Oil on canvas</MATERIAL><DIM Metric=‘cm’> <HEIGHT>30</HEIGHT><WIDTH>60</WIDTH></DIM>
<DIM Metric=‘in’> <HEIGHT>11 7/8</HEIGHT><WIDTH>23 3/4</WIDTH></DIM>
<LOCATION>San Diego Museum of Art</LOCATION><IMAGE File=‘http://192.41.13.240/artchive/m/monet/hayricks.jpg’/>
</ARTIFACT></ARTWORK>
</ARTIST>
Element Name Element Content
Empty Element
Attribute ValueAttributeName
7
7
13
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
XML Data Representation: The Database View
ARTIST
NAME ARTWORK
FIRST LAST ARTIFACT
TITLE DATE
MATERIAL
DIM IMAGEDIM
LOCATION
...hayricks.jpg
Claude MONET
Haystacks 1865
Oil on canvas
San Diego Mus.
30 60 11
7/8
23
3/4
H W H W
14
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� It looks like HTML...
�Simple, familiar, easy to learn, human-readable
�Universal and portable
�Supported by the W3C: trusted and quickly adopted by the industry
� …but it’s more than HTML!
�flexible: you can represent any information
�extensible: you can represent it the way you want!
� Increasing precision in XML specifications
�Well-Formed: already better than plain text
�Valid: Structure conforms to a DTD or an XML Schema
The Secrets of XML Popularity
<?<?<?<?XML!>!>!>!>
8
8
15
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Well-Formed XML
�An object is said to be a well-formed XML document if it meets all the
well-formedness constraints (WFCs) of the XML syntax:
�tags (etc.) are syntactically correct
�every tag has an end-tag
�tags are properly nested
�there exists a root
�By definition if a document is not well-formed, it is not XML
�This means that there is no an XML document which is not well-
formed, and XML processors are not required to do anything with
such documents
16
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Valid XML
� A well-formed document is valid only if it contains a proper DTD (or
Schema) and if the document obeys the constraints of that DTD (or
Schema) and therefore the XML Validity Constraints (VCs)
�only declared tags (element or attribute names) are used
�all tag occurrences conform to specified content models
� Examples:
�The following XML Document is well-formed but not valid
<ARTIST>Claude Monet</ARTIST>
�The following XML Document is not even well-formed
<FIRST>Claude</FIRST><LAST>Monet</LAST>
9
9
17
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
XML Document Type Definition (DTD)
<!DOCTYPE artist [<!ELEMENT artist (name, born, death, artwork, nationality?,
influences)><!ATTLIST artist oid ID #REQUIRED><!ELEMENT name (first, last)><!ELEMENT first (#PCDATA)><!ELEMENT last (#PCDATA)> ...<!ELEMENT artwork (artifact+)><!ELEMENT artifact (title, date, material, dim*, location, image)><!ELEMENT title (#PCDATA)> ...<!ELEMENT dim (height, width)><!ATTLIST dim metric (cm | in) ‘cm’><!ELEMENT location (#PCDATA)><!ELEMENT image EMPTY><!ATTLIST image file ENTITY #REQUIRED><!ELEMENT influences (PCDATA | aref)*><!ELEMENT aref EMPTY><!ATTLIST aref oref IDREF #IMPLIED>]>
18
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
XML Anatomy
10
10
19
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Is XML the Solution to Interoperability?
Application 1 Application 2
ARTIST
NAME ARTWORK
FIRST LAST ARTIFACT
TITLE DATE
MATERIAL
DIM IMAGEDIM
LOCATION
hayricks.jpg
ClaudeMONET
Haystacks
1865
Oil on canvas
San Diego Mus.
30 60 11
7/823
3/4
H W H W
Communication
ARTIST
NAME ARTWORK
FIRST LAST ARTIFACT
TITLE DATE
MATERIAL
DIM IMAGEDIM
LOCATION
hayricks.jpg
ClaudeMONET
Haystacks
1865
Oil on canvas
San Diego Mus.
30 60 11
7/823
3/4
H W H W
Document = medium for
exchanging information
� Still need to agree on:
� DTDs or Schemas
� Meaning of tags
� “Operations” on data
� Meaning of operations
20
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Communication Partner using DTD B
Large Scale Interoperation on the Web
XML-based Communicationusing DTD A
? ?
Communication Partner using DTD C
?
Sender using DTD A Recipient using DTD A
11
11
21
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Interoperability is still an Open Issue !
� Semantic discrepancies : � Synonymy & Polysemy & Taxonomy
� <ARTIFACT> vs. <ARTEFACT>� is <ARTWORK> paintings or songs ?� how <… Style=‘Impressionism’> is related to <… Style=‘Pointillism’> ?
� Structural discrepancies :� Aggregation
� <NAME><FIRST>Claude</FIRST><LAST>Monet</LAST></NAME>vs <NAME>Claude Monet</NAME>
� Type� <ARTIFACT Kind=‘Painting’> ... </ARTIFACT>vs <PAINTING>Claude Monet</PAINTING>
� Syntactic discrepancies :� <ARTIST Name=‘Claude Monet’> ... </ARTIST> vs <ARTIST> <NAME>Claude Monet</NAME> ... </ARTIST>
More than Web Data: Semantics on the Web
More than Web Applications: Web Services
22
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
The Semantic Web Vision: A Web of Meaning
Museums
Artists
Artifacts
Techniques
Semantic Relationships
� The “Next Generation Web” aims to provide
infrastructure for expressing information in a
precise, human-readable, and machine-
interpretable form
� Enable both syntactic and semantic/
structural interoperability among
independently-developed Web applications,
allowing them to efficiently perform
sophisticated tasks for humans
� Enable Web resources (data & applications)
to be accessible by their meaning rather
than by keywords and syntactic forms
�Conceptual Navigation & Querying
�Inference Services (Picasso is an Artist)
12
12
23
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Web Innovation: From Web Sites to Web Services
Web Sites Web Services
Brows
e/
Presen
tInt
egrate
/
Transa
ct
Progra
m/
Collab
orate
Today
PacketsCon
nect
HTML
TCP/IP
XML UDDI, SOAP,RosettaNet
1970s 1980s 1990s 1996 2003
24
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Middleware Evolution & Interoperability
13
13
25
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
About the W3C and the XML Activities
� Membership organization
� Different types of groups inside W3C:
�Working groups
�Interest groups
�Coordination groups
� Status for W3C documents:
�Working draft
�Last Call
�Candidate/proposed recommendation
�Recommendation ~ Standard
� Core XML WG
�eXtensible Markup Language (XML 1.0), namespaces, Infoset
� XML Linking WG
�XML Pointer Language (Xpointer), XML Linking language
� XML Schema WG
� XML Query WG
�XML Data Model, Algebra and Query Language
� Document Object Model WG
� XSL WG
�XPath (with XML Linking WG)
�Transformation and stylesheetlanguage (XSLT/XSL)
26
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
W3C XML Related Specifications ‘Open’ std
W3C rec
W3C draft
industry std
SAX 1
XML 1.0 XML namespaces
Xpath
XSLT
XSL
DOM 1
MathML
SMIL 1 & 2
SVG
XHTML 1.0
Modularized
XHTML
XHTML
basic
Xforms
Canonical
XML
signature
XML base
Xlink
Xpointer
XML query ….
Infoset
XML schema
RDF
Xfragment
XHTML
events
SOAP UDDI FinXML
dirXMLXML-RPC
100's more ....
SAX 2
DOM 2
DOM 3
CSS 1
CSS 2
CSS 3
JDOM
JAXP
WSDLIFX
FpML ...
ebXML
Biztalk
WDDX XMI...
...
APIs
Style Protocols Web Services Application areas
XML Core
…...
Ian GRAHAM
14
14
27
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� We now want to build advanced Web applications
�There is an urgent need for XML tools
� Designing XML tools is a data management problem:
�XML 1.0 to describe structured documents
= Syntax for trees
�XML data models to describe the information content
= Data model for trees
�XML schemas to describe the structure of information
= Data definition language for trees
�XML languages to describe information processing
= Data manipulation language for trees
XML is Just the Beginning...
28
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� Databases are large collections of data
�Even “basic” APIs to XML will fail on
large XML documents
� Other reasons -- all the good things that
the DB community brought the world:
�Data models, integrity constraints,
and schemas
�Query languages, optimizers, fast
joins
�Views, Updates
�Concurrency
�Federated database systems
� The emergence of XML underscores the
importance of semistructured data
Why is Database Work Important?
15
15
29
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
� Data Schema is not what it used to be:
� not given in advance (self-describing, schema-less),
� descriptive, not prescriptive (designed by document, not db experts),
� partial (documents and data mixed together),
� rapidly evolving (without notice),
� may be large (compared to the size of the data)
� Data Types are not what they used to be:
� elements and attributes are not strongly typed
� missing or additional attributes
� multiple attributes
� elements in the same collection may have different types i.e.,
heterogeneous collections
� attributes with different types in different elements
XML Main Characteristics: Semistructured Data
30
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Schemas are Useful for …
� Data readers
�What info is in a given collection?
�Thus, what queries might make sense?
� Data writers
�What should I call this piece of info?
�Is it okay to put this kind of data here?
� Efficient/effective data manipulation
�Optimize query processing
�Facilitate integration of multiple data sources
�Improve storage
�Construct indexes, statistics
�Forbid certain types of updates
16
16
31
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Needs also for a DB Paradigm Shift
� Managing semistructured/XML data requires rethinking the design of
components of a DBMS:
�How do we model it?
�directed labeled trees with references (i.e. graphs)
�How do we query it?
�a new standard based on functional languages and regular path
expressions
�How do we store the data?
�looking for structural patterns
�How do we optimize queries?
�beginning to understand (algebras, indexes, etc.)
�What about Integrity constraints, views, updates,…,?
32
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Towards a Convergence
� Databases: relax rigid constraints
imposed by schemas
�Move to a dynamic type system:
semistructured data
� Documents: enrich formatting
instructions with structuring/
semantic information
�Add “types” to documents: XML
Semistructured Data XML=~
SGML, Document Management
HTML,Web Pages
Semi-structureddata models for data integration
ANS.1, ACeDBScientific
Data Formats
Origins
17
17
33
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
What This Course is About
� What the database community has done:
�Semistructured data model: SSD-exps, labeled graphs
�OEM, UnQL and YATL
�Schemas, Storage, Query Optimization
� What the Web community has done:
�Data formats and APIs: XML 1.0, DOM
�Transformation and Stylesheet languages (XSLT/XSL)
� Where they meet and where they differ
�Comparison to relational and object-oriented data models
� Present emerging XML technology as a data management issue
�XML Data models
�XML Data Definition (Schema) Languages
�XML Data Manipulation (Query) languages
34
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
Your Research Projects
� XML Data Semantics
�Type Systems
�Structural & Integrity Constraints
�Incremental Validation
� XML Query Processing
�XQuery Algebras
�Tree Query Pattern Containment &Minimization
�XPath Engines
�Stream-based Query Processing
� XML Query Optimization
�Storage Schemes
�Labelling & Indexing Schemes
�Structural Joins & Cost Models
�Data Statistics & Compression
�Benchmarks, Real & Synthetic Data
� XML Data Management
�Updates, Evolution &Versioning
�Access Control & Active Rules
�Data Publishing & Relational Databases
�Warehouses & View Maintenance
� XML Database Systems
�Commercial DBMS
�Native DBMS
18
18
35
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
�1960’s: Data Centric
�1970’s: Process Centric
�1980’s: Object Oriented
�1990’s: Component Based
�2000’s: XML?
Retrospective
36
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
60’sData
•Record Layouts•Printer Layouts•System Flow Charts•Decision Tables
Batch Jobs were a Seriesof small Programs
Data was our First Focus
19
19
37
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
60’sData
70’sLogic
•GOTO-Less Programming•Structured Programming•Top-Down Design
Programs Became Very Large
Then we Focused on Logic
38
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
60’sData
80’sOO
70’sLogic
•Common Terms forAnalysis and Design•Tightly Coupled Code
Code Reuse was the Holy Grail, Rarely Achieved
Object Oriented Programming
Focused on Runtime Behavior
20
20
39
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
60’sData
90’sComp
70’sLogic
80’sOOSerialization Tied to Code
•Code Reuse•IDE-Based Composition•Limited Acceptance
Component Programming
Shifted the Focus to Interfaces
40
Univ. of Crete V. Christophides I. Fundulaki
CS561 Spring 2010
00’sXML
70’sLogic
80’sOO
90’sComp
•XML Wrappers for Incompatible Systems•Industry-Specific Markup Languages•XML for Persistent Data and Composition
XML Enables Middleware for Application-Specific Data
XML Returns the Focus to Data