50
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native XML Databases Ronald Bourret [email protected] http://www.rpbourret.com

Copyright 2001, Ronald Bourret, Native XML Databases Ronald Bourret [email protected]

Embed Size (px)

Citation preview

  • Slide 1
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native XML Databases Ronald Bourret [email protected] http://www.rpbourret.com
  • Slide 2
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Overview What is a native XML database? Native XML database architectures When should I use a native XML database? Normalization, referential integrity, scalability, and performance Native XML database features
  • Slide 3
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com What is a Native XML Database?
  • Slide 4
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Blame Software AG Software AG coined the term native XML database...... and used it to market Tamino...... without ever defining it For a long time Everybody knew Tamino was a native XML database Nobody knew what Tamino did or how it worked
  • Slide 5
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com What is a native XML database? A database that stores XML documents as XML Defines a (logical) model for an XML document Fundamental unit of (logical) storage is a document Can have any physical storage
  • Slide 6
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Example: Storing a sales order Store data Store documentsStore documents as text as DOM objects Orders Items Customers Parts 1234 Gallagher Industries 29.10.00 A-10 12 10.95 B-43 600 3.99 Element Element Element Text Text Text Attr Element... Element Element Element Text Text Text Attr Element... Element Element Element Text Text Text Attr Element... Element Element Element Text Text Text Attr Element............ 1234 29.10.00 Gallagher Industries........................ 1234 1 A-10 12 10.95 1234 2 B-43 600 3.99........................ Gallagher Industries.................. B-43... A-10......
  • Slide 7
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Logical model of XML document Must include elements, attributes, PCDATA, and document order Examples are XPath data model, XML Infoset, DOM, and model implied by SAX 1.0 Documents stored and retrieved according to the model
  • Slide 8
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Fundamental unit of storage Fundamental unit of (logical) storage is a document Equivalent structure in a relational database is a row Document usually contains single set of data In future, unit of storage could be a fragment
  • Slide 9
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Physical storage Can have any physical storage For example, can be built on a relational, hierarchical, or object-oriented database...... or use a proprietary storage format such as indexed, compressed files
  • Slide 10
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native XML Database Architectures
  • Slide 11
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Text-based storage Stores documents as text Can use file system, BLOB, proprietary storage, etc. XML-aware text engine in RDBMS is a native XML database Uses indexes heavily
  • Slide 12
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Text-based storage 123 Main St. Chicago IL 60609 USA
  • Slide 13
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Text-based databases Indexed files TextML Proprietary GoXML DB
  • Slide 14
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Model-based storage Stores documents according to a specific model For example, maps DOM to relational database Underlying storage can be relational, object-oriented, hierarchical, or proprietary
  • Slide 15
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Model-based storage 123 Main St. Chicago IL 60609 USA Element Element Element Element Element Element Text Text Text Text Text
  • Slide 16
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Model-based databases Pre-parsed DOM Infonyte (PDOM), dbXML, XDBM Proprietary Tamino, Birdstep, Lore, Neocore(?), SIM(?), Virtuoso(?), XYZFind Relational Xfinity, DBDOM, eXist Object-oriented eXcelon, X-Hive, Ozone/Prowler, 4Suite
  • Slide 17
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com When Should I Use a Native XML Database?
  • Slide 18
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Storing document-centric documents Saves physical info (entity references, CDATA, etc.) Stores document ID / name Supports document-centric queries Retrieve the first section containing a list in the third chapter Retrieve the headings of all chapters that contain hyperlinks
  • Slide 19
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Natural format is XML XHTML, DocBook, etc. Data stored temporarily as XML For example, in a message queue Common format of many documents is XML For example, Web search engine database
  • Slide 20
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Retrieval speed is critical One hierarchical view must predominate Happens today: 15 billion gigabytes of data in IMS Relational queries are hierarchy-neutral Speed depends on: Query Underlying storage engine Output format (DOM, SAX, string)
  • Slide 21
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Semi-structured data Structure is present, but not regular like tabular data For example, geneological records or patient records Difficult to store in a relational database Choice is many tables or many nulls Structure might not be known at design time
  • Slide 22
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Well-formed documents No known schema Best example is documents stored by Web search engine Storing data in such documents is very inefficient Tables and mappings must be created at run-time
  • Slide 23
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Normalization, Referential Integrity, Scalability, and Performance
  • Slide 24
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Normalization Means that a given piece of data appears only once Reduces disk usage Reduces potential update errors Fundamental concept of relational databases
  • Slide 25
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Normalization and native XML databases Concept same as in relational database Only difference is database model Relational tables are flat, can only store single values XML documents are hierarchical, can store multiple values Not required
  • Slide 26
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Example: Sales order Requires two tables in RDBMS Can store in a single document in native XML database Both are normalized Relational database XML document Orders Items......... 1234 29.10.00 Gallagher Industries........................ 1234 1 A-10 12 10.95 1234 2 B-43 600 3.99............... 1234 Gallagher Industries 29.10.00 A-10 12 10.95 B-43 600 3.99
  • Slide 27
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Problem: Real sales order Real world not that simple Sales order probably contains customer information ID, name, bill-to address, ship-to address, etc. 1234 020962 Gallagher Industries... 29.10.00 A-10 12 10.95 B-43 600 3.99
  • Slide 28
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Solutions: Real sales order Normal: Store customer info in separate file Use XLinks or joins XLinks not widely supported (will be in future?) If normalized and flat, might as well use relational database Non-normal: Store customer info in each sales order Trades speed for query flexibility and update complexity Real-world relational databases often not normal
  • Slide 29
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Normalization and document-centric documents Often not worth doing For example, in a collection of user manuals Each contains copyright, company logo, company address Duplicate information not worth normalizing Matters only when there is significant overlap Procedures common to many models of same product List of worldwide customer support contacts ...
  • Slide 30
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Referential integrity Refers to validity of pointers to other data For example, PartNumber in Items points to valid row in Parts Applies to XLinks and external entity references XLinks generally not supported => not an issue Probably not enforced for external entity references Needs support in the future
  • Slide 31
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Scalability and performance Outside my area of expertise Native XML databases appear to scale / perform Much better than relational databases when retrieving whole documents or fragments Much worse than relational databases when retrieving unindexed data Slower(?) than relational databases when retrieving views of indexed data that dont follow the storage hierarchy Benchmark data not yet available
  • Slide 32
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Whole documents or fragments Text-based databases are very fast Data is contiguous on disk Retrieval requires index lookup and single disk read 1. Index lookup 2. Position disk head 3. Read to here
  • Slide 33
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Whole documents or fragments (cont.) Model-based databases with proprietary storage are fast Generally use physical pointers between nodes Model-based databases built on other DBs may be fast Depends on underlying database and implementation strategy Node 1. Index lookup 2. Position disk head 3. Follow pointers to here
  • Slide 34
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Views not following storage hierarchy Slower than hierarchical views? May require many index lookups or linear searches Pointers to parent nodes should help in model-based databases Relational databases are query neutral 1234 Gallagher Industries 29.10.00 A-10 12 10.95 B-43 600 3.99 Get the dates of all sales orders for part A-10 1. Index lookup for part A-10 2. Follow pointers to Order? 3. Search children for Date?
  • Slide 35
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Indexed data Native XML databases use indexes heavily Index lookup speed same as any database, but...... more index lookups may be required than by RDBMS Update times slower due to index updates
  • Slide 36
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Unindexed data Slow for model-based databases Must read all elements, not just elements of a particular type Comparisons slower due to converting text Very slow for text-based databases Must parse document as well as comparing values Element Element Element Text Text Text Attr Element... Find date 29.10.00 Relational database: 1. Search this column Model-based native XML database: 1. Search all elements for Date elements 2. Search text for all Date elements Orders......... 1234 29.10.00 Gallagher Industries.........
  • Slide 37
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Query return types String, DOM tree, SAX events Text-based databases Very fast returning strings Slow returning DOM trees or SAX events due to parsing Model-based databases Probably similar speed to relational databases for all types
  • Slide 38
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native XML Database Features
  • Slide 39
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Document Collections Contain related documents Similar to Catalog/schema in relational database Directory in file system Some databases allow nested collections
  • Slide 40
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Indexes All databases use indexes Some databases index everything Other databases allow user to specify what to index
  • Slide 41
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Query Languages XPath and XQL are most common Usually include extensions for multi-document queries Many databases have proprietary languages XQuery will probably be standard in the future
  • Slide 42
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Updates Many databases simply replace existing document Some databases allow updates through live DOM Other databases have fragment update language Best way to do updates still unclear
  • Slide 43
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Transactions, Locking, and Concurrency Most databases support transactions Locking often at document (not fragment) level Whether this is an issue depends on What is stored in a single document Number of concurrent users Fragment locking probably more common in future
  • Slide 44
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com APIs Most databases have proprietary APIs XML:DB is database-neutral API Standard API (XML:DB or other) likely in future APIs similar to ODBC Query language is separate from API Methods to connect, execute queries, retrieve results, commit transactions Results returned as single document or set of documents Documents returned as string, DOM tree, or SAX events Most databases support HTTP
  • Slide 45
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Round-tripping All native XML databases can round-trip documents Round-trip level depends on database Text-based databases usually do exact round-tripping Model-based databases round-trip at level of model Minimum is elements, attributes, PCDATA, and document order May be less than canonical XML (comments and processing instructions discarded)
  • Slide 46
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com External data Some databases can merge data from external databases, such as with ODBC, OLE DB, JDBC Whether data is live depends on database In the future, most databases will probably support live external data
  • Slide 47
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com External entity storage Not clear whether to store entity or URI Storing entity value is incorrect if URI points to live data Storing URI may be incorrect if entity meant as a snapshot Not sure how databases handle this problem Correct answer is probably to let user decide
  • Slide 48
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Resources
  • Slide 49
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Resources Ronald Bourrets Papers Page http://www.rpbourret.com/xml/index.htm XML:DB.orgs Resources Page http://www.xmldb.org/resources.html XML:DB Mailing List http://www.xmldb.org/projects.html
  • Slide 50
  • Copyright 2001, Ronald Bourret, http://www.rpbourret.com Questions? Ronald Bourret [email protected] http://www.rpbourret.com