37
XML for Scientific Applications

XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science

Embed Size (px)

Citation preview

Page 1: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML for Scientific Applications

Page 2: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Outline

SQ1 returned

History Manipulating XML XML for Science

Page 3: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML History

late 1960s “Generic Coding” for elect. manuscripts

'heading' rather than 'format-17' 1969

Goldfarb, Mosher, Lorie @ IBM Generalized Markup Language

1980 SGML; Standardized by ANSI IRS, DoD were early adopters

Page 4: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

SGML

Representation of documents Device-Independent, Software-

Independent Content

Tagged Text Structure

Document Type Definition (DTD) Style

Usually proprietary

Page 5: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

SGML: Content

Tagged Text start tags and end tags indentation not important

<section> <heading>SGML Example</heading> <paragraph> <line>first line of SGML</line> <line>second line of SGML</line> </paragraph></section>

Page 6: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

SGML: Structure

A section is an optional heading, followed by one or more paragraphs.

A paragraph is one or more lines or a quote.

A heading is just text.

A line is just text.

Dcoument Type Definition (DTD)

<!ELEMENT section - - (heading?, paragraph+)><!ELEMENT paragraph - O (line+ | quote)><!ELEMENT heading - O (#PCDATA) ><!ELEMENT line - O (#PCDATA) ><!ELEMENT quote - O (#PCDATA) >

Page 7: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

SGML: Benefits and Weaknesses

Benefits Human-readable Portable Structure-oriented

Weaknesses DTD difficult to define up-front Difficult to implement

syntax abbreviations/rules inference

Page 8: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

AN SGML Application: HTML

1990 Tim Berners-Lee @ CERN

a Scientific Data Management problem… writes a DTD for “Hypertext Markup Language”

1994-1996 Web takes off Page designers need guarantees about how

their page is going to look <font> tags, etc. introduced; purists despair MS and Netscape fight

Page 9: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Data on the Web

<title>,<h1>,<font> are for humans

Back to SGML for data, but Stricter syntax (must use end tags)

good for implementors looser semantics (DTD not required)

good for users

Non-standard markup is better than no markup at all

Page 10: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Growth

Sun introduces Java New hope for interoperability

Microsoft pushes XML because…it’s not Java

Individuals can publish data unilaterally Contrast: How do you publish relations? Contrast: Documents? (MS Word? as pdf?) A non-partisan, unimpeachable format

Page 11: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML File Sample

<?xml version="1.0"?> <dining-room>

<manufacturer>The Wood Shop</manufacturer>

<table type="round" wood="maple">

<price>$199.99</price> </table> <chair wood="maple">

<quantity>6</quantity> <price>$39.99</price>

</chair> </dining-room>

Page 12: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Abstract View of XML

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” “$199.99” “$39.99” 6“maple”

price

Page 13: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Manipulating XML

Page 14: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XPath

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

1) //price

2) /*/*/wood

3) //wood/../

4) /dining-room/*[price>100]/type

5) //wood[1]/text()

Page 15: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XPath

XML [Node] language is not “closed” cannot compose xpath expressions

Answers are references to nodes works for local memory…

Page 16: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XSLTdining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

<xsl:stylesheet version="1.0" xmlns:xsl=“…"> <xsl:template match=“//wood">

<xsl:if test=“/price > 100”><material>

<xsl:value-of select=“.”/> </material>

</xsl:if></xsl:template> </xsl:stylesheet>

Page 17: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XSLT

logically, a set of rules pattern1 -> result1 pattern2 -> result2 …

physically, an XML document why is this a good idea?

XML XML, via XML

Page 18: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XQuerydining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

for $v in doc(“example.xml”)//woodwhere $v/../price > 100return <material>

$v </material>

Page 19: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XQuerydining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

for $m in doc(“example.xml”)//table[wood=“maple”]for $o in doc(“example.xml”)//table[wood=“oak”]where $m/price > $o/pricereturn <pair> <maple>$m/type </maple>

<oak>$m/type </oak> </pair>

Page 20: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XQuery

Feels more like a Query language Joins are natural to express

XML XML, via special syntax

Page 21: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XSLT and XQuery

Today’s author sums up: XSLT is for transformation

document-centric view of XML XQuery is for query

data-centric view of world

Page 22: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML for Science

Exercise: Sensor Data again

Page 23: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Sensor Data

Would Homework 1 be any easier with XML?

Page 24: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Benefits for Science

“better, more precise searching of full-content;

automatic extraction of object metadata; better support for compound document

formats and dynamic formatting generally; and,

improved processes for scholarly dissemination tasks such as peer review, versioning, and archiving.”

-- from NSF / NSDL Workshop on Scientific Markup Languages

Page 25: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML for Science

Recall features of Science Data: Read-oriented access Provenance

who, what, when, where, why Interesting Data Types

timeseries spatial arrays images

Scale

Page 26: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML for Science

Read-oriented access? perfect!

Provenance requires some flexibility; no problem

Interesting Data Types …and special file formats

Scale could get ugly

Page 27: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Interesting Data Types

Data locked in binary file formats Binary Format Description Language

[Myers, Chappell 2000] Data Format Description Language

[OpenGrid Project] Retrofitting Data Models

[Howe, Maier SSDBM 2005] PADX

[Fernandez et al, PLANX 2006] XDTM

[Foster, Voeckler et al. Global Grid Forum 2005]

Page 28: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Binary Format Description Language

Page 29: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

PADX

Page 30: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

1. Precord Pstruct summary_header_t {2. "0|";3. Punixtime tstamp;4. };5. Pstruct no_ramp_t {6. "no_ii";7. Puint64 id;8. };9. Punion dib_ramp_t {10. Pint64 ramp;11. no_ramp_t genRamp;12. };13. Pstruct order_header_t {14. Puint32 order_num;15. ’|’; Puint32 att_order_num;16. ’|’; Puint32 ord_version;17. ’|’; Popt pn_t service_tn;18. ’|’; Popt pn_t billing_tn;19. ’|’; Popt pn_t nlp_service_tn;20. ’|’; Popt pn_t nlp_billing_tn;21. ’|’; Popt Pzip zip_code;22. ’|’; dib_ramp_t ramp;23. ’|’; Pstring(:’|’:) order_type;24. ’|’; Puint32 order_details;25. ’|’; Pstring(:’|’:) unused;26. ’|’; Pstring(:’|’:) stream;27. };28. Pstruct event_t {29. Pstring(:’|’:) state;30. ’|’; Punixtime tstamp;31. };32. Parray event_seq_t {33. event_t[] : Psep(’|’) && Pterm(Peor);34. };35. Precord Pstruct order_t {36. order_header_t order_header;37. ’|’; event_seq_t events;38. };39. Parray orders_t {40. order_t[];41. };42. Psource Pstruct summary_t{43. summary_header_t summary_header;44. orders_t orders;45. };

Page 31: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML as Data

Maier’s Maxim:

Every popular data exchange format eventually becomes a data storage format.

Page 32: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML Storage: Shredding

Use RDBMS as your storage engine Two approaches:

Schema-aware Schema-oblivious

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

Page 33: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML Storage: Schema-aware

Table(SKU, Wood, Type, Price)Chair(SKU, Wood, Price)

DiningRoom(Manufacturer, Chairs, Quantity, Table)

Page 34: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

XML Storage: Schema-oblivious

Remember fancy node-labeling schemes…

Edge(NodeId, Tag, Value, ParentNodeId)

Page 35: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Left/Right Labeling

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

0

1

2 3

4 5

7

6

8

9

34

10 …

Which queries are easy and fast?

What did we say the problems were?

Page 36: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Path Labeling

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

0

0.0

0.0.0 0.1.2

0.1

0.1.0.0

0.1.0

0.1.1.0

0.1.1

What queries are fast and/or easy?

What did we say the problems were?

0.1.2.0

Page 37: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science

Next time

Geographic Information Systems Spatial Data

Study Questions 2 (I’ll post them tonight)