16
INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Pavel Loupal, Michal Valenta Valenta

INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta

Embed Size (px)

Citation preview

INEX – a broadly accepted data set for XML database

processing?

Pavel Loupal, Michal ValentaPavel Loupal, Michal Valenta

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 22

Presentation ContentPresentation Content

1.1. INEX initiativeINEX initiative

2.2. INEX data setINEX data set

3.3. Utilization frameworkUtilization framework

4.4. Example – approximate XML tree Example – approximate XML tree embeddingembedding

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 33

INEX Initiative 1/3INEX Initiative 1/3

2001 – reference dataset for information 2001 – reference dataset for information retrievalretrievalDuisburg-Essen University – Norbert Fuhr, Saadia MalikDuisburg-Essen University – Norbert Fuhr, Saadia Malik

Queen Mary University London – Maunia LalmasQueen Mary University London – Maunia Lalmas

2003 – 69 participants (mainly 2003 – 69 participants (mainly universities)universities)

2 workshops (2002, 2003) 2 workshops (2002, 2003) open discussion about actual stage of the open discussion about actual stage of the

projectproject

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 44

INEX Initiative 2/3INEX Initiative 2/3

1.stage – data collection (by IEEE)1.stage – data collection (by IEEE)

2.stage – referential queries evaluation2.stage – referential queries evaluation 30 Content Only (CO)30 Content Only (CO) 36 Content and Structure (CAS)36 Content and Structure (CAS)

3.stage – manual relevance 3.stage – manual relevance assessment of query resultsassessment of query results

continues…continues…

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 55

INEX Initiative 3/3INEX Initiative 3/3

3.stage – our join-point to INEX:3.stage – our join-point to INEX: Assessment of queries 83,84 – 1000 docs eachAssessment of queries 83,84 – 1000 docs each 2-dimensional scale (exhaustivity, specificity)2-dimensional scale (exhaustivity, specificity) Relevance assessment on XML elements Relevance assessment on XML elements

(parent-child dependencies)(parent-child dependencies) Finished in February 2004Finished in February 2004

4.stage (actual)4.stage (actual) Study of researchers behaviourStudy of researchers behaviour Heterogenous resources / distributed systemsHeterogenous resources / distributed systems

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 66

INEX Initiative - AssessmentINEX Initiative - Assessment

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 77

INEX Data Set Structure 1/3INEX Data Set Structure 1/3

Actual version 1.4 – 536 MBActual version 1.4 – 536 MB 6 IEEE Transactions, 12 journals (1995-6 IEEE Transactions, 12 journals (1995-

2002)2002) 12107 articles – XML text only (without 12107 articles – XML text only (without

pictures)pictures) Organized in file system matterOrganized in file system matter In average each article hasIn average each article has

1532 nodes, 45 kB1532 nodes, 45 kB average depth: 6.9average depth: 6.9

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 88

INEX Data Set Structure 2/3INEX Data Set Structure 2/3

/inex-1.4/inex-1.4 /dtd/dtd ...... xmlarticle.dtdxmlarticle.dtd /xml/xml /an/an /1995/1995 ...... a1019.xmla1019.xml a1032.xmla1032.xml a1034.xmla1034.xml ...... /... /... /2002/2002 /.../... /ts/ts

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 99

INEX Data Set Structure 3/3INEX Data Set Structure 3/3

<article><article> <fm><fm> ...... <ti>IEEE Transactions on ...</ti><ti>IEEE Transactions on ...</ti> <atl>Construction of ...</atl><atl>Construction of ...</atl> <au><au> <fnm>John</fnm><fnm>John</fnm> <snm>Smith</snm><snm>Smith</snm> <aff>University of ...</aff><aff>University of ...</aff> </au></au> </au>...</au></au>...</au> ...... </fm></fm> <bdy><bdy> <sec><sec> <st>Introduction</st><st>Introduction</st> <p>...</p><p>...</p> ...... </sec></sec>

<sec><sec> <st>...</st><st>...</st> ...... <ss1>...</ss1><ss1>...</ss1> <ss1>...</ss1><ss1>...</ss1> ...... </sec></sec> ...... </bdy></bdy> <bm><bm> <bib><bib> <bb><bb> <au>...</au><ti>...</ti><au>...</au><ti>...</ti> ...... </bb></bb> ...... </bib></bib> </bm></bm></article></article>

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1010

Data Set Utilization – Framework Data Set Utilization – Framework 1/21/2

Native XML storage (Apache Xindice)Native XML storage (Apache Xindice) Key features:Key features:

Inner structure: Collections & documentsInner structure: Collections & documents Standard API (XML:DB or XML-RPC)Standard API (XML:DB or XML-RPC) XPath expressions over collections & XPath expressions over collections &

docsdocs MetadataMetadata

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1111

Data Set Utilization – Framework Data Set Utilization – Framework 2/22/2

Web interface – Java Server Pages Web interface – Java Server Pages (JSPs)(JSPs)

Usage of XML:DB Java API:Usage of XML:DB Java API:

String url = String url = “xmldb:xindice://localhost:8080/inex/mu/2001”;“xmldb:xindice://localhost:8080/inex/mu/2001”;

Collection col = DB.getCollection(url);Collection col = DB.getCollection(url);

doc = col.getResource(“a1019.xml”);doc = col.getResource(“a1019.xml”);

System.out.println(doc.getContent());System.out.println(doc.getContent());

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1212

Approximate Tree Embedding Approximate Tree Embedding 1/41/4

Aim:Aim: Approximately embed one XML Approximately embed one XML tree (query) into another (data)tree (query) into another (data)

Algorithm history:Algorithm history: Kilpelainen – NP complete problemKilpelainen – NP complete problem Schlieder – polynomial in practical Schlieder – polynomial in practical

examplesexamples Vana – further improvementsVana – further improvements

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1313

Approximate Tree Embedding Approximate Tree Embedding 2/42/4

article

yr

2001

author

Knopfler

snm

Mark

authorsauthor

article

yr

2001 snm

Smithfnm

author

Smith

snm

John

fnm

a) b)

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1414

Approximate Tree Embedding Approximate Tree Embedding 3/43/4

Query:Query:

<article><article> <yr>2001</yr><yr>2001</yr> <au><au> <snm>Smith</snm><snm>Smith</snm> </au></au></article></article>

Data:Data:

<articles><articles> … … <article yr=“2001”><article yr=“2001”> <authors><authors> <au><au> <fnm>John</fnm><snm>Smith</snm><fnm>John</fnm><snm>Smith</snm> </au></au> <au><au>

<fnm>Mark</fnm><snm>Knopfler</snm><fnm>Mark</fnm><snm>Knopfler</snm> </au></au> </authors></authors> </article></article>……</articles></articles>

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1515

Approximate Tree Embedding Approximate Tree Embedding 4/44/4

Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1616

ConclusionConclusion

INEX initiative overviewINEX initiative overview INEX data set + our testing INEX data set + our testing

framework =framework =suitable for testing algorithms & suitable for testing algorithms &

approachesapproaches Further discussionFurther discussion