Upload
cuthbert-rogers
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
INEX – a broadly accepted data set for XML database
processing?
Pavel Loupal, Michal ValentaPavel Loupal, Michal Valenta
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 22
Presentation ContentPresentation Content
1.1. INEX initiativeINEX initiative
2.2. INEX data setINEX data set
3.3. Utilization frameworkUtilization framework
4.4. Example – approximate XML tree Example – approximate XML tree embeddingembedding
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 33
INEX Initiative 1/3INEX Initiative 1/3
2001 – reference dataset for information 2001 – reference dataset for information retrievalretrievalDuisburg-Essen University – Norbert Fuhr, Saadia MalikDuisburg-Essen University – Norbert Fuhr, Saadia Malik
Queen Mary University London – Maunia LalmasQueen Mary University London – Maunia Lalmas
2003 – 69 participants (mainly 2003 – 69 participants (mainly universities)universities)
2 workshops (2002, 2003) 2 workshops (2002, 2003) open discussion about actual stage of the open discussion about actual stage of the
projectproject
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 44
INEX Initiative 2/3INEX Initiative 2/3
1.stage – data collection (by IEEE)1.stage – data collection (by IEEE)
2.stage – referential queries evaluation2.stage – referential queries evaluation 30 Content Only (CO)30 Content Only (CO) 36 Content and Structure (CAS)36 Content and Structure (CAS)
3.stage – manual relevance 3.stage – manual relevance assessment of query resultsassessment of query results
continues…continues…
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 55
INEX Initiative 3/3INEX Initiative 3/3
3.stage – our join-point to INEX:3.stage – our join-point to INEX: Assessment of queries 83,84 – 1000 docs eachAssessment of queries 83,84 – 1000 docs each 2-dimensional scale (exhaustivity, specificity)2-dimensional scale (exhaustivity, specificity) Relevance assessment on XML elements Relevance assessment on XML elements
(parent-child dependencies)(parent-child dependencies) Finished in February 2004Finished in February 2004
4.stage (actual)4.stage (actual) Study of researchers behaviourStudy of researchers behaviour Heterogenous resources / distributed systemsHeterogenous resources / distributed systems
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 66
INEX Initiative - AssessmentINEX Initiative - Assessment
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 77
INEX Data Set Structure 1/3INEX Data Set Structure 1/3
Actual version 1.4 – 536 MBActual version 1.4 – 536 MB 6 IEEE Transactions, 12 journals (1995-6 IEEE Transactions, 12 journals (1995-
2002)2002) 12107 articles – XML text only (without 12107 articles – XML text only (without
pictures)pictures) Organized in file system matterOrganized in file system matter In average each article hasIn average each article has
1532 nodes, 45 kB1532 nodes, 45 kB average depth: 6.9average depth: 6.9
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 88
INEX Data Set Structure 2/3INEX Data Set Structure 2/3
/inex-1.4/inex-1.4 /dtd/dtd ...... xmlarticle.dtdxmlarticle.dtd /xml/xml /an/an /1995/1995 ...... a1019.xmla1019.xml a1032.xmla1032.xml a1034.xmla1034.xml ...... /... /... /2002/2002 /.../... /ts/ts
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 99
INEX Data Set Structure 3/3INEX Data Set Structure 3/3
<article><article> <fm><fm> ...... <ti>IEEE Transactions on ...</ti><ti>IEEE Transactions on ...</ti> <atl>Construction of ...</atl><atl>Construction of ...</atl> <au><au> <fnm>John</fnm><fnm>John</fnm> <snm>Smith</snm><snm>Smith</snm> <aff>University of ...</aff><aff>University of ...</aff> </au></au> </au>...</au></au>...</au> ...... </fm></fm> <bdy><bdy> <sec><sec> <st>Introduction</st><st>Introduction</st> <p>...</p><p>...</p> ...... </sec></sec>
<sec><sec> <st>...</st><st>...</st> ...... <ss1>...</ss1><ss1>...</ss1> <ss1>...</ss1><ss1>...</ss1> ...... </sec></sec> ...... </bdy></bdy> <bm><bm> <bib><bib> <bb><bb> <au>...</au><ti>...</ti><au>...</au><ti>...</ti> ...... </bb></bb> ...... </bib></bib> </bm></bm></article></article>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1010
Data Set Utilization – Framework Data Set Utilization – Framework 1/21/2
Native XML storage (Apache Xindice)Native XML storage (Apache Xindice) Key features:Key features:
Inner structure: Collections & documentsInner structure: Collections & documents Standard API (XML:DB or XML-RPC)Standard API (XML:DB or XML-RPC) XPath expressions over collections & XPath expressions over collections &
docsdocs MetadataMetadata
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1111
Data Set Utilization – Framework Data Set Utilization – Framework 2/22/2
Web interface – Java Server Pages Web interface – Java Server Pages (JSPs)(JSPs)
Usage of XML:DB Java API:Usage of XML:DB Java API:
String url = String url = “xmldb:xindice://localhost:8080/inex/mu/2001”;“xmldb:xindice://localhost:8080/inex/mu/2001”;
Collection col = DB.getCollection(url);Collection col = DB.getCollection(url);
doc = col.getResource(“a1019.xml”);doc = col.getResource(“a1019.xml”);
System.out.println(doc.getContent());System.out.println(doc.getContent());
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1212
Approximate Tree Embedding Approximate Tree Embedding 1/41/4
Aim:Aim: Approximately embed one XML Approximately embed one XML tree (query) into another (data)tree (query) into another (data)
Algorithm history:Algorithm history: Kilpelainen – NP complete problemKilpelainen – NP complete problem Schlieder – polynomial in practical Schlieder – polynomial in practical
examplesexamples Vana – further improvementsVana – further improvements
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1313
Approximate Tree Embedding Approximate Tree Embedding 2/42/4
article
yr
2001
author
Knopfler
snm
Mark
authorsauthor
article
yr
2001 snm
Smithfnm
author
Smith
snm
John
fnm
a) b)
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1414
Approximate Tree Embedding Approximate Tree Embedding 3/43/4
Query:Query:
<article><article> <yr>2001</yr><yr>2001</yr> <au><au> <snm>Smith</snm><snm>Smith</snm> </au></au></article></article>
Data:Data:
<articles><articles> … … <article yr=“2001”><article yr=“2001”> <authors><authors> <au><au> <fnm>John</fnm><snm>Smith</snm><fnm>John</fnm><snm>Smith</snm> </au></au> <au><au>
<fnm>Mark</fnm><snm>Knopfler</snm><fnm>Mark</fnm><snm>Knopfler</snm> </au></au> </authors></authors> </article></article>……</articles></articles>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1515
Approximate Tree Embedding Approximate Tree Embedding 4/44/4
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 1616
ConclusionConclusion
INEX initiative overviewINEX initiative overview INEX data set + our testing INEX data set + our testing
framework =framework =suitable for testing algorithms & suitable for testing algorithms &
approachesapproaches Further discussionFurther discussion