Upload
kirima
View
28
Download
0
Embed Size (px)
DESCRIPTION
ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack , Wolf Siberski, Wolfgang Nejdl L3S Research Center, Universität Hannover, Germany {minack,siberski,nejdl}@L3S.de 03.06.2009 - PowerPoint PPT Presentation
Citation preview
ESWC 2009Research IX: Evaluation and Benchmarking
Benchmarking Fulltext Search Performance of RDF StoresEnrico Minack, Wolf Siberski, Wolfgang NejdlL3S Research Center, Universität Hannover, Germany
{minack,siberski,nejdl}@L3S.de
03.06.2009
http://www.l3s.de/~minack/rdf-fulltext-benchmark/
03.06.2009Enrico Minack 3
Outline
1. Motivation
2. Benchmark
• Data set and Query set
3. Evaluation
• Methodology and Results
4. Conclusion
5. References
03.06.2009Enrico Minack 4
1. Motivation
Semantic applications provide fulltext search
Underlying RDF stores have to provide fulltext search
Application developers have to choose
Best practice: Benchmark
No fulltext search RDF benchmark
RDF stores perform ad hoc benchmarks
strong need for RDF fulltext benchmark
03.06.2009Enrico Minack 5
2. Benchmark
Extended Lehigh University Benchmark [LUBM]
- Synthetic data, fixed list of queries
Familiar but not trivial ontology
- University, Faculty, Professors, Students, Courses, …
- Realistic structural properties
- Artificial literal data
- „Professor1“, „GraduateStudent216“, „Course7“
03.06.2009Enrico Minack 6
2. Benchmark
03.06.2009Enrico Minack 7
2. Benchmark
03.06.2009Enrico Minack 8
2.1 Data set
Added
• Person names (first name, surname)following real world distribution
• Publication content following topic-mixture-basedword distributions trained by real document collection [LSA]
03.06.2009Enrico Minack 9
2.1 Data set (Person Names)
Probabilities from U.S. Census 1990
(http://www.census.gov/genealogy/names/)
1,200 male first names
4,300 female first names
19,000 surnames
03.06.2009Enrico Minack 10
2.1 Data set (Publication Text)
ProbabilisticTopic Model
NIPSdata set
1,740 documents
trained
100 Topics (word probabilities)
Topics of documents
Topic occuring probability
Topic cooccurring probability
03.06.2009Enrico Minack 11
2.1 Data set (Publication Text)
Faculty ProfessorGraduateStudent
Topic TopicTopic
Publication
03.06.2009Enrico Minack 12
2.1 Data set (Statistics)
03.06.2009Enrico Minack 13
2.2 Query set
Three sets of queries
• Basic IR Queries
• Semantic IR Queries
• Advanced IR Queries
03.06.2009Enrico Minack 14
2.2 Query set (Basic IR Queries)
Pure IR queries
Q1:
Q2:
Q3:
Q4:
Q5:
„engineer“
„engineer“ub:publicationText
„network“ „engineer“ub:publicationText
„network engineer“ub:publicationText
„smith“ub:surname
„Smith“
„network“
„network“
03.06.2009Enrico Minack 15
ub:Publication
2.2 Query set (Semantic IR Queries)
Q6:
Q7:
Q8:
Q9:
„engineer“ub:publicationText
?titleub:title
ub:FullProfessor
ub:publicationAuthor
?nameub:fullname
„smith“
ub:Publication
03.06.2009Enrico Minack 16
„smith“
ub:publicationText
2.2 Query set (Semantic IR Queries)
Q10:
Q11:
ub:FullProfessor
ub:publicationAuthor
ub:Publication
ub:publicationAuthor
„engineer“
ub:publicationText„network“
ub:fullname
ub:Publication
03.06.2009Enrico Minack 17
2.2 Query set (Advanced IR Queries)
Q12: „+network +engineer“
Q13: „+network –engineer“
Q14: „network engineer“~10
Q15: „engineer*“
Q16: „engineer?“
Q17: „engineer“~0.8
Q18: „engineer“ Score
Q19: „engineer“ Snippet
Q20: „network“ Top 10
Q21: „network“ Score > 0.75
ub:publicationText
03.06.2009Enrico Minack 18
3. Evaluation
2 GHz AMD Athlon 64bit Dual Core Processor
3 GByte RAM, RAID 5 array
GNU/Linux, JavaTM SE RE 1.6.0 10 with 2 GB Memory
Jena 2.5.6 + TDB Sesame 2.2.1NativeStore + LuceneSail
Virtuoso 5.0.9 YARS post beta 3
03.06.2009Enrico Minack 19
3.1 Evaluation Methodology
- Evaluated LUBMft(N) with N = {1, 5, 10, 50}
- For each store:- For each query:
- Flush the file system cache
- Start the store
- Repeat 6 times- Evaluate the query- Evaluation time > 1,000s, break
- Stop store
- Performed 5 times
03.06.2009Enrico Minack 20
„engineer“„network“
3.2 Evaluation Results
Basic IR Queries
03.06.2009Enrico Minack 21
ub:Publication
„smith“
„engineer“
3.2 Evaluation Results
ub:publicationText
?title
ub:title
ub:FullProfessor
ub:publicationAuthor
?name
ub:fullname
Semantic IR Queries
03.06.2009Enrico Minack 22
3.2 Evaluation Results
ub:pubTextub:Pub
ub:FullProfub:full
ub:pubAuth
ub:pubTextub:Pub
ub:pubAuth
„smith“
Semantic IR Queries
„engineer“
„network“
03.06.2009Enrico Minack 23
3.2 Evaluation Results
Advanced IR Queries- Same relative
performance
- Feature Richness:Sesame (10)Jena (9)YARS (5)Virtuoso (1)
03.06.2009Enrico Minack 24
4. Conclusion
Identified strong need for a fulltext benchmark
- For semantic application and RDF store developers
Extended LUBM towards a fulltext benchmark
- Other benchmarks can be extended similarily
RDF stores provide many IR features
- boolean, phrase, proximity, fuzzy queries
Multiple fulltext queries in one query are challenging
03.06.2009Enrico Minack 25
5. References
[LSA] Mahwah, N.J., Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, 2007.
[LUBM] Guo, Y., et al.: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 158-182 (2005).
[LuceneSail] Minack, E., et al.: The Sesame LuceneSail: RDF Queries with Full-text Search. Technical Report 2008-1, NEPOMUK (February 2008).
[Sesame] Broekstra, J., et al.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 54-68. Springer, Heidelberg (2002).
[Jena] Carroll, J.J., et al.: Jena: Implementing the Semantic Web Recommendations. In: WWW Alternate track papers & posters, pp. 74-83. ACM, New York (2004).
[YARS] Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress. IEEE Press, Los Alamitos (2005).