38
[email protected] http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum rvesting, Searching, and Ranking owledge from the Web oint work with eorgiana Ifrim, Gjergji Kasneci, Thomas Neumann, aya Ramanath, Fabian Suchanek

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

Embed Size (px)

DESCRIPTION

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek. Vision. Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base. Approach: 1) harvest and combine hand-crafted knowledge sources - PowerPoint PPT Presentation

Citation preview

Page 1: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

[email protected]://www.mpi-inf.mpg.de/~weikum/

Gerhard Weikum

Harvesting, Searching, and RankingKnowledge from the Web

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann,Maya Ramanath, Fabian Suchanek

Page 2: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

2/38

VisionOpportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) intothe world‘s most comprehensive knowledge base

Approach: 1) harvest and combine

a) hand-crafted knowledge sources (Semantic Web, ontologies)

b) automatic knowledge extraction (Statistical Web, text mining)

c) social communities and human computing (Social Web, Web 2.0)

2) express knowledge queries, search, and rank3) everything efficient and scalable

Page 3: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

3/38

Why Google and Wikipedia Are Not Enough

German universities with world-class computer scientists

German Nobel prize winner who survived both world warsand all of his four children

proteins that inhibit proteases and other human enzymes

Answer „knowledge queries“ such as:

connection between Thomas Mann and Goethe

politicians who are also scientists

Page 4: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

4/38

Which politiciansare also scientists ?

Why Google and Wikipedia Are Not Enough

What is lacking?Information is not Knowledge.Knowledge is not Wisdom.Wisdom is not TruthTruth is not Beauty.Beauty is not Music.Music is the best.

(Frank Zappa)

extract facts from Web pages capture user intention by concepts, entities, relations

Page 5: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

5/38

NAGA Example

Query:$x isa politician$x isa scientist

Results:Benjamin FranklinPaul WolfowitzAngela Merkel…

Page 6: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

6/38

Related Work

semistructured IR& graph search

Banks

TextRunner

DBexplorer

Cyc

Freebase

CimpleDBlife

UIMA

DBpedia

YagoNaga

XQ-FT

Libra

SPARQL

Avatar

EntityRank

Powerset

START

Webentity search& QA

informationextraction &ontologybuilding

TopX

Answers

SWSE

Hakia

Tijah

Page 7: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

7/38

Outline

Motivation

• Information Extraction & Knowledge Harvesting (YAGO)

• Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

• Efficient Query Processing (RDF-3X)

Page 8: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

8/38

Information Extraction (IE): Text to Records

Max Planck 4/23, 1858 KielAlbert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar

Person BirthDate BirthPlace ...

Person ScientificResult

Max Planck Quantum Theory

Person CollaboratorMax Planck Albert EinsteinMax Planck Niels Bohr

Planck‘s constant 6.2261023 Js

Constant Value Dimension

combine NLP, pattern matching, lexicons, statistical learning

extracted facts often have confidence < 1 DB with uncertainty (probabilistic DB)

expensive anderror-prone

Page 9: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

9/38

High-Quality Knowledge SourcesGeneral-purpose ontologies and thesauri: WordNet family

scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI …HAS INSTANCE => Bacon, Roger Bacon …

200 000 concepts and relations;can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics)

Page 10: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

10/38

{{Infobox_Scientist| name = Max Planck| birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]]| death_date = [[October 4]], [[1947]]| death_place = [[Göttingen]], [[Germany]]| residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]]| work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]]| alma_mater = [[Ludwig-Maximilians-Universität München]]| doctoral_advisor = [[Philipp von Jolly]]| doctoral_students = [[Gustav Ludwig Hertz]]</br>… | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]]| prizes = [[Nobel Prize in Physics]] (1918)…

Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources

Page 11: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

11/38

Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources

Page 12: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

12/38

YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]

• Turn Wikipedia into explicit knowledge base (semantic DB);

keep source pages as witnesses

• Exploit hand-crafted categories and infobox templates

• Represent facts as explicit knowledge triples:

relation (entity1, entity2)

(in FOL, compatible with RDF, OWL-lite, XML, etc.)

• Map (and disambiguate) relations into WordNet concept DAG

entity1 entity2relation

Max_Planck KielbornIn

Kiel CityisInstanceOf

Examples:

Page 13: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

13/38

YAGO Knowledge Base [F. Suchanek et al.: WWW’07]

Entity

Max_Planck April 23, 1858

Person

City Country

subclass Location

subclass

instanceOf

subclass subclass

bornOn

“Max Planck”

means

“Dr. Planck”

means

subclass

October 4, 1947 diedOn

KielbornInNobel Prize Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

concepts

individuals

words

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Accuracy 95% Entities Facts

KnowItAll 30 000SUMO 20 000 60 000WordNet 120 000 80 000Cyc 300 000 5 Mio.TextRunner n/a 8 Mio.YAGO 1.7 Mio. 15 Mio.DBpedia 1.9 Mio. 103 Mio.Freebase ??? ???

Page 14: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

14/38

Wikipedia Harvesting: Difficulties & Solutions

• instanceOf relation: isleading and difficult category names

(„disputed articles“, „particle physics“, „American Music of the 20th Century“,

„Nobel laureates in physics“, „naturalized citizens of the United States“, … )

noun group parser: ignore when head word in singular• isA relation: mapping categories onto WordNet classes: „Nobel laureates in physics“ Nobel_laureates, „people from Kiel“ person

map to (singular of) head; exploit synsets and statistics• Entity name ambiguities: „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“ means ...

exploit Wikipedia redirects & disambiguations, WN synsets• type checking for scrutinizing candidates:

accept fact candidate only if arguments have proper classes marriedTo (Max Planck, quantum physics) Person Person

Page 15: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

15/38

Higher-Order Facts in YAGO

Berlin GermanyCapitalOf

BonnCapitalOf

1990-2008

validIn

1949-1989

validIn

facts about facts represented by reification as first-order factse314159

e314159 1990-2008validIn

Berlin GermanyCapitalOf

ArnoldSchwarzen-egger

Politician

ActorinstanceOf

instanceOf 1987-2008

validIn

2003-2008validIn

Page 16: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

16/38

Ongoing Work: YAGO for Easier IEYAGO knows (almost) all (interesting) entitiesleverage for discovering & extracting new facts in NL texts

• can filter out many uninteresting sentences• can quickly identify relation arguments• can eliminate many fact candidates by type checking• can focus on specific properties like time

Seine ParisrunsThrough

river city

Cologne lies on the banks of the Rhine

Ss MVp DMc Mp Dg

JsJp

NP PPVP NP PP NP NPNP

People in Cairo like wine from the Rhine valley

Mp

Js Os

Sp Mvp DsJs

AN

NP NPPP VP PP NPNP NPNP

IE with dependency parser is expensive !

The city of Paris was founded on an island in the Seine in 300 BC

France

Europe

isa isa

locatedInlocatedInlocatedIn

Page 17: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

17/38

Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

• Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

• Efficient Query Processing (RDF-3X)

Page 18: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

18/38

NAGA: Graph Search [G. Kasneci et al.: ICDE‘08]

complex queries (with regular expressions)

computerscience $x scientist

isawonPrize

$u universityisa

worksAt |graduatedFrom

discovery queries connectedness queriesThomas MannGoethe * German

novelistisa

politician $x scientistisa isa

Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness

GermanylocatedIn*

$pinField

queries over reified facts Germany

1988

$ccapitalOf isacity

validIn

Page 19: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

19/38

Search Results Without Rankingq: Fisher isa scientist Fisher isa $x

mathematician_109635652   —subClassOf—>   scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge   —subClassOf—>   alumnus_109165182 "Fisher"   —familyNameOf—>   Ronald_Fisher Ronald_Fisher   —type—>   Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182

$@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304

$@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 …

Page 20: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

20/38

Ranking with Statistical Language Modelq: Fisher isa scientist Fisher isa $x

Score: 7.184462521168058E-13 mathematician_109635652   —subClassOf—>   scientist_109871938 "Fisher"   —familyNameOf—>   Ronald_Fisher Ronald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938 20th_century_mathematicians   —subClassOf—>   mathematician_109635652

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 …

Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/

statistical language model for result graphs

Page 21: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

21/38

Ranking FactorsConfidence:Prefer results that are likely to be correct

Certainty of IE Authenticity and Authority of Sources

Informativeness:Prefer results that are likely importantMay prefer results that are likely new to user

Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log

Compactness:Prefer results that are tightly connected

Size of answer graph

bornIn (Max Planck, Kiel) from„Max Planck was born in Kiel“(Wikipedia)

livesIn (Elvis Presley, Mars) from„They believe Elvis hides on Mars“(Martian Bloggeria)

q: isa (Einstein, $y)

isa (Einstein, scientist)isa (Einstein, vegetarian)

q: isa ($x, vegetarian)

isa (Einstein, vegetarian)isa (Al Nobody, vegetarian)

Einstein

vegetarian

BohrNobel Prize

Tom Cruise

1962

isa isa bornIn

diedInwon

won

Page 22: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

22/38

NAGA Ranking Model

i

n

iii qPqPqP

1

|)1(| gg

Following the paradigm of statistical language models(used in speech recognition and modern IR), applied to graphs

For query q with fact templates q1 … qn bornIn ($x, Frankfurt)

rank result graphs g with facts g1 … gn bornIn (Goethe, Frankfurt)

by decreasing likelihoods:usinggenerativemixture model

reflectinformativeness

backgroundmodel

weights subqueriesEx.: bornIn ($x, Germany) & wonAward ($x, Nobel)

Page 23: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

23/38

NAGA Ranking Model: Informativeness

Estimate P[qi | gi]

'x

)z,r,'x(P

)z,r,x(P

)z,r(P

)z,r,x(P)z,r|x(P

for qi = (x*, r, z) with var x* (analogously for other cases)

Ex.: bornIn ($x, Frankfurt)

Albert Einsteinisa

vegetarian

physicistisa

Ex.: isa (Einstein, $z)

bornIn (GW, Frankfurt)

bornIn (Goethe, Frankfurt)

isa (Einstein, physicist)

bornIn (Einstein, vegetarian)

Estimate on knowledge graph: Estimate on Web(exploit redundancy):

freq (Einstein, isa, physicist)vs.freq (Einstein, isa, vegetarian)

Page 24: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

24/38

NAGA Example

Query:$x isa politician$x isa scientist

Results:Benjamin FranklinPaul WolfowitzAngela Merkel…

Page 25: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

25/38

User Study for Quality Assessment (1)Benchmark:• 55 queries from TREC QA 2005/2006 Examples: 1) In what country is Luxor? 2) Discoveries of the 20th Century?• 12 queries from work on SphereSearch Examples: 1) In which movies did a governor act? 2) Firstname of politician Rice?• 18 regular expression queries by us

Example: What do Albert Einstein and Niels Bohr have in common?

Competitors:NAGA vs. Google, Yahoo! Answers, BANKS (IIT Bombay), START (MIT)

Page 26: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

26/38

User Study for Quality Assessment (2)

Benchmark # Q # A Metric Google Yahoo!

Answers

START BANKS

scoring

NAGA

TREC QA 55 1098 NDCG P@1

75.88% 67.81%

26.15%

17.20%

75.38% 73.23%

87.93%

69.54%

92.75%

84.40%

SphereSearch 12 343 NDCG P@1

38.22% 19.38%

17.23%

6.15%

2.87% 2.87%

88.82%

84.28%

91.01%

84.94%

Own 18 418 NDCG P@1

54.09% 27.95%

17.98%

6.57%

13.35% 13.57%

85.59%

76.54%

91.33%

86.56%

Quality Measures:• Precision@1• NDCG: normalized discounted cumulative gain based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0)with Wilson confidence intervals at = 0.95

Page 27: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

27/38

Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

• Efficient Query Processing (RDF-3X)

Page 28: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

28/38

Why RDF? Why a New Engine?

Marie Curie

Nobel Prize Physics

bornOn

Henri Becquerel

U Paris

Nobel Prize Chemistry

1867

1934

Warsaw

Pierre Curie

Maria Sklodowska

1852

1908

Poland

diedOn

bornAs

marriedTo

AlmaMater won

AwardwonAward

wonAward

bornOn

diedOnadvsior

won

Awar

d

bornIn

inCountry

• RDF triples (subject – property/predicate – value/object): (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867), (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“), (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), …

• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained (ER) graph• queries bound to need many star-joins and long chain-joins• physical design critical, but hardly predictable workload

Page 29: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

29/38

SPARQL Query Language

SPJ combinations of triple patternsEx:: Select ?c Where { ?p isa scientist . ?p bornIn ?t . ?p hasWon ?a . ?t inCountry ?c . ?a Name NobelPrize }

options for filter predicates, duplicate handling, wildcard join, etc.

Ex:: Select Distinct ?c Where { ?p ?r1 ?t . ?t ?r2 ?c . ?c isa <country> . ?p bornOn ?b . Filter (?b > 1945) }

support for RDFS: types

Page 30: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

30/38

RDF & SPARQL Engines choice of physical design is crucial

giant triples table (vert. partitioned)property tables

clustered property tables(+ leftover table)

id1 Name Marie Curieid1 bornOn 1867 id1 bornIn id2id2 Name Warsawid2 Country id11id1 Advisor id5… … .,.

S P O S O id1 1867idid5 1852… …

bornOn

S O id1 id5… …

Advisor

id1 Marie C 1867 id2id2 Henri B 1852 id9 … … .,.

S Name bornOn bornIn …

Person

id2 Warsaw id11… … .,.

SESAME / OpenRDFYARS2 (DERI)

Jena (HP Labs)Oracle RDF_MATCH

+ physical design wizard !

C-Store (MIT)MonetDB (CWI)

column stores

+ materialized views

S Name Country

Town

id2 Warsaw id11 … … .,.

Page 31: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

31/38

RDF-3X: a RISC-style Engine[T. Neumann, G. Weikum: VLDB 2008]

Design rationale:• RDF-specific engine (not an RXORDBMS)• Simplify operations• Reduce implementation choices• Optimize for common case• Eliminate tuning knobs

Key principles:• Mapping dictionary for encoding all literals into ids• Exhaustive indexing of id triples• Index-only store, high compression• QP mostly merge joins with order-preservation• Very fast DP-based query optimizer• Frequent-paths synopses, property-value histograms

Page 32: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

32/38

RDF-3X Indexing

index all collation orders of subject-property-object id triples: SPO, SOP, OSP, OPS, PSO, POS

• directly stored in clustered B+ trees• high compression: indexes < original data• can choose any order for scan & join

additionally index count-aggregated projections in all orders: SP, SO, OS, OP, PS, PO – with counter for each entry

• enables efficient bookkeeping for duplicates• also index projections S, P, O with count-aggregation

also need two mapping indexes: literal id, id literal,

Page 33: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

33/38

RDF-3X Query OptimizationPrinciples:• optimizing join orders is key (star joins, long join chains)• should exploit exhaustive indexes and order-preservation• support merge-joins and hash-joins

Bottom-up dynamic programmingfor exhaustive plan enumeration (< 100ms for 20 joins)

Cost model based on selectivity estimation from• histograms for each of the 6 SPO orderings (approx. equi-depth)

• frequent join paths (property sequences) for stars and chains

?x1p1

?x2p2

?x3p3

?x4p4

?x5p5

?x6

v1

a1

v4

a4

v6

a6ExampleQuery:

Page 34: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

34/38

Experimental Evaluation: SetupSetup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by Abadi et al., using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Benchmark queries (7 or 8 per dataset) in the spirit of:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }

Page 35: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

35/38

Experimental Evaluation: Results

DB sizes [GB]:

Barton Yago LibThingRDF-3X 2.8 2.7 1.6MonetDB 1.6-2.0 1.1-2.4 0.7-6.9PostgreSQL 8.7 7.5 5.7

DB load times [min]:

Barton Yago LibThingRDF-3X 13 25 20MonetDB 11 21 4PostgreSQL 30 25 20

Barton Yago LibThingRDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89)MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16)PostgreSQL 64.3 (167.8) 0.56 (10.6) 30.4 (93.9)

Geometric means for query run-times [sec]for warm (cold) cache

Page 36: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

36/38

Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

Efficient Query Processing (RDF-3X)

Page 37: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

37/38

Summary & Outlook

lift world‘s best information sources (Wikipedia, Web, Web 2.0)

to the level of explicit knowledge (ER-oriented facts)

1) building knowledge graphs: combine semantic & statistical & social IE sources (for scholarly Web, digital libraries, enterprise know-how)

challenges in consistency vs. uncertainty, long-term evolution

3) efficiency and scalability challenges for search & ranking (top-k queries) and updates

2) heterogeneity & uncertain IE necessitate ranking new ranking models (e.g. statistical LM for graphs)

Page 38: Joint work with  Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

38/38

Thank You !