85
Expanding the YAGO knowledge base Rebele The YAGO knowledge base Using YAGO for the Humanities Adding Words to Regexes Answering Queries with Unix Shell Conclusion 1/50 Expanding the YAGO knowledge base Thomas Rebele Télécom ParisTech 2018-07-05

Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

1/50

Expanding the YAGO knowledge base

Thomas Rebele

Télécom ParisTech

2018-07-05

Page 2: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is a knowledge base?

2/50

Albert EinsteinMileva Maric

married

Alfred Kleinerhas advisor

Nobel Prize in Physics

won prize

Applications of knowledge basesI question answeringI semantic searchI text analysisI machine translation

Page 3: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is a knowledge base?

2/50

Albert EinsteinMileva Maric

marriedAlfred Kleinerhas advisor

Nobel Prize in Physics

won prize

Applications of knowledge basesI question answeringI semantic searchI text analysisI machine translation

Page 4: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is a knowledge base?

2/50

Albert EinsteinMileva Maric

marriedAlfred Kleinerhas advisor

Nobel Prize in Physics

won prize

Applications of knowledge basesI question answeringI semantic searchI text analysisI machine translation

Page 5: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is a knowledge base?

2/50

Albert EinsteinMileva Maric

marriedAlfred Kleinerhas advisor

Nobel Prize in Physics

won prize

Applications of knowledge basesI question answeringI semantic searchI text analysisI machine translation

Page 6: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is YAGO?

3/50

I knowledge base with 10 million entities and >210 million factsI automatically extracted from Wikipedia, Wordnet, and GeonamesI multilingual facts from 10 languagesI focus on precisionI developed by Max-Planck Institute for Informatics and Télécom

ParisTech

Page 7: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is YAGO?

4/50

I I joined the project in 2015I coordinated / contributed to the evaluationI maintenance, participating in open source releaseI development

Page 8: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

What is YAGO?

5/50

<Albert_Einstein>

Mileva

mar

ried

Alfred Kleiner

advi

sor

Nobel Prize

won

Page 9: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge baseWhat is a knowledgebase?

What is YAGO?

Accuracy

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Accuracy

6/50

Figure: Screenshot of evaluation result

I 2 months evaluation, 15 participantsI evaluated 4412 facts of 76 relations (with 60m total facts)I 98% facts of the sample were correctI Wilson center: 95%, interval width: 4.2%

Now that we have this knowledge base, what can we do with it?

Page 10: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Related Work

7/50

Similar studies using Semantic Web for Digital Humanities

I [Schich et al., 2014]: about 150,000 peopleI [de la Croix et al., 2015]: about 300,000 peopleI [Gergaud et al., 2017]: about 1,100,000 people

These studies are only about few people. Can we do better with YAGO?

YAGO has 2,200,000 people, but, e.g., locations only for 700,000 peopleHow can we make YAGO more complete?

Page 11: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Related Work

7/50

Similar studies using Semantic Web for Digital Humanities

I [Schich et al., 2014]: about 150,000 peopleI [de la Croix et al., 2015]: about 300,000 peopleI [Gergaud et al., 2017]: about 1,100,000 people

These studies are only about few people. Can we do better with YAGO?

YAGO has 2,200,000 people, but, e.g., locations only for 700,000 peopleHow can we make YAGO more complete?

Page 12: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

8/50

Previous algorithm:

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

Extracted birth dates

428

- 427

- 42#

bornOn

- 427

- 42#

Page 13: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

8/50

Previous algorithm:

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

Extracted birth dates

428

bornOn- 427

- 42#

Page 14: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

9/50

New algorithm: filtering with category dates

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

428

- 427

- 42#

bornOn (infobox)

bornOn (category)

?

?

"

Page 15: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

9/50

New algorithm: filtering with category dates

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

428

- 427

- 42#

bornOn (infobox)

bornOn (category)

?

?

"

Page 16: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

9/50

New algorithm: filtering with category dates

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

428

- 427

- 42#

bornOn (infobox)

bornOn (category)

?

?

"

Page 17: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Birth and Death Dates

9/50

New algorithm: filtering with category dates

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

428

- 427

- 42#

bornOn (infobox)

bornOn (category)

??

"

Page 18: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Place of residence

10/50

Extract mapping from demonyms / adjectives to locations

”Austrian“ Austria

”Greek“ Greece

Take most frequent location as place of residence

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

Greece: 2Austria: 1

Greece

location withmax. occurence

Page 19: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Place of residence

11/50

Caveat: only take outermost text spans

”Holy Roman Empire“ → <Holy_Roman_Empire>”Roman Empire“ → <Roman_Empire>

Page 20: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

12/50

Previous algorithm:

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

Extracted gender

malegender

Page 21: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

12/50

Previous algorithm:

Languages

EnglishGermanFrench

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

Extracted gender

malegender

Page 22: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

13/50

New algorithm:

Languages

EnglishGermanFrench

Albert Einstein

Albert Einstein was aphysicist. Einstein de-veloped the theory ofrelativity.

Albert Einstein

Categories: Male scientist | Swissphysicists

Extracted gender

malegender

Page 23: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

13/50

New algorithm:

Languages

EnglishGermanFrench

Albert Einstein

Albert Einstein was aphysicist. Einstein de-veloped the theory ofrelativity.

Albert Einstein

Categories: Male scientist | Swissphysicists

Extracted gender

malegender

Page 24: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

14/50

Albert Schweitzer

Albert Schweitzer was aFrench-German writer,and philosopher.

Albert Schweitzer

Categories: Male writer

Albert Camus

Albert Camus was aFrench philosopher, au-thor, and journalist.

Albert Camus

Categories: Male philosopher

Albert Einstein

Albert Einstein was aphysicist. Einstein de-veloped the theory ofrelativity.

Albert Einstein

Categories: Male scientist | Swissphysicists

”Albert“ male

”Francesca“ female

”Kathleen“ female

in total:1206 first names

Page 25: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

14/50

Albert Schweitzer

Albert Schweitzer was aFrench-German writer,and philosopher.

Albert Schweitzer

Categories: Male writer

Albert Camus

Albert Camus was aFrench philosopher, au-thor, and journalist.

Albert Camus

Categories: Male philosopher

Albert Einstein

Albert Einstein was aphysicist. Einstein de-veloped the theory ofrelativity.

Albert Einstein

Categories: Male scientist | Swissphysicists

”Albert“ male

”Francesca“ female

”Kathleen“ female

in total:1206 first names

Page 26: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Gender

15/50

Prioritize extracted facts

Albert Einstein

Albert Einstein was aphysicist. Einstein de-veloped the theory ofrelativity.

Albert Einstein

Categories: Male scientist | Swissphysicists

1. extract gender by category

Albert Camus

Albert Camus was aFrench philosopher, au-thor, and journalist.

Albert Camus

Categories: Male philosopher

2. extract gender by first name

Plato

Plato was a philoso-pher. He founded theAcademy in Athens. Helaid the foundation forphilosophy.

Plato

Birth 428 or 427 BCDeath 348 BC

Categories: 420s BC births | 340s BCdeaths | Greek philosoph | Greekmale wrestler | Austrian writer

3. extract gender by pronoun

Page 27: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Using YAGO for the Humanities: Evaluation

16/50

I compare extraction process on Wikipedia dump from 2017-02-20I extracted on 11 languagesI evaluate precision based on a sample of 100 people

Extraction YAGObefore

Recall YAGOnow

Recall Precision DBpedia (en)

Birth dates 1.6m 69% 1.7m 74% (+8%) 100% 0.8m

Death dates 0.7m 33% 0.8m 36% (+10%) 100% 0.3m

Place ofresidence 0.7m 30% 2.1m 91% (+201%) 97% (*) 0.7m

Gender 1.5m 64% 2.0m 87% (+35%) 98% 4k

Table: Coverage and precision of our methods.Recall relative to total number of people in YAGO (2.2m).

m million k thousand(*) 6% of anachronistic residencies (e.g., German Empire instead of Germany)

Page 28: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Life expectancy over time

17/50

100 500 1000 1500 1900

45

50

55

60

65

70

75

80

85

Year

Med

ian

age

malefemale

Figure: Median age over time, by year of birth(with the Student’s t confidence interval at α = 95%).

Page 29: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Life expectancy over time

18/50

1100 1200 1300 1400 1500 1600 1700 1800 1900

50

55

60

65

70

75

80

Year

Med

ian

age

Great BritainItalyIndiaChina

Figure: Median age over time, by year of birth(with the Student’s t confidence interval at α = 95%).

Page 30: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Births per month

19/50

0 2 4 6 8 10 12

7.5%

8%

8.5%

9%

Month

Rel

ativ

ebi

rths

YAGO - allNational Center for Health Statistics

Figure: Births per month in the United States between 2003 and 2015(with the Student’s t confidence interval at α = 95%).

Page 31: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Births per month

20/50

Possible explanation:

Languages

EnglishEuskara

Relative age effect

The relative age effectdescribes a bias. Peopleborn early in the selec-tion period of sports oracademia are more likelyto perform well.

Relative age effect

Categories: Ageism | Epidemiology

Page 32: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Births per month

21/50

0 2 4 6 8 10 12

7.5%

8%

8.5%

9%

Month

Rel

ativ

ebi

rths

YAGO - no sportsmenNational Center for Health Statistics

Figure: Births per month in the United States between 2003 and 2015(with the Student’s t confidence interval at α = 95%).

Page 33: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Relative population size

22/50

−3000 −2500 −2000 −1500 −1000 −500 0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.70.8

1

Year

Rel

ativ

epo

pula

tion

size

EgyptBabylonian-Empire

SyriaChinaGreece

Ancient-RomeBritainFranceItaly

GermanyUnited-States

Figure: Relative population size, by century. The y-axis is scaled by a quadraticfunction.

Page 34: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe HumanitiesRelated Work

Extensions

Birth and Death Dates

Place of residence

Gender

Evaluation

Life expectancy overtime

Births per month

Relative populationsize

Summary

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Using YAGO for the Humanities: Summary

23/50

I extension of YAGOI more birth and death dates (+8%/10%, 100% precision)I more people with locations (+201%, 97% precison)I more people with genders (+35%, 98% precision)

I case studiesI life expectancyI births per monthI relative population size

Thomas Rebele Arash Nekoei Fabian Suchanek

publication: ISWC 2017 (workshop paper)

We often had to repair regular expressions (e.g., for matching dates).Can we automate this step?

100 500 1000 1500 1900

50

60

70

80

Year

Med

ian

age

malefemale

Page 35: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Introduction

24/50

Why does YAGO not knowthe ISBN numbers of my books?

I we want to find ISBN numbers in Wikipedia to include it in YAGOI we try the regex ISBN(978|979)?\d{10}

I why does the regex not find I978-2-1234-5680-3 ?I how can we modify the regex automatically to match the word?

Page 36: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Introduction

24/50

Why does YAGO not knowthe ISBN numbers of my books?

I we want to find ISBN numbers in Wikipedia to include it in YAGOI we try the regex ISBN(978|979)?\d{10}I why does the regex not find I978-2-1234-5680-3 ?I how can we modify the regex automatically to match the word?

Page 37: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Problem statement

25/50

Problem statement, first try:

GivenI a regular expression r and ISBN(978|979)?\d{10}I a set of strings S, { I978-2-1234-5680-3 }

find a regular expression r′ such thatI L(r) ⊆ L(r′)I S ⊆ L(r′)

Solution:

r′ = .∗

Page 38: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Problem statement

25/50

Problem statement, first try:

GivenI a regular expression r and ISBN(978|979)?\d{10}I a set of strings S, { I978-2-1234-5680-3 }

find a regular expression r′ such thatI L(r) ⊆ L(r′)I S ⊆ L(r′)

Solution:

r′ = .∗

Page 39: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Problem statement

26/50

Problem statement:

GivenI a regular expression r, ISBN(978|979)?\d{10}I a set of strings S, { I978-2-1234-5680-3 }I a set of negative examples E−, { 0612345678 }

find a regular expression r′ such thatI L(r) ⊆ L(r′)I S ⊆ L(r′)I L(r′) ∩ E− is small

Additional goals:I precision of r′ ≥ or ≈ precision of rI recall of r′ ≥ recall of r

(w.r.t. the intended meaning of the regex)

Page 40: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Problem statement

26/50

Problem statement:

GivenI a regular expression r, ISBN(978|979)?\d{10}I a set of strings S, { I978-2-1234-5680-3 }I a set of negative examples E−, { 0612345678 }

find a regular expression r′ such thatI L(r) ⊆ L(r′)I S ⊆ L(r′)I L(r′) ∩ E− is small

Additional goals:I precision of r′ ≥ or ≈ precision of rI recall of r′ ≥ recall of r

(w.r.t. the intended meaning of the regex)

Page 41: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: What is new in our approach

27/50

Previous approaches

(regex

)E+ E− regex+ + −→

Our approach

regex S E− regex+ + −→

Rationale: creating a large set of positive examples is difficult

Page 42: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Approximate regex matching

28/50

Step 1: match string and regex approximately [Myers et al. 1989].

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \d

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3

...

Page 43: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Finding the gaps

29/50

Step 2: find the gapsI between regex leaves

I between characters of the string

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \dS B N

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3

...

Page 44: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Finding the gaps

29/50

Step 2: find the gapsI between regex leavesI between characters of the string

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \d

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-

...

Page 45: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

30/50

Step 3 (simple approach): adapt regex, so that it includes the missingparts

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \dS B N

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3

...

.

.

I ?

.

S B N

?

|

.

9 7 8

.

9 7 9

{10}

\d

Page 46: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

30/50

Step 3 (simple approach): adapt regex, so that it includes the missingparts

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \d

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-

...

.

.

I ?

.

S B N

?

|

.

9 7 8

.

9 7 9

?

-

{10}

\d

Page 47: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

30/50

Step 3 (simple approach): adapt regex, so that it includes the missingparts

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \d

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-

...

.

.

I ?

.

S B N

?

|

.

9 7 8

.

9 7 9

?

-

.

\d ?

-

\d ... \d \d

Page 48: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

30/50

Step 3 (simple approach): adapt regex, so that it includes the missingparts

.

.

I S B N

?

|

.

9 7 8

.

9 7 9

.

\d \d ... \d \d

I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-

...

.

.

I ?

.

S B N

?

|

.

9 7 8

.

9 7 9

?

-

.

\d ?

-

\d ... \d ?

-

\d

Page 49: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

31/50

Step 3 (adaptive approach): adapt regex, so that it includes the missingpartsCases:

.

a b ... c d

{g1, g2, g3}

.

a

?

b ... c d

{g1, g ′2} {g ′′

2 , g3}

|

a ... b → only recursive call

{n,m}

r

{g1, g2, g3}

.

{...}

r

{...}

r

{g1, g ′2} {g ′′

2 , g3}

*

r

{g1, g2}

*

r

{g1, g ′2, g

′′2 }

Page 50: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

31/50

Step 3 (adaptive approach): adapt regex, so that it includes the missingpartsCases:

.

a b ... c d

{g1, g2, g3}

.

a

?

b ... c d

{g1, g ′2} {g ′′

2 , g3}

|

a ... b → only recursive call

{n,m}

r

{g1, g2, g3}

.

{...}

r

{...}

r

{g1, g ′2} {g ′′

2 , g3}

*

r

{g1, g2}

*

r

{g1, g ′2, g

′′2 }

Page 51: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

31/50

Step 3 (adaptive approach): adapt regex, so that it includes the missingpartsCases:

.

a b ... c d

{g1, g2, g3}

.

a

?

b ... c d

{g1, g ′2} {g ′′

2 , g3}

|

a ... b → only recursive call

{n,m}

r

{g1, g2, g3}

.

{...}

r

{...}

r

{g1, g ′2} {g ′′

2 , g3}

*

r

{g1, g2}

*

r

{g1, g ′2, g

′′2 }

Page 52: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Add missing parts

31/50

Step 3 (adaptive approach): adapt regex, so that it includes the missingpartsCases:

.

a b ... c d

{g1, g2, g3}

.

a

?

b ... c d

{g1, g ′2} {g ′′

2 , g3}

|

a ... b → only recursive call

{n,m}

r

{g1, g2, g3}

.

{...}

r

{...}

r

{g1, g ′2} {g ′′

2 , g3}

*

r

{g1, g2}

*

r

{g1, g ′2, g

′′2 }

Page 53: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Feedback function

32/50

ExampleI now we want to find URLsI we try regex r = http://[a-zA-Z\.]+I it does not find s = wikipedia.orgI repaired regex r′ = (http://)?[a-zA-Z\.]+

I problem: r′ finds all wordsI precision drops

Solution: use feedback on set of negative examples E−

I determine the parts of the regex that we can make optionalI we use the number of false positives, i.e.,

f (r′) = |E− ∩ L(r′)| ≤ α|E− ∩ L(r)|

I if f (r′) = false, add the word as disjunction instead:http://[a-zA-Z\.]+|wikipedia.org

Page 54: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Feedback function

32/50

ExampleI now we want to find URLsI we try regex r = http://[a-zA-Z\.]+I it does not find s = wikipedia.orgI repaired regex r′ = (http://)?[a-zA-Z\.]+

I problem: r′ finds all wordsI precision drops

Solution: use feedback on set of negative examples E−

I determine the parts of the regex that we can make optionalI we use the number of false positives, i.e.,

f (r′) = |E− ∩ L(r′)| ≤ α|E− ∩ L(r)|

I if f (r′) = false, add the word as disjunction instead:http://[a-zA-Z\.]+|wikipedia.org

Page 55: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Feedback function

33/50

Summary of the algorithm:

1. match strings in S approximately to r2. find gaps in the regex or in the strings

3. (adaptive:) find overlaps within the gaps

4. (simple:) add missing parts for every missing word one after theother(adaptive:) add missing parts and check intermediate steps withthe feedback

5. (adaptive:) add a generalization of non-repaired words(similar to [Babbar et al. 2010])

Page 56: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Experiments

34/50

Input dataI datasets:

ReLIE [Li et al., 2008],Enron [Babbar et al., 2010], andYAGO infobox attributes

I in total 8 tasksI in total 52 regexes

Experimental approachI 5× 2 train/test splitI missing words S are selected randomly from E+ \ L(r), |S| ≤ 10I we draw 10 different sets S

Page 57: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Adding Words to Regexes: Experiments

35/50

BaselinesI dis: r|s1| · · · |sn

I star: .*CompetitorsI B&S: [Babbar et al., 2010] (reimplementation)I simpleI adaptive

baseline adaptive

measure original dis star B&S simple α = 1.0 α = 1.1 α = 1.20

F1 55 55 21 40 56 60 60 60recall 66 67 62 35 69 75 76 77precision 64 64 14 71 64 63 63 63

length 56 270 2 3929 250 76 80 81

Table: Averaged measures for the different systems. Length is # of characters ofthe regex.

Page 58: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexesIntroduction

Problem statement

What is new in ourapproach

Approximate regexmatching

Finding the gaps

Add missing parts

Feedback function

Experiments

Summary

AnsweringQueries with UnixShell

Conclusion

Adding Words to Regexes: Summary

36/50

SummaryI algorithm for adding missing words to regexesI increases recall, while keeping precision stableI Source code available at

https://github.com/thomasrebele/regex-repair

Future workI decrease dependency on E−

I add a generalization step as postprocessing

Thomas Rebele Katerina Tzompanaki Fabian Suchanek

publications: ISWC 2017 (demo), PAKDD 2018 (full paper)

Now that we have all this data, how can we process it efficiently?

Page 59: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Motivation

37/50

How can I find all teachers?

Albert Einstein

Relativity

Cosmology

teaches

teaches

Alfred Kleiner Statistical physics

teaches

successorOf

Person

Person

type

type

Page 60: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Motivation

38/50

Observation:

database

qresult

importing querying

//

qresult

’grep’ing

/

Page 61: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Motivation

38/50

Observation:

database

qresult

importing querying

//

qresult

’grep’ing

/

Page 62: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Motivation

38/50

Observation:

database

qresult

importing querying

//

qresult

’grep’ing

/

Page 63: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: System

39/50

3transformation

QSPARQL/OWL

QDatalog

or

&Bash script

qTSV/n-triples files

qresult

QDatalog

σπAlgebra

÷Optimize

^Translate

Page 64: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: System

39/50

3transformation

QSPARQL/OWL

QDatalog

or

&Bash script

qTSV/n-triples files

qresult

QDatalog

σπAlgebra

÷Optimize

^Translate

Page 65: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Answering Queries with Unix Shell: Approach

40/50

Query "Which people teach a course?" in SPARQL

SELECT ?X WHERE {?X <type> <Person>.?X <teachesCourse> ?Y.

}

Translating the query to Datalog

Person(X) :-facts(X, "type", "Person").

teaches(X, Y) :-facts(X, "teaches", Y).

Teacher(X) :- Person(X),teaches(X,Y).

π1

on1=1

σ3=”Person”

σ2=”type”

facts

σ2=”teaches”

facts

Page 66: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

41/50

Optimization:

π1

on1=1

σ3=”Person”

σ2=”type”

facts

σ2=”teaches”

facts

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Page 67: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

42/50

Algebra plan

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Bash code

sort -u \<(join -1 1 -2 1 -o 1.1 \

<(sort -k 1 \<(awk ’($3 == "Person" && $2 == "type")

{ print $1 };’ facts))<(sort -k 1 \

<(awk ’($2 == "teaches"){ print $1 };’ facts))

Page 68: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

42/50

Algebra plan

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Bash code

sort -u \<(join -1 1 -2 1 -o 1.1 \

<(sort -k 1 \<(awk ’($3 == "Person" && $2 == "type")

{ print $1 };’ facts))<(sort -k 1 \

<(awk ’($2 == "teaches"){ print $1 };’ facts))

Page 69: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

42/50

Algebra plan

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Bash code

sort -u \<(join -1 1 -2 1 -o 1.1 \

<(sort -k 1 \<(awk ’($3 == "Person" && $2 == "type")

{ print $1 };’ facts))<(sort -k 1 \

<(awk ’($2 == "teaches"){ print $1 };’ facts))

Page 70: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

42/50

Algebra plan

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Bash code

sort -u \<(join -1 1 -2 1 -o 1.1 \

<(sort -k 1 \<(awk ’($3 == "Person" && $2 == "type")

{ print $1 };’ facts))<(sort -k 1 \

<(awk ’($2 == "teaches"){ print $1 };’ facts))

Page 71: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Approach

42/50

Algebra plan

π1

on1=1

π1

σ3=”Person”

σ2=”type”

facts

π1

σ2=”teaches”

facts

Bash code

sort -u \<(join -1 1 -2 1 -o 1.1 \

<(sort -k 1 \<(awk ’($3 == "Person" && $2 == "type")

{ print $1 };’ facts))<(sort -k 1 \

<(awk ’($2 == "teaches"){ print $1 };’ facts))

Page 72: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

43/50

OptimizationsI algebraic, e.g., merge union / projectsI semi-naive evaluationI join reorderingI remove superfluous recursive callsI materialize repeated subplansI read files only onceI tweak Unix commands, e.g., using LANG=C and MAWK

Page 73: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

44/50

How can I find all professors?

Professor(X) :- Person(X),teachesCourse(X,Y).

Professor(X) :- successorOf(X,Y),Professor(Y).

Person(X) :- Employee(X).Person(X) :- Professor(X).

Combining the first and the last rule leads to

Professor(X) :- Professor(X),teachesCourse(X,Y).

Page 74: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ } {?}

{ } {?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 75: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ } {?}

{ } {?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 76: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ }

{?}

{ }

{?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 77: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ }

{?}

{ }

{?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 78: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ } {?}

{ } {?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 79: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{ } {?}

{ } {?}

{ }

{ , ?, ?}

{ }

{ }

⇒ superfluous

{?}

{?, ?, }

{ }

⇒ necessary

Page 80: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

45/50

µx

π1

on1=1

employee x

teachesCourse

π1

on2=1

successorOf x

{c1} {?}

{c1} {?}

{c1}

{c1, ?, ?}

{c1}

{c1}

⇒ superfluous

{?}

{?, ?, c1}

{c1}

⇒ necessary

Page 81: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Optimization

46/50

Assume that employee is an expensive subplan

µx

employee ...

£ y

employee µx

y ...

Page 82: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Answering Queries with Unix Shell: Experiments

47/50

I Dataset: LUBM university benchmarkI 14 different queriesI competitors: Datalog-based (DLV, Souffle, RDFox),

Triple store (Jena, Stardog, Virtuoso),Database management system (MonetDB, Postgres)

Number of finished queries

LUBM Bash DLV Souffle RDFox Jena Stardog Virtuoso MonetDB* Postgres*

10 14 14 13 14 5 14 6 10 10500 14 11 14 14 6 101000 14 4 14 14 10

Runtime in seconds

LUBM Bash DLV Souffle RDFox Jena Stardog Virtuoso MonetDB* Postgres*

10 1.6 9.3 (21.9) 2.2 (78.7) 13.6 (11.8) (5.2) (20.6)500 83 (310) 132 676 (1581) (600)1000 258 (346) 278 2009 (1187)

* = we folded the TBox into the query

Page 83: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Experiments

48/50

Figure: Screenshot of the web interface

Page 84: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShellMotivation

System

Approach

Optimization

Experiments

Experiments

Summary

Conclusion

Answering Queries with Unix Shell: Summary

49/50

SummaryI Preprocess large datasets without installing softwareI Supports OWL RL subset and Datalog as query languageI Try it online at

https://www.thomasrebele.org/projects/bashlogI Source code available at

https://github.com/thomasrebele/bashlog

Future workI numerical comparisonsI aggregations (e.g., max, count)

Thomas Rebele Thomas P. Tanon Fabian Suchanek

publication: ISWC 2018 (full paper)

Page 85: Expanding the YAGO knowledge base · I knowledge base with 10 million entities and >210 million facts I automatically extracted from Wikipedia, Wordnet, and Geonames I multilingual

Expanding theYAGO knowledge

base

Rebele

The YAGOknowledge base

Using YAGO forthe Humanities

Adding Words toRegexes

AnsweringQueries with UnixShell

Conclusion

Conclusion

50/50

This thesis showed how to extend YAGO along several axes:

I Improve completeness w.r.t. peopleI Automatically repairing of its regular expressionsI Preprocessing queries using only a Bash shell

I Interdisciplinary projectI Source code of all contributions is available onlineI Publications in ISWC 2016, ISWC 2017, ISWC 2018, PAKDD 2018

(other publication in TPDL 2016 (demo))