Upload
dinhnhi
View
216
Download
3
Embed Size (px)
Citation preview
Fabian Suchanek & Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
http://suchanek.name/
http://www.mpi-inf.mpg.de/~weikum/
Knowledge Harvesting
in the Big Data Era
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Turn Web into Knowledge Base
KB Population
Info Extraction Semantic Authoring
Entity Linkage
Web
of D
ata
W
eb
of
Usrs
& C
on
ten
ts
Very Large Knowledge Bases
Semantic Docs
Disambiguation
2
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data: RDF, Tables, Microdata 60 Bio. SPO triples (RDF) and growing
Cyc
TextRunner/
ReVerb WikiTaxonomy/
WikiNet
SUMO
ConceptNet 5
BabelNet
ReadTheWeb
3
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data: RDF, Tables, Microdata 60 Bio. SPO triples (RDF) and growing
• 10M entities in
350K classes
• 120M facts for
100 relations
• 100 languages
• 95% accuracy
• 4M entities in
250 classes
• 500M facts for
6000 properties
• live updates
• 25M entities in
2000 topics
• 100M facts for
4000 properties
• powers Google
knowledge graph
Ennio_Morricone type composer Ennio_Morricone type GrammyAwardWinner composer subclassOf musician Ennio_Morricone bornIn Rome Rome locatedIn Italy Ennio_Morricone created Ecstasy_of_Gold Ennio_Morricone wroteMusicFor The_Good,_the_Bad_,and_the_Ugly Sergio_Leone directed The_Good,_the_Bad_,and_the_Ugly
4
History of Knowledge Bases
Doug Lenat: „The more you know, the more
(and faster) you can learn.“
Cyc project (1984-1994) cont‘d by Cycorp Inc.
x: human(x) male(x) female(x)
x: (male(x) female(x))
(female(x) male(x))
x: mammal(x) (hasLegs(x)
isEven(numberOfLegs(x))
x: human(x)
( y: mother(x,y) z: father(x,z))
x e : human(x) remembers(x,e)
happened(e) < now
George
Miller
Christiane
Fellbaum
WordNet project
(1985-now)
Cyc and WordNet
are hand-crafted
knowledge bases
Some Publicly Available Knowledge Bases
YAGO: yago-knowledge.org
Dbpedia: dbpedia.org
Freebase: freebase.com
Entitycube: research.microsoft.com/en-us/projects/entitycube/
NELL: rtw.ml.cmu.edu
DeepDive: research.cs.wisc.edu/hazy/demos/deepdive/index.php/Steve_Irwin
Probase: research.microsoft.com/en-us/projects/probase/
KnowItAll / ReVerb: openie.cs.washington.edu
reverb.cs.washington.edu
PATTY: www.mpi-inf.mpg.de/yago-naga/patty/
BabelNet: lcl.uniroma1.it/babelnet
WikiNet: www.h-its.org/english/research/nlp/download/wikinet.php
ConceptNet: conceptnet5.media.mit.edu
WordNet: wordnet.princeton.edu
Linked Open Data: linkeddata.org
6
Knowledge for Intelligence
Enabling technology for:
disambiguation in written & spoken natural language
deep reasoning (e.g. QA to win quiz game)
machine reading (e.g. to summarize book or corpus)
semantic search in terms of entities&relations (not keywords&pages)
entity-level linkage for the Web of Data
European composers who have won film music awards?
East coast professors who founded Internet companies?
Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure? ...
Politicians who are also scientists?
Relationships between John Lennon, Billie Holiday, Heath Ledger, King Kong?
7
Use Case: Question Answering
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
This American city has two airports
named after a war hero and a WW II battle
knowledge
back-ends
question
classification &
decomposition
D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010.
IBM Journal of R&D 56(3/4), 2012: This is Watson.
Q: Sin City ?
movie, graphical novel, nickname for city, …
A: Vegas ? Strip ?
Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, …
comic strip, striptease, Las Vegas Strip, …
8
Use Case: Text Analytics
K.Goh,M.Kusick,D.Valle,B.Childs,M.Vidal,A.Barabasi: The Human Disease Network, PNAS, May 2007
But try this with: diabetes mellitus, diabetis type 1, diabetes type 2, diabetes insipidus,
insulin-dependent diabetes mellitus with ophthalmic complications,
ICD-10 E23.2, OMIM 304800, MeSH C18.452.394.750, MeSH D003924, …
Use Case: Big Data+Text Analytics
Who covered which other singer?
Who influenced which other musicians?
Entertainment:
Drugs (combinations) and their side effects Health:
Politicians‘ positions on controversial topics
and their involvement with industry Politics:
Customer opinions on small-company products,
gathered from social media Business:
• Identify relevant contents sources
• Identify entities of interest & their relationships
• Position in time & space
• Group and aggregate
• Find insightful patterns & predict trends
General Design Pattern:
10
Spectrum of Machine Knowledge (1)
factual knowledge: bornIn (SteveJobs, SanFrancisco), hasFounded (SteveJobs, Pixar),
hasWon (SteveJobs, NationalMedalOfTechnology), livedIn (SteveJobs, PaloAlto)
taxonomic knowledge (ontology): instanceOf (SteveJobs, computerArchitects), instanceOf(SteveJobs, CEOs)
subclassOf (computerArchitects, engineers), subclassOf(CEOs, businesspeople)
lexical knowledge (terminology): means (“Big Apple“, NewYorkCity), means (“Apple“, AppleComputerCorp)
means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis)
contextual knowledge (entity occurrences, entity-name disambiguation) maps (“Gates and Allen founded the Evil Empire“,
BillGates, PaulAllen, MicrosoftCorp)
linked knowledge (entity equivalence, entity resolution): hasFounded (SteveJobs, Apple), isFounderOf (SteveWozniak, AppleCorp)
sameAs (Apple, AppleCorp), sameAs (hasFounded, isFounderOf)
11
Spectrum of Machine Knowledge (2)
multi-lingual knowledge: meansInChinese („乔戈里峰“, K2), meansInUrdu („کے ٹو“, K2)
meansInFr („école“, school (institution)), meansInFr („banc“, school (of fish))
temporal knowledge (fluents): hasWon (SteveJobs, NationalMedalOfTechnology)@1985
marriedTo (AlbertEinstein, MilevaMaric)@[6-Jan-1903, 14-Feb-1919]
presidentOf (NicolasSarkozy, France)@[16-May-2007, 15-May-2012]
spatial knowledge: locatedIn (YumbillaFalls, Peru), instanceOf (YumbillaFalls, TieredWaterfalls)
hasCoordinates (YumbillaFalls, 5°55‘11.64‘‘S 77°54‘04.32‘‘W ),
closestTown (YumbillaFalls, Cuispes), reachedBy (YumbillaFalls, RentALama)
12
Spectrum of Machine Knowledge (3)
ephemeral knowledge (dynamic services):
wsdl:getSongs (musician ?x, song ?y), wsdl:getWeather (city?x, temp ?y)
common-sense knowledge (properties):
hasAbility (Fish, swim), hasAbility (Human, write),
hasShape (Apple, round), hasProperty (Apple, juicy),
hasMaxHeight (Human, 2.5 m)
common-sense knowledge (rules):
x: human(x) male(x) female(x)
x: (male(x) female(x)) (female(x) ) male(x))
x: human(x) ( y: mother(x,y) z: father(x,z))
x: animal(x) (hasLegs(x) isEven(numberOfLegs(x))
13
Spectrum of Machine Knowledge (4)
emerging knowledge (open IE):
hasWon (MerylStreep, AcademyAward)
occurs („Meryl Streep“, „celebrated for“, „Oscar for Best Actress“)
occurs („Quentin“, „nominated for“, „Oscar“)
multimodal knowledge (photos, videos):
JimGray
JamesBruceFalls
social knowledge (opinions):
admires (maleTeen, LadyGaga), supports (AngelaMerkel, HelpForGreece)
epistemic knowledge ((un-)trusted beliefs):
believe(Ptolemy,hasCenter(world,earth)),
believe(Copernicus,hasCenter(world,sun))
believe (peopleFromTexas, bornIn(BarackObama,Kenya))
?
14
Knowledge Bases in the Big Data Era
Scalable algorithms
Distributed platforms
Big Data Analytics
Tapping unstructured data
Connecting structured & unstructured data sources
Discovering data sources
Making sense of heterogeneous, dirty, or uncertain data
Knowledge
Bases: entities,
relations,
time, space,
…
15
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Big Data
Methods for
Knowledge
Harvesting
Knowledge
for Big Data
Analytics
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Time of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Wikipedia-centric Methods
Web-based Methods
Knowledge Bases are labeled graphs
singer
person
resource
location
city
Tupelo
subclassOf
subclassOf
type
bornIn
type
subclassOf
Classes/ Concepts/ Types
Instances/ entities
Relations/ Predicates
A knowledge base can be seen as a directed labeled multi-graph, where the nodes are entities and the edges relations.
18
An entity can have different labels
singer
person
“Elvis” “The King”
type
label label
The same
label for two
entities:
ambiguity
The same
entity
has two
labels:
synonymy type
19
Different views of a knowledge base
singer
type
type(Elvis, singer) bornIn(Elvis,Tupelo) ...
Subject Predicate Object
Elvis type singer
Elvis bornIn Tupelo
... ... ...
Graph notation:
Logical notation:
Triple notation:
Tupelo bornIn
We use "RDFS Ontology"
and "Knowledge Base (KB)"
synonymously.
20
Our Goal is finding classes and instances
singer
person
type
Which classes exist? (aka entity types, unary predicates, concepts)
subclassOf Which subsumptions hold?
Which entities belong to which classes?
Which entities exist?
21
WordNet is a lexical knowledge base
WordNet project
(1985-now)
singer
person
subclassOf
living being
subclassOf
“person”
label
“individual”
“soul”
WordNet contains 82,000 classes
WordNet contains 118,000 class labels
WordNet contains thousands of subclassOf relationships
22
WordNet example: instances
only 32 singers !?
4 guitarists
5 scientists
0 enterprises
2 entrepreneurs
WordNet classes
lack instances
25
Goal is to go beyond WordNet
WordNet is not perfect:
• it contains only few instances
• it contains only common nouns as classes
• it contains only English labels
... but it contains a wealth of information that can be
the starting point for further extraction.
26
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Basics & Goal
Wikipedia-centric Methods
Web-based Methods
Link Wikipedia categories to WordNet?
American billionaires
Technology company founders
Apple Inc.
Deaths from cancer
Internet pioneers
tycoon, magnate
entrepreneur
pioneer, innovator ?
pioneer, colonist
?
Wikipedia categories WordNet classes
30
Categories can be linked to WordNet
American people of Syrian descent
singer gr. person people descent WordNet
American people of Syrian descent
pre-modifier head post-modifier
person
Noungroup parsing
Wikipedia
Stemming person
Most frequent meaning
“person” “singer” “people” “descent”
Head has to be plural
31
YAGO = WordNet+Wikipedia
American people of Syrian descent
WordNet
person
Wikipedia
organism
subclassOf
subclassOf
Related project:
WikiTaxonomy 105,000 subclassOf links 88% accuracy [Ponzetto & Strube: AAAI‘07] 200,000 classes
460,000 subclassOf 3 Mio. instances 96% accuracy [Suchanek: WWW‘07]
Steve Jobs type
32
Link Wikipedia & WordNet by Random Walks
[Navigli 2010]
Formula One
drivers
• construct neighborhood around source and target nodes • use contextual similarity (glosses etc.) as edge weights • compute personalized PR (PPR) with source as start node • rank candidate targets by their PPR scores
{driver,
device driver}
computer
program
chauffeur
race driver
trucker
tool
causal
agent
Barney
Oldfield
{driver, operator
of vehicle}
Formula One
champions truck
drivers
motor
racing
Michael
Schumacher
Wikipedia categories WordNet classes 33
Learning More Mappings [ Wu & Weld: WWW‘08 ]
Kylin Ontology Generator (KOG): learn classifier for subclassOf across Wikipedia & WordNet using
• YAGO as training data
• advanced ML methods (SVM‘s, MLN‘s)
• rich features from various sources
• category/class name similarity measures
• category instances and their infobox templates:
template names, attribute names (e.g. knownFor)
• Wikipedia edit history:
refinement of categories
• Hearst patterns:
C such as X, X and Y and other C‘s, …
• other search-engine statistics:
co-occurrence frequencies
> 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories 34
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Basics & Goal
Wikipedia-centric Methods
Web-based Methods
35
Hearst patterns extract instances from text [M. Hearst 1992]
Hearst defined lexico-syntactic patterns for type relationship:
X such as Y; X like Y;
X and other Y; X including Y;
X, especially Y;
companies such as Apple
Google, Microsoft and other companies
Internet companies like Amazon and Facebook
Chinese cities including Kunming and Shangri-La
computer pioneers like the late Steve Jobs
computer pioneers and other scientists
lakes in the vicinity of Brisbane
type(Apple, company), type(Google, company), ...
Find such patterns in text: //better with POS tagging
Goal: find instances of classes
Derive type(Y,X)
36
Recursively applied patterns increase recall [Kozareva/Hovy 2010]
use results from Hearst patterns as seeds
then use „parallel-instances“ patterns
X such as Y companies such as Apple
companies such as Google
Y like Z
*, Y and Z Apple like Microsoft offers
IBM, Google, and Amazon
Microsoft like SAP sells
eBay, Amazon, and Facebook
Y like Z
*, Y and Z
Y like Z
*, Y and Z Cherry, Apple, and Banana
potential problems with ambiguous words
37
Doubly-anchored patterns are more robust [Kozareva/Hovy 2010, Dalvi et al. 2012]
W, Y and Z
If two of three placeholders match seeds, harvest the third:
Google, Microsoft and Amazon
Cherry, Apple, and Banana
Goal:
find instances of classes
Start with a set of seeds:
companies = {Microsoft, Google}
type(Amazon, company)
Parse Web documents and find the pattern
38
Instances can be extracted from tables [Kozareva/Hovy 2010, Dalvi et al. 2012]
Paris France
Shanghai China
Berlin Germany
London UK
Paris Iliad
Helena Iliad
Odysseus Odysee
Rama Mahabaratha
Goal: find instances of classes
Start with a set of seeds:
cities = {Paris, Shanghai, Brisbane}
Parse Web documents and find tables
If at least two seeds appear in a column, harvest the others:
type(Berlin, city)
type(London, city) 39
Extracting instances from lists & tables [Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]
Caveats:
Precision drops for classes with sparse statistics (IR profs, …)
Harvested items are names, not entities
Canonicalization (de-duplication) unsolved
State-of-the-Art Approach (e.g. SEAL):
• Start with seeds: a few class instances
• Find lists, tables, text snippets (“for example: …“), …
that contain one or more seeds
• Extract candidates: noun phrases from vicinity
• Gather co-occurrence stats (seed&cand, cand&className pairs)
• Rank candidates
• point-wise mutual information, …
• random walk (PR-style) on seed-cand graph
40
Probase builds a taxonomy from the Web
ProBase 2.7 Mio. classes from
1.7 Bio. Web pages [Wu et al.: SIGMOD 2012]
Use Hearst liberally to obtain many instance candidates:
„plants such as trees and grass“
„plants include water turbines“
„western movies such as The Good, the Bad, and the Ugly“
Problem: signal vs. noise
Assess candidate pairs statistically:
P[X|Y] >> P[X*|Y] subclassOf(Y X)
Problem: ambiguity of labels
Merge labels of same class:
X such as Y1 and Y2 same sense of X
41
Use query logs to refine taxonomy [Pasca 2011]
Input:
type(Y, X1), type(Y, X2), type(Y, X3), e.g, extracted from Web
Goal: rank candidate classes X1, X2, X3
H1: X and Y should co-occur frequently in queries
score1(X) freq(X,Y) * #distinctPatterns(X,Y)
H2: If Y is ambiguous, then users will query X Y:
score2(X) (i=1..N term-score(tiX))1/N
example query: "Michael Jordan computer scientist"
H3: If Y is ambiguous, then users will query first X, then X Y:
score3(X) (i=1..N term-session-score(tiX))1/N
Combine the following scores to rank candidate classes:
42
Take-Home Lessons
Semantic classes for entities
> 10 Mio. entities in 100,000‘s of classes
backbone for other kinds of knowledge harvesting
great mileage for semantic search e.g. politicians who are scientists,
French professors who founded Internet companies, …
Variety of methods
noun phrase analysis, random walks, extraction from tables, …
Still room for improvement
higher coverage, deeper in long tail, …
43
Open Problems and Grand Challenges
Wikipedia categories reloaded: larger coverage
Universal solution for taxonomy alignment
New name for known entity vs. new entity?
Long tail of entities
comprehensive & consistent instanceOf and subClassOf
across Wikipedia and WordNet e.g. people lost at sea, ACM Fellow,
Jewish physicists emigrating from Germany to USA, …
e.g. Lady Gaga vs. Radio Gaga vs. Stefani Joanne Angelina Germanotta
e.g. Wikipedia‘s, dmoz.org, baike.baidu.com, amazon, librarything tags, …
beyond Wikipedia: domain-specific entity catalogs e.g. music, books, book characters, electronic products, restaurants, …
44
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
We focus on given binary relations
...find instances of these relations
hasAdvisor (JimGray, MikeHarrison)
hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)
hasAdvisor (Susan Davidson, Hector Garcia-Molina)
graduatedAt (JimGray, Berkeley)
graduatedAt (HectorGarcia-Molina, Stanford)
hasWonPrize (JimGray, TuringAward)
bornOn (JohnLennon, 9-Oct-1940)
Given binary relations with type signature
hasAdvisor: Person Person
graduatedAt: Person University
hasWonPrize: Person Award
bornOn: Person Date
46
IE can tap into different sources
• Semi-structured data
“Low-Hanging Fruit”
• Wikipedia infoboxes & categories
• HTML lists & tables, etc.
• Free text
“Cherrypicking”
• Hearst patterns & other shallow NLP
• Iterative pattern-based harvesting
• Consistency reasoning
• Web tables
Information Extraction (IE) from:
47
Source-centric IE vs. Yield-centric IE
many sources
one source
Surajit
obtained his
PhD in CS from
Stanford ...
Document 1:
instanceOf (Surajit, scientist)
inField (Surajit, c.science)
almaMater (Surajit, Stanford U)
…
Yield-centric IE
Student University Surajit Chaudhuri Stanford U Jim Gray UC Berkeley … …
Student Advisor Surajit Chaudhuri Jeffrey Ullman Jim Gray Mike Harrison … …
1) recall ! 2) precision
1) precision ! 2) recall
Source-centric IE
worksAt
hasAdvisor
+ (optional)
targeted
relations
48
We focus on yield-centric IE
many sources
Yield-centric IE
Student University Surajit Chaudhuri Stanford U Jim Gray UC Berkeley … …
Student Advisor Surajit Chaudhuri Jeffrey Ullman Jim Gray Mike Harrison … …
1) precision ! 2) recall
worksAt
hasAdvisor
+ (optional)
targeted
relations
49
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
Wikipedia uses a Markup Language
{{Infobox scientist
| name = James Nicholas "Jim" Gray
| birth_date = {{birth date|1944|1|12}}
| birth_place = [[San Francisco, California]]
| death_date = ('''lost at sea''')
{{death date|2007|1|28|1944|1|12}}
| nationality = American
| field = [[Computer Science]]
| alma_mater = [[University of California,
Berkeley]]
| advisor = Michael Harrison
...
52
Infoboxes are harvested by RegEx
{{Infobox scientist
| name = James Nicholas "Jim" Gray
| birth_date = {{birth date|1944|1|12}}
Use regular expressions
• to detect dates
• to detect links
• to detect numeric expressions
\{\{birth date \|(\d+)\|(\d+)\|(\d+)\}\}
\[\[([^\|\]]+)
(\d+)(\.\d+)?(in|inches|")
53
Infoboxes are harvested by RegEx
{{Infobox scientist
| name = James Nicholas "Jim" Gray
| birth_date = {{birth date|1944|1|12}}
1944-01-12
wasBorn(Jim_Gray, "1944-01-12")
Map attribute to
canoncial,
predefined
relation
(manually or
crowd-sourced)
Extract data item by
regular expression
wasBorn
54
Learn how articles express facts
James "Jim" Gray (born January 12, 1944
XYZ (born MONTH DAY, YEAR
find
attribute
value
in full
text
learn
pattern
55
Name: R.Agrawal
Birth date: ?
Extract from articles w/o infobox
Rakesh Agrawal (born April 31, 1965) ...
XYZ (born MONTH DAY, YEAR
... and/or build fact
apply
pattern
bornOnDate(R.Agrawal,1965-04-31)
[Wu et al. 2008: "KYLIN"]
propose
attribute
value...
56
Use CRF to express patterns
James "Jim" Gray (born in January, 1944
OTH OTH OTH OTH OTH VAL VAL
𝑃 𝑌 = 𝑦 𝑋 = 𝑥 = 1
𝑍 exp 𝑤𝑘𝑓𝑘(𝑦𝑡−1, 𝑦𝑡, 𝑥 , 𝑡)
𝑘𝑡
𝑥 =
𝑦 =
Features can take into account
• token types (numeric, capitalization, etc.)
• word windows preceding and following position
• deep-parsing dependencies
• first sentence of article
• membership in relation-specific lexicons
[R. Hoffmann et al. 2010: "Learning 5000 Relational Extractors]
James "Jim" Gray (born January 12, 1944 𝑥 =
57
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
Facts Patterns
(JimGray, MikeHarrison)
(BarbaraLiskov, JohnMcCarthy)
& Fact Candidates
X and his advisor Y
X under the guidance of Y
X and Y in their paper
X co-authored with Y
X rarely met his advisor Y
… • good for recall
• noisy, drifting
• not robust enough
for high precision
(Surajit, Jeff)
(Sunita, Mike)
(Alon, Jeff)
(Renee, Yannis)
(Surajit, Microsoft)
(Sunita, Soumen)
(Surajit, Moshe)
(Alon, Larry)
(Soumen, Sunita)
Facts yield patterns – and vice versa
59
Confidence of pattern p:
Confidence of fact candidate (e1,e2):
Support of pattern p:
• gathering can be iterated,
• can promote best facts to additional seeds for next round
# occurrences of p with seeds (e1,e2)
# occurrences of p with seeds (e1,e2)
# occurrences of p
or: PMI (e1,e2) = log freq(e1,e2)
freq(e1) freq(e2)
# occurrences of all patterns with seeds
p freq(e1,p,e2)*conf(p) / p freq(e1,p,e2)
Statistics yield pattern assessment
60
• can promote best facts to additional seeds for next round
• can promote rejected facts to additional counter-seeds
• works more robustly with few seeds & counter-seeds
# occurrences of p with pos. seeds
# occurrences of p with pos. seeds or neg. seeds
Problem: Some patterns have high support, but poor precision:
X is the largest city of Y for isCapitalOf (X,Y)
joint work of X and Y for hasAdvisor (X,Y)
pos. seeds: (Paris, France), (Rome, Italy), (New Delhi, India), ...
neg. seeds: (Sydney, Australia), (Istanbul, Turkey), ...
Negative Seeds increase precision
Idea: Use positive and negative seeds:
Compute the confidence of a pattern as:
(Ravichandran 2002; Suchanek 2006; ...)
61
|{n-grams p} {n-grams q]|
|{n-grams p} {n-grams q]|
Generalized patterns increase recall (N. Nakashole 2011)
Problem: Some patterns are too narrow and thus have small recall:
X and his celebrated advisor Y
X carried out his doctoral research in math under the supervision of Y
X received his PhD degree in the CS dept at Y
X obtained his PhD degree in math at Y
X { his doctoral research, under the supervision of} Y
X { PRP ADJ advisor } Y
X { PRP doctoral research, IN DET supervision of} Y
Compute match quality of pattern p with sentence q by Jaccard:
Compute
n-gram-sets
by frequent
sequence
mining
Idea: generalize patterns to n-grams, allow POS tags
=> Covers more sentences, increases recall 62
(Bunescu 2005 , Suchanek 2006, …)
Cologne lies on the banks of the Rhine
Ss MVp DMc Mp Dg
Js Jp
Problem: Surface patterns fail if the text shows variations
Cologne lies on the banks of the Rhine.
Paris, the French capital, lies on the beautiful banks of the Seine.
Deep Parsing makes patterns robust
Idea: Use deep linguistic parsing to define patterns
Deep linguistic patterns work even on sentences with variations
Paris, the French capital, lies on the beautiful banks of the Seine
63
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
Extending a KB faces 3+ challenges
type (Reagan, president)
spouse (Reagan, Davis)
spouse (Elvis,Priscilla)
(F. Suchanek et al.: WWW‘09)
Problem: If we want to extend a KB, we face (at least) 3 challenges
1. Understand which relations are expressed by patterns
"x is married to y“ spouse(x,y)
2. Disambiguate entities
"Hermione is married to Ron": "Ron" = RonaldReagan?
3. Resolve inconsistencies
spouse(Hermione, Reagan) & spouse(Reagan,Davis) ?
"Hermione is married to Ron"
?
65
SOFIE transforms IE to logical rules (F. Suchanek et al.: WWW‘09)
Idea: Transform corpus to surface statements
"Hermione is married to Ron"
occurs("Hermione", "is married to", "Ron")
Add possible meanings for all words from the KB
means("Ron", RonaldReagan)
means("Ron", RonWeasley)
means("Hermione", HermioneGranger)
Add pattern deduction rules
means(X,Y) & means(X,Z) Y=Z
Only one of these
can be true
occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R
occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')
Add semantic constraints (manually)
spouse(X,Y) & spouse(X,Z) Y=Z 66
The rules deduce meanings of patterns (F. Suchanek et al.: WWW‘09)
Add pattern deduction rules
occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R
occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')
Add semantic constraints (manually)
spouse(X,Y) & spouse(X,Z) Y=Z
type(Reagan, president)
spouse(Reagan, Davis)
spouse(Elvis,Priscilla)
"Elvis is married to Priscilla"
"is married to“ ~ spouse
67
The rules deduce facts from patterns (F. Suchanek et al.: WWW‘09)
Add pattern deduction rules
occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R
occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')
Add semantic constraints (manually)
spouse(X,Y) & spouse(X,Z) Y=Z
spouse(Hermione,RonaldReagan)
spouse(Hermione,RonWeasley)
"is married to“ ~ married
"Hermione is married to Ron" type(Reagan, president)
spouse(Reagan, Davis)
spouse(Elvis,Priscilla)
68
The rules remove inconsistencies (F. Suchanek et al.: WWW‘09)
Add pattern deduction rules
occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R
occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')
Add semantic constraints (manually)
spouse(X,Y) & spouse(X,Z) Y=Z
spouse(Hermione,RonaldReagan)
spouse(Hermione,RonWeasley)
type(Reagan, president)
spouse(Reagan, Davis)
spouse(Elvis,Priscilla)
69
The rules pose a weighted MaxSat problem (F. Suchanek et al.: WWW‘09)
spouse(X,Y) & spouse(X,Z) => Y=Z [10]
type(Reagan, president) [10]
married(Reagan, Davis) [10]
married(Elvis,Priscilla) [10]
occurs("Hermione","loves","Harry") [3]
means("Ron",RonaldReagan) [3]
means("Ron",RonaldWeasley) [2]
...
We are given a set of
rules/facts, and wish
to find the most plausible
possible world.
Possible World 1: Possible World 2:
married married
Weight of satisfied rules: 30 Weight of satisfied rules: 39
PROSPERA parallelizes the extraction (N. Nakashole et al.: WSDM‘11)
occurs() occurs() occurs()
Mining the pattern
occurrences is
embarassingly
parallel
occurs()
spouse()
means()
loves()
means()
loves()
Reasoning is hard to
parallelize as atoms
depends on other atoms
Idea: parallelize
along min-cuts
71
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
Markov Logic generalizes MaxSat reasoning
occurs()
spouse()
means()
loves()
means()
loves()
In a Markov Logic Network (MLN), every atom is represented by
a Boolean random variable.
X3
X2
X4
X1
X6
X5
(M. Richardson / P. Domingos 2006)
means() X7 73
Dependencies in an MLN are limited
The value of a random variable 𝑿𝒊 depends only on its neighbors:
X3
X2
X4
X1
X6
X5
𝑷 𝑿𝒊 𝑿𝟏, … , 𝑿𝒊−𝟏, 𝑿𝒊+𝟏, … , 𝑿𝒏 = 𝑷(𝑿𝒊|𝑵 𝑿𝒊 )
𝑷 𝑿 = 𝒙 =𝟏
𝒁 𝝋𝒊(𝝅𝑪𝒊 𝒙 )
The Hammersley-Clifford Theorem tells us:
We choose 𝝋𝒊 so as to satisfy all
formulas in the the i-th clique:
𝝋𝒊 𝒛 = 𝐞𝐱𝐩 (𝒘𝒊 × 𝒇𝒐𝒓𝒎𝒖𝒍𝒂𝒔 𝒊 𝒔𝒂𝒕.𝒘𝒊𝒕𝒉 𝒛 )
X7 74
There are many methods for MLN inference
X3
X2
X4
X1
X6
X5
To compute the values that maximize the joint probability
(MAP = maximum a posteriori) we can use a variety of methods:
Gibbs sampling, other MCMC, belief propagation,
randomized MaxSat, …
X7 75
In addition, the MLN can model/compute
• marginal probabilities
• the joint distribution
Large-Scale Fact Extraction with MLNs [J. Zhu et al.: WWW‘09]
StatSnowball:
• start with seed facts and initial MLN model
• iterate:
• extract facts
• generate and select patterns
• refine and re-train MLN model (plus CRFs plus …)
BioSnowball:
• automatically creating biographical summaries
renlifang.msra.cn / entitycube.research.microsoft.com 76
NELL couples different learners
http://rtw.ml.cmu.edu/rtw/
Natural Language
Pattern Extractor
Table Extractor
Mutual exclusion
Type Check
Krzewski coaches the
Blue Devils.
Krzewski Blue Angels
Miller Red Angels
sports coach != scientist
If I coach, am I a coach?
Initial Ontology
[Carlson et al. 2010]
77
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Scope & Goal
Regex-based Extraction
Pattern-based Harvesting
Consistency Reasoning
Probabilistic Methods
Web-Table Methods
Web Tables can be annotated with YAGO [Limaye, Sarawagi, Chakrabarti: PVLDB 10]
Goal: enable semantic search over Web tables
Idea:
• Map column headers to Yago classes,
• Map cell values to Yago entities
• Using joint inference for factor-graph learning model
80
Title Author
A short history of time S Hawkins
D Adams Hitchhiker's guide
Book Person
Entity
hasAuthor
Statistics yield semantics of Web tables
[Venetis,Halevy et al: PVLDB 11]
Idea: Infer classes from co-occurrences, headers are class names
𝑃 𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑙1, … , 𝑣𝑎𝑙𝑛 = 𝑃(𝑐𝑙𝑎𝑠𝑠|𝑣𝑎𝑙𝑖)
𝑃(𝑐𝑙𝑎𝑠𝑠)
Result from 12 Mio. Web tables:
• 1.5 Mio. labeled columns (=classes)
• 155 Mio. instances (=values)
Conference
81
City
Statistics yield semantics of Web tables
Idea: Infer facts from table rows, header identifies relation name
hasLocation(ThirdWorkshop, SanDiego)
but: classes&entities not canonicalized. Instances may include:
Google Inc., Google, NASDAQ GOOG, Google search engine, …
Jet Li, Li Lianjie, Ley Lin Git, Li Yangzhong, Nameless hero, … 82
Take-Home Lessons
For high precision, consistency reasoning is crucial:
Bootstrapping works well for recall
but details matter: seeds, counter-seeds,
pattern language, statistical confidence, etc.
Harness initial KB for distant supervision & efficiency:
seeds from KB, canonicalized entities with type contraints
Hand-crafted domain models are assets:
expressive constraints are vital, modeling is not a bottleneck,
but no out-of-model discovery
various methods incl. MaxSat, MLN/factor-graph MCMC, etc.
83
Open Problems and Grand Challenges
Real-time & incremental fact extraction
for continuous KB growth & maintenance (life-cycle management over years and decades)
Extensions to ternary & higher-arity relations
Efficiency and scalability of best methods for
(probabilistic) reasoning without losing accuracy
events in context: who did what to/with whom when where why …?
Robust fact extraction with both high precision & recall as highly automated (self-tuning) as possible
Large-scale studies for vertical domains
e.g. academia: researchers, publications, organizations,
collaborations, projects, funding, software, datasets, …
84
Big Data
Methods for
Knowledge
Harvesting
Knowledge
for Big Data
Analytics
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations Open Information Extraction
Relation Paraphrases
Big Data Algorithms
Discovering “Unknown” Knowledge
so far KB has relations with type signatures
<entity1, relation, entity2>
< CarlaBruni marriedTo NicolasSarkozy> Person R Person
< NataliePortman wonAward AcademyAward > Person R Prize
Open and Dynamic Knowledge Harvesting:
would like to discover new entities and new relation types
<name1, phrase, name2>
Madame Bruni in her happy marriage with the French president …
The first lady had a passionate affair with Stones singer Mick …
Natalie was honored by the Oscar …
Bonham Carter was disappointed that her nomination for the Oscar …
86
Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012]
Consider all verbal phrases as potential relations
and all noun phrases as arguments
Problem 1: incoherent extractions “New York City has a population of 8 Mio” <New York City, has, 8 Mio>
“Hero is a movie by Zhang Yimou” <Hero, is, Zhang Yimou>
Problem 2: uninformative extractions “Gold has an atomic weight of 196” <Gold, has, atomic weight>
“Faust made a deal with the devil” <Faust, made, a deal>
Solution:
• regular expressions over POS tags:
VB DET N PREP; VB (N | ADJ | ADV | PRN | DET)* PREP; etc.
• relation phrase must have # distinct arg pairs > threshold
Problem 3: over-specific extractions “Hero is the most colorful movie by Zhang Yimou”
<..., is the most colorful movie by, …>
http://ai.cs.washington.edu/demos 87
Open IE Example: ReVerb http://openie.cs.washington.edu/
?x „a song composed by“ ?y
88
Open IE Example: ReVerb http://openie.cs.washington.edu/
?x „a piece written by“ ?y
89
Diversity and Ambiguity of
Relational Phrases
Who covered whom?
Cave sang Hallelujah, his own song unrelated to Cohen‘s
Nina Simone‘s singing of Don‘t Explain revived Holiday‘s old song
Cat Power‘s voice is sad in her version of Don‘t Explain
Cale performed Hallelujah written by L. Cohen
16 Horsepower played Sinnerman, a Nina Simone original
Amy‘s souly interpretation of Cupid, a classic piece of Sam Cooke
Amy Winehouse‘s concert included cover songs by the Shangri-Las
Cave sang Hallelujah, his own song unrelated to Cohen‘s
Nina Simone‘s singing of Don‘t Explain revived Holiday‘s old song
Cat Power‘s voice is sad in her version of Don‘t Explain
Cale performed Hallelujah written by L. Cohen
16 Horsepower played Sinnerman, a Nina Simone original
Amy‘s souly interpretation of Cupid, a classic piece of Sam Cooke
Amy Winehouse‘s concert included cover songs by the Shangri-Las
{cover songs, interpretation of,
singing of, voice in, …} SingerCoversSong
{classic piece of, ‘s old song,
written by, composition of, …} MusicianCreatesSong 90
Scalable Mining of SOL Patterns
Syntactic-Lexical-Ontological (SOL) patterns
• Syntactic-Lexical: surface words, wildcards, POS tags
• Ontological: semantic classes as entity placeholders
<singer>, <musician>, <song>, …
• Type signature of pattern: <singer> <song>, <person> <song>
• Support set of pattern: set of entity-pairs for placeholders
support and confidence of patterns
SOL pattern: <singer> ’s ADJECTIVE voice * in <song>
Matching sentences: Amy Winehouse’s soul voice in her song ‘Rehab’
Jim Morrison’s haunting voice and charisma in ‘The End’
Joan Baez’s angel-like voice in ‘Farewell Angelina’
Support set:
(Amy Winehouse, Rehab)
(Jim Morrison, The End)
(Joan Baez, Farewell Angelina)
[N. Nakashole et al.: EMNLP-CoNLL’12, VLDB‘12]
91
Pattern Dictionary for Relations [N. Nakashole et al.: EMNLP-CoNLL’12, VLDB‘12]
WordNet-style dictionary/taxonomy for relational phrases
based on SOL patterns (syntactic-lexical-ontological)
“graduated from” “obtained degree in * from”
“and PRONOUN ADJECTIVE advisor” “under the supervision of”
Relational phrases can be synonymous
“wife of” “ spouse of”
<person> graduated from <university>
<singer> covered <song> <book> covered <event>
One relational phrase can subsume another
Relational phrases are typed
350 000 SOL patterns from Wikipedia, NYT archive, ClueWeb
http://www.mpi-inf.mpg.de/yago-naga/patty/
92
PATTY: Pattern Taxonomy for Relations [N. Nakashole et al.: EMNLP 2012, demo at VLDB 2012]
350 000 SOL patterns with 4 Mio. instances
accessible at: www.mpi-inf.mpg.de/yago-naga/patty 93
Big Data Algorithms at Work
Frequent sequence mining
with generalization hierarchy for tokens Examples: famous ADJECTIVE *
her PRONOUN *
<singer> <musician> <artist> <person>
Map-Reduce-parallelized on Hadoop:
• identify entity-phrase-entity occurrences in corpus
• compute frequent sequences
• repeat for generalizations
n-gram
mining taxonomy
construction
pattern
lifting
text pre-
processing
94
Take-Home Lessons
Scalable algorithms for extraction & mining
have been leveraged – but more work needed
Triples of the form <name, phrase, name>
can be mined at scale and
are beneficial for entity discovery
Semantic typing of relational patterns
and pattern taxonomies are vital assets
95
Open Problems and Grand Challenges
Integrate canonicalized KB with emerging knowledge
Cost-efficient crowdsourcing
for higher coverage & accuracy
Overcoming sparseness in input corpora and
coping with even larger scale inputs
Exploit relational patterns for
question answering over structured data
tap social media, query logs, web tables & lists, microdata, etc.
for richer & cleaner taxonomy of relational patterns
KB life-cycle: today‘s long tail may be tomorrow‘s mainstream
96
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
As Time Goes By: Temporal Knowledge
Which facts for given relations hold
at what time point or during which time intervals ?
marriedTo (Madonna, GuyRitchie) [ 22Dec2000, Dec2008 ]
capitalOf (Berlin, Germany) [ 1990, now ]
capitalOf (Bonn, Germany) [ 1949, 1989 ]
hasWonPrize (JimGray, TuringAward) [ 1998 ]
graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]
graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]
hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]
How can we query & reason on entity-relationship facts
in a “time-travel“ manner - with uncertain/incomplete KB ?
US president‘s wife when Steve Jobs died?
students of Hector Garcia-Molina while he was at Princeton?
98
Temporal Knowledge
for all people in Wikipedia (300 000) gather all spouses,
incl. divorced & widowed, and corresponding time periods!
>95% accuracy, >95% coverage, in one night
consistency constraints are potentially helpful: • functional dependencies: husband, time wife
• inclusion dependencies: marriedPerson adultPerson
• age/time/gender restrictions: birthdate + < marriage < divorce
1) recall: gather temporal scopes for base facts
2) precision: reason on mutual consistency
PRAVDA for T-Facts from Text
1) Candidate gathering:
extract pattern & entities
of basic facts and
time expression
2) Pattern analysis:
use seeds to quantify
strength of candidates
3) Label propagation:
construct weighted graph
of hypotheses and
minimize loss function
4) Constraint reasoning:
use ILP for
temporal consistency
[Y. Wang et al. 2011]
102
Reasoning on T-Fact Hypotheses
Cast into evidence-weighted logic program
or integer linear program with 0-1 variables:
for temporal-fact hypotheses Xi
and pair-wise ordering hypotheses Pij
maximize wi Xi with constraints
• Xi + Xj 1
if Xi, Xj overlap in time & conflict
• Pij + Pji 1
• (1 Pij ) + (1 Pjk) (1 Pik)
if Xi, Xj, Xk must be totally ordered
• (1 Xi ) + (1 Xj) + 1 (1 Pij) + (1 Pji)
if Xi, Xj must be totally ordered
Temporal-fact hypotheses: m(Ca,Nic)@[2008,2012]{0.7}, m(Ca,Ben)@[2010]{0.8}, m(Ca,Mi)@[2007,2008]{0.2},
m(Cec,Nic)@[1996,2004]{0.9}, m(Cec,Nic)@[2006,2008]{0.8}, m(Nic,Ma){0.9}, …
[Y. Wang et al. 2012, P. Talukdar et al. 2012]
Efficient
ILP solvers: www.gurobi.com
IBM Cplex
…
103
TIE for T-Fact Extraction & Ordering [Ling/Weld : AAAI 2010]
TIE (Temporal IE) architectures builds on:
• TARSQI (Verhagen et al. 2005)
for event extraction, using linguistic analyses
• Markov Logic Networks
for temporal ordering of events
104
Take-Home Lessons
Temporal knowledge harvesting:
crucial for machine-reading news, social media, opinions
Combine linguistics, statistics, and logical reasoning:
harder than for „ordinary“ relations
105
Open Problems and Grand Challenges
Robust and broadly applicable methods for
temporal (and spatial) knowledge
populate time-sensitive relations comprehensively:
marriedTo, isCEOof, participatedInEvent, …
Understand temporal relationships in
biographies and narratives
machine-reading of news, bios, novels, …
106
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts
Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
NERD Problem
NED Principles
Coherence-based Methods
Rare & Emerging Entities
Three Different Problems
Harry fought with you know who. He defeats the dark lord.
1) named-entity recognition (NER): segment & label by CRF
(e.g. Stanford NER tagger)
2) co-reference resolution: link to preceding NP
(trained classifier over linguistic features)
3) named-entity disambiguation (NED):
map each mention (name) to canonical entity (entry in KB)
Three NLP tasks:
Harry
Potter
Dirty
Harry
Lord
Voldemort
The Who
(band)
Prince Harry
of England
tasks 1 and 3 together: NERD 108
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
NERD Problem
NED Principles
Coherence-based Methods
Rare & Emerging Entities
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Named Entity Disambiguation
D5 Overview May 30, 2011
Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … …
KB
Eli (bible)
Eli Wallach
Mentions
(surface names)
Entities
(meanings)
Dollars Trilogy
Lord of the Rings
Star Wars Trilogy
Benny Andersson
Benny Goodman
Ecstasy of Gold
Ecstasy (drug)
?
110
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity (m,e): • freq(e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
bag-of-words or
language model:
words, bigrams,
phrases
111
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity (m,e): • freq(e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
joint
mapping
112
Mention-Entity Graph
113 / 20
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy(drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity (m,e): • freq(m,e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
Coherence (e,e‘): • dist(types)
• overlap(links)
• overlap
(keyphrases)
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mention-Entity Graph
114 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity (m,e): • freq(m,e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
Coherence (e,e‘): • dist(types)
• overlap(links)
• overlap
(keyphrases)
American Jews film actors artists Academy Award winners
Metallica songs Ennio Morricone songs artifacts soundtrack music
spaghetti westerns film trilogies movies artifacts Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mention-Entity Graph
115 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity (m,e): • freq(m,e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
Coherence (e,e‘): • dist(types)
• overlap(links)
• overlap
(keyphrases)
http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award
http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mention-Entity Graph
116 / 20
KB+Stats Popularity (m,e): • freq(m,e|m)
• length(e)
• #links(e)
Similarity (m,e): • cos/Dice/KL
(context(m),
context(e))
Coherence (e,e‘): • dist(types)
• overlap(links)
• overlap
(keyphrases)
Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition
The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin
For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone
weighted undirected graph with two types of nodes
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
NERD Problem
NED Principles
Coherence-based Methods
Rare & Emerging Entities
Joint Mapping
• Build mention-entity graph or joint-inference factor graph
from knowledge and statistics in KB
• Compute high-likelihood mapping (ML or MAP) or
dense subgraph such that:
each m is connected to exactly one e (or at most one e)
90
30
5 100
100
50
20 50
90
80
90 30
10 10
20
30
30
118
Joint Mapping: Prob. Factor Graph
90
30
5 100
100
50
20 50
90
80
90 30
10 10
20
30
30
Collective Learning with Probabilistic Factor Graphs [Chakrabarti et al.: KDD’09]:
• model P[m|e] by similarity and P[e1|e2] by coherence
• consider likelihood of P[m1 … mk | e1 … ek]
• factorize by all m-e pairs and e1-e2 pairs
• use MCMC, hill-climbing, LP etc. for solution
119
Joint Mapping: Dense Subgraph
• Compute dense subgraph such that:
each m is connected to exactly one e (or at most one e)
• NP-hard approximation algorithms
• Alt.: feature engineering for similarity-only method
[Bunescu/Pasca 2006, Cucerzan 2007, Milne/Witten 2008, …]
90
30
5 100
100
50
20 50
90
80
90 30
10 10
20
30
30
120
Coherence Graph Algorithm
• Compute dense subgraph to
maximize min weighted degree among entity nodes
such that:
each m is connected to exactly one e (or at most one e)
• Greedy approximation:
iteratively remove weakest entity and its edges
• Keep alternative solutions, then use local/randomized search
90
30
5 100
100
50 50
90
80
90 30
10
20
10
20
30
30
[J. Hoffart et al.: EMNLP‘11] 140
180
50
470
145
230
121
Random Walks Algorithm
• for each mention run random walks with restart
(like personalized PageRank with jumps to start mention(s))
• rank candidate entities by stationary visiting probability
• very efficient, decent accuracy
50
90
80
90 30
10
20
10
0.83
0.7
0.4
0.75 0.15
0.17
0.2
0.1
90
30
5 100
100
50
30
30 20
0.75
0.25
0.04 0.96
0.77
0.5
0.23
0.3 0.2
122
Mention-Entity Popularity Weights
• Collect hyperlink anchor-text / link-target pairs from • Wikipedia redirects
• Wikipedia links between articles and Interwiki links
• Web links pointing to Wikipedia articles
• query-and-click logs
…
• Build statistics to estimate P[entity | name]
• Need dictionary with entities‘ names: • full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp.
• short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, …
• nicknames & aliases: Terminator, City of Angels, Evil Empire, …
• acronyms: LA, UCLA, MS, MSFT
• role names: the Austrian action hero, Californian governor, CEO of MS, …
…
plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her.
[Milne/Witten 2008, Spitkovsky/Chang 2012]
123
Mention-Entity Similarity Edges
Extent of partial matches Weight of matched words
Precompute characteristic keyphrases q for each entity e:
anchor texts or noun phrases in e page with high PMI:
)(
)(
),()(~)|(
mcontextin
ekeyphrasesq
mcover(q)distqscoremescore
1
)|(#
~)|(
qw
cover(q)w
e)|weight(w
ewweight
cover(q)oflength
wordsmatchingeqscore
)()(
),(log),(
efreqqfreq
eqfreqeqweight
Match keyphrase q of candidate e in context of mention m
Compute overall similarity of context(m) and candidate e
„Metallica tribute to Ennio Morricone“
The Ecstasy piece was covered by Metallica on the Morricone tribute album.
124
Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2
))2(),1(min(log||log
))2()1(log())2,1(max(log1
eineinE
eineineein~e2)coh(e1,-mw
Alternatively compute overlap of keyphrases for e1 and e2
or overlap of keyphrases, or similarity of bag-of-words, or …
)2()1(
)2()1(
engramsengrams
engramsengrams~e2)coh(e1,-ngram
Optionally combine with type distance of e1 and e2
(e.g., Jaccard index for type instances)
For special types of e1 and e2 (locations, people, etc.)
use spatial or temporal distance 125
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Very Difficult Example
NED: Experimental Evaluation
Benchmark:
• Extended CoNLL 2003 dataset: 1400 newswire articles
• originally annotated with mention markup (NER),
now with NED mappings to Yago and Freebase
• difficult texts:
… Australia beats India … Australian_Cricket_Team
… White House talks to Kreml … President_of_the_USA
… EDS made a contract with … HP_Enterprise_Services
Results:
Best: AIDA method with prior+sim+coh + robustness test
82% precision @100% recall, 87% mean average precision
Comparison to other methods, see [Hoffart et al.: EMNLP‘11]
see also [P. Ferragina et al.: WWW’13] for NERD benchmarks
128
NERD Online Tools J. Hoffart et al.: EMNLP 2011, VLDB 2011
https://d5gate.ag5.mpi-sb.mpg.de/webaida/
P. Ferragina, U. Scaella: CIKM 2010
http://tagme.di.unipi.it/
R. Isele, C. Bizer: VLDB 2012
http://spotlight.dbpedia.org/demo/index.html
Reuters Open Calais: http://viewer.opencalais.com/
Alchemy API: http://www.alchemyapi.com/api/demo.html
S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009
http://www.cse.iitb.ac.in/soumen/doc/CSAW/
D. Milne, I. Witten: CIKM 2008
http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/
L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL 2011
http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier
some use Stanford NER tagger for detecting mentions
http://nlp.stanford.edu/software/CRF-NER.shtml
129
Coherence-aware Feature Engineering [Cucerzan: EMNLP 2007; Milne/Witten: CIKM 2008, Art.Int. 2013]
• Avoid explicit coherence computation by turning
other mentions‘ candidate entities into features
• sim(m,e) uses these features in context(m)
• special case: consider only unambiguous mentions
or high-confidence entities (in proximity of m)
m e
influence in
context(m)
weighted by
coh(e,ei) and
pop(ei)
130
TagMe: NED with Light-Weight Coherence [P. Ferragina et al.: CIKM‘10, WWW‘13]
• Reduce combinatorial complexity by using
avg. coherence of other mentions‘ candidate entities
• for score(m,e) compute
avg ei cand(mj) coherence (ei ,e) popularity (ei | mj)
then sum up over all mj m („voting“)
m e
mj
e1
e2
e3
131
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
NERD Problem
NED Principles
Coherence-based Methods
Rare & Emerging Entities
Long-Tail and Emerging Entities
last.fm/Nick_Cave/Weeping_Song
wikipedia.org/Weeping_(song)
wikipedia.org/Nick_Cave
last.fm/Nick_Cave/O_Children
last.fm/Nick_Cave/Hallelujah
wikipedia/Hallelujah_(L_Cohen)
wikipedia/Hallelujah_Chorus
wikipedia/Children_(2011 film)
wikipedia.org/Good_Luck_Cave
Cave
composed
haunting
songs like
Hallelujah,
O Children,
and the
Weeping Song.
[J. Hoffart et al.: CIKM’12] 133
Long-Tail and Emerging Entities
last.fm/Nick_Cave/Weeping_Song
wikipedia.org/Weeping_(song)
wikipedia.org/Nick_Cave
last.fm/Nick_Cave/O_Children
last.fm/Nick_Cave/Hallelujah
wikipedia/Hallelujah_(L_Cohen)
wikipedia/Hallelujah_Chorus
wikipedia/Children_(2011 film)
wikipedia.org/Good_Luck_Cave
Cave
composed
haunting
songs like
Hallelujah,
O Children,
and the
Weeping Song.
Gunung Mulu National Park Sarawak Chamber largest underground chamber
eerie violin Bad Seeds No More Shall We Part
Bad Seeds No More Shall We Part Murder Songs
Leonard Cohen Rufus Wainwright Shrek and Fiona
Nick Cave & Bad Seeds Harry Potter 7 movie haunting choir
Nick Cave Murder Songs P.J. Harvey Nick and Blixa duet
Messiah oratorio George Frideric Handel
Dan Heymann apartheid system
South Korean film
KO (p,q) = 𝒎𝒊𝒏(𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒑 ,𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒒 )𝒕
𝒎𝒂𝒙(𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒑 ,𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒒 )𝒕
KORE (e,f) ~ )𝑲𝑶(𝒑, 𝒒)𝟐 × 𝒎𝒊𝒏(𝒘𝒆𝒊𝒈𝒉𝒕 𝒑 𝒊𝒏 𝒆 , 𝒘𝒆𝒊𝒈𝒉𝒕 𝒒 𝒊𝒏 𝒇 )𝒑∈𝒆,𝒒∈𝒇
implementation uses min-hash and LSH [J. Hoffart et al.: CIKM‘12]
Long-Tail and Emerging Entities
any OTHER „Mermaids“
…/The Little Mermaid
wikipedia.org/Nick_Cave
…/Mermaid‘s Song
…/Water‘s Edge (2003 film)
…/Water‘s Edge Restaurant
any OTHER „Water‘s Edge“
wikipedia.org/Good_Luck_Cave
Cave‘s
brand-new
album
contains
masterpieces
like
Water‘s Edge
and
Mermaids.
Bad Seeds No More Shall We Part Murder Songs
excellent seafood clam chowder Maine lobster
Walt Disney Hans Chrisitan Andersen Kiss the Girl
Gunung Mulu National Park Sarawak Chamber largest underground chamber
Nathan Fillion horrible acting
all phrases minus keyphrases of known candidate entities
all phrases minus keyphrases of known candidate entities
Pirates of the Caribbean 4 My Jolly Sailor Bold Johnny Depp
Semantic Typing of Emerging Entities
Given triples (x, p, y) with new x,y
and all type triples (t1, p, t2) for known entities:
• score (x,t) ~ p:(x,p,y) P [t | p,y] + p:(y,p,x) P [t | p,y]
• corr(t1,t2) ~ Pearson coefficient [-1,+1]
Problem: what to do with newly emerging entities
Idea: infer their semantic types using PATTY patterns
For each new e and all candidate types ti:
max i score(e,ti) Xi + ij corr(ti,tj) Yij
s.t. Xi, Yij {0,1} and Yij Xi and Yij Xj and Xi + Xj – 1 Yij
Sandy threatens to hit New York
Nive Nielsen and her band performing Good for You
Nive Nielsen‘s warm voice in Good for You
[N. Nakashole et al.: ACL 2013, T. Lin et al.: EMNLP 2012]
Big Data Algorithms at Work
Web-scale keyphrase mining
Web-scale entity-entity statistics
MAP on large factor graph or
dense subgraphs in large graph
data+text queries on huge KB or LOD
Applications to large-scale input batches:
• discover all musicians in a week‘s social media postings
• identify all diseases & drugs in a month‘s publications
• track a (set of) politician(s) in a decade‘s news archive 137
Take-Home Lessons NERD is key for contextual knowledge
High-quality NERD uses joint inference over various features:
popularity + similarity + coherence
State-of-the-art tools available
Maturing now, but still room for improvement,
especially on efficiency, scalability & robustness
Good approaches, more work needed
Handling out-of-KB entities & long-tail NERD
138
Open Problems and Grand Challenges
Robust disambiguation of entities, relations and classes
Relevant for question answering & question-to-query translation
Key building block for KB building and maintenance
Entity name disambiguation in difficult situations
Short and noisy texts about long-tail entities in social media
Word sense disambiguation in natural-language dialogs
Relevant for multimodal human-computer interactions
(speech, gestures, immersive environments)
Efficient interactive & high-throughput batch NERD
a day‘s news, a month‘s publications, a decade‘s archive
139
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
There are many public knowledge bases
60 Bio. triples
500 Mio. links 143
rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpedia.org/resource/Rome
yago/wordnet:Actor109765278
yago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
imdb.com/name/nm0910607/
Link equivalent entities across KBs
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
144
rdf.freebase.com/ns/en.rome_ny data.nytimes.com/51688803696189142301
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpedia.org/resource/Rome
yago/wordnet:Actor109765278
yago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
imdb.com/name/nm0910607/
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
Referential data quality?
hand-crafted sameAs links?
generated sameAs links? ?
? ?
Link equivalent entities across KBs
145
Record Linkage between Databases
Susan B. Davidson
Peter Buneman
University of Pennsylvania
Yi Chen
record 1
O.P. Buneman
S. Davison
U Penn
Y. Chen
record 2
P. Baumann
S. Davidson
Penn State
Cheng Y.
record 3 …
Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946
H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.
I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statist. Soc., 1969.
Goal: Find equivalence classes of entities, and of records
Techniques:
• similarity of values (edit distance, n-gram overlap, etc.)
• joint agreement of linkage
• similarity joins, grouping/clustering, collective learning, etc.
• often domain-specific customization (similarity measures etc.)
146
Linking Records vs. Linking Knowledge
Susan B. Davidson
Peter Buneman
University of Pennsylvania
Yi Chen
record KB / Ontology
university
Differences between DB records and KB entities:
• Ontological links have rich semantics (e.g. subclassOf)
• Ontologies have only binary predicates
• Ontologies have no schema
• Match not just entities,
but also classes & predicates (relations)
147
Similarity of entities depends on
similarity of neighborhoods
KB 1 KB 2
sameAs ?
?
?
x1 x2
y1
y2
sameAs(x1, x2) depends on sameAs(y1, y2)
which depends on sameAs(x1, x2)
148
Similarity Flooding matches entities at scale
Build a graph:
nodes: pairs of entities, weighted with similarity
edges: weighted with degree of relatedness
similarity: 0.9 similarity: 0.7
relatedness
0.8
Iterate until convergence:
similarity := weighted sum of neighbor similarities
similarity: 0.8
many variants (belief propagation, label propagation, etc.), e.g. SigMa 152
Some neighborhoods are more indicative
1935 1935
"Elvis" "Elvis"
sameAs
sameAs ?
sameAs
Many people born in 1935
not indicative
Few people called "Elvis"
highly indicative
153
Inverse functionality as indicativeness
1935 1935
"Elvis" "Elvis"
sameAs
sameAs ?
sameAs
𝒊𝒇𝒖𝒏 𝒓, 𝒚 =𝟏
| 𝒙: 𝒓 𝒙, 𝒚 |
𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏, 𝟏𝟗𝟑𝟓 =𝟏
𝟓
𝒊𝒇𝒖𝒏 𝒓 = 𝑯𝑴𝒚 𝒊𝒇𝒖𝒏(𝒓, 𝒚)
𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏 = 𝟎. 𝟎𝟏
𝒊𝒇𝒖𝒏 𝒍𝒂𝒃𝒆𝒍 = 𝟎. 𝟗
The higher the inverse functionality of r for r(x,y), r(x',y),
the higher the likelihood that x=x'.
𝒊𝒇𝒖𝒏 𝒓 = 𝟏 ⇒ 𝒙 = 𝒙′ [Suchanek et al.: VLDB’12]
154
PARIS matches entities, classes & relations
Goal:
given 2 ontologies, match entities, relations, and classes
Define
P(x y) := probability that entities x and y are the same
P(p r) := probability that relation p subsumes r
P(c d) := probability that class c subsumes d
Initialize
P(x y) := similarity if x and y are literals, else 0
P(p r) := 0.001
Iterate until convergence
P(x y) := 𝟒𝟐𝛁𝑒−𝑖𝜔𝑡 … 𝑷(𝒑 𝒓)
P(p r) := 𝝑ℵ + 𝑌1𝑛 …𝑷(𝒙 𝒚)
Compute
P(c d) := ratio of instances of d that are in c
Recursive
dependency
[Suchanek et al.: VLDB’12]
156
PARIS matches entities, classes & relations
Goal:
given 2 ontologies, match entities, relations, and classes
Define
P(x y) := probability that entities x and y are the same
P(p r) := probability that relation p subsumes r
P(c d) := probability that class c subsumes d
Initialize
P(x y) := similarity, if x and y are literals, else 0
P(p r) := 0.001
Iterate until convergence
P(x y) := 𝟒𝟐𝛁𝑒−𝑖𝜔𝑡 … 𝑷(𝒑 > 𝒓)
P(p r) := 𝝑ℵ + 𝑌1𝑛 …𝑷(𝒙 = 𝒚)
Compute
P(c d) := ratio of instances of d that are in c
Recursive
dependency
[Suchanek et al.: VLDB’12]
PARIS matches YAGO and DBpedia
• time: 1:30 hours
• precision for instances: 90%
• precision for classes: 74%
• precision for relations: 96%
157
Many challenges remain
Entity linkage is at the heart of semantic data integration.
More than 50 years of research, still some way to go!
Benchmarks: • OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org
• TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/
• TREC Knowledge Base Acceleration: trec-kba.org
• Highly related entities with ambiguous names
George W. Bush (jun.) vs. George H.W. Bush (sen.)
• Long-tail entities with sparse context
• Enterprise data (perhaps combined with Web2.0 data)
• Entities with very noisy context (in social media)
• Records with complex DB / XML / OWL schemas
• Ontologies with non-isomorphic structures
158
Take-Home Lessons Web of Linked Data is great
100‘s of KB‘s with 30 Bio. triples and 500 Mio. links
mostly reference data, dynamic maintenance is bottleneck
connection with Web of Contents needs improvement
Entity resolution & linkage is key
for creating sameAs links in text (RDFa, microdata)
for machine reading, semantic authoring,
knowledge base acceleration, …
Integrated methods for aligning entities, classes and relations
Linking entities across KB‘s is advancing
159
Open Problems and Grand Challenges
Automatic and continuously maintained sameAs links
for Web of Linked Data with high accuracy & coverage
Combine algorithms and crowdsourcing
with active learning, minimizing human effort or cost/accuracy
Web-scale, robust ER with high quality
Handle huge amounts of linked-data sources, Web tables, …
160
Outline
Linked Knowledge: Entity Matching
Motivation and Overview
Wrap-up
Taxonomic Knowledge: Entities and Classes
Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Factual Knowledge: Relations between Entities
Emerging Knowledge: New Entities & Relations
Summary
• Knowledge Bases from Web are Real, Big & Useful:
Entities, Classes & Relations
• Key Asset for Intelligent Applications: Semantic Search, Question Answering, Machine Reading, Digital Humanities,
Text&Data Analytics, Summarization, Reasoning, Smart Recommendations, …
• Harvesting Methods for Entities & Classes Taxonomies
• Methods for extracting Relational Facts
• NERD & ER: Methods for Contextual & Linked Knowledge
• Rich Research Challenges & Opportunities: scale & robustness; temporal, multimodal, commonsense;
open & real-time knowledge discovery; …
• Models & Methods from Different Communities: DB, Web, AI, IR, NLP
162
Knowledge Bases in the Big Data Era
Tapping unstructured data
Connecting structured & unstructured data sources
Discovering data sources
Scalable algorithms
Distributed platforms
Making sense of heterogeneous, dirty, or uncertain data
Big Data Analytics
Knowledge
Bases: entities,
relations,
time, space,
…
163
see comprehensive list in
Fabian Suchanek and Gerhard Weikum:
Knowledge Harvesting in the Big-Data Era,
Proceedings of the ACM SIGMOD
International Conference on Management of Data,
New York, USA, June 22-27, 2013,
Association for Computing Machinery, 2013.
References
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/ 164
Take-Home Message:
From Web & Text to Knowledge
Web & Text Knowledge
analysis
acquisition
synthesis
interpretation
Knowledge
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/
Thank You !
http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/ 166