164
Fabian Suchanek & Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany http://suchanek.name/ http://www.mpi-inf.mpg.de/~weikum/ Knowledge Harvesting in the Big Data Era http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Knowledge Harvesting in the Big Data Eraresources.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/sigmod2013... · Knowledge Harvesting in the Big Data Era ... Ennio_Morricone type composer

  • Upload
    dinhnhi

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

Turn Web into Knowledge Base

KB Population

Info Extraction Semantic Authoring

Entity Linkage

Web

of D

ata

W

eb

of

Usrs

& C

on

ten

ts

Very Large Knowledge Bases

Semantic Docs

Disambiguation

2

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Web of Data: RDF, Tables, Microdata 60 Bio. SPO triples (RDF) and growing

• 10M entities in

350K classes

• 120M facts for

100 relations

• 100 languages

• 95% accuracy

• 4M entities in

250 classes

• 500M facts for

6000 properties

• live updates

• 25M entities in

2000 topics

• 100M facts for

4000 properties

• powers Google

knowledge graph

Ennio_Morricone type composer Ennio_Morricone type GrammyAwardWinner composer subclassOf musician Ennio_Morricone bornIn Rome Rome locatedIn Italy Ennio_Morricone created Ecstasy_of_Gold Ennio_Morricone wroteMusicFor The_Good,_the_Bad_,and_the_Ugly Sergio_Leone directed The_Good,_the_Bad_,and_the_Ugly

4

History of Knowledge Bases

Doug Lenat: „The more you know, the more

(and faster) you can learn.“

Cyc project (1984-1994) cont‘d by Cycorp Inc.

x: human(x) male(x) female(x)

x: (male(x) female(x))

(female(x) male(x))

x: mammal(x) (hasLegs(x)

isEven(numberOfLegs(x))

x: human(x)

( y: mother(x,y) z: father(x,z))

x e : human(x) remembers(x,e)

happened(e) < now

George

Miller

Christiane

Fellbaum

WordNet project

(1985-now)

Cyc and WordNet

are hand-crafted

knowledge bases

Some Publicly Available Knowledge Bases

YAGO: yago-knowledge.org

Dbpedia: dbpedia.org

Freebase: freebase.com

Entitycube: research.microsoft.com/en-us/projects/entitycube/

NELL: rtw.ml.cmu.edu

DeepDive: research.cs.wisc.edu/hazy/demos/deepdive/index.php/Steve_Irwin

Probase: research.microsoft.com/en-us/projects/probase/

KnowItAll / ReVerb: openie.cs.washington.edu

reverb.cs.washington.edu

PATTY: www.mpi-inf.mpg.de/yago-naga/patty/

BabelNet: lcl.uniroma1.it/babelnet

WikiNet: www.h-its.org/english/research/nlp/download/wikinet.php

ConceptNet: conceptnet5.media.mit.edu

WordNet: wordnet.princeton.edu

Linked Open Data: linkeddata.org

6

Knowledge for Intelligence

Enabling technology for:

disambiguation in written & spoken natural language

deep reasoning (e.g. QA to win quiz game)

machine reading (e.g. to summarize book or corpus)

semantic search in terms of entities&relations (not keywords&pages)

entity-level linkage for the Web of Data

European composers who have won film music awards?

East coast professors who founded Internet companies?

Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure? ...

Politicians who are also scientists?

Relationships between John Lennon, Billie Holiday, Heath Ledger, King Kong?

7

Use Case: Question Answering

This town is known as "Sin City" & its

downtown is "Glitter Gulch"

This American city has two airports

named after a war hero and a WW II battle

knowledge

back-ends

question

classification &

decomposition

D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010.

IBM Journal of R&D 56(3/4), 2012: This is Watson.

Q: Sin City ?

movie, graphical novel, nickname for city, …

A: Vegas ? Strip ?

Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, …

comic strip, striptease, Las Vegas Strip, …

8

Use Case: Text Analytics

K.Goh,M.Kusick,D.Valle,B.Childs,M.Vidal,A.Barabasi: The Human Disease Network, PNAS, May 2007

But try this with: diabetes mellitus, diabetis type 1, diabetes type 2, diabetes insipidus,

insulin-dependent diabetes mellitus with ophthalmic complications,

ICD-10 E23.2, OMIM 304800, MeSH C18.452.394.750, MeSH D003924, …

Use Case: Big Data+Text Analytics

Who covered which other singer?

Who influenced which other musicians?

Entertainment:

Drugs (combinations) and their side effects Health:

Politicians‘ positions on controversial topics

and their involvement with industry Politics:

Customer opinions on small-company products,

gathered from social media Business:

• Identify relevant contents sources

• Identify entities of interest & their relationships

• Position in time & space

• Group and aggregate

• Find insightful patterns & predict trends

General Design Pattern:

10

Spectrum of Machine Knowledge (1)

factual knowledge: bornIn (SteveJobs, SanFrancisco), hasFounded (SteveJobs, Pixar),

hasWon (SteveJobs, NationalMedalOfTechnology), livedIn (SteveJobs, PaloAlto)

taxonomic knowledge (ontology): instanceOf (SteveJobs, computerArchitects), instanceOf(SteveJobs, CEOs)

subclassOf (computerArchitects, engineers), subclassOf(CEOs, businesspeople)

lexical knowledge (terminology): means (“Big Apple“, NewYorkCity), means (“Apple“, AppleComputerCorp)

means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis)

contextual knowledge (entity occurrences, entity-name disambiguation) maps (“Gates and Allen founded the Evil Empire“,

BillGates, PaulAllen, MicrosoftCorp)

linked knowledge (entity equivalence, entity resolution): hasFounded (SteveJobs, Apple), isFounderOf (SteveWozniak, AppleCorp)

sameAs (Apple, AppleCorp), sameAs (hasFounded, isFounderOf)

11

Spectrum of Machine Knowledge (2)

multi-lingual knowledge: meansInChinese („乔戈里峰“, K2), meansInUrdu („کے ٹو“, K2)

meansInFr („école“, school (institution)), meansInFr („banc“, school (of fish))

temporal knowledge (fluents): hasWon (SteveJobs, NationalMedalOfTechnology)@1985

marriedTo (AlbertEinstein, MilevaMaric)@[6-Jan-1903, 14-Feb-1919]

presidentOf (NicolasSarkozy, France)@[16-May-2007, 15-May-2012]

spatial knowledge: locatedIn (YumbillaFalls, Peru), instanceOf (YumbillaFalls, TieredWaterfalls)

hasCoordinates (YumbillaFalls, 5°55‘11.64‘‘S 77°54‘04.32‘‘W ),

closestTown (YumbillaFalls, Cuispes), reachedBy (YumbillaFalls, RentALama)

12

Spectrum of Machine Knowledge (3)

ephemeral knowledge (dynamic services):

wsdl:getSongs (musician ?x, song ?y), wsdl:getWeather (city?x, temp ?y)

common-sense knowledge (properties):

hasAbility (Fish, swim), hasAbility (Human, write),

hasShape (Apple, round), hasProperty (Apple, juicy),

hasMaxHeight (Human, 2.5 m)

common-sense knowledge (rules):

x: human(x) male(x) female(x)

x: (male(x) female(x)) (female(x) ) male(x))

x: human(x) ( y: mother(x,y) z: father(x,z))

x: animal(x) (hasLegs(x) isEven(numberOfLegs(x))

13

Spectrum of Machine Knowledge (4)

emerging knowledge (open IE):

hasWon (MerylStreep, AcademyAward)

occurs („Meryl Streep“, „celebrated for“, „Oscar for Best Actress“)

occurs („Quentin“, „nominated for“, „Oscar“)

multimodal knowledge (photos, videos):

JimGray

JamesBruceFalls

social knowledge (opinions):

admires (maleTeen, LadyGaga), supports (AngelaMerkel, HelpForGreece)

epistemic knowledge ((un-)trusted beliefs):

believe(Ptolemy,hasCenter(world,earth)),

believe(Copernicus,hasCenter(world,sun))

believe (peopleFromTexas, bornIn(BarackObama,Kenya))

?

14

Knowledge Bases in the Big Data Era

Scalable algorithms

Distributed platforms

Big Data Analytics

Tapping unstructured data

Connecting structured & unstructured data sources

Discovering data sources

Making sense of heterogeneous, dirty, or uncertain data

Knowledge

Bases: entities,

relations,

time, space,

15

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Big Data

Methods for

Knowledge

Harvesting

Knowledge

for Big Data

Analytics

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Time of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Wikipedia-centric Methods

Web-based Methods

Knowledge Bases are labeled graphs

singer

person

resource

location

city

Tupelo

subclassOf

subclassOf

type

bornIn

type

subclassOf

Classes/ Concepts/ Types

Instances/ entities

Relations/ Predicates

A knowledge base can be seen as a directed labeled multi-graph, where the nodes are entities and the edges relations.

18

An entity can have different labels

singer

person

“Elvis” “The King”

type

label label

The same

label for two

entities:

ambiguity

The same

entity

has two

labels:

synonymy type

19

Different views of a knowledge base

singer

type

type(Elvis, singer) bornIn(Elvis,Tupelo) ...

Subject Predicate Object

Elvis type singer

Elvis bornIn Tupelo

... ... ...

Graph notation:

Logical notation:

Triple notation:

Tupelo bornIn

We use "RDFS Ontology"

and "Knowledge Base (KB)"

synonymously.

20

Our Goal is finding classes and instances

singer

person

type

Which classes exist? (aka entity types, unary predicates, concepts)

subclassOf Which subsumptions hold?

Which entities belong to which classes?

Which entities exist?

21

WordNet is a lexical knowledge base

WordNet project

(1985-now)

singer

person

subclassOf

living being

subclassOf

“person”

label

“individual”

“soul”

WordNet contains 82,000 classes

WordNet contains 118,000 class labels

WordNet contains thousands of subclassOf relationships

22

WordNet example: superclasses

23

WordNet example: subclasses

24

WordNet example: instances

only 32 singers !?

4 guitarists

5 scientists

0 enterprises

2 entrepreneurs

WordNet classes

lack instances

25

Goal is to go beyond WordNet

WordNet is not perfect:

• it contains only few instances

• it contains only common nouns as classes

• it contains only English labels

... but it contains a wealth of information that can be

the starting point for further extraction.

26

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Basics & Goal

Wikipedia-centric Methods

Web-based Methods

Wikipedia is a rich source of instances

Larry Sanger

Jimmy

Wales

28

Wikipedia's categories contain classes

But: categories do not form a taxonomic hierarchy

29

Link Wikipedia categories to WordNet?

American billionaires

Technology company founders

Apple Inc.

Deaths from cancer

Internet pioneers

tycoon, magnate

entrepreneur

pioneer, innovator ?

pioneer, colonist

?

Wikipedia categories WordNet classes

30

Categories can be linked to WordNet

American people of Syrian descent

singer gr. person people descent WordNet

American people of Syrian descent

pre-modifier head post-modifier

person

Noungroup parsing

Wikipedia

Stemming person

Most frequent meaning

“person” “singer” “people” “descent”

Head has to be plural

31

YAGO = WordNet+Wikipedia

American people of Syrian descent

WordNet

person

Wikipedia

organism

subclassOf

subclassOf

Related project:

WikiTaxonomy 105,000 subclassOf links 88% accuracy [Ponzetto & Strube: AAAI‘07] 200,000 classes

460,000 subclassOf 3 Mio. instances 96% accuracy [Suchanek: WWW‘07]

Steve Jobs type

32

Link Wikipedia & WordNet by Random Walks

[Navigli 2010]

Formula One

drivers

• construct neighborhood around source and target nodes • use contextual similarity (glosses etc.) as edge weights • compute personalized PR (PPR) with source as start node • rank candidate targets by their PPR scores

{driver,

device driver}

computer

program

chauffeur

race driver

trucker

tool

causal

agent

Barney

Oldfield

{driver, operator

of vehicle}

Formula One

champions truck

drivers

motor

racing

Michael

Schumacher

Wikipedia categories WordNet classes 33

Learning More Mappings [ Wu & Weld: WWW‘08 ]

Kylin Ontology Generator (KOG): learn classifier for subclassOf across Wikipedia & WordNet using

• YAGO as training data

• advanced ML methods (SVM‘s, MLN‘s)

• rich features from various sources

• category/class name similarity measures

• category instances and their infobox templates:

template names, attribute names (e.g. knownFor)

• Wikipedia edit history:

refinement of categories

• Hearst patterns:

C such as X, X and Y and other C‘s, …

• other search-engine statistics:

co-occurrence frequencies

> 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories 34

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Basics & Goal

Wikipedia-centric Methods

Web-based Methods

35

Hearst patterns extract instances from text [M. Hearst 1992]

Hearst defined lexico-syntactic patterns for type relationship:

X such as Y; X like Y;

X and other Y; X including Y;

X, especially Y;

companies such as Apple

Google, Microsoft and other companies

Internet companies like Amazon and Facebook

Chinese cities including Kunming and Shangri-La

computer pioneers like the late Steve Jobs

computer pioneers and other scientists

lakes in the vicinity of Brisbane

type(Apple, company), type(Google, company), ...

Find such patterns in text: //better with POS tagging

Goal: find instances of classes

Derive type(Y,X)

36

Recursively applied patterns increase recall [Kozareva/Hovy 2010]

use results from Hearst patterns as seeds

then use „parallel-instances“ patterns

X such as Y companies such as Apple

companies such as Google

Y like Z

*, Y and Z Apple like Microsoft offers

IBM, Google, and Amazon

Microsoft like SAP sells

eBay, Amazon, and Facebook

Y like Z

*, Y and Z

Y like Z

*, Y and Z Cherry, Apple, and Banana

potential problems with ambiguous words

37

Doubly-anchored patterns are more robust [Kozareva/Hovy 2010, Dalvi et al. 2012]

W, Y and Z

If two of three placeholders match seeds, harvest the third:

Google, Microsoft and Amazon

Cherry, Apple, and Banana

Goal:

find instances of classes

Start with a set of seeds:

companies = {Microsoft, Google}

type(Amazon, company)

Parse Web documents and find the pattern

38

Instances can be extracted from tables [Kozareva/Hovy 2010, Dalvi et al. 2012]

Paris France

Shanghai China

Berlin Germany

London UK

Paris Iliad

Helena Iliad

Odysseus Odysee

Rama Mahabaratha

Goal: find instances of classes

Start with a set of seeds:

cities = {Paris, Shanghai, Brisbane}

Parse Web documents and find tables

If at least two seeds appear in a column, harvest the others:

type(Berlin, city)

type(London, city) 39

Extracting instances from lists & tables [Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]

Caveats:

Precision drops for classes with sparse statistics (IR profs, …)

Harvested items are names, not entities

Canonicalization (de-duplication) unsolved

State-of-the-Art Approach (e.g. SEAL):

• Start with seeds: a few class instances

• Find lists, tables, text snippets (“for example: …“), …

that contain one or more seeds

• Extract candidates: noun phrases from vicinity

• Gather co-occurrence stats (seed&cand, cand&className pairs)

• Rank candidates

• point-wise mutual information, …

• random walk (PR-style) on seed-cand graph

40

Probase builds a taxonomy from the Web

ProBase 2.7 Mio. classes from

1.7 Bio. Web pages [Wu et al.: SIGMOD 2012]

Use Hearst liberally to obtain many instance candidates:

„plants such as trees and grass“

„plants include water turbines“

„western movies such as The Good, the Bad, and the Ugly“

Problem: signal vs. noise

Assess candidate pairs statistically:

P[X|Y] >> P[X*|Y] subclassOf(Y X)

Problem: ambiguity of labels

Merge labels of same class:

X such as Y1 and Y2 same sense of X

41

Use query logs to refine taxonomy [Pasca 2011]

Input:

type(Y, X1), type(Y, X2), type(Y, X3), e.g, extracted from Web

Goal: rank candidate classes X1, X2, X3

H1: X and Y should co-occur frequently in queries

score1(X) freq(X,Y) * #distinctPatterns(X,Y)

H2: If Y is ambiguous, then users will query X Y:

score2(X) (i=1..N term-score(tiX))1/N

example query: "Michael Jordan computer scientist"

H3: If Y is ambiguous, then users will query first X, then X Y:

score3(X) (i=1..N term-session-score(tiX))1/N

Combine the following scores to rank candidate classes:

42

Take-Home Lessons

Semantic classes for entities

> 10 Mio. entities in 100,000‘s of classes

backbone for other kinds of knowledge harvesting

great mileage for semantic search e.g. politicians who are scientists,

French professors who founded Internet companies, …

Variety of methods

noun phrase analysis, random walks, extraction from tables, …

Still room for improvement

higher coverage, deeper in long tail, …

43

Open Problems and Grand Challenges

Wikipedia categories reloaded: larger coverage

Universal solution for taxonomy alignment

New name for known entity vs. new entity?

Long tail of entities

comprehensive & consistent instanceOf and subClassOf

across Wikipedia and WordNet e.g. people lost at sea, ACM Fellow,

Jewish physicists emigrating from Germany to USA, …

e.g. Lady Gaga vs. Radio Gaga vs. Stefani Joanne Angelina Germanotta

e.g. Wikipedia‘s, dmoz.org, baike.baidu.com, amazon, librarything tags, …

beyond Wikipedia: domain-specific entity catalogs e.g. music, books, book characters, electronic products, restaurants, …

44

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

We focus on given binary relations

...find instances of these relations

hasAdvisor (JimGray, MikeHarrison)

hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)

hasAdvisor (Susan Davidson, Hector Garcia-Molina)

graduatedAt (JimGray, Berkeley)

graduatedAt (HectorGarcia-Molina, Stanford)

hasWonPrize (JimGray, TuringAward)

bornOn (JohnLennon, 9-Oct-1940)

Given binary relations with type signature

hasAdvisor: Person Person

graduatedAt: Person University

hasWonPrize: Person Award

bornOn: Person Date

46

IE can tap into different sources

• Semi-structured data

“Low-Hanging Fruit”

• Wikipedia infoboxes & categories

• HTML lists & tables, etc.

• Free text

“Cherrypicking”

• Hearst patterns & other shallow NLP

• Iterative pattern-based harvesting

• Consistency reasoning

• Web tables

Information Extraction (IE) from:

47

Source-centric IE vs. Yield-centric IE

many sources

one source

Surajit

obtained his

PhD in CS from

Stanford ...

Document 1:

instanceOf (Surajit, scientist)

inField (Surajit, c.science)

almaMater (Surajit, Stanford U)

Yield-centric IE

Student University Surajit Chaudhuri Stanford U Jim Gray UC Berkeley … …

Student Advisor Surajit Chaudhuri Jeffrey Ullman Jim Gray Mike Harrison … …

1) recall ! 2) precision

1) precision ! 2) recall

Source-centric IE

worksAt

hasAdvisor

+ (optional)

targeted

relations

48

We focus on yield-centric IE

many sources

Yield-centric IE

Student University Surajit Chaudhuri Stanford U Jim Gray UC Berkeley … …

Student Advisor Surajit Chaudhuri Jeffrey Ullman Jim Gray Mike Harrison … …

1) precision ! 2) recall

worksAt

hasAdvisor

+ (optional)

targeted

relations

49

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

Wikipedia provides data in infoboxes

51

Wikipedia uses a Markup Language

{{Infobox scientist

| name = James Nicholas "Jim" Gray

| birth_date = {{birth date|1944|1|12}}

| birth_place = [[San Francisco, California]]

| death_date = ('''lost at sea''')

{{death date|2007|1|28|1944|1|12}}

| nationality = American

| field = [[Computer Science]]

| alma_mater = [[University of California,

Berkeley]]

| advisor = Michael Harrison

...

52

Infoboxes are harvested by RegEx

{{Infobox scientist

| name = James Nicholas "Jim" Gray

| birth_date = {{birth date|1944|1|12}}

Use regular expressions

• to detect dates

• to detect links

• to detect numeric expressions

\{\{birth date \|(\d+)\|(\d+)\|(\d+)\}\}

\[\[([^\|\]]+)

(\d+)(\.\d+)?(in|inches|")

53

Infoboxes are harvested by RegEx

{{Infobox scientist

| name = James Nicholas "Jim" Gray

| birth_date = {{birth date|1944|1|12}}

1944-01-12

wasBorn(Jim_Gray, "1944-01-12")

Map attribute to

canoncial,

predefined

relation

(manually or

crowd-sourced)

Extract data item by

regular expression

wasBorn

54

Learn how articles express facts

James "Jim" Gray (born January 12, 1944

XYZ (born MONTH DAY, YEAR

find

attribute

value

in full

text

learn

pattern

55

Name: R.Agrawal

Birth date: ?

Extract from articles w/o infobox

Rakesh Agrawal (born April 31, 1965) ...

XYZ (born MONTH DAY, YEAR

... and/or build fact

apply

pattern

bornOnDate(R.Agrawal,1965-04-31)

[Wu et al. 2008: "KYLIN"]

propose

attribute

value...

56

Use CRF to express patterns

James "Jim" Gray (born in January, 1944

OTH OTH OTH OTH OTH VAL VAL

𝑃 𝑌 = 𝑦 𝑋 = 𝑥 = 1

𝑍 exp 𝑤𝑘𝑓𝑘(𝑦𝑡−1, 𝑦𝑡, 𝑥 , 𝑡)

𝑘𝑡

𝑥 =

𝑦 =

Features can take into account

• token types (numeric, capitalization, etc.)

• word windows preceding and following position

• deep-parsing dependencies

• first sentence of article

• membership in relation-specific lexicons

[R. Hoffmann et al. 2010: "Learning 5000 Relational Extractors]

James "Jim" Gray (born January 12, 1944 𝑥 =

57

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

Facts Patterns

(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact Candidates

X and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall

• noisy, drifting

• not robust enough

for high precision

(Surajit, Jeff)

(Sunita, Mike)

(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)

(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

59

Confidence of pattern p:

Confidence of fact candidate (e1,e2):

Support of pattern p:

• gathering can be iterated,

• can promote best facts to additional seeds for next round

# occurrences of p with seeds (e1,e2)

# occurrences of p with seeds (e1,e2)

# occurrences of p

or: PMI (e1,e2) = log freq(e1,e2)

freq(e1) freq(e2)

# occurrences of all patterns with seeds

p freq(e1,p,e2)*conf(p) / p freq(e1,p,e2)

Statistics yield pattern assessment

60

• can promote best facts to additional seeds for next round

• can promote rejected facts to additional counter-seeds

• works more robustly with few seeds & counter-seeds

# occurrences of p with pos. seeds

# occurrences of p with pos. seeds or neg. seeds

Problem: Some patterns have high support, but poor precision:

X is the largest city of Y for isCapitalOf (X,Y)

joint work of X and Y for hasAdvisor (X,Y)

pos. seeds: (Paris, France), (Rome, Italy), (New Delhi, India), ...

neg. seeds: (Sydney, Australia), (Istanbul, Turkey), ...

Negative Seeds increase precision

Idea: Use positive and negative seeds:

Compute the confidence of a pattern as:

(Ravichandran 2002; Suchanek 2006; ...)

61

|{n-grams p} {n-grams q]|

|{n-grams p} {n-grams q]|

Generalized patterns increase recall (N. Nakashole 2011)

Problem: Some patterns are too narrow and thus have small recall:

X and his celebrated advisor Y

X carried out his doctoral research in math under the supervision of Y

X received his PhD degree in the CS dept at Y

X obtained his PhD degree in math at Y

X { his doctoral research, under the supervision of} Y

X { PRP ADJ advisor } Y

X { PRP doctoral research, IN DET supervision of} Y

Compute match quality of pattern p with sentence q by Jaccard:

Compute

n-gram-sets

by frequent

sequence

mining

Idea: generalize patterns to n-grams, allow POS tags

=> Covers more sentences, increases recall 62

(Bunescu 2005 , Suchanek 2006, …)

Cologne lies on the banks of the Rhine

Ss MVp DMc Mp Dg

Js Jp

Problem: Surface patterns fail if the text shows variations

Cologne lies on the banks of the Rhine.

Paris, the French capital, lies on the beautiful banks of the Seine.

Deep Parsing makes patterns robust

Idea: Use deep linguistic parsing to define patterns

Deep linguistic patterns work even on sentences with variations

Paris, the French capital, lies on the beautiful banks of the Seine

63

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

Extending a KB faces 3+ challenges

type (Reagan, president)

spouse (Reagan, Davis)

spouse (Elvis,Priscilla)

(F. Suchanek et al.: WWW‘09)

Problem: If we want to extend a KB, we face (at least) 3 challenges

1. Understand which relations are expressed by patterns

"x is married to y“ spouse(x,y)

2. Disambiguate entities

"Hermione is married to Ron": "Ron" = RonaldReagan?

3. Resolve inconsistencies

spouse(Hermione, Reagan) & spouse(Reagan,Davis) ?

"Hermione is married to Ron"

?

65

SOFIE transforms IE to logical rules (F. Suchanek et al.: WWW‘09)

Idea: Transform corpus to surface statements

"Hermione is married to Ron"

occurs("Hermione", "is married to", "Ron")

Add possible meanings for all words from the KB

means("Ron", RonaldReagan)

means("Ron", RonWeasley)

means("Hermione", HermioneGranger)

Add pattern deduction rules

means(X,Y) & means(X,Z) Y=Z

Only one of these

can be true

occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R

occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')

Add semantic constraints (manually)

spouse(X,Y) & spouse(X,Z) Y=Z 66

The rules deduce meanings of patterns (F. Suchanek et al.: WWW‘09)

Add pattern deduction rules

occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R

occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')

Add semantic constraints (manually)

spouse(X,Y) & spouse(X,Z) Y=Z

type(Reagan, president)

spouse(Reagan, Davis)

spouse(Elvis,Priscilla)

"Elvis is married to Priscilla"

"is married to“ ~ spouse

67

The rules deduce facts from patterns (F. Suchanek et al.: WWW‘09)

Add pattern deduction rules

occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R

occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')

Add semantic constraints (manually)

spouse(X,Y) & spouse(X,Z) Y=Z

spouse(Hermione,RonaldReagan)

spouse(Hermione,RonWeasley)

"is married to“ ~ married

"Hermione is married to Ron" type(Reagan, president)

spouse(Reagan, Davis)

spouse(Elvis,Priscilla)

68

The rules remove inconsistencies (F. Suchanek et al.: WWW‘09)

Add pattern deduction rules

occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R

occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y')

Add semantic constraints (manually)

spouse(X,Y) & spouse(X,Z) Y=Z

spouse(Hermione,RonaldReagan)

spouse(Hermione,RonWeasley)

type(Reagan, president)

spouse(Reagan, Davis)

spouse(Elvis,Priscilla)

69

The rules pose a weighted MaxSat problem (F. Suchanek et al.: WWW‘09)

spouse(X,Y) & spouse(X,Z) => Y=Z [10]

type(Reagan, president) [10]

married(Reagan, Davis) [10]

married(Elvis,Priscilla) [10]

occurs("Hermione","loves","Harry") [3]

means("Ron",RonaldReagan) [3]

means("Ron",RonaldWeasley) [2]

...

We are given a set of

rules/facts, and wish

to find the most plausible

possible world.

Possible World 1: Possible World 2:

married married

Weight of satisfied rules: 30 Weight of satisfied rules: 39

PROSPERA parallelizes the extraction (N. Nakashole et al.: WSDM‘11)

occurs() occurs() occurs()

Mining the pattern

occurrences is

embarassingly

parallel

occurs()

spouse()

means()

loves()

means()

loves()

Reasoning is hard to

parallelize as atoms

depends on other atoms

Idea: parallelize

along min-cuts

71

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

Markov Logic generalizes MaxSat reasoning

occurs()

spouse()

means()

loves()

means()

loves()

In a Markov Logic Network (MLN), every atom is represented by

a Boolean random variable.

X3

X2

X4

X1

X6

X5

(M. Richardson / P. Domingos 2006)

means() X7 73

Dependencies in an MLN are limited

The value of a random variable 𝑿𝒊 depends only on its neighbors:

X3

X2

X4

X1

X6

X5

𝑷 𝑿𝒊 𝑿𝟏, … , 𝑿𝒊−𝟏, 𝑿𝒊+𝟏, … , 𝑿𝒏 = 𝑷(𝑿𝒊|𝑵 𝑿𝒊 )

𝑷 𝑿 = 𝒙 =𝟏

𝒁 𝝋𝒊(𝝅𝑪𝒊 𝒙 )

The Hammersley-Clifford Theorem tells us:

We choose 𝝋𝒊 so as to satisfy all

formulas in the the i-th clique:

𝝋𝒊 𝒛 = 𝐞𝐱𝐩 (𝒘𝒊 × 𝒇𝒐𝒓𝒎𝒖𝒍𝒂𝒔 𝒊 𝒔𝒂𝒕.𝒘𝒊𝒕𝒉 𝒛 )

X7 74

There are many methods for MLN inference

X3

X2

X4

X1

X6

X5

To compute the values that maximize the joint probability

(MAP = maximum a posteriori) we can use a variety of methods:

Gibbs sampling, other MCMC, belief propagation,

randomized MaxSat, …

X7 75

In addition, the MLN can model/compute

• marginal probabilities

• the joint distribution

Large-Scale Fact Extraction with MLNs [J. Zhu et al.: WWW‘09]

StatSnowball:

• start with seed facts and initial MLN model

• iterate:

• extract facts

• generate and select patterns

• refine and re-train MLN model (plus CRFs plus …)

BioSnowball:

• automatically creating biographical summaries

renlifang.msra.cn / entitycube.research.microsoft.com 76

NELL couples different learners

http://rtw.ml.cmu.edu/rtw/

Natural Language

Pattern Extractor

Table Extractor

Mutual exclusion

Type Check

Krzewski coaches the

Blue Devils.

Krzewski Blue Angels

Miller Red Angels

sports coach != scientist

If I coach, am I a coach?

Initial Ontology

[Carlson et al. 2010]

77

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Scope & Goal

Regex-based Extraction

Pattern-based Harvesting

Consistency Reasoning

Probabilistic Methods

Web-Table Methods

Web Tables provide relational information [Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]

79

Web Tables can be annotated with YAGO [Limaye, Sarawagi, Chakrabarti: PVLDB 10]

Goal: enable semantic search over Web tables

Idea:

• Map column headers to Yago classes,

• Map cell values to Yago entities

• Using joint inference for factor-graph learning model

80

Title Author

A short history of time S Hawkins

D Adams Hitchhiker's guide

Book Person

Entity

hasAuthor

Statistics yield semantics of Web tables

[Venetis,Halevy et al: PVLDB 11]

Idea: Infer classes from co-occurrences, headers are class names

𝑃 𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑙1, … , 𝑣𝑎𝑙𝑛 = 𝑃(𝑐𝑙𝑎𝑠𝑠|𝑣𝑎𝑙𝑖)

𝑃(𝑐𝑙𝑎𝑠𝑠)

Result from 12 Mio. Web tables:

• 1.5 Mio. labeled columns (=classes)

• 155 Mio. instances (=values)

Conference

81

City

Statistics yield semantics of Web tables

Idea: Infer facts from table rows, header identifies relation name

hasLocation(ThirdWorkshop, SanDiego)

but: classes&entities not canonicalized. Instances may include:

Google Inc., Google, NASDAQ GOOG, Google search engine, …

Jet Li, Li Lianjie, Ley Lin Git, Li Yangzhong, Nameless hero, … 82

Take-Home Lessons

For high precision, consistency reasoning is crucial:

Bootstrapping works well for recall

but details matter: seeds, counter-seeds,

pattern language, statistical confidence, etc.

Harness initial KB for distant supervision & efficiency:

seeds from KB, canonicalized entities with type contraints

Hand-crafted domain models are assets:

expressive constraints are vital, modeling is not a bottleneck,

but no out-of-model discovery

various methods incl. MaxSat, MLN/factor-graph MCMC, etc.

83

Open Problems and Grand Challenges

Real-time & incremental fact extraction

for continuous KB growth & maintenance (life-cycle management over years and decades)

Extensions to ternary & higher-arity relations

Efficiency and scalability of best methods for

(probabilistic) reasoning without losing accuracy

events in context: who did what to/with whom when where why …?

Robust fact extraction with both high precision & recall as highly automated (self-tuning) as possible

Large-scale studies for vertical domains

e.g. academia: researchers, publications, organizations,

collaborations, projects, funding, software, datasets, …

84

Big Data

Methods for

Knowledge

Harvesting

Knowledge

for Big Data

Analytics

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations Open Information Extraction

Relation Paraphrases

Big Data Algorithms

Discovering “Unknown” Knowledge

so far KB has relations with type signatures

<entity1, relation, entity2>

< CarlaBruni marriedTo NicolasSarkozy> Person R Person

< NataliePortman wonAward AcademyAward > Person R Prize

Open and Dynamic Knowledge Harvesting:

would like to discover new entities and new relation types

<name1, phrase, name2>

Madame Bruni in her happy marriage with the French president …

The first lady had a passionate affair with Stones singer Mick …

Natalie was honored by the Oscar …

Bonham Carter was disappointed that her nomination for the Oscar …

86

Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012]

Consider all verbal phrases as potential relations

and all noun phrases as arguments

Problem 1: incoherent extractions “New York City has a population of 8 Mio” <New York City, has, 8 Mio>

“Hero is a movie by Zhang Yimou” <Hero, is, Zhang Yimou>

Problem 2: uninformative extractions “Gold has an atomic weight of 196” <Gold, has, atomic weight>

“Faust made a deal with the devil” <Faust, made, a deal>

Solution:

• regular expressions over POS tags:

VB DET N PREP; VB (N | ADJ | ADV | PRN | DET)* PREP; etc.

• relation phrase must have # distinct arg pairs > threshold

Problem 3: over-specific extractions “Hero is the most colorful movie by Zhang Yimou”

<..., is the most colorful movie by, …>

http://ai.cs.washington.edu/demos 87

Open IE Example: ReVerb http://openie.cs.washington.edu/

?x „a song composed by“ ?y

88

Open IE Example: ReVerb http://openie.cs.washington.edu/

?x „a piece written by“ ?y

89

Diversity and Ambiguity of

Relational Phrases

Who covered whom?

Cave sang Hallelujah, his own song unrelated to Cohen‘s

Nina Simone‘s singing of Don‘t Explain revived Holiday‘s old song

Cat Power‘s voice is sad in her version of Don‘t Explain

Cale performed Hallelujah written by L. Cohen

16 Horsepower played Sinnerman, a Nina Simone original

Amy‘s souly interpretation of Cupid, a classic piece of Sam Cooke

Amy Winehouse‘s concert included cover songs by the Shangri-Las

Cave sang Hallelujah, his own song unrelated to Cohen‘s

Nina Simone‘s singing of Don‘t Explain revived Holiday‘s old song

Cat Power‘s voice is sad in her version of Don‘t Explain

Cale performed Hallelujah written by L. Cohen

16 Horsepower played Sinnerman, a Nina Simone original

Amy‘s souly interpretation of Cupid, a classic piece of Sam Cooke

Amy Winehouse‘s concert included cover songs by the Shangri-Las

{cover songs, interpretation of,

singing of, voice in, …} SingerCoversSong

{classic piece of, ‘s old song,

written by, composition of, …} MusicianCreatesSong 90

Scalable Mining of SOL Patterns

Syntactic-Lexical-Ontological (SOL) patterns

• Syntactic-Lexical: surface words, wildcards, POS tags

• Ontological: semantic classes as entity placeholders

<singer>, <musician>, <song>, …

• Type signature of pattern: <singer> <song>, <person> <song>

• Support set of pattern: set of entity-pairs for placeholders

support and confidence of patterns

SOL pattern: <singer> ’s ADJECTIVE voice * in <song>

Matching sentences: Amy Winehouse’s soul voice in her song ‘Rehab’

Jim Morrison’s haunting voice and charisma in ‘The End’

Joan Baez’s angel-like voice in ‘Farewell Angelina’

Support set:

(Amy Winehouse, Rehab)

(Jim Morrison, The End)

(Joan Baez, Farewell Angelina)

[N. Nakashole et al.: EMNLP-CoNLL’12, VLDB‘12]

91

Pattern Dictionary for Relations [N. Nakashole et al.: EMNLP-CoNLL’12, VLDB‘12]

WordNet-style dictionary/taxonomy for relational phrases

based on SOL patterns (syntactic-lexical-ontological)

“graduated from” “obtained degree in * from”

“and PRONOUN ADJECTIVE advisor” “under the supervision of”

Relational phrases can be synonymous

“wife of” “ spouse of”

<person> graduated from <university>

<singer> covered <song> <book> covered <event>

One relational phrase can subsume another

Relational phrases are typed

350 000 SOL patterns from Wikipedia, NYT archive, ClueWeb

http://www.mpi-inf.mpg.de/yago-naga/patty/

92

PATTY: Pattern Taxonomy for Relations [N. Nakashole et al.: EMNLP 2012, demo at VLDB 2012]

350 000 SOL patterns with 4 Mio. instances

accessible at: www.mpi-inf.mpg.de/yago-naga/patty 93

Big Data Algorithms at Work

Frequent sequence mining

with generalization hierarchy for tokens Examples: famous ADJECTIVE *

her PRONOUN *

<singer> <musician> <artist> <person>

Map-Reduce-parallelized on Hadoop:

• identify entity-phrase-entity occurrences in corpus

• compute frequent sequences

• repeat for generalizations

n-gram

mining taxonomy

construction

pattern

lifting

text pre-

processing

94

Take-Home Lessons

Scalable algorithms for extraction & mining

have been leveraged – but more work needed

Triples of the form <name, phrase, name>

can be mined at scale and

are beneficial for entity discovery

Semantic typing of relational patterns

and pattern taxonomies are vital assets

95

Open Problems and Grand Challenges

Integrate canonicalized KB with emerging knowledge

Cost-efficient crowdsourcing

for higher coverage & accuracy

Overcoming sparseness in input corpora and

coping with even larger scale inputs

Exploit relational patterns for

question answering over structured data

tap social media, query logs, web tables & lists, microdata, etc.

for richer & cleaner taxonomy of relational patterns

KB life-cycle: today‘s long tail may be tomorrow‘s mainstream

96

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

As Time Goes By: Temporal Knowledge

Which facts for given relations hold

at what time point or during which time intervals ?

marriedTo (Madonna, GuyRitchie) [ 22Dec2000, Dec2008 ]

capitalOf (Berlin, Germany) [ 1990, now ]

capitalOf (Bonn, Germany) [ 1949, 1989 ]

hasWonPrize (JimGray, TuringAward) [ 1998 ]

graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]

graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]

hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]

How can we query & reason on entity-relationship facts

in a “time-travel“ manner - with uncertain/incomplete KB ?

US president‘s wife when Steve Jobs died?

students of Hector Garcia-Molina while he was at Princeton?

98

Temporal Knowledge

for all people in Wikipedia (300 000) gather all spouses,

incl. divorced & widowed, and corresponding time periods!

>95% accuracy, >95% coverage, in one night

consistency constraints are potentially helpful: • functional dependencies: husband, time wife

• inclusion dependencies: marriedPerson adultPerson

• age/time/gender restrictions: birthdate + < marriage < divorce

1) recall: gather temporal scopes for base facts

2) precision: reason on mutual consistency

Dating Considered Harmful

explicit dates vs. implicit dates

100

vague dates

relative dates

narrative text

relative order

Machine-Reading Biographies

PRAVDA for T-Facts from Text

1) Candidate gathering:

extract pattern & entities

of basic facts and

time expression

2) Pattern analysis:

use seeds to quantify

strength of candidates

3) Label propagation:

construct weighted graph

of hypotheses and

minimize loss function

4) Constraint reasoning:

use ILP for

temporal consistency

[Y. Wang et al. 2011]

102

Reasoning on T-Fact Hypotheses

Cast into evidence-weighted logic program

or integer linear program with 0-1 variables:

for temporal-fact hypotheses Xi

and pair-wise ordering hypotheses Pij

maximize wi Xi with constraints

• Xi + Xj 1

if Xi, Xj overlap in time & conflict

• Pij + Pji 1

• (1 Pij ) + (1 Pjk) (1 Pik)

if Xi, Xj, Xk must be totally ordered

• (1 Xi ) + (1 Xj) + 1 (1 Pij) + (1 Pji)

if Xi, Xj must be totally ordered

Temporal-fact hypotheses: m(Ca,Nic)@[2008,2012]{0.7}, m(Ca,Ben)@[2010]{0.8}, m(Ca,Mi)@[2007,2008]{0.2},

m(Cec,Nic)@[1996,2004]{0.9}, m(Cec,Nic)@[2006,2008]{0.8}, m(Nic,Ma){0.9}, …

[Y. Wang et al. 2012, P. Talukdar et al. 2012]

Efficient

ILP solvers: www.gurobi.com

IBM Cplex

103

TIE for T-Fact Extraction & Ordering [Ling/Weld : AAAI 2010]

TIE (Temporal IE) architectures builds on:

• TARSQI (Verhagen et al. 2005)

for event extraction, using linguistic analyses

• Markov Logic Networks

for temporal ordering of events

104

Take-Home Lessons

Temporal knowledge harvesting:

crucial for machine-reading news, social media, opinions

Combine linguistics, statistics, and logical reasoning:

harder than for „ordinary“ relations

105

Open Problems and Grand Challenges

Robust and broadly applicable methods for

temporal (and spatial) knowledge

populate time-sensitive relations comprehensively:

marriedTo, isCEOof, participatedInEvent, …

Understand temporal relationships in

biographies and narratives

machine-reading of news, bios, novels, …

106

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts

Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

NERD Problem

NED Principles

Coherence-based Methods

Rare & Emerging Entities

Three Different Problems

Harry fought with you know who. He defeats the dark lord.

1) named-entity recognition (NER): segment & label by CRF

(e.g. Stanford NER tagger)

2) co-reference resolution: link to preceding NP

(trained classifier over linguistic features)

3) named-entity disambiguation (NED):

map each mention (name) to canonical entity (entry in KB)

Three NLP tasks:

Harry

Potter

Dirty

Harry

Lord

Voldemort

The Who

(band)

Prince Harry

of England

tasks 1 and 3 together: NERD 108

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

NERD Problem

NED Principles

Coherence-based Methods

Rare & Emerging Entities

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Named Entity Disambiguation

D5 Overview May 30, 2011

Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … …

KB

Eli (bible)

Eli Wallach

Mentions

(surface names)

Entities

(meanings)

Dollars Trilogy

Lord of the Rings

Star Wars Trilogy

Benny Andersson

Benny Goodman

Ecstasy of Gold

Ecstasy (drug)

?

110

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity (m,e): • freq(e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

bag-of-words or

language model:

words, bigrams,

phrases

111

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity (m,e): • freq(e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

joint

mapping

112

Mention-Entity Graph

113 / 20

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy(drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity (m,e): • freq(m,e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

Coherence (e,e‘): • dist(types)

• overlap(links)

• overlap

(keyphrases)

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

114 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity (m,e): • freq(m,e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

Coherence (e,e‘): • dist(types)

• overlap(links)

• overlap

(keyphrases)

American Jews film actors artists Academy Award winners

Metallica songs Ennio Morricone songs artifacts soundtrack music

spaghetti westerns film trilogies movies artifacts Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

115 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity (m,e): • freq(m,e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

Coherence (e,e‘): • dist(types)

• overlap(links)

• overlap

(keyphrases)

http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award

http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

116 / 20

KB+Stats Popularity (m,e): • freq(m,e|m)

• length(e)

• #links(e)

Similarity (m,e): • cos/Dice/KL

(context(m),

context(e))

Coherence (e,e‘): • dist(types)

• overlap(links)

• overlap

(keyphrases)

Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition

The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin

For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone

weighted undirected graph with two types of nodes

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

NERD Problem

NED Principles

Coherence-based Methods

Rare & Emerging Entities

Joint Mapping

• Build mention-entity graph or joint-inference factor graph

from knowledge and statistics in KB

• Compute high-likelihood mapping (ML or MAP) or

dense subgraph such that:

each m is connected to exactly one e (or at most one e)

90

30

5 100

100

50

20 50

90

80

90 30

10 10

20

30

30

118

Joint Mapping: Prob. Factor Graph

90

30

5 100

100

50

20 50

90

80

90 30

10 10

20

30

30

Collective Learning with Probabilistic Factor Graphs [Chakrabarti et al.: KDD’09]:

• model P[m|e] by similarity and P[e1|e2] by coherence

• consider likelihood of P[m1 … mk | e1 … ek]

• factorize by all m-e pairs and e1-e2 pairs

• use MCMC, hill-climbing, LP etc. for solution

119

Joint Mapping: Dense Subgraph

• Compute dense subgraph such that:

each m is connected to exactly one e (or at most one e)

• NP-hard approximation algorithms

• Alt.: feature engineering for similarity-only method

[Bunescu/Pasca 2006, Cucerzan 2007, Milne/Witten 2008, …]

90

30

5 100

100

50

20 50

90

80

90 30

10 10

20

30

30

120

Coherence Graph Algorithm

• Compute dense subgraph to

maximize min weighted degree among entity nodes

such that:

each m is connected to exactly one e (or at most one e)

• Greedy approximation:

iteratively remove weakest entity and its edges

• Keep alternative solutions, then use local/randomized search

90

30

5 100

100

50 50

90

80

90 30

10

20

10

20

30

30

[J. Hoffart et al.: EMNLP‘11] 140

180

50

470

145

230

121

Random Walks Algorithm

• for each mention run random walks with restart

(like personalized PageRank with jumps to start mention(s))

• rank candidate entities by stationary visiting probability

• very efficient, decent accuracy

50

90

80

90 30

10

20

10

0.83

0.7

0.4

0.75 0.15

0.17

0.2

0.1

90

30

5 100

100

50

30

30 20

0.75

0.25

0.04 0.96

0.77

0.5

0.23

0.3 0.2

122

Mention-Entity Popularity Weights

• Collect hyperlink anchor-text / link-target pairs from • Wikipedia redirects

• Wikipedia links between articles and Interwiki links

• Web links pointing to Wikipedia articles

• query-and-click logs

• Build statistics to estimate P[entity | name]

• Need dictionary with entities‘ names: • full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp.

• short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, …

• nicknames & aliases: Terminator, City of Angels, Evil Empire, …

• acronyms: LA, UCLA, MS, MSFT

• role names: the Austrian action hero, Californian governor, CEO of MS, …

plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her.

[Milne/Witten 2008, Spitkovsky/Chang 2012]

123

Mention-Entity Similarity Edges

Extent of partial matches Weight of matched words

Precompute characteristic keyphrases q for each entity e:

anchor texts or noun phrases in e page with high PMI:

)(

)(

),()(~)|(

mcontextin

ekeyphrasesq

mcover(q)distqscoremescore

1

)|(#

~)|(

qw

cover(q)w

e)|weight(w

ewweight

cover(q)oflength

wordsmatchingeqscore

)()(

),(log),(

efreqqfreq

eqfreqeqweight

Match keyphrase q of candidate e in context of mention m

Compute overall similarity of context(m) and candidate e

„Metallica tribute to Ennio Morricone“

The Ecstasy piece was covered by Metallica on the Morricone tribute album.

124

Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2

))2(),1(min(log||log

))2()1(log())2,1(max(log1

eineinE

eineineein~e2)coh(e1,-mw

Alternatively compute overlap of keyphrases for e1 and e2

or overlap of keyphrases, or similarity of bag-of-words, or …

)2()1(

)2()1(

engramsengrams

engramsengrams~e2)coh(e1,-ngram

Optionally combine with type distance of e1 and e2

(e.g., Jaccard index for type instances)

For special types of e1 and e2 (locations, people, etc.)

use spatial or temporal distance 125

NED: Experimental Evaluation

Benchmark:

• Extended CoNLL 2003 dataset: 1400 newswire articles

• originally annotated with mention markup (NER),

now with NED mappings to Yago and Freebase

• difficult texts:

… Australia beats India … Australian_Cricket_Team

… White House talks to Kreml … President_of_the_USA

… EDS made a contract with … HP_Enterprise_Services

Results:

Best: AIDA method with prior+sim+coh + robustness test

82% precision @100% recall, 87% mean average precision

Comparison to other methods, see [Hoffart et al.: EMNLP‘11]

see also [P. Ferragina et al.: WWW’13] for NERD benchmarks

128

NERD Online Tools J. Hoffart et al.: EMNLP 2011, VLDB 2011

https://d5gate.ag5.mpi-sb.mpg.de/webaida/

P. Ferragina, U. Scaella: CIKM 2010

http://tagme.di.unipi.it/

R. Isele, C. Bizer: VLDB 2012

http://spotlight.dbpedia.org/demo/index.html

Reuters Open Calais: http://viewer.opencalais.com/

Alchemy API: http://www.alchemyapi.com/api/demo.html

S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009

http://www.cse.iitb.ac.in/soumen/doc/CSAW/

D. Milne, I. Witten: CIKM 2008

http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/

L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL 2011

http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier

some use Stanford NER tagger for detecting mentions

http://nlp.stanford.edu/software/CRF-NER.shtml

129

Coherence-aware Feature Engineering [Cucerzan: EMNLP 2007; Milne/Witten: CIKM 2008, Art.Int. 2013]

• Avoid explicit coherence computation by turning

other mentions‘ candidate entities into features

• sim(m,e) uses these features in context(m)

• special case: consider only unambiguous mentions

or high-confidence entities (in proximity of m)

m e

influence in

context(m)

weighted by

coh(e,ei) and

pop(ei)

130

TagMe: NED with Light-Weight Coherence [P. Ferragina et al.: CIKM‘10, WWW‘13]

• Reduce combinatorial complexity by using

avg. coherence of other mentions‘ candidate entities

• for score(m,e) compute

avg ei cand(mj) coherence (ei ,e) popularity (ei | mj)

then sum up over all mj m („voting“)

m e

mj

e1

e2

e3

131

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

NERD Problem

NED Principles

Coherence-based Methods

Rare & Emerging Entities

Long-Tail and Emerging Entities

last.fm/Nick_Cave/Weeping_Song

wikipedia.org/Weeping_(song)

wikipedia.org/Nick_Cave

last.fm/Nick_Cave/O_Children

last.fm/Nick_Cave/Hallelujah

wikipedia/Hallelujah_(L_Cohen)

wikipedia/Hallelujah_Chorus

wikipedia/Children_(2011 film)

wikipedia.org/Good_Luck_Cave

Cave

composed

haunting

songs like

Hallelujah,

O Children,

and the

Weeping Song.

[J. Hoffart et al.: CIKM’12] 133

Long-Tail and Emerging Entities

last.fm/Nick_Cave/Weeping_Song

wikipedia.org/Weeping_(song)

wikipedia.org/Nick_Cave

last.fm/Nick_Cave/O_Children

last.fm/Nick_Cave/Hallelujah

wikipedia/Hallelujah_(L_Cohen)

wikipedia/Hallelujah_Chorus

wikipedia/Children_(2011 film)

wikipedia.org/Good_Luck_Cave

Cave

composed

haunting

songs like

Hallelujah,

O Children,

and the

Weeping Song.

Gunung Mulu National Park Sarawak Chamber largest underground chamber

eerie violin Bad Seeds No More Shall We Part

Bad Seeds No More Shall We Part Murder Songs

Leonard Cohen Rufus Wainwright Shrek and Fiona

Nick Cave & Bad Seeds Harry Potter 7 movie haunting choir

Nick Cave Murder Songs P.J. Harvey Nick and Blixa duet

Messiah oratorio George Frideric Handel

Dan Heymann apartheid system

South Korean film

KO (p,q) = 𝒎𝒊𝒏(𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒑 ,𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒒 )𝒕

𝒎𝒂𝒙(𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒑 ,𝒘𝒆𝒊𝒈𝒉𝒕 𝒕 𝒊𝒏 𝒒 )𝒕

KORE (e,f) ~ )𝑲𝑶(𝒑, 𝒒)𝟐 × 𝒎𝒊𝒏(𝒘𝒆𝒊𝒈𝒉𝒕 𝒑 𝒊𝒏 𝒆 , 𝒘𝒆𝒊𝒈𝒉𝒕 𝒒 𝒊𝒏 𝒇 )𝒑∈𝒆,𝒒∈𝒇

implementation uses min-hash and LSH [J. Hoffart et al.: CIKM‘12]

Long-Tail and Emerging Entities

any OTHER „Mermaids“

…/The Little Mermaid

wikipedia.org/Nick_Cave

…/Mermaid‘s Song

…/Water‘s Edge (2003 film)

…/Water‘s Edge Restaurant

any OTHER „Water‘s Edge“

wikipedia.org/Good_Luck_Cave

Cave‘s

brand-new

album

contains

masterpieces

like

Water‘s Edge

and

Mermaids.

Bad Seeds No More Shall We Part Murder Songs

excellent seafood clam chowder Maine lobster

Walt Disney Hans Chrisitan Andersen Kiss the Girl

Gunung Mulu National Park Sarawak Chamber largest underground chamber

Nathan Fillion horrible acting

all phrases minus keyphrases of known candidate entities

all phrases minus keyphrases of known candidate entities

Pirates of the Caribbean 4 My Jolly Sailor Bold Johnny Depp

Semantic Typing of Emerging Entities

Given triples (x, p, y) with new x,y

and all type triples (t1, p, t2) for known entities:

• score (x,t) ~ p:(x,p,y) P [t | p,y] + p:(y,p,x) P [t | p,y]

• corr(t1,t2) ~ Pearson coefficient [-1,+1]

Problem: what to do with newly emerging entities

Idea: infer their semantic types using PATTY patterns

For each new e and all candidate types ti:

max i score(e,ti) Xi + ij corr(ti,tj) Yij

s.t. Xi, Yij {0,1} and Yij Xi and Yij Xj and Xi + Xj – 1 Yij

Sandy threatens to hit New York

Nive Nielsen and her band performing Good for You

Nive Nielsen‘s warm voice in Good for You

[N. Nakashole et al.: ACL 2013, T. Lin et al.: EMNLP 2012]

Big Data Algorithms at Work

Web-scale keyphrase mining

Web-scale entity-entity statistics

MAP on large factor graph or

dense subgraphs in large graph

data+text queries on huge KB or LOD

Applications to large-scale input batches:

• discover all musicians in a week‘s social media postings

• identify all diseases & drugs in a month‘s publications

• track a (set of) politician(s) in a decade‘s news archive 137

Take-Home Lessons NERD is key for contextual knowledge

High-quality NERD uses joint inference over various features:

popularity + similarity + coherence

State-of-the-art tools available

Maturing now, but still room for improvement,

especially on efficiency, scalability & robustness

Good approaches, more work needed

Handling out-of-KB entities & long-tail NERD

138

Open Problems and Grand Challenges

Robust disambiguation of entities, relations and classes

Relevant for question answering & question-to-query translation

Key building block for KB building and maintenance

Entity name disambiguation in difficult situations

Short and noisy texts about long-tail entities in social media

Word sense disambiguation in natural-language dialogs

Relevant for multimodal human-computer interactions

(speech, gestures, immersive environments)

Efficient interactive & high-throughput batch NERD

a day‘s news, a month‘s publications, a decade‘s archive

139

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Knowledge bases are complementary

141

No Links No Use

Who is the spouse of the guitar player?

142

rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpedia.org/resource/Rome

yago/wordnet:Actor109765278

yago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

imdb.com/name/nm0910607/

Link equivalent entities across KBs

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

144

rdf.freebase.com/ns/en.rome_ny data.nytimes.com/51688803696189142301

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpedia.org/resource/Rome

yago/wordnet:Actor109765278

yago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

imdb.com/name/nm0910607/

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

Referential data quality?

hand-crafted sameAs links?

generated sameAs links? ?

? ?

Link equivalent entities across KBs

145

Record Linkage between Databases

Susan B. Davidson

Peter Buneman

University of Pennsylvania

Yi Chen

record 1

O.P. Buneman

S. Davison

U Penn

Y. Chen

record 2

P. Baumann

S. Davidson

Penn State

Cheng Y.

record 3 …

Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946

H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.

I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statist. Soc., 1969.

Goal: Find equivalence classes of entities, and of records

Techniques:

• similarity of values (edit distance, n-gram overlap, etc.)

• joint agreement of linkage

• similarity joins, grouping/clustering, collective learning, etc.

• often domain-specific customization (similarity measures etc.)

146

Linking Records vs. Linking Knowledge

Susan B. Davidson

Peter Buneman

University of Pennsylvania

Yi Chen

record KB / Ontology

university

Differences between DB records and KB entities:

• Ontological links have rich semantics (e.g. subclassOf)

• Ontologies have only binary predicates

• Ontologies have no schema

• Match not just entities,

but also classes & predicates (relations)

147

Similarity of entities depends on

similarity of neighborhoods

KB 1 KB 2

sameAs ?

?

?

x1 x2

y1

y2

sameAs(x1, x2) depends on sameAs(y1, y2)

which depends on sameAs(x1, x2)

148

Equivalence of entities is transitive

KB 1 KB 2 KB 3

ek sameAs ?

ej sameAs ?

sameAs ?

ei

… …

149

Similarity Flooding matches entities at scale

Build a graph:

nodes: pairs of entities, weighted with similarity

edges: weighted with degree of relatedness

similarity: 0.9 similarity: 0.7

relatedness

0.8

Iterate until convergence:

similarity := weighted sum of neighbor similarities

similarity: 0.8

many variants (belief propagation, label propagation, etc.), e.g. SigMa 152

Some neighborhoods are more indicative

1935 1935

"Elvis" "Elvis"

sameAs

sameAs ?

sameAs

Many people born in 1935

not indicative

Few people called "Elvis"

highly indicative

153

Inverse functionality as indicativeness

1935 1935

"Elvis" "Elvis"

sameAs

sameAs ?

sameAs

𝒊𝒇𝒖𝒏 𝒓, 𝒚 =𝟏

| 𝒙: 𝒓 𝒙, 𝒚 |

𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏, 𝟏𝟗𝟑𝟓 =𝟏

𝟓

𝒊𝒇𝒖𝒏 𝒓 = 𝑯𝑴𝒚 𝒊𝒇𝒖𝒏(𝒓, 𝒚)

𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏 = 𝟎. 𝟎𝟏

𝒊𝒇𝒖𝒏 𝒍𝒂𝒃𝒆𝒍 = 𝟎. 𝟗

The higher the inverse functionality of r for r(x,y), r(x',y),

the higher the likelihood that x=x'.

𝒊𝒇𝒖𝒏 𝒓 = 𝟏 ⇒ 𝒙 = 𝒙′ [Suchanek et al.: VLDB’12]

154

Match entities, classes and relations

subClassOf

sameAs

subPropertyOf

155

PARIS matches entities, classes & relations

Goal:

given 2 ontologies, match entities, relations, and classes

Define

P(x y) := probability that entities x and y are the same

P(p r) := probability that relation p subsumes r

P(c d) := probability that class c subsumes d

Initialize

P(x y) := similarity if x and y are literals, else 0

P(p r) := 0.001

Iterate until convergence

P(x y) := 𝟒𝟐𝛁𝑒−𝑖𝜔𝑡 … 𝑷(𝒑 𝒓)

P(p r) := 𝝑ℵ + 𝑌1𝑛 …𝑷(𝒙 𝒚)

Compute

P(c d) := ratio of instances of d that are in c

Recursive

dependency

[Suchanek et al.: VLDB’12]

156

PARIS matches entities, classes & relations

Goal:

given 2 ontologies, match entities, relations, and classes

Define

P(x y) := probability that entities x and y are the same

P(p r) := probability that relation p subsumes r

P(c d) := probability that class c subsumes d

Initialize

P(x y) := similarity, if x and y are literals, else 0

P(p r) := 0.001

Iterate until convergence

P(x y) := 𝟒𝟐𝛁𝑒−𝑖𝜔𝑡 … 𝑷(𝒑 > 𝒓)

P(p r) := 𝝑ℵ + 𝑌1𝑛 …𝑷(𝒙 = 𝒚)

Compute

P(c d) := ratio of instances of d that are in c

Recursive

dependency

[Suchanek et al.: VLDB’12]

PARIS matches YAGO and DBpedia

• time: 1:30 hours

• precision for instances: 90%

• precision for classes: 74%

• precision for relations: 96%

157

Many challenges remain

Entity linkage is at the heart of semantic data integration.

More than 50 years of research, still some way to go!

Benchmarks: • OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org

• TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/

• TREC Knowledge Base Acceleration: trec-kba.org

• Highly related entities with ambiguous names

George W. Bush (jun.) vs. George H.W. Bush (sen.)

• Long-tail entities with sparse context

• Enterprise data (perhaps combined with Web2.0 data)

• Entities with very noisy context (in social media)

• Records with complex DB / XML / OWL schemas

• Ontologies with non-isomorphic structures

158

Take-Home Lessons Web of Linked Data is great

100‘s of KB‘s with 30 Bio. triples and 500 Mio. links

mostly reference data, dynamic maintenance is bottleneck

connection with Web of Contents needs improvement

Entity resolution & linkage is key

for creating sameAs links in text (RDFa, microdata)

for machine reading, semantic authoring,

knowledge base acceleration, …

Integrated methods for aligning entities, classes and relations

Linking entities across KB‘s is advancing

159

Open Problems and Grand Challenges

Automatic and continuously maintained sameAs links

for Web of Linked Data with high accuracy & coverage

Combine algorithms and crowdsourcing

with active learning, minimizing human effort or cost/accuracy

Web-scale, robust ER with high quality

Handle huge amounts of linked-data sources, Web tables, …

160

Outline

Linked Knowledge: Entity Matching

Motivation and Overview

Wrap-up

Taxonomic Knowledge: Entities and Classes

Temporal Knowledge: Validity Times of Facts Contextual Knowledge: Entity Name Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/

Factual Knowledge: Relations between Entities

Emerging Knowledge: New Entities & Relations

Summary

• Knowledge Bases from Web are Real, Big & Useful:

Entities, Classes & Relations

• Key Asset for Intelligent Applications: Semantic Search, Question Answering, Machine Reading, Digital Humanities,

Text&Data Analytics, Summarization, Reasoning, Smart Recommendations, …

• Harvesting Methods for Entities & Classes Taxonomies

• Methods for extracting Relational Facts

• NERD & ER: Methods for Contextual & Linked Knowledge

• Rich Research Challenges & Opportunities: scale & robustness; temporal, multimodal, commonsense;

open & real-time knowledge discovery; …

• Models & Methods from Different Communities: DB, Web, AI, IR, NLP

162

Knowledge Bases in the Big Data Era

Tapping unstructured data

Connecting structured & unstructured data sources

Discovering data sources

Scalable algorithms

Distributed platforms

Making sense of heterogeneous, dirty, or uncertain data

Big Data Analytics

Knowledge

Bases: entities,

relations,

time, space,

163

see comprehensive list in

Fabian Suchanek and Gerhard Weikum:

Knowledge Harvesting in the Big-Data Era,

Proceedings of the ACM SIGMOD

International Conference on Management of Data,

New York, USA, June 22-27, 2013,

Association for Computing Machinery, 2013.

References

http://www.mpi-inf.mpg.de/yago-naga/sigmod2013-tutorial/ 164