65
Introduction Existing Lexical Knowledge Bases Building a Multilingual Wordnet Results and Experiments Summary and Future Work Towards a Universal Wordnet by Learning from Combined Evidence Gerard de Melo and Gerhard Weikum Max Planck Institute for Informatics Saarbr¨ ucken, Germany 2009-11-03 Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 1/29

Towards a Universal Wordnet by Learning from Combined Evidence

Embed Size (px)

DESCRIPTION

Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.

Citation preview

Page 1: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Towards a Universal Wordnetby Learning from Combined Evidence

Gerard de Melo and Gerhard Weikum

Max Planck Institute for InformaticsSaarbrucken, Germany

2009-11-03

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 1/29

Page 2: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

person whogives a talk

“speaker”

device that produces

sounds

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 3: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

flat piece of wood

“board”

committee

panel for writingwith chalk

to enter a transportation

vehicle

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 4: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

someone who studies

“student”

“pupil”

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 5: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

faculty

professor

memberpart

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 6: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

entity

institution

educationalinstitution

university

...

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 7: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Lexical Knowledge

What meanings doesa word have?

How do those meaningsrelate to the meaningsof other words?

Many Applications

examples:NLP, AIquestion answeringquery expansionhuman consultation

entity

institution

educationalinstitution

university

...

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 2/29

Page 8: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Multilinguality

the world ismultilingual

the Internet is alsoincreasinglymultilingual

Top 10 Languages byApprox. No. of Speakers

Source: Ethnologue 2005

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 3/29

Page 9: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

Multilinguality

the world ismultilingual

the Internet is alsoincreasinglymultilingual

Internet users by Region

Source:

Internet World Stats

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 3/29

Page 10: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

person who gives a talk

eng: “speaker”

jpn: “ ”話者

rus: “докладчик”

ces: “řečník”

... ......

Vision

universal index of wordmeanings

large-scale semantic networkwith class hierarchy

look up any wordin any language,get a list of its meanings

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 4/29

Page 11: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Introduction

entitypor: “entidade”

cmn: “ ”制度 institution

educationalinstitution

university

heb: “ישות.”

deu: “Bildungs-einrichtung”

cym: “prifysgol”

...

Vision

universal index of wordmeanings

large-scale semantic networkwith class hierarchy

meanings should be connectedvia semantic relations

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 4/29

Page 12: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

Lexical KnowledgeMultilingualityVision

Outline

1 Existing Lexical Knowledge Bases

2 Building a Multilingual Wordnet

3 Results and Experiments

4 Summary and Future Work

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 5/29

Page 13: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Outline

1 Existing Lexical Knowledge Bases

2 Building a Multilingual Wordnet

3 Results and Experiments

4 Summary and Future Work

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 6/29

Page 14: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

WordNet

lexical database created at Princeton

enumerates meanings of Englishwords

meaning-to-meaning links

Miller, Fellbaum et al. (1990)among most-cited papersin computer science(source: CiteseerX)

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 7/29

Page 15: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

WordNet

lexical database created at Princeton

enumerates meanings of Englishwords

meaning-to-meaning links

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 7/29

Page 16: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

WordNet

lexical database created at Princeton

enumerates meanings of Englishwords

meaning-to-meaning links

hypernym hierarchymeronymy (part of)etc.

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 7/29

Page 17: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Non-English Wordnets

EuroWordNet, BalkaNet, Global WordNet Association

problem: many are small, incomplete

problem: different identifiers, formats, etc.

problem: only ∼10 languages with freely available wordnets

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 8/29

Page 18: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Non-English Wordnets

EuroWordNet, BalkaNet, Global WordNet Association

problem: many are small, incomplete

problem: different identifiers, formats, etc.

problem: only ∼10 languages with freely available wordnets

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 8/29

Page 19: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Non-English Wordnets

EuroWordNet, BalkaNet, Global WordNet Association

problem: many are small, incomplete

problem: different identifiers, formats, etc.

problem: only ∼10 languages with freely available wordnets

not a single, coherent resource

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 8/29

Page 20: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Non-English Wordnets

EuroWordNet, BalkaNet, Global WordNet Association

problem: many are small, incomplete

problem: different identifiers, formats, etc.

problem: only ∼10 languages with freely available wordnets

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 8/29

Page 21: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Other Resources

PANGLOSS Ontology: Knight & Luk (1994)

TransGraph system: Etzioni et al. (2007)

DBPedia, YAGO, OpenCyc

2 languages, around 70 000 entities

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 9/29

Page 22: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Other Resources

PANGLOSS Ontology: Knight & Luk (1994)

TransGraph system: Etzioni et al. (2007)

DBPedia, YAGO, OpenCyc

large translation graphlimited structuree.g. no semantic hierarchy

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 9/29

Page 23: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

WordNetNon-English WordnetsOther Resources

Existing Lexical Knowledge Bases

Other Resources

PANGLOSS Ontology: Knight & Luk (1994)

TransGraph system: Etzioni et al. (2007)

DBPedia, YAGO, OpenCyc

class hierarchy not multilingual

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 9/29

Page 24: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Outline

1 Existing Lexical Knowledge Bases

2 Building a Multilingual Wordnet

3 Results and Experiments

4 Summary and Future Work

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 10/29

Page 25: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Strategy

use existing wordnets as backbone

add new terms, link to meaning nodes

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

eng: “course”

eng: “class”

Existing Wordnets

−→

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Desired Output

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 11/29

Page 26: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Strategy

use existing wordnets as backbone

add new terms, link to meaning nodes

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

eng: “course”

eng: “class”

Existing Wordnets

−→

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Desired OutputGerard de Melo and Gerhard Weikum Towards a Universal Wordnet 11/29

Page 27: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Input Graph

use existing wordnets as backbone

add translations to graph

mainly English, Spanish, Catalan

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

eng: “course”

eng: “class”

Input Graph G0

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 12/29

Page 28: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Input Graph

use existing wordnets as backbone

add translations to graph

dictionaries (e.g. Wiktionary)thesauri and ontologiesparallel corpora (word alignment)

also: predict new translations

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Input Graph G0

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 12/29

Page 29: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Approach: Link new words to meanings of their translations

Huge Challenge: Disambiguation!

academic course

part of a meal

route of travel

series of events

ita: “piatto”

eng: “course”

trans-lation

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 13/29

Page 30: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Approach: Link new words to meanings of their translations

Huge Challenge: Disambiguation!

academic course

part of a meal

route of travel

series of events

ita: “piatto”

eng: “course”

trans-lation

?

?

??

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 13/29

Page 31: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

academic course

part of a meal

route of travel

series of events

ita: “piatto”

eng: “course”

trans-lation

?

?

??

Approach

variety of features that analyseprevious graph Gi−1,incorporate neighbourhoodinformation into anedge’s feature vector

supervised learning: new edgeweights determined usingRBF-kernel SVM with posteriorprobability estimation

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 14/29

Page 32: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

academic course

part of a meal

route of travel

series of events

ita: “piatto”

eng: “course”

trans-lation

?

?

??

Approach

variety of features that analyseprevious graph Gi−1,incorporate neighbourhoodinformation into anedge’s feature vector

supervised learning: new edgeweights determined usingRBF-kernel SVM with posteriorprobability estimation

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 14/29

Page 33: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Example Feature:

fra: “suite” academic course?

t m

Given term tand meaning m

Question: Should they be linked?

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 15/29

Page 34: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Example Feature:

fra: “suite” academic course?

t m

fra: “suite”

spa: “trayectoria”

eng: “course”

part of a meal

academic course

route of travel

...

series of eventst'

m'm'

Given term tand meaning m

Question: Should they be linked?

Look at neighbours t ′ ∈ Γt

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 15/29

Page 35: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Example Feature:

fra: “suite” academic course?

t m

fra: “suite”

spa: “trayectoria”

eng: “course”

part of a meal

academic course

route of travel

...

series of eventst'

m'm'

∑t′∈Γ(t)

sim∗(t ′,m)

sim∗(t ′,m) + dissim(t ′,m)

sim∗(t′,m)= maxm′∈Γ(t′)

sim(m′,m)

dissim(t′,m)=P

m′∈Γ(t′)(1−sim(m′,m))

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 15/29

Page 36: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Example Feature:

fra: “suite” academic course?

t m

fra: “suite”

spa: “trayectoria”

eng: “course”

part of a meal

academic course

route of travel

...

series of eventst'

m'm'

∑t′∈Γ(t)

sim∗(t ′,m)

sim∗(t ′,m) + dissim(t ′,m)

sim∗(t′,m)= maxm′∈Γ(t′)

sim(m′,m)

dissim(t′,m)=P

m′∈Γ(t′)(1−sim(m′,m))

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 15/29

Page 37: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

Example Feature:

fra: “suite” academic course?

t m

fra: “suite”

spa: “trayectoria”

eng: “course”

part of a meal

academic course

route of travel

...

series of eventst'

m'm'

∑t′∈Γ(t)

φ1(t, t ′) sim∗(t ′,m)

sim∗(t ′,m) + dissim(t ′,m)

sim∗(t′,m)= maxm′∈Γ(t′)

φ2(t′,m′)sim(m′,m)

dissim(t′,m)=P

m′∈Γ(t′)φ2(t′,m′)(1−sim(m′,m))

weighting based on:part-of-speechcorpus frequency...

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 15/29

Page 38: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Other Features

cosine similarity oftranslations with gloss

scores assessing polysemy bylooking at back-translations

many more(see paper for details)

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 16/29

Page 39: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Approach

use scores as features forRBF-kernel SVM

multiple iterations:each graphs Gi based on theprevious Gi−1

stop when F1 score plateauis reached on a validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 16/29

Page 40: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Approach

use scores as features forRBF-kernel SVM

multiple iterations:each graphs Gi based on theprevious Gi−1

stop when F1 score plateauis reached on a validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 16/29

Page 41: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

StrategyInput GraphApproachFeatures

Building a Multilingual Wordnet

deu: “Reihe”

spa: “trayectoria”

academic course

part of a meal

route of travel

series of events

ita: “piatto”

fra: “suite”

eng: “course”

deu: “Kurs”

eng: “class”

Approach

use scores as features forRBF-kernel SVM

multiple iterations:each graphs Gi based on theprevious Gi−1

stop when F1 score plateauis reached on a validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 16/29

Page 42: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Outline

1 Existing Lexical Knowledge Bases

2 Building a Multilingual Wordnet

3 Results and Experiments

4 Summary and Future Work

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 17/29

Page 43: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Results

Setup

input graph G0:448,069 pre-existing term-meaning links10,805,400 translation edges1.3 million term nodes with candidates7.7 candidate meanings per new term

2,445 term-meaning links for training (French/German)

2,901 term-meaning links as validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 18/29

Page 44: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Results

Setup

input graph G0:448,069 pre-existing term-meaning links10,805,400 translation edges1.3 million term nodes with candidates7.7 candidate meanings per new term

2,445 term-meaning links for training (French/German)

2,901 term-meaning links as validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 18/29

Page 45: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Results

Setup

input graph G0:448,069 pre-existing term-meaning links10,805,400 translation edges1.3 million term nodes with candidates7.7 candidate meanings per new term

2,445 term-meaning links for training (French/German)

2,901 term-meaning links as validation set

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 18/29

Page 46: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Results

deu: “Schulgebäude”

school (group of fish)

school(institution)

school(building)

deu: “Schulhaus”

deu: “Fischschwarm”

ces: “hejno”

fra: “banc”

ind: “sekolah”

jpn: “ ”学校

kor: “ ”학교

lao: “ໂຮງຮຽນ”

kat: “ ”სკოლა

Excerpt from final UWN graph G3 after 3 iterationsretaining only edges with sufficiently high weights (0.5 / 0.6)

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 19/29

Page 47: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Evaluation

Relation Precision1

Term-Meaning Links (French) 89.2% ± 3.4%Term-Meaning Links (German) 85.9% ± 3.8%Term-Meaning Links (Mandarin Chinese) 90.5% ± 3.3%

Generalization (Hypernymy) 87.1% ± 4.8%Instance 89.3% ± 4.4%Similarity 92.0% ± 3.8%Category 93.3% ± 4.5%Part (Meronymy) 94.4% ± 4.1%Member (Meronymy) 92.7% ± 4.0%Substance (Meronymy) 95.6% ± 3.5%Opposite 94.3% ± 3.9%

1: Wilson score intervals for random samples

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 20/29

Page 48: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Coverage

Language Term-Meaning Links Distinct Terms

Overall 1,595,763 822,212

German 132,523 67,087French 75,544 33,423Esperanto 71,247 33,664Dutch 68,792 30,154Spanish 68,445 32,143Turkish 67,641 31,553Czech 59,268 33,067Russian 57,929 26,293Portuguese 55,569 23,499Italian 52,008 24,974Hungarian 46,492 28,324Thai 44,523 30,815

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 21/29

Page 49: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Semantic Relatedness

Experimental Setup

Example: “curriculum” considered closely related to“school”, but not to “water”

compute term relatedness using UWNsim(t1, t2) = max

s1∈σ(t1)max

s2∈σ(t2)sim(s1, s2) sim(s1, s2):

combined graph-/gloss-based method

compare with assessments of relatedness made by humanjudges

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 22/29

Page 50: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Semantic Relatedness

Experimental Setup

Example: “curriculum” considered closely related to“school”, but not to “water”

compute term relatedness using UWNsim(t1, t2) = max

s1∈σ(t1)max

s2∈σ(t2)sim(s1, s2) sim(s1, s2):

combined graph-/gloss-based method

compare with assessments of relatedness made by humanjudges

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 22/29

Page 51: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Semantic Relatedness

Experimental Setup

Example: “curriculum” considered closely related to“school”, but not to “water”

compute term relatedness using UWNsim(t1, t2) = max

s1∈σ(t1)max

s2∈σ(t2)sim(s1, s2) sim(s1, s2):

combined graph-/gloss-based method

compare with assessments of relatedness made by humanjudges

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 22/29

Page 52: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Semantic Relatedness

Results for 3 German DatasetsDataset GUR65 GUR350 ZG222

r Cov. r Cov. r Cov.

Inter-Annot. Agreement 0.81 (65) 0.69 (350) 0.49 (222)

Wikipedia (ESA*) 0.56 65 0.52 333 0.32 205GermaNet (Lin*) 0.73 60 0.50 208 0.08 88

UWN 0.80 60 0.68 242 0.51 106r : Pearson product-moment correlation coefficient

Cov.: absolute coverage

∗: scores by Gurevych et al. (2007)

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 23/29

Page 53: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Cross-Lingual Text Classification

cross-lingual TC: train using documents in one language,classify documents in another language

used bag-of-words/meanings TF-IDF vectors

Dataset: Reuters corpora (RCV1/2)for each language pair:105 binary classification tasks, each using200 training documents, 600 test documents

SVMlight

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 24/29

Page 54: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Cross-Lingual Text Classification

cross-lingual TC: train using documents in one language,classify documents in another language

used bag-of-words/meanings TF-IDF vectors

Dataset: Reuters corpora (RCV1/2)for each language pair:105 binary classification tasks, each using200 training documents, 600 test documents

SVMlight

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 24/29

Page 55: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Cross-Lingual Text Classification

cross-lingual TC: train using documents in one language,classify documents in another language

used bag-of-words/meanings TF-IDF vectors

Dataset: Reuters corpora (RCV1/2)for each language pair:105 binary classification tasks, each using200 training documents, 600 test documents

SVMlight

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 24/29

Page 56: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Cross-Lingual Text Classification

cross-lingual TC: train using documents in one language,classify documents in another language

used bag-of-words/meanings TF-IDF vectors

Dataset: Reuters corpora (RCV1/2)for each language pair:105 binary classification tasks, each using200 training documents, 600 test documents

SVMlight

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 24/29

Page 57: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SetupOutputEvaluationApplication: Semantic RelatednessApplication: Cross-Lingual Text Classification

Application: Cross-Lingual Text Classification

Language Pair Terms only Terms + Meanings

English-Italian 68.3% 76.3%

English-Russian 51.7% 71.2%

Italian-English 74.4% 78.1%

Italian-Russian 58.4% 73.2%

Russian-English 67.3% 76.8%

Russian-Italian 62.2% 71.8%

(all values are F1 scores)

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 25/29

Page 58: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Outline

1 Existing Lexical Knowledge Bases

2 Building a Multilingual Wordnet

3 Results and Experiments

4 Summary and Future Work

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 26/29

Page 59: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Summary

large-scale multilingual wordnet:85% accuracy, 800,000 terms, over 1.5 million links fromterms to meanings,

built by learning edge weights using graph-based evidence

useful for monolingual and cross-lingual tasks

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 27/29

Page 60: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Summary

large-scale multilingual wordnet:85% accuracy, 800,000 terms, over 1.5 million links fromterms to meanings,

built by learning edge weights using graph-based evidence

useful for monolingual and cross-lingual tasks

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 27/29

Page 61: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Summary

large-scale multilingual wordnet:85% accuracy, 800,000 terms, over 1.5 million links fromterms to meanings,

built by learning edge weights using graph-based evidence

useful for monolingual and cross-lingual tasks

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 27/29

Page 62: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Future Work

ongoing work: user interface incl. user contributions

techniques to automatically discover new word meanings

word sense disambiguation, query expansion using UWN

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 28/29

Page 63: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Future Work

ongoing work: user interface incl. user contributions

techniques to automatically discover new word meanings

word sense disambiguation, query expansion using UWN

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 28/29

Page 64: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Future Work

ongoing work: user interface incl. user contributions

techniques to automatically discover new word meanings

word sense disambiguation, query expansion using UWN

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 28/29

Page 65: Towards a Universal Wordnet by Learning from Combined Evidence

IntroductionExisting Lexical Knowledge Bases

Building a Multilingual WordnetResults and Experiments

Summary and Future Work

SummaryFuture Work

Thanks!

expression of gratitude

eng: “thank you”

yue: “ ”唔該

cmn: “ ”谢谢

jap: “ ”ありがとう

spa: “gracias”

ara: “شكرا.”

Gerard de Melo and Gerhard Weikum Towards a Universal Wordnet 29/29