Learning Bilingual Lexicons from Monolingual Corpora

Preview:

DESCRIPTION

Learning Bilingual Lexicons from Monolingual Corpora. Aria Haghighi , Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California, Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A. - PowerPoint PPT Presentation

Citation preview

Learning Bilingual Lexicons from Monolingual Corpora

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein

Computer Science DivisionUniversity of California, Berkeley

Standard MT Approach

SourceText

TargetText

Need (lots of) parallel sentences May not always be available

Need (lots of) sentences

MT from Monotext

SourceText

TargetText

This talk: translation w/o parallel text? Koehn and Knight (2002) & Fung (1995)

Need (lots of) sentences

Task: Lexicon Induction

SourceText

TargetText

Matchingmstate

world

name

Source Words s

nation

estado

política

Target Words t

mundo

nombre

Data Representation

state

SourceText

Orthographic Features

1.01.0

1.0

#sttatte#

Context Features

20.05.0

10.0

worldpoliticssociety

Data Representation

state

Orthographic Features1.0

1.0

1.0

#sttatte#

5.0

20.0

10.0

Context Features

worldpoliticssociety

SourceText

estado

Orthographic Features1.0

1.0

1.0

#esstado#

10.0

17.0

6.0

Context Features

mundopolitica

sociedadTargetText

Canonical Correlation Analysis

Source Space Target Space

PCA

PCA

3

1

2

Canonical Correlation Analysis

PCA

Source Space

12 3 23 1

Target Space

2

3 1

PCA

Canonical Correlation Analysis

1

Source Space Target Space

23

2

3 1

CCA

CCA

21 3 21 3

Canonical Correlation Analysis

21 3

Canonical Space

1

23

2

3 1

Source Space Target Space

Canonical Correlation Analysis

2

Canonical Space

2

2

2

Source Space Target Space

Generative ModelSource Words

sTarget Words

tMatchingm

Generative Model

estadostateSource Space Target Space

PAria

Canonical Space

Generative ModelSource Words

sTarget Words

tMatchingmstate

world

name

nation

estado

nombre

politica

mundo

E-Step: Obtain posterior over matching

M-Step: Maximize CCA Parameters

Learning: EM?

Learning: EM?

0.2

0.15

0.30

0.10

0.30....

Getting expectations over matchings is #P-hard! See John DeNero’s paper

“The Complexity of Phrase Alignment Problems”

Hard E-Step: Find bipartite matching

M-Step: Solve CCA

Inference: Hard EM

Experimental Setup

Nouns only (for now)

Seed lexicon – 100 translation pairs

Induce lexicon between top 2k source and target word-types

Evaluation: Precision and Recall against lexicon obtained from Wiktionary Report p0.33, precision at recall 0.33

Feature Experiments

Series10

25

50

75

100

61.1

Edit Dist

Prec

isio

n Baseline: Edit Distance

4k EN-ES Wikipedia Articles

Feature Experiments

Series10

25

50

75

100

61.1

Series1; Ortho; 80.1

Edit Dist Ortho

Prec

isio

n MCCA: Only orthographic features

4k EN-ES Wikipedia Articles

Feature Experiments

Series10

25

50

75

100

Series1; Edit Dist;

61.1

Series1; Ortho; 80.1

Series1; Context;

80.2

Edit Dist Ortho Context

Prec

isio

n MCCA: Only Context features

4k EN-ES Wikipedia Articles

Feature Experiments

Series10

25

50

75

100

Series1; Edit Dist;

61.1

Series1; Ortho; 80.1

Series1; Context;

80.289.0

Edit Dist Ortho Context MCCA

Prec

isio

n MCCA: Orthographic and context features

4k EN-ES Wikipedia Articles

Feature ExperimentsPr

ecis

ion

Recall

Feature ExperimentsPr

ecis

ion

Recall

Corpus Variation

93.8

100k EN-ES Europarl Sentences

Identical Corpora

Series10

25

50

75

100

93.8

Identical

Prec

isio

n

Corpus Variation

Comparable Corpora

4k EN-ES Wikipedia Articles

¼

Series10

25

50

75

100

93.8 89.0

Identical Wiki

Prec

isio

n

Corpus Variation

Unrelated Corpora

92 8968

100k English and Spanish Gigaword

?

Series10

25

50

75

100

93.8 89.0Series1;

Unre-lated; 68.3

Identical Wiki Unrelated

Prec

isio

n

Seed Lexicon Source

Automatic Seed Use edit distance to induce seed lexicon as inKoehn & Knight (2002)

92

4k EN-ES Wikipedia Articles

Series10

25

50

75

100

91.8 93.8

Auto Seed Gold Seed

Prec

isio

n

Analysis

Analysis

Top Non-Cognates

Analysis

Interesting Mistakes

Language Variation

Language Variation

AnalysisOrthography Features

Context Features

Summary

Learned bilingual lexicon from monotext Matching + CCA model Possible even from unaligned corpora Possible for non-related languages High-precision, but much left to do!

Thank you!

http://nlp.cs.berkeley.edu

Error Analysis

Top 100 errors 21 correct translations not in gold 30 were semantically related 15 were orthographically related (coast,costas) 30 were seemingly random

Bleu Experiment

On English-French only 1k parallel sentences Without lexicon BLEU: 13.61 With lexicon BLEU: 15.22

More Numbers

Conclusion

Three cases of unsupervised learning in NLP

Unsupervised systems can be competitive with supervised systems

Future problems Document summarization Building MindNet-like resources Discourse Analysis

Generative Model

estadostateSource Space Target Space

Latent Space

Orthographic Features1.0

1.0

1.0

#sttat

te#

5.0

20.0

10.0

Context Featuresworldpolitics

society

Generate Matched Words

Generative Model

estadostate

Source Space Target Space

Latent Space

Orthographic Features1.0

1.0

1.0

#sttat

te#

5.0

20.0

10.0

Context Featuresworldpolitics

society

Generate Matched Words

state

Translation Lexicon Induction

SourceText

TargetText

state

world

name

Source Words s

estado

nombre

mundo

Target Words tMatching

m

Generative Model

For each matched word pair:

For each unmatched source word:

For each unmatched target word:

Results: Accuracy

Corpus Variation

Disjoint Sentences

P@0.1 P@0.33 P@0.5050

75

100

ParallelWikiDisjoint

Corpus Variation

Unrelated

P@0.1 P@0.33 P@0.5050

75

100

ParallelWikiUnrelated

?

Machine Translation

SourceText

TargetText

Machine Translation

SourceText

TargetText

Machine Translation

Source Word Target Word P(T | S)state estado 0.98world mundo 0.97name nombre 0.99

SourceText

TargetText

What are we generating?

Canonical Correlation Analysis

Source Space Target Space

PCAPCA

CCACCA

Canonical Space

1

23

2

3 1

1 2 3

Corpus Variation

Unrelated Corpora

P@0.33 Best F150

75

100

ParallelWiki

E-Step: Compute matching posteriors

M-Step: Estimate

Inference: EM?

P (mjs;t)

Data Representation

state

Orthographic Features1.0

1.0

1.0

#sttatte#

5.0

20.0

10.0

Context Features

worldpoliticssociety

SourceText

estado

Orthographic Features1.0

1.0

1.0

#esstado#

10.0

17.0

6.0

Context Features

mundopolitica

sociedadTargetText

What are we generating?

Language Variation

Generative Model

estadostateSource Space Target Space

Latent Space

PAria

Generate matched word vectors

Generative Model

Matchingmstate

world

name

Source Words s

nation

estado

nombre

política

Target Words t

mundo

Generate unmatched word vectors

Results: Example Matches

Results: Examples

Top Non-Cognates Interesting Mistakes

PCAPCA

Canonical Correlation Analysis

Source Space Target Space

PCAPCA

CCACCA

Canonical Space

1

23

2

3 1

1 2 3

Generative Model

Matchingmstate

world

name

Source Words s

nation

estado

nombre

política

Target Words t

mundo

Corpus Variation

Identical Corpora

p0.33 89.0

Recall

Prec

ision

100k EN-ES Europarl Sentences

Recommended