Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Preview:

DESCRIPTION

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance. Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University. July 27 EMNLP 2011. Goal: . Learn linguistic structure for a language without any labeled data in that language. - PowerPoint PPT Presentation

Citation preview

Unsupervised Structure Predictionwith Non-Parallel Multilingual

Guidance

July 27EMNLP 2011

Shay B. Cohen Dipanjan Das Noah A. Smith

Carnegie Mellon University

Goal:

2

Learn linguistic structure for a language without any labeled

data in that language

Part-of-Speech Tagging

DET NOUN NOUN VERB ADJ ADP .The Skibo Castle is close by .

Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

This work!

Multilingual Unsupervised Learning

3

using parallel data

no parallel data

(hard)

supervision in source

language(s)

joint learning for multiple languages

Snyder et al. (2009)Naseem et al. (2010)

supervision in source

language(s)

Smith and Eisner (2009)Das and Petrov (2011)McDonald et al. (2011)

joint learning for multiple languages

Cohen and Smith (2009)Berg-Kirkpatrick and

Klein (2010)

EMNLP 2011Cohen, Das and Smith (2011)

Yarowsky and Ngai (2001)Xi and Hwa (2005)

Annotated data

In a Nutshell

4

Unlabeled data in

Portuguese+ =

Spanish Italian

Coarse, universal paramete

rs

Coarse, universal paramete

rsInterpolatio

n(unsupervised training)

coarse parameters of Portuguese

Monolingual unsupervised training in Portuguese

Coarse-to-fine expansion

and initialization

Cohen, Das and Smith (2011)

Portuguese parameters

EMNLP 2011

5

Assumptions for a given problem:

1. Underlying model is generative

HMMThe Skibo is close byCastle Merialdo (1994)

EMNLP 2011Cohen, Das and Smith (2011)

66

1. Underlying model is generative

DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)

Assumptions for a given problem:

EMNLP 2011Cohen, Das and Smith (2011)

77

Composed of multinomial distributions

HMMThe Skibo is close byCastle Merialdo (1994)

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

88

DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)

Composed of multinomial distributions

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

99

In general, unlexicalized parameters look like:

kth multinomial in the modelith event in the multinomial

Assumptions for a given problem:

1. Underlying model is generative

e.g. transition from ADJ ( ) to NOUN ( ) EMNLP 2011Cohen, Das and Smith (2011)

1010

The lexicalized parameters take a similar form(No lexicalized parameters for the DMV)

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

1111

unlexicalizedlexicalized

number of times event i of multinomial k fires in the

derivation

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

12

2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X

Assumptions for a given problem:

EMNLP 2011Cohen, Das and Smith (2011)

13

Assumptions for a given problem:

2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X

Treebanktagset

For each language , there is a mapping

EMNLP 2011Cohen, Das and Smith (2011)

Coarse treebank

coarse conversion

3. helper languages

14

Assumptions for a given problem:

Treebank

unlexicalized parameters

MLE

EMNLP 2011Cohen, Das and Smith (2011)

For each:

15

Multilingual Modeling

EMNLP 2011Cohen, Das and Smith (2011)

16

Multilingual ModelingFor a target language, unlexicalized parameters:

kth multinomial in the model

(say, the transitions from the ADJ tagin an HMM)

mixture weight for kth multinomial

for the th

helper languageEMNLP 2011Cohen, Das and Smith (2011)

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

0.7 0.3

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

17

Multilingual Modelinge.g., two helper languages: Spanish and Italian

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

EMNLP 2011Cohen, Das and Smith (2011)

? ?

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

18

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

19

Learning and Inference

EMNLP 2011Cohen, Das and Smith (2011)

20

Learning and Inference

normal learning

EMNLP 2011Cohen, Das and Smith (2011)

21

Learning and Inference

multilingual learning

are fixed!

EMNLP 2011Cohen, Das and Smith (2011)

22

Learning and InferenceMultilingual

learninglearning with EM:

Number of times is used in a derivation

M-step:

EMNLP 2011Cohen, Das and Smith (2011)

23

Learning and InferenceMultilingual

learningWhat about feature-rich

generative models?

Berg-Kirkpatrick et al. (2010)

Locally normalized log-linear model

EMNLP 2011Cohen, Das and Smith (2011)

? ?

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

24

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown

ADJ → . ADJ → .ADJ → . ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

0.6237 0.3763

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

25

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

learned

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

EMNLP 2011Cohen, Das and Smith (2011)

JJS → .JJ → .JJR → .

26

Coarse-to-fine expansion

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

Learning and Inference(for English)

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

identicalcopies

Step 1

ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

27

Coarse-to-fine expansionLearning and Inference

(for English)

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .

EMNLP 2011Cohen, Das and Smith (2011)

28

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .

Coarse-to-fine expansionLearning and Inference

(for English)

VBVBDVBGVBNVBPVBZ

0.0650.0650.0650.0650.020.020.020.020.020.02

NNNNSNNPNNPS

.

.

.

.

.

.

.

.

.

.

.

.

Equaldivision

Monolingualunsupervised

training

Initializer

new, fine

JJ → .Step 2

EMNLP 2011Cohen, Das and Smith (2011)

29

Experiments

EMNLP 2011Cohen, Das and Smith (2011)

30

Two ProblemsUnsupervised Part-of-Speech

Tagging

Model:feature-based HMM(Berg-Kirkpatrick et al., 2010)

Learning:L-BFGS

Unsupervised Dependency

Parsing

Model:DMV

(Klein and Manning, 2004)

Learning:EM

EMNLP 2011Cohen, Das and Smith (2011)

31

Languages

Target Languages:Bulgarian, Danish, Dutch, Greek, Japanese, Portuguese, Slovene, Spanish, Swedish, and Turkish

Helper Languages:English, German, Italian and Czech

(CoNLL Treebanks from 2006 and 2007)EMNLP 2011Cohen, Das and Smith (2011)

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

Average Accuracy

32

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Monolingual baseline(Berg-Kirkpatrick et al.,

2010)

Uniform mixture parameters

(no learning)

Full model

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

2(Portuguese,

Danish)

2(Turkish,

Bulgarian)

6

Average Accuracy 40.6 41.0 43.3

33

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

34

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Monolingual EM(Klein and Manning, 2004)

Posterior Regularization(Gillenwater et al, 2010)

Phylogenetic Grammar Induction

(Berg-Kirkpatrick and Klein, 2010)

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

Uniform Mixture

Uniform + EM

Mixture + EM

1. Uniform mixture parameters

2. No coarse-to-fine expansion

(no learning)

35

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

1. Learned mixture parameters

2. No coarse-to-fine expansion

1. Uniform mixture parameters2. Coarse-to-fine expansion →

monolingual learning

1. Learned mixture parameters2. Coarse-to-fine expansion →

monolingual learning

Uniform Mixture

Uniform + EM

Mixture + EM

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

36

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Uniform Mixture

Uniform + EM

Mixture + EM

3(Bulgarian, Swedish,

Dutch)

1(Danish

)

1(Greek)

3(Portugue

se, Japanese, Spanish)

61.6 62.2 61.5 62.1

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

37

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Cohen, Das and Smith (2011) EMNLP 2011 38

Analyzing with Principal Component Analysis

Two principal components

39

From Words to Dependencies

EMNLP 2011Cohen, Das and Smith (2011)

40

From Words to DependenciesUse induced tags to induce

dependencies

1. In a pipeline2. Using the posteriors over

tagsin a sausage lattice(Cohen and Smith, 2007)

EMNLP 2011Cohen, Das and Smith (2011)

Cohen, Das and Smith (2011) EMNLP 2011 41

From Words to DependenciesJoint Decoding:

1 2 3 4

The Skibo Castle

DET : 0.95

ADJ: 0.03NOUN: 0.02

DET : 0.0

ADJ: 0.3NOUN: 0.7

DET : 0.01

ADJ: 0.1NOUN: 0.89

DMV

Parsing a

lattice

42

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

Average

EMNLP 2011Cohen, Das and Smith (2011)

43

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6

EMNLP 2011Cohen, Das and Smith (2011)

44

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6

EMNLP 2011Cohen, Das and Smith (2011)

Best average result with gold tags: 62.2Interesting result: Auto tags perform better

for Turkish and Slovene

45

Conclusions

EMNLP 2011Cohen, Das and Smith (2011)

46

Conclusions• Improvements for two major tasks

using non-parallel multilingual guidance

• In general grammar induction results better than POS tagging

• Joint POS and dependency parsing performs surprisingly well• For a few languages, results are better

than using gold tags• Joint decoding performs better than a

pipelineEMNLP 2011Cohen, Das and Smith (2011)

47

Questions?

EMNLP 2011Cohen, Das and Smith (2011)

48

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 34.7 38.0 35.8

Danish 48.8 36.2 39.9Dutch 45.4 43.7 50.2Greek 35.3 36.7 38.9

Japanese 52.3 60.4 61.7Portugue

se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

49

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 34.7 38.0 35.8

Danish 48.8 36.2 39.9Dutch 45.4 43.7 50.2Greek 35.3 36.7 38.9

Japanese 52.3 60.4 61.7Portugue

se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

50

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 80.7 81.3 82.6

Danish 82.3 82.0 82.0Dutch 79.2 79.3 80.0Greek 88.0 80.3 80.3

Japanese 83.4 77.9 79.9Portugue

se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.4Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

51

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 80.7 81.3 82.6

Danish 82.3 82.0 82.0Dutch 79.2 79.3 80.0Greek 88.0 80.3 80.3

Japanese 83.4 77.9 79.9Portugue

se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.0Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

52

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

53

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

54

Results: Words to DependenciesJoint Pipeline Gold

TagsDG Mixture +

DGDG Mixture +

DGBulgarian 62.4 67.0 57.7 62.9 75.6

Danish 50.4 50.1 48.9 48.3 59.9Dutch 48.3 52.2 49.9 51.2 50.7Greek 63.5 52.2 68.2 50.0 73.0

Japanese 61.4 69.5 64.2 68.6 63.9Portuguese 68.4 62.2 60.0 59.8 79.8

Slovene 47.2 36.8 45.8 36.4 46.1Spanish 67.7 69.3 65.8 68.1 76.7Swedish 58.2 49.1 57.9 47.6 74.0Turkish 52.4 47.4 50.8 47.1 45.3Average 57.9 55.0 56.9 54.0 64.5

EMNLP 2011Cohen, Das and Smith (2011)

Recommended