54
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 y B. Cohen Dipanjan Das Noah A. Sm Carnegie Mellon University

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

  • Upload
    hovan

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance. Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University. July 27 EMNLP 2011. Goal: . Learn linguistic structure for a language without any labeled data in that language. - PowerPoint PPT Presentation

Citation preview

Page 1: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Unsupervised Structure Predictionwith Non-Parallel Multilingual

Guidance

July 27EMNLP 2011

Shay B. Cohen Dipanjan Das Noah A. Smith

Carnegie Mellon University

Page 2: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Goal:

2

Learn linguistic structure for a language without any labeled

data in that language

Part-of-Speech Tagging

DET NOUN NOUN VERB ADJ ADP .The Skibo Castle is close by .

Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Page 3: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

This work!

Multilingual Unsupervised Learning

3

using parallel data

no parallel data

(hard)

supervision in source

language(s)

joint learning for multiple languages

Snyder et al. (2009)Naseem et al. (2010)

supervision in source

language(s)

Smith and Eisner (2009)Das and Petrov (2011)McDonald et al. (2011)

joint learning for multiple languages

Cohen and Smith (2009)Berg-Kirkpatrick and

Klein (2010)

EMNLP 2011Cohen, Das and Smith (2011)

Yarowsky and Ngai (2001)Xi and Hwa (2005)

Page 4: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Annotated data

In a Nutshell

4

Unlabeled data in

Portuguese+ =

Spanish Italian

Coarse, universal paramete

rs

Coarse, universal paramete

rsInterpolatio

n(unsupervised training)

coarse parameters of Portuguese

Monolingual unsupervised training in Portuguese

Coarse-to-fine expansion

and initialization

Cohen, Das and Smith (2011)

Portuguese parameters

EMNLP 2011

Page 5: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

5

Assumptions for a given problem:

1. Underlying model is generative

HMMThe Skibo is close byCastle Merialdo (1994)

EMNLP 2011Cohen, Das and Smith (2011)

Page 6: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

66

1. Underlying model is generative

DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)

Assumptions for a given problem:

EMNLP 2011Cohen, Das and Smith (2011)

Page 7: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

77

Composed of multinomial distributions

HMMThe Skibo is close byCastle Merialdo (1994)

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

Page 8: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

88

DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)

Composed of multinomial distributions

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

Page 9: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

99

In general, unlexicalized parameters look like:

kth multinomial in the modelith event in the multinomial

Assumptions for a given problem:

1. Underlying model is generative

e.g. transition from ADJ ( ) to NOUN ( ) EMNLP 2011Cohen, Das and Smith (2011)

Page 10: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

1010

The lexicalized parameters take a similar form(No lexicalized parameters for the DMV)

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

Page 11: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

1111

unlexicalizedlexicalized

number of times event i of multinomial k fires in the

derivation

Assumptions for a given problem:

1. Underlying model is generative

EMNLP 2011Cohen, Das and Smith (2011)

Page 12: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

12

2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X

Assumptions for a given problem:

EMNLP 2011Cohen, Das and Smith (2011)

Page 13: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

13

Assumptions for a given problem:

2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X

Treebanktagset

For each language , there is a mapping

EMNLP 2011Cohen, Das and Smith (2011)

Page 14: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Coarse treebank

coarse conversion

3. helper languages

14

Assumptions for a given problem:

Treebank

unlexicalized parameters

MLE

EMNLP 2011Cohen, Das and Smith (2011)

For each:

Page 15: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

15

Multilingual Modeling

EMNLP 2011Cohen, Das and Smith (2011)

Page 16: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

16

Multilingual ModelingFor a target language, unlexicalized parameters:

kth multinomial in the model

(say, the transitions from the ADJ tagin an HMM)

mixture weight for kth multinomial

for the th

helper languageEMNLP 2011Cohen, Das and Smith (2011)

Page 17: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

0.7 0.3

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

17

Multilingual Modelinge.g., two helper languages: Spanish and Italian

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

EMNLP 2011Cohen, Das and Smith (2011)

Page 18: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

? ?

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

18

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

Page 19: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

19

Learning and Inference

EMNLP 2011Cohen, Das and Smith (2011)

Page 20: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

20

Learning and Inference

normal learning

EMNLP 2011Cohen, Das and Smith (2011)

Page 21: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

21

Learning and Inference

multilingual learning

are fixed!

EMNLP 2011Cohen, Das and Smith (2011)

Page 22: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

22

Learning and InferenceMultilingual

learninglearning with EM:

Number of times is used in a derivation

M-step:

EMNLP 2011Cohen, Das and Smith (2011)

Page 23: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

23

Learning and InferenceMultilingual

learningWhat about feature-rich

generative models?

Berg-Kirkpatrick et al. (2010)

Locally normalized log-linear model

EMNLP 2011Cohen, Das and Smith (2011)

Page 24: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

? ?

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

24

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown

ADJ → . ADJ → .ADJ → . ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

Page 25: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

0.6237 0.3763

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

25

Multilingual Modelinge.g., two helper languages: Spanish and Italian

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

learned

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

EMNLP 2011Cohen, Das and Smith (2011)

Page 26: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

JJS → .JJ → .JJR → .

26

Coarse-to-fine expansion

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

Learning and Inference(for English)

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

identicalcopies

Step 1

ADJ → .

EMNLP 2011Cohen, Das and Smith (2011)

Page 27: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

27

Coarse-to-fine expansionLearning and Inference

(for English)

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .

EMNLP 2011Cohen, Das and Smith (2011)

Page 28: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

28

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .

Coarse-to-fine expansionLearning and Inference

(for English)

VBVBDVBGVBNVBPVBZ

0.0650.0650.0650.0650.020.020.020.020.020.02

NNNNSNNPNNPS

.

.

.

.

.

.

.

.

.

.

.

.

Equaldivision

Monolingualunsupervised

training

Initializer

new, fine

JJ → .Step 2

EMNLP 2011Cohen, Das and Smith (2011)

Page 29: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

29

Experiments

EMNLP 2011Cohen, Das and Smith (2011)

Page 30: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

30

Two ProblemsUnsupervised Part-of-Speech

Tagging

Model:feature-based HMM(Berg-Kirkpatrick et al., 2010)

Learning:L-BFGS

Unsupervised Dependency

Parsing

Model:DMV

(Klein and Manning, 2004)

Learning:EM

EMNLP 2011Cohen, Das and Smith (2011)

Page 31: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

31

Languages

Target Languages:Bulgarian, Danish, Dutch, Greek, Japanese, Portuguese, Slovene, Spanish, Swedish, and Turkish

Helper Languages:English, German, Italian and Czech

(CoNLL Treebanks from 2006 and 2007)EMNLP 2011Cohen, Das and Smith (2011)

Page 32: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

Average Accuracy

32

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Monolingual baseline(Berg-Kirkpatrick et al.,

2010)

Uniform mixture parameters

(no learning)

Full model

Page 33: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

2(Portuguese,

Danish)

2(Turkish,

Bulgarian)

6

Average Accuracy 40.6 41.0 43.3

33

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Page 34: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

34

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Monolingual EM(Klein and Manning, 2004)

Posterior Regularization(Gillenwater et al, 2010)

Phylogenetic Grammar Induction

(Berg-Kirkpatrick and Klein, 2010)

Page 35: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

Uniform Mixture

Uniform + EM

Mixture + EM

1. Uniform mixture parameters

2. No coarse-to-fine expansion

(no learning)

35

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

1. Learned mixture parameters

2. No coarse-to-fine expansion

1. Uniform mixture parameters2. Coarse-to-fine expansion →

monolingual learning

1. Learned mixture parameters2. Coarse-to-fine expansion →

monolingual learning

Page 36: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Uniform Mixture

Uniform + EM

Mixture + EM

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

36

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Page 37: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Uniform Mixture

Uniform + EM

Mixture + EM

3(Bulgarian, Swedish,

Dutch)

1(Danish

)

1(Greek)

3(Portugue

se, Japanese, Spanish)

61.6 62.2 61.5 62.1

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

37

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Page 38: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Cohen, Das and Smith (2011) EMNLP 2011 38

Analyzing with Principal Component Analysis

Two principal components

Page 39: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

39

From Words to Dependencies

EMNLP 2011Cohen, Das and Smith (2011)

Page 40: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

40

From Words to DependenciesUse induced tags to induce

dependencies

1. In a pipeline2. Using the posteriors over

tagsin a sausage lattice(Cohen and Smith, 2007)

EMNLP 2011Cohen, Das and Smith (2011)

Page 41: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Cohen, Das and Smith (2011) EMNLP 2011 41

From Words to DependenciesJoint Decoding:

1 2 3 4

The Skibo Castle

DET : 0.95

ADJ: 0.03NOUN: 0.02

DET : 0.0

ADJ: 0.3NOUN: 0.7

DET : 0.01

ADJ: 0.1NOUN: 0.89

DMV

Parsing a

lattice

Page 42: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

42

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

Average

EMNLP 2011Cohen, Das and Smith (2011)

Page 43: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

43

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6

EMNLP 2011Cohen, Das and Smith (2011)

Page 44: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

44

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6

EMNLP 2011Cohen, Das and Smith (2011)

Best average result with gold tags: 62.2Interesting result: Auto tags perform better

for Turkish and Slovene

Page 45: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

45

Conclusions

EMNLP 2011Cohen, Das and Smith (2011)

Page 46: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

46

Conclusions• Improvements for two major tasks

using non-parallel multilingual guidance

• In general grammar induction results better than POS tagging

• Joint POS and dependency parsing performs surprisingly well• For a few languages, results are better

than using gold tags• Joint decoding performs better than a

pipelineEMNLP 2011Cohen, Das and Smith (2011)

Page 47: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

47

Questions?

EMNLP 2011Cohen, Das and Smith (2011)

Page 48: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

48

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 34.7 38.0 35.8

Danish 48.8 36.2 39.9Dutch 45.4 43.7 50.2Greek 35.3 36.7 38.9

Japanese 52.3 60.4 61.7Portugue

se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Page 49: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

49

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 34.7 38.0 35.8

Danish 48.8 36.2 39.9Dutch 45.4 43.7 50.2Greek 35.3 36.7 38.9

Japanese 52.3 60.4 61.7Portugue

se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Page 50: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

50

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 80.7 81.3 82.6

Danish 82.3 82.0 82.0Dutch 79.2 79.3 80.0Greek 88.0 80.3 80.3

Japanese 83.4 77.9 79.9Portugue

se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.4Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Page 51: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

51

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 80.7 81.3 82.6

Danish 82.3 82.0 82.0Dutch 79.2 79.3 80.0Greek 88.0 80.3 80.3

Japanese 83.4 77.9 79.9Portugue

se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.0Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Page 52: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

52

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Page 53: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

53

Results: Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

Page 54: Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

54

Results: Words to DependenciesJoint Pipeline Gold

TagsDG Mixture +

DGDG Mixture +

DGBulgarian 62.4 67.0 57.7 62.9 75.6

Danish 50.4 50.1 48.9 48.3 59.9Dutch 48.3 52.2 49.9 51.2 50.7Greek 63.5 52.2 68.2 50.0 73.0

Japanese 61.4 69.5 64.2 68.6 63.9Portuguese 68.4 62.2 60.0 59.8 79.8

Slovene 47.2 36.8 45.8 36.4 46.1Spanish 67.7 69.3 65.8 68.1 76.7Swedish 58.2 49.1 57.9 47.6 74.0Turkish 52.4 47.4 50.8 47.1 45.3Average 57.9 55.0 56.9 54.0 64.5

EMNLP 2011Cohen, Das and Smith (2011)