Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Unsupervised Structure Predictionwith Non-Parallel Multilingual

Guidance

July 27EMNLP 2011

Shay B. Cohen Dipanjan Das Noah A. Smith

Carnegie Mellon University

Goal:

2

Learn linguistic structure for a language without any labeled

data in that language

Part-of-Speech Tagging

DET NOUN NOUN VERB ADJ ADP .The Skibo Castle is close by .

Dependency Parsing

EMNLP 2011Cohen, Das and Smith (2011)

This work!

Multilingual Unsupervised Learning

3

using parallel data

no parallel data

(hard)

supervision in source

language(s)

joint learning for multiple languages

Snyder et al. (2009)Naseem et al. (2010)

supervision in source

language(s)

Smith and Eisner (2009)Das and Petrov (2011)McDonald et al. (2011)

joint learning for multiple languages

Cohen and Smith (2009)Berg-Kirkpatrick and

Klein (2010)


Yarowsky and Ngai (2001)Xi and Hwa (2005)

Annotated data

In a Nutshell

4

Unlabeled data in

Portuguese+ =

Spanish Italian

Coarse, universal paramete

rs

Coarse, universal paramete

rsInterpolatio

n(unsupervised training)

coarse parameters of Portuguese

Monolingual unsupervised training in Portuguese

Coarse-to-fine expansion

and initialization

Cohen, Das and Smith (2011)

Portuguese parameters

EMNLP 2011

5

Assumptions for a given problem:

1. Underlying model is generative

HMMThe Skibo is close byCastle Merialdo (1994)


66


DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)



77

Composed of multinomial distributions

HMMThe Skibo is close byCastle Merialdo (1994)




88

DET NOUN NOUN VERB ADJ ADP

ROOT

DMVKlein and

Manning (2004)

Composed of multinomial distributions




99

In general, unlexicalized parameters look like:

kth multinomial in the modelith event in the multinomial



e.g. transition from ADJ ( ) to NOUN ( ) EMNLP 2011Cohen, Das and Smith (2011)

1010

The lexicalized parameters take a similar form(No lexicalized parameters for the DMV)




1111

unlexicalizedlexicalized

number of times event i of multinomial k fires in the

derivation




12

2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X



13


2. Coarse, universal part-of-speech tags

VERB DETNOUN CONJPRON NUMADJ PRTADV .ADP X

Treebanktagset

For each language , there is a mapping


Coarse treebank

coarse conversion

3. helper languages

14


Treebank

unlexicalized parameters

MLE


For each:

15

Multilingual Modeling


16

Multilingual ModelingFor a target language, unlexicalized parameters:

kth multinomial in the model

(say, the transitions from the ADJ tagin an HMM)

mixture weight for kth multinomial

for the th

helper languageEMNLP 2011Cohen, Das and Smith (2011)

ADJ → . ADJ → . ADJ → .ADJ → . ADJ → .

0.7 0.3

NOUNVERBADJADV

PRONDETADPNUMCONJPRT

.X

NOUNVERBADJADV


.X

17

Multilingual Modelinge.g., two helper languages: Spanish and Italian

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00


? ?

NOUNVERBADJADV


.X

NOUNVERBADJADV


.X

18


0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown



19

Learning and Inference


20


normal learning


21


multilingual learning

are fixed!


22

Learning and InferenceMultilingual

learninglearning with EM:

Number of times is used in a derivation

M-step:


23

Learning and InferenceMultilingual

learningWhat about feature-rich

generative models?

Berg-Kirkpatrick et al. (2010)

Locally normalized log-linear model


? ?

NOUNVERBADJADV


.X

NOUNVERBADJADV


.X

24


0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

unknown

ADJ → . ADJ → .ADJ → . ADJ → .


0.6237 0.3763

NOUNVERBADJADV


.X

NOUNVERBADJADV


.X

25


0.270.120.030.040.040.030.250.010.100.050.010.05

0.250.110.040.040.060.040.260.0

0.200.00.0

0.00

learned


NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04


JJS → .JJ → .JJR → .

26

Coarse-to-fine expansion

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

Learning and Inference(for English)

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

identicalcopies

Step 1

ADJ → .


27

Coarse-to-fine expansionLearning and Inference

(for English)

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .


28

NOUNVERBADJADV


.X

0.260.120.030.040.050.030.250.010.130.040.010.04

JJ → .

Coarse-to-fine expansionLearning and Inference

(for English)

VBVBDVBGVBNVBPVBZ

0.0650.0650.0650.0650.020.020.020.020.020.02

NNNNSNNPNNPS

.

.

.

.

.

.

.

.

.

.

.

.

Equaldivision

Monolingualunsupervised

training

Initializer

new, fine

JJ → .Step 2


29

Experiments


30

Two ProblemsUnsupervised Part-of-Speech

Tagging

Model:feature-based HMM(Berg-Kirkpatrick et al., 2010)

Learning:L-BFGS

Unsupervised Dependency

Parsing

Model:DMV

(Klein and Manning, 2004)

Learning:EM


31

Languages

Target Languages:Bulgarian, Danish, Dutch, Greek, Japanese, Portuguese, Slovene, Spanish, Swedish, and Turkish

Helper Languages:English, German, Italian and Czech

(CoNLL Treebanks from 2006 and 2007)EMNLP 2011Cohen, Das and Smith (2011)

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

Average Accuracy

32

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Monolingual baseline(Berg-Kirkpatrick et al.,

2010)

Uniform mixture parameters

(no learning)

Full model

Direct Gradient

(DG)

Uniform+

DG

Mixture+

DGNumber of

Languages with Best Results

2(Portuguese,

Danish)

2(Turkish,

Bulgarian)

6

Average Accuracy 40.6 41.0 43.3

33

Results: POS Tagging

(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

34

Results: Dependency Parsing


Monolingual EM(Klein and Manning, 2004)

Posterior Regularization(Gillenwater et al, 2010)

Phylogenetic Grammar Induction

(Berg-Kirkpatrick and Klein, 2010)

EM PR PGI

Number of

Languages with

Best ResultsAverage Accuracy

Uniform Mixture

Uniform + EM

Mixture + EM

1. Uniform mixture parameters

2. No coarse-to-fine expansion

(no learning)

35



1. Learned mixture parameters

2. No coarse-to-fine expansion

1. Uniform mixture parameters2. Coarse-to-fine expansion →

monolingual learning

1. Learned mixture parameters2. Coarse-to-fine expansion →

monolingual learning

Uniform Mixture

Uniform + EM

Mixture + EM

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

36



Uniform Mixture

Uniform + EM

Mixture + EM

3(Bulgarian, Swedish,

Dutch)

1(Danish

)

1(Greek)

3(Portugue

se, Japanese, Spanish)

61.6 62.2 61.5 62.1

EM PR PGI

Number of

Languages with

Best Results

0 2(Turkis

h, Sloven

e)

0

Average Accuracy

41.4 50.2*

53.6*

37



Cohen, Das and Smith (2011) EMNLP 2011 38

Analyzing with Principal Component Analysis

Two principal components

39

From Words to Dependencies


40

From Words to DependenciesUse induced tags to induce

dependencies

1. In a pipeline2. Using the posteriors over

tagsin a sausage lattice(Cohen and Smith, 2007)


Cohen, Das and Smith (2011) EMNLP 2011 41

From Words to DependenciesJoint Decoding:

1 2 3 4

The Skibo Castle

DET : 0.95

ADJ: 0.03NOUN: 0.02

DET : 0.0

ADJ: 0.3NOUN: 0.7

DET : 0.01

ADJ: 0.1NOUN: 0.89

DMV

Parsing a

lattice

42

Results: Words to DependenciesPipeline Joint

DG Mixture + DG

DG Mixture + DG

Number of Languages with Best Results

Average


43


DG Mixture + DG

DG Mixture + DG


1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6


44


DG Mixture + DG

DG Mixture + DG


1(Greek)

0 5(Portuguese,

Turkish, Swedish, Slovebe,Danish)

4(Bulgarian, Japanese, Spanish,Dutch)

Average 56.9 54.0 57.9 55.6


Best average result with gold tags: 62.2Interesting result: Auto tags perform better

for Turkish and Slovene

45

Conclusions


46

Conclusions• Improvements for two major tasks

using non-parallel multilingual guidance

• In general grammar induction results better than POS tagging

• Joint POS and dependency parsing performs surprisingly well• For a few languages, results are better

than using gold tags• Joint decoding performs better than a

pipelineEMNLP 2011Cohen, Das and Smith (2011)

47

Questions?


48

Results: POS TaggingDirect

Gradient(DG)

Uniform+

DG

Mixture+

DGBulgarian 34.7 38.0 35.8

Danish 48.8 36.2 39.9Dutch 45.4 43.7 50.2Greek 35.3 36.7 38.9

Japanese 52.3 60.4 61.7Portugue

se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

49


Gradient(DG)

Uniform+

DG

Mixture+




se53.5 45.7 51.5

Slovene 33.4 35.9 36.0Spanish 40.0 31.8 40.5Swedish 34.4 37.7 39.9Turkish 27.9 43.6 38.6Average 40.6 41.0 43.3(without tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

50


Gradient(DG)

Uniform+

DG

Mixture+




se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.4Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

51


Gradient(DG)

Uniform+

DG

Mixture+




se75.4 83.8 84.7

Slovene 75.6 82.8 82.8Spanish 82.3 82.3 83.3Swedish 61.5 69.0 67.0Turkish 50.4 50.4 50.0Average 75.9 76.9 77.3(with tag dictionary)EMNLP 2011Cohen, Das and Smith (2011)

Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

52



Uniform

Mixture

Uniform + EM

Mixture + EM

75.6 75.5 74.7 72.859.2 59.9 51.3 55.250.7 51.1 45.9 46.057.0 59.5 73.0 72.356.3 58.3 59.8 63.978.6 76.8 78.7 79.846.1 46.0 41.3 41.073.2 75.9 75.5 76.774.0 73.2 70.5 68.745.0 45.3 43.9 44.161.6 62.2 61.5 62.1

EM PR PGIBulgarian 54.

354.0

-

Danish 41.4

44.0

41.6

Dutch 38.6

37.9

45.1

Greek 41.0

- -

Japanese 43.0

60.2

-

Portuguese

42.5

47.8

63.1

Slovene 37.0

50.3

49.6

Spanish 38.1

62.4

63.8

Swedish 42.3

42.2

58.3

Turkish 36.3

53.4

-

Average 41.4

- -

53



54

Results: Words to DependenciesJoint Pipeline Gold

TagsDG Mixture +

DGDG Mixture +

DGBulgarian 62.4 67.0 57.7 62.9 75.6

Danish 50.4 50.1 48.9 48.3 59.9Dutch 48.3 52.2 49.9 51.2 50.7Greek 63.5 52.2 68.2 50.0 73.0

Japanese 61.4 69.5 64.2 68.6 63.9Portuguese 68.4 62.2 60.0 59.8 79.8

Slovene 47.2 36.8 45.8 36.4 46.1Spanish 67.7 69.3 65.8 68.1 76.7Swedish 58.2 49.1 57.9 47.6 74.0Turkish 52.4 47.4 50.8 47.1 45.3Average 57.9 55.0 56.9 54.0 64.5


Documents

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance