Upload
conway
View
53
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Unified Expectation Maximization. Rajhans Samdani Joint work with Ming-Wei Chang ( Microsoft Research ) and Dan Roth University of Illinois at Urbana-Champaign. NAACL 2012, Montreal. Weakly Supervised Learning in NLP. Labeled data is scarce and difficult to obtain - PowerPoint PPT Presentation
Citation preview
Unified Expectation Maximization
Rajhans SamdaniJoint work with
Ming-Wei Chang (Microsoft Research) and Dan Roth
University of Illinois at Urbana-Champaign
Page 1
NAACL 2012,Montreal
Weakly Supervised Learning in NLP
Labeled data is scarce and difficult to obtain
A lot of work on learning with a small amount of labeled data
Expectation Maximization (EM) algorithm is the de facto standard
More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM Constraint-driven Learning (CoDL; Chang et al, 07) Posterior regularization (PR; Ganchev et al, 10)
Page 2
Weakly Supervised Learning: EM and …?
Several variants of EM exist in the literature: Hard EM
Variants of constrained EM: CoDL and PR
Which version to use: EM (PR) vs hard EM (CoDL)????? Or is there something better out there?
OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) Includes existing EM algorithms Pick the most suitable EM algorithm in a simple, adaptive, and
principled way Adapting to data, initialization, and constraints
Page 3
Outline
Background: Expectation Maximization (EM) EM with constraints
Unified Expectation Maximization (UEM)
Optimization Algorithm for the E-step
Experiments
Page 4
Predicting Structures in NLP
Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w
E.g. predict POS tags given a sentence, predict word alignments given sentences in two different languages, predict the entity-relation structure from a document
Prediction expressed as y* = argmaxy 2 Y P (y | x; w)
Page 5
Learning Using EM: a Quick Primer
Given unlabeled data: x, estimate w; hidden: y for t = 1 … T do
E:step: estimate a posterior distribution, q, over y:
M:step: estimate the parameters w w.r.t. q:
wt+1 = argmaxw Eq log P (x, y; w)
Page 6
qt(y) = P (y|x;wt) qt(y) = argminq KL( q(y) , P
(y|
x;wt) ) (Neal and Hinton,
99)
Conditional distribution of
y given wPosterior distribution
Other Version of EM: Hard EM
Standard EM E-step:argminq KL(qt(y),P
(y|
x;wt))
M-step: argmaxw Eq log P (x, y; w)
Hard EM E-step:
M-step: argmaxw Eq log P (x, y; w)
Page 7
q(y) = ±y=y*y*= argmaxy P(y|x,w)Not clear which version To use!!!
Constrained EM Domain knowledge-based constraints can help a lot by guiding
unsupervised learning Constraint-driven Learning (Chang et al, 07), Posterior Regularization (Ganchev et al, 10), Generalized Expectation Criterion (Mann & McCallum, 08), Learning from Measurements (Liang et al, 09)
Constraints are imposed on y (a structured object, {y1,y2…yn}) to specify/restrict the set of allowed structures Y
Page 8
Entity-Relation Prediction: Type Constraints
Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints
Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3
R12 R2
3
Page 9
lives-in
LocPer
10
Bilingual Word Alignment: Agreement Constraints
Align words from sentences in EN with sentences in FR
Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10)
Picture: courtesy Lacoste-Julien et al
Structured Prediction Constraints Representation
Assume a set of linear constraints: Y = {y : Uy · b}
A universal representation (Roth and Yih, 07)
Can be relaxed into expectation constraints on posterior probabilities:
Eq[Uy] · b
Focus on introducing constraints during the E-step
Page 11
Posterior Regularization (Ganchev et al, 10) E-step:argminq KL(qt(y),P
(y|
x;wt)) Eq[Uy] · b
M-step: argmaxw Eq log P (x, y; w)
Constraint driven-learning (Chang et al, 07) E-step:
M-step: argmaxw Eq log P (x, y; w)
y*= argmaxy P(y|x,w)Uy · b
Not clear which version To use!!!
Two Versions of Constrained EM
Page 12
So how do we learn…?
EM (PR) vs hard EM (CODL) Unclear which version of EM to use (Spitkovsky et al, 10)
This is the initial point of our research
We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM)
UEM lets us pick the best EM algorithm in a principled way
Page 13
Outline
Notation and Expectation Maximization (EM)
Unified Expectation Maximization Motivation Formulation and mathematical intuition
Optimization Algorithm for the E-step
Experiments
Page 14
Motivation: Unified Expectation Maximization (UEM)
EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution
UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter °
Page 15
EM Hard EM
EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = y q(y) log q(y) – q(y) log p(y)
UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
Different ° values ! different EM algorithms
Changes the entropy of the posterior
Unified EM (UEM)
Page 16
Effect of Changing °
Original Distribution p
q with ° = 1
q with ° = 0
q with ° = 1
q with ° = -1
Page 17
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
Unifying Existing EM Algorithms
Page 18
No Constraints
With Constraints
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
° 1 0 -1 1
Hard EM
CODL
EM
PR
Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99)
Changing ° values results in different existing EM algorithms
Range of °
Page 19
No Constraints
With Constraints
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
° 0 1
Hard EM EM
PRLP approx to CODL (New)
We focus on tuning ° in the range [0,1]
Infinitely many new EM algorithms
Tuning ° in practice
° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc.
We tune ° using a small amount of development data over the range
UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM
Page 20
0 1 .1 .2 .3 ……
Outline
Setting up the problem
Unified Expectation Maximization
Solving the constrained E-step Lagrange dual-based based algorithm Unification of existing algorithms
Experiments
Page 21
The Constrained E-step
For ° ¸ 0 ) convex
Page 22
Domain knowledge-based linear constraints
°-Parameterized KL divergence
Standard probability simplex constraints
1 Introduce dual variables ¸ for each constraint
2 Sub-gradient ascent on dual vars with O ¸ / Eq[Uy] – b
3 Compute q for given ¸ For °>0, compute
With ° !0, unconstrained MAP inference:
Page 23
Solving the Constrained E-step for q(y)
Iterate untilconvergence
Some Properties of our E-step Optimization
We use a dual projected sub-gradient ascent algorithm (Bertsekas, 99) Includes inequality constraints
For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition For ° > 0: convex dual decomposition over individual models
(e.g. HMMs) connected via dual variables ° = 1: dual decomposition in posterior regularization
(Ganchev et al, 08) For ° = 0: Lagrange relaxation/dual decomposition for hard
ILP inference (Koo et al, 10; Rush et al, 11)
Page 24
Outline
Setting up the problem
Introduction to Unified Expectation Maximization
Lagrange dual-based optimization Algorithm for the E-step
Experiments POS tagging Entity-Relation Extraction Word Alignment
Page 25
Experiments: exploring the role of ° Test if tuning ° helps improve the performance over baselines
Study the relation between the quality of initialization and ° (or “hardness” of inference)
Compare against: Posterior Regularization (PR) corresponds to ° = 1.0 Constraint-driven Learning (CODL) corresponds to ° = -1
Page 26
Unsupervised POS Tagging
Model as first order HMM
Try varying qualities of initialization: Uniform initialization: initialize with equal probability for all states Supervised initialization: initialize with parameters trained on varying
amounts of labeled data
Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization
Page 27
Unsupervised POS tagging: Different EM instantiations
Uniform Initialization
Initialization with 5 examples
Initialization with 10 examples
Initialization with 20 examples
Initialization with 40-80 examples
°
Perfo
rman
ce re
lativ
e to
EM
Hard EMEM
Page 28
Experiments: Entity-Relation Extraction
Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities
Add constraints: Type constraints between entity and relations Expected count constraints to regularize the counts of ‘None’ relation
Semi-supervised learning with a small amount of labeled data
Page 29
Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3 R12 R2
3
Result on Relations
5% 10% 20%0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
No semi-sup
CODL
PR
UEM
Page 30
% of labeled data
Mac
ro-f1
sco
res
UEM Statistically significantly better than PR
Experiments: Word Alignment
Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for
word alignment PR with agreement constraints known to give HUGE
improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into
individual HMMs
Page 31
Word Alignment: EN-FR with 10k Unlabeled Data
EN-FR FR-EN0
5
10
15
20
25
EMPRCODLUEM
Page 32
Alig
nmen
t Err
or R
ate
Word Alignment: EN-FR
10k 50k 100k0
5
10
15
20
25
EMPRCODLUEM
Page 33
Alig
nmen
t Err
or R
ate
Word Alignment: FR-EN
10k 50k 100k0
5
10
15
20
25
EMPRCODLUEM
Page 34
Alig
nmen
t Err
or R
ate
Word Alignment: EN-ES
10k 50k 100k10
15
20
25
30
35
40
EMPRCODLUEM
Page 35
Alig
nmen
t Err
or R
ate
Word Alignment: ES-EN
10k 50k 100k10
15
20
25
30
35
EMPRCODLUEM
Page 36
Alig
nmen
t Err
or R
ate
Experiments Summary
In different settings, different baselines work better Entity-Relation extraction: CODL does better than PR Word Alignment: PR does better than CODL Unsupervised POS tagging: depends on the initialization
UEM allows us to choose the best algorithm in all of these cases Best version of EM: a new version with 0 < ° < 1
Page 37
Unified EM: Summary
UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single
parameter ° Efficient dual projected subgradient ascent technique to
incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM
(CODL) and found through the UEM framework Tuning ° adaptively changes the entropy of the posterior
UEM is easy to implement: add a few lines of code to existing EM codes
Page 38
Questions?