Learning Representations of Language for Domain Adaptation

Learning Representations of Language

for Domain Adaptation

Alexander Yates

Fei (Irene) Huang

Temple UniversityTemple UniversityComputer and Information SciencesComputer and Information Sciences

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

2

Outline

• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective

• Domain Adaptation

• Learning Representations

• Experiments


A sequence-labeling task

Identify phrases that name birds and cats.

BIRDThrushes build cup-shaped nests, sometimes lining them with mud.

CAT

Sylvester was #33 on TV Guide's list of top 50 best cartoon characters,

BIRD

together with Tweety Bird.


Machine Learning

Quick formal background:

Let X be a set of all possible data points (e.g., all English sentences)

Let Z be the space of all possible predictions (e.g., all sequences of labels)

A target is a function f: X Z that we’re trying to learn

A learning machine is an algorithm L

Input: a set of examples S = {xi} drawn from distribution D, and a label zi = f(xi) for each example.

Output: a hypothesis h: X Z that minimizes Ex~D 1[h(x) ≠ f(x) ]


Representations for Learning

Most NLP systems first transform the raw data into a more convenient representation.

• A representation is a function , for some suitable feature space Y, like .

• A feature is a dimension in the feature space Y.

• Alternatively, we may use the word feature to refer to a value for one component of R(x), for some representation R and instance x.

A learning machine L takes as input a set of examples (R(xi),f(xi)) and returns h: Y Z.

YXR :dR


A traditional NLP representation

Thrushes build cup-shaped nests, …

0

0

1

1

0

0

0

1

1

0

Word = ‘build’

Word = ‘nests’

Word = ‘thrushes’

Next word is ‘build’

Next word is ‘nests’

Previous word is ‘build’

Previous word is ‘thrushes’

Is capitalized

Ends with ‘–s’

Ends with ‘–ed’

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

Feature sets are carefully-engineered for specific tasks, but usually include at least the word-based features.


Sparsity

• Common indicators for birds: • feather, beak, nest, egg, wing

• Uncommon indicators: • aviary, archaeopteryx, insectivorous, warm-blooded

“The jackjaw stood irreverently on the scarecrow’s shoulder.”

• Sparsity analysis of Collins parser: (Bikel, 2004)

– bilexical statistics are available in < 1.5% of parse decisions


Sparsity in biomedical POS tagging• Most part-of-speech taggers are trained on

newswire text (Penn Treebank)• In a standard biomedical data set, fully 23%

of words never appear in the Penn Treebank

Tagger Newswire accuracy

Biomedical accuracy

Unknown word

accuracy

CRF with word and orthographic

features

94.0% 88.3% 67.3%

SCL (Blitzer et al.,

2006)

- 88.9% 72.0%


Polysemy

• The word “thrush” is not necessarily an indicator of a bird:– Thrush is the term for an overgrowth of yeast in a baby's mouth.

– Thrush products have been a staple of hot rodders for over 40 years as these performance mufflers bring together the power and sound favored by true enthusiasts.

• “Leopard”, “jaguar”, “puma”, “tiger”, “lion”, etc. all have various meanings as cats, operating systems, sports teams, and so on.

• Word meanings depend on their contexts, and word-based features do not capture this.


Embeddings

• Kernel trick: – implicitly embed data points in higher-dimensional space

• Dimensionality reduction:– Embed data points in a lower-dimensional space– Common technique in text mining, combined with vector

space models• PCA, LSA, SVD (Deerweester et al., 1990)

• Self-organizing maps (Honkela, 1997)

• Independent component analysis (Sahlgren, 2005)

• Random indexing (Väyrynen et al., 2007)

• But existing embedding techniques ignore linguistic structure.


A representation from linguistics

Many modern linguistic theories (GPSG, HPSG, LFG, etc.) treat language as a small set of constraints over a large number of lexical features.

But lexical entries are painstakingly crafted by hand.

thrushesHEAD Noun

SINGULAR -

COUNT +

VAL-SPR Det[SINGULAR-]

VAL-COMP None

SEM-AGENCY +

… …

buildHEAD Verb

VFORM Infinite

AUX Minus

VAL-SUBJ Noun[SINGULAR-, SEM-AGENCY+]

VAL-COMP

Noun

… …


12

Outline

• Representations in NLP



• Experiments


Domains

Definition: A domain is a subset of language that is related through genre, topic, or style.

Examples:

newswire text

science fiction novels

biomedical research literature


Domain Dependence

Newswire Domain… isn’t signaling (verb) a recession …

… acquiring the company, signaling (verb) to others that …

… in that list, signaling (verb) that all the company’s coal and …

Dow officials were signaling (verb) that the company …

… the S&P was signaling (verb) that the Dow could fall …

Biomedical Domain… factor for Wnt signaling (noun), …

… for the Wnt signaling (noun) pathway via …

... in a novel signaling (noun) pathway from an extracellular guidance cue …

… in the Wnt signaling (noun) pathway, and mutation …


Domain adaptation: a hard test for NLP

Formally, a domain is a probability distribution

D over the instance set Xe.g., sentences in newswire domain ~ DNews(X)

sentences in biomedical domain ~ DBio(X)

In domain adaptation, a learning machine is given training examples from a source domain

The hypothesis is then tested on data points drawn from a separate target domain.


Learning theory for domain adaptation

A recently-proved theorem:

The error rate of h on target domain T after being trained on source domain S depends on:

1. the error rate of h on the source domain S

2. the distance between S and T• The claim depends on a particular notion of

“distance” between probability distributions S and T

[Ben-David et al., 2009]


Formal version


18

Outline




• Experiments


Objectives for (lexical) representations1. Usefulness: We want features that help in learning the target

function.

2. Non-Sparsity: We want features that appear commonly in reasonable amounts of training data.

3. Context-dependence: We want features that somehow depend on, or take into account, the context of a word.

4. Minimal domain distance: We want features that appear approximately as often in one domain as any other.

5. Automation: We don’t want to have to manually construct the features.


Representation learning

Thrushes build cup-shaped nests

R (representation)

F1 F2 F3 F4 F5 F6 F7 F8 … Fd

0.1 -7 21 0 2 0 12.1 5 … 1dR

h (hypothesis)

BIRD X X X

We learn this

Why not learn this, too?!


1) Ngram Models for Representations

finches thrushes

---------------ngram--------------- Prob

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001

actually - are .0005

true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

---------------ngram--------------- Prob

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001



finches thrushes

---------------ngram--------------- Prob

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001


true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

---------------ngram--------------- Prob

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001



finches thrushes

feature value

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001


true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

feature value

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001



True finches are predominantly seed-eating songbirds.

Training

0

0

.0001

.07

0

0

0

1

1

0

0

X BIRD X X X BIRD

- sing



Thrushes build cup-shaped nests, sometimes …

Testing0.001

0

.0005

0

0

1

0

0

0

1

0

BIRD X X X X

- sing

Ngram features:• Advantages:• Automated• Useful

• Disadvantages• Sparse• Not context-dependent


Pause: let’s generalize the procedure1. Train a language model on lots of (unlabeled)

text (preferably from multiple domains)

2. Use the language model to annotate (labeled) training and test texts with latent information

3. Use the annotations as features in a CRF

4. Train and test CRF as usual


Pause: how to improve procedure?

The main idea we’ve explored is –

cluster words into sets of related words

use the clusters as features

We can control the number of clusters, to make the features less sparse.


2) Distributional Clustering


1) Construct a Naïve Bayes model for generating trigrams• The parent node is a latent state with K possible values• Trigrams are generated according to Pleft(word | parent),

Pmid(word | parent) and Pright(word | parent)


2) Distributional Clustering – NB


2) Train the prior P(parent) and conditional distributions on a large corpus using EM, treating all trigrams as independent.

3) For each token in training and test sets, determine the best value of the latent state, and use it as a new feature.

Thrushes build cup-shaped nests, …


2) Distributional Clustering – NB

Advantages over ngram features1) Sparsity: only K features, so each should be common2) Context-dependence: The new feature depends not just on the token at position i, but also on tokens at i-1 and i+1

Potential problems:1) Features are only sensitive to immediate neighbors2) The model requires 3 observation distributions, each of which will be sparsely observed.3) Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?


3) Distributional clustering - HMMs

finches are predominantly seed-eating songbirdsTrue

Hidden Markov ModelOne latent node yi per token xi

A conditional observation distribution Pobs(xi | yi)A conditional transition distribution Ptrans(yi | yi-1)A prior distribution Pprior(y0)

Joint probability P(x, y) given by

N

iiiobsiitransobsprior yxPyyPyxPyPyxP

21111 )|()|()|()(),(


3) Distributional clustering - HMMs


1) Train the prior and conditional distributions on a large corpus using EM.

2) Use the Viterbi algorithm to find the best setting of all latent states for a given sentence.

3) Use the latent state value yi as a new feature for xi.

build cup-shaped nestsThrushes


3) Distributional Clustering – HMMsAdvantages over NB features

1) Sparsity: same number of features, but the HMM model itself is less sparse -- it includes only one observation distribution2) Context-dependence: The new feature depends (indirectly) on the whole observation sequence

Potential problem:Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?


4) Multi-dimensional clustering


Independent HMM (I-HMM) model:L layers of HMM models, each trained independently.Each layer’s parameters are initialized randomly for EM.




As before, we decode each layer using the Viterbi algorithm to generate features.Each layer represents a random projection from the full feature space to K boolean dimensions.



Advantages over HMM features1) Usefulness: closer to the lexical representation from linguistics2) Usefulness: can represent KL points (instead of just K)

Potential problem:Each layer is trained independently, so are they really providing additional (rather than overlapping) information?


37

Outline




• Experiments


Experiments

• Part-of-speech tagging (and chunking)– Train on newswire text– Test on biomedical text(Huang and Yates, ACL 2009; Huang and Yates, DANLP 2010)

• Semantic role labeling– Train on newswire text– Test on fiction text(Huang and Yates, ACL 2010)


Part-of-Speech (POS) taggingTagger Biomedical

accuracyUnknown

word accuracy

baseline: CRF with word and orthographic features

88.3% 67.3%

baseline + NB features 88.4% 69.3%

SCL (Blitzer et al., 2006)

88.9% 72.0%

baseline + HMM features 90.5% 75.2%

baseline + Web ngram features

93.1% 75.6%

baseline + I-HMM features (7 layers)

93.3% 76%

SCL+500 labeled biomedical

sentences

96.1 -Except for the Web ngram features, all features derived from the Penn Treebank plus 70,000 sentences of unlabeled biomedical text.


Sparsity

Sparse Not Sparse

Num. tokens 463 12194

Baseline 52.5 89.6

Web Ngrams 61.8 94.0

NB (-Ngram)

57.8(-4.0)

89.4(-4.6)

HMM (-Ngram)

60.2(-1.6)

91.6(-2.4)

I-HMM (-Ngram)

62.9(+1.1)

94.5(+0.5)

Sparse: The word appears 5 times or fewer in all of our unlabeled text.Not Sparse: The word appears 50 times or more in all of our unlabeled text.

Graphical models perform better on sparse words than not-sparse words, relative to Ngram models.


Polysemy

Polysemous Not Polysemous

Num. tokens 159 4321

Baseline 59.5 78.5

Web Ngrams 68.2 85.3

NB (-Ngram)

64.5(-3.7)

88.7(+3.4)

HMM (-Ngram)

67.9(-0.3)

83.4(-1.9)

I-HMM (-Ngram)

75.6(+7.4)

85.2(-0.1)

Polysemous: The word is associated with multiple, unrelated POS tags.Not Polysemous: The word has only 1 POS tag in all of our labeled text.

Graphical models perform better on polysemous words than not-polysemous words, relative to Ngram models (except for NB).


Accuracy vs. domain distance

• Distance is measured as the Jensen-Shannon divergence between frequencies of features in S and T.

• For I-HMMs, we weighted the distance for each layer by the proportion of CRF parameter weights placed on that layer.


Biomedical NP Chunking

The I-HMM representation can reduce error by over 57% relative to a standard representation, when training on news text and testing on biomedical journal text.


Chinese POS Tagging

Domain Tokens Stanford Chinese Tagger CRF + HMM

Lore 5428 88.4 89.7*

Religion 3248 83.5 85.2

Humour 3326 89.0 89.6

General fiction 4913 87.5 89.4**

Essay 5214 88.4 89.0

Mystery 5774 87.4 90.1**

Romance 5489 87.5 89.0*

Science-fiction 3070 88.6 87.0

Skills 5464 82.7 84.9**

Science 5262 86.0 87.8**

Adventure fiction 5071 82.1** 80.0

Report 6662 91.7 91.9

News 9774 98.8** 96.9

All but news 58921 87.0 88.1**

All domains 68695 88.7 89.5**

HMMs can beat a state-of-the-art system on many different domains.


Semantic Role Labeling (SRL)

(aka, Shallow Semantic Parsing)

Input: 1) Training sentences, labeled with syntax and semantic roles

2) A new sentence, and its syntax

Output:The predicate, arguments, and their roles

Example output:

build cup-shaped nestsThrushes

Predicate Thing BuiltBuilder


Parsing

Chris broke the window with a hammer

Proper Noun

Verb Det. Noun Prep. Det. Noun

NP NP

PP

VP

S

NP

Subject

Direct Object


Semantic Role Labeling


Proper Noun


NP NP

PP

VP

S

NP

Breaker

Thing broken

Means


Semantic Role Labeling

The window broke

Det. Noun Verb

NP VP

S

Thing broken

Subject


Simple, open-domain SRL


Proper Noun


B-NP B-VP B-NP I-NP B-PP B-NP I-NP

-1 0 +1 +2 +3 +4 +5

POS tag

Chunk tag

dist. from predicate

SRL Label Breaker Pred Thing Broken Means

Baseline Features


HMM label



Proper Noun


B-NP B-VP B-NP I-NP B-PP B-NP I-NP

-1 0 +1 +2 +3 +4 +5

POS tag

Chunk tag

dist. from predicate


Baseline +HMM


The importance of paths

Chris [predicate broke] [thing broken a hammer]

Chris [predicate broke] a window with [means a hammer]

Chris [predicate broke] the desk, so she fetched

[not an arg a hammer] and nails.




None None None the the-window

the-window-

withthe-window-

with-aWord path


Baseline +HMM + Paths





the-window-

withthe-window-

with-aWord path



Det Det-Noun

Det-Noun-Prep

Det-Noun-Prep-Det

POS path None None None





the-window-

withthe-window-

with-aWord path



Det Det-Noun

Det-Noun-Prep

Det-Noun-Prep-Det

POS path None None None

HMM path None None None


Experimental results – F1

All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).


Experimental results – F1



Span-HMMs


Span-HMM features


Span-HMM for “hammer”


Span-HMM Features

Span-HMM feature


Span-HMM features


Span-HMM for “hammer”


Span-HMM Features

Span-HMM feature


Span-HMM features


Span-HMM for “a”


Span-HMM Features

Span-HMM feature


Span-HMM features


Span-HMM for “a”


Span-HMM Features

Span-HMM feature


Span-HMM features



Span-HMM Features

Span-HMM feature

None None None


Experimental results – SRL F1



Experimental results – feature sparsity


Benefit grows with distance from predicate


Take-away lessons (1)

• Hand-crafted feature sets can be beaten.

– Distributional similarity (Harris, 1954) is an extremely valuable feature for many NLP applications.

– Features based on distributional similarity, derived from a large corpus, complement traditional features.



• Context-dependent features matters a lot– Ngram models have their advantages, but

not so much as representations.• Features are not dependent on local context• Features are sparse• Even web-scale models are outperformed by more

sophisticated models trained on small datasets

– HMMs significantly outperform NB clustering



• The trend is for more sophisticated models to perform better than simpler models (!)– In contrast to the received wisdom that

more data > better models

(Banko and Brill, 2001)

– The community has not yet figured out the “right way” to define and measure distributional similarity.


Open problems and future work

• We need a mechanism for controlling for distance between domains in our feature sets.

• More sophisticated models for representations:– Tree-based, rather than sequential, models– Non-independent, multi-dimensional models

• Sophisticated models on larger corpora


Acknowledgments

Northwestern EECS

Prof. Doug Downey

Arun Ahuja

Temple CIS

Fei (Irene Huang)

Prof. Yuhong Guo

Avirup Sil

Anjan Nepal

Documents

Learning Representations of Language for Domain Adaptation