Upload
efrat
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Alexander Yates Fei (Irene) Huang. Learning Representations of Language for Domain Adaptation. Temple University Computer and Information Sciences. Outline. Representations in NLP Machine Learning / Data mining perspective Linguistics perspective Domain Adaptation - PowerPoint PPT Presentation
Citation preview
Learning Representations of Language
for Domain Adaptation
Alexander Yates
Fei (Irene) Huang
Temple UniversityTemple UniversityComputer and Information SciencesComputer and Information Sciences
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
2
Outline
• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective
• Domain Adaptation
• Learning Representations
• Experiments
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
A sequence-labeling task
Identify phrases that name birds and cats.
BIRDThrushes build cup-shaped nests, sometimes lining them with mud.
CAT
Sylvester was #33 on TV Guide's list of top 50 best cartoon characters,
BIRD
together with Tweety Bird.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Machine Learning
Quick formal background:
Let X be a set of all possible data points (e.g., all English sentences)
Let Z be the space of all possible predictions (e.g., all sequences of labels)
A target is a function f: X Z that we’re trying to learn
A learning machine is an algorithm L
Input: a set of examples S = {xi} drawn from distribution D, and a label zi = f(xi) for each example.
Output: a hypothesis h: X Z that minimizes Ex~D 1[h(x) ≠ f(x) ]
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Representations for Learning
Most NLP systems first transform the raw data into a more convenient representation.
• A representation is a function , for some suitable feature space Y, like .
• A feature is a dimension in the feature space Y.
• Alternatively, we may use the word feature to refer to a value for one component of R(x), for some representation R and instance x.
A learning machine L takes as input a set of examples (R(xi),f(xi)) and returns h: Y Z.
YXR :dR
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
A traditional NLP representation
Thrushes build cup-shaped nests, …
0
0
1
1
0
0
0
1
1
0
Word = ‘build’
Word = ‘nests’
Word = ‘thrushes’
Next word is ‘build’
Next word is ‘nests’
Previous word is ‘build’
Previous word is ‘thrushes’
Is capitalized
Ends with ‘–s’
Ends with ‘–ed’
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
Feature sets are carefully-engineered for specific tasks, but usually include at least the word-based features.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Sparsity
• Common indicators for birds: • feather, beak, nest, egg, wing
• Uncommon indicators: • aviary, archaeopteryx, insectivorous, warm-blooded
“The jackjaw stood irreverently on the scarecrow’s shoulder.”
• Sparsity analysis of Collins parser: (Bikel, 2004)
– bilexical statistics are available in < 1.5% of parse decisions
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Sparsity in biomedical POS tagging• Most part-of-speech taggers are trained on
newswire text (Penn Treebank)• In a standard biomedical data set, fully 23%
of words never appear in the Penn Treebank
Tagger Newswire accuracy
Biomedical accuracy
Unknown word
accuracy
CRF with word and orthographic
features
94.0% 88.3% 67.3%
SCL (Blitzer et al.,
2006)
- 88.9% 72.0%
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Polysemy
• The word “thrush” is not necessarily an indicator of a bird:– Thrush is the term for an overgrowth of yeast in a baby's mouth.
– Thrush products have been a staple of hot rodders for over 40 years as these performance mufflers bring together the power and sound favored by true enthusiasts.
• “Leopard”, “jaguar”, “puma”, “tiger”, “lion”, etc. all have various meanings as cats, operating systems, sports teams, and so on.
• Word meanings depend on their contexts, and word-based features do not capture this.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Embeddings
• Kernel trick: – implicitly embed data points in higher-dimensional space
• Dimensionality reduction:– Embed data points in a lower-dimensional space– Common technique in text mining, combined with vector
space models• PCA, LSA, SVD (Deerweester et al., 1990)
• Self-organizing maps (Honkela, 1997)
• Independent component analysis (Sahlgren, 2005)
• Random indexing (Väyrynen et al., 2007)
• But existing embedding techniques ignore linguistic structure.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
A representation from linguistics
Many modern linguistic theories (GPSG, HPSG, LFG, etc.) treat language as a small set of constraints over a large number of lexical features.
But lexical entries are painstakingly crafted by hand.
thrushesHEAD Noun
SINGULAR -
COUNT +
VAL-SPR Det[SINGULAR-]
VAL-COMP None
SEM-AGENCY +
… …
buildHEAD Verb
VFORM Infinite
AUX Minus
VAL-SUBJ Noun[SINGULAR-, SEM-AGENCY+]
VAL-COMP
Noun
… …
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
12
Outline
• Representations in NLP
• Domain Adaptation
• Learning Representations
• Experiments
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Domains
Definition: A domain is a subset of language that is related through genre, topic, or style.
Examples:
newswire text
science fiction novels
biomedical research literature
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Domain Dependence
Newswire Domain… isn’t signaling (verb) a recession …
… acquiring the company, signaling (verb) to others that …
… in that list, signaling (verb) that all the company’s coal and …
Dow officials were signaling (verb) that the company …
… the S&P was signaling (verb) that the Dow could fall …
Biomedical Domain… factor for Wnt signaling (noun), …
… for the Wnt signaling (noun) pathway via …
... in a novel signaling (noun) pathway from an extracellular guidance cue …
… in the Wnt signaling (noun) pathway, and mutation …
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Domain adaptation: a hard test for NLP
Formally, a domain is a probability distribution
D over the instance set Xe.g., sentences in newswire domain ~ DNews(X)
sentences in biomedical domain ~ DBio(X)
In domain adaptation, a learning machine is given training examples from a source domain
The hypothesis is then tested on data points drawn from a separate target domain.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Learning theory for domain adaptation
A recently-proved theorem:
The error rate of h on target domain T after being trained on source domain S depends on:
1. the error rate of h on the source domain S
2. the distance between S and T• The claim depends on a particular notion of
“distance” between probability distributions S and T
[Ben-David et al., 2009]
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Formal version
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
18
Outline
• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective
• Domain Adaptation
• Learning Representations
• Experiments
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Objectives for (lexical) representations1. Usefulness: We want features that help in learning the target
function.
2. Non-Sparsity: We want features that appear commonly in reasonable amounts of training data.
3. Context-dependence: We want features that somehow depend on, or take into account, the context of a word.
4. Minimal domain distance: We want features that appear approximately as often in one domain as any other.
5. Automation: We don’t want to have to manually construct the features.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Representation learning
Thrushes build cup-shaped nests
R (representation)
F1 F2 F3 F4 F5 F6 F7 F8 … Fd
0.1 -7 21 0 2 0 12.1 5 … 1dR
h (hypothesis)
BIRD X X X
We learn this
Why not learn this, too?!
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
1) Ngram Models for Representations
finches thrushes
---------------ngram--------------- Prob
- are plain .0001
- range from .001
- inhabit
wooded
.0001
- sing .0001
actually - are .0005
true - are .001
Darwin’s - .0001
Galapagos
- .0001
true - .01
---------------ngram--------------- Prob
- eat worms .001
- are plump .001
- lay two .0005
- sing .0005
- build cup-shaped
.0001
large - in .002
traditional
- genera
.0001
soft-plumage
d
- .001
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
1) Ngram Models for Representations
finches thrushes
---------------ngram--------------- Prob
- are plain .0001
- range from .001
- inhabit
wooded
.0001
- sing .0001
actually - are .0005
true - are .001
Darwin’s - .0001
Galapagos
- .0001
true - .01
---------------ngram--------------- Prob
- eat worms .001
- are plump .001
- lay two .0005
- sing .0005
- build cup-shaped
.0001
large - in .002
traditional
- genera
.0001
soft-plumage
d
- .001
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
1) Ngram Models for Representations
finches thrushes
feature value
- are plain .0001
- range from .001
- inhabit
wooded
.0001
- sing .0001
actually - are .0005
true - are .001
Darwin’s - .0001
Galapagos
- .0001
true - .01
feature value
- eat worms .001
- are plump .001
- lay two .0005
- sing .0005
- build cup-shaped
.0001
large - in .002
traditional
- genera
.0001
soft-plumage
d
- .001
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
1) Ngram Models for Representations
True finches are predominantly seed-eating songbirds.
Training
0
0
.0001
.07
0
0
0
1
1
0
0
X BIRD X X X BIRD
- sing
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
1) Ngram Models for Representations
Thrushes build cup-shaped nests, sometimes …
Testing0.001
0
.0005
0
0
1
0
0
0
1
0
BIRD X X X X
- sing
Ngram features:• Advantages:• Automated• Useful
• Disadvantages• Sparse• Not context-dependent
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Pause: let’s generalize the procedure1. Train a language model on lots of (unlabeled)
text (preferably from multiple domains)
2. Use the language model to annotate (labeled) training and test texts with latent information
3. Use the annotations as features in a CRF
4. Train and test CRF as usual
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Pause: how to improve procedure?
The main idea we’ve explored is –
cluster words into sets of related words
use the clusters as features
We can control the number of clusters, to make the features less sparse.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
2) Distributional Clustering
True finches are predominantly seed-eating songbirds.
1) Construct a Naïve Bayes model for generating trigrams• The parent node is a latent state with K possible values• Trigrams are generated according to Pleft(word | parent),
Pmid(word | parent) and Pright(word | parent)
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
2) Distributional Clustering – NB
True finches are predominantly seed-eating songbirds.
2) Train the prior P(parent) and conditional distributions on a large corpus using EM, treating all trigrams as independent.
3) For each token in training and test sets, determine the best value of the latent state, and use it as a new feature.
Thrushes build cup-shaped nests, …
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
2) Distributional Clustering – NB
Advantages over ngram features1) Sparsity: only K features, so each should be common2) Context-dependence: The new feature depends not just on the token at position i, but also on tokens at i-1 and i+1
Potential problems:1) Features are only sensitive to immediate neighbors2) The model requires 3 observation distributions, each of which will be sparsely observed.3) Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
3) Distributional clustering - HMMs
finches are predominantly seed-eating songbirdsTrue
Hidden Markov ModelOne latent node yi per token xi
A conditional observation distribution Pobs(xi | yi)A conditional transition distribution Ptrans(yi | yi-1)A prior distribution Pprior(y0)
Joint probability P(x, y) given by
N
iiiobsiitransobsprior yxPyyPyxPyPyxP
21111 )|()|()|()(),(
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
3) Distributional clustering - HMMs
finches are predominantly seed-eating songbirdsTrue
1) Train the prior and conditional distributions on a large corpus using EM.
2) Use the Viterbi algorithm to find the best setting of all latent states for a given sentence.
3) Use the latent state value yi as a new feature for xi.
build cup-shaped nestsThrushes
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
3) Distributional Clustering – HMMsAdvantages over NB features
1) Sparsity: same number of features, but the HMM model itself is less sparse -- it includes only one observation distribution2) Context-dependence: The new feature depends (indirectly) on the whole observation sequence
Potential problem:Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
4) Multi-dimensional clustering
finches are predominantly seed-eating songbirdsTrue
Independent HMM (I-HMM) model:L layers of HMM models, each trained independently.Each layer’s parameters are initialized randomly for EM.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
4) Multi-dimensional clustering
finches are predominantly seed-eating songbirdsTrue
As before, we decode each layer using the Viterbi algorithm to generate features.Each layer represents a random projection from the full feature space to K boolean dimensions.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
4) Multi-dimensional clustering
Advantages over HMM features1) Usefulness: closer to the lexical representation from linguistics2) Usefulness: can represent KL points (instead of just K)
Potential problem:Each layer is trained independently, so are they really providing additional (rather than overlapping) information?
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
37
Outline
• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective
• Domain Adaptation
• Learning Representations
• Experiments
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Experiments
• Part-of-speech tagging (and chunking)– Train on newswire text– Test on biomedical text(Huang and Yates, ACL 2009; Huang and Yates, DANLP 2010)
• Semantic role labeling– Train on newswire text– Test on fiction text(Huang and Yates, ACL 2010)
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Part-of-Speech (POS) taggingTagger Biomedical
accuracyUnknown
word accuracy
baseline: CRF with word and orthographic features
88.3% 67.3%
baseline + NB features 88.4% 69.3%
SCL (Blitzer et al., 2006)
88.9% 72.0%
baseline + HMM features 90.5% 75.2%
baseline + Web ngram features
93.1% 75.6%
baseline + I-HMM features (7 layers)
93.3% 76%
SCL+500 labeled biomedical
sentences
96.1 -Except for the Web ngram features, all features derived from the Penn Treebank plus 70,000 sentences of unlabeled biomedical text.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Sparsity
Sparse Not Sparse
Num. tokens 463 12194
Baseline 52.5 89.6
Web Ngrams 61.8 94.0
NB (-Ngram)
57.8(-4.0)
89.4(-4.6)
HMM (-Ngram)
60.2(-1.6)
91.6(-2.4)
I-HMM (-Ngram)
62.9(+1.1)
94.5(+0.5)
Sparse: The word appears 5 times or fewer in all of our unlabeled text.Not Sparse: The word appears 50 times or more in all of our unlabeled text.
Graphical models perform better on sparse words than not-sparse words, relative to Ngram models.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Polysemy
Polysemous Not Polysemous
Num. tokens 159 4321
Baseline 59.5 78.5
Web Ngrams 68.2 85.3
NB (-Ngram)
64.5(-3.7)
88.7(+3.4)
HMM (-Ngram)
67.9(-0.3)
83.4(-1.9)
I-HMM (-Ngram)
75.6(+7.4)
85.2(-0.1)
Polysemous: The word is associated with multiple, unrelated POS tags.Not Polysemous: The word has only 1 POS tag in all of our labeled text.
Graphical models perform better on polysemous words than not-polysemous words, relative to Ngram models (except for NB).
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Accuracy vs. domain distance
• Distance is measured as the Jensen-Shannon divergence between frequencies of features in S and T.
• For I-HMMs, we weighted the distance for each layer by the proportion of CRF parameter weights placed on that layer.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Biomedical NP Chunking
The I-HMM representation can reduce error by over 57% relative to a standard representation, when training on news text and testing on biomedical journal text.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Chinese POS Tagging
Domain Tokens Stanford Chinese Tagger CRF + HMM
Lore 5428 88.4 89.7*
Religion 3248 83.5 85.2
Humour 3326 89.0 89.6
General fiction 4913 87.5 89.4**
Essay 5214 88.4 89.0
Mystery 5774 87.4 90.1**
Romance 5489 87.5 89.0*
Science-fiction 3070 88.6 87.0
Skills 5464 82.7 84.9**
Science 5262 86.0 87.8**
Adventure fiction 5071 82.1** 80.0
Report 6662 91.7 91.9
News 9774 98.8** 96.9
All but news 58921 87.0 88.1**
All domains 68695 88.7 89.5**
HMMs can beat a state-of-the-art system on many different domains.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Semantic Role Labeling (SRL)
(aka, Shallow Semantic Parsing)
Input: 1) Training sentences, labeled with syntax and semantic roles
2) A new sentence, and its syntax
Output:The predicate, arguments, and their roles
Example output:
build cup-shaped nestsThrushes
Predicate Thing BuiltBuilder
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Parsing
Chris broke the window with a hammer
Proper Noun
Verb Det. Noun Prep. Det. Noun
NP NP
PP
VP
S
NP
Subject
Direct Object
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Semantic Role Labeling
Chris broke the window with a hammer
Proper Noun
Verb Det. Noun Prep. Det. Noun
NP NP
PP
VP
S
NP
Breaker
Thing broken
Means
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Semantic Role Labeling
The window broke
Det. Noun Verb
NP VP
S
Thing broken
Subject
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Simple, open-domain SRL
Chris broke the window with a hammer
Proper Noun
Verb Det. Noun Prep. Det. Noun
B-NP B-VP B-NP I-NP B-PP B-NP I-NP
-1 0 +1 +2 +3 +4 +5
POS tag
Chunk tag
dist. from predicate
SRL Label Breaker Pred Thing Broken Means
Baseline Features
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
HMM label
Simple, open-domain SRL
Chris broke the window with a hammer
Proper Noun
Verb Det. Noun Prep. Det. Noun
B-NP B-VP B-NP I-NP B-PP B-NP I-NP
-1 0 +1 +2 +3 +4 +5
POS tag
Chunk tag
dist. from predicate
SRL Label Breaker Pred Thing Broken Means
Baseline +HMM
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
The importance of paths
Chris [predicate broke] [thing broken a hammer]
Chris [predicate broke] a window with [means a hammer]
Chris [predicate broke] the desk, so she fetched
[not an arg a hammer] and nails.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Simple, open-domain SRL
Chris broke the window with a hammer
None None None the the-window
the-window-
withthe-window-
with-aWord path
SRL Label Breaker Pred Thing Broken Means
Baseline +HMM + Paths
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Simple, open-domain SRL
Chris broke the window with a hammer
None None None the the-window
the-window-
withthe-window-
with-aWord path
SRL Label Breaker Pred Thing Broken Means
Baseline +HMM + Paths
Det Det-Noun
Det-Noun-Prep
Det-Noun-Prep-Det
POS path None None None
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Simple, open-domain SRL
Chris broke the window with a hammer
None None None the the-window
the-window-
withthe-window-
with-aWord path
SRL Label Breaker Pred Thing Broken Means
Baseline +HMM + Paths
Det Det-Noun
Det-Noun-Prep
Det-Noun-Prep-Det
POS path None None None
HMM path None None None
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Experimental results – F1
All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Experimental results – F1
All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMMs
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMM features
Chris broke the window with a hammer
Span-HMM for “hammer”
SRL Label Breaker Pred Thing Broken Means
Span-HMM Features
Span-HMM feature
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMM features
Chris broke the window with a hammer
Span-HMM for “hammer”
SRL Label Breaker Pred Thing Broken Means
Span-HMM Features
Span-HMM feature
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMM features
Chris broke the window with a hammer
Span-HMM for “a”
SRL Label Breaker Pred Thing Broken Means
Span-HMM Features
Span-HMM feature
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMM features
Chris broke the window with a hammer
Span-HMM for “a”
SRL Label Breaker Pred Thing Broken Means
Span-HMM Features
Span-HMM feature
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Span-HMM features
Chris broke the window with a hammer
SRL Label Breaker Pred Thing Broken Means
Span-HMM Features
Span-HMM feature
None None None
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Experimental results – SRL F1
All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Experimental results – feature sparsity
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Benefit grows with distance from predicate
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Take-away lessons (1)
• Hand-crafted feature sets can be beaten.
– Distributional similarity (Harris, 1954) is an extremely valuable feature for many NLP applications.
– Features based on distributional similarity, derived from a large corpus, complement traditional features.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Take-away lessons (2)
• Context-dependent features matters a lot– Ngram models have their advantages, but
not so much as representations.• Features are not dependent on local context• Features are sparse• Even web-scale models are outperformed by more
sophisticated models trained on small datasets
– HMMs significantly outperform NB clustering
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Take-away lessons (3)
• The trend is for more sophisticated models to perform better than simpler models (!)– In contrast to the received wisdom that
more data > better models
(Banko and Brill, 2001)
– The community has not yet figured out the “right way” to define and measure distributional similarity.
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Open problems and future work
• We need a mechanism for controlling for distance between domains in our feature sets.
• More sophisticated models for representations:– Tree-based, rather than sequential, models– Non-independent, multi-dimensional models
• Sophisticated models on larger corpora
Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences
Acknowledgments
Northwestern EECS
Prof. Doug Downey
Arun Ahuja
Temple CIS
Fei (Irene Huang)
Prof. Yuhong Guo
Avirup Sil
Anjan Nepal