Upload
jessie-chase
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Domain Adaptation in Natural Language Processing
Jing Jiang
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
Jan 8, 2008 2
• Contains much useful information– E.g. >85% corporate data stored as text
• Hard to handle– Large amount: e.g. by 2002, 2.5 billion documents on
surface Web, +7.3 million / day – Diversity: emails, news, digital libraries, Web logs,
etc.– Unstructured: vs. relation databases
How to manage textual data?
Textual Data in the Information Age
Jan 8, 2008 3
• Information retrieval: to rank documents based on relevance to keyword queries
• Not always satisfactory– More sophisticated services desired
Jan 8, 2008 4
Automatic Text Summarization
Jan 8, 2008 5
Question Answering
Jan 8, 2008 6
Company Founder
… …
Google Larry Page
… …
Information Extraction
Jan 8, 2008 7
Beyond Information Retrieval
• Automatic text summarization
• Question answering
• Information extraction
• Sentiment analysis
• Machine translation
• Etc.
All relies on Natural Language Processing (NLP) techniques
to deeply understand and analyze text
Jan 8, 2008 8
Typical NLP Tasks
“Larry Page was Google’s founding CEO”• Part-of-speech tagging
Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun
• Chunking[NP: Larry Page] [V: was] [NP: Google ’s founding CEO]
• Named entity recognition[person: Larry Page] was [organization: Google] ’s founding CEO
• Relation extractionFounder(Larry Page, Google)
• Word sense disambiguation“Larry Page” vs. “Page 81”
state-of-the-art solution:
supervised machine learning
Jan 8, 2008 9
WSJ articles
Supervised Learning for NLP
Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN
trainedPOS tagger
Standard Supervised
Learning Algorithm
part-of-speech tagging on news articles
representative corpus human annotation
POS-tagged WSJ articles
training
Jan 8, 2008 10
MEDLINE articles
In Reality…
We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS
trainedPOS tagger
Standard Supervised
Learning Algorithm
part-of-speech tagging on biomedical articles
representative corpus human annotation
POS-tagged MEDLINE articles
training
Xhuman
annotation is expensive
POS-tagged WSJ articles
Jan 8, 2008 11
Many Other Examples
• Named entity recognition– News articles personal blogs– Organism A organism B
• Spam filtering– Public email collection personal inboxes
• Sentiment analysis of product reviews (positive vs. negative)– Movies books– Cell phones digital cameras
Problem with this non-standard setting with domain difference?
Jan 8, 2008 12
Domain Difference Performance Degradation
MEDLINE MEDLINEPOS
Tagger ~96%
WSJ MEDLINEPOS
Tagger ~86%
ideal setting
realistic setting
Jan 8, 2008 13
Another Example
gene name
recognizer54.1%
gene name
recognizer28.1%
ideal setting
realistic setting
Jan 8, 2008 14
Domain Adaptationsource domain
target domain
Labeled Labeled Unlabeled
Domain Adaptive Learning Algorithm
to design learning algorithms that are aware of domain difference and exploit all available data
to adapt to the target domain
Jan 8, 2008 15
With Domain Adaptation Techniques…
Fly + Mouse Yeast
gene name
recognizer63.3%
Fly + Mouse Yeast
gene name
recognizer75.9%
standard learning
domain adaptive learning
Jan 8, 2008 16
Roadmap
• What is domain adaptation in NLP?
• Our work– Overview– Instance weighting– Feature selection
• Summary and future work
Jan 8, 2008 17
Overview
SourceDomain Target
Domain
Jan 8, 2008 18
Ideal Goal
TargetDomain
SourceDomain
Jan 8, 2008 19
Standard Supervised Learning
SourceDomain Target
Domain
Jan 8, 2008 20
SourceDomain Target
Domain
Standard Semi-Supervised Learning
Jan 8, 2008 21
Idea 1: Generalization
SourceDomain Target
Domain
Jan 8, 2008 22
Idea 2: Adaptation
SourceDomain Target
Domain
Jan 8, 2008 23
SourceDomain Target
Domain
How to formally formulate the ideas?
Jan 8, 2008 24
Instance Weighting
SourceDomain Target
Domain
instance space(each point represents an observed instance)
to find appropriate weights for different instances
Jan 8, 2008 25
Feature Selection
SourceDomain Target
Domain
feature space(each point represents a useful feature)
to separate generalizable features from domain-specific features
Jan 8, 2008 26
Roadmap
• What is domain adaptation in NLP?
• Our work– Overview– Instance weighting– Feature selection
• Summary and future work
Jan 8, 2008 27
Observation
source domain target domain
Jan 8, 2008 28
Observation
source domain target domain
Jan 8, 2008 29
Analysis of Domain Difference
p(x, y)
p(x)p(y | x)
ps(y | x) ≠ pt(y | x)
ps(x) ≠ pt(x)
labeling difference instance difference
labeling adaptation instance adaptation?
x: observed instance y: class label (to be predicted)
Jan 8, 2008 30
Labeling Adaptationsource domain target domain
pt(y | x) ≠ ps(y | x)
remove/demote instances
Jan 8, 2008 31
Labeling Adaptationsource domain target domain
pt(y | x) ≠ ps(y | x)
remove/demote instances
Jan 8, 2008 32
Instance Adaptation (pt(x) < ps(x))
source domain target domain
pt(x) < ps(x)
remove/demote instances
Jan 8, 2008 33
Instance Adaptation (pt(x) < ps(x))
source domain target domain
pt(x) < ps(x)
remove/demote instances
Jan 8, 2008 34
Instance Adaptation (pt(x) > ps(x))
source domain target domain
pt(x) > ps(x)
promote instances
Jan 8, 2008 35
Instance Adaptation (pt(x) > ps(x))
source domain target domain
pt(x) > ps(x)
promote instances
Jan 8, 2008 36
Instance Adaptation (pt(x) > ps(x))
• Target domain instances are useful
source domain target domain
pt(x) > ps(x)
Jan 8, 2008 37
Empirical Risk Minimization with Three Sets of Instances
Ds Dt, l Dt, u
X
Y
*t
ytt dxyxLxypxp ),,()|()(minarg
loss function
expected lossuse empirical loss to replace expected loss
optimal classification model
Jan 8, 2008 38
Using DsDs Dt, l Dt, u
instance difference
(hard for high-dimensional data)
)(
)(sis
sit
i xp
xp
XDs
)|(
)|(si
sis
si
sit
i xyp
xyp
labeling difference
(need labeled target data)
X
Y
*t
ytt dxyxLxypxp ),,()|()(minarg
sN
i
si
siii yxL
1
),,(minarg
Jan 8, 2008 39
Ds Dt, l Dt, u
Using Dt,l
ltN
i
lti
lti yxL
,
1
,, ),,(minarg
XDt,lsmall sample size
estimation not accurate
X
Y
*t
ytt dxyxLxypxp ),,()|()(minarg
Jan 8, 2008 40
Ds Dt, l Dt, u
Using Dt,u
utN
i y
utii yxLy
,
1
, ),,()(minargY
XDt,u
use predicted labels (bootstrapping)
X
Y
*t
ytt dxyxLxypxp ),,()|()(minarg
)|()( ,utiti xyPy
Jan 8, 2008 41
Combined Framework
)](
),,()(1
),,(1
),,(1
[minargˆ
,
,
1,,
1,,
1
R
yxLyC
yxLN
yxLC
ut
lt
s
N
i y
tii
utut
N
i
ti
ti
ltlt
N
i
si
siii
ss
Y
a flexible setup covering both standard methods and new domain adaptive methods
1,, utlts
Jan 8, 2008 42
Experiments
• NLP tasks– POS tagging: WSJ (Penn TreeBank) Oncology
(biomedical) text (Penn BioIE)
– NE type classification: newswire conversational telephone speech (CTS) and web-log (WL) (ACE 2005)
– Spam filtering: public email collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006)
• Three heuristics to partially explore the parameter settings
Jan 8, 2008 43
Instance Pruningremoving “misleading” instances from Ds
k CTS k WL
0 0.7815 0 0.7045
1600 0.8640 1200 0.6975
3200 0.8825 2400 0.6795
all 0.8830 all 0.6600
k Oncology
0 0.8630
8000 0.8709
16000 0.8714
all 0.8720
k User 1 User 2 User 3
0 0.6306 0.6950 0.7644
300 0.6611 0.7228 0.8222
600 0.7911 0.8322 0.8328
all 0.8106 0.8517 0.8067
POS NE Type
Spamuseful in most cases; failed
in some case
When is it guaranteed to work? (future work)
Jan 8, 2008 44
Dt,l with Larger Weights
method CTS WL
Ds 0.7815 0.7045
Ds + Dt,l 0.9340 0.7735
Ds + 5Dt,l 0.9360 0.7820
Ds + 10Dt,l 0.9355 0.7840
method Oncology
Ds 0.8630
Ds + Dt,l 0.9349
Ds + 10Dt,l 0.9429
Ds + 20Dt,l 0.9443
method User 1 User 2 User 3
Ds 0.6306 0.6950 0.7644
Ds + Dt,l 0.9572 0.9572 0.9461
Ds + 5Dt,l 0.9628 0.9611 0.9601
Ds + 10Dt,l 0.9639 0.9628 0.9633
POS NE Type
SpamDt,l is very useful
promoting Dt,l is more useful
Jan 8, 2008 45
Bootstrapping with Larger Weightsuntil Ds and Dt,u are balanced
method CTS WL
supervised 0.7781 0.7351
standard bootstrap
0.8917 0.7498
balanced bootstrap
0.8923 0.7523
method Oncology
supervised 0.8630
standard bootstrap
0.8728
balanced bootstrap
0.8750
method User 1 User 2 User 3
supervised 0.6476 0.6976 0.8068
standard bootstrap
0.8720 0.9212 0.9760
balanced bootstrap
0.8816 0.9256 0.9772
POS NE Type
Spampromoting target instances is useful, even
with predicted labels
Jan 8, 2008 46
Roadmap
• What is domain adaptation in NLP?
• Our work– Overview– Instance weighting– Feature selection
• Summary and future work
Jan 8, 2008 47
Observation 1Domain-specific features
wingless
daughterless
eyeless
apexless
…
Jan 8, 2008 48
Observation 1Domain-specific features
wingless
daughterless
eyeless
apexless
…
• describing phenotype in fly gene nomenclature
• feature “-less” useful for this organism
CD38
PABPC5
…
feature still useful for other
organisms?
No!
Jan 8, 2008 49
Observation 2Generalizable features
…decapentaplegic and wingless are expressed in analogous patterns in each…
…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.
Jan 8, 2008 50
Observation 2Generalizable features
…decapentaplegic and wingless are expressed in analogous patterns in each…
…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.
feature “X be expressed”
Jan 8, 2008 51
Assume Multiple Source Domainssource
domainstarget
domain
Labeled Unlabeled
Domain Adaptive Learning Algorithm
Jan 8, 2008 52
x
Detour: Logistic Regression Classifiers01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
-less
X be expressed
wyT x
'' )exp(
)exp(),|(
y
Ty
Tyypxw
xwwx
p binary features
… and wingless are expressed in…
wy
Jan 8, 2008 53
Learning a Logistic Regression Classifier01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
N
iy
Ty
Ty
N 1'
'
2
)exp(
)exp(log
1
minargˆ
xw
xw
www
log likelihood of training data
regularization term
penalize large weights
control model complexity
wyT x
Jan 8, 2008 54
Generalizable Features in Weight Vectors
0.24.55
-0.33.0
::
2.1-0.90.4
3.20.54.5-0.13.5
::
0.1-1.0-0.2
0.10.74.20.13.2
::
1.70.10.3
D1 D2 DK
w1 w2 wK
…
K source domains
generalizable features
domain-specific features
Jan 8, 2008 55
0.24.55
-0.33.0
::
2.1-0.90.4
4.63.2
::
3.6
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0
::
0 0 … 00 0 … 00 0 … 0
Decomposition of wk for Each Source Domainshared by all domainsdomain-specific
wk = AT v + uk
a matrix that selects generalizable features
Jan 8, 2008 56
Framework for GeneralizationFix A, optimize:
wkregularization termλs >> 1: to penalize domain-specific features
SourceDomain
TargetDomain
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
k
AypNK 1 1
1
22
}{,
);|(log11
minarg})ˆ{,ˆ(
uvx
uvuvuv
log likelihood of labeled data from K source domains
Jan 8, 2008 57
Framework for Adaptation
);|(log1
);|(log1
1
1
minarg})ˆ{,ˆ,ˆ(
1
1 1
2
1
22
}{,
m
i
tTti
ti
N
k
N
i
kTki
ki
k
tt
K
k
ks
kt
Aypm
AypNK
k k
kt
uvx
uvx
uuvuuvuuv
SourceDomain
TargetDomain
log likelihood of target domain examples with
predicted labelsλt = 1 << λs : to pick up domain-specific
features in the target domain
Fix A, optimize:
Jan 8, 2008 58
• Joint optimization
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
A,
k
AypNK
A
1 1
1
22
}{,
);|(log11
minarg})ˆ{,ˆ,ˆ(
uvx
uvuvuv
How to Find A? (1)
Jan 8, 2008 59
How to Find A? (2)
• Domain cross validation– Idea: training on (K – 1) source domains and
validate on the held-out source domain– Approximation:
• wfk: weight for feature f learned from domain k
• wfk: weight for feature f learned from other domains
• rank features by
K
k
kf
kf ww
1
Jan 8, 2008 60
Intuition for Domain Cross Validation…domains
………expressed………-less
D1 D2 Dk-1 Dk (fly)
……-less……expressed……
w
1.5
0.05
w
2.0
1.2………expressed………-less
1.8
0.1
product of w1 and w2
w1 w2
Jan 8, 2008 61
Experiments
• Data set– BioCreative Challenge Task 1B– Gene/protein recognition– 3 organisms/domains: fly, mouse and yeast
• Experimental setup– 2 organisms for training, 1 for testing– F1 as performance measure
Jan 8, 2008 62
Experiments: Generalization
Method F+MY M+YF Y+FM
BL 0.633 0.129 0.416
DA-1
(joint-opt)
0.627 0.153 0.425
DA-2
(domain CV)
0.654 0.195 0.470
SourceDomain
TargetDomain
SourceDomain
TargetDomain using generalizable features is effective
F: fly M: mouse Y: yeast
domain cross validation is more effective than joint optimization
Jan 8, 2008 63
Experiments: Adaptation
Method F+MY M+YF Y+FM
BL-SSL 0.633 0.241 0.458
DA-2-SSL 0.759 0.305 0.501
SourceDomain
TargetDomain
F: fly M: mouse Y: yeast
SourceDomain
TargetDomain
domain-adaptive bootstrapping is more effective than regular bootstrapping
Jan 8, 2008 64
Related Work
• Problem relatively new to NLP and ML communities– Most related work developed concurrently with our work
Instances Used
StandardInstance
WeightingFeature
SelectionIW + FS
Dssupervised
learningShimodaira 00 Blitzer et al. 06
Our Future Wok
Ds + Dt,lsupervised
learningDaumé III & Marcus 06
Daumé III 07
Ds + Dt,u
semi-supervised
learningACL’07
HLT’06, CIKM’07
Ds + Dt,l + Dt,u
semi-supervised
learning
Jan 8, 2008 65
Roadmap
• What is domain adaptation in NLP?
• Our work– Overview– Instance weighting– Feature selection
• Summary and future work
Jan 8, 2008 66
Summary
• Domain adaptation is a critical novel problem in natural language processing and machine learning
• Contributions– First systematic formal analysis of domain adaptation– Two novel general frameworks, both shown to be effective– Potentially applicable to other classification problems outside of
NLP
• Future work– Domain difference measure– Unify two frameworks– Incorporate domain knowledge into adaptation process– Leverage domain adaptation to perform large-scale information
extraction on scientific literature and on the Web
Jan 8, 2008 67
Information Extraction System
Existing Knowledge
Bases
Labeled Data from Related Domains
Entity Recognition
Relation Extraction
Intelligent Learning
Knowledge Resources Exploitation
Interactive Expert
Supervision
Domain Adaptive Learning
Domain Expert
Jan 8, 2008 68
Biomedical Literature(MEDLINE abstracts, full-text articles, etc.)
DWnt-2 is expressed in somatic cells of the gonad throughout
development.
Entity Recognition
Relation Extraction
Information Extraction
SystemExtracted Facts
gene tissue/position
DWnt-2 gonad
expression relations
Inference Engine
Pathway Construction
…
Hypothesis Generation
Knowledge Base Curation
Applications
Jan 8, 2008 69
Applications (cont.)
• Similar ideas for Web text mining– Product reviews
• Existing annotated reviews limited (certain products from certain sources)
• Large amount of semi-structured reviews from review websites
• Unstructured reviews from personal blogs
Jan 8, 2008 70
Selected Publications• J. Jiang & C. Zhai. “A two-stage approach to domain adaptation for
statistical classifiers.” In CIKM’07.• J. Jiang & C. Zhai. “Instance weighting for domain adaptation in NLP.” In
ACL’07.• J. Jiang & C. Zhai. “Exploiting domain structure for named entity
recognition.” In HLT-NAACL’06.• J. Jiang & C. Zhai. “A systematic exploration of the feature space for relation
extraction.” In NAACL-HLT’07.• J. Jiang & C. Zhai. “Extraction of coherent relevant passages using hidden
Markov models.” ACM Transactions on Information Systems (TOIS), Jul 2006.
• J. Jiang & C. Zhai. “An empirical study of tokenization strategies for biomedical information retrieval.” Information Retrieval, Oct 2007.
• X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Generating semi-structured gene summaries from biomedical literature.” Information Processing & Management, Nov 2007.
• X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Automatically generating gene summaries from biomedical literature.” In PSB’06.
this talk
feature exploration for relation extraction
information retrieval
gene summarization