Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign

Domain Adaptation in Natural Language Processing

Jing Jiang

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Jan 8, 2008 2

• Contains much useful information– E.g. >85% corporate data stored as text

• Hard to handle– Large amount: e.g. by 2002, 2.5 billion documents on

surface Web, +7.3 million / day – Diversity: emails, news, digital libraries, Web logs,

etc.– Unstructured: vs. relation databases

How to manage textual data?

Textual Data in the Information Age

Jan 8, 2008 3

• Information retrieval: to rank documents based on relevance to keyword queries

• Not always satisfactory– More sophisticated services desired

Jan 8, 2008 4

Automatic Text Summarization

Jan 8, 2008 5

Question Answering

Jan 8, 2008 6

Company Founder

… …

Google Larry Page

… …

Information Extraction

Jan 8, 2008 7

Beyond Information Retrieval

• Automatic text summarization

• Question answering

• Information extraction

• Sentiment analysis

• Machine translation

• Etc.

All relies on Natural Language Processing (NLP) techniques

to deeply understand and analyze text

Jan 8, 2008 8

Typical NLP Tasks

“Larry Page was Google’s founding CEO”• Part-of-speech tagging

Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun

• Chunking[NP: Larry Page] [V: was] [NP: Google ’s founding CEO]

• Named entity recognition[person: Larry Page] was [organization: Google] ’s founding CEO

• Relation extractionFounder(Larry Page, Google)

• Word sense disambiguation“Larry Page” vs. “Page 81”

state-of-the-art solution:

supervised machine learning

Jan 8, 2008 9

WSJ articles

Supervised Learning for NLP

Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN

trainedPOS tagger

Standard Supervised

Learning Algorithm

part-of-speech tagging on news articles

representative corpus human annotation

POS-tagged WSJ articles

training

Jan 8, 2008 10

MEDLINE articles

In Reality…

We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS

trainedPOS tagger

Standard Supervised

Learning Algorithm

part-of-speech tagging on biomedical articles

representative corpus human annotation

POS-tagged MEDLINE articles

training

Xhuman

annotation is expensive

POS-tagged WSJ articles

Jan 8, 2008 11

Many Other Examples

• Named entity recognition– News articles personal blogs– Organism A organism B

• Spam filtering– Public email collection personal inboxes

• Sentiment analysis of product reviews (positive vs. negative)– Movies books– Cell phones digital cameras

Problem with this non-standard setting with domain difference?

Jan 8, 2008 12

Domain Difference Performance Degradation

MEDLINE MEDLINEPOS

Tagger ~96%

WSJ MEDLINEPOS

Tagger ~86%

ideal setting

realistic setting

Jan 8, 2008 13

Another Example

gene name

recognizer54.1%

gene name

recognizer28.1%

ideal setting

realistic setting

Jan 8, 2008 14

Domain Adaptationsource domain

target domain

Labeled Labeled Unlabeled

Domain Adaptive Learning Algorithm

to design learning algorithms that are aware of domain difference and exploit all available data

to adapt to the target domain

Jan 8, 2008 15

With Domain Adaptation Techniques…

Fly + Mouse Yeast

gene name

recognizer63.3%

Fly + Mouse Yeast

gene name

recognizer75.9%

standard learning

domain adaptive learning

Jan 8, 2008 16

Roadmap

• What is domain adaptation in NLP?

• Our work– Overview– Instance weighting– Feature selection

• Summary and future work

Jan 8, 2008 17

Overview

SourceDomain Target

Domain

Jan 8, 2008 18

Ideal Goal

TargetDomain

SourceDomain

Jan 8, 2008 19

Standard Supervised Learning

SourceDomain Target

Domain

Jan 8, 2008 20

SourceDomain Target

Domain

Standard Semi-Supervised Learning

Jan 8, 2008 21

Idea 1: Generalization

SourceDomain Target

Domain

Jan 8, 2008 22

Idea 2: Adaptation

SourceDomain Target

Domain

Jan 8, 2008 23

SourceDomain Target

Domain

How to formally formulate the ideas?

Jan 8, 2008 24

Instance Weighting

SourceDomain Target

Domain

instance space(each point represents an observed instance)

to find appropriate weights for different instances

Jan 8, 2008 25

Feature Selection

SourceDomain Target

Domain

feature space(each point represents a useful feature)

to separate generalizable features from domain-specific features

Jan 8, 2008 26

Roadmap




Jan 8, 2008 27

Observation

source domain target domain

Jan 8, 2008 28

Observation


Jan 8, 2008 29

Analysis of Domain Difference

p(x, y)

p(x)p(y | x)

ps(y | x) ≠ pt(y | x)

ps(x) ≠ pt(x)

labeling difference instance difference

labeling adaptation instance adaptation?

x: observed instance y: class label (to be predicted)

Jan 8, 2008 30

Labeling Adaptationsource domain target domain

pt(y | x) ≠ ps(y | x)

remove/demote instances

Jan 8, 2008 31

Labeling Adaptationsource domain target domain

pt(y | x) ≠ ps(y | x)


Jan 8, 2008 32

Instance Adaptation (pt(x) < ps(x))


pt(x) < ps(x)


Jan 8, 2008 33

Instance Adaptation (pt(x) < ps(x))


pt(x) < ps(x)


Jan 8, 2008 34

Instance Adaptation (pt(x) > ps(x))


pt(x) > ps(x)

promote instances

Jan 8, 2008 35



pt(x) > ps(x)

promote instances

Jan 8, 2008 36


• Target domain instances are useful


pt(x) > ps(x)

Jan 8, 2008 37

Empirical Risk Minimization with Three Sets of Instances

Ds Dt, l Dt, u

X

Y

*t

ytt dxyxLxypxp ),,()|()(minarg

loss function

expected lossuse empirical loss to replace expected loss

optimal classification model

Jan 8, 2008 38

Using DsDs Dt, l Dt, u

instance difference

(hard for high-dimensional data)

)(

)(sis

sit

i xp

xp

XDs

)|(

)|(si

sis

si

sit

i xyp

xyp

labeling difference

(need labeled target data)

X

Y

*t


sN

i

si

siii yxL

1

),,(minarg

Jan 8, 2008 39

Ds Dt, l Dt, u

Using Dt,l

ltN

i

lti

lti yxL

,

1

,, ),,(minarg

XDt,lsmall sample size

estimation not accurate

X

Y

*t


Jan 8, 2008 40

Ds Dt, l Dt, u

Using Dt,u

utN

i y

utii yxLy

,

1

, ),,()(minargY

XDt,u

use predicted labels (bootstrapping)

X

Y

*t


)|()( ,utiti xyPy

Jan 8, 2008 41

Combined Framework

)](

),,()(1

),,(1

),,(1

[minargˆ

,

,

1,,

1,,

1

R

yxLyC

yxLN

yxLC

ut

lt

s

N

i y

tii

utut

N

i

ti

ti

ltlt

N

i

si

siii

ss

Y

a flexible setup covering both standard methods and new domain adaptive methods

1,, utlts

Jan 8, 2008 42

Experiments

• NLP tasks– POS tagging: WSJ (Penn TreeBank) Oncology

(biomedical) text (Penn BioIE)

– NE type classification: newswire conversational telephone speech (CTS) and web-log (WL) (ACE 2005)

– Spam filtering: public email collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006)

• Three heuristics to partially explore the parameter settings

Jan 8, 2008 43

Instance Pruningremoving “misleading” instances from Ds

k CTS k WL

0 0.7815 0 0.7045

1600 0.8640 1200 0.6975

3200 0.8825 2400 0.6795

all 0.8830 all 0.6600

k Oncology

0 0.8630

8000 0.8709

16000 0.8714

all 0.8720

k User 1 User 2 User 3

0 0.6306 0.6950 0.7644

300 0.6611 0.7228 0.8222

600 0.7911 0.8322 0.8328

all 0.8106 0.8517 0.8067

POS NE Type

Spamuseful in most cases; failed

in some case

When is it guaranteed to work? (future work)

Jan 8, 2008 44

Dt,l with Larger Weights

method CTS WL

Ds 0.7815 0.7045

Ds + Dt,l 0.9340 0.7735

Ds + 5Dt,l 0.9360 0.7820

Ds + 10Dt,l 0.9355 0.7840

method Oncology

Ds 0.8630

Ds + Dt,l 0.9349

Ds + 10Dt,l 0.9429

Ds + 20Dt,l 0.9443

method User 1 User 2 User 3

Ds 0.6306 0.6950 0.7644

Ds + Dt,l 0.9572 0.9572 0.9461

Ds + 5Dt,l 0.9628 0.9611 0.9601

Ds + 10Dt,l 0.9639 0.9628 0.9633

POS NE Type

SpamDt,l is very useful

promoting Dt,l is more useful

Jan 8, 2008 45

Bootstrapping with Larger Weightsuntil Ds and Dt,u are balanced

method CTS WL

supervised 0.7781 0.7351

standard bootstrap

0.8917 0.7498

balanced bootstrap

0.8923 0.7523

method Oncology

supervised 0.8630

standard bootstrap

0.8728

balanced bootstrap

0.8750

method User 1 User 2 User 3

supervised 0.6476 0.6976 0.8068

standard bootstrap

0.8720 0.9212 0.9760

balanced bootstrap

0.8816 0.9256 0.9772

POS NE Type

Spampromoting target instances is useful, even

with predicted labels

Jan 8, 2008 46

Roadmap




Jan 8, 2008 47

Observation 1Domain-specific features

wingless

daughterless

eyeless

apexless

…

Jan 8, 2008 48

Observation 1Domain-specific features

wingless

daughterless

eyeless

apexless

…

• describing phenotype in fly gene nomenclature

• feature “-less” useful for this organism

CD38

PABPC5

…

feature still useful for other

organisms?

No!

Jan 8, 2008 49

Observation 2Generalizable features

…decapentaplegic and wingless are expressed in analogous patterns in each…

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

Jan 8, 2008 50

Observation 2Generalizable features

…decapentaplegic and wingless are expressed in analogous patterns in each…

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

feature “X be expressed”

Jan 8, 2008 51

Assume Multiple Source Domainssource

domainstarget

domain

Labeled Unlabeled

Domain Adaptive Learning Algorithm

Jan 8, 2008 52

x

Detour: Logistic Regression Classifiers01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

-less

X be expressed

wyT x

'' )exp(

)exp(),|(

y

Ty

Tyypxw

xwwx

p binary features

… and wingless are expressed in…

wy

Jan 8, 2008 53

Learning a Logistic Regression Classifier01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

N

iy

Ty

Ty

N 1'

'

2

)exp(

)exp(log

1

minargˆ

xw

xw

www

log likelihood of training data

regularization term

penalize large weights

control model complexity

wyT x

Jan 8, 2008 54

Generalizable Features in Weight Vectors

0.24.55

-0.33.0

::

2.1-0.90.4

3.20.54.5-0.13.5

::

0.1-1.0-0.2

0.10.74.20.13.2

::

1.70.10.3

D1 D2 DK

w1 w2 wK

…

K source domains

generalizable features

domain-specific features

Jan 8, 2008 55

0.24.55

-0.33.0

::

2.1-0.90.4

4.63.2

::

3.6

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0

::

0 0 … 00 0 … 00 0 … 0

Decomposition of wk for Each Source Domainshared by all domainsdomain-specific

wk = AT v + uk

a matrix that selects generalizable features

Jan 8, 2008 56

Framework for GeneralizationFix A, optimize:

wkregularization termλs >> 1: to penalize domain-specific features

SourceDomain

TargetDomain

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

k

AypNK 1 1

1

22

}{,

);|(log11

minarg})ˆ{,ˆ(

uvx

uvuvuv

log likelihood of labeled data from K source domains

Jan 8, 2008 57

Framework for Adaptation

);|(log1

);|(log1

1

1

minarg})ˆ{,ˆ,ˆ(

1

1 1

2

1

22

}{,

m

i

tTti

ti

N

k

N

i

kTki

ki

k

tt

K

k

ks

kt

Aypm

AypNK

k k

kt

uvx

uvx

uuvuuvuuv

SourceDomain

TargetDomain

log likelihood of target domain examples with

predicted labelsλt = 1 << λs : to pick up domain-specific

features in the target domain

Fix A, optimize:

Jan 8, 2008 58

• Joint optimization

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

A,

k

AypNK

A

1 1

1

22

}{,

);|(log11

minarg})ˆ{,ˆ,ˆ(

uvx

uvuvuv

How to Find A? (1)

Jan 8, 2008 59

How to Find A? (2)

• Domain cross validation– Idea: training on (K – 1) source domains and

validate on the held-out source domain– Approximation:

• wfk: weight for feature f learned from domain k

• wfk: weight for feature f learned from other domains

• rank features by

K

k

kf

kf ww

1

Jan 8, 2008 60

Intuition for Domain Cross Validation…domains

………expressed………-less

D1 D2 Dk-1 Dk (fly)

……-less……expressed……

w

1.5

0.05

w

2.0

1.2………expressed………-less

1.8

0.1

product of w1 and w2

w1 w2

Jan 8, 2008 61

Experiments

• Data set– BioCreative Challenge Task 1B– Gene/protein recognition– 3 organisms/domains: fly, mouse and yeast

• Experimental setup– 2 organisms for training, 1 for testing– F1 as performance measure

Jan 8, 2008 62

Experiments: Generalization

Method F+MY M+YF Y+FM

BL 0.633 0.129 0.416

DA-1

(joint-opt)

0.627 0.153 0.425

DA-2

(domain CV)

0.654 0.195 0.470

SourceDomain

TargetDomain

SourceDomain

TargetDomain using generalizable features is effective

F: fly M: mouse Y: yeast

domain cross validation is more effective than joint optimization

Jan 8, 2008 63

Experiments: Adaptation

Method F+MY M+YF Y+FM

BL-SSL 0.633 0.241 0.458

DA-2-SSL 0.759 0.305 0.501

SourceDomain

TargetDomain

F: fly M: mouse Y: yeast

SourceDomain

TargetDomain

domain-adaptive bootstrapping is more effective than regular bootstrapping

Jan 8, 2008 64

Related Work

• Problem relatively new to NLP and ML communities– Most related work developed concurrently with our work

Instances Used

StandardInstance

WeightingFeature

SelectionIW + FS

Dssupervised

learningShimodaira 00 Blitzer et al. 06

Our Future Wok

Ds + Dt,lsupervised

learningDaumé III & Marcus 06

Daumé III 07

Ds + Dt,u

semi-supervised

learningACL’07

HLT’06, CIKM’07

Ds + Dt,l + Dt,u

semi-supervised

learning

Jan 8, 2008 65

Roadmap




Jan 8, 2008 66

Summary

• Domain adaptation is a critical novel problem in natural language processing and machine learning

• Contributions– First systematic formal analysis of domain adaptation– Two novel general frameworks, both shown to be effective– Potentially applicable to other classification problems outside of

NLP

• Future work– Domain difference measure– Unify two frameworks– Incorporate domain knowledge into adaptation process– Leverage domain adaptation to perform large-scale information

extraction on scientific literature and on the Web

Jan 8, 2008 67

Information Extraction System

Existing Knowledge

Bases

Labeled Data from Related Domains

Entity Recognition

Relation Extraction

Intelligent Learning

Knowledge Resources Exploitation

Interactive Expert

Supervision

Domain Adaptive Learning

Domain Expert

Jan 8, 2008 68

Biomedical Literature(MEDLINE abstracts, full-text articles, etc.)

DWnt-2 is expressed in somatic cells of the gonad throughout

development.

Entity Recognition

Relation Extraction

Information Extraction

SystemExtracted Facts

gene tissue/position

DWnt-2 gonad

expression relations

Inference Engine

Pathway Construction

…

Hypothesis Generation

Knowledge Base Curation

Applications

Jan 8, 2008 69

Applications (cont.)

• Similar ideas for Web text mining– Product reviews

• Existing annotated reviews limited (certain products from certain sources)

• Large amount of semi-structured reviews from review websites

• Unstructured reviews from personal blogs

Jan 8, 2008 70

Selected Publications• J. Jiang & C. Zhai. “A two-stage approach to domain adaptation for

statistical classifiers.” In CIKM’07.• J. Jiang & C. Zhai. “Instance weighting for domain adaptation in NLP.” In

ACL’07.• J. Jiang & C. Zhai. “Exploiting domain structure for named entity

recognition.” In HLT-NAACL’06.• J. Jiang & C. Zhai. “A systematic exploration of the feature space for relation

extraction.” In NAACL-HLT’07.• J. Jiang & C. Zhai. “Extraction of coherent relevant passages using hidden

Markov models.” ACM Transactions on Information Systems (TOIS), Jul 2006.

• J. Jiang & C. Zhai. “An empirical study of tokenization strategies for biomedical information retrieval.” Information Retrieval, Oct 2007.

• X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Generating semi-structured gene summaries from biomedical literature.” Information Processing & Management, Nov 2007.

• X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Automatically generating gene summaries from biomedical literature.” In PSB’06.

this talk

feature exploration for relation extraction

information retrieval

gene summarization

Documents

Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign