Improving Supervised Classification using Confidence Weighted Learning

1

Improving Supervised Classification

using Confidence Weighted Learning

Koby Crammer

Joint work with

Mark Dredze, Alex Kulesza and Fernando Pereira

Workshop in Machine Learning, The EE department, Technion January 20, 2010

2

Linear Classifiers

Input Instance to be classified

Weight vector of classifier

3

Natural Language Processing

• Big datasets, large number of features

• Many features are only weakly correlated with target label

• Linear classifiers: features are associated with word-counts

• Heavy-tailed feature distribution

Feature Rank

Cou

nts

4

Sentiment Classification

• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!

Pang, Lee, Vaithyanathan, EMNLP 2002

Online Learning

Maintain Model M Get Instance x

Predict Label y=M(x)

Get True Label ySuffer Loss l(y,y)

Update Model M

6

Sentiment Classification

• Many positive reviews with the word best

Wbest

• Later negative review – “boring book – best if you want to sleep in seconds”

• Linear update will reduce both

Wbest Wboring

• But best appeared more than boring• Better to reduce words in different rate

Wboring Wbest

7

Linear Model Distribution over Linear Models

Example

Mean weight-vector

8

New Prediction Models

• Gaussian distributions over weight vectors

• The covariance is either full or diagonal

• In NLP we have many features and use a diagonal covariance

9

The algorithm forces that most of the values of would reside in this region

Weight Vector (Version) Space

10

Nothing to do, most of the weight vectors already classifies the example correctly

Passive Step

11

The mean is moved beyond the mistake-line(Large Margin)

Aggressive Step

The covariance is shirked in the direction of the input example

The algorithm projects the current Gaussian distribution on the half-space

12

• Projection update:

• Can be solved analytically

The Update

13

Synthetic Data

• 20 features• 2 informative (rotated

skewed Gaussian)• 18 noisy• Using a single feature is

as good as random prediction

14

Synthetic Data (cntd.)

Distribution after 50 examples (x1)

15

Synthetic Data (results)

Perceptron

PA

2nd Order

CW-full

CW-diag

16

Data

• Binary document classification– Sentiment reviews:

• 6 Amazon domains (Blitzer et al)

– Reuters (RCV1):• 3 pairs of labels

– 20 News Groups:• 3 of labels

• About 2000 instances per dataset• Bag of words representation• 10 Fold Cross-Validation; 5 epochs

17

Results vs Batch - Sentiment

• always better than batch methods • 3/6 significantly better

18

Results vs Batch - 20NG + Reuters

• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

19

Parallel Training

• Split large data into disjoint sets• Train using each set independently• Combine resulting classifiers

– Average Classifiers performance– Uniform mean of linear weights

– Weighted mean of linear weights using confidence information

20

Parallel Training

• Data Size: Sentiment ~1M ; Reuters ~0.8M

• #Features/#Docs: Sentiment ~13 ; Reuters ~0.35

• Performance degradates with number of splits

• Weighting improves performance

Baseline (CW)

21 ApproximationConstraints for labels

Multi-Class Update• Multiple constraints per instance

• Approximate using a single constraint

Crammer, Dredze, Kulesza. EMNLP 2008

22

Evaluation Setup

• Nine multi-class datasets

TaskInstancesFeaturesLabelsBalanced

20 News18,828252,11520Yes

Amazon 713,580686,7247Yes

Amazon 37,000494,4813Yes

Enron A3,00013,55910No

Enron B3,00018,06510No

NYT Desk10,000108,67126No

NYT Online10,000108,67134No

NYT Section10,000108,67120No

Reuters4,00023,6994No


23

Evaluation

Better than all baselines (online and batch): 8 of 9 datasets


24

20 Newsgroups

Better than all online baselines: 8 of 9 datasets


25

Multi-Domain Learning

• Task: sentiment classification

• Goal: reviews differ across domains– Electronics– Books– Movies– Kitchen Appliances

• Challenge: domains differ– Domains use different features– Domains may behave differently towards features

Blitzer, Dredze, Pereira, ACL 2007

Dredze, Kulesza, Crammer. MLJ 2009

26

Differing Feature Behaviors

• Share similar behaviors across domains

• Learn domain specific behaviors

Shared parameters

Parameters used for every domain

Domain parameters

Separate parameters for every domain

27

Combining Domain Parameters

w

w

Shared

Domain Specific

wCombined

2

-1

.5

28

Classifier Combination

Individual classifiers

Combined classifier

Weighting

• CW classifier is a distribution over weight vectors

29

Multi-Domain Regularization• Combined classifier for prediction and updates

– Based on Evgeniou and Pontil, KDD 2004

• Passive-aggressive update rule– Find shared model and individual model closest to

current corresponding models– Such that their combination will perform well on

current example

1) Smallest parameter change

Classify example correctly

2)

30

Evaluation on Sentiment

• Sentiment classification– Rate product reviews: positive/negative

– 4 datasets• All- 7 Amazon product types• Books- different rating thresholds• DVDs- different rating thresholds• Books+DVDs

– 1500 train, 100 test per domain

31

Results

0

5

10

15

20

25

Books DVD Books+DVD All

Test

Err

or

SingleSeparateMDR

Test error (smaller better)

10-fold CV, one pass online training

Books, DVDs, Books+DVDs

p=.001

32

Active LearningDredze & Crammer, ACL 2008

• Start with a pool of unlabeled examples

• Use few labeled examples to choose an initial hypothesis

• Iterative Algorithm :– Use current classifier

to pick an example to be labeled

– Train using all labeled examples

33







34







35







36







37







38







39

Picking the Next Example

• Random• Linear Classifiers

– Example with lowest margin

• Active Confidence Learning :– Example with least confidence

– Equivalent to lowest normalized-margin

Dredze & Crammer, ACL 2008

40

Active Learning

• 13 Datasets : – Sentiment (4), 20NG (3), Reuters (3), SPAM (3)


41

Active Learning

• Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data


42

Summary

• Online training is fast and effective …• … but, NLP data has heavy-tailed feature

distribution• New Model:

– Add feature confidence parameters

• Benefits:– better than state-of-the-art training algorithms for linear

classifiers – Converges faster– Theoretical guaranties– Allows better combination of models trained in parallel

and better active learning– better domain adaptation

Documents

Improving Supervised Classification using Confidence Weighted Learning