Upload
faraji
View
57
Download
0
Embed Size (px)
DESCRIPTION
Improving Supervised Classification using Confidence Weighted Learning. Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira. Workshop in Machine Learning, The EE department, Technion January 20, 2010. Linear Classifiers. Input Instance to be classified. - PowerPoint PPT Presentation
Citation preview
1
Improving Supervised Classification
using Confidence Weighted Learning
Koby Crammer
Joint work with
Mark Dredze, Alex Kulesza and Fernando Pereira
Workshop in Machine Learning, The EE department, Technion January 20, 2010
2
Linear Classifiers
Input Instance to be classified
Weight vector of classifier
3
Natural Language Processing
• Big datasets, large number of features
• Many features are only weakly correlated with target label
• Linear classifiers: features are associated with word-counts
• Heavy-tailed feature distribution
Feature Rank
Cou
nts
4
Sentiment Classification
• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!
Pang, Lee, Vaithyanathan, EMNLP 2002
Online Learning
Maintain Model M Get Instance x
Predict Label y=M(x)
Get True Label ySuffer Loss l(y,y)
Update Model M
6
Sentiment Classification
• Many positive reviews with the word best
Wbest
• Later negative review – “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring• Better to reduce words in different rate
Wboring Wbest
7
Linear Model Distribution over Linear Models
Example
Mean weight-vector
8
New Prediction Models
• Gaussian distributions over weight vectors
• The covariance is either full or diagonal
• In NLP we have many features and use a diagonal covariance
9
The algorithm forces that most of the values of would reside in this region
Weight Vector (Version) Space
10
Nothing to do, most of the weight vectors already classifies the example correctly
Passive Step
11
The mean is moved beyond the mistake-line(Large Margin)
Aggressive Step
The covariance is shirked in the direction of the input example
The algorithm projects the current Gaussian distribution on the half-space
12
• Projection update:
• Can be solved analytically
The Update
13
Synthetic Data
• 20 features• 2 informative (rotated
skewed Gaussian)• 18 noisy• Using a single feature is
as good as random prediction
14
Synthetic Data (cntd.)
Distribution after 50 examples (x1)
15
Synthetic Data (results)
Perceptron
PA
2nd Order
CW-full
CW-diag
16
Data
• Binary document classification– Sentiment reviews:
• 6 Amazon domains (Blitzer et al)
– Reuters (RCV1):• 3 pairs of labels
– 20 News Groups:• 3 of labels
• About 2000 instances per dataset• Bag of words representation• 10 Fold Cross-Validation; 5 epochs
17
Results vs Batch - Sentiment
• always better than batch methods • 3/6 significantly better
18
Results vs Batch - 20NG + Reuters
• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse
19
Parallel Training
• Split large data into disjoint sets• Train using each set independently• Combine resulting classifiers
– Average Classifiers performance– Uniform mean of linear weights
– Weighted mean of linear weights using confidence information
20
Parallel Training
• Data Size: Sentiment ~1M ; Reuters ~0.8M
• #Features/#Docs: Sentiment ~13 ; Reuters ~0.35
• Performance degradates with number of splits
• Weighting improves performance
Baseline (CW)
21 ApproximationConstraints for labels
Multi-Class Update• Multiple constraints per instance
• Approximate using a single constraint
Crammer, Dredze, Kulesza. EMNLP 2008
22
Evaluation Setup
• Nine multi-class datasets
TaskInstancesFeaturesLabelsBalanced
20 News18,828252,11520Yes
Amazon 713,580686,7247Yes
Amazon 37,000494,4813Yes
Enron A3,00013,55910No
Enron B3,00018,06510No
NYT Desk10,000108,67126No
NYT Online10,000108,67134No
NYT Section10,000108,67120No
Reuters4,00023,6994No
Crammer, Dredze, Kulesza. EMNLP 2008
23
Evaluation
Better than all baselines (online and batch): 8 of 9 datasets
Crammer, Dredze, Kulesza. EMNLP 2008
24
20 Newsgroups
Better than all online baselines: 8 of 9 datasets
Crammer, Dredze, Kulesza. EMNLP 2008
25
Multi-Domain Learning
• Task: sentiment classification
• Goal: reviews differ across domains– Electronics– Books– Movies– Kitchen Appliances
• Challenge: domains differ– Domains use different features– Domains may behave differently towards features
Blitzer, Dredze, Pereira, ACL 2007
Dredze, Kulesza, Crammer. MLJ 2009
26
Differing Feature Behaviors
• Share similar behaviors across domains
• Learn domain specific behaviors
Shared parameters
Parameters used for every domain
Domain parameters
Separate parameters for every domain
27
Combining Domain Parameters
w
w
Shared
Domain Specific
wCombined
2
-1
.5
28
Classifier Combination
Individual classifiers
Combined classifier
Weighting
• CW classifier is a distribution over weight vectors
29
Multi-Domain Regularization• Combined classifier for prediction and updates
– Based on Evgeniou and Pontil, KDD 2004
• Passive-aggressive update rule– Find shared model and individual model closest to
current corresponding models– Such that their combination will perform well on
current example
1) Smallest parameter change
Classify example correctly
2)
30
Evaluation on Sentiment
• Sentiment classification– Rate product reviews: positive/negative
– 4 datasets• All- 7 Amazon product types• Books- different rating thresholds• DVDs- different rating thresholds• Books+DVDs
– 1500 train, 100 test per domain
31
Results
0
5
10
15
20
25
Books DVD Books+DVD All
Test
Err
or
SingleSeparateMDR
Test error (smaller better)
10-fold CV, one pass online training
Books, DVDs, Books+DVDs
p=.001
32
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
33
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
34
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
35
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
36
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
37
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
38
Active LearningDredze & Crammer, ACL 2008
• Start with a pool of unlabeled examples
• Use few labeled examples to choose an initial hypothesis
• Iterative Algorithm :– Use current classifier
to pick an example to be labeled
– Train using all labeled examples
39
Picking the Next Example
• Random• Linear Classifiers
– Example with lowest margin
• Active Confidence Learning :– Example with least confidence
– Equivalent to lowest normalized-margin
Dredze & Crammer, ACL 2008
40
Active Learning
• 13 Datasets : – Sentiment (4), 20NG (3), Reuters (3), SPAM (3)
Dredze & Crammer, ACL 2008
41
Active Learning
• Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data
Dredze & Crammer, ACL 2008
42
Summary
• Online training is fast and effective …• … but, NLP data has heavy-tailed feature
distribution• New Model:
– Add feature confidence parameters
• Benefits:– better than state-of-the-art training algorithms for linear
classifiers – Converges faster– Theoretical guaranties– Allows better combination of models trained in parallel
and better active learning– better domain adaptation