Improving Supervised Classification using Confidence Weighted Learning

Improving Supervised Classification

using Confidence Weighted Learning

Koby Crammer

Joint work with

Mark Dredze, Alex Kulesza and Fernando Pereira

Workshop in Machine Learning, The EE department, Technion January 20, 2010

Linear Classifiers

Input Instance to be classified

Weight vector of classifier

Natural Language Processing

• Big datasets, large number of features

• Many features are only weakly correlated with target label

• Linear classifiers: features are associated with word-counts

• Heavy-tailed feature distribution

Feature Rank

Sentiment Classification

• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!

Pang, Lee, Vaithyanathan, EMNLP 2002

Online Learning

Maintain Model M Get Instance x

Predict Label y=M(x)

Get True Label ySuffer Loss l(y,y)

Update Model M

Sentiment Classification

• Many positive reviews with the word best

• Later negative review – “boring book – best if you want to sleep in seconds”

• Linear update will reduce both

Wbest Wboring

• But best appeared more than boring• Better to reduce words in different rate

Wboring Wbest

Linear Model Distribution over Linear Models

Example

Mean weight-vector

New Prediction Models

• Gaussian distributions over weight vectors

• The covariance is either full or diagonal

• In NLP we have many features and use a diagonal covariance

The algorithm forces that most of the values of would reside in this region

Weight Vector (Version) Space

Nothing to do, most of the weight vectors already classifies the example correctly

Passive Step

The mean is moved beyond the mistake-line(Large Margin)

Aggressive Step

The covariance is shirked in the direction of the input example

The algorithm projects the current Gaussian distribution on the half-space

• Projection update:

• Can be solved analytically

The Update

Synthetic Data

• 20 features• 2 informative (rotated

skewed Gaussian)• 18 noisy• Using a single feature is

as good as random prediction

Synthetic Data (cntd.)

Distribution after 50 examples (x1)

Synthetic Data (results)

Perceptron

2nd Order

CW-full

CW-diag

• Binary document classification– Sentiment reviews:

• 6 Amazon domains (Blitzer et al)

– Reuters (RCV1):• 3 pairs of labels

– 20 News Groups:• 3 of labels

• About 2000 instances per dataset• Bag of words representation• 10 Fold Cross-Validation; 5 epochs

Results vs Batch - Sentiment

• always better than batch methods • 3/6 significantly better

Results vs Batch - 20NG + Reuters

• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

Parallel Training

• Split large data into disjoint sets• Train using each set independently• Combine resulting classifiers

– Average Classifiers performance– Uniform mean of linear weights

– Weighted mean of linear weights using confidence information

Parallel Training

• Data Size: Sentiment ~1M ; Reuters ~0.8M

• #Features/#Docs: Sentiment ~13 ; Reuters ~0.35

• Performance degradates with number of splits

• Weighting improves performance

Baseline (CW)

21 ApproximationConstraints for labels

Multi-Class Update• Multiple constraints per instance

• Approximate using a single constraint

Crammer, Dredze, Kulesza. EMNLP 2008

Evaluation Setup

• Nine multi-class datasets

TaskInstancesFeaturesLabelsBalanced

20 News18,828252,11520Yes

Amazon 713,580686,7247Yes

Amazon 37,000494,4813Yes

Enron A3,00013,55910No

Enron B3,00018,06510No

NYT Desk10,000108,67126No

NYT Online10,000108,67134No

NYT Section10,000108,67120No

Reuters4,00023,6994No

Evaluation

Better than all baselines (online and batch): 8 of 9 datasets

20 Newsgroups

Better than all online baselines: 8 of 9 datasets

Multi-Domain Learning

• Task: sentiment classification

• Goal: reviews differ across domains– Electronics– Books– Movies– Kitchen Appliances

• Challenge: domains differ– Domains use different features– Domains may behave differently towards features

Blitzer, Dredze, Pereira, ACL 2007

Dredze, Kulesza, Crammer. MLJ 2009

Differing Feature Behaviors

• Share similar behaviors across domains

• Learn domain specific behaviors

Shared parameters

Parameters used for every domain

Domain parameters

Separate parameters for every domain

Combining Domain Parameters

Shared

Domain Specific

wCombined

Classifier Combination

Individual classifiers

Combined classifier

Weighting

• CW classifier is a distribution over weight vectors

Multi-Domain Regularization• Combined classifier for prediction and updates

– Based on Evgeniou and Pontil, KDD 2004

• Passive-aggressive update rule– Find shared model and individual model closest to

current corresponding models– Such that their combination will perform well on

current example

1) Smallest parameter change

Classify example correctly

Evaluation on Sentiment

• Sentiment classification– Rate product reviews: positive/negative

– 4 datasets• All- 7 Amazon product types• Books- different rating thresholds• DVDs- different rating thresholds• Books+DVDs

– 1500 train, 100 test per domain

Results

Books DVD Books+DVD All

SingleSeparateMDR

Test error (smaller better)

10-fold CV, one pass online training

Books, DVDs, Books+DVDs

p=.001

Active LearningDredze & Crammer, ACL 2008

• Start with a pool of unlabeled examples

• Use few labeled examples to choose an initial hypothesis

• Iterative Algorithm :– Use current classifier

to pick an example to be labeled

– Train using all labeled examples

Picking the Next Example

• Random• Linear Classifiers

– Example with lowest margin

• Active Confidence Learning :– Example with least confidence

– Equivalent to lowest normalized-margin

Dredze & Crammer, ACL 2008

Active Learning

• 13 Datasets : – Sentiment (4), 20NG (3), Reuters (3), SPAM (3)

Active Learning

• Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data

Summary

• Online training is fast and effective …• … but, NLP data has heavy-tailed feature

distribution• New Model:

– Add feature confidence parameters

• Benefits:– better than state-of-the-art training algorithms for linear

classifiers – Converges faster– Theoretical guaranties– Allows better combination of models trained in parallel

and better active learning– better domain adaptation

Improving Supervised Classification using Confidence Weighted Learning

Documents

Calibrating AdaBoost for asymmetric learningAdaBoost Can it handle asymmetric problems? Update examples’ weights Confidence weighted majority vote Assign a confidence score to each

Confidence intervals for the weighted coefficients of ... · Confidence intervals for the weighted coefficients of variation of two-parameter exponential distributions Warisa Thangjai

Chapter 3: Supervised Learning · 2019. 11. 24. · Supervised vs. unsupervised Learning •Supervised learning: classification is seen as supervised learning from examples. •Supervision:

· Web viewCOMMERCIAL-IN-CONFIDENCE. COMMERCIAL-IN-CONFIDENCE. COMMERCIAL-IN-CONFIDENCE. COMMERCIAL-IN-CONFIDENCE. COMMERCIAL-IN-CONFIDENCE. COMMERCIAL-IN-CONFIDENCE. DO NOT SEND

Cross-View Person Identification Based on Confidence ...songwang/document/tip19d.pdf · LIANG et al.: CVPI BASED ON CONFIDENCE-WEIGHTED HUMAN POSE MATCHING 3823 to this method. Due

Geometric PDEs on weighted graphs for semi-supervised ... · known geometric PDEs, which are widely used in image pro-cessing. While the p-Laplacian on graphs was intensively used

Poorly Supervised Learning - statisticalcyber.comstatisticalcyber.com/talks/anagnostopoulos.pdf · Poorly Supervised Learning ... Semi-supervised learning and co-training Expectation-Maximisation:

Supervised Nonlinear Factorizations Excel In Semi-supervised … · 2020. 12. 9. · Supervised Nonlinear Factorizations Excel In Semi-supervised Regression Josif Grabocka 1, Erind

Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted least squares, rather than fitting constants locally. n Linear models fit to a basis

Contrast T1 weighted – (MPRAGE-anatomical) T2 weighted – (fmri)

An Ensemble of Transfer, Semi-supervised and Supervised ...An Ensemble of Transfer, Semi-supervised and Supervised Learning Methods for Pathological Heart Sound Classiﬁcation Ahmed

Your cognitive future Part I: The evolution of cognitive · Cognitive systems are able to put content into context, providing confidence-weighted responses, with supporting evidence

Smoothed weighted empirical likelihood ratio confidence ...jjren/PDFS/BEJ1290.pdf · and Wellner (2005) for right censored data, doubly censored data and interval censored Case 1

Semi-Supervised Learningzhuxj/tmp/book.pdf1.1.2 Semi-Supervised Learning Semi-supervised learning (SSL) is half way between supervised and unsupervised learning. In addition to unlabeled

Classifying Phishing Emails Using Confidence …Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers Ram B. Basnet Computer Science & Engineering Department New

SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED … · SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED METHODS IN DISCRIMINATIVE LANGUAGE MODELING FOR AUTOMATIC SPEECH RECOGNITION by Erin˘c

Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn Google

A Simple and Eﬃcient Supervised Method for Spatially ... · Weighted PCA in Face Image Analysis Carlos E. Thomaz Department of Electrical Engineering, FEI S˜ao Bernardo do Campo,

…to Supervised Visitation Programs 1 law enforcement guide to … · 2018. 10. 28. · …to Supervised Visitation Programs 3 What are Supervised Visitation Programs? Supervised

Adversarial Dropout for Supervised and Semi-Supervised Learning … · 2017-09-19 · Adversarial Dropout for Supervised and Semi-Supervised Learning Sungrae Park and Jun-Keon Park