21
Sensitive Active Sensitive Active Learning with Learning with Multiple Imperfect Multiple Imperfect Oracles Oracles Pinar Donmez and Jaime Pinar Donmez and Jaime Carbonell Carbonell Language Technologies Language Technologies Institute, Institute, School of Computer School of Computer Science, Science, Carnegie Mellon University Carnegie Mellon University CIKM ’08, Napa Valley, CIKM ’08, Napa Valley, October 2008 October 2008

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Embed Size (px)

Citation preview

Page 1: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Proactive Learning: Proactive Learning: Cost-Sensitive Active Cost-Sensitive Active Learning with Multiple Learning with Multiple

Imperfect OraclesImperfect Oracles Pinar Donmez and Jaime CarbonellPinar Donmez and Jaime Carbonell

Language Technologies Institute,Language Technologies Institute, School of Computer Science,School of Computer Science,

Carnegie Mellon UniversityCarnegie Mellon University

CIKM ’08, Napa Valley, October 2008CIKM ’08, Napa Valley, October 2008

Page 2: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Active learning Assumptions and Active learning Assumptions and Real WorldReal World

► unique oracleunique oracle

► perfect oracleperfect oracle always rightalways right never tirednever tired

►works for free or works for free or charges uniformlycharges uniformly

►multiple sources of multiple sources of informationinformation

► imperfect oraclesimperfect oracles unreliableunreliable reluctantreluctant

► expensive or expensive or charges non-charges non-uniformlyuniformly

Active Learning Real World

Page 3: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Solution: Proactive LearningSolution: Proactive Learning

►Proactive learningProactive learning is a generalization of is a generalization of active learning to relax these active learning to relax these assumptionsassumptions

►decision-theoretic framework to jointly decision-theoretic framework to jointly optimize instance-oracle pairoptimize instance-oracle pair

►utility optimization problem under a fixed utility optimization problem under a fixed budget constraintbudget constraint

Page 4: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

OutlineOutline

►MethodologyMethodology 3 Scenarios3 Scenarios

►ReluctanceReluctance►FallibilityFallibility►Variable and Fixed CostVariable and Fixed Cost

► EvaluationEvaluation Problem SetupProblem Setup DatasetsDatasets ResultsResults

► ConclusionConclusion

Page 5: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Scenario 1: ReluctanceScenario 1: Reluctance

►2 oracles:2 oracles: reliable oracle: expensive but always reliable oracle: expensive but always

answers with a correct labelanswers with a correct label reluctant oracle: cheap but may not reluctant oracle: cheap but may not

respond to some queriesrespond to some queries

►Define a utility score as expected Define a utility score as expected value of information at unit costvalue of information at unit cost

( | , ) * ( )( , )

k

P ans x k V xU x k

C

Page 6: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

How to simulate oracle How to simulate oracle unreliability?unreliability?

► depend on factors such as query difficulty (hard to classify), complexity depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficultywork, we model it based on query difficulty

► AssumptionsAssumptions Perfect oracle ~ classifier having zero training error on the entire Perfect oracle ~ classifier having zero training error on the entire

datadata Imperfect oracle ~ weak classifier trained on a subset of the entire Imperfect oracle ~ weak classifier trained on a subset of the entire

datadata

► Train a logistic regression classifier on the subset to obtain Train a logistic regression classifier on the subset to obtain

► Identify instances with Identify instances with

► These are the unreliable instancesThese are the unreliable instances

► Challenge: tradeoff between the information value of an instance and Challenge: tradeoff between the information value of an instance and the reliability of the oracle the reliability of the oracle

ˆ( | ) [0.45,0.5]P y x

ˆ( | )P y x

Page 7: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

How to estimate ?How to estimate ?

► Cluster unlabeled data using k-meansCluster unlabeled data using k-means► Ask the label of each cluster centroid to the reluctant Ask the label of each cluster centroid to the reluctant

oracle. Iforacle. If label received: increase of nearby pointslabel received: increase of nearby points no label: decrease of nearby pointsno label: decrease of nearby points

equals 1 when label received, -1 otherwiseequals 1 when label received, -1 otherwise

► # clusters depend on the clustering budget and oracle # clusters depend on the clustering budget and oracle feefee

ˆ( | , )P ans x k

ˆ( | ,reluctant)P ans x

ˆ( | ,reluctant)P ans x

max( , )0.5ˆ( | ,reluctant) exp ln2

tt t

t

d cc ct

c

x xh x yP ans x x C

Z x x

( , ) { 1, 1}c ch x y

Page 8: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

► Algorithm works in rounds till no budgetAlgorithm works in rounds till no budget

► At each round, sampling continues until a label is At each round, sampling continues until a label is obtainedobtained

► Be careful: You may spend the entire budget on a Be careful: You may spend the entire budget on a single attempt single attempt

► If no label, decrease the utility of remaining If no label, decrease the utility of remaining instances:instances:

► This is adaptive Penalization of the Reluctant Oracle This is adaptive Penalization of the Reluctant Oracle

ˆ( | ,reluctant) * ( )ˆ( ,reluctant)

where is the amount spent thus far in the given roundround

round

P ans x V xU x

C

C

Page 9: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Algorithm for Scenario 1Algorithm for Scenario 1

Page 10: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Scenario 2: FallibilityScenario 2: Fallibility

► 2 oracles:2 oracles: One perfect but expensive oracleOne perfect but expensive oracle One fallible but cheap oracle, always answersOne fallible but cheap oracle, always answers

► Alg. Similar to Scenario 1 with slight modificationsAlg. Similar to Scenario 1 with slight modifications

► During exploration:During exploration: Fallible oracle provides the label with its confidenceFallible oracle provides the label with its confidence

Confidence = of fallible oracleConfidence = of fallible oracle

If then we don’t use the labelIf then we don’t use the label but we still update but we still update

ˆ( | ) [0.45,0.5]P y x

ˆ( | )P y x

ˆ(correct | , )P x k

Page 11: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Outline of Scenario 2Outline of Scenario 2

Page 12: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Scenario 3: Non-uniform CostScenario 3: Non-uniform Cost

► Uniform cost: Fraud detection, face Uniform cost: Fraud detection, face recognition, etc.recognition, etc.

►Non-uniform cost: text categorization, Non-uniform cost: text categorization, medical diagnosis, protein structure medical diagnosis, protein structure prediction, etc.prediction, etc.

► 2 oracles:2 oracles: Fixed-cost OracleFixed-cost Oracle Variable-cost OracleVariable-cost Oracle

ˆmax ( | ) 1( ) 1

1 1y Y

non unif

P y x YC x

Y

Page 13: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Outline of Scenario 3Outline of Scenario 3

Page 14: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

EvaluationEvaluation

►Datasets: Face detection, UCI Letter (V-vs-Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI AdultY), Spambase, and UCI Adult

Page 15: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Oracle Properties and CostsOracle Properties and Costs

► The cost is inversely proportional to reliabilityThe cost is inversely proportional to reliability► Higher costs for the fallible oracle since a noisy Higher costs for the fallible oracle since a noisy

label should be penalized more than no label at alllabel should be penalized more than no label at all► Cost ratio creates an incentive to choose between Cost ratio creates an incentive to choose between

oraclesoracles

Page 16: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Underlying Sampling Underlying Sampling StrategyStrategy

► Conditional entropy based sampling, Conditional entropy based sampling, weighted by a density measureweighted by a density measure

► Captures the information content of a close Captures the information content of a close neighborhoodneighborhood

2

2{ 1} { 1}ˆ ˆˆ ˆ( ) log min ( | , ) exp * min ( | , )

xy y

k x N

U x P y x w x k P y k w

close neighbors of x

Page 17: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Results: Overall and Reluctance Results: Overall and Reluctance on Spambase Dataon Spambase Data

Page 18: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Results: ReluctanceResults: Reluctance

Page 19: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

Cost varies non-uniformlyCost varies non-uniformly

statistically significant results (p<0.01)

Page 20: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

More light on the clustering More light on the clustering stepstep

► Run each baseline without the clustering stepRun each baseline without the clustering step► Entire budget is spent in rounds for data Entire budget is spent in rounds for data

elicitationelicitation► No separate clustering budgetNo separate clustering budget► Results on Spambase under Scenario 1, cost 1:3Results on Spambase under Scenario 1, cost 1:3

Page 21: Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language

ConclusionConclusion

► Address issues with the assumptions of active Address issues with the assumptions of active learninglearning

► Introduction to a Proactive Learning frameworkIntroduction to a Proactive Learning framework

► Analysis of imperfect oracles with differing Analysis of imperfect oracles with differing properties and costsproperties and costs

► Expected utility maximization across oracle-Expected utility maximization across oracle-instance pairsinstance pairs

► Effective against exploitation of a single oracleEffective against exploitation of a single oracle