View
814
Download
1
Category
Preview:
Citation preview
Classification Labels in a Fast Moving Environment
Classification Labels in a Fast Moving
Environment
Alessandro Magnani@WalmartLabs, Walmart Global eCommerce
California, USA
Friday 13th November, 2015
Classification Labels in a Fast Moving Environment
Classification Model Performance
Items Classifier
EditorN sampled items true label yi
estimate yi
accuracyEvaluation
◮ correctly evaluating classification models is critical andrequires labels
◮ labeling products is expensive
◮ need to correctly and optimally use labels
Classification Labels in a Fast Moving Environment
Classification Model Performance
Items Classifier
EditorN sampled items true label yi
estimate yi
accuracyEvaluation
Measure accuracy common approach:
◮ sample uniformly at random N items
◮ compute accuracy 1N
∑Ni=1 1{yi=yi}
Classification Labels in a Fast Moving Environment
Practical challenges
Items Classifier
EditorN sampled items true label yi
estimate yi
accuracyEvaluation
◮ items change over time
Classification Labels in a Fast Moving Environment
Practical challenges
Items Classifier
EditorN sampled items true label yi
estimate yi
accuracyEvaluation
◮ items change over time
◮ evaluation required over multiple subsets
Classification Labels in a Fast Moving Environment
Practical challenges
Items Classifier
EditorN sampled items true label yi
estimate yi
accuracyEvaluation
◮ items change over time
◮ evaluation required over multiple subsets
◮ existing labels potentially hard to reuse
Classification Labels in a Fast Moving Environment
A motivating example
compute accuracy over 1M items1K labels budget
◮ sample 1K items and get
labels yi
◮ measure accuracy11K
∑1K
i=1 1{yi=yi}
1M
p11K
Classification Labels in a Fast Moving Environment
A motivating example
500K items added, compute accuracy on all 1.5M items
◮ use previous accuracy
measure
◮ most likely inaccurate
1M 1.5M
p11K
Classification Labels in a Fast Moving Environment
A motivating example
500K items added, compute accuracy on all 1.5M items500 labels extra budget
◮ sample 500 items from the
1.5M
◮ compute accuracy on new
500 labels
◮ previous 1K labels “wasted”
1M 1.5M
p
13K
Classification Labels in a Fast Moving Environment
A motivating example
500K items added, compute accuracy on all 1.5M items500 labels extra budget, better approach
◮ sample 500 items from new
items
◮ compute accuracy on all 1.5K
labels
◮ no label “wasted”
1M 1.5M
p11K
Classification Labels in a Fast Moving Environment
A motivating example
500K items added, compute accuracy on all 1.5M itemsonly 250 labels extra budget?
◮ sample 250 items from new
items
◮ need to account for difference
in sampling
◮ accuracy:
1M 1.5M
p
12K
11.5K
(
∑1Ki=1 1{yi=yi} + 2
∑250i=1 1{ynew
i=ynew
i}
)
Classification Labels in a Fast Moving Environment
A motivating example
What are the challenges?
◮ sampling new test labels for every measure is generallyexpensive
Classification Labels in a Fast Moving Environment
A motivating example
What are the challenges?
◮ sampling new test labels for every measure is generallyexpensive
◮ knowing how previous labels were sampled required tooptimally sample new items for test
Classification Labels in a Fast Moving Environment
A motivating example
What are the challenges?
◮ sampling new test labels for every measure is generallyexpensive
◮ knowing how previous labels were sampled required tooptimally sample new items for test
◮ computing accuracy using all labels requires knowledge ofsampling profile
Classification Labels in a Fast Moving Environment
A motivating example
What are the challenges?
◮ sampling new test labels for every measure is generallyexpensive
◮ knowing how previous labels were sampled required tooptimally sample new items for test
◮ computing accuracy using all labels requires knowledge ofsampling profile
◮ overtime reusing labels can become very tricky
Classification Labels in a Fast Moving Environment
Evaluation framework
◮ pi is probability of item i to be selected for test (Bernoulli)
◮ each item carries pi and is marked if selected (store thesampling profile)
◮ accuracy:
1∑
i selected
1pi
∑
i selected
1
pi1{yi=yi}
Classification Labels in a Fast Moving Environment
Evaluation framework
◮ pi is probability of item i to be selected for test (Bernoulli)
◮ each item carries pi and is marked if selected (store thesampling profile)
◮ accuracy:
1∑
i selected
1pi
∑
i selected
1
pi1{yi=yi}
◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
Classification Labels in a Fast Moving Environment
Evaluation framework
◮ pi is probability of item i to be selected for test (Bernoulli)
◮ each item carries pi and is marked if selected (store thesampling profile)
◮ accuracy:
1∑
i selected
1pi
∑
i selected
1
pi1{yi=yi}
◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
◮ all labels are used
Classification Labels in a Fast Moving Environment
Evaluation framework
◮ pi is probability of item i to be selected for test (Bernoulli)
◮ each item carries pi and is marked if selected (store thesampling profile)
◮ accuracy:
1∑
i selected
1pi
∑
i selected
1
pi1{yi=yi}
◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
◮ all labels are used
◮ with uniform sampling this is simply “standard” accuracy
Classification Labels in a Fast Moving Environment
Evaluation framework
◮ pi is probability of item i to be selected for test (Bernoulli)
◮ each item carries pi and is marked if selected (store thesampling profile)
◮ accuracy:
1∑
i selected
1pi
∑
i selected
1
pi1{yi=yi}
◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
◮ all labels are used
◮ with uniform sampling this is simply “standard” accuracy
◮ very closely related to importance sampling
Classification Labels in a Fast Moving Environment
Evaluation framework
given existing sampling pi and extra budget
how do we sample?
◮ minimize accuracy variance with budget constraint
◮ can be formulated as an optimization problem
◮ easy to solve
Classification Labels in a Fast Moving Environment
Evaluation framework
it works as you’d expect as budget grows:
p p
◮ new budget (blue) used more where pi is smaller
◮ given enough budget we obtain uniform sampling
Classification Labels in a Fast Moving Environment
Extensions
◮ framework works more generally for supervised learning
◮ framework can work with a wide range of different metrics
◮ optimal sampling can use model posterior to reduce variance
◮ this framework can be used on the training side together withactive learning
Recommended