Upload
hilda-mosley
View
218
Download
2
Embed Size (px)
Citation preview
What are the real challenges in data mining?
Charles Elkan
University of California, San Diego
August 21, 2003
Bogosity about learning with unbalanced data
1. The goal is yes/no classification. No: ranking, or probability estimation Often, P(c=minority|x) < 0.5 for all examples x
2. Decision trees and C4.5 are well-suited– No: model each class separately, then use Bayes’ rule
P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] – No: avoid small disjuncts
With naïve Bayes: P(x|c) = P(xi | c)
3. Under/over-sampling are appropriate– No: do cost-based example-specific sampling, then bagging
4. ROC curves and AUC are important
Learning to predict contact maps
3D protein distance map binary contact map
(Source: Paolo Frasconi et al.)
Issues in contact map prediction
1. An ML researcher sees O(n2) non-contacts and O(n) contacts.
2. But to a biologist, the concept “an example of a non- contact” is far from natural.
3. Moreover, there is no natural probability distribution defining the population of “all” proteins.
4. A statistician sees simply O(n2) distance measures— but s/he finds least-squares regression is useless!
For the rooftop detection task …
We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.”
(Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)
How to detect faces in real-time?
Viola and Jones, CVPR ‘01:Slide window over image45396 features per windowLearn boosted decision-stump classifier
UCI datasets are small and not highly unbalanced
DATA SET SIZE FEATURES MINORITY FRACTION
PIMA 768 8 0.35
PHONEME 5484 5 0.29
SATIMAGE 6435 36 0.10
MAMMOG. 11183 6 0.02
KRKOPT 28056 6 0.01
(Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)
Features of the DMEF and similar datasets
1. At least 105 examples and 102.5 features.
2. No single well-defined target class.
3. Interesting cases have frequency < 0.01.
4. Much information on costs and benefits, but no overall model of profit/loss.
5. Different cost matrices for different examples
6. Most cost matrix entries are unknown.
Example-dependent costs and benefits
Observations:
1. Loss or profit depends on the transaction size x.
2. Figuring out the full profit/loss model is hard.
3. Opportunity costs are confusing.
4. Creative management transforms costs into benefits.
5. How do we account for long-term costs and benefits?
actual
predicted
legitimate fraudulent
legitimate + 0.01x - x
fraudulent - 20 - 10
Correct decisions require correct probabilities
Let p = P(legitimate). The optimal decision is “approve” iff
0.01xp – (1-p)x > (-20)p + (-10)(1-p)
This calculation requires well-calibrated estimates of p.
actual
predicted
legitimate fraudulent
legitimate + 0.01x - x
fraudulent - 20 - 10
1. “AUC can give a general idea of the quality of the probabilistic estimates produced by the model”
No, AUC only evaluates the ranking produced.1. “Cost curves are equivalent to ROC curves” No, a single point on the ROC curve is optimal only if
costs are the same for all examples.Advice: Use $ profit to compare methods.Issue: When is $ difference statistically
significant?
ROC curves considered harmful
(Source: Medical College of Georgia.)
Usually we must learn a model to estimate costs
Cost matrix for soliciting donors to a charity.
The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x.
actual
predicted
donor non-donor
solicit x – $0.68 - $0.68
ignore 0 0
So, we learn a model to estimate costs …
Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x.
Reason: Donation amount x and probability of donation p are inversely correlated.
Hence, the training set contains too few examples of large donations, compared to small ones.
actual
predicted
donor non-donor
solicit x – $0.68 - $0.68
ignore 0 0
The “reject inference” problem
1. Let humans make credit grant/deny decisions.
2. Collect data about repay/write-off, but only for people to whom credit is granted.
3. Learn a model from this training data.
4. Apply the model to all future applicants.
Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”
Selection bias makes training labels incorrect
In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)!
Historical actions are not optimal, but they are not chosen randomly either.
(Source: William H. Wolberg, M.D.)
Sequences of training sets
1. Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population.
2. Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002.
3. And so on…
Each time a new model is learned, its training set is has been created using a different selection bias.
Let’s use the word “unbalanced” in the future
Google: Searched the web for imbalanced. … about 53,800.
Searched the web for unbalanced. … about 465,000.
C. Elkan. The Foundations of Cost-Sensitive Learning IJCAI'01, pp. 973-978.
B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213.
B. Zadrozny and C. Elkan. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers ICML'01, pp. 609-616.
N. Abe et al. Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02.
B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting ICDM’03.