Upload
pham-ngoc-khanh
View
216
Download
0
Embed Size (px)
Citation preview
8/6/2019 DM SC 07 Some Advanced Topics
1/96
Data Mining and Soft Computing.
Some Advanced Topics:,
Subgroup Discovery and Data Complexity
Francisco HerreraResearch Group on Soft Computing and
Information Intell igent Systems (SCI 2S)
Dept. of Computer Science and A.I.,
Emai l : he r re ra@dec sai .ugr.es
. .
http:/ /decsai.ugr.es/~herrera
8/6/2019 DM SC 07 Some Advanced Topics
2/96
Data Mining and Soft Computing
Summary1.Introduction to Data Mining and Knowledge Discovery2. Data Preparation3. Introduction to Prediction, Classification, Clustering and
Association
4.Introduction to Soft Computing. Focusing our attention inFuzzy Logic and Evolutionary Computation
5.Soft Com utin Techni ues in Data Minin : Fuzz Data Mininand Knowledge Extraction based on Evolutionary Learning
6. Genetic Fuzzy Systems: State of the Art and New Trends.Sets, Subgroup Discovery, Data Complexity
8.Final talk: How must I Do my Experimental Study? Design of.
Non-parametric Tests. Some Cases of Study.
8/6/2019 DM SC 07 Some Advanced Topics
3/96
Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery
and Data Complexity
u ne
Imbalanced Data Sets
Subgroup Discovery
Data Complexity
3
8/6/2019 DM SC 07 Some Advanced Topics
4/96
Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery
and Data Complexity
u ne
Imbalanced Data Sets
Subgroup Discovery
Data Complexity
4
8/6/2019 DM SC 07 Some Advanced Topics
5/96
ImbalancedImbalanced Data SetsData Sets
PresentationPresentationIn a concept-learning problem, the data set is
said to present a class imbalance if it contains
Such a situation poses challenges for typicalclassifiers such as decision tree induction systems
or multi-layer perceptrons that are designed to
optimize overall accuracy without taking into
account the relative distribution of each class.
As a result, these classifiers tend to ignore small classes whileconcen ra ng on c ass y ng e arge ones accura e y.
Such a problem occurs in a large number of practical domains and isoften dealt with by using re-sampling or cost-based methods.
5
This talk introduce the classification with imbalanced data sets analyzing
in depth the solutions based on re-sampling.
8/6/2019 DM SC 07 Some Advanced Topics
6/96
Introduction to Imbalanced Data Sets
Learning in non-Balanced domains.
Data balancing through resampling.
- - - .
6
8/6/2019 DM SC 07 Some Advanced Topics
7/96
In r i n Im l n D
Learning in non-Balanced domains.
Data balancing through resampling.
- - - .
7
8/6/2019 DM SC 07 Some Advanced Topics
8/96
L rnin in n n- l n m in
Data sets are said to be balanced if there are, approximately, as
ones.
The positive examples are more interesting or their.
- ----- - +
-
--
-
-
--
-
--
--
-- ---- -
- -
+
+
+- - --
8
. , . , . , . , . .
Surveillance of Nosocomial Infection. Arti fic ial Intelligence in Medicine 37 (2006) 7-18
8/6/2019 DM SC 07 Some Advanced Topics
9/96
L rnin in n n- l n m in
The classes of small size are usually labeled by rareThe classes of small size are usually labeled by rare
cases (rarities).cases (rarities).
The most important knowledge usually resides in the rare cases.
These cases are common in classification problems:Ej.: Detection of uncommon diseases.
Imbalanced data: Few sick persons and lots of healthy persons.
Some real-problems:
Fraudulent credit card transactions
Learning word pronunciation
Prediction of telecommunications equipment failuresDetection oil spills from satellite images
Detection of Melanomas
Intrusion detection
9Insurance risk modelingHardware fault detection
8/6/2019 DM SC 07 Some Advanced Topics
10/96
L rnin in n n- l n m in
ro emro em::
The problem w ith class imbalances is thatstan ar earners are o ten iase towar s t emajority class.
a s ecause ese c ass ers a emp oreduce global quantities such as the error rate,not taking the data distribution into consideration.
ResultResult::
examples from the overwhelming class are well-classified
10 whereas examples from the minority class tend to bemisclassified.
8/6/2019 DM SC 07 Some Advanced Topics
11/96
L rnin in n n- l n m in
Why is difficult to learn in
Class imbalance is not the onl
responsible of the lack in accuracy ofan algorithm.
The class overlapping also influencesthe behaviour of the algorithms, and it
is ver t ical in these domains.
11
. . , . , . . .
Explorations 6:1 (2004) 1-6
8/6/2019 DM SC 07 Some Advanced Topics
12/96
LearninLearnin in nonin non--balancedbalanced domainsdomains
Why Learning from Imbalanced Data Sets might be difficult?
Four Groups of Negative Examples
Noise exam les
Borderline examplesBorderline examplesare unsa e s nce asmall amount of noise
can make them fall on
decision border.
Redundant examples
Safe examples
8/6/2019 DM SC 07 Some Advanced Topics
13/96
LearningLearning in nonin non--balancedbalanced domainsdomains
Why Learning from Imbalanced Data Sets might be difficult?Rare or exce tional cases corres ond to small numbers of traininexamples in particular areas of the feature space. When learning a concept,the presence of rare cases in the domain is an important consideration.The reason why rare cases are of interest is that they cause small disjuncts
, .
In the real world domains, reare cases are unknown since high dimensionaldata cannot be visualizad to reveal areas of low coverage.
Dataset Knowledge Model
Learner
Minimize learning error
+maximize generalization
T. Jo, N. Japkowicz. Class imbalances versus small disjuncts. SIGKDD Explorations 6:1 (2004) 40-49
8/6/2019 DM SC 07 Some Advanced Topics
14/96
LearningLearning in nonin non--balancedbalanced domainsdomains
Why Learning from Imbalanced Data Sets might be difficult?
Smalldisjunct:ocus ng
theproblem
Small DisjunctorStarved niche
more small disjuncts
OvergeneralClassifier
8/6/2019 DM SC 07 Some Advanced Topics
15/96
L rnin in n n- l n m in
How can we evaluate an algorithm in
Positive Negativere c on re c on
Positive Class True Positive(TP)
False Negative(FN)
oesn a e
into account theFalse NegativeRate, which isvery important in
ConfusionConfusion matrixmatr ix forfor aa twotwo--classclass problemproblem
g
(FP)
g
(TN)m a anceproblemsimbalancedproblems
Classical evaluation:Classical evaluation:
15
Accuracy Rate: (TP + TN) / N
8/6/2019 DM SC 07 Some Advanced Topics
16/96
L rnin in n n- l n m in
Imbalanced evaluation based on the geometric mean:
os ve rue ra o:os ve rue ra o: a = +NegativeNegative true ratio:true ratio: a- = TN / (FP+TN)
g = (a+ a- )Precision = TP/ (TP+FP)Recall = TP/ (TP+FN)
FF--measuremeasure: (2 x: (2 x precisionprecision xx recallrecall) / () / (recallrecall ++precisionprecision))
16
R. Barandela, J.S. Snchez, V. Garca, E. Rangel. Strategies for learning in class imbalance
problems. Pattern Recognition 36:3 (2003) 849-851
8/6/2019 DM SC 07 Some Advanced Topics
17/96
L rnin in n n- l n m in
ROC Curves Real
The confusion matrix isPP NP
PC 0,8 0,121
Real
Prednormalized bycolumns Espacio ROC
NC 0,2 0,879Pred
0,600
0,800
,
os
itives
A.P. Bradley, The use of the area under
the ROC curve in the evaluation of
machine learning algorithms, Pattern 0,000
0,200
,
TrueP
17ecogn on - . , , , , , ,
False Positives
8/6/2019 DM SC 07 Some Advanced Topics
18/96
L rnin in n n- l n m in
crisp and soft classifiers:
A crisp classifier (discrete) predicts a class among the candidates. ,
accompanied by a reliability value.
ROC curve
0,800
1,000
es
0,200
0,400
0,600
T
ru
e
P
o
siti
AUCAUC
0,000
0,000 0,200 0,400 0,600 0,800 1,000
False Positives
18AUC: rea under ROC curve. Scalar quantity w idleused for estimating classifiers performance.
8/6/2019 DM SC 07 Some Advanced Topics
19/96
L rnin in n n- l n m in
ana ys s or en e o a a resamp ng nimbalanced domains
The resampling algorithmmust allow to adjust the rateof under/ over sampling.
Performance of the classifieris measured with ov er / u n der Sam l in at 25% 50%100%, 200%, 300%, etc.
I t can be on ly u sed in
a l low t h e ad j u st m en t o f t h is p a r a m e t e r .
19
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer. SMOTE: synthetic minor ity over-sampling
technique. Journal of Art ificial Intelligence Research 16 (2002) 321-357
8/6/2019 DM SC 07 Some Advanced Topics
20/96
In r i n Im l n D
Learning in non-Balanced domains.
Data balancing through resampling.
- - - .
20
8/6/2019 DM SC 07 Some Advanced Topics
21/96
D B l n in hr hre-sampling
StrategiesStrategies toto dealdeal withw ith imbalancedimbalanced data setsdata sets
Over-SamplingRandom
MotivationRetain influyent
FocusedUnder-Sam lin
Balance the training
Random
Focused
Remove noisy
instances in the
M if in
ecision oun aries
Reduce the training
21
8/6/2019 DM SC 07 Some Advanced Topics
22/96
DataData BalancingBalancing throughthrough r er e-- s a m p l i n g samp l i ng
# examples
examp es +under-sampling
# examples+
over-sampling
examp es
# examples +
8/6/2019 DM SC 07 Some Advanced Topics
23/96
Data Balancin throu h re-sam lin
Over Sampling 0
Focused
-
r
Random
.
+Focuse
Cost Modifying1
# examples of -
23# examples of +
8/6/2019 DM SC 07 Some Advanced Topics
24/96
Data Balancin throu h re-sam lin
Over Sampling 0
Focused
-
r
Random
.
+Focuse
Cost Modifying1
# examples of -
24# examples of +
8/6/2019 DM SC 07 Some Advanced Topics
25/96
8/6/2019 DM SC 07 Some Advanced Topics
26/96
Data Balancin throu h re-sam lin
Over Sampling 0
Focused
-
r
Random
.
+Focuse
Cost Modifying1
# examples of -
26# examples of +
8/6/2019 DM SC 07 Some Advanced Topics
27/96
D B l n in hr h r - m lin
Over Sampling 0
Focused
-
r
Random
.
+Focuse
Cost Modifying1
# examples of -
27# examples of +
8/6/2019 DM SC 07 Some Advanced Topics
28/96
Data Balancin throu h re-sam lin
Under-sampling: Tomek Links
To remove both noise andborderline examples of the majorityclassTomek link
Ei, Ej belong to different classes, d(Ei, Ej) is the distance between them.A (Ei Ej) pair is called a Tomek link if(Ei, Ej) pair is called a Tomek link ifthere is no example El, such that d(Ei,El) < d(Ei, Ej) or d(Ej , El) < d(Ei, Ej).
28
8/6/2019 DM SC 07 Some Advanced Topics
29/96
8/6/2019 DM SC 07 Some Advanced Topics
30/96
D B l n in hr h r - m lin
Under-sampling: (OSS, CNN+TL, NCL)
One-sided selectionTomek links + CNN
NCLTo remove majority class examplesDifferent from OSS, emphasize moreomek links + CNN
CNN + Tomek linksProposed by the author
Different from OSS, emphasize moredata cleaning than data reductionAlgorithm:
Find three nearest neighbors for eachyFinding Tomek links iscomputationally demanding, itwould be computationally cheaper
gexample Ei in the training set If Ei belongs to majority class, & thethree nearest neighbors classify it tobe minority class then remove Eiwould be computationally cheaperif it was performed on a reduceddata set.
be minority class, then remove E i If Ei belongs to minority class, and thethree nearest neighbors classify it tobe majority class, then remove thethree nearest neighbors
30
three nearest neighbors
8/6/2019 DM SC 07 Some Advanced Topics
31/96
In r i n Im l n D
Learning in non-Balanced domains.
Data balancing through resampling.
- - - .
31
8/6/2019 DM SC 07 Some Advanced Topics
32/96
- f- h - r l ri hm: M TE.
Over-sampling method:To form new minority class examples by interpolating betweenseveral minority class examples that lie together.in ``feature space'' rather than ``data space''n feature space rather than data space
Algorithm: For each minority class example, introducesynthetic examples along the line segments joining any/all ofp g g j g ythe k minority class nearest neighbors.Note: Depending upon the amount of over-sampling required,neighbors from the k nearest neighbors are randomly choseneighbors from the k nearest neighbors are randomly chosen.For example: if we are using 5 nearest neighbors, if theamount of over-sampling needed is 200%, only two neighbors
32
p g y gfrom the five nearest neighbors are chosen and one sample isgenerated in the direction of each.
8/6/2019 DM SC 07 Some Advanced Topics
33/96
StateState--ofof--thethe--art algorithm: SMOTE.art algorithm: SMOTE.mo e: yn e c
Minority Over-sampling
Technique
Consider a sample (6,4) and let (4,3) be its
nearest neighbor.
Synthetic samples aregenerated in the following
(6,4) is the sample for which k-nearestneighbors are being identified
Take the differencebetween the feature
(4,3) is one of its k-nearest neighbors.
Let:
consideration and itsnearest neighbor.
Multi l this difference
f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2
f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1
by a random numberbetween 0 and 1
Add it to the feature
The new samples will be generated as
(f1',f2') = (6,4) + rand(0-1) * (-2,-1)
vec or un er
consideration.rand(0-1) generates a random number
between 0 and 1.
8/6/2019 DM SC 07 Some Advanced Topics
34/96
- f- h - r l ri hm: M TE.
N.V. Chawla, K.W. Bow yer,
L.O. Hall , W.P. Kegelmeyer.
But what if there
is a majority sample
over-sampling technique.
Journal of ArtificialIntelligence Research 16
Nearby?-
: Minority sample : Majority sample
34: Synthetic sample
8/6/2019 DM SC 07 Some Advanced Topics
35/96
8/6/2019 DM SC 07 Some Advanced Topics
36/96
artificial minority class examples toodee l in the ma orit class s ace.
class examples that form Tomek links,
8/6/2019 DM SC 07 Some Advanced Topics
37/96
- f- h - r l ri hm: M TE.
SMOTE+
TomekLinks
37
8/6/2019 DM SC 07 Some Advanced Topics
38/96
- f- h - r l ri hm: M TE.
SMOTE + ENN:
removes any examp e w ose c asslabel differs from the class of at least twoo s ree neares ne g ors.
ENN remove more examples than theTomek links does
ENN remove exam les from both classes
38
8/6/2019 DM SC 07 Some Advanced Topics
39/96
- f- h - r l ri hm: M TE.
39
G.E.A.P.A. Batista, R.C. Prati, M.C. Monard. A study of the behavior of several methods for
balancing machine learning training data. SIGKDD Explorations 6:1 (2004) 20-29
8/6/2019 DM SC 07 Some Advanced Topics
40/96
- f- h - r l ri hm: M TE.
Adaptive Synthetic
MinoritOversampling Method
(ASMO)
- Clustering
: Minority sample
-generation
: Synthetic sample: Majority sample
8/6/2019 DM SC 07 Some Advanced Topics
41/96
- f- h - r l ri hm: M TE.
Borderline-SMOTE: Genera ejemplos sintticosentre e em los minoritarios cercanos a los bordes.
- -. , . , . .
Sets Learning. In: ICIC 2005. LNCS 3644 (2005) 878-887.
Some Advanced Topics: Classification with
8/6/2019 DM SC 07 Some Advanced Topics
42/96
Imbalanced Data Sets, Subgroup Discovery
and Data Complexity
u ne
Imbalanced Data Sets
Subgroup Discovery
Data Complexity
42
8/6/2019 DM SC 07 Some Advanced Topics
43/96
y Predictive DM:
Classification (learning of rulesets,
decision trees ...
+
+
+ H
Prediction and estimation
(regression)
---
,
Descriptive DM: description and summarization
x
xx
+x
xH
dependency analysis (associationrule learning)
discover of ro erties andconstraints
segmentation (clustering)
43
Text, Web and image analysis
8/6/2019 DM SC 07 Some Advanced Topics
44/96
Predictive vs. descri tive induction
re c ve n uc on: n uc ng c ass ers or so v ngclassification and prediction tasks,
, , ... Bayesian classifier, ANN, SVM, ...
testing Descriptive induction: Discovering interesting
regu arities in t e ata, uncovering patterns, ... orsolving KDD tasks
, ,Subgroup discovery, ...
Ex lorator data anal sis
44
Predictive vs descriptive induction:
8/6/2019 DM SC 07 Some Advanced Topics
45/96
Predictive vs. descriptive induction:
ru e earn ng perspect ve
Predictive induction: Induces rulesets acting as
classifiers for solving classification and predictionas s
Descriptive induction: Discovers individual rules
Therefore: Different oals different heuristicsdifferent evaluation criteria
45
Predictive vs descriptive induction:
8/6/2019 DM SC 07 Some Advanced Topics
46/96
Predictive vs. descriptive induction:
Prediction Models: A lied for inductive rediction and com osedru e earn ng perspec ve
of rule sets used for classification.
Kweku-Muata, Osei-Bryson, Evaluation of decision trees: a multicriteria approach.-, , , , .
TrainingExtrac. Algorithm
IND, S-Plus Trees,
C4.5, CN2, FACT,QUEST, CART,
Age Car type Risk
ata
Classifier Age < 31
OC1, LMDT, CAL5,
T1
20 Combi High
18 Sport High
40 Sport High
(mo el)
Car Type=Sport
35 Minivan Low
30 Combi High
32 Familiar Low
ge