DM SC 07 Some Advanced Topics

8/6/2019 DM SC 07 Some Advanced Topics

1/96

Data Mining and Soft Computing.

Some Advanced Topics:,

Subgroup Discovery and Data Complexity

Francisco HerreraResearch Group on Soft Computing and

Information Intell igent Systems (SCI 2S)

Dept. of Computer Science and A.I.,

Emai l : he r re ra@dec sai .ugr.es

. .

http:/ /decsai.ugr.es/~herrera


2/96

Data Mining and Soft Computing

Summary1.Introduction to Data Mining and Knowledge Discovery2. Data Preparation3. Introduction to Prediction, Classification, Clustering and

Association

4.Introduction to Soft Computing. Focusing our attention inFuzzy Logic and Evolutionary Computation

5.Soft Com utin Techni ues in Data Minin : Fuzz Data Mininand Knowledge Extraction based on Evolutionary Learning

6. Genetic Fuzzy Systems: State of the Art and New Trends.Sets, Subgroup Discovery, Data Complexity

8.Final talk: How must I Do my Experimental Study? Design of.

Non-parametric Tests. Some Cases of Study.


3/96

Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery

and Data Complexity

u ne

Imbalanced Data Sets

Subgroup Discovery

Data Complexity

3


4/96

Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery

and Data Complexity

u ne


Subgroup Discovery

Data Complexity

4


5/96

ImbalancedImbalanced Data SetsData Sets

PresentationPresentationIn a concept-learning problem, the data set is

said to present a class imbalance if it contains

Such a situation poses challenges for typicalclassifiers such as decision tree induction systems

or multi-layer perceptrons that are designed to

optimize overall accuracy without taking into

account the relative distribution of each class.

As a result, these classifiers tend to ignore small classes whileconcen ra ng on c ass y ng e arge ones accura e y.

Such a problem occurs in a large number of practical domains and isoften dealt with by using re-sampling or cost-based methods.

5

This talk introduce the classification with imbalanced data sets analyzing

in depth the solutions based on re-sampling.


6/96

Introduction to Imbalanced Data Sets

Learning in non-Balanced domains.

Data balancing through resampling.

- - - .

6


7/96

In r i n Im l n D



- - - .

7


8/96

L rnin in n n- l n m in

Data sets are said to be balanced if there are, approximately, as

ones.

The positive examples are more interesting or their.

- ----- - +

-

--

-

-

--

-

--

--

-- ---- -

- -

+

+

+- - --

8

. , . , . , . , . .

Surveillance of Nosocomial Infection. Arti fic ial Intelligence in Medicine 37 (2006) 7-18


9/96


The classes of small size are usually labeled by rareThe classes of small size are usually labeled by rare

cases (rarities).cases (rarities).

The most important knowledge usually resides in the rare cases.

These cases are common in classification problems:Ej.: Detection of uncommon diseases.

Imbalanced data: Few sick persons and lots of healthy persons.

Some real-problems:

Fraudulent credit card transactions

Learning word pronunciation

Prediction of telecommunications equipment failuresDetection oil spills from satellite images

Detection of Melanomas

Intrusion detection

9Insurance risk modelingHardware fault detection


10/96


ro emro em::

The problem w ith class imbalances is thatstan ar earners are o ten iase towar s t emajority class.

a s ecause ese c ass ers a emp oreduce global quantities such as the error rate,not taking the data distribution into consideration.

ResultResult::

examples from the overwhelming class are well-classified

10 whereas examples from the minority class tend to bemisclassified.


11/96


Why is difficult to learn in

Class imbalance is not the onl

responsible of the lack in accuracy ofan algorithm.

The class overlapping also influencesthe behaviour of the algorithms, and it

is ver t ical in these domains.

11

. . , . , . . .

Explorations 6:1 (2004) 1-6


12/96

LearninLearnin in nonin non--balancedbalanced domainsdomains

Why Learning from Imbalanced Data Sets might be difficult?

Four Groups of Negative Examples

Noise exam les

Borderline examplesBorderline examplesare unsa e s nce asmall amount of noise

can make them fall on

decision border.

Redundant examples

Safe examples


13/96

LearningLearning in nonin non--balancedbalanced domainsdomains

Why Learning from Imbalanced Data Sets might be difficult?Rare or exce tional cases corres ond to small numbers of traininexamples in particular areas of the feature space. When learning a concept,the presence of rare cases in the domain is an important consideration.The reason why rare cases are of interest is that they cause small disjuncts

, .

In the real world domains, reare cases are unknown since high dimensionaldata cannot be visualizad to reveal areas of low coverage.

Dataset Knowledge Model

Learner

Minimize learning error

+maximize generalization

T. Jo, N. Japkowicz. Class imbalances versus small disjuncts. SIGKDD Explorations 6:1 (2004) 40-49


14/96

LearningLearning in nonin non--balancedbalanced domainsdomains

Why Learning from Imbalanced Data Sets might be difficult?

Smalldisjunct:ocus ng

theproblem

Small DisjunctorStarved niche

more small disjuncts

OvergeneralClassifier


15/96


How can we evaluate an algorithm in

Positive Negativere c on re c on

Positive Class True Positive(TP)

False Negative(FN)

oesn a e

into account theFalse NegativeRate, which isvery important in

ConfusionConfusion matrixmatr ix forfor aa twotwo--classclass problemproblem

g

(FP)

g

(TN)m a anceproblemsimbalancedproblems

Classical evaluation:Classical evaluation:

15

Accuracy Rate: (TP + TN) / N


16/96


Imbalanced evaluation based on the geometric mean:

os ve rue ra o:os ve rue ra o: a = +NegativeNegative true ratio:true ratio: a- = TN / (FP+TN)

g = (a+ a- )Precision = TP/ (TP+FP)Recall = TP/ (TP+FN)

FF--measuremeasure: (2 x: (2 x precisionprecision xx recallrecall) / () / (recallrecall ++precisionprecision))

16

R. Barandela, J.S. Snchez, V. Garca, E. Rangel. Strategies for learning in class imbalance

problems. Pattern Recognition 36:3 (2003) 849-851


17/96


ROC Curves Real

The confusion matrix isPP NP

PC 0,8 0,121

Real

Prednormalized bycolumns Espacio ROC

NC 0,2 0,879Pred

0,600

0,800

,

os

itives

A.P. Bradley, The use of the area under

the ROC curve in the evaluation of

machine learning algorithms, Pattern 0,000

0,200

,

TrueP

17ecogn on - . , , , , , ,

False Positives


18/96


crisp and soft classifiers:

A crisp classifier (discrete) predicts a class among the candidates. ,

accompanied by a reliability value.

ROC curve

0,800

1,000

es

0,200

0,400

0,600

T

ru

e

P

o

siti

AUCAUC

0,000

0,000 0,200 0,400 0,600 0,800 1,000

False Positives

18AUC: rea under ROC curve. Scalar quantity w idleused for estimating classifiers performance.


19/96


ana ys s or en e o a a resamp ng nimbalanced domains

The resampling algorithmmust allow to adjust the rateof under/ over sampling.

Performance of the classifieris measured with ov er / u n der Sam l in at 25% 50%100%, 200%, 300%, etc.

I t can be on ly u sed in

a l low t h e ad j u st m en t o f t h is p a r a m e t e r .

19

N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer. SMOTE: synthetic minor ity over-sampling

technique. Journal of Art ificial Intelligence Research 16 (2002) 321-357


20/96

In r i n Im l n D



- - - .

20


21/96

D B l n in hr hre-sampling

StrategiesStrategies toto dealdeal withw ith imbalancedimbalanced data setsdata sets

Over-SamplingRandom

MotivationRetain influyent

FocusedUnder-Sam lin

Balance the training

Random

Focused

Remove noisy

instances in the

M if in

ecision oun aries

Reduce the training

21


22/96

DataData BalancingBalancing throughthrough r er e-- s a m p l i n g samp l i ng

# examples

examp es +under-sampling

# examples+

over-sampling

examp es

# examples +


23/96

Data Balancin throu h re-sam lin

Over Sampling 0

Focused

-

r

Random

.

+Focuse

Cost Modifying1

# examples of -

23# examples of +


24/96


Over Sampling 0

Focused

-

r

Random

.

+Focuse

Cost Modifying1

# examples of -

24# examples of +


25/96


26/96


Over Sampling 0

Focused

-

r

Random

.

+Focuse

Cost Modifying1

# examples of -

26# examples of +


27/96

D B l n in hr h r - m lin

Over Sampling 0

Focused

-

r

Random

.

+Focuse

Cost Modifying1

# examples of -

27# examples of +


28/96


Under-sampling: Tomek Links

To remove both noise andborderline examples of the majorityclassTomek link

Ei, Ej belong to different classes, d(Ei, Ej) is the distance between them.A (Ei Ej) pair is called a Tomek link if(Ei, Ej) pair is called a Tomek link ifthere is no example El, such that d(Ei,El) < d(Ei, Ej) or d(Ej , El) < d(Ei, Ej).

28


29/96


30/96

D B l n in hr h r - m lin

Under-sampling: (OSS, CNN+TL, NCL)

One-sided selectionTomek links + CNN

NCLTo remove majority class examplesDifferent from OSS, emphasize moreomek links + CNN

CNN + Tomek linksProposed by the author

Different from OSS, emphasize moredata cleaning than data reductionAlgorithm:

Find three nearest neighbors for eachyFinding Tomek links iscomputationally demanding, itwould be computationally cheaper

gexample Ei in the training set If Ei belongs to majority class, & thethree nearest neighbors classify it tobe minority class then remove Eiwould be computationally cheaperif it was performed on a reduceddata set.

be minority class, then remove E i If Ei belongs to minority class, and thethree nearest neighbors classify it tobe majority class, then remove thethree nearest neighbors

30

three nearest neighbors


31/96

In r i n Im l n D



- - - .

31


32/96

- f- h - r l ri hm: M TE.

Over-sampling method:To form new minority class examples by interpolating betweenseveral minority class examples that lie together.in ``feature space'' rather than ``data space''n feature space rather than data space

Algorithm: For each minority class example, introducesynthetic examples along the line segments joining any/all ofp g g j g ythe k minority class nearest neighbors.Note: Depending upon the amount of over-sampling required,neighbors from the k nearest neighbors are randomly choseneighbors from the k nearest neighbors are randomly chosen.For example: if we are using 5 nearest neighbors, if theamount of over-sampling needed is 200%, only two neighbors

32

p g y gfrom the five nearest neighbors are chosen and one sample isgenerated in the direction of each.


33/96

StateState--ofof--thethe--art algorithm: SMOTE.art algorithm: SMOTE.mo e: yn e c

Minority Over-sampling

Technique

Consider a sample (6,4) and let (4,3) be its

nearest neighbor.

Synthetic samples aregenerated in the following

(6,4) is the sample for which k-nearestneighbors are being identified

Take the differencebetween the feature

(4,3) is one of its k-nearest neighbors.

Let:

consideration and itsnearest neighbor.

Multi l this difference

f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2

f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1

by a random numberbetween 0 and 1

Add it to the feature

The new samples will be generated as

(f1',f2') = (6,4) + rand(0-1) * (-2,-1)

vec or un er

consideration.rand(0-1) generates a random number

between 0 and 1.


34/96


N.V. Chawla, K.W. Bow yer,

L.O. Hall , W.P. Kegelmeyer.

But what if there

is a majority sample

over-sampling technique.

Journal of ArtificialIntelligence Research 16

Nearby?-

: Minority sample : Majority sample

34: Synthetic sample


35/96


36/96

artificial minority class examples toodee l in the ma orit class s ace.

class examples that form Tomek links,


37/96


SMOTE+

TomekLinks

37


38/96


SMOTE + ENN:

removes any examp e w ose c asslabel differs from the class of at least twoo s ree neares ne g ors.

ENN remove more examples than theTomek links does

ENN remove exam les from both classes

38


39/96


39

G.E.A.P.A. Batista, R.C. Prati, M.C. Monard. A study of the behavior of several methods for

balancing machine learning training data. SIGKDD Explorations 6:1 (2004) 20-29


40/96


Adaptive Synthetic

MinoritOversampling Method

(ASMO)

- Clustering

: Minority sample

-generation

: Synthetic sample: Majority sample


41/96


Borderline-SMOTE: Genera ejemplos sintticosentre e em los minoritarios cercanos a los bordes.

- -. , . , . .

Sets Learning. In: ICIC 2005. LNCS 3644 (2005) 878-887.

Some Advanced Topics: Classification with


42/96

Imbalanced Data Sets, Subgroup Discovery

and Data Complexity

u ne


Subgroup Discovery

Data Complexity

42


43/96

y Predictive DM:

Classification (learning of rulesets,

decision trees ...

+

+

+ H

Prediction and estimation

(regression)

---

,

Descriptive DM: description and summarization

x

xx

+x

xH

dependency analysis (associationrule learning)

discover of ro erties andconstraints

segmentation (clustering)

43

Text, Web and image analysis


44/96

Predictive vs. descri tive induction

re c ve n uc on: n uc ng c ass ers or so v ngclassification and prediction tasks,

, , ... Bayesian classifier, ANN, SVM, ...

testing Descriptive induction: Discovering interesting

regu arities in t e ata, uncovering patterns, ... orsolving KDD tasks

, ,Subgroup discovery, ...

Ex lorator data anal sis

44

Predictive vs descriptive induction:


45/96

Predictive vs. descriptive induction:

ru e earn ng perspect ve

Predictive induction: Induces rulesets acting as

classifiers for solving classification and predictionas s

Descriptive induction: Discovers individual rules

Therefore: Different oals different heuristicsdifferent evaluation criteria

45

Predictive vs descriptive induction:


46/96

Predictive vs. descriptive induction:

Prediction Models: A lied for inductive rediction and com osedru e earn ng perspec ve

of rule sets used for classification.

Kweku-Muata, Osei-Bryson, Evaluation of decision trees: a multicriteria approach.-, , , , .

TrainingExtrac. Algorithm

IND, S-Plus Trees,

C4.5, CN2, FACT,QUEST, CART,

Age Car type Risk

ata

Classifier Age < 31

OC1, LMDT, CAL5,

T1

20 Combi High

18 Sport High

40 Sport High

(mo el)

Car Type=Sport

35 Minivan Low

30 Combi High

32 Familiar Low

ge

Documents

DM SC 07 Some Advanced Topics