DM SC 07 Some Advanced Topics

Embed Size (px)

Citation preview

  • 8/6/2019 DM SC 07 Some Advanced Topics

    1/96

    Data Mining and Soft Computing.

    Some Advanced Topics:,

    Subgroup Discovery and Data Complexity

    Francisco HerreraResearch Group on Soft Computing and

    Information Intell igent Systems (SCI 2S)

    Dept. of Computer Science and A.I.,

    Emai l : he r re ra@dec sai .ugr.es

    . .

    http:/ /decsai.ugr.es/~herrera

  • 8/6/2019 DM SC 07 Some Advanced Topics

    2/96

    Data Mining and Soft Computing

    Summary1.Introduction to Data Mining and Knowledge Discovery2. Data Preparation3. Introduction to Prediction, Classification, Clustering and

    Association

    4.Introduction to Soft Computing. Focusing our attention inFuzzy Logic and Evolutionary Computation

    5.Soft Com utin Techni ues in Data Minin : Fuzz Data Mininand Knowledge Extraction based on Evolutionary Learning

    6. Genetic Fuzzy Systems: State of the Art and New Trends.Sets, Subgroup Discovery, Data Complexity

    8.Final talk: How must I Do my Experimental Study? Design of.

    Non-parametric Tests. Some Cases of Study.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    3/96

    Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery

    and Data Complexity

    u ne

    Imbalanced Data Sets

    Subgroup Discovery

    Data Complexity

    3

  • 8/6/2019 DM SC 07 Some Advanced Topics

    4/96

    Some Advanced Topics: Classification withImbalanced Data Sets, Subgroup Discovery

    and Data Complexity

    u ne

    Imbalanced Data Sets

    Subgroup Discovery

    Data Complexity

    4

  • 8/6/2019 DM SC 07 Some Advanced Topics

    5/96

    ImbalancedImbalanced Data SetsData Sets

    PresentationPresentationIn a concept-learning problem, the data set is

    said to present a class imbalance if it contains

    Such a situation poses challenges for typicalclassifiers such as decision tree induction systems

    or multi-layer perceptrons that are designed to

    optimize overall accuracy without taking into

    account the relative distribution of each class.

    As a result, these classifiers tend to ignore small classes whileconcen ra ng on c ass y ng e arge ones accura e y.

    Such a problem occurs in a large number of practical domains and isoften dealt with by using re-sampling or cost-based methods.

    5

    This talk introduce the classification with imbalanced data sets analyzing

    in depth the solutions based on re-sampling.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    6/96

    Introduction to Imbalanced Data Sets

    Learning in non-Balanced domains.

    Data balancing through resampling.

    - - - .

    6

  • 8/6/2019 DM SC 07 Some Advanced Topics

    7/96

    In r i n Im l n D

    Learning in non-Balanced domains.

    Data balancing through resampling.

    - - - .

    7

  • 8/6/2019 DM SC 07 Some Advanced Topics

    8/96

    L rnin in n n- l n m in

    Data sets are said to be balanced if there are, approximately, as

    ones.

    The positive examples are more interesting or their.

    - ----- - +

    -

    --

    -

    -

    --

    -

    --

    --

    -- ---- -

    - -

    +

    +

    +- - --

    8

    . , . , . , . , . .

    Surveillance of Nosocomial Infection. Arti fic ial Intelligence in Medicine 37 (2006) 7-18

  • 8/6/2019 DM SC 07 Some Advanced Topics

    9/96

    L rnin in n n- l n m in

    The classes of small size are usually labeled by rareThe classes of small size are usually labeled by rare

    cases (rarities).cases (rarities).

    The most important knowledge usually resides in the rare cases.

    These cases are common in classification problems:Ej.: Detection of uncommon diseases.

    Imbalanced data: Few sick persons and lots of healthy persons.

    Some real-problems:

    Fraudulent credit card transactions

    Learning word pronunciation

    Prediction of telecommunications equipment failuresDetection oil spills from satellite images

    Detection of Melanomas

    Intrusion detection

    9Insurance risk modelingHardware fault detection

  • 8/6/2019 DM SC 07 Some Advanced Topics

    10/96

    L rnin in n n- l n m in

    ro emro em::

    The problem w ith class imbalances is thatstan ar earners are o ten iase towar s t emajority class.

    a s ecause ese c ass ers a emp oreduce global quantities such as the error rate,not taking the data distribution into consideration.

    ResultResult::

    examples from the overwhelming class are well-classified

    10 whereas examples from the minority class tend to bemisclassified.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    11/96

    L rnin in n n- l n m in

    Why is difficult to learn in

    Class imbalance is not the onl

    responsible of the lack in accuracy ofan algorithm.

    The class overlapping also influencesthe behaviour of the algorithms, and it

    is ver t ical in these domains.

    11

    . . , . , . . .

    Explorations 6:1 (2004) 1-6

  • 8/6/2019 DM SC 07 Some Advanced Topics

    12/96

    LearninLearnin in nonin non--balancedbalanced domainsdomains

    Why Learning from Imbalanced Data Sets might be difficult?

    Four Groups of Negative Examples

    Noise exam les

    Borderline examplesBorderline examplesare unsa e s nce asmall amount of noise

    can make them fall on

    decision border.

    Redundant examples

    Safe examples

  • 8/6/2019 DM SC 07 Some Advanced Topics

    13/96

    LearningLearning in nonin non--balancedbalanced domainsdomains

    Why Learning from Imbalanced Data Sets might be difficult?Rare or exce tional cases corres ond to small numbers of traininexamples in particular areas of the feature space. When learning a concept,the presence of rare cases in the domain is an important consideration.The reason why rare cases are of interest is that they cause small disjuncts

    , .

    In the real world domains, reare cases are unknown since high dimensionaldata cannot be visualizad to reveal areas of low coverage.

    Dataset Knowledge Model

    Learner

    Minimize learning error

    +maximize generalization

    T. Jo, N. Japkowicz. Class imbalances versus small disjuncts. SIGKDD Explorations 6:1 (2004) 40-49

  • 8/6/2019 DM SC 07 Some Advanced Topics

    14/96

    LearningLearning in nonin non--balancedbalanced domainsdomains

    Why Learning from Imbalanced Data Sets might be difficult?

    Smalldisjunct:ocus ng

    theproblem

    Small DisjunctorStarved niche

    more small disjuncts

    OvergeneralClassifier

  • 8/6/2019 DM SC 07 Some Advanced Topics

    15/96

    L rnin in n n- l n m in

    How can we evaluate an algorithm in

    Positive Negativere c on re c on

    Positive Class True Positive(TP)

    False Negative(FN)

    oesn a e

    into account theFalse NegativeRate, which isvery important in

    ConfusionConfusion matrixmatr ix forfor aa twotwo--classclass problemproblem

    g

    (FP)

    g

    (TN)m a anceproblemsimbalancedproblems

    Classical evaluation:Classical evaluation:

    15

    Accuracy Rate: (TP + TN) / N

  • 8/6/2019 DM SC 07 Some Advanced Topics

    16/96

    L rnin in n n- l n m in

    Imbalanced evaluation based on the geometric mean:

    os ve rue ra o:os ve rue ra o: a = +NegativeNegative true ratio:true ratio: a- = TN / (FP+TN)

    g = (a+ a- )Precision = TP/ (TP+FP)Recall = TP/ (TP+FN)

    FF--measuremeasure: (2 x: (2 x precisionprecision xx recallrecall) / () / (recallrecall ++precisionprecision))

    16

    R. Barandela, J.S. Snchez, V. Garca, E. Rangel. Strategies for learning in class imbalance

    problems. Pattern Recognition 36:3 (2003) 849-851

  • 8/6/2019 DM SC 07 Some Advanced Topics

    17/96

    L rnin in n n- l n m in

    ROC Curves Real

    The confusion matrix isPP NP

    PC 0,8 0,121

    Real

    Prednormalized bycolumns Espacio ROC

    NC 0,2 0,879Pred

    0,600

    0,800

    ,

    os

    itives

    A.P. Bradley, The use of the area under

    the ROC curve in the evaluation of

    machine learning algorithms, Pattern 0,000

    0,200

    ,

    TrueP

    17ecogn on - . , , , , , ,

    False Positives

  • 8/6/2019 DM SC 07 Some Advanced Topics

    18/96

    L rnin in n n- l n m in

    crisp and soft classifiers:

    A crisp classifier (discrete) predicts a class among the candidates. ,

    accompanied by a reliability value.

    ROC curve

    0,800

    1,000

    es

    0,200

    0,400

    0,600

    T

    ru

    e

    P

    o

    siti

    AUCAUC

    0,000

    0,000 0,200 0,400 0,600 0,800 1,000

    False Positives

    18AUC: rea under ROC curve. Scalar quantity w idleused for estimating classifiers performance.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    19/96

    L rnin in n n- l n m in

    ana ys s or en e o a a resamp ng nimbalanced domains

    The resampling algorithmmust allow to adjust the rateof under/ over sampling.

    Performance of the classifieris measured with ov er / u n der Sam l in at 25% 50%100%, 200%, 300%, etc.

    I t can be on ly u sed in

    a l low t h e ad j u st m en t o f t h is p a r a m e t e r .

    19

    N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer. SMOTE: synthetic minor ity over-sampling

    technique. Journal of Art ificial Intelligence Research 16 (2002) 321-357

  • 8/6/2019 DM SC 07 Some Advanced Topics

    20/96

    In r i n Im l n D

    Learning in non-Balanced domains.

    Data balancing through resampling.

    - - - .

    20

  • 8/6/2019 DM SC 07 Some Advanced Topics

    21/96

    D B l n in hr hre-sampling

    StrategiesStrategies toto dealdeal withw ith imbalancedimbalanced data setsdata sets

    Over-SamplingRandom

    MotivationRetain influyent

    FocusedUnder-Sam lin

    Balance the training

    Random

    Focused

    Remove noisy

    instances in the

    M if in

    ecision oun aries

    Reduce the training

    21

  • 8/6/2019 DM SC 07 Some Advanced Topics

    22/96

    DataData BalancingBalancing throughthrough r er e-- s a m p l i n g samp l i ng

    # examples

    examp es +under-sampling

    # examples+

    over-sampling

    examp es

    # examples +

  • 8/6/2019 DM SC 07 Some Advanced Topics

    23/96

    Data Balancin throu h re-sam lin

    Over Sampling 0

    Focused

    -

    r

    Random

    .

    +Focuse

    Cost Modifying1

    # examples of -

    23# examples of +

  • 8/6/2019 DM SC 07 Some Advanced Topics

    24/96

    Data Balancin throu h re-sam lin

    Over Sampling 0

    Focused

    -

    r

    Random

    .

    +Focuse

    Cost Modifying1

    # examples of -

    24# examples of +

  • 8/6/2019 DM SC 07 Some Advanced Topics

    25/96

  • 8/6/2019 DM SC 07 Some Advanced Topics

    26/96

    Data Balancin throu h re-sam lin

    Over Sampling 0

    Focused

    -

    r

    Random

    .

    +Focuse

    Cost Modifying1

    # examples of -

    26# examples of +

  • 8/6/2019 DM SC 07 Some Advanced Topics

    27/96

    D B l n in hr h r - m lin

    Over Sampling 0

    Focused

    -

    r

    Random

    .

    +Focuse

    Cost Modifying1

    # examples of -

    27# examples of +

  • 8/6/2019 DM SC 07 Some Advanced Topics

    28/96

    Data Balancin throu h re-sam lin

    Under-sampling: Tomek Links

    To remove both noise andborderline examples of the majorityclassTomek link

    Ei, Ej belong to different classes, d(Ei, Ej) is the distance between them.A (Ei Ej) pair is called a Tomek link if(Ei, Ej) pair is called a Tomek link ifthere is no example El, such that d(Ei,El) < d(Ei, Ej) or d(Ej , El) < d(Ei, Ej).

    28

  • 8/6/2019 DM SC 07 Some Advanced Topics

    29/96

  • 8/6/2019 DM SC 07 Some Advanced Topics

    30/96

    D B l n in hr h r - m lin

    Under-sampling: (OSS, CNN+TL, NCL)

    One-sided selectionTomek links + CNN

    NCLTo remove majority class examplesDifferent from OSS, emphasize moreomek links + CNN

    CNN + Tomek linksProposed by the author

    Different from OSS, emphasize moredata cleaning than data reductionAlgorithm:

    Find three nearest neighbors for eachyFinding Tomek links iscomputationally demanding, itwould be computationally cheaper

    gexample Ei in the training set If Ei belongs to majority class, & thethree nearest neighbors classify it tobe minority class then remove Eiwould be computationally cheaperif it was performed on a reduceddata set.

    be minority class, then remove E i If Ei belongs to minority class, and thethree nearest neighbors classify it tobe majority class, then remove thethree nearest neighbors

    30

    three nearest neighbors

  • 8/6/2019 DM SC 07 Some Advanced Topics

    31/96

    In r i n Im l n D

    Learning in non-Balanced domains.

    Data balancing through resampling.

    - - - .

    31

  • 8/6/2019 DM SC 07 Some Advanced Topics

    32/96

    - f- h - r l ri hm: M TE.

    Over-sampling method:To form new minority class examples by interpolating betweenseveral minority class examples that lie together.in ``feature space'' rather than ``data space''n feature space rather than data space

    Algorithm: For each minority class example, introducesynthetic examples along the line segments joining any/all ofp g g j g ythe k minority class nearest neighbors.Note: Depending upon the amount of over-sampling required,neighbors from the k nearest neighbors are randomly choseneighbors from the k nearest neighbors are randomly chosen.For example: if we are using 5 nearest neighbors, if theamount of over-sampling needed is 200%, only two neighbors

    32

    p g y gfrom the five nearest neighbors are chosen and one sample isgenerated in the direction of each.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    33/96

    StateState--ofof--thethe--art algorithm: SMOTE.art algorithm: SMOTE.mo e: yn e c

    Minority Over-sampling

    Technique

    Consider a sample (6,4) and let (4,3) be its

    nearest neighbor.

    Synthetic samples aregenerated in the following

    (6,4) is the sample for which k-nearestneighbors are being identified

    Take the differencebetween the feature

    (4,3) is one of its k-nearest neighbors.

    Let:

    consideration and itsnearest neighbor.

    Multi l this difference

    f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2

    f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1

    by a random numberbetween 0 and 1

    Add it to the feature

    The new samples will be generated as

    (f1',f2') = (6,4) + rand(0-1) * (-2,-1)

    vec or un er

    consideration.rand(0-1) generates a random number

    between 0 and 1.

  • 8/6/2019 DM SC 07 Some Advanced Topics

    34/96

    - f- h - r l ri hm: M TE.

    N.V. Chawla, K.W. Bow yer,

    L.O. Hall , W.P. Kegelmeyer.

    But what if there

    is a majority sample

    over-sampling technique.

    Journal of ArtificialIntelligence Research 16

    Nearby?-

    : Minority sample : Majority sample

    34: Synthetic sample

  • 8/6/2019 DM SC 07 Some Advanced Topics

    35/96

  • 8/6/2019 DM SC 07 Some Advanced Topics

    36/96

    artificial minority class examples toodee l in the ma orit class s ace.

    class examples that form Tomek links,

  • 8/6/2019 DM SC 07 Some Advanced Topics

    37/96

    - f- h - r l ri hm: M TE.

    SMOTE+

    TomekLinks

    37

  • 8/6/2019 DM SC 07 Some Advanced Topics

    38/96

    - f- h - r l ri hm: M TE.

    SMOTE + ENN:

    removes any examp e w ose c asslabel differs from the class of at least twoo s ree neares ne g ors.

    ENN remove more examples than theTomek links does

    ENN remove exam les from both classes

    38

  • 8/6/2019 DM SC 07 Some Advanced Topics

    39/96

    - f- h - r l ri hm: M TE.

    39

    G.E.A.P.A. Batista, R.C. Prati, M.C. Monard. A study of the behavior of several methods for

    balancing machine learning training data. SIGKDD Explorations 6:1 (2004) 20-29

  • 8/6/2019 DM SC 07 Some Advanced Topics

    40/96

    - f- h - r l ri hm: M TE.

    Adaptive Synthetic

    MinoritOversampling Method

    (ASMO)

    - Clustering

    : Minority sample

    -generation

    : Synthetic sample: Majority sample

  • 8/6/2019 DM SC 07 Some Advanced Topics

    41/96

    - f- h - r l ri hm: M TE.

    Borderline-SMOTE: Genera ejemplos sintticosentre e em los minoritarios cercanos a los bordes.

    - -. , . , . .

    Sets Learning. In: ICIC 2005. LNCS 3644 (2005) 878-887.

    Some Advanced Topics: Classification with

  • 8/6/2019 DM SC 07 Some Advanced Topics

    42/96

    Imbalanced Data Sets, Subgroup Discovery

    and Data Complexity

    u ne

    Imbalanced Data Sets

    Subgroup Discovery

    Data Complexity

    42

  • 8/6/2019 DM SC 07 Some Advanced Topics

    43/96

    y Predictive DM:

    Classification (learning of rulesets,

    decision trees ...

    +

    +

    + H

    Prediction and estimation

    (regression)

    ---

    ,

    Descriptive DM: description and summarization

    x

    xx

    +x

    xH

    dependency analysis (associationrule learning)

    discover of ro erties andconstraints

    segmentation (clustering)

    43

    Text, Web and image analysis

  • 8/6/2019 DM SC 07 Some Advanced Topics

    44/96

    Predictive vs. descri tive induction

    re c ve n uc on: n uc ng c ass ers or so v ngclassification and prediction tasks,

    , , ... Bayesian classifier, ANN, SVM, ...

    testing Descriptive induction: Discovering interesting

    regu arities in t e ata, uncovering patterns, ... orsolving KDD tasks

    , ,Subgroup discovery, ...

    Ex lorator data anal sis

    44

    Predictive vs descriptive induction:

  • 8/6/2019 DM SC 07 Some Advanced Topics

    45/96

    Predictive vs. descriptive induction:

    ru e earn ng perspect ve

    Predictive induction: Induces rulesets acting as

    classifiers for solving classification and predictionas s

    Descriptive induction: Discovers individual rules

    Therefore: Different oals different heuristicsdifferent evaluation criteria

    45

    Predictive vs descriptive induction:

  • 8/6/2019 DM SC 07 Some Advanced Topics

    46/96

    Predictive vs. descriptive induction:

    Prediction Models: A lied for inductive rediction and com osedru e earn ng perspec ve

    of rule sets used for classification.

    Kweku-Muata, Osei-Bryson, Evaluation of decision trees: a multicriteria approach.-, , , , .

    TrainingExtrac. Algorithm

    IND, S-Plus Trees,

    C4.5, CN2, FACT,QUEST, CART,

    Age Car type Risk

    ata

    Classifier Age < 31

    OC1, LMDT, CAL5,

    T1

    20 Combi High

    18 Sport High

    40 Sport High

    (mo el)

    Car Type=Sport

    35 Minivan Low

    30 Combi High

    32 Familiar Low

    ge