Data Mining Classification and Prediction by Dr. Tanvir Ahmed

  • Upload
    hapi

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    1/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    Classifcation and

    Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    2/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Classifcation and Prediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    'a(% !earners )or !earning*rom %our neighbors+

    ther c!assi"cation

    methods

    -rediction

    .ccurac% and error

    measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    3/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    C!assi"cation predicts categorica! c!ass !abe!s )discrete or nomina!+ c!assi"es data )constructs a mode!+ based on the

    training set and the va!ues )c!ass !abe!s+ in ac!assi*%ing attribute and uses it in c!assi*%ing nedata

    -rediction mode!s continuous3va!ued *unctions, i4e4, predicts

    unnon or missing va!ues

    T%pica! app!ications Credit approva! Target mareting Medica! diagnosis raud detection

    Classifcation vs. Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    4/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    ClassifcationA Two-StepProcess

    Mode! construction: describing a set o* predetermined c!asses /ach tup!e8samp!e is assumed to be!ong to a prede"ned

    c!ass, as determined b% the c!ass !abe! attribute The set o* tup!es used *or mode! construction is training

    set

    The mode! is represented as c!assi"cation ru!es, decisiontrees, or mathematica! *ormu!ae Mode! usage: *or c!assi*%ing *uture or unnon ob9ects

    /stimate accurac%o* the mode! The non !abe! o* test samp!e is compared ith the

    c!assi"ed resu!t *rom the mode! .ccurac% rate is the percentage o* test set samp!es

    that are correct!% c!assi"ed b% the mode! Test set is independent o* training set, otherise over3

    "tting i!! occur $* the accurac% is acceptab!e, use the mode! to c!assi*%

    datatup!es hose c!ass !abe!s are not non

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    5/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    rocess : o eConstruction

    Training

    Data

    NAME RANK YEARS TENURED

    Mike Assistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yes

    Jim Associate Prof 7 yes

    Dave Assistant Prof 6 no

    Anne Associate Prof 3 no

    Classification

    Algorithms

    IF rank = professor

    OR years > 6

    T!" ten#re$ = yes

    Classifier

    %&o$el'

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    6/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    Process (2): sin! t"e Model in

    Prediction

    Classifier

    Testing

    Data

    NAME RANK YEARS TENURED

    Tom Assistant Prof 2 no

    Merlisa Associate Prof 7 no

    George Professor 5 yes

    Josep Assistant Prof 7 yes

    (nseen Data

    %)eff* +rofessor* ,'

    Ten#re$-

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    7/124November 16, 2015Data Mining: Concepts and

    Techniques

    Supervised vs. nsupervised#earnin!

    upervised !earning )c!assi"cation+

    upervision: The training data )observations,

    measurements, etc4+ are accompanied b%

    !abe!s indicating the c!ass o* the observations Ne data is c!assi"ed based on the training set

    ;nsupervised !earning)c!ustering+

    The c!ass !abe!s o* training data is unnon

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    8/124November 16, 2015Data Mining: Concepts and

    Techniques >

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    upport ?ector Machines

    )?M+

    'a(% !earners )or !earning*rom %our neighbors+

    ther c!assi"cation

    methods

    -rediction

    .ccurac% and error

    measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    9/124November 16, 2015Data Mining: Concepts and

    Techniques @

    %ssues: &ata Preparation

    Data c!eaning

    -reprocess data in order to reduce noise and

    hand!e missing va!ues

    Ae!evance ana!%sis )*eature se!ection+ Aemove the irre!evant or redundant attributes

    Data trans*ormation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    10/124November 16, 2015Data Mining: Concepts and

    Techniques 10

    %ssues: 'valuatin! Classifcation

    Met"ods

    .ccurac% c!assi"er accurac%: predicting c!ass !abe! predictor accurac%: guessing va!ue o* predicted

    attributes peed

    time to construct the mode! )training time+ time to use the mode! )c!assi"cation8prediction time+

    Aobustness: hand!ing noise and missing va!ues ca!abi!it%: eBcienc% in dis3resident databases $nterpretabi!it%

    understanding and insight provided b% the mode! ther measures, e4g4, goodness o* ru!es, such as

    decision tree si(e or compactness o* c!assi"cation ru!es

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    11/124November 16, 2015Data Mining: Concepts and

    Techniques 11

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    & i i T % d i T i i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    12/124November 16, 2015Data Mining: Concepts and

    Techniques 12

    &ecision Tree %nduction: Trainin!&ataset

    age income st!"ent cre"it#rating $!ys#comp!ter

    %&3' ig no fair no

    %&3' ig no e(cellent no

    3)*+' ig no fair yes

    ,+' me"i!m no fair yes

    ,+' lo- yes fair yes

    ,+' lo- yes e(cellent no

    3)*+' lo- yes e(cellent yes

    %&3' me"i!m no fair no

    %&3' lo- yes fair yes

    ,+' me"i!m yes fair yes

    %&3' me"i!m yes e(cellent yes

    3)*+' me"i!m no e(cellent yes

    3)*+' ig yes fair yes

    ,+' me"i!m no e(cellent no

    This*o!!os

    ane=amp!eo*uin!ans$D)-!a%ing

    Tennis+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    13/124November 16, 2015Data Mining: Concepts and

    Techniques 1

    utput: A &ecision Tree or*buys_computer

    age-

    o.ercast

    st#$ent- cre$it rating-

    40

    no yes yes

    yes

    31..40

    no

    faire/cellentyesno

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    14/124November 16, 2015Data Mining: Concepts and

    Techniques 17

    %nduction

    &asic a!gorithm )a greed% a!gorithm+ Tree is constructed in a top3don recursive divide3and3

    conquer manner .t start, a!! the training e=amp!es are at the root .ttributes are categorica! )i* continuous3va!ued, the% are

    discreti(ed in advance+ /=amp!es are partitioned recursive!% based on se!ected

    attributes Test attributes are se!ected on the basis o* a heuristic or

    statistica! measure )e4g4, in*ormation gain+

    Conditions *or stopping partitioning .!! samp!es *or a given node be!ong to the same c!ass There are no remaining attributes *or *urther partitioning E

    ma9orit% votingis emp!o%ed *or c!assi*%ing the !ea*

    There are no samp!es !e*t

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    15/124

    ," &ecision tree is popular

    Does not require domain no!edge Can hand!e mu!tidimensiona! data /as% to understand

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    16/124November 16, 2015Data Mining: Concepts and

    Techniques 16

    Attri/ute Selection Measure:%nor+ation 0ain (%&C3.4)

    e!ect the attribute ith the highest in*ormationgain4 'east impurit%4

    'etpibe the probabi!it% that an arbitrar% tup!e in D

    be!ongs to c!ass Ci, estimated b% FCi, DF8FDF

    /=pected in*ormation)entrop%+ needed to c!assi*%a tup!e in D:

    $n*ormationneeded )a*ter using . to sp!it D into vpartitions+ to c!assi*% D:

    $n*ormation gainedb% branching on attribute .

    '%log'% 01

    i

    m

    i

    i ppDInfo =

    =

    '%22

    22'%

    1

    j

    v

    j

    j

    A DID

    DDInfo =

    =

    (D)InfoInfo(D)Gain(A) A=

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    17/124November 16, 2015

    Data Mining: Concepts andTechniques 1

    Attri/ute Selection: %nor+ation 0ain

    C!ass -: bu%sGcomputer HI%esJ C!ass N: bu%sGcomputer H

    InoJ

    means Iage KH0J has

    5 out o* 17 samp!es, ith 2

    %eses and nos4 Lence

    imi!ar!%,

    63,45'0*%1,

    7

    '5*,%1,

    ,'*0%1,

    7'%

    =+

    +=

    I

    IIDInfoage

    5,845'9%

    17145'%

    50345'%

    ===

    ratingcreditGain

    studentGain

    incomeGain

    0,645'%'%'% == DInfoDInfoageGain age

    '*0%1,

    7I

    3,545'1,

    7%log

    1,

    7'

    1,

    3%log

    1,

    3'7*3%'% 00 === IDInfo

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    18/124November 16, 2015

    Data Mining: Concepts andTechniques 1>

    Co+putin! %nor+ation-0ain orContinuous-5alue Attri/utes

    'et attribute . be a continuous3va!ued attribute Must determine the best split point*or .

    ort the va!ue . in increasing order

    T%pica!!%, the midpoint beteen each pair o* ad9acent

    va!ues is considered as a possib!e split point )aiai1+82 is the midpoint beteen the va!ues o* a iand ai1

    The point ith the minimum expected information

    requirement*or . is se!ected as the sp!it3point *or . p!it:

    D1 is the set o* tup!es in D satis*%ing . sp!it3point,

    and D2 is the set o* tup!es in D satis*%ing . O sp!it3

    point

    0 i 6 ti Att i/ t S l ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    19/124November 16, 2015

    Data Mining: Concepts andTechniques 1@

    0ain 6atio or Attri/ute Selection(C3.4)

    $n*ormation gain measure is biased toardsattributes ith a !arge number o* va!ues

    C745 )a successor o* $D+ uses gain ratio to overcome

    the prob!em )norma!i(ation to in*ormation gain+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    20/124November 16, 2015

    Data Mining: Concepts andTechniques 20

    0ini inde7 (CA6T8 %9M%ntelli!entMiner)

    $* a data set D contains e=amp!es *rom nc!asses, gini inde=,gini)D+ is de"ned as

    herepjis the re!ative *requenc% o* c!assjin D

    $* a data set D is sp!it on . into to subsets D1and D2, the giniinde= gini)D+ is de"ned as

    Aeduction in $mpurit%:

    The attribute provides the sma!!est ginisplit)D+ )or the !argest

    reduction in impurit%+ is chosen to sp!it the node )need to

    enumerate all the possible splitting points for each attribute+

    =

    =n

    j

    p jDgini

    1

    01'%

    '%22

    22'%

    22

    22'% 0

    01

    1Dgini

    D

    DDgini

    D

    DDginiA +=

    '%'%'% DginiDginiAginiA

    =

    0ini inde7 (CA6T %9M

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    21/124November 16, 2015

    Data Mining: Concepts andTechniques 21

    0ini inde7 (CA6T8 %9M%ntelli!entMiner)

    /=4 D has @ tup!es in bu%sGcomputer H I%esJ and 5 in InoJ

    uppose the attribute income partitions D into 10 in D1: P!o,

    mediumQ and 7 in D2

    but giniPmedium,highQis 040 and thus the best since it is the !oest

    .!! attributes are assumed continuous3va!ued

    Ma% need other too!s, e4g4, c!ustering, to get the possib!e sp!it

    va!ues

    Can be modi"ed *or categorica! attributes

    ,73451,

    7

    1,

    31'%

    00

    =

    =Dgini

    '%1,

    ,

    '%1,

    15

    '% 11:*; DGiniDGiniDgini mediumlowincome

    +

    =

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    22/124November 16, 2015

    Data Mining: Concepts andTechniques 22

    Co+parin! Attri/ute SelectionMeasures

    The three measures, in genera!, return good resu!tsbut

    $n*ormation gain:

    biased toards mu!tiva!ued attributes

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    23/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    t"er Attri/ute SelectionMeasures

    CL.$D: a popu!ar decision tree a!gorithm, measure based on S2

    test *orindependence

    C3/-: per*orms better than in*o4 gain and gini inde= in certain cases

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    24/124

    November 16, 2015Data Mining: Concepts and

    Techniques 27

    verfttin! and Tree Prunin!

    ver"tting: .n induced tree ma% over"t the training data Too man% branches, some ma% reect anoma!ies due to noise or

    out!iers

    -oor accurac% *or unseen samp!es

    To approaches to avoid over"tting -repruning: La!t tree construction ear!%Udo not sp!it a node i* this

    ou!d resu!t in the goodness measure *a!!ing be!o a thresho!d

    DiBcu!t to choose an appropriate thresho!d

    -ostpruning: Aemove branches *rom a I*u!!% gronJ treeUget a

    sequence o* progressive!% pruned trees

    ;se a set o* data diVerent *rom the training data to decide

    hich is the Ibest pruned treeJ

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    25/124

    November 16, 2015Data Mining: Concepts and

    Techniques 25

    'n"ance+ents to 9asic &ecision Tree%nduction

    .!!o *or continuous3va!ued attributes D%namica!!% de"ne ne discrete3va!ued attributes

    that partition the continuous attribute va!ue into adiscrete set o* interva!s

    Land!e missing attribute va!ues .ssign the most common va!ue o* the attribute

    .ssign probabi!it% to each o* the possib!e va!ues

    .ttribute construction Create ne attributes based on e=isting ones that

    are sparse!% represented

    This reduces *ragmentation, repetition, and

    rep!ication

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    26/124

    November 16, 2015Data Mining: Concepts and

    Techniques 26

    Classifcation in #ar!e &ata/ases

    C!assi"cationUa c!assica! prob!em e=tensive!% studiedb% statisticians and machine !earning researchers

    ca!abi!it%: C!assi*%ing data sets ith mi!!ions o*

    e=amp!es and hundreds o* attributes ith reasonab!e

    speed h% decision tree induction in data mining#

    re!ative!% *aster !earning speed )than otherc!assi"cation methods+

    convertib!e to simp!e and eas% to understandc!assi"cation ru!es can use ' queries *or accessing databases comparab!e c!assi"cation accurac% ith other

    methods

    Scala/le &ecision Tree %nduction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    27/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Scala/le &ecision Tree %nductionMet"ods

    '$)/D&T@6 U Mehta et a!4+ &ui!ds an inde= *or each attribute and on!% c!ass !ist

    and the current attribute !ist reside in memor% -A$NT)?'D&@6 U W4 ha*er et a!4+

    Constructs an attribute !ist data structure -;&'$C)?'D&@> U Aastogi X him+

    $ntegrates tree sp!itting and tree pruning: stopgroing the tree ear!ier

    Aainorest )?'D&@> U

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    28/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2>

    ca a t ra+ewor or6ainorest

    .eparates te scala$ility aspects from te criteria tat"etermine te /!ality of te tree

    B!il"s an A01list: AVC (Attribute, Value, Class_label)

    AVC-set of an attri$!teX4 Pro5ection of training "ataset onto te attri$!teXan"

    class la$el -ere co!nts of in"ivi"!al class la$el are

    aggregate" AVC-grou of a no"e n4

    .et of A01sets of all pre"ictor attri$!tes at te no"e n

    6ainorest: Trainin! Set and %ts A5C

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    29/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2@

    6ainorest: Trainin! Set and %ts A5CSets

    st!"ent B!y#1omp!ter

    yes no

    yes 6 )

    no 3 +

    Age B!y#1omp!ter

    yes no

    %&3' 3 2

    3)+' + '

    ,+' 3 2

    1re"itrating

    B!y#1omp!ter

    yes no

    fair 6 2

    e(cellent 3 3

    .?C3set on incom.?C3set onAge

    .?C3set on Student

    Training /=amp!esincome B!y#1omp!ter

    yes no

    ig 2 2

    me"i!m + 2

    lo- 3 )

    .?C3set oncredit_rating

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    30/124

    November 16, 2015Data Mining: Concepts and

    Techniques 0

    &ata Cu/e-9ased &ecision-Tree%nduction

    $ntegration o* genera!i(ation ith decision3tree

    induction )Yamber et a!4@+

    C!assi"cation at primitive concept !eve!s

    /4g4, precise temperature, humidit%, out!oo, etc4 'o3!eve! concepts, scattered c!asses, bush%

    c!assi"cation3trees

    emantic interpretation prob!ems Cube3based mu!ti3!eve! c!assi"cation

    Ae!evance ana!%sis at mu!ti3!eve!s

    $n*ormation3gain ana!%sis ith dimension !eve!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    31/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    9AT (9ootstrapped pti+isticAl!orit"+ or Tree Construction)

    7se a statistical tecni/!e calle" bootstrappingto create

    several smaller samples s!$sets48 eac fits in memory

    9ac s!$set is !se" to create a tree8 res!lting in several

    trees Tese trees are e(amine" an" !se" to constr!ct a ne-

    tree T

    :t t!rns o!t tatTis very close to te tree tat -o!l"$e generate" !sing te -ole "ata set togeter

    A"v; re/!ires only t-o scans of DB8 an incremental alg

    P t ti Cl if ti 6 lt

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    32/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Presentation o Classifcation 6esults

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    33/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    5isuali

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    34/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    %nteractive 5isual Minin!/ Perception-

    9ased Classifcation (P9C)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    35/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    9 i Cl if ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    36/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    9aesian Classifcation:,"

    . statistica! c!assi"er: per*ormsprobabilisticprediction, i.e.,predicts c!ass membershipprobabi!ities

    oundation: &ased on &a%es Theorem4 -er*ormance: . simp!e &a%esian c!assi"er, nae

    !a"esian classi#er, has comparab!e per*ormance ithdecision tree and se!ected neura! netor c!assi"ers

    $ncrementa!: /ach training e=amp!e can incrementa!!%increase8decrease the probabi!it% that a h%pothesis iscorrect U prior no!edge can be combined ith

    observed data tandard: /ven hen &a%esian methods are

    computationa!!% intractab!e, the% can provide astandard o* optima! decision maing against hichother methods can be measured

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    37/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    9aesian T"eore+: 9asics

    'et >be a data samp!e )IeidenceJ+: c!ass !abe! isunnon

    'et L be a h"pothesisthat Z be!ongs to c!ass C

    C!assi"cation is to determine -)LF>+, the probabi!it% that

    the h%pothesis ho!ds given the observed data samp!e > -)L+ )prior probabilit"+, the initia! probabi!it%

    /4g4,>i!! bu% computer, regard!ess o* age, income, [

    -)>+: probabi!it% that samp!e data is observed

    -)>FL+ )posteriori probabilit"+, the probabi!it% o* observingthe samp!e >, given that the h%pothesis ho!ds

    /4g4,i!! bu% computer, the prob4 that Z is

    14470, medium income

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    38/124

    November 16, 2015Data Mining: Concepts and

    Techniques >

    9aesian T"eore+

    , posteriori probabilit" of ah"pothesis L, -)LF>+, *o!!os the &a%es theorem

    $n*orma!!%, this can be ritten as

    posteriori H !ie!ihood = prior8evidence

    -redicts >be!ongs to C2iV the probabi!it% -)CiF>+ is

    the highest among a!! the -)CFZ+ *or a!! the $c!asses

    -ractica! diBcu!t%: require initia! no!edge o* man%

    probabi!ities, signi"cant computationa! cost

    '%'%'2%'2%

    X

    XX

    PHPHPHP =

    owar s a ve aes an

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    39/124

    November 16, 2015Data Mining: Concepts and

    Techniques @

    owar s a ve aes anClassifer

    'et D be a training set o* tup!es and their associatedc!ass !abe!s, and each tup!e is represented b% an n3Dattribute vector >H )=1, =2, [, =n+

    uppose there are mc!asses C1, C2, [, Cm4

    C!assi"cation is to derive the ma=imum posteriori,i4e4, the ma=ima! -)CiF>+ This can be derived *rom &a%es theorem

    ince -)Z+ is constant *or a!! c!asses, on!%

    needs to be ma=imi(ed

    '%

    '%'2%'2%

    X

    XX

    Pi

    CPi

    CP

    iCP =

    '%'2%'2%i

    CPi

    CPi

    CP XX =

    er va on o a ve aes

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    40/124

    November 16, 2015Data Mining: Concepts and

    Techniques 70

    er va on o a ve aesClassifer

    . simp!i"ed assumption: attributes are conditiona!!%independent )i4e4, no dependence re!ation beteenattributes+:

    This great!% reduces the computation cost: n!%

    counts the c!ass distribution $* .is categorica!, -)=FCi+ is the R o* tup!es in Ci

    having va!ue =*or .divided b% FCi, DF )R o* tup!es o*Ciin D+

    $* .is continous3va!ued, -)=FCi+ is usua!!% computedbased on

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    41/124

    November 16, 2015Data Mining: Concepts and

    Techniques 71

    ?a ve 9aes an ass er: Tra n n!&ataset

    C!ass:

    C1:bu%sGcomputer H

    ^%es

    C2:bu%sGcomputer H ^no

    Data samp!e

    Z H )age KH0,

    $ncome H medium,

    tudent H %esCreditGrating H air+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    42/124

    November 16, 2015Data Mining: Concepts and

    Techniques 72

    '7a+ple

    -)Ci+: -)bu%sGcomputer H I%esJ+ H @817 H 0467 -)bu%sGcomputer H InoJ+ H 5817H 045

    Compute -)ZFCi+ *or each c!ass -)age H IKH0J F bu%sGcomputer H I%esJ+ H 28@ H 04222 -)age H IKH 0J F bu%sGcomputer H InoJ+ H 85 H 046 -)income H ImediumJ F bu%sGcomputer H I%esJ+ H 78@ H 04777

    -)income H ImediumJ F bu%sGcomputer H InoJ+ H 285 H 047 -)student H I%esJ F bu%sGcomputer H I%es+ H 68@ H 0466 -)student H I%esJ F bu%sGcomputer H InoJ+ H 185 H 042 -)creditGrating H I*airJ F bu%sGcomputer H I%esJ+ H 68@ H 0466 -)creditGrating H I*airJ F bu%sGcomputer H InoJ+ H 285 H 047

    > (a!e B = 8 inco+e +ediu+8 student es8 creditratin! air)

    P(>DCi) :-)ZFbu%sGcomputer H I%esJ+ H 04222 = 04777 = 0466 = 0466 H 04077 -)ZFbu%sGcomputer H InoJ+ H 046 = 047 = 042 = 047 H 0401@P(>DCi)EP(Ci) : -)ZFbu%sGcomputer H I%esJ+ _ -)bu%sGcomputer H I%esJ+ H 0402>

    -)ZFbu%sGcomputer H InoJ+ _ -)bu%sGcomputer H InoJ+ H 0400

    T"ereore8 > /elon!s to class (*/usco+puter esF)

    vo n! e - ro a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    43/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    vo n! e ro a Pro/le+

    Na`ve &a%esian prediction requires each conditiona! prob4 benon3(ero4 therise, the predicted prob4 i!! be (ero

    /=4 uppose a dataset ith 1000 tup!es, incomeH!o )0+,incomeH medium )@@0+, and income H high )10+,

    ;se 'ap!acian correction )or 'ap!acian estimator+ .dding 1 to each case

    -rob)income H !o+ H 18100-rob)income H medium+ H @@18100

    -rob)income H high+ H 118100 The IcorrectedJ prob4 estimates are c!ose to their

    IuncorrectedJ counterparts

    =

    =n

    kCixkPCiP

    1

    '2%'2%

    a ve aes an ass er:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    44/124

    November 16, 2015Data Mining: Concepts and

    Techniques 77

    a ve aes an ass er:Co++ents

    .dvantages /as% to imp!ement

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    45/124

    November 16, 2015Data Mining: Concepts and

    Techniques 75

    9aesian 9elie ?etwor;s

    &a%esian be!ie* netor a!!os a subseto* the

    variab!es conditiona!!% independent

    . graphica! mode! o* causa! re!ationships

    Aepresents dependenc% among the variab!es

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    46/124

    November 16, 2015Data Mining: Concepts and

    Techniques 76

    '7a+ple

    FamilyHistory

    LungCancer

    PositiveXRay

    Smoer

    !m"#ysema

    $ys"nea

    LC

    %LC

    &FH' S) &FH' %S) &%FH' S) &%FH' %S)

    0.*

    0.+

    0.,

    0.,

    0.-

    0.3

    0.1

    0.

    aesian 9elie ?etwor;s

    The conditional pro/a/ilitta/le)CPT+ *or variab!e'ungCancer:

    =

    =n

    i

    !Parents ixiPxxP n1

    ''%2%'*444*% 1

    C-T shos the conditiona! probabi!it%*or each possib!e combination o* itsparents

    Derivation o* the probabi!it% o* aparticu!ar combination o* va!ueso* >, *rom C-T:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    47/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    Trainin! 9aesian ?etwor;s

    evera! scenarios:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    48/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7>

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    49/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7@

    sin! %-TI'? 6ules or Classifcation

    Aepresent the no!edge in the *orm o* $3TL/Nru!es

    A: $ ageH %outh .ND studentH %es TL/N bu"s_computerH %es

    Au!e antecedent8precondition vs4 ru!e consequent

    .ssessment o* a ru!e: coerageand accurac"

    ncovers H R o* tup!es covered b% A

    ncorrect H R o* tup!es correct!% c!assi"ed b% Acoverage)A+ H ncovers 8FDF 8_ D: training data set _8

    accurac%)A+ H ncorrect 8 ncovers $* more than one ru!e is triggered, need conJict resolution

    i(e ordering: assign the highest priorit% to the triggering ru!es that

    has the ItoughestJ requirement )i4e4, ith the most attribute test+ C!ass3based ordering: decreasing order o*prealence or

    misclassi#cation cost per class

    Au!e3based ordering )decision list+: ru!es are organi(ed into one !ong

    priorit% !ist, according to some measure o* ru!e qua!it% or b% e=perts

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    50/124

    November 16, 2015Data Mining: Concepts and

    Techniques 50

    age-

    st#$ent- cre$it rating-

    40

    no yes yes

    yes

    31..40

    no

    faire/cellentyesno

    /=amp!e: Au!e e=traction *rom our bu"s_computerdecision3tree

    $ ageH %oung .ND studentH no TL/N bu"s_computerH no

    $ ageH %oung .ND studentH"es TL/N bu"s_computerH"es

    $ ageH mid3age TL/N bu"s_computerH"es

    $ ageH o!d .ND credit_ratingH excellent TL/N bu"s_computer H"es

    $ ageH %oung .ND credit_ratingH fair TL/N bu"s_computerH no

    6ule '7traction ro+ a &ecision Tree

    Au!es are easier to understand than !argetrees

    ne ru!e is created *or each path *rom the

    root to a !ea*

    /ach attribute3va!ue pair a!ong a path*orms a con9unction: the !ea* ho!ds the

    c!ass prediction

    Au!es are mutua!!% e=c!usive and

    e=haustive

    6u e '7tract on ro+ t e Tra n n!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    51/124

    November 16, 2015Data Mining: Concepts and

    Techniques 51

    u e ac o o e a !&ata

    equentia! covering a!gorithm: /=tracts ru!es direct!% *rom trainingdata

    T%pica! sequentia! covering a!gorithms: $', ., CN2, A$--/A

    Au!es are !earned sequentiall", each *or a given c!ass Ci i!! cover

    man% tup!es o* Ci but none )or *e+ o* the tup!es o* other c!asses

    teps:

    Au!es are !earned one at a time

    /ach time a ru!e is !earned, the tup!es covered b% the ru!es are

    removed

    The process repeats on the remaining tup!es un!ess terminationcondition, e4g4, hen no more training e=amp!es or hen the

    qua!it% o* a ru!e returned is be!o a user3speci"ed thresho!d

    Comp4 4 decision3tree induction: !earning a set o* ru!es

    simultaneousl"

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    52/124

    November 16, 2015Data Mining: Concepts and

    Techniques 52

    Iow to #earn-ne-6ule

    tar ith the most genera! ru!e possib!e: condition H empt%

    .dding ne attributes b% adopting a greed% depth3"rst strateg%

    -ics the one that most improves the ru!e qua!it%

    Au!e3ua!it% measures: consider both coverage and accurac%

    oi!3gain )in $' X A$--/A+: assesses in*oGgain b% e=tending

    condition

    $t *avors ru!es that have high accurac% and cover man% positive tup!es

    Au!e pruning based on an independent set o* test tup!es

    -os8neg are R o* positive8negative tup!es covered b% A4

    $* )*+_&runeis higher *or the pruned version o* A, prune A

    'log

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    53/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    Classifcation: A Mat"e+atical

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    54/124

    November 16, 2015Data Mining: Concepts and

    Techniques 57

    C!assi"cation: predicts categorica! c!ass !abe!s

    /4g4, -ersona! homepage c!assi"cation

    =iH )=1, =2, =, [+, %iH 1 or E1 =1: R o* a ord IhomepageJ

    =2: R o* a ord Ie!comeJ

    Mathematica!!% = Z H n, % H P1, E1Q e ant a *unction *: Z

    Classifcation: A Mat"e+aticalMappin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    55/124

    November 16, 2015Data Mining: Concepts and

    Techniques 55

    #inear Classifcation

    &inar% C!assi"cationprob!em The data above the red

    !ine be!ongs to c!ass ^=

    The data be!o red !inebe!ongs to c!ass ^o /=amp!es: ?M,

    -erceptron,

    -robabi!istic C!assi"ers

    =

    ==

    =

    ==

    =

    =

    =

    = ooooo

    o

    o

    o

    o o

    o

    o

    o

    i i i i Cl if

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    56/124

    November 16, 2015Data Mining: Concepts and

    Techniques 56

    &iscri+inative Classifers

    .dvantages prediction accurac% is genera!!% high

    .s compared to &a%esian methods E in genera!

    robust, ors hen training e=amp!es contain errors

    *ast eva!uation o* the !earned target *unction &a%esian netors are norma!!% s!o

    Criticism !ong training time

    diBcu!t to understand the !earned *unction )eights+ &a%esian netors can be used easi!% *or pattern discover%

    not eas% to incorporate domain no!edge /as% in the *orm o* priors on the data or distributions

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    57/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    Perceptron K ,innow

    ?ector: =,

    ca!ar: =, %,

    $nput: P)=1, %1+, [Q

    utput: c!assi"cation *unction*)=+

    *)=i+ O 0 *or %iH 1

    *)=i+ K 0 *or %iH 31

    *)=+ HO = b H 0

    or 1=12=2b H 0

    =1

    =2

    -erceptron: update additive!%

    inno: update mu!tip!icative!%

    ass ca on

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    58/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5>

    9ac;propa!ation

    &acpropagation: . neural networ; !earning a!gorithm tarted b% ps%cho!ogists and neurobio!ogists to deve!op

    and test computationa! ana!ogues o* neurons

    . neura! netor: . set o* connected input8output units

    here each connection has a wei!"tassociated ith it During the !earning phase, the networ; learns /

    adLustin! t"e wei!"tsso as to be ab!e to predict the

    correct c!ass !abe! o* the input tup!es

    .!so re*erred to as connectionist learnin!due to the

    connections beteen units

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    59/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5@

    ?eural ?etwor; as a Classifer

    eaness 'ong training time Aequire a number o* parameters t%pica!!% best determined

    empirica!!%, e4g4, the netor topo!og% or structure4 -oor interpretabi!it%: DiBcu!t to interpret the s%mbo!ic

    meaning behind the !earned eights and o* hidden units inthe netor

    trength Ligh to!erance to nois% data .bi!it% to c!assi*% untrained patterns

    e!!3suited *or continuous3va!ued inputs and outputs uccess*u! on a ide arra% o* rea!3or!d data .!gorithms are inherent!% para!!e! Techniques have recent!% been deve!oped *or the e=traction o*

    ru!es *rom trained neura! netors

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    60/124

    November 16, 2015Data Mining: Concepts and

    Techniques 60

    A ?euron ( a perceptron)

    The n3dimensiona! input vector 7is mapped into variab!e % b%means o* the sca!ar product and a non!inear *unction mapping

    k

    f

    /eig#te

    sum

    n"ut

    vector 7

    out"ut y

    2ctivation

    unction

    /eig#t

    vector w

    w&

    w'

    wn

    x&

    x'

    xn

    'sign%y

    !/ampleFor

    n

    5i

    kiixw +=

    =

    A Multi-#aer eed-orward ?eural

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    61/124

    November 16, 2015Data Mining: Concepts and

    Techniques 61

    ?etwor;

    utput laer

    %nput laer

    Iidden laer

    utput vector

    %nput vector: X

    wij

    +=i

    jiijj #wI

    jIj e#

    += 1

    1

    ''%1% jjjjj #(##)rr =

    jkk

    kjjj w)rr##)rr = '1%

    ijijij #)rrlww '%+=

    jjj )rrl'%+=

    Iow A Multi-#aer ?eural ?etwor;

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    62/124

    November 16, 2015Data Mining: Concepts and

    Techniques 62

    ,or;s

    The inputsto the netor correspond to the attributes measured *or

    each training tup!e

    $nputs are *ed simu!taneous!% into the units maing up the input

    laer

    The% are then eighted and *ed simu!taneous!% to a "idden laer

    The number o* hidden !a%ers is arbitrar%, a!though usua!!% on!% one The eighted outputs o* the !ast hidden !a%er are input to units

    maing up the output laer, hich emits the netorfs prediction

    The netor is eed-orwardin that none o* the eights c%c!es bac

    to an input unit or to an output unit o* a previous !a%er

    rom a statistica! point o* vie, netors per*orm nonlinear

    re!ression:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    63/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    &efnin! a ?etwor; Topolo!

    irst decide the networ; topolo!: R o* units in theinput la"er, R o* hidden la"ers)i* O 1+, R o* units in

    each hidden la"er, and R o* units in the output la"er

    Norma!i(ing the input va!ues *or each attribute

    measured in the training tup!es to 040U140 ne inputunit per domain va!ue, each initia!i(ed to 0

    utput, i* *or c!assi"cation and more than to

    c!asses, one output unit per c!ass is used

    nce a netor has been trained and its accurac% isunaccepta/le, repeat the training process ith a

    di-erent net(or$ topolog"or a di-erent set of initial

    (eights

    9 ; i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    64/124

    November 16, 2015Data Mining: Concepts and

    Techniques 67

    9ac;propa!ation

    $terative!% process a set o* training tup!es X compare the

    netorfs prediction ith the actua! non target va!ue

    or each training tup!e, the eights are modi"ed to +ini+i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    65/124

    November 16, 2015Data Mining: Concepts and

    Techniques 65

    p p !%nterpreta/ilit

    /Bcienc% o* bacpropagation: /ach epoch )one interation

    through the training set+ taes )FDF _ (+, ith FDF tup!es and

    (eights, but R o* epochs can be e=ponentia! to n, the

    number o* inputs, in the orst case

    Au!e e=traction *rom netors: netor pruning

    imp!i*% the netor structure b% removing eighted !insthat have the !east eVect on the trained netor

    Then per*orm !in, unit, or activation va!ue c!ustering

    The set o* input and activation va!ues are studied to derive

    ru!es describing the re!ationship beteen the input andhidden unit !a%ers

    ensitivit% ana!%sis: assess the impact that a given input

    variab!e has on a netor output4 The no!edge gained

    *rom this ana!%sis can be represented in ru!es

    C"apter $. Classifcation and

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    66/124

    November 16, 2015Data Mining: Concepts and

    Techniques 66

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    S5M Support 5ector Mac"ines

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    67/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    S5MSupport 5ector Mac"ines

    . ne c!assi"cation method *or both !inear and

    non!inear data

    $t uses a non!inear mapping to trans*orm the origina!

    training data into a higher dimension

    ith the ne dimension, it searches *or the !inear

    optima! separating h%perp!ane )i4e4, Idecision

    boundar%J+

    ith an appropriate non!inear mapping to a suBcient!%

    high dimension, data *rom to c!asses can a!a%s be

    separated b% a h%perp!ane

    ?M "nds this h%perp!ane using support vectors

    )Iessentia!J training tup!es+ and margins )de"ned b%

    the support vectors+

    S5M Ii t d A li ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    68/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6>

    S5MIistor and Applications

    ?apni and co!!eagues )1@@2+Ugroundor *rom

    ?apni X Chervonenis statistica! !earning theor% in

    1@60s

    eatures: training can be s!o but accurac% is high

    oing to their abi!it% to mode! comp!e= non!ineardecision boundaries )margin ma=imi(ation+

    ;sed both *or c!assi"cation and prediction

    .pp!ications: handritten digit recognition, ob9ect recognition,

    speaer identi"cation, benchmaring time3series

    prediction tests

    S5M 0 l P"il "

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    69/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6@

    S5M0eneral P"ilosop"

    upport ?ectors

    ma!! Margin 'arge Margin

    ar! ns an uppor5 t

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    70/124

    November 16, 2015Data Mining: Concepts and

    Techniques 0

    ! pp5ectors

    en a a s near S /l

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    71/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    Separa/le

    m

    'et data D be )>1, %1+, [, )>FDF, %FDF+, here >iis the set o* training

    tup!es associated ith the c!ass !abe!s % i

    There are in"nite !ines )h%perp!anes+ separating the to c!asses bute ant to "nd the best one )the one that minimi(es c!assi"cationerror on unseen data+

    ?M searches *or the h%perp!ane ith the !argest margin, i4e4,

    +a7i+u+ +ar!inal "perplane)MML+

    S5M #i l S /l

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    72/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    S5M#inearl Separa/le

    . separating h%perp!ane can be ritten as

    , > b H 0

    here ,HP1, 2, [, nQ is a eight vector and b a sca!ar

    )bias+

    or 23D it can be ritten as

    0 1=1 2=2H 0

    The h%perp!ane de"ning the sides o* the margin:

    L1: 0 1=1 2=2j 1 *or %i H 1, and

    L2

    : 0

    1

    =1

    2

    =2

    E 1 *or %i

    H E1

    .n% training tup!es that *a!! on h%perp!anes L1or L2)i4e4, the

    sides de"ning the margin+ are support vectors

    This becomes a constrained (conve7) uadratic

    opti+i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    73/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    &ata

    The comp!e=it% o* trained c!assi"er is characteri(ed b% the R

    o* support vectors rather than the dimensiona!it% o* the data

    The support vectors are the essentia! or critica! training

    e=amp!es Uthe% !ie c!osest to the decision boundar% )MML+

    $* a!! other training e=amp!es are removed and the training isrepeated, the same separating h%perp!ane ou!d be *ound

    The number o* support vectors *ound can be used to compute

    an )upper+ bound on the e=pected error rate o* the ?M

    c!assi"er, hich is independent o* the data dimensiona!it%

    Thus, an ?M ith a sma!! number o* support vectors can

    have good genera!i(ation, even hen the dimensiona!it% o*

    the data is high

    S5M #inearl %nsepara/le

    A *

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    74/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    S5M#inearl %nsepara/le

    Trans*orm the origina! input data into a higherdimensiona! space

    earch *or a !inear separating h%perp!ane in the

    ne space

    A '

    S5M Oernel unctions

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    75/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    S5MOernel unctions

    $nstead o* computing the dot product on the trans*ormed

    data tup!es, it is mathematica!!% equiva!ent to insteadapp!%ing a erne! *unction Y)>i, >L+ to the origina! data, i4e4,

    Y)>i, >L+ H k)>i+ k)>L+

    T%pica! Yerne! unctions

    ?M can a!so be used *or c!assi*%ing mu!tip!e )O 2+ c!asses

    and *or regression ana!%sis )ith additiona! user parameters+

    Scalin! S5M / Iierarc"ical Micro-Clusterin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    76/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    Clusterin!

    ?M is not sca!ab!e to the number o* data ob9ects in terms o*

    training time and memor% usage

    IC!assi*%ing 'arge Datasets ;sing ?Ms ith Lierarchica!

    C!usters -rob!emJ b% Lan9o u, Wiong ang, Wiaei Lan, YDD0

    C&3?M )C!ustering3&ased ?M+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    77/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    C9-S5M: Clusterin!-9ased S5M

    Training data sets ma% not even "t in memor%

    Aead the data set once )minimi(ing dis access+

    Construct a statistica! summar% o* the data )i4e4,

    hierarchica! c!usters+ given a !imited amount o* memor%

    The statistica! summar% ma=imi(es the bene"t o* !earning?M

    The summar% p!a%s a ro!e in inde=ing ?Ms

    /ssence o* Micro3c!ustering )Lierarchica! inde=ing structure+

    ;se micro3c!uster hierarchica! inde=ing structure

    provide "ner samp!es c!oser to the boundar% and

    coarser samp!es *arther *rom the boundar%

    e!ective de3c!ustering to ensure high accurac%

    C Tree: Iierarc"ical Micro cluster

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    78/124

    November 16, 2015Data Mining: Concepts and

    Techniques >

    C-Tree: Iierarc"ical Micro-cluster

    C9 S5M Al!orit"+: utline

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    79/124

    November 16, 2015Data Mining: Concepts and

    Techniques @

    C9-S5M Al!orit"+: utline

    Construct to C3trees *rom positive and negativedata sets independent!% Need one scan o* the data set

    Train an ?M *rom the centroids o* the root entries

    De3c!uster the entries near the boundar% into thene=t !eve! The chi!dren entries de3c!ustered *rom the

    parent entries are accumu!ated into the training

    set ith the non3dec!ustered parent entries Train an ?M again *rom the centroids o* the

    entries in the training set Aepeat unti! nothing is accumu!ated

    Selective &eclusterin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    80/124

    November 16, 2015Data Mining: Concepts and

    Techniques >0

    Selective &eclusterin!

    C tree is a suitab!e base structure *or se!ective dec!ustering

    De3c!uster on!% the c!uster /isuch that

    DiE AiK Ds, here Diis the distance *rom the boundar% to

    the center point o* /iand Aiis the radius o* /i Dec!uster on!% the c!uster hose subc!usters have

    possibi!ities to be the support c!uster o* the boundar% Iupport c!usterJ: The c!uster hose centroid is a

    support vector

    '7peri+ent on Snt"etic &ataset

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    81/124

    November 16, 2015 Data Mining: Concepts andTechniques >1

    '7peri+ent on Snt"etic &ataset

    '7peri+ent on a #ar!e &ata Set

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    82/124

    November 16, 2015 Data Mining: Concepts andTechniques >2

    '7peri+ent on a #ar!e &ata Set

    S5M vs ?eural ?etwor;

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    83/124

    November 16, 2015 Data Mining: Concepts andTechniques >

    S5M vs. ?eural ?etwor;

    ?M Ae!ative!% ne concept

    Deterministic a!gorithm

    Nice

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    84/124

    November 16, 2015 Data Mining: Concepts andTechniques >7

    S5M 6elated #in;s

    ?M ebsite http:884erne!3machines4org8

    Aepresentative imp!ementations

    '$&?M: an eBcient imp!ementation o* ?M, mu!ti3c!ass

    c!assi"cations, nu3?M, one3c!ass ?M, inc!uding a!so various

    inter*aces ith 9ava, p%thon, etc4

    ?M3!ight: simp!er but per*ormance is not better than '$&?M,

    support on!% binar% c!assi"cation and on!% C !anguage

    ?M3torch: another recent imp!ementation a!so ritten in C4

    #iterature

    http://www.kernel-machines.org/http://www.kernel-machines.org/
  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    85/124

    November 16, 2015 Data Mining: Concepts andTechniques >5

    #iterature

    Itatistica! 'earning Theor%J b% ?apni: e=treme!% hard to

    understand, containing man% errors too4

    C4 W4 C4 &urges4

    . Tutoria! on upport ?ector Machines *or -attern Aecognition 4

    no(ledge Discoer" and Data ining, 2)2+, 1@@>4

    &etter than the ?apnis boo, but sti!! ritten too hard *or

    introduction, and the e=amp!es are so not3intuitive

    The boo I.n $ntroduction to upport ?ector MachinesJ b% N4

    Cristianini and W4 hae3Ta%!or

    .!so ritten hard *or introduction, but the e=p!anation about

    the mercers theorem is better than above !iteratures

    The neura! netor boo b% La%ins

    Contains one nice chapter o* ?M introduction

    C"apter $. Classifcation and

    http://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.kernel-machines.org/papers/Burges98.ps.gz
  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    86/124

    November 16, 2015 Data Mining: Concepts andTechniques >6

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Associative Classifcation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    87/124

    November 16, 2015 Data Mining: Concepts andTechniques >

    Associative Classifcation

    .ssociative c!assi"cation

    .ssociation ru!es are generated and ana!%(ed *or use in c!assi"cation

    earch *or strong associations beteen *requent patterns

    )con9unctions o* attribute3va!ue pairs+ and c!ass !abe!s

    C!assi"cation: &ased on eva!uating a set o* ru!es in the *orm o*

    -1 p2[ p!I.c!assH CJ )con*, sup+

    h% eVective#

    $t e=p!ores high!% con"dent associations among mu!tip!e attributes

    and ma% overcome some constraints introduced b% decision3tree

    induction, hich considers on!% one attribute at a time $n man% studies, associative c!assi"cation has been *ound to be more

    accurate than some traditiona! c!assi"cation methods, such as C745

    Tp ca Assoc at ve ass cat onMet"ods

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    88/124

    November 16, 2015 Data Mining: Concepts andTechniques >>

    Met"ods

    C&. )C!assi"cation &% .ssociation: 'iu, Lsu X Ma, YDD@>+

    Mine association possib!e ru!es in the *orm o*

    Cond3set )a set o* attribute3va!ue pairs+ c!ass !abe!

    &ui!d c!assi"er: rgani(e ru!es according to decreasing

    precedence based on con"dence and then support

    CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ C!assi"cation: tatistica! ana!%sis on mu!tip!e ru!es

    C-.A )C!assi"cation based on -redictive .ssociation Au!es: in X Lan, DM0+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    89/124

    November 16, 2015 Data Mining: Concepts andTechniques >@

    A Closer #oo; at CMA6

    CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ /Bcienc%: ;ses an enhanced -3tree that maintains the distribution

    o* c!ass !abe!s among tup!es satis*%ing each *requent itemset Au!e pruning henever a ru!e is inserted into the tree

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    90/124

    November 16, 2015 Data Mining: Concepts andTechniques @0

    S%0M&=4)

    C"apter $. Classifcation and

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    91/124

    November 16, 2015 Data Mining: Concepts andTechniques @1

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    #a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    92/124

    November 16, 2015 Data Mining: Concepts andTechniques @2

    #a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    93/124

    November 16, 2015 Data Mining: Concepts andTechniques @

    Met"ods

    $nstance3based !earning: tore training e=amp!es and de!a% theprocessing )I!a(% eva!uationJ+ unti! a neinstance must be c!assi"ed

    T%pica! approaches $3nearest neighbor approach

    $nstances represented as points in a/uc!idean space4

    'oca!!% eighted regression

    Constructs !oca! appro=imation Case3based reasoning

    ;ses s%mbo!ic representations andno!edge3based in*erence

    T"e k-?earest ?ei!"/or

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    94/124

    November 16, 2015 Data Mining: Concepts andTechniques @7

    Al!orit"+

    .!! instances correspond to points in the n3Dspace The nearest neighbor are de"ned in terms o*

    /uc!idean distance, dist)>1, >2+ Target *unction cou!d be discrete3 or rea!3

    va!ued or discrete3va!ued, $3NN returns the most

    common va!ue among the $training e=amp!esnearest toxq

    ?onoroi diagram: the decision sur*ace inducedb% 13NN *or a t%pica! set o* training e=amp!es

    4

    9

    9 xq

    9 9

    9

    9

    .

    .. .

    &iscussion on t"e k-??Al it"

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    95/124

    November 16, 2015 Data Mining: Concepts andTechniques @5

    Al!orit"+

    3NN *or rea!3va!ued prediction *or a given unnontup!e

    Aeturns the mean va!ues o* the$nearest neighbors

    Distance3eighted nearest neighbor a!gorithm

    eight the contribution o* each o* the neighborsaccording to their distance to the quer%xq

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    96/124

    November 16, 2015 Data Mining: Concepts andTechniques @6

    Case 9ased 6easonin! (C96)

    C&A: ;ses a database o* prob!em so!utions to so!ve ne prob!ems

    tore s%mbo!ic description )tup!es or cases+Unot points in a /uc!ideanspace

    .pp!ications: Customer3service )product3re!ated diagnosis+, !ega!

    ru!ing

    Methodo!og%

    $nstances represented b% rich s%mbo!ic descriptions )e4g4, *unction

    graphs+

    earch *or simi!ar cases, mu!tip!e retrieved cases ma% be combined

    Tight coup!ing beteen case retrieva!, no!edge3based reasoning,

    and prob!em so!ving

    Cha!!enges

    ind a good simi!arit% metric

    $nde=ing based on s%ntactic simi!arit% measure, and hen *ai!ure,

    bactracing, and adapting to additiona! cases

    C"apter $. Classifcation andP di ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    97/124

    November 16, 2015 Data Mining: Concepts andTechniques @

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    0enetic Al!orit"+s (0A)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    98/124

    November 16, 2015 Data Mining: Concepts andTechniques @>

    0e e c !o s (0 )

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    99/124

    November 16, 2015 Data Mining: Concepts andTechniques @@

    ! pp

    Aough sets are used to appro7i+atel or *rou!"lF

    defne euivalent classes

    . rough set *or a given c!ass C is appro=imated b% to sets:

    a !oer appro=imation)certain to be in C+ and an upper

    appro=imation)cannot be described as not be!onging to C+

    inding the minima! subsets )reducts+ o* attributes *or

    *eature reduction is N-3hard but a discerni/ilit +atri7

    )hich stores the diVerences beteen attribute va!ues *or

    each pair o* data tup!es+ is used to reduce the computation

    intensit%

    u

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    100/124

    November 16, 2015 Data Mining: Concepts andTechniques 100

    Approac"es

    u((% !ogic uses truth va!ues beteen 040 and 140 torepresent the degree o* membership )such as using*u((% membership graph+

    .ttribute va!ues are converted to *u((% va!ues e4g4, income is mapped into the discrete categories

    P!o, medium, highQ ith *u((% va!ues ca!cu!ated or a given ne samp!e, more than one *u((% va!ue

    ma% app!% /ach app!icab!e ru!e contributes a vote *or

    membership in the categories T%pica!!%, the truth va!ues *or each predicted categor%

    are summed, and these sums are combined

    C"apter $. Classifcation andP di ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    101/124

    November 16, 2015 Data Mining: Concepts andTechniques 101

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    ,"at %s Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    102/124

    November 16, 2015 Data Mining: Concepts andTechniques 102

    )Numerica!+ prediction is simi!ar to c!assi"cation construct a mode! use mode! to predict continuous or ordered va!ue *or a given

    input -rediction is diVerent *rom c!assi"cation

    C!assi"cation re*ers to predict categorica! c!ass !abe! -rediction mode!s continuous3va!ued *unctions Ma9or method *or prediction: regression

    mode! the re!ationship beteen one or more independentorpredictorvariab!es and a dependentor responsevariab!e

    Aegression ana!%sis 'inear and mu!tip!e regression Non3!inear regression ther regression methods: genera!i(ed !inear mode!, -oisson

    regression, !og3!inear mode!s, regression trees

    #inear 6e!ression

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    103/124

    November 16, 2015 Data Mining: Concepts andTechniques 10

    ea e! ess o

    'inear regression: invo!ves a response variab!e % and a sing!e

    predictor variab!e =

    % H 0 1=

    here 0)%3intercept+ and 1)s!ope+ are regression coeBcients

    Method o* !east squares: estimates the best3"tting straight !ine

    Mu!tip!e !inear regression: invo!ves more than one predictor variab!e

    Training data is o* the *orm )>1, %1+, )>2, %2+,[, )>D&D, %FDF+

    /=4 or 23D data, e ma% have: % H 0 1=1 2=2 o!vab!e b% e=tension o* !east square method or using ., 3

    -!us

    Man% non!inear *unctions can be trans*ormed into the above

    =

    =

    =

    22

    1

    0

    22

    1

    '%

    ''%%

    1 D

    i

    i

    D

    i

    ii

    xx

    ,,xx

    w xw,w 15 =

    ?onlinear 6e!ression

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    104/124

    November 16, 2015 Data Mining: Concepts andTechniques 107

    ome non!inear mode!s can be mode!ed b% a

    po!%nomia! *unction . po!%nomia! regression mode! can be trans*ormed into

    !inear regression mode!4 or e=amp!e,

    % H 0 1= 2=2 =

    convertib!e to !inear ith ne variab!es: =2 H =2, =H =

    % H 0 1= 2=2 = ther *unctions, such as poer *unction, can a!so be

    trans*ormed to !inear mode! ome mode!s are intractab!e non!inear )e4g4, sum o*

    e=ponentia! terms+ possib!e to obtain !east square estimates through

    e=tensive ca!cu!ation on more comp!e= *ormu!ae

    !

    t"er 6e!ression-9ased Models

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    105/124

    November 16, 2015 Data Mining: Concepts andTechniques 105

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    106/124

    November 16, 2015 Data Mining: Concepts andTechniques 106

    Trees Aegression tree: proposed in C.AT s%stem )&reiman et a!4 1@>7+

    C.AT: C!assi"cation .nd Aegression Trees

    /ach !ea* stores a continuous3alued prediction

    $t is the aerage alue of the predicted attribute*or the

    training tup!es that reach the !ea*

    Mode! tree: proposed b% uin!an )1@@2+

    /ach !ea* ho!ds a regression mode!Ua mu!tivariate !inear

    equation *or the predicted attribute

    . more genera! case than regression tree Aegression and mode! trees tend to be more accurate than

    !inear regression hen the data are not represented e!! b% a

    simp!e !inear mode!

    Predictive Modelin! in Multidi+ensional&ata/ases

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    107/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10

    -redictive mode!ing: -redict data va!ues or construct

    genera!i(ed !inear mode!s based on the database data ne can on!% predict va!ue ranges or categor% distributions Method out!ine:

    Minima! genera!i(ation

    .ttribute re!evance ana!%sis

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    108/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10>

    Prediction: Cate!orical &ata

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    109/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10@

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    110/124

    November 16, 2015Data Mining: Concepts and

    Techniques 110

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Classifer AccuracMeasures

    C1 C2

    C1 True positive a!senegative

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    111/124

    November 16, 2015Data Mining: Concepts and

    Techniques 111

    .ccurac% o* a c!assi"er M, acc)M+: percentage o* test set tup!es that arecorrect!% c!assi"ed b% the mode! M

    /rror rate )misc!assi"cation rate+ o* M H 1 E acc)M+ > 000 >642

    tota! 66 267 1000

    0

    @5452

    g

    C2 a!sepositive

    True negative

    Predictor 'rror Measures

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    112/124

    November 16, 2015Data Mining: Concepts and

    Techniques 112

    Measure predictor accurac%: measure ho *ar oV the predicted

    va!ue is *rom the actua! non va!ue #oss unction: measures the error bet4 %iand the predicted va!ue

    %i

    .bso!ute error: F %iE %iF

    quared error: )%iE %i+2

    Test error )genera!i(ation error+: the average !oss over the test set

    Mean abso!ute error: Mean squared error:

    Ae!ative abso!ute error: Ae!ative squared error:

    The mean squared3error e=aggerates the presence o* out!iers

    -opu!ar!% use )square+ root mean3square error, simi!ar!%, root

    re!ative squared error

    d

    ,,d

    i

    ii=

    1

    2

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    113/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11

    Classifer or Predictor (%) Lo!dout method

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    114/124

    November 16, 2015Data Mining: Concepts and

    Techniques 117

    Classifer or Predictor (%%)

    &ootstrap

    ors e!! ith sma!! data sets

    amp!es the given training tup!es uni*orm!% (ith replacement

    i4e4, each time a tup!e is se!ected, it is equa!!% !ie!% to be

    se!ected again and re3added to the training set

    evera! boostrap methods, and a common one is .$2 /oostrap uppose e are given a data set o* d tup!es4 The data set is samp!ed

    d times, ith rep!acement, resu!ting in a training set o* d samp!es4

    The data tup!es that did not mae it into the training set end up

    *orming the test set4 .bout 642 o* the origina! data i!! end up in

    the bootstrap, and the remaining 64> i!! *orm the test set )since )1E 18d+d e31H 046>+

    Aepeat the samp!ing procedue times, overa!! accurac% o* the

    mode!:''%6845'%6045%'% 9

    1

    9 settraini

    k

    i

    settesti -acc-acc-acc +==

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    115/124

    November 16, 2015Data Mining: Concepts and

    Techniques 115

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    'nse+ e Met o s: %ncreas n! t eAccurac

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    116/124

    November 16, 2015Data Mining: Concepts and

    Techniques 116

    /nsemb!e methods ;se a combination o* mode!s to increase accurac% Combine a series o* !earned mode!s, M1, M2, [, M,

    ith the aim o* creating an improved mode! M_ -opu!ar ensemb!e methods

    &agging: averaging the prediction over a co!!ection o*

    c!assi"ers &oosting: eighted vote ith a co!!ection o*

    c!assi"ers /nsemb!e: combining a set o* heterogeneous

    c!assi"ers

    a!! n!: oos rapA!!re!ation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    117/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11

    .na!og%: Diagnosis based on mu!tip!e doctors ma9orit% vote

    Training

    /ach c!assi"er Mireturns its c!ass prediction The bagged c!assi"er M_ counts the votes and assigns the c!ass

    ith the most votes to > -rediction: can be app!ied to the prediction o* continuous va!ues b%

    taing the average va!ue o* each prediction *or a given test tup!e

    .ccurac% *ten signi"cant better than a sing!e c!assi"er derived *rom D or noise data: not considerab!% orse, more robust -roved improved accurac% in prediction

    9oostin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    118/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11>

    .na!og%: Consu!t severa! doctors, based on a combination o* eighted

    diagnosesUeight assigned based on the previous diagnosis accurac% Lo boosting ors#

    eights are assigned to each training tup!e

    . series o* c!assi"ers is iterative!% !earned

    .*ter a c!assi"er Miis !earned, the eights are updated to a!!o the

    subsequent c!assi"er, Mi1, to pa% more attention to the training

    tup!es that ere misc!assi"ed b% Mi The "na! M_ combines the votes o* each individua! c!assi"er, here

    the eight o* each c!assi"erfs vote is a *unction o* its accurac%

    The boosting a!gorithm can be e=tended *or the prediction o*

    continuous va!ues

    Comparing ith bagging: boosting tends to achieve greater accurac%,

    but it a!so riss over"tting the mode! to misc!assi"ed data

    a oos reun an c ap re81QQR)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    119/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11@

    1, %1+, [, )>d, %d+ $nitia!!%, a!! the eights o* tup!es are set the same )18d+ L4

    C!assi"er Mierror rate is the sum o* the eights o* themisc!assi"ed tup!es:

    The eight o* c!assi"er Mis vote is '%

    '%1log

    i

    i

    -error

    -error

    =d

    j

    ji errw-error '%'% X

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    120/124

    November 16, 2015Data Mining: Concepts and

    Techniques 120

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Model Selection: 6CCurves

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    121/124

    November 16, 2015Data Mining: Concepts and

    Techniques 121

    AC )Aeceiver perating

    Characteristics+ curves: *or visua!comparison o* c!assi"cation mode!s

    riginated *rom signa! detection theor%

    hos the trade3oV beteen the true

    positive rate and the *a!se positive rate

    The area under the AC curve is a

    measure o* the accurac% o* the mode!

    Aan the test tup!es in decreasing order:

    the one that is most !ie!% to be!ong to

    the positive c!ass appears at the top o*the !ist

    The c!oser to the diagona! !ine )i4e4, the

    c!oser the area is to 045+, the !ess

    accurate is the mode!

    ?ertica! a=isrepresents the truepositive rate

    Lori(onta! a=is rep4

    the *a!se positiverate The p!ot a!so shos

    a diagona! !ine . mode! ith per*ect

    accurac% i!! have

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    122/124

    November 16, 2015Data Mining: Concepts and

    Techniques 122

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Su++ar (%)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    123/124

    November 16, 2015Data Mining: Concepts and

    Techniques 12

    C!assi"cation andpredictionare to *orms o* data ana!%sis that

    can be used to e=tract mode!sdescribing important data c!assesor to predict *uture data trends4

    /Vective and sca!ab!e methods have been deve!oped *or decision

    trees induction, Naive &a%esian c!assi"cation, &a%esian be!ie*

    netor, ru!e3based c!assi"er, &acpropagation, upport ?ector

    Machine )?M+, associative c!assi"cation, nearest neighbor

    c!assi"ers,and case3based reasoning, and other c!assi"cation

    methods such as genetic a!gorithms, rough set and *u((% set

    approaches4

    'inear, non!inear, and genera!i(ed !inear mode!s o* regressioncanbe used *or prediction4 Man% non!inear prob!ems can be converted

    to !inear prob!ems b% per*orming trans*ormations on the predictor

    variab!es4 Aegression treesand mode! treesare a!so used *or

    prediction4

    Su++ar (%%)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    124/124

    trati"ed 3*o!d cross3va!idationis a recommended method *or

    accurac% estimation4 &agging and boostingcan be used to

    increase overa!! accurac% b% !earning and combining a series o*

    individua! mode!s4

    igni"cance testsand AC curvesare use*u! *or mode! se!ection

    There have been numerous comparisons o* the diVerent

    c!assi"cation and prediction methods, and the matter remains a

    research topic

    No sing!e method has been *ound to be superior over a!! others *or

    a!! data sets $ssues such as accurac%, training time, robustness, interpretabi!it%,