Bagging, Boosting

Embed Size (px)

Citation preview

  • 7/28/2019 Bagging, Boosting

    1/32

    Dealing with Data ,Bagging,

    Boosting

  • 7/28/2019 Bagging, Boosting

    2/32

    Types of Data :Binary Data

    ID Salary Male / Female Mortgage car

    A 50000 1 0 0

    B 85000 0 1 1

    C 55000 1 0 1

    D 95000 1 1 0

    E 75000 0 0 0

    F 45000 0 1 1

    G 65000 1 1 0

    A Binary variable has two states 0 or 1 , where 0 means the variable is absent

    and 1 means it is present. Thus the variable smoker has value 1 if the

    person smokes and 0 if he does not. A binary variable is symmetric if both ofits states are equally valuable and carry the same weight. A variable denoting

    the gender of the person is a symmetric binary variable as both male, female

    values are equally important.

    Consider the data above. In this case Male/ Female , Mortgage and car are

    binary variables as they take values 0 and 1 only. In this case how do we find

    the distance between A and B.

  • 7/28/2019 Bagging, Boosting

    3/32

    Types of Data :Binary Data

    object j

    1 0 sum

    object i 1 q r q+r

    0 s t s+t

    sum q+s r+t p

    We construct a matrix as shown above. The matrix shows the matching

    between two objects i and j. In the above matrix q denotes the number ofmatches between i and j , where both are 1 , r denotes the number of

    matches where i=1 and j=0 and so on.

    r + s r + s

    d( i,j)= ------------------ = --------------q + r + s +t p

    The distance between i & j is also called the

    dissimilarity between i and j.

  • 7/28/2019 Bagging, Boosting

    4/32

    D(A,B) = (r+ s) / p = 3/3 =1

    Calculation of d(A,B) i.e dissimilarity

    between A and B

    D(A,C) = (r+s) / p = 1 /3 =.33

    Calculation of D(A,C) i.e dissimilaritybetween A and C

    B

    1 0 sum

    A 1 0 1 1

    0 2 0 2

    sum 2 1 3

    C

    1 0 sum

    A 1 1 0 1

    0 1 1 2

    sum 2 1 3

    Symmetric Data

  • 7/28/2019 Bagging, Boosting

    5/32

    Asymmetric Binary variable

    object j

    1 0 sum

    object i 1 q r q+r

    0 s t s+t

    sum q+s r+t p

    A variable is asymmetric variable if the outcomes of the states are not equally

    important such as positive and negative outcomes of a disease test. Let the variable

    be the status of HIV disease of a person. It will be 1 if disease is present and 0 ifdisease is absent. Given two asymmetric binary variables , the agreement of two 1s (

    a positive match) is considered more significant than that of two 0s. In this case the

    formula for dissimilarity becomes:

    r + s

    d( i,j)= ------------------

    q + r + s

    where t is not considered

  • 7/28/2019 Bagging, Boosting

    6/32

    name gender fever cough test-1 test-2 test-3 test-4

    Jack M Y N P N N N

    Mary F Y N P N P NJim M Y Y N N N N

    name gender fever cough test-1 test-2 test-3 test-4

    Jack M 1 0 1 0 0 0

    Mary F 1 0 1 0 1 0

    Jim M 1 1 0 0 0 0

    In the above case gender is symmetric and other factors are asymmetric binary. We convert asymmetric

    values as 1 for Yes and Positive and 0 for No and negative.

    D(Jack, Mary)= ( 0 + 1) / ( 2+ 0 +1 ) = .33

    D(Jack,Jim) = ( 1 + 1) / ( 1 + 1 +1) = .67

    D(Mary,Jim) = ( 1 + 2 ) / ( 1 + 1 +2) = .75

    Asymmetric Binary variable

  • 7/28/2019 Bagging, Boosting

    7/32

    Categorical Variables

    A categorical variable is a generalization of the binary variable in that it can take

    on more than two states . For example map_color is a categorical variable thatmay take five states : red, yellow, green, pink, and blue.

    The dissimilarity between two categorical objects i and j can be computed

    based on the ratio of mismatches:

    p - m

    d( i, j ) = ----------------------------

    p

    Where m is the number of matches ( i.e the number of variables for which i and j

    are in the same state) , and p is the total number of variables.

  • 7/28/2019 Bagging, Boosting

    8/32

    Categorical Variables

    We take into account object identifier and test-1 only and make the dissimilarity

    matrix. We have p=1 since only one variable is considered.

  • 7/28/2019 Bagging, Boosting

    9/32

    Categorical Variables

  • 7/28/2019 Bagging, Boosting

    10/32

    Ordinal Variables

    A discrete ordinal variable resembles a categorical variable, except that the M

    states of the ordinal value are ordered in a meaningful sequence.

  • 7/28/2019 Bagging, Boosting

    11/32

    Ordinal Variables

    We consider the object identifier and test2 (ordinal variable). We replace each of

    the test-2 value by the rank. Since there are three states namely ( excellent, fair

    and good) Mf = 3.

  • 7/28/2019 Bagging, Boosting

    12/32

    Ordinal Variables

    object-identifier test-2 Normalized value

    1 3 (3-1)/ (3-1) =12 1 (1-1)/ (3-1)=0

    3 2 (2-1) / (3-1)=.5

    4 3 (3-1)/ (3-1) =1

    We next calculate the Euclidean distance between the objects using the normalisedvalues .

    The distance between 2 and 1 is (( 1)2 ) 1/2 = 1

    The distance between 3 and 1 is ((.5-1)2) 1/2 = .5 and so on. This results in the

    following matrix.

    Rank : 1-fair , 2-good,3-excellent

  • 7/28/2019 Bagging, Boosting

    13/32

    RatioScaled Variables

    For the Ratio Scaled Variables we take the log values. Consider the object-

    identifier and the test-3 variable.

  • 7/28/2019 Bagging, Boosting

    14/32

    object-identifier test-3 Log Values

    1 445 log(445)= 2.652 22 log(22)= 1.34

    3 164 log(164)=2.21

    4 1210 log(1210)=3.08

    RatioScaled Variables

    From the values in the last column we calculate the Euclidean distance and we get thefollowing distance matrix.

  • 7/28/2019 Bagging, Boosting

    15/32

    x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    y 1 1 1 -1 -1 -1 -1 1 1 1

    Bagging

    Bagging , which is also known as bootstrap aggregating , is a

    technique that repeatedly samples (with replacement) from a data

    set . Each bootstrap sample has the same size as the original data.

    Because the sampling is done with replacement , some instances

    may appear several times in the same training set., while others may

    be omitted from the training set.

    Let x denote a one-dimension attribute and y denote the class label.

    We apply a classifier that induces only one-level binary decision tree

    with a test condition x < = k, where k is a split point chosen to

    minimize the entropy of the leaf nodes.

  • 7/28/2019 Bagging, Boosting

    16/32

    Bagging

  • 7/28/2019 Bagging, Boosting

    17/32

    BaggingThese values of y are determined based on

    bagging round 1. The round 1 states that for

    x=3.5 , y= -1

    The values of y in the column are added

  • 7/28/2019 Bagging, Boosting

    18/32

    An iterative procedure to adaptively changedistribution of training data by focusing more

    on previously misclassified records

    Initially, all N records are assigned equal weights

    Unlike bagging, weights may change at the end of

    boosting round

    Boosting

  • 7/28/2019 Bagging, Boosting

    19/32

    Records that are wrongly classified will have

    their weights increased

    Records that are classified correctly willhave their weights decreased

    Boosting

  • 7/28/2019 Bagging, Boosting

    20/32

    Boosting - AdaBoost

    AdaBoost Algorithm

    1: w= { wj= 1 /N | j= 1,2,3.N} {Initialize the weights for all N samples}

    2: Let k be the number of boosting rounds.

    3: for i= 1 to k do4: Create training set Di by sampling (with replacement) from D according to w

    5: Train a base classifier Ci on D

    6: Apply Ci to all examples in the original training set D.

    Calculate the weighted error

    7: If i > .5 then

    w= { wj= 1 /N | j= 1,2,3.N} (Reset the weights for all N examples}

    Go back to step 48: end if

    9: Calculate

    10: Update the weight of each example

    N

    j

    jjiji yxCwN 1)(

    1

    i

    ii

    1ln

    2

    1

  • 7/28/2019 Bagging, Boosting

    21/32

    Base classifiers: C1, C2, , CT

    Error rate:

    Importance of a classifier:

    N

    jjjiji yxCwN 1 )(

    1

    i

    ii

    1ln

    2

    1

    Boosting - AdaBoost

  • 7/28/2019 Bagging, Boosting

    22/32

    Weight update:

    If any intermediate rounds produce error rate

    higher than 50%, the weights are reverted

    back to 1/ N and the re sampling procedure is

    repeated

    Classification:

    factorionnormalizattheiswhere

    )(ifexp

    )(ifexp)()1(

    j

    iij

    iij

    j

    jij

    i

    Z

    yxC

    yxC

    Z

    ww

    j

    j

    T

    j

    jj

    y

    yxCxC

    1

    )(maxarg)(*

    Boosting - AdaBoost

  • 7/28/2019 Bagging, Boosting

    23/32

    Boosting - AdaBoost

    x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    y 1 1 1 -1 -1 -1 -1 1 1 1

    N=10 i.e the number of elements as shown above.

    w= 1 / N = 1/10 =.1 is the initial weight assigned to each

    element in the data

    Let k= number of boosting rounds =3

  • 7/28/2019 Bagging, Boosting

    24/32

    Boosting - AdaBoost

    The figure above shows the three boostingrounds . The elements are sampled with

    replacement. Hence a element appears more

    than once.

  • 7/28/2019 Bagging, Boosting

    25/32

    Boosting - AdaBoost

    In round 1 all elements are given the same weight = 1 /10 =.1

    as shown in the first row above.

    The weights of training records are as follows (calculation is shown in subsequent

    slides)

  • 7/28/2019 Bagging, Boosting

    26/32

    Boosting - AdaBoost

    The split point is: if x

  • 7/28/2019 Bagging, Boosting

    27/32

    Boosting - AdaBoost

    We need to calculate the value of

    and so that new weights can becalculated according to the equation:

    N

    j

    jjiji yxCw

    N 1)(

    1

    factorionnormalizattheiswhere

    )(ifexp

    )(ifexp)()1(

    j

    iij

    iij

    j

    j

    ij

    i

    Z

    yxC

    yxC

    Z

    ww

    j

    j

    i

    ii

    1ln

    2

    1

  • 7/28/2019 Bagging, Boosting

    28/32

    Boosting - AdaBoost

    i = 1/10 (.1 x1 + .1x 1 +.1x1 +0+0..)i = .1 (.3) = .03

    N

    j

    jjiji yxCw

    N 1

    )(1

    Calculation is as under :

    = 1 if data element in D does not

    match the original data element else itis 0.

    Thus = 1 for first three data elements in D , w is the weight assigned which

    is equal to .1 for the first round.

  • 7/28/2019 Bagging, Boosting

    29/32

    Boosting - AdaBoost

    We have the value ofi

    :

    i = .1 (.3) = .03

    = 1 /2 In ( (1- .03)/ .03)

    = 1.738

  • 7/28/2019 Bagging, Boosting

    30/32

    Boosting - AdaBoost

    We now need to calculate the new weights given by the equation

    factorionnormalizattheiswhere

    )(ifexp

    )(ifexp)()1(

    j

    iij

    iij

    j

    j

    ij

    i

    Z

    yxC

    yxC

    Z

    ww

    j

    j

    The normalization factor ensures that wi

    (j+1) =1

    This condition shows the matching or non matching of values

    x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    y 1 1 1 -1 -1 -1 -1 1 1 1

    Matching values

  • 7/28/2019 Bagging, Boosting

    31/32

    We need to calculate the value of Z j the normalization factor

    1 = (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e

    1.738 ) + (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e

    -1.738)+

    1 = (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e

    1.738 ) + (.1/Zj ) ( e1.738 ) + (.1/Zj ) (.175 x 7)

    The value of Zj must make the right hand side expression equal to 1

    If we solve the above equation we get value of Zj = 1.82

    For non-matching instances the weights are:

    =(.1 / 1.82) x e 1.738 = .31

    For non-matching instances the weights are :

    = (.1/1.82) x e -1.738 =.0096 ~ .01

    The whole process is repeated with the new weights

    Boosting - AdaBoost

  • 7/28/2019 Bagging, Boosting

    32/32

    Boosting - AdaBoost

    = -1 x (1.738) + 1 x (2.7784 ) + 1 x (4.1195) =-1 x(1.738) + 1 x (2.7784) +

    -1 x (4.1195)