Fast Cross Validation Via Sequential Analysis - Appendix

Embed Size (px)

Citation preview

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    1/6

    Appendix to

    Fast Cross-Validation via Sequential Analysis

    Tammo Krueger, Danny Panknin, Mikio BraunTechnische Universitaet Berlin

    Machine Learning Group10587 Berlin

    [email protected], {panknin|mikio}@cs.tu-berlin.de

    1 Selection of Meta-Parameters for the Fast Cross-validation

    The algorithm has a number of free parameters as can be seen from the pseudo-code in Algo-rithm 1: maxSteps, the number of subsamples sizes to consider, , the significance level for thebinarization of the test errors, l, l, the significance levels for the sequential analysis test, andearlyStoppingWindow, the number of steps to look back in the early stopping procedure. Whilewe will give an in-depth treatment of the selection of 0, 1 and the maxSteps parameter in thefollowing sections we here give some suggestions for the other parameters. The parameter con-trols the significance level in each step of the test for similar behavior. We suggest to set this tothe usual level of = 0.05. Furthermore l and l control the significance level of the H0 (con-figuration is a loser) and H1 (configuration is a winner) respectively. We suggest an asymmetricsetup by setting l = 0.1, since we want to drop loser configurations relatively fast and l = 0.01,since we want to be really sure when we accept a configuration as overall winner. Finally, we setearlyStoppingWindow to 3 for maxSteps = 10 and 6 for maxSteps = 20, as we have observed thatthis choice works well in practice.

    1.1 Choosing the Optimal Sequential Test Parameters

    As outlined in the main part of the paper we want to use the sequential testing framework to eliminateunderperforming configurations as fast as possible while postponing the decission for a winner aslong as possible. Using the parameters of the sequential testing framework we have to choose 0 and1, such that the the area of acceptance for H0 (region H0(0, 1, l, l) denoted by LOSER inthe overview figure) is maximized, while the earliest point of acceptance of H1 (Sa(0, 1, l, l)in the overview figure) is postoned until the procedure has run at least maxStep steps:

    (0, 1) = argmax0,

    1

    H0(

    0,

    1, l, l) s.t. Sa(

    0,

    1, l, l) (maxSteps 1, maxSteps] (1)

    It turns out that the global optimization in Equation (1) can be approximated by

    0 = 0.5 1 = min1

    ASN(0,

    1| = 1.0) maxSteps (2)

    where ASN(, ) (Average Sample Number) is the expected number of steps until the given test willyield a decision, if the real = 1.0. For details of the sequential analysis please consult [1].

    Note that sequential analysis formally requires i.i.d. variables which is clearly not the case in oursetting. However, we focus on loser configurations, which are always zero (ergo deterministic) andtherefore i.i.d. by construction. Also note that the true distribution of the trace matrix is complex andin general unknown. Our method should therefore be considered a first approximation with morerefined methods being the topic of future work.

    1

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    2/6

    Algorithm 1 Fast Cross-Validation

    1: function FASTCV(data, maxSteps, configurations, , l, l, earlyStoppingWindow)2: N/maxSteps; modelSize ; test getTest(steps, l, l)3: s {1, . . . , maxSteps}, c configurations : traces[c, s] performance[c, s] 04: c configurations : remainingModel[c] true5: for s 1 to steps do6: pointwisePerformance calcPerformance(data, modelSize, remainingModel)7: performance[remainingModel, s] averagePerformance(pointwisePerformance)8: traces[bestPerformingConfigurations (pointwisePerformance, ), s] 19: remainingModel[loserConfigurations(test, traces[remainingModel, 1:s])] false

    10: ifsimilarPerformance(traces[remainingModel, (s-earlyStoppingWindow):s], ) then11: break12: modelSize modelSize + 13: return selectWinnner(performance, remainingModel)

    Steps

    RelativeSpeedup

    1

    10

    20

    30

    40506070

    20 40 60 80 100 120 140

    Experiment

    easy

    medium

    hard

    Change Point

    FalseNegativeRate

    0.0

    0.2

    0.4

    0.6

    0.8

    steps=10

    q q

    q q

    q

    q qq q

    2 4 6 8

    steps=20

    q q q

    q q

    q q

    q q

    q

    q q q qq

    6 8 10 1 2 14 1 6 18

    Pi

    q 0.1

    0.2

    0.3

    0.4

    0.5

    Figure 1: Left: Relative speed gain of fast CV compared to full CV. We assume that training time iscubic in the number of samples. Shown are simulated runtimes for 10-fold CV on different problemclasses by different loser/winner ratios (easy: 3:1; medium: 1:1, hard: 1:3) over 100 resamples.Right: False negatives generated for non-stationary configurations, i.e., at the given change point

    the Bernoulli variable changes its before from the indicated value to 1.0.

    1.2 Determine the Number of Steps

    In this section we consider the maxSteps parameter. In principle, a larger number of steps leads tomore robust estimates, but also to an increase of computation time. We study the effect of differentchoices of this parameter in a simulation. For the sake of simplicity we assume that the binary topor flop scheme consists of independent Bernoulli variables with winner [0.9, 1.0] and loser [0.0, 0.1]. Figure 1 shows the resulting simulated runtimes for different settings. We see that thelargest speed-up can be expected for 10 maxSteps 20. The speed gain rapidly decreasesafterwards and becomes negligible between 40 for the hard setup and 100 for the easy setup. Thesesimplified findings suggests that all following experiments should be carried out with either 10 or20 steps.

    2 False Negative Rate

    The types of errors we must be most concerned with in our procedure are false negatives: Configura-tions which are eliminated although they are among the top configurations on the full sample. In thefollowing we study the false negative rate and prove the maximal number of times a configurationcan be a loser before it is eliminated, and study the general effect in simulations.

    Assume that there exists a change point cp such that a winning configuration looses for the first cpiterations. From the properties of our algorithm we can prove a security zone in which the fast

    2

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    3/6

    cross-validation has a false negative rate (FNR) of zero (see next Section for details): As long as

    0 cp

    maxSteps

    log l1l

    log 10

    log 1ll

    log 1110

    with maxSteps

    log1 l

    l/ log2

    ,

    the probability of a FNR larger than zero is zero. For instance for l = 0.01 and l = 0.1 wecan start a fast cross-validation run with minimal 7 steps, since there is no suitable test availablefor a smaller number of steps. For maxSteps = 10 steps, the security zone amounts to 0.27 10,meaning that if the change point for all switching configurations occurs at step one or two, the fastcross-validation procedure would not suffer from false negatives. Similarly, for maxSteps = 20 thesecurity zone is 0.39 20 = 7.8.

    To illustrate the false negative rate further we simulate those switching configurations by indepen-dent Bernoulli variables, which change their parameter from a chosen before {0.1, 0.2, . . . , 0.5}to a constant 1.0 at a given change point. The relative loss of these configurations for 10 and 20 stepsare plotted in Figure 1, right panel, for different change points. As stated by our theoretical resultabove, the FNR is zero for sufficiently small change points. After that, there are increasing proba-bilities that the configuration will be removed. As our experiments pointed out we see consistentlygood performance of the fast cross-validation procedure nevertheless, indicating that the changepoints are sufficiently small for real data sets.

    3 Proof of Security Zone Bound

    In this section we prove the security zone bound of the previous Section. We will follow the notationand treatment of the sequential analysis as found in the original publication of Wald [1], Sections 5.3to 5.5. First of all, Wald proves in Equation 5:27, that the following approximation holds:

    ASN(0, 1| = 1.0) =log 1l

    l

    log 10

    .

    The minimal ASN(0, 1| = 1.0) is therefore attained, if log10

    is maximal, which is clearly

    the case for 1 = 1.0 and 0 = 0.5, which holds by construction. So we get the lower bound ofmaxSteps for a given significance level l, l:

    maxSteps

    log1 l

    l/ log2

    .

    The lower line L0 of the graphical sequential analysis test as exemplified in the overview Figure ofthe paper is defined as follows (see Equation 5:13 - 5:15):

    L0 =log l

    1l

    log 10

    log 1110

    nlog 11

    10

    log 10

    log 1110

    .

    Setting L0 = 0, we can get the intersection of the lower test line with the x-axis and therefore theearliest step ndrop, in which the procedure will drop a constant loser configuration. This yields

    ndrop =log l

    1l

    log10 log

    1110

    /log 11

    10

    log10 log

    1110

    =log l

    1l

    log1110

    .

    Setting ndrop in relation to ASN(0, 1| = 1.0) yields the security zone bound of the previousSection.

    4 Error Rates on Benchmark Data

    The following table shows the mean absolute difference of test error (fast versus full cross-validation) in percentage points and 95% confidence intervals (standard error, 100 repetitions) forvarious setups. The fast setup runs with maxSteps = 10 steps while the slow setup is executed with20 steps. Each setup is once employed with and without the early stopping rule.

    3

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    4/6

    fast/early fast slow/early slowbanana 0.20 % 0.18 0.11 % 0.15 0.32 % 0.22 0.07 % 0.10

    breastCancer 2.00 % 1.85 2.09 % 1.64 -0.38 % 2.91 1.46 % 1.95diabetis 0.56 % 0.88 0.80 % 0.82 0.68 % 0.81 -0.00 % 0.71

    flareSolar 1.44 % 2.95 2.53 % 3.31 1.39 % 1.77 -0.11 % 1.86german 0.45 % 0.70 0.92 % 0.58 1.14 % 0.53 0.86 % 0.62

    image 0.19 % 0.19 0.22 % 0.20 0.46 % 0.26 0.41 % 0.24ringnorm 0.03 % 0.03 0.00 % 0.04 0.05 % 0.04 0.03 % 0.04

    splice 0.25 % 0.19 0.32 % 0.18 0.15 % 0.19 0.14 % 0.15thyroid 0.39 % 0.53 -0.13 % 0.47 -0.06 % 0.56 -0.38 % 0.44

    twonorm -0.02 % 0.03 -0.03 % 0.04 0.00 % 0.05 0.00 % 0.03waveform 0.27 % 0.12 0.21 % 0.17 0.33 % 0.15 0.21 % 0.15covertype 0.78 % 0.21 0.89 % 0.19 0.65 % 0.19 0.88 % 0.20

    5 Example Run of Fast Cross-Validation

    In this section we give an example of the whole fast cross-validation procedure on a toy data set ofn = 1, 000 data points, which is based on a sine wave y = sin(x) + , x [0, 2d] with beingGaussian noise ( = 0, = 0.25). The parameter d = 50 controls the inherent complexity of the

    data and the sign ofy is taken as the class membership. The fast cross-validation is executed withmaxSteps = 10 and earlyStoppingWindow = 3. We use a -SVM [2] and test a parameter gridof {1, 0.5, 0, 0.5, 1} and {0.1, 0.2, 0.3, 0.4, 0.5}. The procedure runs for 4 steps afterwhich the early stopping rule takes effect. This yields the following traces matrix (only remainingconfigurations are shown):

    Configuration modelSize=100 modelSize=200 modelSize=300 modelSize=400 = 0, = 0.1 1 1 0 0 = 0, = 0.2 1 1 0 0 = 0, = 0.3 1 1 0 0 = 0, = 0.4 1 1 0 0 = 0, = 0.5 1 1 0 0

    = 0.5, = 0.1 1 1 1 1 = 0.5, = 0.2 1 1 1 1

    = 0.5, = 0.3 1 1 1 0 = 0.5, = 0.4 1 1 1 0 = 0.5, = 0.5 1 1 0 0

    = 1, = 0.1 1 1 1 1 = 1, = 0.2 1 1 1 1 = 1, = 0.3 1 1 1 1 = 1, = 0.4 0 1 1 0

    The corresponding performances (prediction accuracy) are as follows, from which the procedurechooses = 1, = 0.2 as final winning configuration:

    Configuration modelSize=100 modelSize=200 modelSize=300 modelSize=400 = 0, = 0.1 0.659 0.760 0.824 0.858 = 0, = 0.2 0.659 0.759 0.826 0.855

    = 0, = 0.3 0.659 0.759 0.824 0.857 = 0, = 0.4 0.659 0.759 0.827 0.857 = 0, = 0.5 0.659 0.760 0.824 0.853

    = 0.5, = 0.1 0.657 0.757 0.841 0.873 = 0.5, = 0.2 0.657 0.759 0.853 0.872 = 0.5, = 0.3 0.657 0.762 0.851 0.867 = 0.5, = 0.4 0.658 0.762 0.850 0.865 = 0.5, = 0.5 0.658 0.756 0.837 0.857

    = 1, = 0.1 0.652 0.743 0.847 0.878 = 1, = 0.2 0.648 0.746 0.866 0.895 = 1, = 0.3 0.646 0.766 0.861 0.883 = 1, = 0.4 0.624 0.745 0.861 0.860

    4

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    5/6

    6 Non-Parametric Tests

    The tests used in the fast cross-validation procedure are common tools in the field of statistical dataanalysis. Here we give a short summary based on the Dataplot Manual [3]. Both methods deal witha data matrix ofc experimental treatments with observations arranged in r blocks:

    TreatmentBlock 1 2 . . . c

    1 x11 x12 . . . x1c2 x21 x22 . . . x2c3 x31 x32 . . . x3c

    . . . . . . . . . . . . . . .r xr1 xr2 . . . xrc

    Both tests treat similar questions (Do the c treatments have identical effects?) but are designedfor different kinds of data: Cochran Q test is tuned for binary xij while the Friedman test acts oncontinuous values. In the context of the fast cross-validation procedure the test are used for twodifferent tasks:

    1. Determine whether a set of configurations are the top performing ones (step in the

    overview Figure and the function bestPerformingConfigurations in Algorithm 1).2. Check whether the remaining configurations behaved similar in the past (step in the

    overview Figure and the function similarPerformance in Algorithm 1).

    In both cases, the configurations act as treatments on either the samples (Point 1 above) or on thelast earlyStoppingWindow traces (Point 2 above) of the remaining configurations. Depending on thelearning problem either the Friedman Test for regression task or the Cochran Q test for classificationtasks is used in Point 1.

    In both cases the hypotheses for the tests are as follows:

    H0: All treatments are equally effective (no effect)

    H1: There is a difference in the effectiveness among the treatments, i.e., there is at leastone treatment showing a significant effect.

    6.1 Cochran Q Test

    The test statistic is calculated as follows:

    T = c(c 1)

    ci=1 Ci

    Ncr

    i=1 Ri(c Ri)

    with Ci denoting the column total for the ith treatment, Ri the row total for the i

    th block, and N thetotal number of values. We reject H0, ifT >

    2(1 , c 1) with 2(1 , c 1) denoting the(1 )-quantile of the 2 distribution with c 1 degrees of freedom and is the significance level.

    6.2 Friedman Test

    Let R(xij) be the rank assigned to R(xij) within block i (i.e., ranks within a given row). Averageranks are used in the case of ties. The ranks are summed to obtain

    Rj =

    ri=1

    R(xij).

    The test statistic is then calculated as follows:

    T =12

    rc(c + 1)

    ci=1

    (Ri r(c + 1)/2)2.

    We reject H0 ifT > 2(, c 1) with 2(, c 1) denoting the -quantile of the 2 distribution

    with c 1 degrees of freedom and is the significance level.

    5

  • 8/3/2019 Fast Cross Validation Via Sequential Analysis - Appendix

    6/6

    References

    [1] Abraham Wald. Sequential Analysis. Wiley, 1947.

    [2] Bernhard Scholkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New supportvector algorithms. Neural Comput., 12:12071245, May 2000.

    [3] James J. Filliben and Alan Heckert. Dataplot Reference Manual Volume 1: Commands. Statis-tical Engineering Division, Information Technology Laboratory, National Institute of Standardsand Technology.

    6