Solving Test Case Based Problems With Fuzzy Dominance

Solving Test Case Based Problems With Fuzzy DominanceJason Zutty

Georgia Tech Research Institute925 Dalney St NWAtlanta, GA 30332

[email protected]

Gregory RohlingGeorgia Tech Research Institute

925 Dalney St NWAtlanta, GA 30332

[email protected]

ABSTRACTGenetic algorithms and genetic programming lend themselves wellto the field of machine learning, which involves solving test casebased problems. However, most traditional multi-objective selectionmethods work with scalar objectives, such as minimizing falsenegative and false positive rates, that are computed from underlyingtest cases.

In this paper, we propose a new fuzzy selection operator thattakes into account the statistical nature of machine learning prob-lems based on test cases. Rather than use a Pareto rank or strengthcomputed from scalar objectives, such as with NSGA2 or SPEA2,we will compute a probability of Pareto optimality. This will beaccomplished through covariance estimation and Markov chainMonte Carlo simulation in order to generate probabilistic objectivescores for each individual. We then compute a probability that eachindividual will generate a Pareto optimal solution. This probabilityis directly used with a roulette wheel selection technique.

Our method’s performance is evaluated on the evolution of afeature selection vector for a binary classification on each of eightdifferent activities. Fuzzy selection performance varies, outperform-ing both NSGA2 and SPEA2 in both speed (measured in generations)and solution quality (measured by area under the curve) in somecases, while underperforming in others.

CCS CONCEPTS• Computing methodologies → Genetic algorithms; Searchmethodologies; • Mathematics of computing → Probabilistic al-gorithms;

KEYWORDSGenetic Algorithms, Machine Learning, Markov ChainMonte Carlo,Pareto Dominance

ACM Reference format:Jason Zutty and Gregory Rohling. 2017. Solving Test Case Based ProblemsWith Fuzzy Dominance. In Proceedings of GECCO ’17, Berlin, Germany, July15-19, 2017, 8 pages.https://doi.org/http://dx.doi.org/10.1145/3071178.3071234

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’17, July 15-19, 2017, Berlin, Germany© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-4920-8/17/07. . . $15.00https://doi.org/http://dx.doi.org/10.1145/3071178.3071234

1 INTRODUCTIONIn the current age of big data, a new field of data science has beenemerging, seeking methods to automate the process of building ma-chine learning algorithms through using multiple-objective vector-based genetic programming[16, 17] or machine learning pipelineoptimization [7, 8]. There are two significant differences betweenthese methods and traditional genetic programming. First, evalua-tions are extremely expensive when compared to that of a scalarbased tree structure. Second, our objectives are all based on aggre-gate statistics, such as true positive or false positive rates, computedfrom multitudes of test cases.

While we have shown that traditional selection methods such asNSGA2 or SPEA2 have performed acceptably in evolving machinelearning algorithms from these aggregate statistics, it is worth not-ing that these selection methods have been designed to be quickand efficient. In traditional genetic programming, the selection algo-rithm is often the bottleneck of the evolutionary process, howeverin our case, it is the evaluations that take most of the processingtime. To that end, we seek to find selection methods that may takemore processing time, but also take advantage of the underlyingstatistical behavior of the machine learning algorithm results inorder to produce better solutions (measure by area under the Paretocurve) more quickly (measured in generations).

This paper introduces the concept of fuzzy selection, which fo-cuses on estimating the statistics of each solution in objective spacefrom test cases, rather than a single point. Using these statisti-cal representations allows for dominance comparisons in terms ofprobability rather than a resolved state. For example, in a pairwisecomparison using scalar objectives one individual would either bedominant to, dominated by, or co-dominant with respect to theother. Instead, all of these cases have some probability of occurringbased upon the distributions of the underlying test cases used tocompute each objective score. We propose a method for computinga probability of selection based upon a probability of Pareto opti-mality, rather than using a resolved state of dominance betweencoordinates in objective space. We estimate this probability throughcovariance estimation combined with Markov chain Monte Carlosimulation. This probability is then used with a roulette wheel selec-tion method. We then compare performance to traditional selectionmethods NSGA2 and SPEA2.

2 RELATEDWORK2.1 Traditional Multi-Objective Selection

MethodsOne of the major components of any genetic algorithms is theselecting of parent genomes to produce the next generation of indi-viduals. In multi-objective optimization most of these algorithms

529

https://doi.org/http://dx.doi.org/10.1145/3071178.3071234

https://doi.org/http://dx.doi.org/10.1145/3071178.3071234

GECCO ’17, July 15-19, 2017, Berlin, Germany Jason Zutty and Gregory Rohling

rely on the concept of Pareto optimality. An individual is Pareto op-timal if in objective space there is no other individual that improveson all objectives, i.e. it is impossible to improve in one dimensionwithout sacrificing performance in another. Mathematically, this isexpressed as follows:

Let a population of size N be represented as:

x = x0,x1, . . . ,xN

Where xi represents a single individual in the population. Let anobjective vector ofM objectives be represented as:

f (xi ) = f0(xi ), f1(xi ), . . . , fM (xi )

Where fk is the kth objective function. Then an individual xi isPareto optimal in a maximization scheme if:

�j ∈ 0...N | fk (x j ) > fk (xi )∀k ∈ (0...M)

Or in a minimization scheme:

�j ∈ 0...N | fk (x j ) < fk (xi )∀k ∈ (0...M)

The Pareto frontier, or front for short, is the set of all individuals thatare non-dominated. Solutions on the Pareto front can be thoughtof as co-dominant with one another.

The non-dominated sorting genetic algorithm (NSGA), intro-duced inMulti-Objective Optimization using Non-Dominated Sortingin Genetic Algorithms [11], was one of the first genetic algorithmsto emphasize the Pareto optimal multi-objective space. NSGA isuseful for pushing the Pareto front over the course of an evolutionby assigning its highest fitness to the first non-dominated front,and decreasing fitness with increasing dominance levels. In theirpaper introducing NSGA2 [2], Deb et al. point out that NSGA alsohas several weakness including a high computational load, a lackof elitism, and a need for a sharing parameter. NSGA2 introduceda fast non-dominated sort algorithm and diversity preservationthrough crowding distance sorting.

Zitzler et al. [13], introduced an improved version of the StrengthPareto Evolutionary Algorithm (SPEA) [14]. This fitness scoringmethod assigns each evaluated individual a score that correspondsto the number of other individuals in the population that dominateit. An individual on the Pareto front would be assigned a score ofzero. The authors also found that of SPEA, SPEA2, NSGA2, andPESA, SPEA2 had the highest performance in high dimensionalobjective spaces. The best overall performance was found to beshown by SPEA2 and NSGA2. [15]

While these methods based on Pareto optimality have beenshown to be successful, we believe there is more to be gainedby exploring the underlying statistics of the scalar objectives. Forexample, if on a particular objective of a binary classifier there are100 test cases that are used to compute a probability of detection,there are 100 choose 50 or 1 · 1029 different possible individuals togive a 50% error rate. By collapsing these test cases to their aver-ages in order to use traditional selection techniques, we are losinginformation that could potentially help to steer the population tonew and interesting solutions.

2.2 Test Case based problemsTraditionally, evolutionary machine learning is performed on ag-gregate scores of large numbers of test cases. In recent years focus

has begun to shift toward the usage of vectorized information onhow a classifier or predictor performs on each test case. Some ofthese methods include: Novelty search [4], Discovery of searchobjectives by clustering [6], and Discovery of search objectives bynon-negative matrix factorization [5].

The latter two of these different approaches work on deriving asmall number of objectives from a large number of test cases froman interaction matrix. The interaction matrix places each individualon a row, and the result on each test case across the columns. Theythen use the traditional multi-objective approaches such as thosedescribed in the previous section on the derived objectives. Whilethese methods have both shown to produce good results and isolateunique individuals, we believe they are still limited by not capturingthe statistics of the underlying objectives.

Novelty search on the other hand, removes the sense of objec-tives entirely, and selects based on the uniqueness of the solution inbehavioral space. For our purposes of a binary classifier, this wouldbe the vector pertaining to the predictions for each test case. Thehigher the distance between this vector and the rest of the popu-lation, the higher the probability of selection. While this methodis more free to explore the behavior space, that doesn’t offer anyguarantee that the solution will quickly converge to something thatis objectively good, in fact, the search itself is by design divergent.

3 MOTIVATION3.1 Probability of DominanceThere are many techniques for selection: NSGA2, SPEA2, MOEA/D,to name just a few. Each method has its strengths and weaknesses,but each of them can be considered a black box that takes in a pop-ulation and using some probabilistic method, returns a new popu-lation. In effect, all of these methods have control over one thing,which is the probability of selection per individual. In single objec-tive optimizations using a scalar objective score, such as accuracy, asimple roulette selection provides the best control over probabilityof selection. In multi-objective methods, we desire a probability ofselection that can be fed in to roulette wheel based on dominancerelationships and crowding. Instead most multi-objective selectiontechniques rely on Pareto rankings and strengths combined withsorting or tournaments.

For real world applications of evolved machine learning, it isworth noting that we have scalar objectives (e.g. an average errorrate) computed from many observations that are fed to the machinelearning algorithm as test cases. Despite all of the statistical infor-mation that can be derived from these vectors of performance wetraditionally only use the aggregate scalar objective score. Insteadwe can compute statistical information for each objective basedon the distribution of performance on test cases, and compute aprobability of dominance rather than a resolved state based on thescalar.

Therefore, what wewould really like to feed in to a roulette wheelis the following computation of probability of dominance: GivenMobjectives, N individuals, the probability that Xi is non-dominatedin a minimization scheme is one minus the probability that it isdominated, which is the probability that there is some individualX jthat is lower on each of theM objectives of individual Xi . Summed

530

Solving Test Case Based Problems With Fuzzy Dominance GECCO ’17, July 15-19, 2017, Berlin, Germany

up mathematically the probability Xi is non-dominated is given by:

P(Xi ≺ X j=1...N , j,i ) = 1 −⋃

j ∈1...N , j,i

M⋂m=1

P(X j (m) < Xi (m))

In this paper, we will refer to this as the probability of Pareto-dominance, or probability of dominance for short. Unfortunately,computing this probability is riddled with challenges, namely givena population of individuals, we do not have independence on per-formance of test cases, performance on aggregate scalar objectives,or performance on individuals. So rather than directly computethis quantity, we are going to use a Markov chain Monte Carloapproach to computing a probability of dominance that will driveselection.

4 METHODS4.1 Computing CovarianceIn order to estimate a probability of dominance using Monte Carlo,we need to simulate potential solutions in objective space from apopulation. If we take each objective per individual to be a randomvariable, we can use the result from each test case of an individualto represent an observation on that random variable. In this waywe can construct an NM by K observation matrix, where N is thenumber of individuals,M is the number of objectives, and K is thenumber of test cases. In practice however, we often find that it israre to find a problem with equal number of test cases for eachobjective. For example, in a binary classification problem, we mayhave more test cases for false positive error (where the truth dataindicates a 0) thanwe do for true positive error (where the truth dataindicates 1). To rectify this, we group the test cases such that eachobjectives has the same number of test case groups, allowing us toform one to one relationships between our random variables. Foreach group, the result is the average of the test cases that contribute.The number of groups is chosen to be the minimum number of testcases that make up each objective, as a result at least one objectivewill have one test case per test case group.

Once the observation matrix X¯has been constructed, we use it

to compute the covariance matrix Σ between our NM variables,resulting in an NM by NM matrix. Ideally, this covariance matrixwould be found by:

Σi j =1K

K∑k=1

(Xik − E[Xi ])(X jk − E[X j ])

However, an issue arises with the number of objectives and individu-als typically found in a multi-objective population based algorithm,where large covariance matrices become ill conditioned. When thathappens the inverse, which will be necessary to compute in theMarkov chain Monte Carlo simulation, will not be useful due tonumerical precision issues. To solve this issue, we use the Ledoit-Wolf method [3]. This procedure, implemented in scikit-learn [9],results in a well-conditioned, invertible, estimator of the covariancematrix.

4.2 Markov Chain Monte CarloGiven the computed estimate of the covariance matrix: Σ, alongwith the expected value for each objective for each individual: µ,

µ = E [X1] ,E [X2] , . . . ,E [XNM ]We now seek to draw samples from the distribution they represent.Following our example of our binary classifier, is a truncated multi-variate normal distribution on the range of [0,1]. This is becausefalse negative and false positive rates can not fall outside thosebounds. We represent our distribution as:

TN(µ, Σ

)Drawing samples from a distribution can be as simple as drawing

a random sample from a uniform distribution on [0,1] and using itto sample the desired distribution’s cumulitive distribution function(CDF). However, because a truncated multivariate normal distribu-tion has no closed form solution for its CDF, we instead implementa Gibbs sampler prescribed by Wilhelm [12] as follows:

Let j be the trial number of the Monte Carlo and i be our positionin the NM dimensional vector x . Then we draw our sample as:

xi .−i = µi .−i+

σi .−iΦ−1[U

(Φ

(1 − µi .−iσi .−i

)− Φ

(0 − µi .−iσi .−i

))+ Φ

(0 − µi .−iσi .−i

)]Where the notation i . − i means sample i given all samples not i ,U is a randomly drawn Uni(0, 1), and Φ is the CDF of a standardnormal N(0, 1). Furthermore,

µi .−i = µi − H−1ii Hi,−i (x−i − µ−i )

σi .−i =√H−1ii

And H is known as the precision matrix, defined as:

H = Σ−1

Finally,x−i = x

(j)1 , . . . ,x

(j)i−1,x

(j−1)i+1 , . . . ,x

(j−1)NM

µ−i = µ(j)1 , . . . , µ

(j)i−1, µ

(j−1)i+1 , . . . , µ

(j−1)NM

This sampling is the slowest part of this proposed selection algo-rithm by several orders of magnitude, and scales with the numberof objectives, individuals, and trials. It is worth noting that the timeto perform the sampling is not dependent on the number of testcases, which does impact the evaluation time.

Now that we have J samples of our NM variables, we wish touse this matrix to compute a probability of non-dominance. We caniterate through each NM dimensional sample and deinterlace it toform N vectors ofM elements each. Each of these N vectors repre-sents a possible location in objective space for an individual givenby theM elements. Because the vector was drawn simultaneously,the correlations between the individuals and their objective scoreswill be respected. We compute the non-dominated set of the N in-dividuals for each sample, and maintain a 1 by N histogram of thenumber of times each individual was found to be non-dominatedthroughout the J samples. This histogram is then normalized suchthat the sum is 1, and the value Ni is the probability that individuali is a non-dominated solution. These probabilities are then used todrive a roulette wheel selection with replacement, such that thevalues in the histogram truly represent a probability of selection.

531


5 EXPERIMENTTo test the proposed fuzzy selection technique, we chose to work abinary classification problem from the Physical Activity MonitoringData Set (PAMAP2) collected by Reiss and Stricker [10]. Data inthis set was collected from nine subjects each wearing three inertialmeasurement units and a heart rate monitor while performing 12different activities. Prior to feature construction we cleaned the timeseries data from the sensors in the following manner as prescribedby Baldominos et al.[1]:

• A boolean mask for the 54 (Indexed [0-53]) time seriessensor data columns is created.– Column 0 is marked for removal, corresponding to

the time stamp of the data, we do not want to trainon the time of an activity.

– Columns 16-19, 33-36, and 50-53, which represent ori-entation data that should not be used as features, aremarked for removal. This was indicated by the Reissand Stricker.

• Remove all columns that were marked for removal above.• Due to differing sampling rates of the sensors, there will

be missing data points in the data, we fill all the missingdata by propagating our samples forward.

• At this point there may be some leading data points thatare missing, fill all these by propagating backward.

• Remove all rows that were labeled as transient activities(Class label 0), we do not want to train on or classify thesebehaviors, as we do not know what was occurring.

Once the data has been cleaned, features are constructed from the40 remaining time series columns as prescribed by Baldominos[1]:

• Each column was transformed using a sliding fast Fouriertransform (FFT) of size 512 samples, which represents 5.12seconds of 100 Hz data.

• From each of set of the 512 coefficients (257 real, 255 imag-inary broken out from the first half of the FFT) 7 statisticalfeatures were extracted: Mean, Median, Standard deviation,max, min, and first and third quartiles.

• The resulting feature matrix is N windows by 280 features(7 statistics on each of the 40 columns).

We experimented with different step sizes and found them to havelittle impact on the results, so we chose to step by 6.25% (1/16th ofthe window, or 0.32 seconds) to generate our feature matrix.

Given the feature matrix, we select a single activity to classifyfrom the 280 features. For the example, if we select the activity ofvacuuming, which is class label 16 in the data set, all instances ofclass 16 are replaced with a 1, while all other instances’ class labelsare replaced with a 0. For vacuuming, this produces 701 true positivesamples, and 6,888 false positive samples. The unbalanced dataset makes an excellent candidate for multi-objective optimization,where driving by accuracy alone would quickly favor suppressingfalse positives. We repeat this partitioning on each of the eightactivities carried out by all subjects: lying, sitting, standing, walking,ascending stairs, descending stairs, vacuuming, and ironing.

The goal of the experiment is to understand how using the statis-tical nature of the underlying test cases (of which there are 7,589 forvacuuming) to compute a probability of dominance will comparewith traditional multi-objective approaches that use only the false

Figure 1: Average area under the curve (AUC) of the Paretofrontier of 30 trials of 100 generations during the evolutionof a binary classifier for vacuuming. Shading shows ± onestandard deviation. Dashed lines show max and mins. AUCis shown on a log scale to emphasize the later generations.

negative and false positive error rates. We compare on a linearboolean genome of length 280, that serves as a selection mask forour feature matrix prior to training a naive Bayes classifier. Wechose this machine learning method for its quick training and eval-uation time in order to run through generations more quickly, inaddition to its deterministic results when implemented in scikit-learn.

The evolutionary settings used were for the most part, the sameas those used by Baldominos[1]. We added elitism to the optimiza-tion to push the optimization along faster. The evolutionary settingswere:

• Population size 512• Single point crossover rate 35%• Mutation bit flip

– 8.333% per individual– 8.333% per gene in the individual

• Elitism: Every generation selection is performed fromParetofrontier + current offspring

• Genome: 280 boolean attributes• Fuzzy Pareto method uses 10, 000Monte Carlo trials

6 RESULTS6.1 Comparison to Traditional MethodsIn order to analyze the performance of the fuzzy selection method,three optimizations were created, where the only variable was theselection method used, being the proposed fuzzy selection, NSGA2,and SPEA2. Results of running this optimization are shown inFigure 1 for the binary classifier of the vacuuming activity. The topplot compares the proposed fuzzy selection with SPEA2, while thebottom shows the same fuzzy selection against NSGA2. In bothcases note that for this experiment, fuzzy selection outperforms onthe average case, best case, and worst case each of the traditional

532


Figure 2: Average area under the curve (AUC) of the Paretofrontier of 30 trials of 100 generations during the evolutionof a binary classifier for walking. Shading shows ± one stan-dard deviation. Dashed lines show max and mins. AUC isshown on a log scale to emphasize the later generations.

methods. After comparing all three selection methods on the case ofa vacuuming classifier, we repeated the experiment for each of theseven other activities in the protocol from PAMAP2: lying, sitting,standing, walking, ascending stairs, descending stairs, and ironing.Results of all experiments are shown in Table 1.

It is clear that while for the classifier evolution for ascendingstairs, descending stairs, vacuuming, and ironing all outperform thestandard multi-objective selection methods on all measured metrics,fuzzy selection is not a silver bullet. Fuzzy selection underperformson some or all metrics for lying, sitting, standing, and walking. Inorder to illustrate some of the characteristics of what indicatorscan show fuzzy selection will perform well, we compare here thecases of vacuuming, where we outperformed on all metrics, andwalking, where we underperformed on all metrics.

The 30 trial average area under the curve of the Pareto frontieris shown as a function of generation number for the evolutionof the walking classifier in Figure 2. This evolution behaves dif-ferently than that of the vacuuming classifier shown previouslyin Figure 1. The fuzzy selection for the walking classifier whilestarting out faster than SPEA2 and NSGA2 very quickly levels outaround generation 10, in contrast to continuously decreasing asit did in the vacuuming evolution. However, SPEA2 and NSGA2both continue to decrease for the walking classifier. This highlightssome systematic issue in the fuzzy selection method.

6.2 Population SimulationTo better understand the successes and failures of fuzzy selection,we look at the behavior of fuzzy selection over the course of theevolution. Figure 3 shows scatter plots of objective space for fourof the 100 generations evaluated for the vacuuming classifier, whileFigure 4 shows the same generational snapshots for the walkingclassifier. The y-axis of each plot shows the false negative rate, i.e.,

Figure 3: Four snapshots in to the 100 Generations of popu-lations and expected number of selections is shown in colorfor a binary classifier for vacuuming. Blue represents 0 selec-tions per generation. Red represents more than 2 selections.

Figure 4: Four snapshots in to the 100 Generations of popu-lations and expected number of selections is shown in colorfor a binary classifier for walking. Blue represents 0 selec-tions per generation. Red represents more than 2 selections.

the probability the classifier predicted a vacuuming sample as notvacuuming. The x-axis of each plot shows the false positive rate,i.e., the probability the classifier predicted a non-vacuuming sampleas vacuuming. Each point on the plot represents the expected errorrates for each of the two objectives computed over all test cases. Thecolor of each point is correlated to the individual’s expected numberof selections per generation, ranging from blue (zero selections) tored (more than two selections).

Starting at generation zero the initial results of the random pop-ulation are evaluated, and the probability of selection is drawingtowards the origin as expected. This is true for both the vacuuming

533


Table 1: Results on 30 trials of 100 generations optimizing a binary activity classifier. Note that µ and σ are computed over thetrials, and the subscripts represent generations. Best and worst represent the minimum and maximum over the trials.

Activity Method µ0 µ99 σ99 µgen ≤ 12 µ0 Best99 Worst99

LyingFuzzy 8.66 · 10−2 4.87 · 10−2 2.98 · 10−3 ≥ 100 3.62 · 10−2 5.28 · 10−2SPEA2 8.84 · 10−2 5.29 · 10−2 1.32 · 10−3 ≥ 100 5.04 · 10−2 5.76 · 10−2NSGA2 8.77 · 10−2 5.32 · 10−2 1.33 · 10−3 ≥ 100 5.16 · 10−2 5.76 · 10−2

SittingFuzzy 4.55 · 10−1 2.72 · 10−1 9.92 · 10−2 ≥ 100 1.19 · 10−1 3.51 · 10−1SPEA2 4.54 · 10−1 2.34 · 10−1 9.63 · 10−2 ≥ 100 9.54 · 10−2 3.28 · 10−1NSGA2 4.47 · 10−1 2.25 · 10−1 1.01 · 10−1 ≥ 100 1.10 · 10−1 3.35 · 10−1

StandingFuzzy 8.54 · 10−2 5.39 · 10−3 7.00 · 10−3 14 1.18 · 10−3 3.65 · 10−2SPEA2 8.43 · 10−2 1.61 · 10−3 2.74 · 10−3 13 6.25 · 10−4 1.63 · 10−2NSGA2 8.37 · 10−2 2.62 · 10−3 2.64 · 10−3 18 9.50 · 10−4 1.49 · 10−2

WalkingFuzzy 8.51 · 10−2 5.50 · 10−2 7.82 · 10−3 ≥ 100 4.31 · 10−2 7.24 · 10−2SPEA2 8.77 · 10−2 3.77 · 10−2 1.50 · 10−3 39 3.54 · 10−2 3.98 · 10−2NSGA2 8.37 · 10−2 3.70 · 10−2 1.84 · 10−3 48 3.25 · 10−2 3.98 · 10−2

AscendingStairs

Fuzzy 1.15 · 10−1 4.98 · 10−2 2.23 · 10−3 35 4.35 · 10−2 5.28 · 10−2SPEA2 1.14 · 10−1 5.49 · 10−2 4.59 · 10−3 91 4.62 · 10−2 6.43 · 10−2NSGA2 1.15 · 10−1 5.90 · 10−2 5.19 · 10−3 ≥ 100 4.97 · 10−2 6.93 · 10−2

DescendingStairs

Fuzzy 1.42 · 10−1 6.23 · 10−3 7.14 · 10−3 12 3.24 · 10−3 3.70 · 10−2SPEA2 1.43 · 10−1 1.96 · 10−2 8.78 · 10−3 35 5.32 · 10−3 4.09 · 10−2NSGA2 1.44 · 10−1 2.10 · 10−2 7.29 · 10−3 41 1.26 · 10−2 4.22 · 10−2

VacuumingFuzzy 1.58 · 10−2 3.22 · 10−4 1.87 · 10−4 5 2.10 · 10−5 7.45 · 10−4SPEA2 1.59 · 10−2 5.40 · 10−4 2.65 · 10−4 9 1.73 · 10−4 1.19 · 10−3NSGA2 1.69 · 10−2 6.90 · 10−4 3.07 · 10−4 9 3.24 · 10−4 1.35 · 10−3

IroningFuzzy 2.25 · 10−2 2.37 · 10−3 2.04 · 10−4 8 1.89 · 10−3 2.78 · 10−3SPEA2 2.31 · 10−2 3.13 · 10−3 5.02 · 10−4 19 2.21 · 10−3 4.37 · 10−3NSGA2 2.21 · 10−2 3.30 · 10−3 6.32 · 10−4 25 2.51 · 10−3 4.81 · 10−3

classifier and the walking classifier. By generation 10, the distri-bution of individuals in objective space begins to form an L, andprobability of selection is highest for the points closest to the edges.This trend continues through the evolution as shown in genera-tion 25 and the final generation, 99. The differences between thevacuuming and walking evolutions become apparent in the plotfor generation 25. While at this point, the vacuuming classifier stillhas a high probability of selection nearest the origin, the walkingclassifier has a lower probability of selection there than it doesrising up the axis corresponding to false negative error rates.

A benefit of this method is its implicit handling of crowding. Ifthere are many solutions in one section of objective space, they aregoing to naturally split the probability of dominance between them,especially if those solutions are highly correlated. On the otherhand, solutions that don’t have many neighbors will be more likelyto be Pareto when they generate strong results from their randomtrials. This is observed in the scatter plots as points far behind thePareto frontier often have a stronger probability of selection thanthose that are closer to the frontier, but also have more neighbors.

We also see some regions in objective space that are crowded,but have particular individuals with high expected selections. Webelieve these are individuals who are capturing unique solutionsfrom their neighbors, i.e. their means are similar on both objectives,

but their performance on the underlying test cases is not highlycorrelated.

One other effect that we observe is due to higher variances ofthe underlying test cases, individuals along the axes may producesimilar probabilities of selection to those that are closer to the origin.This is due to the properties of what is effectively an underlyingBernoulli random variable, where variance for each objective isgiven by:

σ 2 = p(1 − p)Where p is the probability the event occurred. This is maximizedwhen an objective is at 0.5 and falls off as p increases or decreases.

In Figure 5 we see a visualization of a cumulative distributionfunction for a single dimensional truncated normal distributionµ = p and σ 2 = p(1 − p). The color shows the value of the CDF foreach possible value of µ across the bounded region 0 ≤ x ≤ 1. Wecan see that the band that represents the 10th percentile grows at aslower rate than the 90th , this shows the phenomenon observedmakes sense where individuals tend to have similar probability ofselection where the line is flatter.

Another interesting item to look at over the course of the evo-lution is expected number of selections per generation of Paretoand non-Pareto individuals (where the Pareto optimality of an indi-vidual is determined by its average objective score across all test

534


Figure 5: Visualization of cumulative distribution functionof a truncated normal in a single dimension from an under-lying Bernoulli random variable.

Figure 6: Expected number of selections, and average num-ber of selections, per generation of Pareto and non-Paretoindividuals during the evolution of a binary classifier forvacuuming.

cases). We can see from Figure 6 that in the evolution of a vacu-uming classifier, fuzzy selection is appropriately favoring Paretoindividuals, selecting Pareto individuals on average twice as likelythan non-Pareto individuals. The favoritism is higher in earliergenerations when the Pareto solutions are larger drivers, but asthe population tightens towards the end, the ratio begins to de-cline. This is due to a number of factors, including the increase ofPareto individuals in the population over time. We can also see thatboth the number of Pareto individuals and the expected numberof selections of Pareto individuals grow over time, and level off atabout 80% of the population size. There are always more expectedselections of Pareto individuals than there are Pareto individuals,

Figure 7: Expected number of selections, and average num-ber of selections, per generation of Pareto and non-Paretoindividuals during the evolution of a binary classifier forwalking.

which implies Pareto individuals are being selected more than onceeach. This is to be contrasted with the performance of the walkingclassifier shown in Figure 7. Here, the number of Pareto individualsand expected Pareto selections have a much noisier characteris-tic over the course of the optimization. The average number ofnon-Pareto selections crosses that of Pareto selections at about thesame generation the area under the Pareto front starts to level offin Figure 2. The expected number of Pareto individuals selected inthe walking classifier evolution levels off at approximately 6% ofthe population in contrast with the 80% found to be expected in theevolution of the vacuuming classifier. We have found that for thisdata set, a favoritism of Pareto selections to non-Pareto selectionsis an indicator as to whether or not fuzzy selection will performsuccessfully.

7 CONCLUSIONSFuzzy selection can offer significant improvements over traditionalmulti-objective approaches by taking in to consideration the sta-tistical nature of the objective performance. In our relative bestcase, which was the classifier for ascending stairs, we achieved areduction of 68.21% in final area under the curve from SPEA2. Wealso showed a speed up of 291.66% in reaching the generation thathalves the initial area under the curve.

However, fuzzy selection is also sensitive to the underlying sta-tistics of test cases, and in some cases will underperform whencompared to those same traditional approaches. In our relativeworst case, optimizing a standing classifier, we converged to a finalarea under the curve that was an increase of 234.78% over SPEA2,but only suffered a slowdown of 7.69% to reach the generation thathalves the initial area under the curve.

It is also worth noting that fuzzy selection takes several ordersof magnitude longer to perform for a 10, 000 trial, two objective,512 individual population than the traditional selection methods.

535


This however is by design, as the selection process still takes lessthan one typical evaluation of a reasonably challenging machinelearning data set. For smaller populations, fewer trials are needed,and we can achieve faster performance.

While we have found indicators as to when fuzzy selection willperform successfully, such as a favoring of probability of selectionof Pareto individuals to non-Pareto individuals, further work isnecessary to understand the underlying mechanism that causesfuzzy selection to outperform or underperform traditional multi-objective techniques. Ideally, we will be looking for a set of featuresof a problem that is well suited to fuzzy selection, and a systematicmethod to correct the fuzzy selection procedure in order to improveperformance on those problems that are currently not well suitedfor this method.

Finally we would like to demonstrate success on further prob-lems, including the benchmarking suite used to analyze the per-formance of the non-negative matrix factorization technique byLiskowski and Krawiec. We would like to compare the performanceof a corrected fuzzy selectionmethod against various test case basedmethods inclunding: the NMF search drivers, novelty search, andonline discovery of search objectives (DISCO).

8 REFERENCESREFERENCES[1] Alejandro Baldominos, Yago Saez, and Pedro Isasi. 2015. Feature Set Optimization

for Physical Activity Recognition Using Genetic Algorithms. In Proceedings of theCompanion Publication of the 2015 Annual Conference on Genetic and EvolutionaryComputation. ACM, 1311–1318.

[2] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. Afast and elitist multiobjective genetic algorithm: NSGA-II. Evolutionary Compu-tation, IEEE Transactions on 6, 2 (2002), 182–197.

[3] Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis 88, 2 (2004),365–411.

[4] Joel Lehman and Kenneth O Stanley. 2008. Exploiting Open-Endedness to SolveProblems Through the Search for Novelty.. In ALIFE. 329–336.

[5] Paweł Liskowski and Krzysztof Krawiec. 2016. Non-negativeMatrix Factorizationfor Unsupervised Derivation of Search Objectives in Genetic Programming. InProceedings of the 2016 on Genetic and Evolutionary Computation Conference.ACM, 749–756.

[6] Paweł Liskowski and Krzysztof Krawiec. 2016. Online Discovery of SearchObjectives for Test-based Problems. Evolutionary computation (2016).

[7] Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016.Evaluation of a tree-based pipeline optimization tool for automating data science.In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference.ACM, 485–492.

[8] Randal S Olson and JasonHMoore. 2016. Identifying andHarnessing the BuildingBlocks of Machine Learning Pipelines for Sensible Initialization of a Data ScienceAutomation Tool. arXiv preprint arXiv:1607.08878 (2016).

[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

[10] Attila Reiss and Didier Stricker. 2012. Creating and benchmarking a new datasetfor physical activity monitoring. In Proceedings of the 5th International Conferenceon PErvasive Technologies Related to Assistive Environments. ACM, 40.

[11] Nidamarthi Srinivas and Kalyanmoy Deb. 1994. Muiltiobjective optimizationusing nondominated sorting in genetic algorithms. Evolutionary computation 2,3 (1994), 221–248.

[12] Stefan Wilhelm. 2015. Gibbs Sampler for the Truncated Multivariate NormalDistribution. (2015).

[13] Eckart Zitzler, Marco Laumanns, Lothar Thiele, and others. 2001. SPEA2: Im-proving the strength Pareto evolutionary algorithm. (2001).

[14] Eckart Zitzler and Lothar Thiele. 1999. Multiobjective evolutionary algorithms:a comparative case study and the strength Pareto approach. IEEE transactions onEvolutionary Computation 3, 4 (1999), 257–271.

[15] Jason Zutty. 2016. Creating Human-Competitive Algorithms using Multiple Objec-tive Vector-Based Genetic Programming. Ph.D. Dissertation. Gerogia Institute of

Technology. Proposal.[16] Jason Zutty, Daniel Long, Heyward Adams, Gisele Bennett, and Christina Baxter.

2015. Multiple objective vector-based genetic programming using human-derivedprimitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolu-tionary Computation. ACM, 1127–1134.

[17] Jason Zutty, Daniel Long, and Gregory Rohling. 2016. Increasing the Through-put of Expensive Evaluations Through a Vector Based Genetic ProgrammingFramework. In Proceedings of the 2016 on Genetic and Evolutionary ComputationConference Companion. ACM, 1477–1478.

536

Documents

Solving Test Case Based Problems With Fuzzy Dominance