Speeding Up the Evaluation of Evolutionary Learning Systems Using GPGPUs

BioHEL GBML SystemBioHEL using CUDA

Experiments and resultsConclusions and Further Work

Speeding up the Evaluation of EvolutionaryLearning Systems using GPGPUs

María A. Franco, Natalio Krasnogor and Jaume Bacardit

University of Nottingham, UK,ASAP Research Group,

School of Computer Science{mxf,nxk,jqb}@cs.nott.ac.uk

July 10, 2010

M. Franco, N. Krasnogor, J. Bacardit. Uni. Nottingham Speeding up Evolutionary Learning using GPGPUs 1 / 27



Motivation

Nowadays the data collection rate easily exceeds theprocessing/data-mining rate.

Real-life problems = big + complex.

There is a need to improve the efficiency of evolutionarylearning systems to cope with large scaledomains[Sastry, 2005][Bacardit and Llorà, 2009].

This work is focused on boosting the performance of theBioHEL system by using the processing capacity insideGPGPUs.




Outline

1 BioHELBioHEL GBML SystemCharacteristics of BioHEL

2 BioHEL using CUDAHow does CUDA works?Challenges of using CUDAImplementation details

3 Experiments and resultsStage 1: Raw evaluationStage 2: Integration with BioHEL

4 Conclusions and Further Work




BioHEL GBML SystemCharacteristics of BioHEL

The BioHEL GBML System

BIOinformatics-oriented Hierarchical Evolutionary Learning- BioHEL[Bacardit et al., 2009]BioHEL was designed to handle large scale bioinformaticsdatasets[Stout et al., 2008]BioHEL is a GBML system that employs the Iterative RuleLearning (IRL) paradigm

First used in EC in Venturini’s SIA system[Venturini, 1993]Widely used for both Fuzzy and non-fuzzy evolutionarylearning



















Characteristics of BioHEL

The fitness function based on theMinimum-Description-Length (MDL)[Rissanen, 1978]principle that tries to

Evolve accurate rulesEvolve high coverage rulesEvolve rules with low complexity, as general as possible

BioHEL applies a supervised learning paradigm. Tocompute the fitness we need three metrics per classifiercomputed from the training set:

1 # of instances that match the condition of the rule2 # of instances that match the action of the rule3 # of instances that match the condition and the action






The fitness function based on theMinimum-Description-Length (MDL)[Rissanen, 1978]principle that tries to

Evolve accurate rulesEvolve high coverage rulesEvolve rules with low complexity, as general as possible

BioHEL applies a supervised learning paradigm. Tocompute the fitness we need three metrics per classifiercomputed from the training set:

1 # of instances that match the condition of the rule2 # of instances that match the action of the rule3 # of instances that match the condition and the action






The ILAS windowing scheme[Bacardit, 2004]Efficiency enhancement method, the training set isdivided in non-overlapping strata and each iteration usesa different strata for its fitness calculations.





Evaluation process in BioHEL

The computationally heavystage of the evaluationprocess is the matchprocess.We perform stages 1 and 2inside the GPGPU.





Evaluation process in BioHEL

The computationally heavystage of the evaluationprocess is the matchprocess.We perform stages 1 and 2inside the GPGPU.




How does CUDA works?Challenges of using CUDAImplementation details

Why using GPGPUs?

This GPGPUs acceleration have been used already inGP[Langdon and Harrison, 2008], GAs[Maitre et al., 2009]and LCS[Loiacono and Lanzi, 2009]The use of GPGPUs in Machine Learning involves agreater challenge because it deals with very largevolumes of dataHowever, this also means that it is potentially moreparallelizable.





Why using GPGPUs?






Why using GPGPUs?






CUDA Architecture

NVIDIA Computed Unified Device Architecture (CUDA) is aparallel computing architecture that exploits the capacitywithin NVIDIA’s Graphic Processor Units.CUDA runs thousands of threads at the same time⇒SPMD (Same Program Multiple Data) paradigm





CUDA Architecture

NVIDIA Computed Unified Device Architecture (CUDA) is aparallel computing architecture that exploits the capacitywithin NVIDIA’s Graphic Processor Units.CUDA runs thousands of threads at the same time⇒SPMD (Same Program Multiple Data) paradigm





CUDA Architecture





CUDA Memories

Different types of memory with different access speed.The memory is limitedThe memory copy operations involve a considerableamount of execution timeSince we are aiming to work with large scale datasets agood strategy to minimize the execution time is based onthe memory usage





CUDA Memories






CUDA Memories






The ideal way would be...

Copy all the classifiers and instances and launch onethread for each classifier-instance comparison





Challenges of using CUDA

Problem 1: Memory copy operationsIn each iteration the classifiers and the instances tocompare are different. They need to be copied again intoglobal memory in each iteration.

Problem 2: Memory boundsIt might be not possible to store all the classifiers and theexample instances to make all the comparisons at thesame time.






Problem 1: Memory copy operationsIn each iteration the classifiers and the instances tocompare are different. They need to be copied again intoglobal memory in each iteration.

Problem 2: Memory boundsIt might be not possible to store all the classifiers and theexample instances to make all the comparisons at thesame time.





Solution: Memory calculations

If all the instances fit in memory...We copy the instances only one time at the beginning of eachGA run and access different windows by using a memory offset.





Solution: Memory calculations

If all the instances fit in memory...We copy the instances only one time at the beginning of eachGA run and access different windows by using a memory offset.

If all the instances do not fit in memory...We calculate the number of classifiers and instances that fit inmemory in order to minimise the memory copy operations.






Problem 3: Output structure sizeIf we make N ×M comparisons we will have to copy back tohost a structure of size O(N ×M) which is very slow

Solution: Compute the total in device memory






Problem 3: Output structure sizeIf we make N ×M comparisons we will have to copy back tohost a structure of size O(N ×M) which is very slow

Solution: Compute the total in device memoryReduce the three values in device memory in other tominimize the time spend in memory copy operations.





Kernel functions





Kernel functions




Set upStage 1: Raw evaluationStage 2: Integration with BioHEL

Experiments Set up

Two stages of experimentsEvaluation process independentlyIntegration with the learning process

Different functions to manage discrete, continuous andmixed problemsWe check the integration with the ILAS windowing system

Cuda ExperimentsPentium 4 of 3.6GHz, 2GBRAM and a Tesla C1060 with4GB of global memory and30 multiprocessors

Serial ExperimentsHPC facility and the UoNeach node with 2 quad-coreprocessors (Intel XeonE5472 3.0GHz)





Experiments Set up

Two stages of experimentsEvaluation process independentlyIntegration with the learning process

Different functions to manage discrete, continuous andmixed problemsWe check the integration with the ILAS windowing system

Cuda ExperimentsPentium 4 of 3.6GHz, 2GBRAM and a Tesla C1060 with4GB of global memory and30 multiprocessors

Serial ExperimentsHPC facility and the UoNeach node with 2 quad-coreprocessors (Intel XeonE5472 3.0GHz)





Speed in the evaluation process

Name |T| #Att #Disc #Cont #Cl T. Serial (s) T.CUDA (s) Speed Up

Con

t.

sat 5790 36 0 36 6 3.60± 0.21 1.92±0.01 1.9wav 4539 40 0 40 3 2.57± 0.08 1.59±0.01 1.6pen 9892 16 0 16 10 4.94± 0.24 2.25±0.02 2.2SS 75583 300 0 300 3 770.61± 119.49 14.69±0.23 52.4CN 234638 180 0 180 2 1555.90± 452.79 42.35±0.55 36.7

Mix

ed

adu 43960 14 8 6 2 147.86± 30.93 10.38±0.09 14.2far 90868 29 24 5 8 420.78± 90.58 23.13±1.04 18.2kdd 444619 41 15 26 23 1715.66± 632.40 95.89±1.42 17.9SA 493788 270 26 244 2 3776.36±1212.84 90.45±1.17 41.8Par 235929 18 18 0 2 863.72± 163.13 60.04±0.58 14.4c-4 60803 42 42 0 3 343.75± 71.93 17.86±0.18 19.2





Speed Up vs. Training set size

0

10

20

30

40

50

60

100 1000 10000 100000 1e+06

Sp

ee

d U

p

Training set size

Speed Up according to the training set size

adu - 14attspen - 16attsPar - 18attsfar - 29attssat - 36atts

wav - 40attskdd - 41attsc-4 - 42atts

CN - 180attsSA - 270attsSS - 300atts





Integration with the ILAS Windowing scheme

0

100

200

300

400

500

600

700

5 10 15 20 25 30 35 40 45 50

Sp

ee

d U

p

Number of Windows

Total Speedup According to the Number of Windows

adu - 14attspen - 16attsPar - 18attsfar - 29attssat - 36atts

wav - 40attskdd - 41attsc-4 - 42atts

CN - 180attsSA - 270attsSS - 300atts





How do we get speed up?

The continuous problems get more speed up than themixed problems.The problems with large number of attributes or largetraining sets get more speed up.There is a sweet-spot were the combination between theCUDA fitness function and the ILAS Windowing schemeproduces more speedup.





Speed up of BioHEL using CUDA


Con

t.

sat 5790 36 0 36 6 0.03± 0.01 25.91± 2.45 3.7wav 4539 40 0 40 3 75.47± 9.38 24.69± 0.81 3.1pen 9892 16 0 16 10 149.70± 19.93 40.04± 2.94 3.7SS 75583 300 0 300 3 347979.80± 60982.74 5992.28±247.50 58.1CN 234638 180 0 180 2 821464.70±167542.04 18644.31±943.98 44.1

Mix

ed

adu 43960 14 8 6 2 5422.78± 1410.71 271.73± 26.03 20.0far 90868 29 24 5 8 2471.28± 701.83 94.99± 41.53 26.0kdd 444619 41 15 26 23 76442.32± 23533.21 2102.414±191.34 36.4SA 493788 270 26 244 2 1252976.80±203186.55 28759.71±552.00 38.3Par 235929 18 18 0 2 524706.70± 98949.46 19559.79±671.70 26.8c-4 60803 42 42 0 3 52917.95± 8059.55 2417.83±170.19 21.9





Speed up of BioHEL using CUDA


Con

t.

sat 5790 36 0 36 6 0.03± 0.01 25.91± 2.45 3.7wav 4539 40 0 40 3 75.47± 9.38 24.69± 0.81 3.1pen 9892 16 0 16 10 149.70± 19.93 40.04± 2.94 3.7SS 75583 300 0 300 3 347979.80± 60982.74 5992.28±247.50 58.1CN 234638 180 0 180 2 821464.70±167542.04 18644.31±943.98 44.1

Mix

ed

adu 43960 14 8 6 2 5422.78± 1410.71 271.73± 26.03 20.0far 90868 29 24 5 8 2471.28± 701.83 94.99± 41.53 26.0kdd 444619 41 15 26 23 76442.32± 23533.21 2102.414±191.34 36.4SA 493788 270 26 244 2 1252976.80±203186.55 28759.71±552.00 38.3Par 235929 18 18 0 2 524706.70± 98949.46 19559.79±671.70 26.8c-4 60803 42 42 0 3 52917.95± 8059.55 2417.83±170.19 21.9

The experiments that took 2 weeks to finish now run in 8 hours!




ConclusionsFurther Work

Conclusions

CUDA allows us to exploit the intrinsic parallelism withinthe populations by checking a group of individuals at thesame time.Now we can handle much larger problems, which is thecase of most of the real life problems.This fusion between CUDA and genetic algorithms helpspushing forward the boundaries of evolutionary learning byovercoming technical boundaries





Further work

Extend our methodology to use more than one GPGPU atthe time.Develop models to determine the limits of the usage ofCUDA. Disable the CUDA evaluation when the training setis small.Study the impact of the ILAS in the accuracy.Adapt the CUDA methodology to other evolutionarylearning systems.





Bacardit, J. (2004).Pittsburgh Genetics-Based Machine Learning in the Data Mining era: Representations, generalization, andrun-time.PhD thesis, Ramon Llull University, Barcelona, Spain.

Bacardit, J., Burke, E., and Krasnogor, N. (2009).Improving the scalability of rule-based evolutionary learning.Memetic Computing, 1(1):55–67.

Bacardit, J. and Llorà, X. (2009).Large scale data mining using genetics-based machine learning.In GECCO ’09: Proceedings of the 11th Annual Conference Companion on Genetic and EvolutionaryComputation Conference, pages 3381–3412, New York, NY, USA. ACM.

Langdon, W. B. and Harrison, A. P. (2008).GP on SPMD parallel graphics hardware for mega bioinformatics data mining.Soft Comput., 12(12):1169–1183.

Loiacono, D. and Lanzi, P. L. (2009).Speeding up matching in XCS.In 12th International Workshop on Learning Classifier Systems.

Maitre, O., Baumes, L. A., Lachiche, N., Corma, A., and Collet, P. (2009).Coarse grain parallelization of evolutionary algorithms on GPGPU cards with EASEA.In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1403–1410,Montreal, Québec, Canada. ACM.

Rissanen, J. (1978).Modeling by shortest data description.Automatica, vol. 14:465–471.





Sastry, K. (2005).Principled efficiency enhancement techniques.Genetic and Evolutionary Computation Conference - GECCO 2005- Tutorial.

Stout, M., Bacardit, J., Hirst, J. D., and Krasnogor, N. (2008).Prediction of recursive convex hull class assignments for protein residues.Bioinformatics, 24(7):916–923.

Venturini, G. (1993).SIA: a supervised inductive algorithm with genetic search for learning attributes based concepts.In Brazdil, P. B., editor, Machine Learning: ECML-93 - Proceedings of the European Conference on MachineLearning, pages 280–296. Springer-Verlag.





Questions or comments?





Iterative Rule Learning

IRL has been used for many years in the ML community, with thename of separate-and-conquer

Algorithm 4.1: ITERATIVERULELEARNING(Examples)

Theory ← ∅while Example 6= ∅

do

Rule← FindBestRule(Examples)Covered ← Cover(Rule, Examples)if RuleStoppingCriterion(Rule, Theory , Examples)

then exitExamples ← Examples − CoveredTheory ← Theory ∪ Rule

return (Theory)





BioHEL fitness function

Coverage term penalizes rules that do not cover a minimumpercentage of examples system


Technology

Speeding Up the Evaluation of Evolutionary Learning Systems Using GPGPUs