Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era

Masaharu MunetomoInformation Initiative Center,

Hokkaido University, Sapporo, JAPAN.

Toward the next generations supercomputing

• Next year (2012): 10 PFlops systems will be installed

• “K” Computer (Fujitsu)： 10PFlops

• Sequoia (IBM) : 20PFlops

• Near Future (~2018): Toward exascale (1018)

• International Exa-Scale Software Project

Potential Architecture in Exa-Flops Era

• # of nodes: O(106)

• # of concurrencyper node: O(103)

• Total concurrency: O(109)

• Memory BW: 400GB/s

• Inter-node BW: 50GB/s

(Jack Dongarra Presentation@SC10)

Uniform or Heterogeneous

• Uniform Architecture: using cores with the same architecture

• PROS: Natural approach, easy to program

• CONS: Needs more power compared with heterogeneous architecture

• Heterogeneous Architecture: CPU + Accelerator with different architecture

• PROS: Energy efficiency, concurrency

• CONS: Difficult to program and optimize compared with uniform one

Challenges toward Exa-Scale

• Improvement of bandwidth and delay of the interconnect are not enough compared to that by increasing the number of cores

• It is difficult to develop massively parallel library for the many-core architecture

• It is also difficult to develop scalable algorithms for many-core

• In the field of high performance computing, their efforts concentrate on parallel numerical library such as large-scale matrix calculations. Extreme scale parallel optimization is not considered.

Scalability of Master-Worker Model

• When delay is proportional to # of processors

• When delay is proportional to log of # of processors (ideal situation)

Extremely Parallel Master-Slave Model

• P* should be around 100,000 ~ 1,000,000

• X should be more

than 1010 !!

• OK

• It is preferable if we can hide communication overheads.

Toward exa-scale massive parallelization

• Communication latency should be at most O(log n)

• Network hardware architecture: Hypercube, etc.

• Network software stack: Optimizing collective communications, etc.

• Algorithms that reduces communications

• Employing heterogeneous architecture: GPGPU, etc.

Massive parallelization of Evolutionary computation

• Master-Worker Model: Same as general parallelized algorithms

• Island models: difficult to implement massive parallelism as it is

• Localize inter-node communications such as cellular GA

• Massive parallelization of advanced methods like EDAs Dependency (communications) among variables should be considered→　　 ex) DBOA, pLINC, pD5 etc.

• Massive parallelization on cloud systems: MapReduce GA, ECGA, etc.

Research toward exa-scale ECs

• We should assume more than million-way parallelization

• Making communication overheads as less as possible

• Designing algorithms with less communications necessary

• Dependency on the target application problems

• When fitness evaluations are costly, simply parallelize them is enough

• Parallelization of “intelligent” algorithms are necessary to consider

• Analyzing interactions among genes to solve complex problems

Promising approach toward massive parallelization

• Massively parallel cellular GAs with local communications on many-core

• Massive parallelization of linkage identification: pLINC, pLIEM, pD5

• Parallelization of EDAs: pBOA, gBOA (A GPGPU implementation)

• GPGPU implementations: gBOA, MAXSAT, MINLP

• MapReduce implementations: MapReduce GA, ECGA, linkage identification

A cellular GA implemented on CellBE [Asim 2009]

• A cellular GA (cGA) to solve Capacitated Vehicle RoutingProblem (CVRP) implemented on a many-core architecture,Cell BroadBand Engine (Sony, IBM, Toshiba).

• Cellular GA was employed dueto limited communication BWbetween SPE and PPE.

Parallelization of linkage identification techniques

• Linkage identification techniques are based on pair-wise perturbations todetect nonlinearity/non-monotonicity.

• It is relatively easy to parallelize byassigning calculation of each pairto each processors

• pLINC, pLIEM, pD5 have been proposed.

• We have succeeded in solving a half-million bit problems by employing pD5

A half-million-bit optimization with pD5 [Tsuji 2004]

• A 500,000-bit problem(5bit trap x 100,000）

• # of local optima= 2100,000 -1 10≒ 30,000

• HITACHI SR-11000model K1 (5.4TFlops)

• 32 days by 1 node (16 CPU)

• 1 day by whole system (40 nodes)

pBOA on GPGPU [Asim 2008]

• Implementation of parallel Bayesian OptimizationAlgorithm on NVIDIA CUDA.

Massive Parallelization over the Cloud: MapReduce

• MapReduce is a simple but powerful approach to realize massive parallelization over Cloud computing systems

• There are several papers have been published on MapReduce implementations on GA, ECGA.

• MapReduce ECGA: Calculation of Marginal Product Model (MPM) and generating offsprings can be parallelized via“Map” process.

• Fault tolerance can be handled byMapReducefor massive parallelization

Concluding Remarks

• Designing robust and scalable algorithms is not easy, since analysis of linkage or interactions among genes should be necessary which usually needs at least O(l2) computational overheads and communications among processors.

• In order to adapt to exa-scale massively parallel processing cores of more than O(106), we need to reduce communication overheads of each message exchange less than O(log n). Otherwise, the parallel algorithm cannot scale well to such massively parallel architectures.

• We also need to adapt to modern architecture and systems such as heterogeneous architecture with GPGPUs and cloud computing environment.

Technology

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era