1
Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem Jiri Jaros 1 Overview The Genetic Algorithms (GAs) have become widely applied optimization tools since their development by Holland in 1975 [1]. One of the famous NP-hard problems successfully solved by GAs is the knapsack. However, millions of candidate solutions have to be created and evaluated for large problem instances rising the execution time up to hours and days [2]. The latest GPUs are about 15 times faster than six-core Intel CPUs which opens new possibilities for massive acceleration of GAs [3]. 2 Multi-GPU Cluster Systems The availability of multiple PCI-Express buses, even on very low cost commodity computers, means that it is possible to construct cluster nodes with multiple GPUs. Inter-node communications are done via MPI over a high speed network while intra-node communications exploit CPU shared memory. 3 Multi-GPU Island-Based Genetic Algorithm The population of the GA is distributed over multiple GPUs. Every GPU, controlled by a single MPI process, entirely evolves a single island. Migration of individuals occurs after a predefined number of generations exchanging the best local solution and an optional number of randomly selected individuals over a ring topology. 4 Local GPU Island Implementation Details [4] 32 knapsack items packed into a single integer value Individuals processed by CUDA WARPS in multiple rounds Most CUDA block barriers removed Negligible thread divergence ( < 0.5%) Blocks of knapsack data shared within the block Uniform crossover, bit-flip mutation, binary tournament 5 Experimental Results Highly optimized CPU implementation running on 4 6-core Intel Xeon processors with 40Gb infiniband interconnection CUDA implementation running on 14 NVIDIA GTX 580 Knapsack problem with 10,000 items 6 Conclusions The proposed multi-GPU island-based GA allows the solution of large-scale instances of the knapsack problem. The significant benefits: Speedups up to 35, 194, 781 (14 GPUs vs. 24, 6, 1 cores) Overall performance of 5.67 TFLOPS (14 GPUs) Overall efficiency of 26% The codes will be released as an open-source software (http://www.fit.vutbr.cz/~jarosjir). [1] J. H. Holland, “Adaptation in Natural and Artificial Systems”, Ann Arbor, no. 53. University of Michigan Press, 1975, p. 211 [2] Z. Michalewicz and J. Arabas, “Genetic algorithms for the 0/1 knapsack problem”, in Lecture Notes in Computer Science, 1994, vol. 869/1994, 134-143 [3] V. W. Lee et al., “Debunking the 100X GPU vs. CPU myth,” in Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10, 2010, p. 451 [4] J. Jaros and P. Pospichal, “A Fair Comparison of Modern CPUs and GPUs Running the Genetic Algorithm under the Knapsack Benchmark”, in Applications of Evolutionary Computation, Heidelberg, DE, Springer, 2012, p. 426-435 College of Engineering and Computer Science, Australian National University This research has been partially supported by the research grant "Natural Computing on Unconventional Platforms", GP103/10/1517, Czech Science Foundation (2010-13). Evaluate the local GPU island Select emigrants in the local GPU island Transfer the emigrants from GPU to CPU and then over network Receive immigrants from network by CPU and upload them to GPU Incorporate the immigrants into the local GPU island MPI process 2820 2825 2830 2835 2840 128 256 512 1024 2048 Fitness value x1000 Individuals per local island Solution quality 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs 0 20 40 60 80 100 120 140 128 256 512 1024 2048 Execution time [s] Individuals per local island Total execution time 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs 0 10 20 30 40 50 60 128 256 512 1024 2048 Speedup vs. single-thread CPU Individuals per local island Speedup on a single island reached by multicore CPU and GPU 1xGPU 2x6 CPU threads 6 CPU threads 0 4 8 12 16 20 24 28 32 36 40 128 256 512 1024 2048 Speedup vs. 4 6-core Xeons Individuals per local island GPU Speedup vs. 4 6-core Xeons 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs Interconnection Network

Multi-GPU Island-Based Genetic Algorithm for Solving the ...€¦ · Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem Jiri Jaros 1 Overview The Genetic Algorithms

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-GPU Island-Based Genetic Algorithm for Solving the ...€¦ · Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem Jiri Jaros 1 Overview The Genetic Algorithms

Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem

Jiri Jaros

1 Overview

The Genetic Algorithms (GAs) have become widely applied

optimization tools since their development by Holland in 1975 [1].

One of the famous NP-hard problems successfully solved by

GAs is the knapsack. However, millions of candidate solutions

have to be created and evaluated for large problem instances

rising the execution time up to hours and days [2]. The latest

GPUs are about 15 times faster than six-core Intel CPUs which

opens new possibilities for massive acceleration of GAs [3].

2 Multi-GPU Cluster Systems

The availability of multiple PCI-Express buses, even on very low

cost commodity computers, means that it is possible to construct

cluster nodes with multiple GPUs. Inter-node communications

are done via MPI over a high speed network while intra-node

communications exploit CPU shared memory.

3 Multi-GPU Island-Based Genetic Algorithm

The population of the GA is distributed over multiple GPUs.

Every GPU, controlled by a single MPI process, entirely evolves

a single island. Migration of individuals occurs after a predefined

number of generations exchanging the best local solution and an

optional number of randomly selected individuals over a ring

topology.

4 Local GPU Island Implementation Details [4]

• 32 knapsack items packed into a single integer value

• Individuals processed by CUDA WARPS in multiple rounds

• Most CUDA block barriers removed

• Negligible thread divergence ( < 0.5%)

• Blocks of knapsack data shared within the block

• Uniform crossover, bit-flip mutation, binary tournament

5 Experimental Results

• Highly optimized CPU implementation running on 4 6-core

Intel Xeon processors with 40Gb infiniband interconnection

• CUDA implementation running on 14 NVIDIA GTX 580

Knapsack problem with 10,000 items

6 Conclusions

The proposed multi-GPU island-based GA allows the solution of

large-scale instances of the knapsack problem.

The significant benefits:

• Speedups up to 35, 194, 781 (14 GPUs vs. 24, 6, 1 cores)

• Overall performance of 5.67 TFLOPS (14 GPUs)

• Overall efficiency of 26%

The codes will be released as an open-source software

(http://www.fit.vutbr.cz/~jarosjir).

[1] J. H. Holland, “Adaptation in Natural and Artificial Systems”, Ann Arbor, no. 53. University of Michigan Press, 1975, p. 211

[2] Z. Michalewicz and J. Arabas, “Genetic algorithms for the 0/1 knapsack problem”, in Lecture Notes in Computer Science, 1994, vol. 869/1994, 134-143

[3] V. W. Lee et al., “Debunking the 100X GPU vs. CPU myth,” in Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10, 2010, p. 451

[4] J. Jaros and P. Pospichal, “A Fair Comparison of Modern CPUs and GPUs Running the Genetic Algorithm under the Knapsack Benchmark”, in Applications of Evolutionary Computation, Heidelberg, DE, Springer, 2012, p. 426-435

College of Engineering and Computer Science, Australian National University

This research has been partially supported by the

research grant "Natural Computing on

Unconventional Platforms", GP103/10/1517,

Czech Science Foundation (2010-13).

Evaluate the local GPU island

Select emigrants in the local GPU island

Transfer the emigrants from GPU to CPU and then over

network

Receive immigrants from network by CPU and upload them to

GPU

Incorporate the immigrants into the local GPU island

MPI

process

2820

2825

2830

2835

2840

128 256 512 1024 2048

Fitn

ess

val

ue

x1

00

0

Individuals per local island

Solution quality

1 GPU 6 GPUs7 GPUs 12 GPUs14 GPUs

0

20

40

60

80

100

120

140

128 256 512 1024 2048

Exe

cuti

on

tim

e [

s]

Individuals per local island

Total execution time

1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs

0

10

20

30

40

50

60

128 256 512 1024 2048

Spe

ed

up

vs.

sin

gle

-th

read

CP

U

Individuals per local island

Speedup on a single island reached by multicore CPU and GPU

1xGPU

2x6 CPU threads

6 CPU threads

048

1216202428323640

128 256 512 1024 2048

Spe

ed

up

vs.

4 6

-co

re X

eo

ns

Individuals per local island

GPU Speedup vs. 4 6-core Xeons

1 GPU

6 GPUs

7 GPUs

12 GPUs

14 GPUs

Inte

rcon

ne

ction

Netw

ork