A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science

A Parallel GPU Version of the Traveling Salesman Problem

Molly A. O’Neil, Dan Tamir, and Martin Burtscher*Department of Computer Science

The Traveling Salesman Problem Common combinatorial optimization problem

Wire routing, logistics, robot arm movement, etc. Given n cities, find shortest Hamiltonian tour

Must visit all cities exactly once and end in first city Usually expressed as a graph problem

We use complete, undirected, planar, Euclidean graph Vertices represent cities Edge weights reflect distances

A Parallel GPU Version of the Traveling Salesman Problem July 2011

TSP Algorithm Optimal solution is NP-hard

Heuristic algorithms used to approximate solution We use an iterative hill climbing search algorithm

Generate k random initial tours (k climbers) Iteratively refine them until local minimum reached

In each iteration, apply best opt-2 move Find best pair of edges (a,b) and (c,d)

such that replacing them with (a,d) →and (b,c) minimizes tour length


GPU Requirements Lots of data parallelism

Need 10,000s of ‘independent’ threads

Sufficient memory access regularity Sets of 32 threads should have ‘nice’ access patterns

Sufficient code regularity Sets of 32 threads should follow the same control flow

Plenty of data reuse At least O(n2) operations on O(n) data


Thepcreport.net

July 2011

TSP_GPU Implementation Assuming 100-city problems & 100,000 climbers Climbers are independent, can be run in parallel

Plenty of data parallelism Potential load imbalance

Different number of steps required to reach local minimum

Every step determines best of 4851 opt-2 moves Same control flow (but different data) Coalesced memory access patterns O(n2) operations on O(n) data


Code Optimizations Key code section: finding best opt-2 move

Doubly nested loop Only computes difference in tour length, not absolute length

Highly optimized to minimize memory accesses “Caches” rest of data in registers Requires only 6 clock cycles per move on a Xeon CPU core

Local minimum compared to best solution so far Best solution updated if needed, otherwise tour is discarded

Other small optimizations (see paper)


GPU Optimizations Random tours generated in parallel on GPU

Minimizes data transfer to GPU (CPU only generates distance matrix

and prints result)

2D distance matrix resident in shared memory Ensures hits in software-controlled fast data cache

Tours copied to local memory in chunks of 1024 Enables accessing them with coalesced loads & stores


gamedsforum.ca

July 2011

Evaluation Method Systems

NVIDIA Tesla C2050 GPU (1.15 GHz 14 SMs w/ 32 PEs) Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons)

Datasets Five 100-city inputs from TSPLIB

Implementations CUDA (GPU), Pthreads (CPU), serial C (CPU) Use almost identical code for finding best opt-2 move


Runtime Comparison (kroE100 Input)

GPU is 7.8x faster than CPU with 8 cores One GPU chip is as fast as 16 or 32 CPU chips


154684 156413

78350

39175

19591

9802

4908 4368

2724 2539 2497

1024

4096

16384

65536

262144

1 2 4 8 16 32 64 128 256

Runti

mes

(in

ms)

Number of threads (pthreads CPU)

MinMedianMax

sequential

CUDAGPU

seqCPU

pthreads CUDA GPU(median)

July 2011

Speedup over Serial (kroE100 Input)

Pthreads code scales well to 32 threads (4 CPUs) CPU performance fluctuates (NUMA), GPU stable


1.0 2.0 3.97.9

15.8

31.535.4

56.860.9 61.9

0

10

20

30

40

50

60

70

80

90

1 2 4 8 16 32 64 128 256

Spee

dup

over

Seq

uenti

al C

ode

Number of threads (pthreads)

Min

Median

Max

CUDA GPU

pthreads

(median)

CUDA GPU

July 2011

Solution Quality

Optimal tour found in 4 of 5 cases with 100,000 climbers 200,000 climbers find best solution in fifth case

Runtime independent of input and linear in climbers


Name Optimal Cost Min. Tour Cost Min. Tour # Runtime (s)

kroA100 21,282 21,282 33,188 2.540

kroB100 22,141 22,141 5,969 2.499

kroC100 20,749 20,749 23,092 2.543

kroD100 21,294 21,294 32,142 2.497

22,084 16,941 2.499

22,068 117,583 4.952

TSPLIB Database CUDA GPU Solution Quality

kroE100 22,068

July 2011

Summary TSP_GPU source code is freely available at

http://www.cs.txstate.edu/~burtscher/research/TSP_GPU/ TSP_GPU algorithm

Highly optimized implementation for GPUs Evaluates almost 20 billion tour modifications per

second on a single GPU (as fast as 32 8-core Xeons) Produces high-quality results May be better suited for GPU than ACO and GA algos.

Acknowledgments NSF TeraGrid (NICS), NVIDIA Corp., and Intel Corp.


Documents

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science