[IEEE TENCON 2013 - 2013 IEEE Region 10 Conference - Xi'an, China (2013.10.22-2013.10.25)] 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013) - FPGA based parallel

FPGA based Parallel Neighborhood SearchShuang Ying Yu and Yuet Ming Lam

Faculty of Information TechnologyMacau University of Science and Technology

[email protected]

Abstract — An FPGA based generic parallel neighborhoodsearch which exploits parallelism at both search and move levelsis proposed. A neighborhood partitioning technique is employedto significantly increase parallelism at move level with minimumhardware resource increment. The proposed approach is appliedto a tabu search and evaluated using the quadratic assignmentproblem. Experimental results show that the proposed techniquecan enhance the search speed by 13.3 times with a solutionquality improvement of 11.9%. Compared with a GPUimplementation, this work achieves a speedup of 20.2 times.

Keywords — Parallel neighborhood search; Field ProgrammableGate Arrays; Quadratic assignment problem.

I. INTRODUCTIONMany combinatorial optimization problems, such as

vehicle route planning and job assignment, are NP-hardproblems. The complexity grows exponentially with theproblem size. Neighborhood search algorithm [1] can find afeasible solution in polynomial time. Parallel neighborhoodsearch algorithms have been proposed to improve run time orenhance solution quality. [2].

There are some works on FPGA based parallelneighborhood search. Wakabayashi et. al. propose a tabusearch architecture to solve the quadratic assignment problem[3]. This approach exploits data level parallelism byemploying a parallel architecture to evaluate a solution.Although significant speedup is achieved over a CPUimplementation, the applicability and scalability of thisapproach is highly limited as the hardware size growsdramatically with the problem size. Mavroidis et. al. propose aparallel architecture to solve the travelling salesman problem[4]. This architecture exploits parallelism at move level wherea number of symmetric processing elements are designed toevaluate the entire neighborhood in parallel. However,exploiting move level parallelism solely may result in lost ofdiversity. In particular, exploring the entire neighborhood ishard to scale up.

To maximize the achievable parallelism on FPGA platform,a generic parallel neighborhood search is proposed. Thecontributions of this work are:

An FPGA based generic parallel neighborhoodsearch that exploits parallelism at both search andmove levels.

A neighborhood partitioning technique is proposed toachieve higher parallelism without significanthardware resource increment.

An evaluation of the proposed techniques usingquadratic assignment problem.

This paper is organized as follows. Section II givesbackgrounds including tabu search and quadratic assignmentproblem. The proposed parallel approach and the FPGAimplementation are introduced in Section III. Section IVillustrates the experimental results. Finally, Section V draws aconclusion.

II. BACKGROUNDS

A. Tabu SearchIn neighborhood search, let S be the set containing all

possible solutions of the problem, forming the solution space.During the search, let sn be a current solution in S and V(sn) isa set of neighborhood solutions of sn obtained by performing amove, i.e. operation to produce a new solution, on sn. If thereis an sn+1 in V(sn) that fulfills E(sn+1) ≤ E(sn) where E is thecost function, then sn+1 will be accepted as the starting point innext iteration. This process is repeated until a terminationcondition is reached. Upon termination, the best s which hasthe smallest E(s) is chosen as the search result.

Tabu search [1] is an advanced neighborhood search that mayforbid the search to explore the searched space by using amemory structure called tabu list. A new solution will bediscarded if it matches a solution in the tabu list. When asolution is accepted, the solution will be appended into tabulist and the most antique solution in tabu list will be removed.

B. Quadratic Assignment ProblemQuadratic assignment problem (QAP) [5] is a typicalcombinatorial optimization problem which tries to find theassignment of n facilities to n locations with minimumtransportation cost. Assuming that each element of an n×nmatrix DIS=(aij) indicates the distance between locations i andj, and each element of n×n matrix FLOW=(bij) indicates eachtraffic flow between facilities i and j, the task is to find theoptimal assignment such that the sum of all distances-flowsproducts:

)()(

1

0

1

0)( jφiφ

n

i

n

jijbaφF

is minimized. In Equation (1), φ is a permutation of n integersthat denotes the assignments of n facilities to n locations, i.e. aQAP solution. To produce a new QAP solution in this work,

(1)

978-1-4799-2827-9/13/$31.00 ©2013 IEEE

two facilities r and s are identified randomly and theirassigned locations are swapped. The cost change of the newsolution ΔF can be calculated as:

))((2)( )()()()(

1

, ,0kφqφkφpφqk

n

qpkkpk bbaaφF

In this work, the tabu list is a list L of pairs (p, q) of facilitiesthat are to be swapped. A solution is tabu if (p, q) exists in L.If a solution, which is generated by the move operationdescribed above, is accepted, the pair of facilities (p, q) will beadded to tabu list.

III. METHODOLOGY

A. Parallel approach with neighborhood partitioningFigure 1 illustrates the proposed generic parallel approach.

The parallelization can be differentiated into parallel searchand parallel move (PSPM). Parallel search is to launchmultiple searches in parallel where each search generates andevaluates multiple moves, i.e. solutions, concurrently in eachiteration of the search. Although multiple solutions areevaluated in each iteration, only the solution with the lowestcost is selected and the acceptance is tested against the tabulist. This solution will be used for the next iteration if it isaccepted. Upon termination, the best solution among allsearches is extracted. Using this approach, the number ofsolutions explored per iteration is the number of searchestimes the number of moves.

To generate and evaluate a solution, a set of data such asthe distance matrix and flow matrix in QAP is required. Onecan see that the required memory to store these data thusincreases linearly with the number of parallel moves evaluated.To reduce the memory requirement, a neighborhoodpartitioning technique is proposed. As shown in Figure 1, theneighborhood V(sn) of current solution sn is partitioned intomultiple regions, and each move is allocated to work on eachregion, i.e. generate and evaluate a solution in itscorresponding region. The number of regions partitioned mustequal to the number of parallel moves. Since each move isrestricted to work on a sub-region, the data required toevaluate a solution may be reduced as well.

Considering the design in QAP, each move must beassociated with a distance matrix and a flow matrix if eachmove explores an independent solution on the entireneighborhood V(sn), unless a dedicated memory can providesufficient IO bandwidth for all parallel moves. However, thisis not feasible normally due to the considerable cost inhardware size and latency. Using the proposed approach, theneighborhood V(sn) can be partitioned into h regions, and eachmove explores a solution in one of these h regions. In otherwords, the i-th move evaluates a swap of two facilities (p, qi)where p is the same for all moves in the same search and eachqi is defined as:

hihntwhereti

hnqi 0,0,

where t is a random number and n is the problem size. Henceeach move requires only a portion of the distance and flowmatrix to calculate the cost change (Equation 2). As a result, hparallel moves can be evaluated by sharing one set of distanceand flow matrix. Besides the reduction in memory storage, thecircuit size can also be reduced since all parallel moves in asearch can share some common logics such as control.

B. Parallel architectureFigure 2 shows the simplified architecture of the

proposed parallel approaching for solving the quadraticassignment problem. The architecture illustrates theimplementation of a parallel approach with 4 searches and 4moves. Each parallel search is implemented by a modulenamed as ‘PS’ and each parallel move is mapped to a module‘PM’. The ‘Synchronizer’ is used to extract the best solutionamong all ‘PS’ modules. The component ‘M’, contains twosubtracters, one multiplier, and one accumulator, is used toimplement Equation 2 to calculate the cost change ΔF. Finally,the ‘Arbiter’ selects the best solution among all ‘PM’ modules,evaluates the acceptance, and updates the tabu list.

As mentioned before, the implementation of parallelmove is to divide the neighborhood and each move requiresonly a portion of the distance and flow matrix to calculate thecost change. As a result, the memory to store each matrix willbe divided into a number of sub-memories and each sub-memory will be associated with a ‘PM’ module. For betterillustration, Figure 2 just shows the partition and connection ofthe distance matrix memory (DIST_MEM). The original n×nmemory is divided into 4 sub-memories (DIST_MEM0 ~DIST_MEM3) with each of size (n/4)×n to supply data for aqkin Equation 2. In this work, each sub-memory is configured as

(2)

(3)

Fig. 1. The proposed generic parallel approach with neighborhoodpartition.

a dual-port memory. Note that the total memory requirementis still n×n independent of the number of ‘PM’, but thenumber of solution evaluated by a search is increased due toparallel moves. In addition, since all moves share the samevalue of p, which is an arbitrary number in [0, n), themultiplexer is used to select apk (Equation 2) from one of thesesub-memories, and supply it to all sub-module ‘M’ tocalculate the cost change ΔF. To have a better scalability, asequential pipelined design is employed in sub-module ‘M’instead of a data-parallel architecture.

The flow matrix memory FLOW_MEM is partitioned inthe way similar to the distance matrix memory. The differenceis that the indices φ(p) and φ(qi) must be acquired from theassignment solution φ. As a result, a sub-module ‘M’ may notalign with the connected sub-memory. To solve this problem,if a swap (p, q) is accepted, the matrix column in the two sub-memories, to where p and q belong, will be swapped. Thisswap process introduces an overhead of n cycles. However,the increased parallelism from parallel move can compensatethis overhead. Equation 2 shows that the number of cyclesrequired to evaluate a solution is about n cycles, hence 2ncycles are required in this work due to the n-cycle overhead.The proposed approach is expected to achieve speedup whenmore than 2 parallel moves are evaluated.

The advantage of the proposed approach is that it has abetter scalability than data-parallel approach. If a hardwareplatform has limited resources where data level parallelism isconstrained, the proposed approach can employ a pipelineddesign on move evaluation and exploit parallelism at bothsearch and move levels. Furthermore, the parallelism at eachlevel can be adjusted to adapt to various platform constraints.On the other hand, if the hardware platform provides sufficientresources, the proposed approach can also be extended tosupport data level parallelism.

IV. EXPERIMENTAL RESULTSThe proposed architecture was implemented on a Xilinx

ML605 FPGA evaluation board with an XC6VLX240T FPGAchip. The design tool is Xilinx ISE 13.2. To evaluate theperformance, 10 instances of sizes 64 to 128 (Esc64a, Sko64,Tai64c, Lipa70a, Sko72, Lipa80a, Sko81, Lipa90a, Sko90,Esc128) are chosen from QAPLIB [6].

A. Search speed improvementThe search speed is measured as number of solutions per

second (NSPS), which counts the number of solution aparticular implementation can evaluate per second. Table Ishows the search speed of the proposed parallel approach fordifferent number of parallel searches (NPS) and number ofparallel moves (NPM). All implementations are run at250MHz. A reference implementation without usingneighborhood partitioning, i.e. NPM = 1, is employed forcomparison. The comparison is fair as all implementations usesimilar hardware resource. Although less parallel searches arelaunched in order to implement more parallel moves, theoverall search speed is increased. It is because the hardwareresources, which are gained by reducing NPS, can be betterutilized to implement more parallel moves using the proposedneighborhood partitioning technique. For example, when NPSis reduced from 27 to 24 which is an 11% reduction, the NPM

TABLE I. SEARCH SPEED COMPARISON OF DIFFERENT FPGAIMPLEMENTATIONS.

NPS 37 37 30 27 24 13 8NPM 1 2 4 8 16 32 64NSPS(106) 57.8 114.8 185.0 330.8 584.5 629.2 769.6

SLICE(%) 91% 98% 94% 91% 98% 98% 88%

Fig. 2. Circuit diagram for a parallel approach with 4 parallel searches and 4 parallel moves.

is increased from 8 to 16, which is a 100% increment. Figure 3shows the speedup of the proposed approach over thereference implementation. A maximum speedup of 13.3 timesis achieved at NPS = 8 and NPM = 64.

B. Solution quality improvementFigure 4 illustrates the solution quality improvement of the

proposed approach over the reference design with NPS = 37and NPM = 1. A solution quality (SQ) coefficient is calculatedas:

achieved

opt

costcost

quality Solution

where costopt is the cost of the optimal solution reported inQAPLIB, costachieved is the cost of the achieved solution. Thesolution quality improvement of an approach A over anotherapproach B is thus calculated as:

%.100SQ

SQ-SQImprovmentSQB

BA

The results in Figure 4 are obtained by running eachimplementation 0.01n2 iterations. The solution qualityimprovement increases with the product of NPS and NPM, asa larger product means more solutions can be explored in aunit time. A maximum improvement of 11.9% is achieved atNPS = 8 and NPM = 64 which has the largest product.

C. Comparison with GPUThe performance of the proposed FPGA implementation is

then compared with a GPU implementation reported in [7]which employs an Nvidia Tesla C2050 GPU card. Table IIshows the search speed improvement of the proposed FPGAapproach. The GPU implementation achieves a search speedof 38.1×106 NSPS. The results show that all FPGAimplementations achieve higher search speed than the GPUapproach. A maximum speedup of 20.2 times is achieved.

V. CONCLUSIONSA generic parallel neighborhood search with neighborhood

partitioning is proposed. The approach is applied to parallelizea tabu for solving the quadratic assignment problem.Experimental results show that the proposed techniques can

enhance the search speed by 13.3 times with a solution qualityimprovement of 11.9%. Moreover, the proposed FPGAapproach can search solutions 20.2 times faster than a GPUimplementation. Current and future work involves extendingthe proposed approach to address other neighborhood searchalgorithms and combinatorial optimization problems.

ACKNOWLEDGEMENTThe research leading to these results has received fundingfrom Macao Science and Technology Development Fundunder grant number 058/2010/A.

REFERENCES[1] M. Pirlot, "General local search methods," European Journal of

Operational Research, vol. 92, no. 3, pages 493 – 511, 1996.[2] The Luong, Van, et al. "A GPU-based iterated tabu search for solving

the quadratic 3-dimensional assignment problem," IEEE/ACSInternational Conference on Computer Systems and Applications(AICCSA), pages 1 – 8, 2010.

[3] S. Wakabayashi, K. Yoshihiro and S. Nagayama. "FPGAimplementation of Tabu search for the quadratic assignment problem,"IEEE International Conference on Field Programmable Technology(FPT), pages 269 – 272, 2006.

[4] I. Mavroidis, I. Papaefstathiou, and D. Pnevmatikatos. "A fast FPGA-based 2-Opt solver for small-scale euclidean traveling salesmanproblem," IEEE Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 13 – 22, 2007.

[5] E. L. Lawler, "The quadratic assignment problem," ManagementScience, vol. 9, no. 4, pages 586 – 599, 1963.

[6] R. Burkard, S. Karisch, F. Rendl, R. Burkard, S. Karisch, F. Rendl,"QAPLIB - A Quadratic Assignment Problem Library," Journal ofGlobal Optimization, vol. 10, no. 4, pages 391 – 403, 1997.

[7] P. Wang, Y. M. Lam, K. H. Tsoi, and W. Luk, “Solving CombinatorialOptimisation Problems on Many-core Platforms,” Workshop on ParallelProgramming and Run-Time Management Techniques for Many-coreArchitectures (PARMA), 2013.

TABLE II. SEARCH SPEED COMPARISON BETWEEN THE PROPOSED FPGAAPPROACH AND A GPU IMPLEMENTATION.

GPUNSPS (106)

FPGAImplementation

FPGANSPS (106)

FPGASpeedup(times)

38.1NPS30, NPM4 185.0 4.9NPS24, NPM16 584.5 15.3NPS8, NPM64 769.6 20.2

Fig. 3. Search speed improvement of the proposed approach.

Fig. 4. Solution quality improvement of the proposed approach.

Documents

[IEEE TENCON 2013 - 2013 IEEE Region 10 Conference - Xi'an, China (2013.10.22-2013.10.25)] 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013) - FPGA based parallel