Dynamic load-balancing on multi-FPGA systems a case study

National Center for Supercomputing Applications

Dynamic load-balancing on

multi-FPGA systems

a case study

Volodymyr Kindratenko

Innovative Systems Lab (ISL)

National Center for Supercomputing Applications (NCSA)

Robert Brunner and Adam Myers

Department of Astronomy

University of Illinois at Urbana-Champaign (UIUC)

SRC-6 Reconfigurable Computer


Memory

SRC Hi-Bar™ 4-port Switch

CommonMemory

MAPC®SNAP™

µP

PCI-Xdual-Xeon

2.8 GHz, 1 GB memory

MAPE®

Carte™ 2.2

2400 MB/s each

1.4 GB/s sustained payload

OBM

A

OBM

B

OBM

C

OBM

D

OBM

E

OBM

F

Dual-portedMemory

ControlFPGA

User FPGA 1

User FPGA 0

192

64 6464646464

192

108

64 6464646464

Angular Correlation Function

• TPACF, denoted as (), is the

frequency distribution of angular

separations between celestial

objects in the interval (, + )

– is the angular distance between

two points

• Blue points (random data) are, on

average, randomly distributed,

red points (observed data) are

clustered

– Blue points: ()=0

– Red points: ()>0

• Can vary as a function of angular

distance, (yellow circles)

– Blue: ()=0 on all scales

– Red: () is larger on smaller scales


Image source: http://astro.berkeley.edu/~mwhite/


The Method

• The angular correlation function is calculated

using the estimator derived by Landy & Szalay

(1993):

• where DD() and RR() are the autocorrelation

function of the data and random points,

respectively, and DR() is the cross-correlation

between the data and random points.

1

1

21

2

2

i

R

i

RDD

RRn

DRnn

DDn


Serial Code Organization

// pre-compute bin boundaries, binb

// compute DD

doCompute{CPU|MAP}(data, npd, data, npd, 1, DD, binb, nbins);

// loop through random data files

for (i = 0; i < random_count; i++)

{

// compute RR

doCompute{CPU|MAP}(random[i], npr[i], random[i], npr[i], 1, RRS, binb, nbins);

// compute DR

doCompute{CPU|MAP}(data, npd, random[i], npr[i], 0, DRS, binb, nbins);

}

// compute w

for (k = 0; k < nbins; k++)

{

w[k] = (random_count * 2*DD[k] - DRS[k]) / RRS[k] + 1.0;

}

Reference C Kernel Implementation

for (i = 0; i < ((autoCorrelation) ? n1-1 : n1); i++)

{

double xi = data1[i].x, yi = data1[i].y, zi = data1[i].z;

for (j = ((autoCorrelation) ? i+1 : 0); j < n2; j++)

{

double dot = xi * data2[j].x + yi * data2[j].y + * data2[j].z;

register int k, min = 0, max = nbins;

if (dot >= binb[min]) data_bins[min] += 1;

else if (dot < binb[max]) data_bins[max+1] += 1;

// run binary search

else {

while (max > min+1)

{

k = (min + max) / 2;

if (dot >= binb[k]) max = k;

else min = k;

};

data_bins[max] += 1;

}

}

}


pi

pj

q0 q1 q2 q3 q4 q5

OpenMP Implementation


µP

Hi-Bar Switch

MAP

C

MAP

E

for (i = 0; i < random_count; i++){

#pragma omp parallel sections#pragma omp sectiondoComputeMAP1(…, mapC);

#pragma omp sectiondoComputeMAP2(…, mapE);

}

79.2x 82.8x 84.9x 86.7x 88.1x 89.1x 89.6x 89.7x 89.3x

0x

20x

40x

60x

80x

100x

120x

140x

1

10

100

1000

10000

100000

5000 25000 45000 65000 85000

spe

ed

up

exe

cuti

on

tim

e (s

)

dataset size

speedup CPU MAP

0

200

400

600

800

1000

1 2 3 4 5 6 7 8 9 10

exe

cuti

on

tim

e (seconds)

number of points in the dataset (x10,000)

MAP C MAP E

MAP C

processor

is idle 18%

of the time

Simplified Performance Model

• Analysis of a data/random file with 100 data points

each

– Autocorrelation between the points in the random data file requires

100*(100-1)/2=4,950 steps

– Cross-correlation between the observed data and random data

requires 100*100=10,000 steps

– MAP Series C processor is idle about 50% of the time!


Autocorrelation (AC)

4,950 steps

Cross-correlation (CC)

10,000 steps

MAP C

MAP E

Consider Data Partitioning…


A1 A2 A3Dataset A: 100 points

B1 B2 B3Dataset B: 100 points

MAP C

MAP E

Autocorrelation

Jobs

Cross-correlation

Jobs

A1-B1 (cc)

A1-B2 (cc)

A1-B3 (cc)

A2-B1 (cc)

A2-B2 (cc)

A2-B3 (cc)

A3-B1 (cc)

A3-B2 (cc)

A3-B3 (cc)

A1-A1 (ac)

A1-A2 (cc)

A1-A3 (cc)

A2-A2 (ac)

A2-A3 (cc)

A3-A3 (ac)

Consider Data Partitioning…

• Analysis of a data/random file with 100 data points

each

– Each data file is divided into 3 equally sized segments

– Autocorrelation is computed first, followed by the cross-correlation

– Each MAP processor is invoked with the first available

unprocessed pair of segments

– MAP Series C processor is idle about 7% of the time!


AC

528 steps

MAP C

MAP E CC

1,089

steps

CC

1,122

steps

AC

528 steps

CC

1,122

steps

AC

561 steps

CC

1,089

steps

CC

1,089

steps

CC

1,122

steps

CC

1,089

steps

CC

1,089

steps

CC

1,122

steps

CC

1,089

steps

CC

1,089

steps

CC

1,122

steps

Job Scheduler


dataset 2dataset 1

`jobs`

MAP C

MAP E

`workers`

Scheduler

for each pair of d1/d2 segments, pij

for each MAP processor, m

if m is free

assign pij to m

break

endif

endfor

endfor

Job Scheduler Implementation

do {

for (k = 0; k < K; k++) { // loop thru all the jobs

if (job[k].status == running) continue; // let it run

if (job[k].status == done) continue; // nothing to do anymore

if (job[k].status == finished) { // need to get results back

pthread_join(job[k].thread, (void **)&mytd); // join the thread

for (i = 0; i < nbins+2; i++) res[i] += mytd->res[i]; // copy results

job[k].status = done; // set status to done

TOTAL++; // count number of fully executed jobs

continue;

}

for (t = NPROCS-1; t >= 0; t--) { // is there a free MAP to run this job?

if (thread_stat[t] == busy) continue; // thread is busy

if (self && i == j && t == 1) continue; // not suitable thread for 'self‘

struct my_thread_data *mytd = (struct my_thread_data *)malloc(sizeof(struct my_thread_data));

pthread_create(&(job[k].thread), NULL, my_map_proc, (void *)mytd);

thread_stat[t] = busy; // lock it

job[k].status = running; // set status to running

break; // no need to check the rest of the MAPs

}

}

usleep(1000);

} while (TOTAL != K);


Load-balanced Implementation


µP

Hi-Bar Switch

MAP

C

MAP

E

for (i = 0; i < random_count; i++){

JobScheduler(data, random);

JobScheduler(random, random);}

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9

exe

cuti

on

tim

e (seconds)

number of points in the dataset (x10,000)

MAP C MAP E

46.5x

79.5x84.8x 88.9x 92.8x 94.7x 95.8x 96.4x 96.2x

0x

20x

40x

60x

80x

100x

120x

140x

1

10

100

1000

10000

100000

5000 25000 45000 65000 85000

spe

ed

up

exe

cuti

on

tim

e (s

)

dataset size

speedup CPU MAP

MAP E

processor

is idle lees

than 1% of

the time

Conclusions

• Pros

– A 9% performance improvement due to a better

utilization of the idle resources

– Near-identical load on each of the MAPs

– Scalable solution that allows to mix compute

subroutines with different performance characteristics

• Cons

– Performance hit for the smaller datasets due to the

overhead in calling the MAP processors

– More complex execution flow and data management


Acknowledgements

• This work is funded by NASA Applied Information

Systems Research (AISR) award number NNG06GH15G

– Prof. Robert Brunner and Dr. Adam Myers from UIUC Department

of Astronomy

• NCSA Collaborators

– Dr. Rob Pennington, Dr. Craig Steffen, David Raila, Michael

Showerman, Jeremy Enos, John Larson, David Meixner, Ken

Sartain

• SRC Computers, Inc.

– David Caliga, Dr. Jeff Hammes, Dan Poznanovic, David Pointer,

Jon Huppenthal


Documents

Dynamic load-balancing on multi-FPGA systems a case study