Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
National Center for Supercomputing Applications
Dynamic load-balancing on
multi-FPGA systems
a case study
Volodymyr Kindratenko
Innovative Systems Lab (ISL)
National Center for Supercomputing Applications (NCSA)
Robert Brunner and Adam Myers
Department of Astronomy
University of Illinois at Urbana-Champaign (UIUC)
SRC-6 Reconfigurable Computer
National Center for Supercomputing Applications
Memory
SRC Hi-Bar™ 4-port Switch
CommonMemory
MAPC®SNAP™
µP
PCI-Xdual-Xeon
2.8 GHz, 1 GB memory
MAPE®
Carte™ 2.2
2400 MB/s each
1.4 GB/s sustained payload
OBM
A
OBM
B
OBM
C
OBM
D
OBM
E
OBM
F
Dual-portedMemory
ControlFPGA
User FPGA 1
User FPGA 0
192
64 6464646464
192
108
64 6464646464
Angular Correlation Function
• TPACF, denoted as (), is the
frequency distribution of angular
separations between celestial
objects in the interval (, + )
– is the angular distance between
two points
• Blue points (random data) are, on
average, randomly distributed,
red points (observed data) are
clustered
– Blue points: ()=0
– Red points: ()>0
• Can vary as a function of angular
distance, (yellow circles)
– Blue: ()=0 on all scales
– Red: () is larger on smaller scales
National Center for Supercomputing Applications
Image source: http://astro.berkeley.edu/~mwhite/
National Center for Supercomputing Applications
The Method
• The angular correlation function is calculated
using the estimator derived by Landy & Szalay
(1993):
• where DD() and RR() are the autocorrelation
function of the data and random points,
respectively, and DR() is the cross-correlation
between the data and random points.
1
1
21
2
2
i
R
i
RDD
RRn
DRnn
DDn
National Center for Supercomputing Applications
Serial Code Organization
// pre-compute bin boundaries, binb
// compute DD
doCompute{CPU|MAP}(data, npd, data, npd, 1, DD, binb, nbins);
// loop through random data files
for (i = 0; i < random_count; i++)
{
// compute RR
doCompute{CPU|MAP}(random[i], npr[i], random[i], npr[i], 1, RRS, binb, nbins);
// compute DR
doCompute{CPU|MAP}(data, npd, random[i], npr[i], 0, DRS, binb, nbins);
}
// compute w
for (k = 0; k < nbins; k++)
{
w[k] = (random_count * 2*DD[k] - DRS[k]) / RRS[k] + 1.0;
}
Reference C Kernel Implementation
for (i = 0; i < ((autoCorrelation) ? n1-1 : n1); i++)
{
double xi = data1[i].x, yi = data1[i].y, zi = data1[i].z;
for (j = ((autoCorrelation) ? i+1 : 0); j < n2; j++)
{
double dot = xi * data2[j].x + yi * data2[j].y + * data2[j].z;
register int k, min = 0, max = nbins;
if (dot >= binb[min]) data_bins[min] += 1;
else if (dot < binb[max]) data_bins[max+1] += 1;
// run binary search
else {
while (max > min+1)
{
k = (min + max) / 2;
if (dot >= binb[k]) max = k;
else min = k;
};
data_bins[max] += 1;
}
}
}
National Center for Supercomputing Applications
pi
pj
q0 q1 q2 q3 q4 q5
OpenMP Implementation
National Center for Supercomputing Applications
µP
Hi-Bar Switch
MAP
C
MAP
E
for (i = 0; i < random_count; i++){
#pragma omp parallel sections#pragma omp sectiondoComputeMAP1(…, mapC);
#pragma omp sectiondoComputeMAP2(…, mapE);
}
79.2x 82.8x 84.9x 86.7x 88.1x 89.1x 89.6x 89.7x 89.3x
0x
20x
40x
60x
80x
100x
120x
140x
1
10
100
1000
10000
100000
5000 25000 45000 65000 85000
spe
ed
up
exe
cuti
on
tim
e (s
)
dataset size
speedup CPU MAP
0
200
400
600
800
1000
1 2 3 4 5 6 7 8 9 10
exe
cuti
on
tim
e (seconds)
number of points in the dataset (x10,000)
MAP C MAP E
MAP C
processor
is idle 18%
of the time
Simplified Performance Model
• Analysis of a data/random file with 100 data points
each
– Autocorrelation between the points in the random data file requires
100*(100-1)/2=4,950 steps
– Cross-correlation between the observed data and random data
requires 100*100=10,000 steps
– MAP Series C processor is idle about 50% of the time!
National Center for Supercomputing Applications
Autocorrelation (AC)
4,950 steps
Cross-correlation (CC)
10,000 steps
MAP C
MAP E
Consider Data Partitioning…
National Center for Supercomputing Applications
A1 A2 A3Dataset A: 100 points
B1 B2 B3Dataset B: 100 points
MAP C
MAP E
Autocorrelation
Jobs
Cross-correlation
Jobs
A1-B1 (cc)
A1-B2 (cc)
A1-B3 (cc)
A2-B1 (cc)
A2-B2 (cc)
A2-B3 (cc)
A3-B1 (cc)
A3-B2 (cc)
A3-B3 (cc)
A1-A1 (ac)
A1-A2 (cc)
A1-A3 (cc)
A2-A2 (ac)
A2-A3 (cc)
A3-A3 (ac)
Consider Data Partitioning…
• Analysis of a data/random file with 100 data points
each
– Each data file is divided into 3 equally sized segments
– Autocorrelation is computed first, followed by the cross-correlation
– Each MAP processor is invoked with the first available
unprocessed pair of segments
– MAP Series C processor is idle about 7% of the time!
National Center for Supercomputing Applications
AC
528 steps
MAP C
MAP E CC
1,089
steps
CC
1,122
steps
AC
528 steps
CC
1,122
steps
AC
561 steps
CC
1,089
steps
CC
1,089
steps
CC
1,122
steps
CC
1,089
steps
CC
1,089
steps
CC
1,122
steps
CC
1,089
steps
CC
1,089
steps
CC
1,122
steps
Job Scheduler
National Center for Supercomputing Applications
dataset 2dataset 1
`jobs`
MAP C
MAP E
`workers`
Scheduler
for each pair of d1/d2 segments, pij
for each MAP processor, m
if m is free
assign pij to m
break
endif
endfor
endfor
Job Scheduler Implementation
do {
for (k = 0; k < K; k++) { // loop thru all the jobs
if (job[k].status == running) continue; // let it run
if (job[k].status == done) continue; // nothing to do anymore
if (job[k].status == finished) { // need to get results back
pthread_join(job[k].thread, (void **)&mytd); // join the thread
for (i = 0; i < nbins+2; i++) res[i] += mytd->res[i]; // copy results
job[k].status = done; // set status to done
TOTAL++; // count number of fully executed jobs
continue;
}
for (t = NPROCS-1; t >= 0; t--) { // is there a free MAP to run this job?
if (thread_stat[t] == busy) continue; // thread is busy
if (self && i == j && t == 1) continue; // not suitable thread for 'self‘
struct my_thread_data *mytd = (struct my_thread_data *)malloc(sizeof(struct my_thread_data));
pthread_create(&(job[k].thread), NULL, my_map_proc, (void *)mytd);
thread_stat[t] = busy; // lock it
job[k].status = running; // set status to running
break; // no need to check the rest of the MAPs
}
}
usleep(1000);
} while (TOTAL != K);
National Center for Supercomputing Applications
Load-balanced Implementation
National Center for Supercomputing Applications
µP
Hi-Bar Switch
MAP
C
MAP
E
for (i = 0; i < random_count; i++){
JobScheduler(data, random);
JobScheduler(random, random);}
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6 7 8 9
exe
cuti
on
tim
e (seconds)
number of points in the dataset (x10,000)
MAP C MAP E
46.5x
79.5x84.8x 88.9x 92.8x 94.7x 95.8x 96.4x 96.2x
0x
20x
40x
60x
80x
100x
120x
140x
1
10
100
1000
10000
100000
5000 25000 45000 65000 85000
spe
ed
up
exe
cuti
on
tim
e (s
)
dataset size
speedup CPU MAP
MAP E
processor
is idle lees
than 1% of
the time
Conclusions
• Pros
– A 9% performance improvement due to a better
utilization of the idle resources
– Near-identical load on each of the MAPs
– Scalable solution that allows to mix compute
subroutines with different performance characteristics
• Cons
– Performance hit for the smaller datasets due to the
overhead in calling the MAP processors
– More complex execution flow and data management
National Center for Supercomputing Applications
Acknowledgements
• This work is funded by NASA Applied Information
Systems Research (AISR) award number NNG06GH15G
– Prof. Robert Brunner and Dr. Adam Myers from UIUC Department
of Astronomy
• NCSA Collaborators
– Dr. Rob Pennington, Dr. Craig Steffen, David Raila, Michael
Showerman, Jeremy Enos, John Larson, David Meixner, Ken
Sartain
• SRC Computers, Inc.
– David Caliga, Dr. Jeff Hammes, Dan Poznanovic, David Pointer,
Jon Huppenthal
National Center for Supercomputing Applications