Upload
aldous-short
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Multilevel Parallelization Architecture of Boundary Element Solver Engines for
Cloud Computing
Dipanjan Gope, Don MacMillen, Devan Williams, Xiren Wang
Proprietary & Confidential 2
The need for parallelization
Challenges towards effective parallelization
A multilevel parallelization framework for BEM: A compute intensive application
Results and discussion
Outline
Proprietary & Confidential 3
The commodity computing environment underwent a paradigm shift in the last 7 yearsPower hungry frequency-scaling replaced by more efficient scaling-in-computing-cores
The Need For Parallelization
“No free-lunch” for software applications
5M
60M
3.4G
Proprietary & Confidential 4
Parallelization: The Easy Way
Shared Memory
ALU ALU ALUALU
Cache
TestCase 1
TestCase 2
TestCase 3
TestCase 4
TestCase ..
TestCase ..
Tool.exe
Tool.exe
TestCase 1
Tool.exe Tool.exe Tool.exe
TestCase 2 TestCase 3 TestCase 4
“Embarrassingly Parallel” Class of Problems - Suitable for applications with low memory requirement
Case 2
Case3Case
4Case
1
What about memory intensive applications ? - Large SPICE runs, Extraction
Proprietary & Confidential 5
Parallelization ParadigmsMultithreading
MPIDistribution
CPU – GPU (FPGA)
Proprietary & Confidential 6
Benefits of Effective Parallelization
Compute Platform Time Utility Computing Cost
1 Compute Unit 100 hours $x
100 Compute Units 1 hours $x
Compute Platform Time Utility Computing Cost
1 Compute Unit 100 hours $x
100 Compute Units 50 hours $50x
Algorithm in which 100% can be efficiently parallelized
Algorithm in which 50% can be efficiently parallelized
Proprietary & Confidential 7
The need for parallelization
Challenges towards effective parallelization
A multilevel parallelization framework for BEM: A compute intensive application
Results and discussion
Outline
Proprietary & Confidential 8
Amdahl’s Law
P = Fraction of the algorithm that can be parallelized(1-P) = Serial content of the algorithm
Parallel performance of an algorithm is severely affected by any serial content
Scaling = nPP )1(
1
Proprietary & Confidential 9
Amdahl’s Law: Cloud Perspective
Speed Up = nPP )1(
1
Cost Multiplier = nPPn 1
Power of perfect scaling: Commodity computing cost does not increase even when solving 100x faster
0 20 40 60 80 1000
2
4
6
8
10
12
Number of Processors
Cos
t Mul
tiplie
r
0 20 40 60 80 1000
20
40
60
80
100
Number of Processors
Spee
d U
p
P=0.9
P=0.99
P=0.999
Ideal: P=1
P=0.9
P=0.99P=0.999
Ideal: P=1
Proprietary & Confidential 10
The need for parallelization
Challenges towards effective parallelization
A multilevel parallelization framework for BEM: A compute intensive application
Results and discussion
Outline
Application: Large scale problems
Signal Integrity
Power Integrity
EMI
Large Scale Package-Board Analysis
- Surface is discretized into patches (Basis Functions)
, , , ' 'Z r r rj i
S
j i f G f ds
Pulse
Boundary Element Field Solver
Dense Method of Moments Matrix
Scheme Setup Solve Time Memory
Conventional BEM O(N2) O(N3) 2 Days 360 GB
Fast BEM O(N) O(N) 20minutes 4 GB
N = Number of basis functions; (150,000)
Large problems: N ~ 1 million
Fast Iterative Matrix Solution
Proprietary & Confidential 14
FMM: Fast Solver Physics (1992)
Gravitational Image: Courtesy ParticleMan
Astronomy Equivalent
Interaction Shell
Multi-level Algorithm
Proprietary & Confidential 15
FMM: Fast Solver Algorithm (1992)
Q2M
L2LM2M M2M
M2L
M2L
M2L
Sequential Algorithm: Impedes Effective Parallelization
Fin
est
Level
Fin
est
- 1
Level
L2P
L2L
Up Tree Across Tree Down Tree
Proprietary & Confidential 16
SVD/QR Based Compression
Dense MoM Matrix
Zsub
Zsub =
SVD/QR Decomposition Saving
Q
R
( )
m n
m n r
m
n r
mn
r
Proprietary & Confidential 17
Predetermined Matrix Structure
Pre-determined matrix structure Sub-matrix ranks pre-estimated to reasonable accuracy
Proprietary & Confidential 18
OpenMP: Matrix Setup and Solve
Predetermined matrix structure given meshed geometry
Static load-balancing
Correction by dynamic scheduling
Proprietary & Confidential 19
MPI: Matrix Setup and Solve
MPI processes Data Comm.
OpenMP Master Threads
OpenMP Worker Threads
Start
Read Geometry, Mesh
Setup
Solve
All nodes read geometry and mesh
Combined DRAM of nodes are used to storethe whole matrix, increasing capacity
Solve: Vectors are broadcast for matrix- vector operations
Proprietary & Confidential 20
SGE: Parallel Frequency Distribution
Multiple cores use shared memory utilized to store andcompute single frequency matrix
Different frequency matriceson separate node memory
Reveals an embarrassingly parallel level without wasting shared memory
Proprietary & Confidential 21
Hybrid Parallel Framework
Several layers of parallelism, including embarrassingly parallel layers are employed in a hybrid framework to work around Amdahl’s law and achieve high scalability
Global controller decides the number and type of compute instances to be used at each level
File R/W for SGE based parallel layers is performed in binary
Proprietary & Confidential 22
The need for parallelization
Challenges towards effective parallelization
A multilevel parallelization framework for BEM: A compute intensive application
Results and discussion
Outline
Proprietary & Confidential 23
Matrix Setup Analysis
Predetermined matrix structure and static-dynamic load balancing achieves close to ideal speed ups Lack of dynamic scheduling across nodes affects MPI speedup
Proprietary & Confidential 24
Matrix Solve Analysis
For large number of right hand sides, SGE based RHS distribution yields better scalability
Map-reduce after every matrix vector product affects parallel efficiency
Proprietary & Confidential 25
Performance on Full-Board Analysis
0
2
4
6
8
1 2 4 8#proc
Spee
dup 12 K
29 KI deal
1 core 2 cores
4 cores
6 cores
8 cores
0
100
200
300
400
500
600
Time Vs. the Number of Core
Total Time
Solve Time
Setup Time
Proprietary & Confidential 26
Performance on Amazon EC2
Proprietary & Confidential 27
Performance on Amazon EC2
Time Taken Speed-up EffectiveCompute Cost
2 core(m1.large)
100 minutes 1 $ 0.60
18 machines with 8 cores(c1.xlarge)
2 minutes 50 $ 0.40
System LSI package and an early design board mergedMeshed with 20,000 basis functions
Broadband frequency simulation for EMI prediction
Proprietary & Confidential 28
EMI from Cell-Phone Package
Proprietary & Confidential 29
The future of compute and memory intensive algorithms is “parallel”
Revisit algorithms from a perspective of multi-processor efficiency as opposed to single-processor complexity
Even a small serial content can adversely affect the speedup of an algorithm
Employ levels of parallelism to work-around Amdahl’s law
Presented algorithm achieves around 6.5x speedup on a single 8 core machine and around 50x speed up in a cloud framework using 72 (18x4) cores
Summary