29
Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacMillen, Devan Williams, Xiren Wang

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application

Embed Size (px)

Citation preview

Page 1: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Multilevel Parallelization Architecture of Boundary Element Solver Engines for

Cloud Computing

Dipanjan Gope, Don MacMillen, Devan Williams, Xiren Wang

Page 2: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 2

The need for parallelization

Challenges towards effective parallelization

A multilevel parallelization framework for BEM: A compute intensive application

Results and discussion

Outline

Page 3: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 3

The commodity computing environment underwent a paradigm shift in the last 7 yearsPower hungry frequency-scaling replaced by more efficient scaling-in-computing-cores

The Need For Parallelization

“No free-lunch” for software applications

5M

60M

3.4G

Page 4: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 4

Parallelization: The Easy Way

Shared Memory

ALU ALU ALUALU

Cache

TestCase 1

TestCase 2

TestCase 3

TestCase 4

TestCase ..

TestCase ..

Tool.exe

Tool.exe

TestCase 1

Tool.exe Tool.exe Tool.exe

TestCase 2 TestCase 3 TestCase 4

“Embarrassingly Parallel” Class of Problems - Suitable for applications with low memory requirement

Case 2

Case3Case

4Case

1

What about memory intensive applications ? - Large SPICE runs, Extraction

Page 5: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 5

Parallelization ParadigmsMultithreading

MPIDistribution

CPU – GPU (FPGA)

Page 6: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 6

Benefits of Effective Parallelization

Compute Platform Time Utility Computing Cost

1 Compute Unit 100 hours $x

100 Compute Units 1 hours $x

Compute Platform Time Utility Computing Cost

1 Compute Unit 100 hours $x

100 Compute Units 50 hours $50x

Algorithm in which 100% can be efficiently parallelized

Algorithm in which 50% can be efficiently parallelized

Page 7: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 7

The need for parallelization

Challenges towards effective parallelization

A multilevel parallelization framework for BEM: A compute intensive application

Results and discussion

Outline

Page 8: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 8

Amdahl’s Law

P = Fraction of the algorithm that can be parallelized(1-P) = Serial content of the algorithm

Parallel performance of an algorithm is severely affected by any serial content

Scaling = nPP )1(

1

Page 9: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 9

Amdahl’s Law: Cloud Perspective

Speed Up = nPP )1(

1

Cost Multiplier = nPPn 1

Power of perfect scaling: Commodity computing cost does not increase even when solving 100x faster

0 20 40 60 80 1000

2

4

6

8

10

12

Number of Processors

Cos

t Mul

tiplie

r

0 20 40 60 80 1000

20

40

60

80

100

Number of Processors

Spee

d U

p

P=0.9

P=0.99

P=0.999

Ideal: P=1

P=0.9

P=0.99P=0.999

Ideal: P=1

Page 10: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 10

The need for parallelization

Challenges towards effective parallelization

A multilevel parallelization framework for BEM: A compute intensive application

Results and discussion

Outline

Page 11: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Application: Large scale problems

Signal Integrity

Power Integrity

EMI

Large Scale Package-Board Analysis

Page 12: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

- Surface is discretized into patches (Basis Functions)

, , , ' 'Z r r rj i

S

j i f G f ds

Pulse

Boundary Element Field Solver

Dense Method of Moments Matrix

Page 13: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Scheme Setup Solve Time Memory

Conventional BEM O(N2) O(N3) 2 Days 360 GB

Fast BEM O(N) O(N) 20minutes 4 GB

N = Number of basis functions; (150,000)

Large problems: N ~ 1 million

Fast Iterative Matrix Solution

Page 14: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 14

FMM: Fast Solver Physics (1992)

Gravitational Image: Courtesy ParticleMan

Astronomy Equivalent

Interaction Shell

Multi-level Algorithm

Page 15: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 15

FMM: Fast Solver Algorithm (1992)

Q2M

L2LM2M M2M

M2L

M2L

M2L

Sequential Algorithm: Impedes Effective Parallelization

Fin

est

Level

Fin

est

- 1

Level

L2P

L2L

Up Tree Across Tree Down Tree

Page 16: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 16

SVD/QR Based Compression

Dense MoM Matrix

Zsub

Zsub =

SVD/QR Decomposition Saving

Q

R

( )

m n

m n r

m

n r

mn

r

Page 17: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 17

Predetermined Matrix Structure

Pre-determined matrix structure Sub-matrix ranks pre-estimated to reasonable accuracy

Page 18: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 18

OpenMP: Matrix Setup and Solve

Predetermined matrix structure given meshed geometry

Static load-balancing

Correction by dynamic scheduling

Page 19: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 19

MPI: Matrix Setup and Solve

MPI processes Data Comm.

OpenMP Master Threads

OpenMP Worker Threads

Start

Read Geometry, Mesh

Setup

Solve

All nodes read geometry and mesh

Combined DRAM of nodes are used to storethe whole matrix, increasing capacity

Solve: Vectors are broadcast for matrix- vector operations

Page 20: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 20

SGE: Parallel Frequency Distribution

Multiple cores use shared memory utilized to store andcompute single frequency matrix

Different frequency matriceson separate node memory

Reveals an embarrassingly parallel level without wasting shared memory

Page 21: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 21

Hybrid Parallel Framework

Several layers of parallelism, including embarrassingly parallel layers are employed in a hybrid framework to work around Amdahl’s law and achieve high scalability

Global controller decides the number and type of compute instances to be used at each level

File R/W for SGE based parallel layers is performed in binary

Page 22: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 22

The need for parallelization

Challenges towards effective parallelization

A multilevel parallelization framework for BEM: A compute intensive application

Results and discussion

Outline

Page 23: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 23

Matrix Setup Analysis

Predetermined matrix structure and static-dynamic load balancing achieves close to ideal speed ups Lack of dynamic scheduling across nodes affects MPI speedup

Page 24: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 24

Matrix Solve Analysis

For large number of right hand sides, SGE based RHS distribution yields better scalability

Map-reduce after every matrix vector product affects parallel efficiency

Page 25: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 25

Performance on Full-Board Analysis

0

2

4

6

8

1 2 4 8#proc

Spee

dup 12 K

29 KI deal

1 core 2 cores

4 cores

6 cores

8 cores

0

100

200

300

400

500

600

Time Vs. the Number of Core

Total Time

Solve Time

Setup Time

Page 26: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 26

Performance on Amazon EC2

Page 27: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 27

Performance on Amazon EC2

Time Taken Speed-up EffectiveCompute Cost

2 core(m1.large)

100 minutes 1 $ 0.60

18 machines with 8 cores(c1.xlarge)

2 minutes 50 $ 0.40

System LSI package and an early design board mergedMeshed with 20,000 basis functions

Broadband frequency simulation for EMI prediction

Page 28: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 28

EMI from Cell-Phone Package

Page 29: The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application

Proprietary & Confidential 29

The future of compute and memory intensive algorithms is “parallel”

Revisit algorithms from a perspective of multi-processor efficiency as opposed to single-processor complexity

Even a small serial content can adversely affect the speedup of an algorithm

Employ levels of parallelism to work-around Amdahl’s law

Presented algorithm achieves around 6.5x speedup on a single 8 core machine and around 50x speed up in a cloud framework using 72 (18x4) cores

Summary