1
An Exascale Workload Study Prasanna Balaprakash (ANL), Darius Buntinas (ANL), Anthony Chan (ANL), Apala Guha (UChicago, ANL), Rinku Gupta (ANL), Sri Hari Krishna Narayanan (ANL), Andrew A. Chien (UChicago, ANL), Paul Hovland (ANL), and Boyana Norris (ANL) PROBLEM 10x10 is a new approach towards understanding “energy efficient + performance optimized” supercomputing • Amdahl’s approach : 10% of code is where 90% of 8me is spent • Derive common cases across applica8ons and op8mizing those • Worked very well in the past Tradi8onal 90/10 Architecture • Focuses on broad architectural changes which impact big por8ons of broad range of applica8on code • No support for upcoming customized heterogeneous architectures • Power efficiency and performance op8miza8on are increasingly important Limita8ons of 90/10 approach The 10 x 10 APPROACH Step 1 • Understand and measure diverse characteris8cs of a broad range of workloads which are of interest to DOE in the exascale era Step 2 • Iden8fy which top ten characteris8cs form dominant modes (characteris8cs) across the various workloads Step 3 • Design or iden8fy architectures/accelerators best suited for each characteris8c for the exascale era. These customized accelerators address specific characteris8cs and can be designed to be highly energy efficient and high performance. APPLICATION CHARACTERIZATION EXASCALE EXTRAPOLATION MODELS EVALUATING MODELS ON POTENTIAL EXASCALE ARCHICTECTURES ACKNOWLEDGEMENTS This work was supported by the U.S. Department of Energy Office of Science DEWAC02W06CH11357 and NSF OCIW1W57921. 1 2 4 5 3 6 Operations types in applications Operations types in application hotspots Understanding memory bandwidth variations of applications with increasing input size Apps PETSc Mantevo NEK5K Tools PIN HPCT PBound We focus on studying diverse applica8ons in an effort to understand “dominant modes and characteris8cs” We understand how to measure these characteris8cs using current technologies Determine whether and how these characteris8cs will change for increasing applica8on sizes during the exascale era EXASCALE SCALING LIMITS Exascale Machine: Projected instruction mix Memory requirements projection models for applications We build numerous extrapola8ve models for various applica8ons to understand their key characteris8cs on exascale machines Models are sta8s8cally validated for accuracy Key modeled characteris8cs: compute intensity, memory intensity, instruc8on mix Characteris8cs provide empirical basis for designing future exascale architectures Exascale Machine: Projected app runtime Workloads Dominant characteristics Customized architectures Overall solution Applica8on Exascale Projec8on Models, where N = n1*n2*n3, and c1, c2, c3, c4 are constants Ex19, Ex30 f(n1,n2,n3) = c1 + c2*N + c3*(n1*n2) + c4*n1 Ex20 f(n1,n2,n3) = c1 + c2*(n1*n2) + c3*(n1*n2) 2 miniFE, miniMD, HPCCG f(n1,n2,n3) = c1 + c2*N 0% 20% 40% 60% 80% 100% Ex19 Ex20 Ex30 miniFE miniMD HPCCG Fraction of Total Operations Loads Stores Floating Point Integer Branches Other 0% 20% 40% 60% 80% 100% Ex10 1 Ex10 2 Ex19 Ex20 Ex30 1 Ex30 2 turbChan 1 turbChan 2 miniFE 1 miniFE 2 miniFE 3 miniMD HPCCG Fraction of Ops in Hotspot Loads Stores Floating Point Integer Branches Other 1 10 100 1000 Ex10 Ex19 Ex20 Ex30 vortex turbChan miniFE miniMD HPCCG Bandwidth (MB/s) Traditional Exascale architecture Processor-under-memory (PUM) We evaluate our models on Tradi8onal memory model: CPU 10TFlops; bandwidth 1TB/s PUM: bandwidth scales to 10TB/s due to the stacked memory die architecture Not all applica8ons are bandwidth limited App Scaling Limit Exascale PUM Improvement Key Limit PUM Programming Change App=level Node=level Ex19 MemCap 2.35 2.97 MemCap Local Ex20 Compute 4.02 1.00 Compute Local Ex30 Compute 4.08 1.00 Compute High miniMD Compute 6.75 1.00 Compute Local miniFE MemCap 6.57 6.56 MemCap Moderate HPCCG MemCap 10.00 10.00 MemCap Local 0% 20% 40% 60% 80% 100% Ex19 Ex20 Ex30 miniFE miniMD HPCCG Fraction of Total Operations Loads Stores Floating Point Integer Branches Other 1e-10 1e-05 1e+00 1e+05 1e+10 1e+15 1e+01 1e+02 1e+03 1e+04 Number of Days Input Size in G Ex19 Ex20 Ex30 miniFE miniMD HPCCG 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+01 1e+02 1e+03 Total Memory in PB Input Size in G Ex19 Ex20 Ex30 miniFE miniMD HPCCG Exascale Machine: Projected memory requirement Apps Exascale Limit Exascale Limit Cri@cal Limit Feasible Size 24hrs 100PB Time Mem Cap miniMD 41G 600G 27 hrs 108 PB Compute 45GB Ex20 92G 500G 25 hrs 98 PB Compute 92GB Ex30 130G 1500G 27 hrs 110 PB Compute 130GB Ex19 5000 G 1000G 23 hrs 125 PB Memory 1TB miniFE 5000 G 250G 23 hrs 110 PB Memory 250GB HPCCG 5000 G 250G 23 hrs 110 PB Memory 250GB Extrapola8on models allow us to classify apps as computeW or memoryW limited Feasible dataset sizes used to es8mate exascale memory requirements assess poten8al benefit of exascale technologies such as PUM Compute Engine #1 Memory Compute Engine #2 Compute Engine #1 Memory

An Exascale Workload Study

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Exascale Workload Study

An Exascale Workload Study

Prasanna Balaprakash (ANL), Darius Buntinas (ANL), Anthony Chan (ANL), Apala Guha (UChicago, ANL), Rinku Gupta (ANL), Sri Hari Krishna Narayanan (ANL), Andrew A. Chien (UChicago, ANL), Paul Hovland (ANL), and Boyana Norris (ANL)

PROBLEM

10x10 is a new approach towards understanding “energy

efficient + performance optimized” supercomputing • "Amdahl’s"approach":"10%"of"code"is"where"90%"of"8me"is"spent"

• "Derive"common"cases"across"applica8ons"and""op8mizing"those""

• "Worked"very"well"in"the"past"

Tradi8onal"

90/10"Architecture"

• "Focuses"on"broad"architectural"changes"which"impact"big"por8ons"of"broad"range"of"applica8on"code"

• "No"support"for"upcoming"customized"heterogeneous"architectures"

• "Power"efficiency"and"performance"op8miza8on"are"increasingly"important"

Limita8ons"

of"

90/10"approach"

The 10 x 10 APPROACH

Step"1"

• "Understand"and"measure"diverse"characteris8cs"of"a"broad"range"of"workloads"which"are"of"interest"to"DOE"in"the"exascale"era"

Step"2"

• "Iden8fy"which"top"ten"characteris8cs"form"dominant(modes"(characteris8cs)"across"the"various"workloads""

Step"3""

• "Design"or"iden8fy"architectures/accelerators"best"suited"for"each"characteris8c"for"the"exascale"era."These"customized"accelerators"address"specific"characteris8cs"and"can"be"designed"to"be"highly"energy"efficient"and"high"performance."

APPLICATION CHARACTERIZATION

EXASCALE EXTRAPOLATION MODELS EVALUATING MODELS ON POTENTIAL

EXASCALE ARCHICTECTURES

ACKNOWLEDGEMENTS

This" work" was" supported" by" the" U.S." Department" of" Energy" Office" of"Science"DEWAC02W06CH11357"and"NSF"OCIW1W57921."

1"

2"

4" 5"

3"

6"

Operations types in applications

Operations types in application hotspots

Understanding memory bandwidth variations of applications with increasing input size

Apps"

PETSc"

Mantevo"

NEK5K"

Tools"

PIN"

HPCT"

PBound"

•  We"focus"on"studying"diverse"applica8ons"in"an"effort"to"understand"“dominant"modes"and"characteris8cs”"

•  We"understand"how"to"measure"these"characteris8cs"using"current"technologies"

•  Determine"whether"and"how"these"characteris8cs"will"change"for"increasing"applica8on"sizes"during"the"exascale"era"

EXASCALE SCALING LIMITS

Exascale Machine: Projected instruction mix Memory requirements projection models for applications

•  We"build"numerous"extrapola8ve"models"for"various"applica8ons"to"understand"their"key"characteris8cs"on"exascale"machines"

•  Models"are"sta8s8cally"validated"for"accuracy"

•  Key"modeled"characteris8cs:"compute"intensity,"memory"intensity,"instruc8on"mix"

•  Characteris8cs"provide"empirical"basis"for"designing"future"exascale"architectures"

Exascale Machine: Projected app runtime

Workloads Dominant characteristics

Customized architectures

Overall solution

Applica8on"Exascale"Projec8on"Models,"where"N"="n1*n2*n3,"and"c1,"c2,"c3,"c4"are"constants"

Ex19,"Ex30"f(n1,n2,n3)"="c1"+"c2*N"+"c3*(n1*n2)"+"c4*n1"

Ex20"f(n1,n2,n3)"="c1"+"c2*(n1*n2)"+"c3*(n1*n2)2"

miniFE,"miniMD,"HPCCG" f(n1,n2,n3)"="c1"+"c2*N"

0%20%40%60%80%

100%

Ex19 Ex20 Ex30 miniFEminiMD

HPCCG

Frac

tion

of T

otal

Ope

ratio

ns LoadsStores

Floating PointInteger

BranchesOther

0%20%40%60%80%

100%

Ex10 1Ex10 2

Ex19Ex20

Ex30 1Ex30 2

turbChan 1

turbChan 2

miniFE 1miniFE 2

miniFE 3miniMD

HPCCG

Frac

tion

of O

ps in

Hot

spot

LoadsStores

Floating PointInteger

BranchesOther

1

10

100

1000

Ex10Ex19

Ex20Ex30

vortexturbChan

miniFEminiMD

HPCCG

Band

wid

th (M

B/s)

Traditional Exascale architecture

Processor-under-memory (PUM)

We"evaluate"our"models"on""

 Tradi8onal"memory"model:"CPU"10TFlops;"bandwidth"1TB/s""

 PUM:"bandwidth"scales"to"10TB/s"due"to"the"stacked"memory"die"architecture"

"

Not"all"applica8ons"are"bandwidth"limited""

App#Scaling#Limit#

Exascale#

PUM#Improvement# Key#Limit#PUM#

Programming#Change#

App=level# Node=level#

Ex19" MemCap" 2.35" 2.97" MemCap" Local"

Ex20" Compute" 4.02" 1.00" Compute" Local"

Ex30" Compute" 4.08" 1.00" Compute" High"

miniMD" Compute" 6.75" 1.00" Compute" Local"

miniFE" MemCap" 6.57" 6.56" MemCap" Moderate"

HPCCG" MemCap" 10.00" 10.00" MemCap" Local"

0%20%40%60%80%

100%

Ex19 Ex20 Ex30 miniFEminiMD

HPCCG

Frac

tion

of T

otal

Ope

ratio

ns LoadsStores

Floating PointInteger

BranchesOther

1e-10

1e-05

1e+00

1e+05

1e+10

1e+15

1e+01 1e+02 1e+03 1e+04

Num

ber o

f Day

s

Input Size in G

Ex19Ex20Ex30

miniFEminiMDHPCCG

1e-051e-041e-031e-021e-011e+001e+011e+021e+031e+041e+05

1e+01 1e+02 1e+03

Tota

l Mem

ory

in P

B

Input Size in G

Ex19Ex20Ex30

miniFEminiMDHPCCG

Exascale Machine: Projected memory requirement

Apps# Exascale#Limit# Exascale#Limit#Cri@cal##Limit#

Feasible#Size#

24hrs# 100PB# Time# Mem#Cap#

miniMD" 41G" 600G" 27"hrs# 108"PB" Compute" 45GB"

Ex20" 92G" 500G" 25"hrs# 98"PB" Compute" 92GB"

Ex30" 130G" 1500G" 27"hrs# 110"PB" Compute" 130GB"

Ex19"5000G"

1000G" 23"hrs" 125"PB# Memory" 1TB"

miniFE"5000G"

250G" 23"hrs" 110"PB# Memory" 250GB"

HPCCG"5000G"

250G" 23"hrs" 110"PB# Memory" 250GB"

•  Extrapola8on"models"allow"us"to"classify"apps"as"computeW"or"memoryWlimited"

"

•  Feasible"dataset"sizes""

 used"to"es8mate"exascale"memory"requirements"

 assess"poten8al"benefit"of"exascale"technologies"such"as"PUM"

Compute"Engine"#1"

Memory"

Compute"Engine"#2"

Compute"Engine"#1"

Memory"