Chip-Multiprocessors & You

1

Chip-Multiprocessors & You

John [email protected]

March 16, 2007

mailto:[email protected]

March 16, 2007 Software Engineering Working Group Meeting

2

Intel “Tera Chip”Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”


3

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE


4

Moore’s LawMoore’s Law

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

5

The advent of Chip-multiprocessors

Moore’s Law gone bad!


6

New implications of Moore’s Law


Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!


7


(con’t) New implications of Moore’s Law

(con’t) Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!



8

Questions ?Questions ?

9

Parallel I/O library (PIO)

John Dennis ([email protected])Ray Loy ([email protected])

March 16, 2007

http://swiki.ucar.edu/ccsm/93


10

IntroductionIntroduction

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs


11

Design goalsDesign goals

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement


12

Design goals (con’t)Design goals (con’t)

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ


13

Terms and ConceptsTerms and Concepts

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene


14

Terms and Concepts (con’t)


Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware


15



IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays


16

Component Model ‘issues’Component Model ‘issues’

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?


17

What worksWhat worksBinary I/O [direct]

Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

Binary I/O [direct]Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL


18

What works (con’t)What works (con’t) Prototype added to POP2

Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed


19

Complexity / Remaining IssuesComplexity / Remaining Issues

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library


20

ConclusionsConclusions

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository



21

Fun with Large Processor Counts:POP, CICE

John [email protected]

March 16, 2007


22

MotivationMotivationCan Community Climate System Model

(CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

Can Community Climate System Model (CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015


23

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE


POPCICE


24

Status of POPStatus of POP

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor


25

Status of POP (con’t)Status of POP (con’t)

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks


26

OutlineOutline


POPCICE


POPCICE


27

Status of CICEStatus of CICE

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology


28

POP (gx1v3) + Space-filling curve

POP (gx1v3) + Space-filling curve


29

Space-filling curve partition for 8 processors

Space-filling curve partition for 8 processors


30

Weighted Space-filling curves

Weighted Space-filling curves

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10


31

Probability FunctionProbability Function

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °


32

1° CICE4 on 20 processors1° CICE4 on 20 processors

Small domains @ high latitudes

Large domains @ low latitudes


33

0.1° CICE40.1° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing


34

CICE4 @ 0.1°CICE4 @ 0.1°


35

Timings for 1°,npes=160, NH=70°Timings for 1°,npes=160, NH=70°

Load-imbalance: Hudson Bay south of 70°


36

Timings for 1°,npes=160, NH=55°Timings for 1°,npes=160, NH=55°


37

Better Probability FunctionBetter Probability Function Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

€

€

Pi =1.0 if φij

j

∑ ni ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟≥ 0.1

0.0

⎧

⎨ ⎪

⎩ ⎪


38

Timings for 1°,npes=160, climate-based

Timings for 1°,npes=160, climate-based

Reduces dynamics sub-cycling time by 28%!


39

Acknowledgements/Questions?

Acknowledgements/Questions?

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

et


40

Partitioning with Space-filling Curves

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Nb


41

Scalable data structuresScalable data structuresCommon problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Common problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info


42

Scalable data structures (con’t)

Scalable data structures (con’t)

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info


CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info



43

Remove Land blocksRemove Land blocks


44

Case Study:Memory use in CLM

Case Study:Memory use in CLM

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors


45

Memory use of CLM on BGL

Memory use of CLM on BGL


46

Motivation (con’t)Motivation (con’t)Multiple efforts underway

CAM scalability + high resolution coupled simulation [A. Mirin]

Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

Multiple efforts underwayCAM scalability + high resolution coupled

simulation [A. Mirin]Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]


47

OutlineOutline

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM


Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM



48

Status of CLMStatus of CLM

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10


49

Status of CLM memory usage

Status of CLM memory usage

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL


50

Proposed Petascale ExperimentProposed Petascale Experiment

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors


51

POPIO benchmark on BGWPOPIO benchmark on BGW


52

CICE results (con’t)CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

Documents

Chip-Multiprocessors & You