52
1 Chip-Multiprocessors & You John Dennis [email protected] March 16, 2007

Chip-Multiprocessors & You

  • Upload
    xanto

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Chip-Multiprocessors & You. John Dennis [email protected] March 16, 2007. Intel “Tera Chip”. 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process 45 nm technology High-K 2D mesh network Each processor has 5-port router Connects to “3D-memory”. Outline. Chip-Multiprocessor - PowerPoint PPT Presentation

Citation preview

Page 1: Chip-Multiprocessors & You

1

Chip-Multiprocessors & You

John [email protected]

March 16, 2007

Page 2: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

2

Intel “Tera Chip”Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

Page 3: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

3

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Page 4: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

4

Moore’s LawMoore’s Law

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

Page 5: Chip-Multiprocessors & You

5

The advent of Chip-multiprocessors

Moore’s Law gone bad!

Page 6: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

6

New implications of Moore’s Law

New implications of Moore’s Law

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

Page 7: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

7

New implications of Moore’s Law

(con’t) New implications of Moore’s Law

(con’t) Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

Page 8: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

8

Questions ?Questions ?

Page 9: Chip-Multiprocessors & You

9

Parallel I/O library (PIO)

John Dennis ([email protected])Ray Loy ([email protected])

March 16, 2007

Page 10: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

10

IntroductionIntroduction

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

Page 11: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

11

Design goalsDesign goals

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Page 12: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

12

Design goals (con’t)Design goals (con’t)

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Page 13: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

13

Terms and ConceptsTerms and Concepts

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

Page 14: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

14

Terms and Concepts (con’t)

Terms and Concepts (con’t)

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Page 15: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

15

Terms and Concepts (con’t)

Terms and Concepts (con’t)

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

Page 16: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

16

Component Model ‘issues’Component Model ‘issues’

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

Page 17: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

17

What worksWhat worksBinary I/O [direct]

Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

Binary I/O [direct]Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

Page 18: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

18

What works (con’t)What works (con’t) Prototype added to POP2

Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Page 19: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

19

Complexity / Remaining IssuesComplexity / Remaining Issues

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

Page 20: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

20

ConclusionsConclusions

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Page 21: Chip-Multiprocessors & You

21

Fun with Large Processor Counts:POP, CICE

John [email protected]

March 16, 2007

Page 22: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

22

MotivationMotivationCan Community Climate System Model

(CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

Can Community Climate System Model (CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

Page 23: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

23

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Page 24: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

24

Status of POPStatus of POP

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Page 25: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

25

Status of POP (con’t)Status of POP (con’t)

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

Page 26: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

26

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Page 27: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

27

Status of CICEStatus of CICE

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

Page 28: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

28

POP (gx1v3) + Space-filling curve

POP (gx1v3) + Space-filling curve

Page 29: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

29

Space-filling curve partition for 8 processors

Space-filling curve partition for 8 processors

Page 30: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

30

Weighted Space-filling curves

Weighted Space-filling curves

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Page 31: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

31

Probability FunctionProbability Function

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

Page 32: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

32

1° CICE4 on 20 processors1° CICE4 on 20 processors

Small domains @ high latitudes

Large domains @ low latitudes

Page 33: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

33

0.1° CICE40.1° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

Page 34: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

34

CICE4 @ 0.1°CICE4 @ 0.1°

Page 35: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

35

Timings for 1°,npes=160, NH=70°Timings for 1°,npes=160, NH=70°

Load-imbalance: Hudson Bay south of 70°

Page 36: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

36

Timings for 1°,npes=160, NH=55°Timings for 1°,npes=160, NH=55°

Page 37: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

37

Better Probability FunctionBetter Probability Function Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Pi =1.0 if φij

j

∑ ni ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟≥ 0.1

0.0

⎨ ⎪

⎩ ⎪

Page 38: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

38

Timings for 1°,npes=160, climate-based

Timings for 1°,npes=160, climate-based

Reduces dynamics sub-cycling time by 28%!

Page 39: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

39

Acknowledgements/Questions?

Acknowledgements/Questions?

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

et

Page 40: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

40

Partitioning with Space-filling Curves

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Nb

Page 41: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

41

Scalable data structuresScalable data structuresCommon problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Common problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Page 42: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

42

Scalable data structures (con’t)

Scalable data structures (con’t)

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

Memory footprint problem will not solve itself!

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

Memory footprint problem will not solve itself!

Page 43: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

43

Remove Land blocksRemove Land blocks

Page 44: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

44

Case Study:Memory use in CLM

Case Study:Memory use in CLM

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

Page 45: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

45

Memory use of CLM on BGL

Memory use of CLM on BGL

Page 46: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

46

Motivation (con’t)Motivation (con’t)Multiple efforts underway

CAM scalability + high resolution coupled simulation [A. Mirin]

Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

Multiple efforts underwayCAM scalability + high resolution coupled

simulation [A. Mirin]Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

Page 47: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

47

OutlineOutline

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Parallel I/O library (PIO)

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Parallel I/O library (PIO)

Page 48: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

48

Status of CLMStatus of CLM

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Page 49: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

49

Status of CLM memory usage

Status of CLM memory usage

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

Page 50: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

50

Proposed Petascale ExperimentProposed Petascale Experiment

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

Page 51: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

51

POPIO benchmark on BGWPOPIO benchmark on BGW

Page 52: Chip-Multiprocessors & You

March 16, 2007 Software Engineering Working Group Meeting

52

CICE results (con’t)CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!