National Energy Research Scientific Computing Center (NERSC) NERSC User Group October 4, 2005 Horst Simon and William Kramer NERSC/LBNL

National Energy Research Scientific Computing Center (NERSC)

NERSC User Group October 4, 2005

Horst Simon and William KramerNERSC/LBNL

Science-Driven Computing

NERSC is enabling new science

NERSC Must AddressThree Trends

• The widening gap between application performance and peak performance of high-end computing systems

• The recent emergence of large, multidisciplinary computational science teams in the DOE research community

• The flood of scientific data from both simulations and experiments, and the convergence of computational simulation with experimental data collection and analysis in complex workflows

Science-Driven Computing Strategy 2006 -2010

Science-Driven Systems

• Balanced and timely introduction of best new technology for complete computational systems (computing, storage, networking, analytics)

• Engage and work directly with vendors in addressing the SC requirements in their roadmaps

• Collaborate with DOE labs and other sites in technology evaluation and introduction

Science-Driven Services

• Provide the entire range of services from high-quality operations to direct scientific support

• Enable a broad range of scientists to effectively use NERSC in their research

• Concentrate on resources for scaling to large numbers of processors, and for supporting multidisciplinary computational science teams

Science-Driven Analytics

• Provide architectural and systems enhancements and services to more closely integrate computational and storage resources

• Provide scientists with new tools to effectively manipulate, visualize and analyze the huge data sets from both simulations and experiments

Impact on Science Mission

Impact on Science Mission

• Majority of great science in SC is done with medium- to large-scale resources

• In 2003 and 2004 NERSC users reported the publication of at least 2,206 papers that were partly based on work done at NERSC.

Computational System Strategy

NERSC System Architecture

April 2005

SYMBOLICMANIPULATION

SERVER

ETHERNET10/100 Megabit

FC Disk

STKRobots

ESnet

SGI

OC 48 – 2400 Mbps

HPPS14 IBM SP servers

35 TB of cache disk8 STK robots, 44,000 tape slots,

24 - 200 GB drives, 60 - 20 GB drivesMax capacity 9 PB

PDSF~800 processors

(Peak ~1.25 TFlop/s)~1 TB of Memory

135 TB of Shared DiskGigabit and Fast Ethernet

Ratio = (0.8, 96)

IBM SPNERSC-3 – “Seaborg”

6,656 Processors (Peak 10 TFlop/s) SSP – 1.35 Tflop/s

7.8 Terabyte Memory55 Terabytes of Shared Disk

Ratio = (0.8,4.8)

NCS Cluster – “jacquard”650 CPU Opteron/Infiniband 4X/12X

3.1 TF/ 1.2 TB memorySSP - .41 Tflop/s

30 TB DiskRatio = (.4,10)

Ratio = (RAM Bytes per Flop, Disk Bytes per Flop)

Testbeds and servers

Visualization Server – “escher”SGI Onyx 3400 – 12 Processors

2 Infinite Reality 4 graphics pipes 24 Gigabyte Memory

5Terabytes Disk

SGI

10 Gigabit, 1 Gigabit EthernetJumbo 1 Gigabit Ethernet

Visualization Server – “Davinci”SGI Altix – 8 Processors

48 Gigabyte Memory3Terabytes Disk

(.5,62)HPSS

HPSS

90% Gross Utilization

NERSC Focus is on Capability Computing

256-511 CPUs

64-255 CPUs

1-63 CPUs

1024-2047 CPUs

>2047 CPUs

HO

UR

S

512-1023 CPUs

Computational Systems Must Support a Diverse Workload

• NERSC’s priority is capability computing, “The goal [of capability systems] is to solve a large problem or to solve a single problem in a shorter period of time. Capability computing enables the solution of problems that cannot otherwise be solved in a reasonable period of time … also enables the solution of problems with real-time constraints (e.g., intelligence processing and analysis). The main figure of merit is time to solution.” - NRC Report on Supercomputing

–Working definition is the use of one-tenth or more of an entire computing resource over an extended time period –Includes INCITE and SciDAC projects

• Large-scale computing–The use of significant computational resources over an extended time period

• Interactive and analysis computing –Use significant amounts of memory and I/O bandwidth for interactive analysis with modest scalability

• A modest amount of capacity computing that is related to the capability and large-scale applications

“Smaller and cheaper systems … where smaller problems are solved. Capacity computing can be used to enable parametric studies or to explore design alternatives; it is often needed to prepare for more expensive runs on capability systems. … The main figure of merit is sustained performance per unit cost.” – NRC Report on Supercomputing

–Working definition is computing that is comparable to running on a desktop system for a week

Computational SystemsTechnology Choices

• Completely commodity supercomputers built using off-the-shelf processors developed for workstations and commercial applications and connected by off-the-shelf networks

– Examples are NERSC’s PDSF system, the Virginia Tech Big Mac cluster, the UCB Millennium cluster, and a host of integrated solutions

• Custom supercomputers use processors and interconnects that are specialized. The systems provide specialized and high-bandwidth interconnects and processor-memory interfaces.

– Examples are the Cray X1 and the NEC SX-8• Hybrid supercomputers combine commodity processors with custom high-

speed interconnects and/or accelerators to enhance science performance– Examples include the ASCI Red Storm, Cray T3E, SGI Altix, and IBM SP– System customization: Adjust the system configuration without employing any

new/unique technologies• Example, the Blue Planet node’s 8-way IH systems instead of 64-way • Requires careful attention/understanding of code bottlenecks and that those bottlenecks

can be fixed by adding or swapping existing hardware– Technology customization: specialized/custom modification to embedded

components for special value-added accelerators • Example: ViVA• Requires a long-term commitment by partners given that product development cycles

typically require 2- to 5-year lead time

• Rather than three distinct categories, this taxonomy represents a spectrum of choices

Computational SystemsStrategy

• The total annual investment in the computational systems will remain approximately one-third of the NERSC annual funding

• Lease-to-own payments for a major system will be spread over three years

– Technology availability may dictate a phased introduction• NERSC uses the “best value” process

– Allows considerable flexibility for NERSC and provides an opportunity for significant innovation by suppliers

– One key metric we use is what we call the Sustained System Performance (SSP) metric, which is based on a benchmark performance integrated over three years

– The Effective System Performance (ESP) test assesses system-level efficiency, namely the ability of the large-scale system to deliver a large fraction of its potential resources to the users

– NERSC will use a sets of benchmark kernels and full applications to assess system

– NERSC is exploring advanced modeling methods for its applications to project the performance of systems as well as to guide the Science-Driven System Architecture efforts

Sustained System Performance (SSP) Test

• NERSC focuses on the area under the measured curve – SSP is responsible for

assuring delivered performance• SSP is conservative so

most applications do better

– To achieve the required performance, NERSC-3 has a 22% higher peak performance than planned

– The higher final result benefits the community for the long term

Peak vs SSP

0

50

100

150

200

250

300

350

Oct-99 Apr-00 Oct-00 Apr-01 Oct-01 Apr-02

Months since installation

SS

P G

Flo

p/s

0

1

2

3

4

5

6

Pe

ak

TF

lop

/s

Measured SSP Gflop/s Planned SSP System Gflop/s

Planned Peak System Tflop/s Actual Peak System Tflop/s

Software lags hardware

Test/Config, Acceptance,etc

SSP = Measured Performance * Time

Effective System Performance (ESP) Test

• Test uses a mix of NERSC test codes that run in a random order, testing standard system scheduling – There are also full configuration codes, I/O tests, and typical system

administration activities

• Independent of hardware and compiler optimization improvements

• The test measures both how much and how often the system can do scientific work

FullConfig

FullConfig

Shutdown and Boottime = S

Elapse Time - TSubmitSubmit Submit

Num

ber

of C

PU

s -

P

ti

pi

Effectiveness = (∑1Npi* ti)/[P*(S+T)]

Computational SystemsCost Projections

SSP Price Performance Trends

$0.1

$1.0

$10.0

$100.0

$1,000.0

$10,000.0

Jan-97 Jan-00 Jan-03 Jan-06 Jan-09

Time

$M/S

SP

TF

Average$M/SSPPerformance

SelectedSystem$M/SSP TF

Peak Price Performance Trends

$0.1

$1.0

$10.0

$100.0


Time

$M/P

eak

TF

Average$M/PeakPerf

SelectedSystem$M/Peak TF

Peak Price Performance Trends

$0.1

$1.0

$10.0

$100.0


Time

$M/P

eak

TF

Average$M/PeakPerf

SelectedSystem$M/Peak TF

Average ofHybrid$M/Peak TF

Average ofCommodity$M/Peak TF

Average ofCustom$M/Peak TF

Average ofHybrid-Commodity$M/Peak TF

SSP Price Performance Trends

$0.1

$1.0

$10.0

$100.0

$1,000.0

$10,000.0

Mar-97 Nov-99 Nov-02 Oct-04 Nov-05 Nov-06 Nov-09

Time

$M/P

eak

TF

Average$M/SSPPerformance

SelectedSystem$M/SSP TF

Average ofHybrid$M/SSP TF

Log ofCommodity$M/SSP

Average ofCustom$M/SSP TF

Average ofHybrid-Commodity$M/SSP TF

NERSC-5 NERSC-6

Total System Cost $35M $35M

$ per SSP Mflop/s $6-8 $1.5-2.5

Computational SystemsCost Projections

NERSC Sustained System Performance

0

2

4

6

8

10

12

14

16

18

2002 2003 2004 2005* 2006* 2007* 2008* 2009* 2010*

Fiscal Year

SS

P T

eraf

lop

/s

Baseline SSP TeraFlop/s

* Estimate

37 sustained TF-years between 2005 and 2010

37 Tflop/s-years of SSP performance Between 2005 and 2010

Science-Driven System Architecture

Science-Driven System Architectures Goals

• Broadest, large-scale application base runs very well on SDSA solutions with excellent sustained performance per dollar

• Even applications that do well on specialized architectures could perform near optimal on a SDSA Architectures

Science-Driven System Architecture Goals

• Collaboration between scientists and computer vendors on science-driven system architecture is the path to continued improvement in application performance

• Create systems that best serve the entire science community• Vendors are not knowledgeable in current and future algorithmic methods

– When SDSA started, system designers were working with algorithms that were 10 years old

• Did not consider sparse matrix methods of 3D FFTs in design of CPUs

• NERSC staff and users represent the scientific application community• Active collaboration with other groups: DARPA, NGA, etc.• Early objectives:

– ViVA-2 architecture development – Power6 scientific application accelerator– Additional investigation with other architectures– Lower interconnect latency and large spanning

• Long-term objectives:– Integrate lessons of the large-scale systems, such as the Blue Gene/L and HPCS

experiments, with other technologies, into a hybrid system for petascale computing.

• SDSA applies to all aspect of NERSC – not just parallel computing– Facility-Wide File Systems

NERSC Expertise is Critical to SDSA Process

• Samples of Algorithms and Computational Science Successes – Numerical and System Libraries – SuperLU, ScaLAPACK, MPI-2, parallel netCDF– Applications and Tools - ACTS Toolkit– Programming Languages - UPC– System Software - Linux Checkpoint/Restart, VIA Protocol– Mixed-mode programming studies – APDEC/NASA Computational Technology Center

• Performance evaluation and analysis– LBNL staff includes authors of widely used evaluation tools: NAS Parallel

Benchmarks (NPB), Sustained System Performance (SSP) benchmark, Effective System Performance (ESP) benchmark

– The “Performance Evaluation Research Center” (PERC), a multi-institution SciDAC project funded by DOE

– Tuning and analysis of dozens of applications on NERSC scalar and vector systems

• Architecture evaluation and design– Multi-application study of the Earth Simulator– Other studies of Cray X1, NEC SX-6, IBM BlueGene/L, Cray Red Storm, Tera MTA, Sun

Wildfire, etc. – Collaborators on architecture design projects, including Blue Planet– Clusters: UCB Millennium and NOW, Processor in Memory (PIM)– RAID– HPSS

Facility-Wide File System

Data Divergence Problem

GatewayNode

ComputeNode

File SystemNode

LAN/WANLAN/WAN

ComputerNode

File SystemNode

InternalInterconnect




The memory divergence problem is masking the data divergence problem

Colliding Black Holes – 2-5 TB files for each time step

Facility-Wide File System

GatewayNode

ComputeNode

File SystemNode

LAN

LAN

ComputerNode

File SystemNode





GatewayNode

ComputeNode

File SystemNode

LAN

LAN

ComputerNode

File SystemNode





GatewayNode

ComputeNode

File SystemNode

LAN

LAN

ComputerNode

File SystemNode





WAN

LAN/WAN

LAN/WAN

GatewayNode

ComputeNode

ComputerNode

File SystemNode





GatewayNode

ComputeNode

ComputerNode

File SystemNode





GatewayNode

ComputeNode

ComputerNode

File SystemNode





LAN/WAN

LAN/WAN

LAN/WAN

LAN/WAN

Archive Storage and Networking

NERSC Storage Growth

Increase:1.7x per year in data growth

45 million files makes NERSCone of the largest sites

Mass StorageProjected Data Growth and Bandwidth Requirements

2005–2010

Year Total Archived Data

Data Transfers per Day

Transfer Rate

2005 1.5 PB 6 TB 60 MB/s

2006 2.9 PB 10 TB 120 MB/s

2007 5.0 PB 20 TB 231 MB/s

2008 11.0 PB 33 TB 392 MB/s

2009 21.0 PB 64 TB 749 MB/s

2010 38.0 PB 117 TB 1356 MB/s

Up to 6 TB per day this year

Mass StorageBandwidth Needs As Well

• A transfer of a 372 GB file takes 3 to 4 hours at today’s 30 MB/s– Striping the data across three tape drives shortens time to 1 hour

• FY 05 new technology tape drives with transfer rates of 120 MB/s so access time will be 5 minutes

Networking Strategy

NERSC Data Storage and WAN Network Speed

0.01

0.1

1

10

100

1998

/10

1999

/10

2000

/10

2001

/10

2002

/10

2003

/10

2004

/10

2005

/10

2006

/10

2007

/10

2008

/10

2009

/10

2010

/10

HP

SS

Dat

a (P

etab

ytes

)

0.01

0.1

1

10

100

NE

RS

C <

-->

ES

net

Sp

eed

(G

igab

its/

sec)

Networking

• NERSC will join the Bay Area Metropolitan Area Network (BA MAN) in 2005

– One 10 Gb/s link for production traffic– Second 10 Gb/s link for projects, dedicated high-bandwidth services,

testing• NERSC and ESnet are working to deploy Quality of Service (QoS)

capability that will permit high-priority network traffic to receive dedicated bandwidth

– Expand to allow dynamic provisioning of circuits across both Abilene and ESnet

– “Bandwidth corridors” could support real-time processing of experimental data

• End-to-end support– NERSC expertise troubleshooting and optimizing data transfers

between remote user sites and NERSC resources– NERSC is one of the few sites that troubleshoots problems end to end

and actively engages with service providers and site networking staff

What is Analytics?

• Science of reasoning– Generate insight and understanding from large,

complex, disparate, sometimes conflicting data

• Visual analytics:– Science of reasoning facilitated by visual interfaces

• Why visual?– High bandwidth through human visual system– Better leverage human reasoning, knowledge, intuition

and judgment

• Intersection of:– Visualization, analysis, scientific data management,

human-computer interfaces, cognitive science, statistical analysis, reasoning, …

NERSC’s Analytics Strategy

• NERSC Analytics Strategy: – Objective: Improve scientific productivity by increasing

analytics capabilities and capacity for NERSC user community– Several key strategy elements: scientific data management,

visualization, analysis, support and integrated activities• Understand user needs in analytics• Provide the visualization and analysis tools needed to

realize analytics capabilities• Increase capability and capacity of NERSC’s data

management infrastructure• Support distributed computing (analytics) activities• Support proliferation and use of analytics capabilities in

the NERSC user community

2007

ETHERNET10/100/1,000 Megabit

FC Disk

STKRobots

HPPS100 TB of cache disk

8 STK robots, 44,000 tape slots, max capacity 44 PB

PDSF~1,000 processors ~1.5 TF, 1.2 TB of

Memory~300 TB of Shared Disk

Ratio = (0.8, 20)


Testbeds and servers

NERSC-5SSP ~4-6 Tflop/s

SGI

Visualization and Post Processing Server64 Processors.4 TB Memory

60 Terabytes Disk

HPSS

HPSS

NCS-b – SSP - ~.7-.8 Tflop/s

2 TB Memory70 TB disk

Ratio = (0.25, 9)

GUPFS

Storage

Fabric

OC 192 – 10,000 Mbps

IBM SPNERSC-3 – “Seaborg”

6,656 Processors (Peak 10 TFlop/s)SSP – 1.35 Tflop/s

7.8 Terabyte Memory55 Terabytes of Shared Disk

Ratio = (0.8,4.8)

10 Gigabit, Jumbo 10 Gigabit

Ethernet

NCS Cluster – “jacquard”650 CPU

Opteron/Infiniband 4X/12X 3.1 TF/ 1.2 TB memory

SSP - .41 Tflop/s30 TB Disk

Ratio = (.4,10)

2009

ETHERNET10/100/1,000 Megabit

FC Disk

STKRobots

HPPS1000 TB of cache disk

8 STK robots, 44,000 tape slots, max capacity 150 PB

PDSF41,000 processors (Peak 833 GFlop/s)

4 TB of Memory2000 TB of Shared Disk

Ratio = (0.8, 96)


Testbeds and servers SGI

Visualization and Post Processing Server 100 Processors

4 TB Memory

HPSS

HPSS

GUPFS

Storage

Fabric

OC 768 – 40,000 Mbps


40 Gigabit, Jumbo 40 Gigabit Ethernet

NCSc~13 TF

SSP ~3 Tflop/s

NCS-b – SSP - ~.7-.8 Tflop/s

2 TB Memory70 TB disk

Ratio = (0.25, 9)


The Primary Messages

• Hybrid systems are the most likely computational system for the diverse NERSC workload over this time period

• The Science-Driven System Architecture process is critical to the success of NERSC

• NERSC has methods that choose systems that best serve the entire community to maximize science productivity

• NERSC provides a balanced facility with storage, networking, and support systems

Science-Driven Computing Strategy 2006 -2010

Documents

National Energy Research Scientific Computing Center (NERSC) NERSC User Group October 4, 2005 Horst Simon and William Kramer NERSC/LBNL