21
PITTSBU RGH P IT T S B U RG H PITTSBU RGH PITTSBU RGH PITTSBU RGH SU PERC O M P U TIN G SU PERC O M PU TIN G SU PERC O M P U TIN G SU PERC O M PU T IN G SU PERC O M P U TIN G C E N T E R C E N T E R C E N T E R C E N T E R C E N T E R What the Users’ Want What the Users’ Want or or The NSF’s Terascale Computing System and Teragrid: The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research Support for Scientific Research or or Making the Best of It Making the Best of It Mike Levine Mike Levine Scientific Director, PSC Scientific Director, PSC SOS7 – Durango – 5 Mar 2003 SOS7 – Durango – 5 Mar 2003

1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

Embed Size (px)

Citation preview

Page 1: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R11

What the Users’ WantWhat the Users’ Wantoror

The NSF’s Terascale Computing System and Teragrid:The NSF’s Terascale Computing System and Teragrid:

Support for Scientific ResearchSupport for Scientific Researchoror

Making the Best of ItMaking the Best of It

Mike LevineMike LevineScientific Director, PSCScientific Director, PSC

SOS7 – Durango – 5 Mar 2003SOS7 – Durango – 5 Mar 2003

Page 2: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R22

Choice of a Title I.Choice of a Title I.

The first title, The first title, What do the Users’ Want, What do the Users’ Want, was proposed was proposed by Neil. Thank you, Neil.by Neil. Thank you, Neil.

It is a lovely title and a very important question, butIt is a lovely title and a very important question, but I don’t really understand things, here, that most of you I don’t really understand things, here, that most of you

do not already understand:do not already understand: The users’ want to be able to get their work done!The users’ want to be able to get their work done!

As efficiently as possible in their time and machine timeAs efficiently as possible in their time and machine time On ever increasing problem sizes.On ever increasing problem sizes. (Usually, they are not picky about the method.)(Usually, they are not picky about the method.)

Page 3: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R33

Choice of a Title IIChoice of a Title II

The second title, The second title, The NSF’s Terascale Computing The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research, System and Teragrid: Support for Scientific Research, was my choice.was my choice.

Dan Reed might do a better job on this subject but went Dan Reed might do a better job on this subject but went for something more relevant to this meeting.for something more relevant to this meeting.

Perhaps not being as wise as he, I will say a few words Perhaps not being as wise as he, I will say a few words on this subject.on this subject.

Then, I will abuse Neil’s hospitality and move to Then, I will abuse Neil’s hospitality and move to another topic.another topic.

Page 4: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R44

NSF’s Terascale Computing SystemsNSF’s Terascale Computing Systems I have already introduced the TCS in the panel I have already introduced the TCS in the panel

Machines Already Operational Machines Already Operational .. TCS was meant to very substantially increase the size TCS was meant to very substantially increase the size

of machine open to US scientists.of machine open to US scientists. This it has doneThis it has done.. Soon to be joined by Soon to be joined by DTFDTF (the Distributed Terascale Facility)(the Distributed Terascale Facility)

Described, yesterday, by Dan Reed.Described, yesterday, by Dan Reed. A very large IBM/IA64 - Linux cluster A very large IBM/IA64 - Linux cluster Distributed between NCSA and SDSC with Distributed between NCSA and SDSC with Additional capabilities at ANL and Caltech.Additional capabilities at ANL and Caltech. Interconnected by a multi-lamba network, 30 Gb/s to Interconnected by a multi-lamba network, 30 Gb/s to

each site.each site.

Page 5: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R55

TeragridTeragrid

Join TCS to DTFJoin TCS to DTF Upgrade network to be routed and extensibleUpgrade network to be routed and extensible Extend it to PSC and into TCSExtend it to PSC and into TCS Begin, shortly, to incorporate additional sites & resources.Begin, shortly, to incorporate additional sites & resources.

A basis for a National CyberInfrastructureA basis for a National CyberInfrastructure To “revolutionize our efforts in scientific research”To “revolutionize our efforts in scientific research” Incorporate computation, data intensive work, visualization, Incorporate computation, data intensive work, visualization,

instruments, diverse facilities.instruments, diverse facilities. Lots of software effort to provide a uniform, distributed Lots of software effort to provide a uniform, distributed

environment for users.environment for users.

One of the new resources is EV7/Marvel.One of the new resources is EV7/Marvel.

Page 6: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R66

EV7 EV7 (cpu)(cpu)/Marvel /Marvel (system)(system)

“… “… the greatest scientific processor, ever…” the greatest scientific processor, ever…” (Bill Camp, yesterday)(Bill Camp, yesterday)

Pre-production systems at PSC, CEA, … for several months. Pre-production systems at PSC, CEA, … for several months. Jean Gonnord, yesterday, mentioned CEA’s work on EV7Jean Gonnord, yesterday, mentioned CEA’s work on EV7 (He has a substantial body of benchmark information to be summarized, below)(He has a substantial body of benchmark information to be summarized, below)

Production systems are now shipping (2 at PSC)Production systems are now shipping (2 at PSC) PSC is building up towards 2 systems PSC is building up towards 2 systems

~250 processors ~250 processors ~1/2 TB memory/system. ~1/2 TB memory/system. #1: NSF: large memory, high bw, SMP; ETF resource.#1: NSF: large memory, high bw, SMP; ETF resource. #2: NIH: all of the above, specific data-intensive applications.#2: NIH: all of the above, specific data-intensive applications.

We believe that Marvel will supply a good deal of We believe that Marvel will supply a good deal of What the Users Want.What the Users Want. In addition to science output, we hope to In addition to science output, we hope to

Learn more about the application value of high bandwidth systems.Learn more about the application value of high bandwidth systems. Encourage vendors to match it or do better.Encourage vendors to match it or do better.

Page 7: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R77

NIH Marvel:NIH Marvel:

Partner with four world leaders in three diverse fields: Partner with four world leaders in three diverse fields: Eric Lander and the Whitehead group (genomics), Eric Lander and the Whitehead group (genomics), Michael Klein et al from the University of Pennsylvania Michael Klein et al from the University of Pennsylvania

(structural biology)(structural biology) Klaus Schulten et al from the University of Illinois (structural Klaus Schulten et al from the University of Illinois (structural

biology)biology) Terrence Sejnowski et al, UCSD and PSC (neuroscience). Terrence Sejnowski et al, UCSD and PSC (neuroscience).

They present compelling examples of data-, memory-, They present compelling examples of data-, memory-, and compute-intensive problems that can only and compute-intensive problems that can only realistically be attacked with the proposed architecture. realistically be attacked with the proposed architecture.

Page 8: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R88

Alpha EV7/Alpha EV7/MarvelMarvel

Alpha EV7 = Alpha EV68 core Alpha EV7 = Alpha EV68 core (1 GHz)(1 GHz) + + on-on-chip 2D torus SMP interconnect chip 2D torus SMP interconnect + huge IO & memory bandwidth (per CPU)+ huge IO & memory bandwidth (per CPU)

– 12.8 GB/s (=6B/f! recall Buddy’s chart; ES45=2 GB/s)12.8 GB/s (=6B/f! recall Buddy’s chart; ES45=2 GB/s)

+ low memory latency + low memory latency – 80ns, local (ES45=140ns) 80ns, local (ES45=140ns)

Marvel = 2-128 proc SMP’s Marvel = 2-128 proc SMP’s [HP has yet to promise systems >64p][HP has yet to promise systems >64p]

Low intra-SMP memory latency Low intra-SMP memory latency – (( ~350 ns, furthest node) ~350 ns, furthest node)

Large aggregate memory (global, up to 8 GB/proc)Large aggregate memory (global, up to 8 GB/proc) 8P early test system testing at PSC8P early test system testing at PSC

Multi-week tests (local & remote users, incl ORNL)Multi-week tests (local & remote users, incl ORNL) Multiple applications (& OS’s)Multiple applications (& OS’s) Excellent McCalpin Streams performanceExcellent McCalpin Streams performance

2*16P production systems now at PSC2*16P production systems now at PSC

Page 9: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R99

EV7 – The System is the Silicon….EV7 – The System is the Silicon….

Building a System….

EV7 + I/O + Memory = SYSTEM !EV7 + I/O + Memory = SYSTEM !

MemoryMemory

Page 10: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1010

Direct processor-processor interconnects (16P Torus)Direct processor-processor interconnects (16P Torus)

Page 11: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1111

128P Partitionable System 128P Partitionable System using 8P Building Block Drawersusing 8P Building Block Drawers

• Up to 128 processors

• Up to 4TB memory

• Loads of IO (PCI-X & AGP)

Page 12: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1212

EV7 tests (November 2001)

1.1 Ghz, 1Ghz memory1.1 Ghz, 1Ghz memory TERATERA

17517533 : 640 Mflops (29% of peak) : 640 Mflops (29% of peak) 10010033 : 660 Mflops : 660 Mflops

PUMAPUMA 272s (EV68@833 : 405s)272s (EV68@833 : 405s) 1.48 times better than ES45 (clock ratio is 1.32)1.48 times better than ES45 (clock ratio is 1.32)

PUMA with MPI, 16 processors (800 MHz)PUMA with MPI, 16 processors (800 MHz) ES45 : 51.384 sES45 : 51.384 s EV7 : 33s (ratio : 1.56)EV7 : 33s (ratio : 1.56)

Jean Gonnord, CEA/DAM

Page 13: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1313

EV7 tests

AA

VV

A EV7@1100 EV68@833 ratio500x192 45.11 77.633 1.721000x384 : 198.57 442.1 2.23

EV7@1100 EV68@833 ratioD 700 759 272 2.79P 700 677 266 2.55D 1000 755 236 3.20P 1000 668 230 2.90

Jean Gonnord, CEA/DAM

Page 14: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1414

Choice of a Title III:Choice of a Title III:

Let’s Make the Best of ItLet’s Make the Best of It An important topic drifted in and out of several talks, An important topic drifted in and out of several talks,

yesterday but was not given direct attention.yesterday but was not given direct attention. Bill Camp mentioned it as “reliability”Bill Camp mentioned it as “reliability” Dan Reed mentioned it as “carefully engineered” clustersDan Reed mentioned it as “carefully engineered” clusters Dieter said we had heard enough of “fault tolerance”. Dieter said we had heard enough of “fault tolerance”.

(He, I think, was wrong.)(He, I think, was wrong.)

Dan Katz mentioned it under “software management and Dan Katz mentioned it under “software management and configuration”.configuration”.

It is directly supportive of It is directly supportive of What the Users Want.What the Users Want.

Page 15: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1515

Let’s Make the Best of ItLet’s Make the Best of It

The issue is “system reliability & availability”, hardware and software. The issue is “system reliability & availability”, hardware and software. ““It”It”, in the title,, in the title, refers to our lovely, expensive systems.refers to our lovely, expensive systems.

More specifically, I refer to issues that I would characterize as “good More specifically, I refer to issues that I would characterize as “good engineering” and not to theoretical issues (which are however more engineering” and not to theoretical issues (which are however more fascinating).fascinating).

In contrast to some of the fault tolerance discussion, I suggestIn contrast to some of the fault tolerance discussion, I suggest there is a fair amount of “low-hanging fruit” there is a fair amount of “low-hanging fruit” requiring but small amounts of effort and requiring but small amounts of effort and having little or no impact on performance.having little or no impact on performance.

Many of the comments in the panel of “Machines Already Many of the comments in the panel of “Machines Already Operational” implied clear lack of sufficient attention toOperational” implied clear lack of sufficient attention tothese types of issues.these types of issues.

Page 16: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1616

(at this point I wish I had)(at this point I wish I had)

The Cartoon from The Cartoon from TheThe New YorkerNew Yorker

A picture of a man leaving Church, pausing at the door to say to the Minister:

“Thank you, Reverend, for not mentioningme by name in your sermon”

(but, I was preparing this during Thomas’ talk, yesterday evening)

Page 17: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1717

Whose Whose BabyBaby is This? is This?

We save bundles by buying commodity components, We save bundles by buying commodity components, either raw or from large vendors. either raw or from large vendors.

At the component level, we still benefit from the At the component level, we still benefit from the substantial engineering that went into their design.substantial engineering that went into their design.

At the system level, however, multiple forms of At the system level, however, multiple forms of danger lurk.danger lurk.

When no one else claims ownership of dealing with When no one else claims ownership of dealing with these dangers, it is “our baby”.these dangers, it is “our baby”. (Applause for the efforts, described today, to strongly (Applause for the efforts, described today, to strongly

influence vendors.)influence vendors.)

Page 18: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1818

What Forms of Danger?What Forms of Danger?

Issues of scale:Issues of scale: We are integrating these components into systems in many cases well We are integrating these components into systems in many cases well

beyond anything imagined by the original designers.beyond anything imagined by the original designers. Issues of style of usage:Issues of style of usage:

We are often using these systems in modes different from that intended We are often using these systems in modes different from that intended by the original designers and not understood by them.by the original designers and not understood by them.

(e.g. very large-scale, highly synchronized applications may be peculiar (e.g. very large-scale, highly synchronized applications may be peculiar to HPTC.)to HPTC.)

In addition to needing some things that they do not provide, we also In addition to needing some things that they do not provide, we also often do not need things that they do provide and for which other system often do not need things that they do provide and for which other system compromises have been made.compromises have been made.

These issues are terribly atypical of the vast majority of their These issues are terribly atypical of the vast majority of their customer base. customer base. (as has been mentioned, frequently)(as has been mentioned, frequently)

Not dealing with them limits our system scalability.Not dealing with them limits our system scalability.

Page 19: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R1919

What Don’t We Need?What Don’t We Need?

We do not run a life support system. We can get along We do not run a life support system. We can get along pretty well with the temporary failure of a fair amount pretty well with the temporary failure of a fair amount of hardware.of hardware. With increasing scale, attempts to totally prevent failure are With increasing scale, attempts to totally prevent failure are

insufficient. insufficient. Then, work to prevent “splatter” is more important than Then, work to prevent “splatter” is more important than

slightly reducing the frequency of failure.slightly reducing the frequency of failure. We do not need “rapid response” to most failures.We do not need “rapid response” to most failures.

Once a node goes, the immediate job is toast. Once a node goes, the immediate job is toast. Here, too, “containment” is very important.Here, too, “containment” is very important.

Page 20: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R2020

Examples of Dealing With Such IssuesExamples of Dealing With Such Issues

If you have not, you should read the RAS requirements in the If you have not, you should read the RAS requirements in the Red Storm solicitation. Red Storm solicitation. (Jim Tompkins mentioned that, briefly)(Jim Tompkins mentioned that, briefly) From those requirements, you can readily infer the kinds of problems that From those requirements, you can readily infer the kinds of problems that

Camp, Tompkins etal are working to avoid.Camp, Tompkins etal are working to avoid. I, at least, was impressed by the level of effort SNL is prepared to expend I, at least, was impressed by the level of effort SNL is prepared to expend

in this domain.in this domain. At PSC, working with HP, we have implemented continuous At PSC, working with HP, we have implemented continuous

monitoring of soft-fault errors.monitoring of soft-fault errors. We are now doing true “preventative maintenance”.We are now doing true “preventative maintenance”. Contrary to historical practice, analysis need not be done on nodes Contrary to historical practice, analysis need not be done on nodes

optimized for computation.optimized for computation. Dan Reed mentioned this sort of thing, yesterday.Dan Reed mentioned this sort of thing, yesterday. Particularly applicable to disks, memory & network.Particularly applicable to disks, memory & network.

Page 21: 1 What the Users’ Want or The NSF’s Terascale Computing System and Teragrid: Support for Scientific Research or Making the Best of It Mike Levine Scientific

P I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G HP I T T S B U R G H

SU PERCOMP UTI NGSU PERCOMPU TI NGSU PERCOMP UTI NGSU PERCOM PUT INGSU PERCOMP UTI NG

C E N T E RC E N T E RC E N T E RC E N T E RC E N T E R2121

SolutionsSolutions

I am not here to propose a solution. I am not here to propose a solution. My immediate goal is to call attention to the problem.My immediate goal is to call attention to the problem. Just as it has become clear that HPC makes special Just as it has become clear that HPC makes special

demands on Linux, it also makes special demands on demands on Linux, it also makes special demands on system configuration and operation. system configuration and operation.

Both might benefit from more coordinated attention.Both might benefit from more coordinated attention.

Perhaps at the next conference, Neil might consider Perhaps at the next conference, Neil might consider some further attention to this issue.some further attention to this issue.