You have exascale problems? ◦ Load Balancing? ◦ Failure? ◦ Power Management? My system software will solve these problems

You have exascale problems?◦ Load Balancing? ◦ Failure? ◦ Power Management?

My system software will solve these problems

System Software: It Slices, Dices, and makes Julienne Fries!

Coordinated checkpointing to the traditional parallel file system won’t scale

Checkpoint commit approaches node MTBF=> Application efficiency drops quickly

Example: Fault Tolerance

Each MPI process runs twice, only fail if both processes in a rank fail

Handle full MPI semantics at scale

rMPI: Replicated (not Ronco™) MPI

Ferreira, et al. SC 2011.

Your machine power budget and hardware acquisitionbudget (*)

Act now, and you’ll gettwice the capacity computing functionality for FREE!

(*) plus contracting and granting

Only two low, low payments of…

Costs and benefits are really easy to understand◦ Large and node-scalable reduction in system mean

time to interrupt (MTTI)◦ Using it as the primary fault tolerance technique

means twice the power consumption on capability problems

◦ Buying twice the number of nodes is also quite painful

SC13 Panel: “Replication is too expensive…We [as a community] will have failed if we can't do better than that. ” – Marc Snir

What are you trying to sell me?

Department of Computer Science

Everything’s A NailWhy you don’t want system software to solve your problems (if you can help it)

Patrick G. [email protected]

April 22, 2014

TheoremEvery individual complete system-level solution to an application exascale problem is “too expensive” for some real workload

Rationale◦ OS doesn’t know your application◦ General solutions are expensive◦ Specialized solutions have limited

power or applicability

I have a hammer!

Save us, vendors!◦ Adding reliability on the compute and control path

is potentially hardware-intensive◦ How much to pay in transistors, power, and $$?◦ While stepping off the commodity

price/performance curve… Burst Buffers

◦ How much budget to spend on the I/O system? ◦ Memory is a scarse resource at exascale◦ NVRAM and network bandwidth aren’t free in

power◦ Some nice recent work in this area

“Simple” Resilience Solutions

Idea: Each node checkpoints when most convenient and out of sync with other nodes

Benefit: get checkpointing off the peak B/W curve onto the sustained B/W curve

Has some (low) obvious costs, some less obvious costs

Asynchronous Checkpointing

Async. Checkpointing approaches highly application-dependent

Apps and Benchmarks Proxy Applications

Ferreira, et al. In submission.

• Note how bimodal these performance curves are!• Clustered asynchronous checkpointing may hold promise here

Checkpoint-avoidance Systems

Levy, et al. In submission.Cheap and powerful is here

No one inexpensive technique enough, but each solves part of the problem

System software must stop trying to “rescue” the application and work with the application◦ Application/runtime can cover part of the space◦ System software can provide “last resort”

solutions when the application cannot easily recover

◦ Right solution application and hardware dependent

◦ Like it is for linear solvers and load balancing Not just a resilience issue

Still have to solve the problems

Characterization of techniques at scale Continued development of new techniques Good decision support

◦ Yet more knobs someone needs to turn◦ Many of the tradeoffs are non-linear, stochastic,

etc◦ Different problem areas interact “interestingly”◦ Complex influence on acquisition decisions, too

Clean interfaces to runtime and application◦ “From a runtime developer’s perspective, the way

that current operating systems manage resources is fundamentally broken” – Mike Bauer, Legion project

What do we need to enable this?

Linux (like OSF/1) will solve all your problems for you

◦ Whether you like it or not◦ While making sure you can’t

do the things you (think you) should do

◦ Which is fine, as long as you don’t need to do anything interesting

Current Oses are Helicopter Parents

Runtimes: “…it is the OS's job to provide mechanism and stay out of the way…”

Sandia lightweight kernels: “The QK provides mechanism, PCT encapsulates policy”

Go ahead and try – if you fall, I’ll catch you

Exascale OS must be your partner

Applications more complex than when the LWK was originally designed◦ Users want more complex interfaces and services◦ Runtimes still want low-level hardware access◦ But we still have to provide some level of isolation◦ As well as backstop mechanisms in cooperation

with hardware Two predominant approaches:

◦ Composite OS (Fused OS, MAHOS, Argo OS/R, etc.)

◦ Virtualization (Kitten+Palacios VMM, Hobbes OS/R)

Lightweight OSes for Next-generation Systems

Safe low-level hardware access for runtime systems Supports bringing your own OS with you Don’t have to muck with the insides of Linux Can be very fast

Why Virtualization?

HPCC FFT over virtualized 10GbE

CTH on Palacios/Kitten on Red Storm

Virtualization in Hobbes OS/R

Multiple virtualization architectures, not just one Pick the point on the spectrum that provides the

mechanisms your application/runtime needs Interesting research challenges on the right mechanisms

and interfaces to provide at and between each point

LWKVirtual LinuxEvironment

(Kitten, CNK)

LWKCustom

(Catamount,HybridVM)

HeaviestWeight

Fused OSMultiple-native

OSes(Pisces, Argo)

Para-virtualImplicit,

VMM ChangesGuest OS

(Gears, GuardedModules)

Para-virtualExplicit,

Guest OS Modifiedor Augmented

(Orig. Xen,Device Drivers)

Full HW VMRuns Unmodified

Guest OSes, Passthru(Palacios, KVM, …)

Software VirtEmulate HW, Binary

Translation, …(Qemu, Vmware,

Emulate HW TransMemory pre-product)

LightestWeight

Assumption is that the runtime (and/or virtualized OS) will do this for the LWK

Is a semi-static policy + local (HW or runtime) adaptation sufficient?

Or global dynamic adaptive runtime system that sets policy and resource allocation for millions of cores? ◦ With low overhead and application interference?◦ “Burning a core” probably not viable at this problem size?◦ Heuristics vs. more disciplined methods?

I want to believe but I have yet to see it◦ Distributed, Decentralized◦ Must be robust and efficient◦ Can we tolerate imperfect and unfair?

Now that I’ve punted on policy…

No, the application and runtime really shouldn’t expect the OS to rescue it

System software can and shouuld provide a range of modest, inexpensive mechanisms ◦ Which can backstop app when it can’t rescue

itself◦ Need well-quantified performance for techniques ◦ On real legacy and next-generation workloads

Virtualization can give the runtime the low-level mechanisms it wants inexpensively

Conclusions

Colleagues, collaborators and students on this work◦ UNM: Dorian Arnold, Scott Levy, Cui Zheng◦ Sandia: Ron Brightwell, Kurt Ferreira, Kevin Pedretti,

Patrick Widener◦ Northwestern: Peter Dinda, Lei Xia◦ Oak Ridge: Barney Maccabe◦ Pittsburgh: Jack Lange

Acknowledgements

This work was supported in part by:◦ DOE Office of Science, Advanced Scientific Computing

Research, under award number DE-SC0005050, program manager Sonia Sachs

◦ Sandia National Labs including funding from the Hobbes project, which is funded by the 2013 Exascale Operating and Runtime Systems Program from the DOE Office of Science, Advanced Scientific Computing Research

◦ Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000

◦ U.S. National Science Foundation Awards CNS-0709168 and CNS-0707365

Acknowledgements (cont’d)

Documents

You have exascale problems? ◦ Load Balancing? ◦ Failure? ◦ Power Management? My system software will solve these problems