50
November 25, 2002 1 WORKLOADS THAT SCALE IN MULTIPLE DIMENSIONS Workload Benchmarks that Scale in Multiple Dimensions John L. Gustafson Sun Labs HPC Workload Characterization and System Analysis Team

November 25, 20021WORKLOADS THAT SCALE IN MULTIPLE DIMENSIONS Workload Benchmarks that Scale in Multiple Dimensions John L. Gustafson Sun Labs HPC Workload

Embed Size (px)

Citation preview

November 25, 2002 1WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Workload Benchmarks thatScale in Multiple Dimensions

John L. GustafsonSun Labs HPC Workload Characterization and System

Analysis Team

November 25, 2002 2WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Goal

Create a suite of purpose-based benchmarks representative of HPC, which adjust in several dimensions to match real workloads.Description must be architecture-independent and language-independent.Conjecture: This approach will yield improved predictive methods that are relatively invariant as technology evolves.

November 25, 2002 3WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

What Does it Mean to “Scale” a Workload?

Performance = Work/Time.Work is usually undefined in benchmarks,

so it’s “fixed” to avoid the issue.FLOPS are not work. Not in 2002, anyway.Multiple instances (small, medium, large)

doesn’t make a workload “scalable.”True scalability requires an objective

function.

November 25, 2002 4WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Latency Tradeoffs Drive Need for New HPC Approaches

ILLIAC IV, 1970 latency=1000 nsec

SGI O2000, 1996 latency=800 nsec

Sun SF15K, 2001 latency=400 nsec

…but traditional complexity analysis counts operations, ignores memory latency!

Time for onememory access

1950 1970 20101990

1 µs

1 ns

1 ms

Time for oneoperation

November 25, 2002 5WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Moore’s Law near limit for micro-processors, and this time we mean it! The clock can’t make

it across a 20 mm die at current GHz rates

Either we go to multiple cores or use Non-Uniform Cache Access (NUCA)

This is a physical limit, not technological [Source: Chuck Moore, UT-Austin]

130 nm 100 nm

70 nm 35 nm

November 25, 2002 6WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Bandwidth Burns Energy; Maybe Our Measure of Work is… Joules?

1300 to 1900 pJMove 32 bits off chip

100 pJMove 32 bits across 10 mm chip

50 pJRead 32 bits from 8K RAM

10 pJ32-bit register read

5 pJ32-bit ALU operation

Energy (130 nm, 1.2 V)Operation

[source: Bill Dalley, Stanford]

November 25, 2002 7WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Purpose-Based Benchmarks

A purpose-based benchmark states an objective function that has direct interest to humans.

An activity-based benchmark states computer operations to be performed, usually defined by source code.Either can be made to scale in multiple dimensions, but it’s harder to do with activity-based benchmarks.

November 25, 2002 8WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Means-Based vs Ends-Based Metrics

Flop/sBytes of RAM

Number of ProcessorsUse of Commodity Parts

Word SizeECC Memory

Speedup

Time to Compute AnswerDetail, Content of AnswerFeasible Problems to AttemptCost, Availability of SystemCloseness to Actual PhysicsReliability of AnswerProduct Line Range

ENDS-BASEDMEANS-BASED

November 25, 2002 9WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Two-Way Taxonomy Examples

LINPACK

NAS Parallel

SPEC (any year)

LINPEAK

SLALOM

Streams, LMBench

NSA suites

Many RFP tests

TPC-x, ECPerf

HINT (?)

Sun’s HPC Suite

Activity-Based

Purpose-Based

Fixed-Size Scalable

November 25, 2002 10WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Example of Prediction Failure

0 2 4 6 80.0

0.5

1.0

1.5

2.0

2.5

LINPACK Speed relative to IBM 3090

NonlinearNon-monotoneLow correlation

November 25, 2002 11WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Peak FLOPS as Inverse Predictor

0 3 6 9 12 150.0

0.5

1.0

1.5

2.0

2.5

Peak Advertised Performance, GFLOPS

• Negative correlation• Hardware imbalance

Teraflops advocates,please note!

November 25, 2002 12WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Why F.P. Op Counts Make Poor Workload Definitions

Explicit TimesteppingConventional Matrix Multiply

Cholesky DecompositionAll-to-All N-Body MethodsSuccessive Over-Relaxation

Time-Domain OperatorsRecompute Gaussian Integrals

Material Property Function

Implicit TimesteppingStrassen, Winograd MethodsPC Conjugate GradientBarnes-Hut, GreengardMultigridFFT’sCompute Once and StoreTable Look-Up

FASTER ANSWERSHIGHER FLOP/S RATES

November 25, 2002 13WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Peak Bandwidth Sometimes Works Surprisingly Well

• High correlation• Monotone (4 points)• Nonlinear, though

0 20 40 60 800.0

0.5

1.0

1.5

2.0

2.5

Peak Memory Bandwidth, GB/sec

November 25, 2002 14WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

An HPC Taxonomy

Electronic Design

Nuclear Applications

Mechanical Design

Mechanical Engineering

Radar Cross-Section

Crash Simulation

Fluid Dynamics

Weather, Climate modeling

Signal/Image Processing

Encryption

Life Sciences

Financial Modeling

Petroleum

For each of these, we seek to match

•Problem size•Data locality•Predominate data type•Dynamic behavior in time•Spatial irregularity•Demands on I/O

To avoid the “toy benchmark” problem.

November 25, 2002 15WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Some Invariant Workload Dimensions

DataParallelism

Real-Time

PhysicalSimulation

Math

Boolean

Small DiscreteLarge Discrete

Low-precision continuous

Highly data-parallel

No locality of reference

WorkloadCategory

November 25, 2002 16WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Example of Workload Dimension: Data Parallelism (estimated)

100%

Data-Parallelcontent

0%

Gene-matching, SETI, factoring large integersVibrational analysis, ray-tracingDense matrix multiply, factoring; convolution and filteringEulerian fluid dynamics (decomposed spatially)N-Body problemsStress-strain analysis with finite elements; crash testingLagrangian fluid dynamics (decomposed by fluid element)“Easy” databases (conflict-free, few updates)FFTs (frequency-space filtering)Particle simulations with imposed fields, game-tree explorationCircuit simulationEconomic simulationsTypical database applications; transaction processing

November 25, 2002 17WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Prior Art: HINTEnglish Description of Scalable Task

Divide the unit square into 2i by 2i rectangles. Bound the integral of(1–x)/(1+x) using that resolution and hierarchical interval subdivision.Use only the knowledge that the function is monotone decreasing.

0

1

10

Known to contribute to lower bound

Limited by arithmetic precision

Available for further refinement

Known not to contribute to upper bound

November 25, 2002 18WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Region of Computation asWorkload Scales

1e+05

1e+06

1e+07

1e+08

1e+09

1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100Time in seconds

limited by precisionor memory

Region of Computation

limited bylatency

limited by"peak speed"

November 25, 2002 19WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Serial Systems Used to Test Model

MHz

CPU

PrimaryCache

SecondaryCache

RAM

OperatingSystem

Compiler

Indy 1

133

R4600

16 KB

0.5 MB

64 MB

IRIX6.2

Mips C

Indy 2

100

R4400

16 KB

1.0 MB

64 MB

IRIX6.2

Mips C

Indy 3

133

R4600

16 KB

None

64 MB

IRIX6.2

Mips C

Indy 4

100

R4600

16 KB

None

64 MB

IRIX6.2

Mips C

Indy 5

200

R4000

16 KB

1.0 MB

64 MB

IRIX6.2

Mips C

PC

200

Pent. Pro

16 KB

0.5 MB

64 MB

Linux

gcc

November 25, 2002 20WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Curve Crossings Predict Inconsistent Rankings

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

100 1K 10K 100K 1M 10M 100M

Memory in Bytes

Pentium Pro PC

SGI Indy 1

SGI Indy 5

SGI Indy 3

SGI Indy 4

SGI Indy 2

Noteobviouscacheregions.

November 25, 2002 21WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Shared Memory Systems Tested

MHz

CPU

PrimaryCache

SecondaryCache

RAM

OperatingSystem

Compiler

SGI Challenge

194

R10000

32 KB

0.5 MB

0.5 GB

IRIX6.2

Mips C

SGI Onyx

194

R10000

32 KB

1.0 MB

0.5 GB

IRIX6.2

Mips C

Cray C90

250

Vector

None

None

2.0 GB

UNICOS

Cray C

November 25, 2002 22WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Here, Differences are Not Subtle!

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

100 1K 10K 100K 1M 10M 100M 1G

Memory in Bytes

Cray C90

SGIChallenge

SGI Oynx

November 25, 2002 23WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Parallel Systems Tested

Number ofProcessors

MHz

CPU

Pri. Cache

Sec. cache

RAM

Op. System

Compiler

SGI Onyx

1, 2, 4, 8

194

R10000

32 KB

2.0 MB

2 GB (8-way)

IRIX 6.2

Mips C

nCUBE 2S

16, 32, 64,128

25

custom

-

-

4 MB/node

Vertex

cc

Cray T3D

32, 64

150

Alpha EV4

32 KB

-

64 MB/node

UNICOS

C 4.0.3.5

IBM SP-2

8, 16, 32,64, 128

67

RS6000

256 KB

-

128 MB/node

AIX 3

mpcc

Cray C

November 25, 2002 24WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

HINT Curves for Parallel Systems

1

10

100

0.0001 0.001 0.01 0.1 1

Time in seconds

SGI Onyx (8)

IBM SP2 (8)

IBM SP2 (64)

IBM SP2 (128)

Cray T3D (32)

nCUBE 2S (256)

November 25, 2002 25WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

SPECint Correlation with HINT

go

m88ksim

gcc

compress

li

iipeg

perl

vortex

SPECint

HINT Sample

27 KB

39 KB

180 KB

290 KB

17 KB

198 KB

36 KB

180 KB

180 KB

Correlation

0.9964

0.9985

0.9970

0.9990

0.9947

0.9986

0.9989

0.9985

0.9996

HINT Sample

468 KB

164 KB

514 KB

164 KB

164 KB

164 KB

164 KB

514 KB

514 KB

Rank Corr.

0.996

0.996

0.996

0.996

0.996

0.992

0.979

perfect

perfect

November 25, 2002 26WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

SPECfp Correlation with HINT

tomcatv

swim

su2cor

hydro2d

mgrid

applu

turb3d

absi

fppp

wave5

SPECfp

HINT Sample

net

net

893 KB

1080 KB

29 KB

net

16 KB

1080 KB

29 KB

893 KB

180 KB

Correlation

0.9990

0.9998

0.9970

0.9983

0.9977

0.9997

0.9994

0.9979

0.9933

0.9962

0.9984

HINT Sample

4963 KB

4963 KB

net

4963 KB

net

4963 KB

net

net

16 KB

4101 KB

4101KB

Rank Corr.

0.996

0.983

0.996

0.996

0.996

0.992

0.996

0.992

0.996

perfect

0.996

November 25, 2002 27WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

SPEC Linear Fits

spec95int = 7.5 × ( 180 ) 15%MQUIPS at KB within

95 = 11. spec fp × ( 893 ) 23%MQUIPS at KB within

Time to run SPEC is about8 .hours

Time to run HINT is about10 .minutes

November 25, 2002 28WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Heuristic App Profile: hydro2d

102

103

104

105

106

107

108

Sample HINT memory point

1.000

0.995

0.990

0.985

0.980

0.975

0.970

0.965

November 25, 2002 29WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Worst Prediction: FT, from NPB

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

NetQUIPS

November 25, 2002 30WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Scalability Allows >0.995 Correlation with Other Benchmarks

100 1000 10000 100000 106 107 108

Memory in Bytes

Fhourstone,Dhrystone,Tower of Hanoi,Queens,Fibonacci, etc.

Whetstone

LINPACK100×100

SPECint

SPECfp

LINPACK1000×1000

Stream

You know those“EnergyGuide”stickers you seeon refrigerators,water heaters,

air conditioners,Etc.…?

November 25, 2002 31WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Wouldn’t it be greatif they had something

like this at retailcomputer stores?

15.6net MQUIPS

32-BIT INTEGER RATING

WINTEL PERSONAL COMPUTER64 MEGABYTES MAIN MEMORY

MODEL9600

ESTIMATES ARE BASED ON THE INT® PERFORMANCE MEASURE

Your performance will vary depending on how you use the product, and on how you modifyit by installing software. This test is based on patented, scalable methods developed by a federal laboratory.

THIS MODEL

June 1998model withlowestperformance3.2

June 1998model with

highestperformance

17.9

Estimated performance over a range of uses

How fast will this model run different types of problems?

Ask your salesperson for information about the needs of your application.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 microsecond 1 millisecond 1 secondTime for a Computing Task

64-bitfloating point

32-bitinteger

ONLY SINGLE-USER PERSONALCOMPUTERS ARE USED ON

THIS SCALE

November 25, 2002 32WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark #1: Truss Design

100 tons

TASK: Given a point that must support a 100 ton load and three attachmentpoints, find the geometry of struts and cables that creates the lighteststructure. Each node requires 1 kg of steel. Structure must support its ownweight.

(0,y1,z1)

L meters

(0,y2,z2)

(0,y3,z3)

(L,y0,z0)

November 25, 2002 33WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Design-continued

Stiffness(strength)

More complex structures have higher strength. The benchmarkinitially sets the topology, and then perturbs the xyz positions of nodepoints to optimize resulting total mass required.

November 25, 2002 34WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Algorithm (Preprocess)

Read problem geometry and N=number of vertices from nonvolatile storage.Iterate until there are N vertices:

Sort edges (cables or struts) by length.Bisect longest edge, creating new vertex.Adjust vertex position to make it non-collinear.Add two non-coplanar edges between new vertex and neighbor vertices.Compute new lengths of edges.

Save entire mesh description to nonvolatile storage.

November 25, 2002 35WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Algorithm (Inner Solver)

For each edge:For each edge that touches the same vertex:

Project edge vectors to this edge to obtain Aij

Compute external force from edge weights + load.Solve linear system such that vertex forces = 0.Compute required cross-sections of cables and struts needed.Return the total weight of the truss.

November 25, 2002 36WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Optimization-Point Method

Iterate until time limit reached:Pick a strut (randomly or sequentially).Vary xyz coordinates of a vertex, adjusting edge lengths that

connectModify equations and re-solve.

If truss weight is reduced, keep the modification;Else restore original mass distribution (or use annealing)

Report best solution found for structure.

This imitates the actions of a human engineer exploring a design space. Note the possibility of massive parallelism at the job level; each processor can try a different variation of the structure. The best solution found is then shared globally and used as a new starting point.

November 25, 2002 37WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Optimization-Interval Method

Set initial bounds on vertex positions (can be very conservative)Iterate until time limit reached:

Subdivide M-dimensional space of vertex positions into subregionsFor each subregion:

Compute the truss weight, as an interval bound.Share bounds globally to exclude subregions from search.

Report best solution (range) found for structure.

This replaces trial-and-error with rigorous exclusion of infeasible sets.Massive parallelism is easy; each processor can try a different subspace.

November 25, 2002 38WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Strut Optimization Dimensions

Number of vertices (if too many, fewer trials)

Number of trials (if too many, fewer vertices)

Type of solver (iterative, direct)Precision of solver (match to workload!)Search strategy (point, point parallel,

interval exclusion)

November 25, 2002 39WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark #2: Radiosity

TASK: Given the geometry above and three 1 by 1 diffuse light sources, find the placement of the light sources that results in the most even illumination of the bottom surface. All surfaces are Lambertian reflectors. Reflectivity of theTop surface, lights, and vertical surfaces is 0.95; reflectivity of the bottom surface and the occluder is 0.70. Figure of merit = brightest/darkest ratio.

3

45

1

1

1 2

4

November 25, 2002 40WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Radiosity-continued

Surfaces are subdivided into patches. Using more patches gives a better result. Point method uses Monte Carlo and only subdivides the bottom surface once; interval method uses an iterative solver and recursively subdivides all surfaces.

November 25, 2002 41WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Radiosity Solver (Point)

Read problem geometry and N=number of patches from nonvolatile storage.Subdivide bottom surface into N patches.Until all patches have three-sigma confidence:

Fire a random photon from a light source.Track reflections using probabilities until photon is absorbed.If absorbed in a bottom surface patch, increment histogram.

Find maximum and minimum photon counts.Save bottom surface radiosities to nonvolatile storage.Compute ratio.

Note: Highly parallel if care is used to create independent random numbergeneration. Easy tests for occlusion compared to interval method.

November 25, 2002 42WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Radiosity Solver (Interval)

Read problem geometry from nonvolatile storage.Create initial subdivision into large rectangles.Set up form factor matrix, and initial bounds on radiosity.Until all patch intervals are 1% of the lightest-darkest range:

Subdivide the patch with the largest uncertainty.Use radiosity equation (contractive mapping) to find new bounds.Compute lightest-darkest range on bottom surface.

Save subdivision geometry and patch ranges to nonvolatile storage.Compute ratio.

Note: Parallel if asynchronous updating of radiosity is allowed (runswill not be exactly repeatable but will always converge).Closed-form expressions for form factors exist for this problem.

November 25, 2002 43WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Radiosity Optimization-Point

Iterate until time limit reached:Pick a light source coordinate.Vary it by some small amount, like 0.1 meter.Resolve the radiosity problem using the point method.Compare radiation evenness on bottom surface:

If ratio is closer to 1.0, keep the modification;Else restore original light position.

Report best solution found for position of lights.

As before, this allows job parallelism if information about the search isshared by all processors after each run.

November 25, 2002 44WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Radiosity Optimization-Interval

Set initial bounds on light positions (Must be on ceiling, disjoint)Iterate until time limit reached:

Subdivide the 6-dimensional space of light positionsFor each subset of the search space:

Bound the ratio of lightest/darkest surface patchShare bounds globally to exclude subregions from search.

Report best solution (range) found for light positions.

Parallelism exists at the job level and within each solver step.

November 25, 2002 45WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark#3: Life Sciences (Proteomics)

Given a sequence of N peptides and a time limit T, find the minimum energy conformation of the peptide sequence.

Figure of merit: N, or N/T

This approaches protein folding as N grows.Answer validity can be tested against experiment.

Currently, N~55 is the frontier.

November 25, 2002 46WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark#4: Electronic Design

Design an N-bit adder with carry look-ahead in a given process technology. (Like 0.10 micron, FO4, 6-layer Cu interconnect). Simulate with a cycle-based simulator for a complete set of test

vectors. Optimize to minimize clock cycle and chip area.

Figure of merit: Clock speed or area.

This captures both integer (logical)and floating-point (analog) aspects of

electronic design.

November 25, 2002 47WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark#5: Financial Modeling

Generate real-time market behavior drawn from historical data to drive workload. Execute trades based on estimates of

future value for N financial instruments over a period T.

Objective function: Profit!

November 25, 2002 48WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark#6: Weather Modeling

Generate real-time weather behavior drawn from historical data to drive workload. Predict weather (temperature,

precipitation, cloud cover, pressure, wind speed) for N days in advance.

Objective function: Minimum total log(error)/time.

November 25, 2002 49WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

Benchmark#7: PetroleumReservoir Management

Given a geological structure containing oil, water, gas, and a set of M injector wells and N extraction wells, position the wells

to maximize the total oil and gas extracted over a period of time T.

Objective function: Maximum fuel extracted.

November 25, 2002 50WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS

SUMMARY

Scalability is much easier if the workloads are purpose-based.

Multiple dimensions of scalability arise naturally as adjustable parameters.

Predictive value looks promising based on prior experience with HINT.

We will share our HPC workload benchmarks with the HPC community when completed.