24
Suitability of Alternative Architectures Suitability of Alternative Architectures for Scientific Computing in 5-10 Years for Scientific Computing in 5-10 Years LDRD 2002 Strategic-Computational Review July 31, 2001 PIs: Xiaoye Li, Bob Lucas, Lenny Oliker, Katherine Yelick Others: Brian Gaeke, Parry Husbands, Hyun Jin Kim, Hyn Jin Moon

Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

  • Upload
    yen

  • View
    28

  • Download
    2

Embed Size (px)

DESCRIPTION

Suitability of Alternative Architectures for Scientific Computing in 5-10 Years. LDRD 2002 Strategic-Computational Review July 31, 2001. PIs: Xiaoye Li, Bob Lucas, Lenny Oliker, Katherine Yelick Others: Brian Gaeke, Parry Husbands, Hyun Jin Kim, Hyn Jin Moon. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

Suitability of Alternative Suitability of Alternative Architectures for Scientific Architectures for Scientific

Computing in 5-10 YearsComputing in 5-10 YearsLDRD 2002 Strategic-Computational

ReviewJuly 31, 2001

PIs: Xiaoye Li, Bob Lucas, Lenny Oliker, Katherine Yelick

Others: Brian Gaeke, Parry Husbands, Hyun Jin Kim, Hyn Jin Moon

Page 2: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

OutlineOutline

Project Goals

FY01 progress report Benchmark kernels definition Performance on IRAM, comparisons with “conventional”

machines

Management plan

Funding opportunities in future

Page 3: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Motivation and GoalMotivation and Goal

NERSC-3 (now) and NERSC-4 (in 2-3 years) consist of large clusters of commodity SMPs. What about 5-10 years from now?

Future architecture technologies: PIM (e.g. IRAM, DIVA, Blue Gene) SIMD/Vector/Stream (e.g. IRAM, Imagine, Playstation) Low power, narrow data types (e.g., MMX, IRAM, Imagine)

Feasibility of building large-scale systems: What will the commodity building blocks (nodes and networks)

be?

Driven by NERSC and DOE scientific applications codes. Where do the needs diverge from big market applications? Influence future architectures

Page 4: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Computational Kernels and ApplicationsComputational Kernels and Applications

Kernels Designed to stress memory systems

Some taken from the Data Intensive Systems Stressmarks Unit and constant stride memory

Transitive-closure FFT Dense, sparse linear algebra (BLAS 1 and 2)

Indirect addressing Pointer-jumping, Neighborhood (Image), sparse CG NSA Giga-Updates Per Second (GUPS)

Frequent branching a well and irregular memory acess Unstructured mesh adaptation

Examples of NERSC/DOE applications that may benefit: Omega3P, accelerator design (SLAC; AMR and sparse linear algebra) Paratec, material science package (LBNL; FFT and dense linear

algebra) Camille, 3D atmospheric circulation model (preconditioned CG) HyperClaw, simulate gas dynamics in AMR framework (LBNL) NWChem, quantum chemistry (PNNL; global arrays and linear algebra)

Page 5: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

VIRAM Overview (UCB)VIRAM Overview (UCB)

14.5 mm

20

.0 m

m

MIPS core (200 MHz) Single-issue, 8 Kbyte I&D caches

Vector unit (200 MHz) 32 64b elements per register 256b datapaths, (16b, 32b, 64b

ops) 4 address generation units

Main memory system 12 MB of on-chip DRAM in 8 banks 12.8 GBytes/s peak bandwidth

Typical power consumption: 2.0 W Peak vector performance

1.6/3.2/6.4 Gops wo. multiply-add 1.6 Gflops (single-precision)

Same process technology as Blue Gene But for single chip for multi-

media

Page 6: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Status of IRAM Benchmarking InfrastructureStatus of IRAM Benchmarking Infrastructure

Improved the VIRAM simulator. Refining the performance model for double-precision FP

performance. Making the backend modular to allow for other

microarchitectures.

Packaging the benchmark codes. Build and test scripts plus input data (small and large data

sets). Added documentation.

Prepare for final chip benchmarking Tape-out scheduled by UCB for 9/01.

Page 7: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Media BenchmarksMedia Benchmarks

FFT uses in-register permutations, generalized reduction All others written in C with Cray vectorizing compiler

0

0.5

1

1.5

2

2.5

3

3.5

4G

OP

S

Page 8: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Power Advantage of PIM+Vectors Power Advantage of PIM+Vectors

100x100 matrix vector multiplication (column layout) Results from the LAPACK manual (vendor optimized assembly) VIRAM performance improves with larger matrices! VIRAM power includes on-chip main memory!

0

100

200

300

400

VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264

PowerPCG3

Power3630

MFLOPS MFLOPS/Watt

Page 9: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems

Transitive-closure (small & large data set)Pointer-jumping (small & large working set)Computing a histogram

Used for image processing of a 16-bit greyscale image: 1536 x 1536

2 algorithms: 64-elements sorting kernel; privatization Needed for sorting

Neighborhood image processing (small & large images)

NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)Sparse matrix-vector product:

Order 10000, #nonzeros 177820

2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010

Page 10: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Benchmark Performance on IRAM SimulatorBenchmark Performance on IRAM Simulator

IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)

Page 11: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Conclusions and VIRAM Future DirectionsConclusions and VIRAM Future Directions

VIRAM outperforms Pentium III on Scientific problems With lower power and clock rate than the Mobile Pentium

Vectorization techniques developed for the Cray PVPs applicable. PIM technology provides low power, low cost memory system. Similar combination used in Sony Playstation.

Small ISA changes can have large impact Limited in-register permutations sped up 1K FFT by 5x.

Memory system can still be a bottleneck Indexed/variable stride costly, due to address generation.

Future work: Ongoing investigations into impact of lanes, subbanks Technical paper in preparation – expect completion 09/01 Run benchmark on real VIRAM chips Examine multiprocessor VIRAM configurations

Page 12: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Project Goals for FY02 and BeyondProject Goals for FY02 and Beyond

Use established data-intensive scientific benchmarks with other emerging architectures:

IMAGINE (Stanford Univ.) Designed for graphics and image/signal processing Peak 20 GLOPS (32-bit FP) Key features: vector processing, VLIW, a streaming memory

system. (Not a PIM-based design.) Preliminary discussions with Bill Dally.

DIVA (DARPA-sponsored: USC/ISI) Based on PIM “smart memory” design, but for multiprocessors Move computation to data Designed for irregular data structures and dynamic databases. Discussions with Mary Hall about benchmark comparisons

Page 13: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Management PlanManagement Plan

Roles of different groups and PIs Senior researchers working on particular class of benchmarks

Parry: sorting and histograms Sherry: sparse matrices Lenny: unstructured mesh adaptation Brian: simulation Jin and Hyun: specific benchmarks

Plan to hire additional postdoc for next year (focus on Imagine) Undergrad model used for targeted benchmark efforts

Plan for using computational resources at NERSC Few resourced used, except for comparisons

Page 14: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Future Funding ProspectsFuture Funding Prospects

FY2003 and beyond DARPA initiated DIS program Related projects are continuing under Polymorphic Computing New BAA coming in “High Productivity Systems” Interest from other DOE labs (LANL) in general problem

General model Most architectural research projects need benchmarking Work has higher quality if done by people who understand

apps. Expertise for hardware projects is different: system level

design, circuit design, etc. Interest from both IRAM and Imagine groups show level of

interest

Page 15: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Long Term ImpactLong Term Impact

Potential impact on Computer Science Promote research of new architectures and micro-

architectures Understand future architectures

Preparation for procurements Provide visibility of NERSC in core CS research areas

Correlate applications: DOE vs. large market problems

Influence future machines through research collaborations

Page 16: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

The The EndEnd

Page 17: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Integer BenchmarksInteger Benchmarks

Strided access important, e.g., RGB narrow types limited by address generation

Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem

01000200030004000500060007000

1 lane

2 lane

4 lane

Page 18: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Status of benchmarking software releaseStatus of benchmarking software release

Build and test scripts (Makefiles, timing, analysis, ...)

Standard random number generator

OptimizedGUPS

inner loop

GUPS C codes PointerJumping

PointerJumpingw/Update

Transitive Field

ConjugateGradient(Matrix)

Neighborhood

Optimizedvector

histogramcode

Vector histogramcode generator

GUPSDocs

Test cases (small and large working sets)

Optimized

Unoptimized Future work:

• Write more documentation, add better test cases as we find them

• Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas

Page 19: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Status of benchmarking workStatus of benchmarking work

Two performance models: simulator (vsim-p), and trace analyzer (vsimII)

Recent work on vsim-p: Refining the performance model for double-precision FP

performance.

Recent work on vsimII: Making the backend modular

Goal: Model different architectures w/ same ISA. Fixing bugs in the memory model of the VIRAM-1 backend. Better comments in code for better maintainability. Completing a new backend for a new decoupled cluster

architecture.

Page 20: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Comparison with Mobile PentiumComparison with Mobile Pentium

GUPS: VIRAM gets 6x more GUPS

Data element width

16 bit 32 bit

64 bit

Mobile Pentium GUPS

.045 .046 .036

VIRAM GUPS .295 .295 .244

0

1

2

3

4

5

6

7

8

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

50 100150200250350450550

Matrix size

To

tal

exec

uti

on

tim

e (s

eco

nd

s)

P-III

VIRAM 4lane

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

update update update update update

0tiny test test2 test3 test4

Working set size

tota

l execu

tio

n t

ime (

seco

nd

s)

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

pointer pointer pointer pointer

0tiny test test2 test3

working set size

tota

l execu

tio

n t

ime (

seco

nd

s)

TransitivePointerUpdate

VIRAM=30-50% faster than P-III

Ex. time for VIRAM rises much more slowly w/ data size than for P-III

Page 21: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Sparse CGSparse CG

Solve Ax = b; Sparse matrix-vector multiplication dominates.

Traditional CRS format requires: Indexed load/store for X/Y vectors Variable vector length, usually short

Other formats for better vectorization: CRS with narrow band (e.g., RCM ordering)

Smaller strides for X vector Segmented-Sum (Modified the old code developed for Cray

PVP) Long vector length, of same size Unit stride

ELL format: make all rows the same length by padding zeros Long vector length, of same size Extra flops

Page 22: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

SMVM PerformanceSMVM Performance

DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)

IRAM results (MFLOPS)

Mobile PIII (500 MHz) CRS: 35 MFLOPS

SubBanks

1 2 4 8

CRS 91 106 109 110

CRS banded

110 110 110 110

SEG-SUM 135 154 163 165

ELL (4.6 X more flops)

511(111)

570(124)

612(133)

632(137)

Page 23: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)

Complicated logic and data structures Difficult to achieve high efficiently

Irregular data access patterns (pointer chasing) Many conditionals / integer intensive

Adaptation is tool for making numerical solution cost effective Three types of element subdivision

Page 24: Suitability of Alternative Architectures for Scientific Computing in 5-10 Years

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick

Vectorization Strategy and Performance Vectorization Strategy and Performance ResultsResults Color elements based on vertices (not edges)

Guarantees no conflicts during vector operations

Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time Difficult: many conditionals, low flops, irregular data access,

dependencies Initial grid: 4802 triangles, Final grid 24010 triangles

Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500

Higher code complexity (requires graph coloring + reordering)

Pentium III 500

1 Lane 2 Lanes 4 Lanes

61 18 14 13Time (ms)