Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 Aug. 2011 Anne C. Elster NTNU/IDI 1

Introduction to Parallel Scientific Computing

TDT 4200 -- Lect. 3 & 4 Anne C. Elster

Dept. of Computer & Info. Sci. (IDI) Norwegian Univ. of Science Technology

Trondheim, Norway

IDI 2011 Studieprogram vs. seksjoner – Komplekse Datasystemer

KDS

TDT 4200 Fall 2011 •  Instructor: Dr. Anne C. Elster ([email protected])

•  Support staff: –  Vit.ass.: TBA –  Und. ass.: Ruben Spaans

•  Web page: http://www.idi.ntnu.no/~elster/tdt4200-f2011

•  Lectures: –  Wednesdays 10:15- 12:00 in F3 (may move due to class size?) –  Thursdays 113:1|5-14:00 in F3

•  Recitation (øvingstimer): –  Thursdays 14:15-16:00 in F3

•  It’s Learning!

Courses Taught by Dr. Elster: Beregningsvitenskap .

TDT4200 Parallel Computing (Parallel programming with MPI & threads)

TDT24 Parallell environments & Numerical Computing

- 2-day IBM CellBE Course (Fall 2007) - GPU & Thread programming

TDT 4205 Compilers

DTD 8117 Grid and Heterogeneous Computing

8-11 Anne C. Elster NTNU/IDI 5

HPC History: Personal perspective   1980’s: Concurrent and Parallel Pascal   1986: Intel iPSC Hypercube

•  CMI (Bergen) and Cornell (Cray at NTNU)   1987: Cluster of 4 IBM 3090s   1988-91: Intel hypercubes

  Some on BBN   1991-94: KSR (MPI1 & 2)   1995 -2005: SGI systems (some IBM SP)   2001-current: Clusters   2006:

  IBM Supercomputer @ NTNU (Njord, 7+ TFLOPS, proprietory switch)   GPU programming (Cg)

  2008:   Quadcore Supercomputer at UiTø (Stallo)   HPC-LAB at IDI/NTNU opens with

  several NVIDIA donation   Several quad-core machines (1-2 donated by Schlumberger)

  2009: More NVIDIA donations:   NVIDIA Tesla s1070 and two Quadro FX 5800 cards (jan ´09)


Hypercubes

  Distributed processor systems with log n processor connection pattern   Intel iPSC  Connection Machine

  Maps tree structures well using Gray codes (each dimension in cube maps to a level in the tree)

7

Shared vs. Distributed Memory

…

Shared Mem …

…

Distrubuted Memory

Processors Processors

...

..

.

... ... ...

8

Replicated vs. Distributed grids

..

.

... ... ...

Pros: - Easier to implement - Load balanced

Con: - Does not scale well (e.g. need to sum grids)

Pros: - Scales much better, especially for large problems

Con: - More complicated message passing - May need load-balancing of particles

..

.

9

9

Anne C. Elster – TDT 4200 Parallel Computing intro Lect 3-4, IDI/NTNU, Aug 2011

”COT”-based SUPERCOMPUTER HARDWARE TRENDS: •  Intel iPSC (mid-1980’s)

–  The first iPSC had no separate communication processor ... –  Specialized OS –  2-128 nodes

•  Today’s PC clusters –  Fast Ethernet or better (more expensive interconnect) –  Linux OS –  32-bit cheapest, but many 64-bit cluster vendors –  Top500 supercomputers

Today´s GPU farms entering Top500 list!!

10

10

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

HPC Hardware Trends at IDI

Beregningsvitenskap Clustis3 (Quad-core cluster)

Installed Spring 2009 8 (9) stk ProLiant DL160GS Server with:

•  2 stk E5405 2,0GHz Quad-Core •  9GB FB-dimm memory • 160GB SATA disc • 2 stk GbE network cards

11

11


HPC Hardware Trends at IDI

Beregningsvitenskap

NVIDA 280 Tesla card

Unpacking NVIDA s1070 and Quadro FX 5800 cards

8-11 12

New architectural features to consider:

  Register usage   On-chip memory   Cache / stack manipulation   Precision effects   Multiplication time ≈ addition time

Note: Recursion benefits from on-chip registers available for stack

TDT 4200 Parallel Computing -- Anne C. Elster

8-11

13

Memory Hierachy   Registers   Cache ( Level 1-3)   On-chip RAM   RAM   Disks - SSD   Disks - HD   Tapes (Robot storage) Slower

TDT 4200 Parallel Computing Anne C. Elster

8-11

14

Memory Access   Floating point optimization:

Factor 2 (in-cache)

  Memory access optimizations: Factor 10 or more !! (out of cache data)

  Much more for RAM vs. disk!!


15 TDT 4200 Anne C. Elster

Main HW/SW Challenges:

•  Slow interconnects (improving, but at a cost ...)

•  Slow protocols (TCP/IP VIA/new technologies)

•  MEMORY BANDWIDTH!!!

8-11

16

Multi-level Caching   Access 2-3D cells for large grid in same

cacheline can give large performance improvements

  E.g. 128 byte cache-line (max. 16 64-bit float.pt. no.s or 4x4 grid)

  Traditional: 2*16 cache hits = 32   Cell-caching: ((3*3)*1)+ ((3*3)*2) + 4 = 25 cache hits 25% improvement!


17

Cluster technologies for HPC

Advantage: Very cost-effective hardware since uses COTS (Commercial Of-The-Shelf) parts

BUT: Typically much slower processor interconnects than traditional HPC systems

What about usability?


18

MESSAGE PASSING CAVEAT:

•  Global operations have more severe impact on cluster performance than traditional supercomputers since communication between processors takes relatively more of the total execution time


19

HARDWARE TRENDS – CONTIN.:

•  32-bit 64-bit architectures •  1 CPU multiple CPUs (2-4)

THE WAL-MART EFFECT: •  game stations (e.g. Playstation-2 farm at UIUC) •  graphics cards •  Low-power COTS devices??


20

The ”Ideal” Cluster -- Hardware

•  High-bandwidth network •  Low-latency network •  Low Operating System overhead (TCP causes ”slow start”) •  Great floating-point performance (64-bit processors or more?)


21 TDT 4200 Parallel Computing Anne C. Elster

The ”Ideal” Cluster -- Software •  Compiler that is:

–  Portable –  Optimizing

•  Do extra work to save communication •  Self-tuning /Load -balanced •  Automatic selection of best algorithm •  One-sided communication support? •  Optimized middleware

22

22


The Wal-Mart Effect (PARA02)

•  Wal-Mart – bigger than Sears, K-mart and JC Penney’s combined predicted to influence $40 billion of IT investments (MIT Review)  has much more impact than Microsoft and Cisco could ever hope for…

–  Not driven my latest technology, but by business model – bad news for HPC?

–  Game market --> HPC market Future high-performance chips and systems --> NVIDIA Tesla!

23

23


”COT”-based SUPERCOMPUTER HARDWARE TRENDS: •  Intel iPSC (mid-1980’s)

–  The first iPSC had no separate communication processor ... –  Specialized OS –  2-128 nodes

•  Today’s PC clusters –  Fast Ethernet or better (more expensive interconnect) –  Linux OS –  32-bit cheapest, but many 64-bit cluster vendors –  Top500 supercomputers

Today´s GPU farms entering Top500 list..

24

24


Main HW/SW Challenges:

•  Slow interconnects (improving, but at a cost ...)

•  Slow protocols (TCP/IP VIA/new technologies)

•  MEMORY BANDWIDTH!!!

25

25

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 Anne C. Elster

NTNU/IDI 25

MPI (Message Passing Interface) <http://www.mcs.anl.gov/research/projects/mpi/>   Communication routines standard developed for

  multiprocessor systems and   clusters of workstations

  Orginally targeted Fortran and C   Now also C++

  Newer strains:   OpenMPI and MPI-Java

26

26

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 26

What is MPI ? -- continued   Message passing model   Standard (specification)

  Many implementations   MPICH was first most widely used   OpenMPI currently most used impl.?

  Two phases:   MPI 1: Traditional message-passing   MPI 2: Remote memory (one-sided communications),

parallel I/O, and dynamic processes

27

27


Notes on black board re. MPI basics

From “A User´s Guide to MPI” by Peter Pacheco

1.  Intro

2.  Greetings!

3.  Collective Communication

4.  Grouping Data for Communication

28

28


LIBRARIES: •  MPICH (ANL) public domain

(working with LBNL on VIA version)

•  MPI LAM (more MPI-2 features) public domain •  MPI-FM (UIUC/UCSD) public domain

–  MPICH built on top of Fast Messages •  MPI/Pro (MPI Technologies, Inc) commercial

–  (working on VIA version) •  PaTENT MPI 4.0 (Genias GmbH) commercial

–  MPI for Windows NT (PaTENT = || Tool Envirnonment for NT) •  SCALI , Norway commercial •  MPI from MESH Technologies (Brian Vinter) commercial

•  Threaded MPI (Penti Hutnanen, others) •  OpenMP for clusters (B. Champman), Hybrid OpenMP/MPI

29

29


GPUs: Graphical Processor Units

HISTORY: •  Late 70’s/ Early´80’s: Grafic drawing calculations on CPUs •  Xerox Alto computer: first special bit block transfer instruct •  Comodore Amiga: first mass-market video accelerator able to draw

fills shapes & animations in HW. Graphics sub-system w/ several chips, incl. Dedicated to bit blk xfer

•  Early 90’s: 2D accelleration •  Ca. 1995: VIDEO GAMES! --> 3D GPUs

30

30


GPU History continued:

•  1995-1998: –  3D rasterization

(converting simple 3D geometric primitives (e.g. lines, triangles, rectangles) to 2D screen pixels)

–  Texture mapping (mapping 2D texture image to planar 3D surface)

•  1999-2000: 3D translation, rotation & scaling •  Towards 2000: GPUs more configurable, •  2001 and beyond: programmable

(ability to change individual pixels)

31

31


Limitations

•  Branching usually not a good idea •  GPU cache is different from CPU cache

–  Optimized for 2D locality •  Random memory access problematic •  Floating point precision •  No integers or booleans (also currently no bit-wise operators, but Cg

reerved symbols for these)

32

GPU: general programming view •  Programmable MIMD processor: the vertex

processor (one vector & once scalar/clock cycle

•  Rasterizer: pass thru or interpolate values (e.g. passing 4 coordinates to draw rectangle leads to interpolation of pixel coordinates of vertices

•  Programmable SIMD processor (fragment processor w/ up to 32 ops/cycle)

•  Simple blending unit (serial) - z-compares and sends to memory

33

GPU -- Outside view

Memory

Programmable MIMD proc Rasterization

Programmable SIMD

processor

Blend

output

34

GPU Internal

Structure

35

General programming on GPUs

•  Rendering = executing •  GPU textures = CPU arrays •  Fragment shader programs = inner loops •  Rendering to texture memory = feedback •  Vertex coordinates = computational range •  Texture coordinates = Computational domain

•  Now have NVIDIA´s CUDA library! (BLAS & FFT)

36

36


Limitations

•  Branching usually not a good idea •  GPU cache is different from CPU cache

–  Optimized for 2D locality •  Random memory access problematic •  Floating point precision •  No integers or booleans (also currently no bit-wise operators, but Cg

reerved symbols for these)

37

37


SNOW SIMULATION DEMO!

Robin Eidissen (Teaching Assitant)

38

38



Modularizing Large Codes   Split large codes into separate

independent modules (e.g. Initializer, solvers, trackers, etc.)

  Easer to maintain and debug   Allows use of external packages

(BLAS, LAPACK, PETSc)   Can use code as test-bed for part

of future codes

Documents

Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending