39
8-11 Aug. 2011 Anne C. Elster NTNU/IDI 1 Introduction to Parallel Scientific Computing TDT 4200 -- Lect. 3 & 4 Anne C. Elster Dept. of Computer & Info. Sci. (IDI) Norwegian Univ. of Science Technology Trondheim, Norway

Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 Aug. 2011 Anne C. Elster NTNU/IDI 1

Introduction to Parallel Scientific Computing

TDT 4200 -- Lect. 3 & 4 Anne C. Elster

Dept. of Computer & Info. Sci. (IDI) Norwegian Univ. of Science Technology

Trondheim, Norway

Page 2: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

IDI 2011 Studieprogram vs. seksjoner – Komplekse Datasystemer

KDS

Page 3: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

TDT 4200 Fall 2011 •  Instructor: Dr. Anne C. Elster ([email protected])

•  Support staff: –  Vit.ass.: TBA –  Und. ass.: Ruben Spaans

•  Web page: http://www.idi.ntnu.no/~elster/tdt4200-f2011

•  Lectures: –  Wednesdays 10:15- 12:00 in F3 (may move due to class size?) –  Thursdays 113:1|5-14:00 in F3

•  Recitation (øvingstimer): –  Thursdays 14:15-16:00 in F3

•  It’s Learning!

Page 4: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

Courses Taught by Dr. Elster: Beregningsvitenskap .

TDT4200 Parallel Computing (Parallel programming with MPI & threads)

TDT24 Parallell environments & Numerical Computing

- 2-day IBM CellBE Course (Fall 2007) - GPU & Thread programming

TDT 4205 Compilers

DTD 8117 Grid and Heterogeneous Computing

Page 5: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 Anne C. Elster NTNU/IDI 5

HPC History: Personal perspective   1980’s: Concurrent and Parallel Pascal   1986: Intel iPSC Hypercube

•  CMI (Bergen) and Cornell (Cray at NTNU)   1987: Cluster of 4 IBM 3090s   1988-91: Intel hypercubes

  Some on BBN   1991-94: KSR (MPI1 & 2)   1995 -2005: SGI systems (some IBM SP)   2001-current: Clusters   2006:

  IBM Supercomputer @ NTNU (Njord, 7+ TFLOPS, proprietory switch)   GPU programming (Cg)

  2008:   Quadcore Supercomputer at UiTø (Stallo)   HPC-LAB at IDI/NTNU opens with

  several NVIDIA donation   Several quad-core machines (1-2 donated by Schlumberger)

  2009: More NVIDIA donations:   NVIDIA Tesla s1070 and two Quadro FX 5800 cards (jan ´09)

Page 6: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 Anne C. Elster NTNU/IDI 6

Hypercubes

  Distributed processor systems with log n processor connection pattern   Intel iPSC  Connection Machine

  Maps tree structures well using Gray codes (each dimension in cube maps to a level in the tree)

Page 7: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

7

Shared vs. Distributed Memory

Shared Mem …

Distrubuted Memory

Processors Processors

...

..

.

... ... ...

Page 8: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8

Replicated vs. Distributed grids

..

.

... ... ...

Pros: - Easier to implement - Load balanced

Con: - Does not scale well (e.g. need to sum grids)

Pros: - Scales much better, especially for large problems

Con: - More complicated message passing - May need load-balancing of particles

..

.

Page 9: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

9

9

Anne C. Elster – TDT 4200 Parallel Computing intro Lect 3-4, IDI/NTNU, Aug 2011

”COT”-based SUPERCOMPUTER HARDWARE TRENDS: •  Intel iPSC (mid-1980’s)

–  The first iPSC had no separate communication processor ... –  Specialized OS –  2-128 nodes

•  Today’s PC clusters –  Fast Ethernet or better (more expensive interconnect) –  Linux OS –  32-bit cheapest, but many 64-bit cluster vendors –  Top500 supercomputers

Today´s GPU farms entering Top500 list!!

Page 10: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

10

10

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

HPC Hardware Trends at IDI

Beregningsvitenskap Clustis3 (Quad-core cluster)

Installed Spring 2009 8 (9) stk ProLiant DL160GS Server with:

•  2 stk E5405 2,0GHz Quad-Core •  9GB FB-dimm memory • 160GB SATA disc • 2 stk GbE network cards

Page 11: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

11

11

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

HPC Hardware Trends at IDI

Beregningsvitenskap

NVIDA 280 Tesla card

Unpacking NVIDA s1070 and Quadro FX 5800 cards

Page 12: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 12

New architectural features to consider:

  Register usage   On-chip memory   Cache / stack manipulation   Precision effects   Multiplication time ≈ addition time

Note: Recursion benefits from on-chip registers available for stack

TDT 4200 Parallel Computing -- Anne C. Elster

Page 13: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11

13

Memory Hierachy   Registers   Cache ( Level 1-3)   On-chip RAM   RAM   Disks - SSD   Disks - HD   Tapes (Robot storage) Slower

TDT 4200 Parallel Computing Anne C. Elster

Page 14: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11

14

Memory Access   Floating point optimization:

Factor 2 (in-cache)

  Memory access optimizations: Factor 10 or more !! (out of cache data)

  Much more for RAM vs. disk!!

TDT 4200 Parallel Computing Anne C. Elster

Page 15: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

15 TDT 4200 Anne C. Elster

Main HW/SW Challenges:

•  Slow interconnects (improving, but at a cost ...)

•  Slow protocols (TCP/IP VIA/new technologies)

•  MEMORY BANDWIDTH!!!

Page 16: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11

16

Multi-level Caching   Access 2-3D cells for large grid in same

cacheline can give large performance improvements

  E.g. 128 byte cache-line (max. 16 64-bit float.pt. no.s or 4x4 grid)

  Traditional: 2*16 cache hits = 32   Cell-caching: ((3*3)*1)+ ((3*3)*2) + 4 = 25 cache hits 25% improvement!

TDT 4200 Parallel Computing Anne C. Elster

Page 17: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

17

Cluster technologies for HPC

Advantage: Very cost-effective hardware since uses COTS (Commercial Of-The-Shelf) parts

BUT: Typically much slower processor interconnects than traditional HPC systems

What about usability?

TDT 4200 Parallel Computing Anne C. Elster

Page 18: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

18

MESSAGE PASSING CAVEAT:

•  Global operations have more severe impact on cluster performance than traditional supercomputers since communication between processors takes relatively more of the total execution time

TDT 4200 Parallel Computing Anne C. Elster

Page 19: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

19

HARDWARE TRENDS – CONTIN.:

•  32-bit 64-bit architectures •  1 CPU multiple CPUs (2-4)

THE WAL-MART EFFECT: •  game stations (e.g. Playstation-2 farm at UIUC) •  graphics cards •  Low-power COTS devices??

TDT 4200 Parallel Computing Anne C. Elster

Page 20: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

20

The ”Ideal” Cluster -- Hardware

•  High-bandwidth network •  Low-latency network •  Low Operating System overhead (TCP causes ”slow start”) •  Great floating-point performance (64-bit processors or more?)

TDT 4200 Parallel Computing Anne C. Elster

Page 21: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

21 TDT 4200 Parallel Computing Anne C. Elster

The ”Ideal” Cluster -- Software •  Compiler that is:

–  Portable –  Optimizing

•  Do extra work to save communication •  Self-tuning /Load -balanced •  Automatic selection of best algorithm •  One-sided communication support? •  Optimized middleware

Page 22: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

22

22

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

The Wal-Mart Effect (PARA02)

•  Wal-Mart – bigger than Sears, K-mart and JC Penney’s combined predicted to influence $40 billion of IT investments (MIT Review)  has much more impact than Microsoft and Cisco could ever hope for…

–  Not driven my latest technology, but by business model – bad news for HPC?

–  Game market --> HPC market Future high-performance chips and systems --> NVIDIA Tesla!

Page 23: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

23

23

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

”COT”-based SUPERCOMPUTER HARDWARE TRENDS: •  Intel iPSC (mid-1980’s)

–  The first iPSC had no separate communication processor ... –  Specialized OS –  2-128 nodes

•  Today’s PC clusters –  Fast Ethernet or better (more expensive interconnect) –  Linux OS –  32-bit cheapest, but many 64-bit cluster vendors –  Top500 supercomputers

Today´s GPU farms entering Top500 list..

Page 24: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

24

24

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

Main HW/SW Challenges:

•  Slow interconnects (improving, but at a cost ...)

•  Slow protocols (TCP/IP VIA/new technologies)

•  MEMORY BANDWIDTH!!!

Page 25: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

25

25

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 Anne C. Elster

NTNU/IDI 25

MPI (Message Passing Interface) <http://www.mcs.anl.gov/research/projects/mpi/>   Communication routines standard developed for

  multiprocessor systems and   clusters of workstations

  Orginally targeted Fortran and C   Now also C++

  Newer strains:   OpenMPI and MPI-Java

Page 26: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

26

26

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 26

What is MPI ? -- continued   Message passing model   Standard (specification)

  Many implementations   MPICH was first most widely used   OpenMPI currently most used impl.?

  Two phases:   MPI 1: Traditional message-passing   MPI 2: Remote memory (one-sided communications),

parallel I/O, and dynamic processes

Page 27: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

27

27

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

Notes on black board re. MPI basics

From “A User´s Guide to MPI” by Peter Pacheco

1.  Intro

2.  Greetings!

3.  Collective Communication

4.  Grouping Data for Communication

Page 28: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

28

28

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

LIBRARIES: •  MPICH (ANL) public domain

(working with LBNL on VIA version)

•  MPI LAM (more MPI-2 features) public domain •  MPI-FM (UIUC/UCSD) public domain

–  MPICH built on top of Fast Messages •  MPI/Pro (MPI Technologies, Inc) commercial

–  (working on VIA version) •  PaTENT MPI 4.0 (Genias GmbH) commercial

–  MPI for Windows NT (PaTENT = || Tool Envirnonment for NT) •  SCALI , Norway commercial •  MPI from MESH Technologies (Brian Vinter) commercial

•  Threaded MPI (Penti Hutnanen, others) •  OpenMP for clusters (B. Champman), Hybrid OpenMP/MPI

Page 29: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

29

29

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

GPUs: Graphical Processor Units

HISTORY: •  Late 70’s/ Early´80’s: Grafic drawing calculations on CPUs •  Xerox Alto computer: first special bit block transfer instruct •  Comodore Amiga: first mass-market video accelerator able to draw

fills shapes & animations in HW. Graphics sub-system w/ several chips, incl. Dedicated to bit blk xfer

•  Early 90’s: 2D accelleration •  Ca. 1995: VIDEO GAMES! --> 3D GPUs

Page 30: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

30

30

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

GPU History continued:

•  1995-1998: –  3D rasterization

(converting simple 3D geometric primitives (e.g. lines, triangles, rectangles) to 2D screen pixels)

–  Texture mapping (mapping 2D texture image to planar 3D surface)

•  1999-2000: 3D translation, rotation & scaling •  Towards 2000: GPUs more configurable, •  2001 and beyond: programmable

(ability to change individual pixels)

Page 31: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

31

31

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

Limitations

•  Branching usually not a good idea •  GPU cache is different from CPU cache

–  Optimized for 2D locality •  Random memory access problematic •  Floating point precision •  No integers or booleans (also currently no bit-wise operators, but Cg

reerved symbols for these)

Page 32: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

32

GPU: general programming view •  Programmable MIMD processor: the vertex

processor (one vector & once scalar/clock cycle

•  Rasterizer: pass thru or interpolate values (e.g. passing 4 coordinates to draw rectangle leads to interpolation of pixel coordinates of vertices

•  Programmable SIMD processor (fragment processor w/ up to 32 ops/cycle)

•  Simple blending unit (serial) - z-compares and sends to memory

Page 33: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

33

GPU -- Outside view

Memory

Programmable MIMD proc Rasterization

Programmable SIMD

processor

Blend

output

Page 34: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

34

GPU Internal

Structure

Page 35: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

35

General programming on GPUs

•  Rendering = executing •  GPU textures = CPU arrays •  Fragment shader programs = inner loops •  Rendering to texture memory = feedback •  Vertex coordinates = computational range •  Texture coordinates = Computational domain

•  Now have NVIDIA´s CUDA library! (BLAS & FFT)

Page 36: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

36

36

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

Limitations

•  Branching usually not a good idea •  GPU cache is different from CPU cache

–  Optimized for 2D locality •  Random memory access problematic •  Floating point precision •  No integers or booleans (also currently no bit-wise operators, but Cg

reerved symbols for these)

Page 37: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

37

37

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

SNOW SIMULATION DEMO!

Robin Eidissen (Teaching Assitant)

Page 38: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

38

38

TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab

Page 39: Introduction to Parallel Scientific ComputingMulti-level Caching ... – Optimizing! • Do extra work to save communication! • Self-tuning /Load -balanced! ... • Simple blending

8-11 Anne C. Elster NTNU/IDI 39

Modularizing Large Codes   Split large codes into separate

independent modules (e.g. Initializer, solvers, trackers, etc.)

  Easer to maintain and debug   Allows use of external packages

(BLAS, LAPACK, PETSc)   Can use code as test-bed for part

of future codes