Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
8-11 Aug. 2011 Anne C. Elster NTNU/IDI 1
Introduction to Parallel Scientific Computing
TDT 4200 -- Lect. 3 & 4 Anne C. Elster
Dept. of Computer & Info. Sci. (IDI) Norwegian Univ. of Science Technology
Trondheim, Norway
IDI 2011 Studieprogram vs. seksjoner – Komplekse Datasystemer
KDS
TDT 4200 Fall 2011 • Instructor: Dr. Anne C. Elster ([email protected])
• Support staff: – Vit.ass.: TBA – Und. ass.: Ruben Spaans
• Web page: http://www.idi.ntnu.no/~elster/tdt4200-f2011
• Lectures: – Wednesdays 10:15- 12:00 in F3 (may move due to class size?) – Thursdays 113:1|5-14:00 in F3
• Recitation (øvingstimer): – Thursdays 14:15-16:00 in F3
• It’s Learning!
Courses Taught by Dr. Elster: Beregningsvitenskap .
TDT4200 Parallel Computing (Parallel programming with MPI & threads)
TDT24 Parallell environments & Numerical Computing
- 2-day IBM CellBE Course (Fall 2007) - GPU & Thread programming
TDT 4205 Compilers
DTD 8117 Grid and Heterogeneous Computing
8-11 Anne C. Elster NTNU/IDI 5
HPC History: Personal perspective 1980’s: Concurrent and Parallel Pascal 1986: Intel iPSC Hypercube
• CMI (Bergen) and Cornell (Cray at NTNU) 1987: Cluster of 4 IBM 3090s 1988-91: Intel hypercubes
Some on BBN 1991-94: KSR (MPI1 & 2) 1995 -2005: SGI systems (some IBM SP) 2001-current: Clusters 2006:
IBM Supercomputer @ NTNU (Njord, 7+ TFLOPS, proprietory switch) GPU programming (Cg)
2008: Quadcore Supercomputer at UiTø (Stallo) HPC-LAB at IDI/NTNU opens with
several NVIDIA donation Several quad-core machines (1-2 donated by Schlumberger)
2009: More NVIDIA donations: NVIDIA Tesla s1070 and two Quadro FX 5800 cards (jan ´09)
8-11 Anne C. Elster NTNU/IDI 6
Hypercubes
Distributed processor systems with log n processor connection pattern Intel iPSC Connection Machine
Maps tree structures well using Gray codes (each dimension in cube maps to a level in the tree)
7
Shared vs. Distributed Memory
…
Shared Mem …
…
Distrubuted Memory
Processors Processors
...
..
.
... ... ...
8
Replicated vs. Distributed grids
..
.
... ... ...
Pros: - Easier to implement - Load balanced
Con: - Does not scale well (e.g. need to sum grids)
Pros: - Scales much better, especially for large problems
Con: - More complicated message passing - May need load-balancing of particles
..
.
9
9
Anne C. Elster – TDT 4200 Parallel Computing intro Lect 3-4, IDI/NTNU, Aug 2011
”COT”-based SUPERCOMPUTER HARDWARE TRENDS: • Intel iPSC (mid-1980’s)
– The first iPSC had no separate communication processor ... – Specialized OS – 2-128 nodes
• Today’s PC clusters – Fast Ethernet or better (more expensive interconnect) – Linux OS – 32-bit cheapest, but many 64-bit cluster vendors – Top500 supercomputers
Today´s GPU farms entering Top500 list!!
10
10
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
HPC Hardware Trends at IDI
Beregningsvitenskap Clustis3 (Quad-core cluster)
Installed Spring 2009 8 (9) stk ProLiant DL160GS Server with:
• 2 stk E5405 2,0GHz Quad-Core • 9GB FB-dimm memory • 160GB SATA disc • 2 stk GbE network cards
11
11
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
HPC Hardware Trends at IDI
Beregningsvitenskap
NVIDA 280 Tesla card
Unpacking NVIDA s1070 and Quadro FX 5800 cards
8-11 12
New architectural features to consider:
Register usage On-chip memory Cache / stack manipulation Precision effects Multiplication time ≈ addition time
Note: Recursion benefits from on-chip registers available for stack
TDT 4200 Parallel Computing -- Anne C. Elster
8-11
13
Memory Hierachy Registers Cache ( Level 1-3) On-chip RAM RAM Disks - SSD Disks - HD Tapes (Robot storage) Slower
TDT 4200 Parallel Computing Anne C. Elster
8-11
14
Memory Access Floating point optimization:
Factor 2 (in-cache)
Memory access optimizations: Factor 10 or more !! (out of cache data)
Much more for RAM vs. disk!!
TDT 4200 Parallel Computing Anne C. Elster
15 TDT 4200 Anne C. Elster
Main HW/SW Challenges:
• Slow interconnects (improving, but at a cost ...)
• Slow protocols (TCP/IP VIA/new technologies)
• MEMORY BANDWIDTH!!!
8-11
16
Multi-level Caching Access 2-3D cells for large grid in same
cacheline can give large performance improvements
E.g. 128 byte cache-line (max. 16 64-bit float.pt. no.s or 4x4 grid)
Traditional: 2*16 cache hits = 32 Cell-caching: ((3*3)*1)+ ((3*3)*2) + 4 = 25 cache hits 25% improvement!
TDT 4200 Parallel Computing Anne C. Elster
17
Cluster technologies for HPC
Advantage: Very cost-effective hardware since uses COTS (Commercial Of-The-Shelf) parts
BUT: Typically much slower processor interconnects than traditional HPC systems
What about usability?
TDT 4200 Parallel Computing Anne C. Elster
18
MESSAGE PASSING CAVEAT:
• Global operations have more severe impact on cluster performance than traditional supercomputers since communication between processors takes relatively more of the total execution time
TDT 4200 Parallel Computing Anne C. Elster
19
HARDWARE TRENDS – CONTIN.:
• 32-bit 64-bit architectures • 1 CPU multiple CPUs (2-4)
THE WAL-MART EFFECT: • game stations (e.g. Playstation-2 farm at UIUC) • graphics cards • Low-power COTS devices??
TDT 4200 Parallel Computing Anne C. Elster
20
The ”Ideal” Cluster -- Hardware
• High-bandwidth network • Low-latency network • Low Operating System overhead (TCP causes ”slow start”) • Great floating-point performance (64-bit processors or more?)
TDT 4200 Parallel Computing Anne C. Elster
21 TDT 4200 Parallel Computing Anne C. Elster
The ”Ideal” Cluster -- Software • Compiler that is:
– Portable – Optimizing
• Do extra work to save communication • Self-tuning /Load -balanced • Automatic selection of best algorithm • One-sided communication support? • Optimized middleware
22
22
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
The Wal-Mart Effect (PARA02)
• Wal-Mart – bigger than Sears, K-mart and JC Penney’s combined predicted to influence $40 billion of IT investments (MIT Review) has much more impact than Microsoft and Cisco could ever hope for…
– Not driven my latest technology, but by business model – bad news for HPC?
– Game market --> HPC market Future high-performance chips and systems --> NVIDIA Tesla!
23
23
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
”COT”-based SUPERCOMPUTER HARDWARE TRENDS: • Intel iPSC (mid-1980’s)
– The first iPSC had no separate communication processor ... – Specialized OS – 2-128 nodes
• Today’s PC clusters – Fast Ethernet or better (more expensive interconnect) – Linux OS – 32-bit cheapest, but many 64-bit cluster vendors – Top500 supercomputers
Today´s GPU farms entering Top500 list..
24
24
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
Main HW/SW Challenges:
• Slow interconnects (improving, but at a cost ...)
• Slow protocols (TCP/IP VIA/new technologies)
• MEMORY BANDWIDTH!!!
25
25
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 Anne C. Elster
NTNU/IDI 25
MPI (Message Passing Interface) <http://www.mcs.anl.gov/research/projects/mpi/> Communication routines standard developed for
multiprocessor systems and clusters of workstations
Orginally targeted Fortran and C Now also C++
Newer strains: OpenMPI and MPI-Java
26
26
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab 8-11 26
What is MPI ? -- continued Message passing model Standard (specification)
Many implementations MPICH was first most widely used OpenMPI currently most used impl.?
Two phases: MPI 1: Traditional message-passing MPI 2: Remote memory (one-sided communications),
parallel I/O, and dynamic processes
27
27
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
Notes on black board re. MPI basics
From “A User´s Guide to MPI” by Peter Pacheco
1. Intro
2. Greetings!
3. Collective Communication
4. Grouping Data for Communication
28
28
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
LIBRARIES: • MPICH (ANL) public domain
(working with LBNL on VIA version)
• MPI LAM (more MPI-2 features) public domain • MPI-FM (UIUC/UCSD) public domain
– MPICH built on top of Fast Messages • MPI/Pro (MPI Technologies, Inc) commercial
– (working on VIA version) • PaTENT MPI 4.0 (Genias GmbH) commercial
– MPI for Windows NT (PaTENT = || Tool Envirnonment for NT) • SCALI , Norway commercial • MPI from MESH Technologies (Brian Vinter) commercial
• Threaded MPI (Penti Hutnanen, others) • OpenMP for clusters (B. Champman), Hybrid OpenMP/MPI
29
29
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
GPUs: Graphical Processor Units
HISTORY: • Late 70’s/ Early´80’s: Grafic drawing calculations on CPUs • Xerox Alto computer: first special bit block transfer instruct • Comodore Amiga: first mass-market video accelerator able to draw
fills shapes & animations in HW. Graphics sub-system w/ several chips, incl. Dedicated to bit blk xfer
• Early 90’s: 2D accelleration • Ca. 1995: VIDEO GAMES! --> 3D GPUs
30
30
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
GPU History continued:
• 1995-1998: – 3D rasterization
(converting simple 3D geometric primitives (e.g. lines, triangles, rectangles) to 2D screen pixels)
– Texture mapping (mapping 2D texture image to planar 3D surface)
• 1999-2000: 3D translation, rotation & scaling • Towards 2000: GPUs more configurable, • 2001 and beyond: programmable
(ability to change individual pixels)
31
31
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
Limitations
• Branching usually not a good idea • GPU cache is different from CPU cache
– Optimized for 2D locality • Random memory access problematic • Floating point precision • No integers or booleans (also currently no bit-wise operators, but Cg
reerved symbols for these)
32
GPU: general programming view • Programmable MIMD processor: the vertex
processor (one vector & once scalar/clock cycle
• Rasterizer: pass thru or interpolate values (e.g. passing 4 coordinates to draw rectangle leads to interpolation of pixel coordinates of vertices
• Programmable SIMD processor (fragment processor w/ up to 32 ops/cycle)
• Simple blending unit (serial) - z-compares and sends to memory
33
GPU -- Outside view
Memory
Programmable MIMD proc Rasterization
Programmable SIMD
processor
Blend
output
34
GPU Internal
Structure
35
General programming on GPUs
• Rendering = executing • GPU textures = CPU arrays • Fragment shader programs = inner loops • Rendering to texture memory = feedback • Vertex coordinates = computational range • Texture coordinates = Computational domain
• Now have NVIDIA´s CUDA library! (BLAS & FFT)
36
36
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
Limitations
• Branching usually not a good idea • GPU cache is different from CPU cache
– Optimized for 2D locality • Random memory access problematic • Floating point precision • No integers or booleans (also currently no bit-wise operators, but Cg
reerved symbols for these)
37
37
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
SNOW SIMULATION DEMO!
Robin Eidissen (Teaching Assitant)
38
38
TDT 4200 Parallel Computing -- Anne C. Elster www.idi.ntnu.no/~elster/hpc-lab
8-11 Anne C. Elster NTNU/IDI 39
Modularizing Large Codes Split large codes into separate
independent modules (e.g. Initializer, solvers, trackers, etc.)
Easer to maintain and debug Allows use of external packages
(BLAS, LAPACK, PETSc) Can use code as test-bed for part
of future codes