Upload
mac
View
67
Download
3
Embed Size (px)
DESCRIPTION
Multiprocessors - Parallel Computing. Processor Performance. We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques): Pipelining ILP Super-scalars Out-of-order execution (Scoreboarding) VLIW Cache (L1, L2, L3) Interleaved memories - PowerPoint PPT Presentation
Citation preview
1COMP381 by M. Hamdi
Multiprocessors - Multiprocessors -
Parallel Parallel ComputingComputing
2COMP381 by M. Hamdi
• We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques):
Pipelining
ILP
Super-scalars
Out-of-order execution (Scoreboarding)
VLIW
Cache (L1, L2, L3)
Interleaved memories
Compilers (Loop unrolling, branch prediction, etc.)
RAID
Etc …
• However, quite often even the best microprocessors are not fast enough for certain applications !!!
Processor Performance
3COMP381 by M. Hamdi
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Fra
ctio
n o
f to
tal c
ycle
s (%
)
Number of instructions issued
Sp
ee
du
p
Instructions issued per cycle
Example: How far will ILP go?
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
4COMP381 by M. Hamdi
When Do We Need High Performance Computing?
• Case1
–To do a time-consuming operation in less timeless time
• I am an aircraft engineer
• I need to run a simulation to test the stability of the wings at high speed
• I’d rather have the result in 5 minutes than in 5 days so that I can complete the aircraft final design sooner.
5COMP381 by M. Hamdi
When Do We Need High Performance Computing?
• Case 2
–To do an operation before a tighter deadlinetighter deadline
• I am a weather prediction agency
• I am getting input from weather stations/sensors
• I’d like to make the forecast for tomorrow before tomorrow
6COMP381 by M. Hamdi
When Do We Need High Performance Computing ?
• Case 3
– To do a high number of operationshigh number of operations per seconds
• I am an engineer of Amazon.com
• My Web server gets 10,000 hits per seconds
• I’d like my Web server and my databases to handle 10,000 transactions per seconds so that customers do not experience bad delays
–Amazon does “process” several GBytes of data per seconds
7COMP381 by M. Hamdi
The need for High-Performance ComputersJust some examples
• Automotive design:– Major automotive companies use large systems (500+ CPUs) for:
• CAD-CAM, crash testing, structural integrity and aerodynamics.
– Savings: approx. $1 billion per company per year.
• Semiconductor industry:– Semiconductor firms use large systems (500+ CPUs) for
• device electronics simulation and logic validation
– Savings: approx. $1 billion per company per year.
• Airlines:– System-wide logistics optimization systems on parallel systems.
– Savings: approx. $100 million per airline per year.
8COMP381 by M. Hamdi
Grand ChallengesGrand Challenges
10 MB
100 MB
1 GB
10 GB
100 GB
1 TB
100 MFLOPS 1 GFLOPS 10 GFLOPS 100 GFLOPS 1 TFLOPS
2D airfoil
48-hourweather
oil reservoirmodelling
3D plasma modelling
72-hourweather
vehicle dynamics
chemical dynamics
pharmaceutical designstructural biology
Computational Performance Requirements
Sto
rag
e R
equir
em
ents
9COMP381 by M. Hamdi
Weather ForecastingWeather Forecasting• Suppose the whole global atmosphere divided
into cells of size 1 km 1 km 1 km to a height of 10 km (10 cells high) - about 5 108 cells.
• Suppose each cell calculation requires 200 floating point operations. In one time step, 1011 floating point operations are necessary.
• To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) – similar to the Pentium 4similar to the Pentium 4 - takes 106 seconds or over 10 days.
• To perform calculation in 5 minutes requires a computer operating at 3.4 Tflops (3.4 1012
floating point operations/sec).
10COMP381 by M. Hamdi
1. The user enters a query on a web form sent to the Google web server.
2. The web server sends the query to the Index Server cluster, which matches the query to documents.
3. The match is sent to the Doc Server cluster, which retrieves the documents to generate abstracts and cached copies.
4. The list, with abstracts, is displayed by the web server to the user, sorted(using a secret formula involving PageRank).
11COMP381 by M. Hamdi
Google Requirements
• Google: search engine that scales at Internet growth rates
• Search engines: 24x7 availability
• Google : 600M queries/day, or AVERAGE of 7500 queries/s all day
• Response time goal: < 0.5 s per search
• Google crawls WWW and puts up new index every 2 weeks
• Storage: 5.3 billion web pages, 950 million newsgroup messages, and 925 million images indexed, Millions of videos
12COMP381 by M. Hamdi
Google• require high amounts of computation per request
• A single query on Google (on average)– reads hundreds of megabytes of data– consumes tens of billions of CPU cycles
• A peak request stream on Google– requires an infrastructure comparable in size to largest supercomputer
installations
• Typical google Data center: 15000 PCs (linux), 30000 disks: almost 3 petabyte!
• Google application affords easy parallelization– Different queries can run on different processors– A single query can use multiple processors
• because the overall index is partitioned
13COMP381 by M. Hamdi
Multiprocessing
• Multiprocessing (Parallel Processing): Concurrent execution of tasks (programs) using multiple computing, memory and interconnection resources.
Use multiple resources to solve problems faster.• Provides alternative to faster clock for performance
– Assuming a doubling of effective processor performance every 2 years, 1024-Processor system can get you the performance that it would take 20 years for a single-processor system to deliver
• Using multiple processors to solve a single problem
– Divide problem into many small pieces
– Distribute these small problems to be solved by multiple processors simultaneously
14COMP381 by M. Hamdi
Multiprocessing• For the last 30+ years multiprocessing has been seen as the best
way to produce orders of magnitude performance gains.– Double the number of processors, get double performance (less than 2
times the cost).• It turns out that the ability to develop and deliver software for
multiprocessing systems has been the impediment to wide adoption.
15COMP381 by M. Hamdi
Amdahl’s Law
• A parallel program has a sequential part (e.g., I/O) and a parallel part
– T1 = T1 + (1-)T1
– Tp = T1 + (1-)T1 / p
• Therefore:
Speedup(p) = 1 / ( + (1-)/p)
= p / ( p + 1 - )
1 /
• Example: if a code is 10% sequential (i.e., = .10), the speedup will always be lower than 1 + 90/10 = 10, no matter how many processors are used!
16COMP381 by M. Hamdi
17COMP381 by M. Hamdi
• Amdahl's Law is pessimistic (in this case)– Let s be the serial part
– Let p be the part that can be parallelized n ways – Serial: SSPPPPPP – 6 processors: SSP– P– P– P– P– P
– Speedup = 8/3 = 2.67
– T(n) =
– As n , T(n)
• Pessimistic
Performance Potential Using Multiple Processors
1s+p/n
1s
18COMP381 by M. Hamdi
10% 20% 30% 40% 50% 60% 70% 80% 90% 99%
0
5
10
15
20
25Speedup
% Serial
1000 CPUs16 CPUs4 CPUs
Amdahl’s Law
19COMP381 by M. Hamdi
Example
20COMP381 by M. Hamdi
Performance Potential: Another view
• Gustafson view (more widely adopted for multiprocessors) – Parallel portion increases as the problem size increases
• Serial time fixed (at s) • Parallel time proportional to problem size (true most of the time)
• Old Serial: SSPPPPPP • 6 processors: SSPPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • Hypothetical Serial:• SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP
– Speedup = (8+5*6)/8 = 4.75 – T'(n) = s + n*p; T'() !!!!
21COMP381 by M. Hamdi
0
20
40
60
80
100
10% 20% 30% 40% 50% 60% 70% 80% 90% 99%
% Serial
Speedup
Gustafson-Barsis
Amdhal
Amdahl vs. Gustafson-Barsis
22COMP381 by M. Hamdi
TOP 5 Most Powerful computers in the world – must be multiprocessors
http://www.top500.org/http://www.top500.org/
Rank Site/Country/Year Computer/Processors Manufacturer Rmax (GFlops)
1 DOE/NNSA/LLNLUnited States
212992 (PowerPC) IBM 478200
2 Forschungszentrum Juelich (FZJ)Germany
65536 (PowerPC) IBM 167300
3 SGI/New Mexico Computing Applications Center (NMCAC)United States
14336 (Intel EM64T Xeon ) SGI126900
4 Computational Research Laboratories, TATA SONSIndia
14240 (Intel EM64T Xeon ) HP 117900
5 Government AgencySweden
13728 (AMD x86_64 Opteron Dual Core )
Cray 42900
23COMP381 by M. Hamdi
Supercomputer Style Migration (Top500)
• In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%
Cluster – whole computers interconnected using their I/O bus
Constellation – a cluster that uses an SMP multiprocessor as the building block
24COMP381 by M. Hamdi
• Multiprocessor systems are being used for a wide variety of uses.
• Redundant processing (safeguard) – fault tolerance.
• Multiprocessor systems – increase throughput
Many tasks (no communication between them)
Multi-user departmental, enterprise and web servers.
• Parallel computing systems – decrease execution time.
Execute large-scale applications in parallel.
Multiprocessing (usage)
25COMP381 by M. Hamdi
Multiprocessing• Multiple resources
– Computers (e.g., clusters of PCs)
– CPU (e.g., shared memory computers)
– ALU (e.g., multiprocessors within a single chips)
– Memory
– Interconnect
• Tasks– Programs
– Procedures
– InstructionsCoarse-grain
Fine-grain
Different combinationsresult in differentsystems.
26COMP381 by M. Hamdi
1. The ability to develop and deliver software for multiprocessing systems has been the impediment to wide adoption – the goal was to make programming transparent to the user (e.g., pipelining) which never happened. However, there have a lot of advances here.
2. The tremendous advances of microprocessors (doubling in performance every 2 years) was able to satisfy the need of 99% of the applications
3. It did not make a business case: vendors were only able to sell few parallel computers (< 200). As a result, they were not able to invest in designing cheap and powerful multiprocessors
4. Most parallel computer vendors went bankrupt by the mid-90s – there was no business.
Why did the popularity of Multiprocessors slowed down compared to the 90s
27COMP381 by M. Hamdi
• SISD (Single Instruction, Single Data): – Typical uniprocessor systems that we’ve studied throughout
this course.– Uniprocessor systems can time share and still be SISD.
• SIMD (Single Instruction, Multiple Data): – Multiple processors simultaneously executing the same
instruction on different data.– Specialized applications (e.g., image processing).
• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different
instructions on different data.– Keep in mind that the processors are working together to solve
a single problem.
Flynn’s Taxonomy of Computing
28COMP381 by M. Hamdi
SIMD Systems
Processor
Memory
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
von Neumann Computer Some Interconnection Network
One control unit
Lockstep
All Ps do the same or nothing
29COMP381 by M. Hamdi
MIMD Shared Memory Systems
Interconnection Networks
M M M M
P P P P P P C
P C
P C
P C
M M M M
Global Memory
P
C
P
C
P
C
One global memory
Cache Coherence
All Ps have equal access to memory
30COMP381 by M. Hamdi
Cache Coherent NUMA
Interconnection Network
M
C
P
M
C
P
M
C
P
M
C
P
Each P has part of the shared memory
Non uniform memory access
31COMP381 by M. Hamdi
MIMD Distributed Memory Systems
Interconnection Networks
M M M M
P P P P
1110 1111
1010 1011
0110 0111
0010 0011
1101
1010
1000 1001
0100 0101
0010
0000 0001
S
LAN/WANNo shared memory
Message Passing
Topology
32COMP381 by M. Hamdi
Cluster Architecture
M
C
P
I/O
OS
M
C
P
I/O
OS
M
C
P
I/O
OS
Middleware
Programming Environment
Interconnection Network Home cluster
33COMP381 by M. Hamdi
InternetInternet
Grids
Dependable, consistent, pervasive, and inexpensive access to high end computing.
Geographically distributed platforms.
34COMP381 by M. Hamdi
10
100
1
2003 2005 2007 2009 2011 2013
Increasing HW
Threads HT
Multi-core Era
Scalar and Parallel
Applications
Many-core Era
Massively Parallel
Applications
Multiprocessing within a chip: Many-Core
Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015
35COMP381 by M. Hamdi
P
P
P
I
N
M
M
MC ontro ller
Add r1, b
Add r1, b
Add r1, b
D atastream
InstructionS tream
SIM D execution
SIMD Parallel ComputingSIMD Parallel Computing
It can be a stand-It can be a stand-alone multiprocessoralone multiprocessor
Or Or
Embedded in a single Embedded in a single processor for specific processor for specific applications (MMX)applications (MMX)
36COMP381 by M. Hamdi
SIMD Applications
• Applications:
• Database, image processing, and signal processing.
• Image processing maps very naturally onto SIMD systems.
» Each processor (Execution unit) performs operations on a single pixel or neighborhood of pixels.
» The operations performed are fairly straightforward and simple.
» Data could be streamed into the system and operated on in real-time or close to real-time.
37COMP381 by M. Hamdi
SIMD Operations
• Image processing on SIMD systems.– Sequential pixel operations take a very long time to
perform.• A 512x512 image would require 262,144 iterations through a
sequential loop with each loop executing 10 instructions. That translates to 2,621,440 clock cycles (if each instruction is a single cycle) plus loop overhead.
512x512 image
Each pixel is operated onsequentially one afteranother.
38COMP381 by M. Hamdi
SIMD Operations
• Image processing on SIMD systems.– On a SIMD system with 64x64 processors (e.g., very
simple ALUs) the same operations would take 640 cycles, where each processor operates on an 8x8 set of pixels plus loop overhead.
512x512 image
Each processor operates onan 8x8 set of pixels in parallel.
Speedup due to parallelism:2,621,440/640 = 4096 =64x64 (number of proc.) loop overhead ignored.
39COMP381 by M. Hamdi
SIMD Operations
• Image processing on SIMD systems.– On a SIMD system with 512x512 processors (which is
not unreasonable on SIMD machines) the same operation would take 10 cycles.
512x512 image
Each processor operates ona single pixel in parallel.
Speedup due to parallelism:2,621,440/10 = 262,144 =512x512 (number of proc.)!
Notice no loop overhead!
40COMP381 by M. Hamdi
Pentium MMX MultiMedia eXtentions
• 57 new instructions
• Eight 64-bit wide MMX registers
• First available in 1997
• Supported on:– Intel Pentium-MMX, Pentium II,
Pentium III, Pentium IV
– AMD K6, K6-2, K6-3, K7 (and later)
– Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later)
• Gives a large speedup in many multimedia applications
41COMP381 by M. Hamdi
MMX SIMD Operations• Example: consider an image pixel
data represented as bytes.
– with MMX, eight of these pixels can be packed together in a 64-bit quantity and moved into an MMX register
– MMX instruction performs the arithmetic or logical operation on all eight elements in parallel
• PADD(B/W/D): Addition
PADDB MM1, MM2 adds 64-bit contents of MM2 to MM1,
byte-by-byte any carries generated
are dropped, e.g., byte A0h + 70h = 10h
• PSUB(B/W/D): Subtraction
42COMP381 by M. Hamdi
MMX: Image Dissolve Using Alpha Blending
• Example: MMX instructions speed up image composition
• A flower will dissolve a swan
• Alpha (a standard scheme) determines the intensity of the flower
• The full intensity, the flower’s 8-bit alpha value is FFh, or 255
• The equation below calculates each pixel:Result_pixel =Flower_pixel *(alpha/255) + Swan_pixel * [1-(alpha/255)]
For alpha 230, the resulting pixel is 90% flower and 10% swan
43COMP381 by M. Hamdi
• It is easy to write applications for SIMD processors
• The applications are limited (image processing, computer vision, etc.)
• It is frequently used to speed specific applications (e.g., graphics co-processor in SGI computers)
• In the late 80s and early 90s, many SIMD machines were commercially available (e.g., Connection machine has 64K ALUs, and MasPar has 16K ALUs)
SIMD Multiprocessing
44COMP381 by M. Hamdi
• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different
instructions on different data.
– Keep in mind that the processors are working together to solve a single problem.
• This is a more general form of multiprocessing, and can be used in numerous applications
Flynn’s Taxonomy of Computing
45COMP381 by M. Hamdi
Unlike SIMD, MIMD computer works asynchronously.
• Shared memory (tightly coupled) MIMD
• Distributed memory (loosely coupled) MIMD
MIMD Architecture
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
InstructionStream A
InstructionStream B
InstructionStream C
46COMP381 by M. Hamdi
Memory
Disk & other IO
Shared Memory Multiprocessor
Registers
Caches
Processor
Registers
Caches
Processor
Registers
Caches
Processor
Registers
Caches
Processor
Chipset •Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O•Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro
47COMP381 by M. Hamdi
Shared Memory Programming Model
ProcessProcess ProcessProcess
SystemSystem
X
load(X) store(X)
Processor Memory
Shared variable
48COMP381 by M. Hamdi
Shared Memory Model
Virtual address spaces for a collection of processes communicating via shared addresses
Machine physical address space
Shared portion of address space
Private portion of address space
Pn private
Common physical addresses
P2 private
P1 private
P0 private
Store
Load
49COMP381 by M. Hamdi
Cache Coherence Problem
• Processor 3 does not see the value written by processor 0
X:42
R: X
$
MEM
P
$
P
$
P
$
P
X:42
X:17
W: X = 17 R: X
X:42
50COMP381 by M. Hamdi
Write Through does not help
• Processor 3 sees 42 in cache (does not get the correct value (17) from memory.
X:42
R: X
$
MEM
P
$
P
$
P
$
P
X:42
X:17
W: X = 17 R: X
X:42
X:17
R: X
51COMP381 by M. Hamdi
One Solution: Shared Cache
Advantages• Cache placement identical to single cache
–only one copy of any cached block
Disadvantages• Bandwidth limitation
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
Shared Cache
52COMP381 by M. Hamdi
Limits of Shared Cache Approach
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Need 5.2 GB/s of bus bandwidth per processor!
• Typical bus bandwidth can hardly support one processor
5.2 GB/s
140 MB/s
53COMP381 by M. Hamdi
Distributed Cache: Snoopy Cache-Coherence Protocols
• Bus is a broadcast medium & caches know what they have
–bus protocol: arbitration, command/addr, data
=> Every device observes every transaction
StateAddressData
I/O devicesMem
P1
$
Bus snoop
$
Pn
Cache-memorytransaction
54COMP381 by M. Hamdi
Snooping Cache Coherency
• Cache Controller “snoops” all transactions on the shared bus• A transaction is a relevant transaction if it involves a cache block currently contained in this cache
• take action to ensure coherence (invalidate, update, or supply value)
COMP381 by M. Hamdi
Hardware Cache Coherence
• write-invalidate
• write-update (also called distributed write)
X -> X’
invalidate -->
X -> Inv X -> Inv
. . . . .
memory ICN
X -> X’
update -->
X -> X’ X -> X’
. . . . .
memory ICN
56COMP381 by M. Hamdi
Limits of Bus-Based Shared Memory
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
8 processors will saturate the memory bus
5.2 GB/s
140 MB/s
57COMP381 by M. Hamdi
Scalable Shared Memory Architectures Crossbar Switch
Mem
Mem
Mem
Mem
Cache
P
I/OCache
P
I/O
Used in SUN Used in SUN entreprise 10000entreprise 10000
58COMP381 by M. Hamdi
Scalable Shared Memory Architectures
• Used in IBM SP Multiprocessor
0
1
1
P0
P1
P2
P3
P4
P5
P6
P7
M0
M1
M2
M3
M4
M5
M6
M7
000
001
010
011
100
101
110
111
59COMP381 by M. Hamdi
Approaches to Building Parallel Machines
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem Mem
Shared Cache
Distributed Memory
Scale