Multiprocessors - Parallel Computing

1COMP381 by M. Hamdi

Multiprocessors - Multiprocessors -

Parallel Parallel ComputingComputing

• We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques):

Pipelining

Super-scalars

Out-of-order execution (Scoreboarding)

Cache (L1, L2, L3)

Interleaved memories

Compilers (Loop unrolling, branch prediction, etc.)

Etc …

• However, quite often even the best microprocessors are not fast enough for certain applications !!!

Processor Performance

0 1 2 3 4 5 6+0

0 5 10 150

Number of instructions issued

Instructions issued per cycle

Example: How far will ILP go?

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

When Do We Need High Performance Computing?

• Case1

–To do a time-consuming operation in less timeless time

• I am an aircraft engineer

• I need to run a simulation to test the stability of the wings at high speed

• I’d rather have the result in 5 minutes than in 5 days so that I can complete the aircraft final design sooner.

When Do We Need High Performance Computing?

• Case 2

–To do an operation before a tighter deadlinetighter deadline

• I am a weather prediction agency

• I am getting input from weather stations/sensors

• I’d like to make the forecast for tomorrow before tomorrow

When Do We Need High Performance Computing ?

• Case 3

– To do a high number of operationshigh number of operations per seconds

• I am an engineer of Amazon.com

• My Web server gets 10,000 hits per seconds

• I’d like my Web server and my databases to handle 10,000 transactions per seconds so that customers do not experience bad delays

–Amazon does “process” several GBytes of data per seconds

The need for High-Performance ComputersJust some examples

• Automotive design:– Major automotive companies use large systems (500+ CPUs) for:

• CAD-CAM, crash testing, structural integrity and aerodynamics.

– Savings: approx. $1 billion per company per year.

• Semiconductor industry:– Semiconductor firms use large systems (500+ CPUs) for

• device electronics simulation and logic validation

– Savings: approx. $1 billion per company per year.

• Airlines:– System-wide logistics optimization systems on parallel systems.

– Savings: approx. $100 million per airline per year.

Grand ChallengesGrand Challenges

100 MB

100 GB

100 MFLOPS 1 GFLOPS 10 GFLOPS 100 GFLOPS 1 TFLOPS

2D airfoil

48-hourweather

oil reservoirmodelling

3D plasma modelling

72-hourweather

vehicle dynamics

chemical dynamics

pharmaceutical designstructural biology

Computational Performance Requirements

Weather ForecastingWeather Forecasting• Suppose the whole global atmosphere divided

into cells of size 1 km 1 km 1 km to a height of 10 km (10 cells high) - about 5 108 cells.

• Suppose each cell calculation requires 200 floating point operations. In one time step, 1011 floating point operations are necessary.

• To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) – similar to the Pentium 4similar to the Pentium 4 - takes 106 seconds or over 10 days.

• To perform calculation in 5 minutes requires a computer operating at 3.4 Tflops (3.4 1012

floating point operations/sec).

Google

1. The user enters a query on a web form sent to the Google web server.

2. The web server sends the query to the Index Server cluster, which matches the query to documents.

3. The match is sent to the Doc Server cluster, which retrieves the documents to generate abstracts and cached copies.

4. The list, with abstracts, is displayed by the web server to the user, sorted(using a secret formula involving PageRank).

Google Requirements

• Google: search engine that scales at Internet growth rates

• Search engines: 24x7 availability

• Google : 600M queries/day, or AVERAGE of 7500 queries/s all day

• Response time goal: < 0.5 s per search

• Google crawls WWW and puts up new index every 2 weeks

• Storage: 5.3 billion web pages, 950 million newsgroup messages, and 925 million images indexed, Millions of videos

Google• require high amounts of computation per request

• A single query on Google (on average)– reads hundreds of megabytes of data– consumes tens of billions of CPU cycles

• A peak request stream on Google– requires an infrastructure comparable in size to largest supercomputer

installations

• Typical google Data center: 15000 PCs (linux), 30000 disks: almost 3 petabyte!

• Google application affords easy parallelization– Different queries can run on different processors– A single query can use multiple processors

• because the overall index is partitioned

Multiprocessing

• Multiprocessing (Parallel Processing): Concurrent execution of tasks (programs) using multiple computing, memory and interconnection resources.

Use multiple resources to solve problems faster.• Provides alternative to faster clock for performance

– Assuming a doubling of effective processor performance every 2 years, 1024-Processor system can get you the performance that it would take 20 years for a single-processor system to deliver

• Using multiple processors to solve a single problem

– Divide problem into many small pieces

– Distribute these small problems to be solved by multiple processors simultaneously

Multiprocessing• For the last 30+ years multiprocessing has been seen as the best

way to produce orders of magnitude performance gains.– Double the number of processors, get double performance (less than 2

times the cost).• It turns out that the ability to develop and deliver software for

multiprocessing systems has been the impediment to wide adoption.

Amdahl’s Law

• A parallel program has a sequential part (e.g., I/O) and a parallel part

– T1 = T1 + (1-)T1

– Tp = T1 + (1-)T1 / p

• Therefore:

Speedup(p) = 1 / ( + (1-)/p)

= p / ( p + 1 - )

• Example: if a code is 10% sequential (i.e., = .10), the speedup will always be lower than 1 + 90/10 = 10, no matter how many processors are used!

• Amdahl's Law is pessimistic (in this case)– Let s be the serial part

– Let p be the part that can be parallelized n ways – Serial: SSPPPPPP – 6 processors: SSP– P– P– P– P– P

– Speedup = 8/3 = 2.67

– T(n) =

– As n , T(n)

• Pessimistic

Performance Potential Using Multiple Processors

1s+p/n

10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

25Speedup

% Serial

1000 CPUs16 CPUs4 CPUs

Amdahl’s Law

Example

Performance Potential: Another view

• Gustafson view (more widely adopted for multiprocessors) – Parallel portion increases as the problem size increases

• Serial time fixed (at s) • Parallel time proportional to problem size (true most of the time)

• Old Serial: SSPPPPPP • 6 processors: SSPPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • Hypothetical Serial:• SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP

– Speedup = (8+5*6)/8 = 4.75 – T'(n) = s + n*p; T'() !!!!

10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

% Serial

Speedup

Gustafson-Barsis

Amdhal

Amdahl vs. Gustafson-Barsis

TOP 5 Most Powerful computers in the world – must be multiprocessors

http://www.top500.org/http://www.top500.org/

Rank Site/Country/Year Computer/Processors Manufacturer Rmax (GFlops)

1 DOE/NNSA/LLNLUnited States

212992 (PowerPC) IBM 478200

2 Forschungszentrum Juelich (FZJ)Germany

65536 (PowerPC) IBM 167300

3 SGI/New Mexico Computing Applications Center (NMCAC)United States

14336 (Intel EM64T Xeon ) SGI126900

4 Computational Research Laboratories, TATA SONSIndia

14240 (Intel EM64T Xeon ) HP 117900

5 Government AgencySweden

13728 (AMD x86_64 Opteron Dual Core )

Cray 42900

Supercomputer Style Migration (Top500)

• In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%

Cluster – whole computers interconnected using their I/O bus

Constellation – a cluster that uses an SMP multiprocessor as the building block

• Multiprocessor systems are being used for a wide variety of uses.

• Redundant processing (safeguard) – fault tolerance.

• Multiprocessor systems – increase throughput

Many tasks (no communication between them)

Multi-user departmental, enterprise and web servers.

• Parallel computing systems – decrease execution time.

Execute large-scale applications in parallel.

Multiprocessing (usage)

Multiprocessing• Multiple resources

– Computers (e.g., clusters of PCs)

– CPU (e.g., shared memory computers)

– ALU (e.g., multiprocessors within a single chips)

– Memory

– Interconnect

• Tasks– Programs

– Procedures

– InstructionsCoarse-grain

Fine-grain

Different combinationsresult in differentsystems.

1. The ability to develop and deliver software for multiprocessing systems has been the impediment to wide adoption – the goal was to make programming transparent to the user (e.g., pipelining) which never happened. However, there have a lot of advances here.

2. The tremendous advances of microprocessors (doubling in performance every 2 years) was able to satisfy the need of 99% of the applications

3. It did not make a business case: vendors were only able to sell few parallel computers (< 200). As a result, they were not able to invest in designing cheap and powerful multiprocessors

4. Most parallel computer vendors went bankrupt by the mid-90s – there was no business.

Why did the popularity of Multiprocessors slowed down compared to the 90s

• SISD (Single Instruction, Single Data): – Typical uniprocessor systems that we’ve studied throughout

this course.– Uniprocessor systems can time share and still be SISD.

• SIMD (Single Instruction, Multiple Data): – Multiple processors simultaneously executing the same

instruction on different data.– Specialized applications (e.g., image processing).

• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different

instructions on different data.– Keep in mind that the processors are working together to solve

a single problem.

Flynn’s Taxonomy of Computing

SIMD Systems

Processor

Memory

von Neumann Computer Some Interconnection Network

One control unit

Lockstep

All Ps do the same or nothing

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P P C

M M M M

Global Memory

One global memory

Cache Coherence

All Ps have equal access to memory

Cache Coherent NUMA

Interconnection Network

Each P has part of the shared memory

Non uniform memory access

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

1110 1111

1010 1011

0110 0111

0010 0011

1000 1001

0100 0101

0000 0001

LAN/WANNo shared memory

Message Passing

Topology

Cluster Architecture

Middleware

Programming Environment

Interconnection Network Home cluster

InternetInternet

Dependable, consistent, pervasive, and inexpensive access to high end computing.

Geographically distributed platforms.

2003 2005 2007 2009 2011 2013

Increasing HW

Threads HT

Multi-core Era

Scalar and Parallel

Applications

Many-core Era

Massively Parallel

Applications

Multiprocessing within a chip: Many-Core

Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015

MC ontro ller

Add r1, b

D atastream

InstructionS tream

SIM D execution

SIMD Parallel ComputingSIMD Parallel Computing

It can be a stand-It can be a stand-alone multiprocessoralone multiprocessor

Embedded in a single Embedded in a single processor for specific processor for specific applications (MMX)applications (MMX)

SIMD Applications

• Applications:

• Database, image processing, and signal processing.

• Image processing maps very naturally onto SIMD systems.

» Each processor (Execution unit) performs operations on a single pixel or neighborhood of pixels.

» The operations performed are fairly straightforward and simple.

» Data could be streamed into the system and operated on in real-time or close to real-time.

SIMD Operations

• Image processing on SIMD systems.– Sequential pixel operations take a very long time to

perform.• A 512x512 image would require 262,144 iterations through a

sequential loop with each loop executing 10 instructions. That translates to 2,621,440 clock cycles (if each instruction is a single cycle) plus loop overhead.

512x512 image

Each pixel is operated onsequentially one afteranother.

SIMD Operations

• Image processing on SIMD systems.– On a SIMD system with 64x64 processors (e.g., very

simple ALUs) the same operations would take 640 cycles, where each processor operates on an 8x8 set of pixels plus loop overhead.

512x512 image

Each processor operates onan 8x8 set of pixels in parallel.

Speedup due to parallelism:2,621,440/640 = 4096 =64x64 (number of proc.) loop overhead ignored.

SIMD Operations

• Image processing on SIMD systems.– On a SIMD system with 512x512 processors (which is

not unreasonable on SIMD machines) the same operation would take 10 cycles.

512x512 image

Each processor operates ona single pixel in parallel.

Speedup due to parallelism:2,621,440/10 = 262,144 =512x512 (number of proc.)!

Notice no loop overhead!

Pentium MMX MultiMedia eXtentions

• 57 new instructions

• Eight 64-bit wide MMX registers

• First available in 1997

• Supported on:– Intel Pentium-MMX, Pentium II,

Pentium III, Pentium IV

– AMD K6, K6-2, K6-3, K7 (and later)

– Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later)

• Gives a large speedup in many multimedia applications

MMX SIMD Operations• Example: consider an image pixel

data represented as bytes.

– with MMX, eight of these pixels can be packed together in a 64-bit quantity and moved into an MMX register

– MMX instruction performs the arithmetic or logical operation on all eight elements in parallel

• PADD(B/W/D): Addition

PADDB MM1, MM2 adds 64-bit contents of MM2 to MM1,

byte-by-byte any carries generated

are dropped, e.g., byte A0h + 70h = 10h

• PSUB(B/W/D): Subtraction

MMX: Image Dissolve Using Alpha Blending

• Example: MMX instructions speed up image composition

• A flower will dissolve a swan

• Alpha (a standard scheme) determines the intensity of the flower

• The full intensity, the flower’s 8-bit alpha value is FFh, or 255

• The equation below calculates each pixel:Result_pixel =Flower_pixel *(alpha/255) + Swan_pixel * [1-(alpha/255)]

For alpha 230, the resulting pixel is 90% flower and 10% swan

• It is easy to write applications for SIMD processors

• The applications are limited (image processing, computer vision, etc.)

• It is frequently used to speed specific applications (e.g., graphics co-processor in SGI computers)

• In the late 80s and early 90s, many SIMD machines were commercially available (e.g., Connection machine has 64K ALUs, and MasPar has 16K ALUs)

SIMD Multiprocessing

• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different

instructions on different data.

– Keep in mind that the processors are working together to solve a single problem.

• This is a more general form of multiprocessing, and can be used in numerous applications

Flynn’s Taxonomy of Computing

Unlike SIMD, MIMD computer works asynchronously.

• Shared memory (tightly coupled) MIMD

• Distributed memory (loosely coupled) MIMD

MIMD Architecture

Processor

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

InstructionStream A

InstructionStream B

InstructionStream C

Memory

Disk & other IO

Shared Memory Multiprocessor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Chipset •Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O•Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro

Shared Memory Programming Model

ProcessProcess ProcessProcess

SystemSystem

load(X) store(X)

Processor Memory

Shared variable

Shared Memory Model

Virtual address spaces for a collection of processes communicating via shared addresses

Machine physical address space

Shared portion of address space

Private portion of address space

Pn private

Common physical addresses

P2 private

P1 private

P0 private

Cache Coherence Problem

• Processor 3 does not see the value written by processor 0

W: X = 17 R: X

Write Through does not help

• Processor 3 sees 42 in cache (does not get the correct value (17) from memory.

W: X = 17 R: X

One Solution: Shared Cache

Advantages• Cache placement identical to single cache

–only one copy of any cached block

Disadvantages• Bandwidth limitation

Switch

Main memory

(Interleaved)

First-level $

Shared Cache

Limits of Shared Cache Approach

I/O MEM MEM° ° °

° ° °

Assume:

1 GHz processor w/o cache

=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Need 5.2 GB/s of bus bandwidth per processor!

• Typical bus bandwidth can hardly support one processor

5.2 GB/s

140 MB/s

Distributed Cache: Snoopy Cache-Coherence Protocols

• Bus is a broadcast medium & caches know what they have

–bus protocol: arbitration, command/addr, data

=> Every device observes every transaction

StateAddressData

I/O devicesMem

Bus snoop

Cache-memorytransaction

Snooping Cache Coherency

• Cache Controller “snoops” all transactions on the shared bus• A transaction is a relevant transaction if it involves a cache block currently contained in this cache

• take action to ensure coherence (invalidate, update, or supply value)

COMP381 by M. Hamdi

Hardware Cache Coherence

• write-invalidate

• write-update (also called distributed write)

X -> X’

invalidate -->

X -> Inv X -> Inv

. . . . .

memory ICN

X -> X’

update -->

X -> X’ X -> X’

. . . . .

memory ICN

Limits of Bus-Based Shared Memory

I/O MEM MEM° ° °

° ° °

Assume:

1 GHz processor w/o cache

=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Suppose 98% inst hit rate and 95% data hit rate

=> 80 MB/s inst BW per processor

=> 60 MB/s data BW per processor140 MB/s combined BW

Assuming 1 GB/s bus bandwidth

8 processors will saturate the memory bus

5.2 GB/s

140 MB/s

Scalable Shared Memory Architectures Crossbar Switch

I/OCache

Used in SUN Used in SUN entreprise 10000entreprise 10000

Scalable Shared Memory Architectures

• Used in IBM SP Multiprocessor

Approaches to Building Parallel Machines

Switch

Main memory

(Interleaved)

First-level $

Interconnection network

Mem Mem

Interconnection network

Mem Mem

Shared Cache

Distributed Memory

Multiprocessors - Parallel Computing

Documents

Parallel Computing Explained Parallel Performance Analysis

ThrougPuter Parallel Computing PaaS · 2020-01-14 · Existing parallel computing tools mainly limited to parallel programming aspect of parallel computing challenge Any parallel

Introduction to Parallel Computing - ULisboaIntroduction to Parallel Computing. Ricardo Fonseca | European PhD 2010 What is Parallel Computing • Parallel computing: use of multiple

Lect. 10: Chip-Multiprocessors (CMP) · CS4/MSc Parallel Architectures - 2012-2013 Lect. 10: Chip-Multiprocessors (CMP) Main driving forces: – Complexity of design and verification

Parallel Computing: Overview - Computer Science Workshop/Parallel... · April 23, 2002 Introduction to Parallel Computing •Why w e need parallel computing • How such machines

Introduction to Parallel Computing · Introduction to Parallel Computing High-Performance Computing. Prefix sums Jesper Larsson Träff traff@par.tuwien.ac.at TU Wien Parallel Computing

Chapter 7 Multicores, Multiprocessors, and Clustersmarkov/ccsu_courses/COD-Chapter7.pdf7.2 The Difficulty of Creating Parallel Processing Prog rams Chapter 7 — Multicores, Multiprocessors,

INTRODUCTION TO PARALLEL AND GPU COMPUTING · 1 Intro 2 Processors Trends 3 Multiprocessors 4 GPU Computing AGENDA . 3 ... •Lots of programming languages •Lots of programming

Programming Massively Parallel Graphics Multiprocessors using …moshovos/CUDA08/arx/TransientStability.pdf · Programming Massively Parallel Graphics Multiprocessors using CUDA Final

Balanced parallel sort on hypercube multiprocessors - Parallel

Parallel Computers 1 PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003 TOPICS: Parallel computing requires an understanding of parallel algorithms,

Parallel Computing - SDUdaniel/parallel/Slides/chap1.pdf · Introduction to Parallel Computing (Second Edition, 2003) other sources will be announced Weekly notes. Parallel Computing

1 Parallel Computing Final Exam Review. 2 What is Parallel computing? Parallel computing involves performing parallel tasks using more than one computer

CPE 779 Parallel Computing - Spring 20121 CPE 779 Parallel Computing Lecture

NVISION08 Oster 04 T10 Performance Optimization.ppt [Read ... · Tesla 10-Series Architecture Massively parallel general computing architecture 30 Streaming multiprocessors @ 1.45

Lect. 4: Shared Memory Multiprocessors · CS4/MSc Parallel Architectures - 2017-2018 Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together – Processors

Parallel Computing Why & How? - SINTEFWe´re already at the age of parallel computing Parallel computing relies on parallel hardware Parallel computing needs parallel software So parallel

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture Guri Sohi University of Wisconsin