35
Chapter 7 Multicores, Multiprocessors, and Clusters

Chapter 7

  • Upload
    cheryl

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Chapter 7. Multicores, Multiprocessors, and Clusters. Introduction. §9.1 Introduction. Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Task-level (process-level) parallelism Independent tasks running in parallel - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 7

Chapter 7

Multicores, Multiprocessors, and Clusters

Page 2: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 2

Introduction Goal: connecting multiple computers

to get higher performance Multiprocessors Scalability, availability, power efficiency

Task-level (process-level) parallelism Independent tasks running in parallel High throughput for independent jobs

Parallel processing program Single program run on multiple processors

§9.1 Introduction

Page 3: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 3

Introduction Multicore microprocessors

Processors are called cores in a multicore chip Chips with multiple processors (cores) Almost always Shared Memory Processors

(SMPs) Clusters

Microprocessors housed in many independent servers

Search engines, web servers, email servers, and databases

Memory not shared

§9.1 Introduction

Page 4: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 4

Introduction The number of cores is expected to

increase with Moore’s Law Challenge is to create hardware and

software that will make it easy to write correct parallel processing programs that will execute efficiently in performance and energy as the number of cores per chip increases.

§9.1 Introduction

Page 5: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 5

Hardware and Software Hardware

Serial: e.g., Pentium 4 Parallel: e.g., Intel Core i7

Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system

Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel

hardware

Page 6: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 6

Hardware and Software§7.6 S

ISD

, MIM

D, S

IMD

, SP

MD

, and Vector

Software

Sequential Concurrent

Hardware Serial Matrix Multiply written in MatLab running on an Intel Pentium 4

Windows 7 running on an Intel Pentium 4

Parallel Matrix Multiply written in MatLab running on an Intel Core i7

Windows 7 running on an Intel Core i7

Page 7: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 7

Parallel Processing Programs Difficulties

Not with the hardware There are too few important application

programs written to complete tasks sooner on multiprocessors

It is difficult to write software that uses multiple processors to complete one task faster

Problem gets worse as the number of processors increases

Page 8: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

Parallel Programming Parallel software is the problem Need to get significant performance

improvement Otherwise, just use a faster uniprocessor,

since it’s easier!

§7.2 The D

ifficulty of Creating P

arallel Processing P

rograms

Page 9: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 9

Parallel Programming Challenges

Scheduling the workload Partitioning the work into parallel pieces Balancing the load evenly between the

processors Synchronize the work Communication overhead between the

processors

§7.2 The D

ifficulty of Creating P

arallel Processing P

rograms

Page 10: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 10

Instruction and Data Streams An alternate classification

§7.6 SIS

D, M

IMD

, SIM

D, S

PM

D, and V

ector

Data Streams

Single Multiple

Instruction Streams

Single SISD:Intel Pentium 4

SIMD: SSE instructions of x86

Multiple MISD:No examples today

MIMD:Intel Xeon e5345

Page 11: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

Instruction and Data Streams Conventional uniprocessor SISD – Single instruction stream and single data stream Conventional multiprocessor MIMD – multiple instruction streams and multiple data streams Can be separate programs running on different processors or Single Program that runs on all processors

Rely on conditional statements when different processors should execute different sections of code

§7.6 SIS

D, M

IMD

, SIM

D, S

PM

D, and V

ector

Page 12: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 12

Instruction and Data Streams SIMD – Single instruction stream and multiple data stream

Operate on vectors of data, e.g. One instruction to add 64 numbers by sending 64 data streams to 64 ALUs to form 64 sums within a single clock cycle Works best when dealing with arrays in for loops Must have a lot of identical structured data E.g. Graphics processing

MISD – Multiple instruction stream and single data stream Perform a series of computations on a single data stream in a pipelined fashion E.g. Parse input from network, decrypt data, decompress data, search for match

§7.6 SIS

D, M

IMD

, SIM

D, S

PM

D, and V

ector

Page 13: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 13

Hardware Multithreading While MIMD relies on multiple processes or

threads to try to keep multiple processors busy, hardware multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion to try to utilize the hardware resources efficiently.

§7.5 Hardw

are Multithreading

Page 14: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 14

Hardware Multithreading Performing multiple threads of execution in

parallel Processor must duplicate the independent state of

each thread Replicate registers, and program counter, etc.

Must have fast switching between threads Process switch can take thousands of cycles Thread switch can be instantaneous

§7.5 Hardw

are Multithreading

Page 15: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 15

Hardware Multithreading Fine-grain multithreading

Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Instruction-level parallelism

Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Disadvantage – pipeline start up cost No instruction-level parallelism

§7.5 Hardw

are Multithreading

Page 16: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 16

Simultaneous Multithreading In multiple-issue dynamically scheduled

pipelined processor Schedule instructions from multiple threads Instructions from independent threads execute

when function units are available Within threads, dependencies handled by

scheduling and register renaming Example: Intel Pentium-4 HT

Two threads: duplicated registers, shared function units and caches

Page 17: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 17

Multithreading Example

Page 18: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 18

Multithreading Example The four threads at the top show how each

would execute running alone on a standard superscalar processor without multithreading support A major stall such as an instruction cache miss, can

leave the entire processor idle. Coarse-grained, the long stalls are partially

hidden by switching to another thread that uses the resources of the processor. Pipelined start-up overhead still leads to idle cycles

§7.5 Hardw

are Multithreading

Page 19: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 19

Multithreading Example Fine-grained

Instruction-level parallelism The interleaving of threads mostly eliminates idle clock cycles

SMT Thread-level parallelism and instruction-level parallelism are

both exploited Multiple threads using the issue slots in a single clock cycle

§7.5 Hardw

are Multithreading

Page 20: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 20

Future of Multithreading Will it survive? In what form? Power considerations simplified

microarchitectures Simpler forms of multithreading

Tolerating cache-miss latency Thread switch may be most effective

Multiple simple cores might share resources more effectively

Page 21: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 21

Shared Memory SMP: shared memory multiprocessor

Hardware provides single physicaladdress space for all processors

Nearly all current multicore chips

§7.3 Shared M

emory M

ultiprocessors

Page 22: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 22

Shared Memory Memory access time can be either: uniform memory access (UMA) – the access time to a word does not

depend on which processor asks for it. or nonuniform memory access (NUMA) – some memory accesses are much

faster than others, depending on which processor asks for which word Because main memory is divided and attached to different microprocessors or to

different memory controllers

Synchronize shared variables using locks

§7.3 Shared M

emory M

ultiprocessors

Page 23: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 23

Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA

Each processor has ID: 0 ≤ Pn ≤ 99

Page 24: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 24

Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA

Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];

Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps

Page 25: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 25

Example: Sum Reduction

half = 100;repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);

Page 26: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 26

Private Memory Each processor has private physical

address space Processor must communicate via explicit

message passing (send and receive)

§7.4 Clusters and O

ther Message-P

assing Multiprocessors

Page 27: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 27

Loosely Coupled Clusters Network of independent computers

Each has private memory and OS Connected using I/O system

E.g., Ethernet/switch, Internet

Suitable for applications with independent tasks Web servers, databases, simulations, …

High availability, scalable, affordable Problems

Administration cost (prefer virtual machines) Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP

Page 28: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

Clusters Clusters with high-performance message-

passing dedicated networks Many supercomputers today use custom

networks Much more expensive than local area networks

Page 29: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 29

Warehouse-Scale Computers Building to house, power, and cool 100,000

servers with dedicated network Like large clusters but their architecture and

operation are more sophisticated They act as one giant computer

Page 30: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 30

Grid Computing Separate computers interconnected by

long-haul networks E.g., Internet connections Work units farmed out, results sent back

Can make use of idle time on PCs

Page 31: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 31

Networks inside a chip Thanks to Moore’s Law and the increasing

number of cores per chip, we now need networks inside a chip as well.

Page 32: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 32

Network Characteristics Performance

Latency per message on an unloaded network to send and receive a message

Throughput The maximum number of messages that can be transmitted

in a given time period. Congestion delays caused by contention for a portion

of the network Fault tolerance – work with broken components

Collision detection

Page 33: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 33

Network Characteristics Cost – includes:

Number of switches Number of links on a switch to connect to the

network The width (number of bits) per link The length of the links when the network is

mapped into silicon Routability in silicon

Power – energy efficiency

Page 34: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 34

Interconnection Networks Network topologies

Arrangements of processors (black squares),

switches (blue dots), and links

§7.8 Introduction to Multiprocessor N

etwork T

opologies

Bus Ring

2D Mesh

N-cube (N = 3)

Fully connected

Page 35: Chapter 7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 35

Concluding Remarks Goal: higher performance by using multiple

processors Difficulties

Developing parallel software Devising appropriate architectures

Many reasons for optimism Changing software and application environment Chip-level multiprocessors with lower latency,

higher bandwidth interconnect An ongoing challenge for computer architects!

§7.13 Concluding R

emarks