42
Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classification Vector computers Pipelining in Vector computers Cray Multiprocessor interconnection General purpose Multiprocessor Data Flow Computers

Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Embed Size (px)

Citation preview

Page 1: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Multiprocessors

Advanced Computers Architecture, UNIT 4

Flynn's classification

Vector computers

Pipelining in Vector computers

Cray

Multiprocessor interconnection

General purpose Multiprocessor

Data Flow Computers

Page 2: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

The Big Picture: Where are We Now?

Advanced Computers Architecture, UNIT 4

The major issue is this:

We’ve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data?

How do we keep all copies of the data in synch with each other?

Page 3: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

The Multiprocessor Picture

Advanced Computers Architecture, UNIT 4

Processor/MemoryBus

PCI Bus

I/O Busses

Example: Pentium System

Organization

Page 4: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 4

Why Buy a Multiprocessor?

• Multiple users.• Multiple applications.• Multitasking within an application.• Responsiveness and/or throughput.

Advanced Computers Architecture, UNIT 4

Page 5: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

5

Multiprocessor Architectures

• Message-Passing Architectures– Separate address space for each processor.– Processors communicate via message passing.

• Shared-Memory Architectures– Single address space shared by all processors.– Processors communicate by memory read/write.– SMP or NUMA.– Cache coherence is important issue.

Advanced Computers Architecture, UNIT 4

Page 6: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 6

Message-Passing Architecture

. . .

processor

cache

memory

processor

cache

memory

processor

cache

memory

interconnection network

. . .

Advanced Computers Architecture, UNIT 4

Page 7: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 7

Shared-Memory Architecture

. . .

interconnection network

. . .

processor1

cache

processor2

cache

processorN

cache

memory1

memoryM

memory2

Advanced Computers Architecture, UNIT 4

Page 8: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

8

Shared-Memory Architecture:SMP and NUMA

• SMP = Symmetric Multiprocessor– All memory is equally close to all processors.– Typical interconnection network is a shared bus.– Easier to program, but doesn’t scale to many

processors.• NUMA = Non-Uniform Memory Access

– Each memory is closer to some processors than others. – a.k.a. “Distributed Shared Memory”.– Typically interconnection is grid or hypercube.– Harder to program, but scales to more processors.

Advanced Computers Architecture, UNIT 4

Page 9: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Shared Memory Multiprocessor

Advanced Computers Architecture, UNIT 4

Memory

Disk & other IO

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Chipset • Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O

• Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro

Page 10: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Shared Memory Multiprocessor

Advanced Computers Architecture, UNIT 4

• Several processors share one address space– conceptually a shared memory– often implemented just like a

multicomputer• address space distributed

over private memories• Communication is implicit

– read and write accesses to shared memory locations

• Synchronization– via shared memory locations

• spin waiting for non-zero– barriers

P

M

Network/Bus

P P

Conceptual Model

Page 11: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Message Passing Multicomputers

Advanced Computers Architecture, UNIT 4

• Computers (nodes) connected by a network– Fast network interface

• Send, receive, barrier– Nodes not different than regular PC or workstation

• Cluster conventional workstations or PCs with fast network – cluster computing– Berkley NOW– IBM SP2

P

M

P

M

P

M

Network

Node

Page 12: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Large-Scale MP Designs

Advanced Computers Architecture, UNIT 4

Low LatencyHigh Reliability

40 cycles100 cycles

Memory: distributed with nonuniform memory access time (“numa”) and scalable interconnect (distributed memory)

Page 13: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Shared Memory Architectures

Advanced Computers Architecture, UNIT 4

In this section we will understand the issues around:

• Sharing one memory space among several processors.

• Maintaining coherence among several copies of a data item.

Page 14: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

The Problem of Cache Coherency

CPU

Cache

100

200

A’

B’

Memory

100

200

A

B

I/O

a) Cache and memory coherent: A’ = A, B’ = B.

CPU

Cache

550

200

A’

B’

Memory

100

200

A

B

I/OOutput of A gives 100

b) Cache and memory incoherent: A’ ^= A.

CPU

Cache

100

200

A’

B’

Memory

100

440

A

B

I/OInput 440 to B

c) Cache and memory incoherent: B’ ^= B.

Page 15: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Some Simple Definitions

Advanced Computers Architecture, UNIT 4

Mechanism How It Works Performance Coherency Issues

Write Back

Write Through

Write modified data from cache to

memory only when necessary.

Write modified data from cache to

memory immediately.

Good, because doesn’t tie up

memory bandwidth.

Not so good - uses a lot of

memory bandwidth.

Can have problems with various copies containing different

values.

Modified values always written to memory;

data always matches.

Page 16: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

What Does Coherency Mean?

Advanced Computers Architecture, UNIT 4

• Informally:– “Any read must return the most recent write”– Too strict and too difficult to implement

• Better:– “Any write must eventually be seen by a read”– All writes are seen in proper order (“serialization”)

• Two rules to ensure this:– “If P writes x and P1 reads it, P’s write will be seen by P1 if the

read and write are sufficiently far apart”– Writes to a single location are serialized:

seen in one order• Latest write will be seen• Otherwise could see writes in illogical order

(could see older value after a newer value)

Page 17: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Computers

Advanced Computers Architecture, UNIT 4

• Vector Processing Overview• Vector Metrics, Terms• Greater Efficiency than Super Scalar Processors• Examples

– CRAY-1 (1976, 1979) 1st vector-register supercomputer– Multimedia extensions to high-performance PC processors– Modern multi-vector-processor supercomputer – NEC ESS

• Design Features of Vector Supercomputers• Conclusions

Page 18: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Programming Model

Advanced Computers Architecture, UNIT 4

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic InstructionsADDV v3, v1, v2 v3

v2v1

v1Vector Load and Store InstructionsLV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

Page 19: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Code Example

Advanced Computers Architecture, UNIT 4

# Scalar Code LI R4, 64loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop

# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

# C codefor (i=0; i<64; i++) C[i] = A[i] + B[i];

Page 20: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Arithmetic Execution

Advanced Computers Architecture, UNIT 4

• Use deep pipeline (=> fast clock) to execute element operations

• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

V1

V2

V3

V3 <- v1 * v2

Six stage multiply pipeline

Page 21: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Instruction Set Advantages

Advanced Computers Architecture, UNIT 4

• Compact–one short instruction encodes N operations => N*FlOp BandWidth

• Expressive, tells hardware that these N operations:–are independent–use the same functional unit–access disjoint registers–access registers in the same pattern as previous instructions–access a contiguous block of memory (unit-stride load/store) OR access memory in a known pattern (strided load/store)

• Scalable–can run same object code on more parallel pipelines or lanes

Page 22: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Properties of Vector Processors

Advanced Computers Architecture, UNIT 4

• Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate

• Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of 64-plus elements=> no (data) caches required! (but use instruction cache)

• Reduces branches and branch problems in pipelines

• Single vector instruction implies lots of work (≈ loop)=> fewer instruction fetches

Page 23: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Supercomputers

Advanced Computers Architecture, UNIT 4

Definition of a supercomputer:• Fastest machine in world at given task• A device to turn a compute-bound problem into an

I/O bound problem • Any machine costing $30M+• Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Page 24: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Supercomputer Applications

Advanced Computers Architecture, UNIT 4

Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer Vector Machine

Page 25: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Supercomputers

Advanced Computers Architecture, UNIT 4

Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions• Load/Store Architecture• Vector Registers• Vector Instructions• Hardwired Control• Highly Pipelined Functional Units• Interleaved Memory System• No Data Caches• No Virtual Memory

Page 26: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Cray-1 (1976)

Advanced Computers Architecture, UNIT 4

Page 27: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Advanced Computers Architecture, UNIT 4

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MW/sec data load/store

320MW/sec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V. Mask

V. Length64 Element Vector Registers

Page 28: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Memory System

Advanced Computers Architecture, UNIT 4

0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base StrideVector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank

Page 29: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Vector Instruction Execution

Advanced Computers Architecture, UNIT 4

ADDV C,A,B

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]

A[16] B[16]

A[20] B[20]

A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]

A[17] B[17]

A[21] B[21]

A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]

A[18] B[18]

A[22] B[22]

A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]

A[19] B[19]

A[23] B[23]

A[27] B[27]

Execution using four pipelined functional units

Page 30: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

History of Microprocessors

Advanced Computers Architecture, UNIT 4

1950s IBM instituted a research program

1964 Release of System/360

Mid-1970s improved measurement tools demonstrated on CISC

1979 32-bit RISC microprocessor (801) developed led by Joel Birnbaum

1984 MIPS developed at Stanford, as well as projects done at Berkeley

1988 RISC processors had taken over high-end of the workstation market

Early 1990s IBM’s POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC

Page 31: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

What is CISC….?

Advanced Computers Architecture, UNIT 4

A complex instruction set computer (CISC, pronounced like "sisk") is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction.

The philosophy behind it is, that hardware is always faster than software, therefore one should make a powerful instruction set, which provides programmers with assembly instructions to do a lot with short programs.

So the primary goal of the CISC is to complete a task in few lines of assembly instruction as possible.

Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy.

Page 32: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Advanced Computers Architecture, UNIT 4

• Memory in those days was expensive bigger program->more storage->more money

Hence needed to reduce the number of instructions per program

• Number of instructions are reduced by having multiple operations within a single instruction

• Multiple operations lead to many different kinds of instructions that access memory In turn making instruction length variable and fetch-decode execute

time unpredictable – making it more complex Thus hardware handles the complexity

Page 33: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

CISC philosophy

Advanced Computers Architecture, UNIT 4

Use microcode • Used a simplified microcode instruction set to control the data path logic. This type of

implementation is known as a microprogrammed implementation.

Build rich instruction sets • Consequences of using a microprogrammed design is that designers could build more

functionality into each instruction.

Build high-level instruction sets • The logical next step was to build instruction sets which map directly from high-level

languages

Page 34: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Characteristics of a CISC design

Advanced Computers Architecture, UNIT 4

Register to register, register to memory, and memory to register commands.

Uses Multiple addressing modes .

Variable length instructions where the length often varies according to the addressing mode

Instructions which require multiple clock cycles to execute.

Page 35: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Addressing Modes

Advanced Computers Architecture, UNIT 4

• Immediate

• Direct

• Indirect

• Register

• Register Indirect

• Displacement (Indexed) • Stack

Page 36: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Immediate Addressing

Advanced Computers Architecture, UNIT 4

• Operand is part of instruction• Operand = address field• e.g. ADD 5

— Add 5 to contents of accumulator— 5 is operand

• No memory reference to fetch data• Fast• Limited range

OperandOpcode

Instruction

Page 37: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Direct Addressing

Advanced Computers Architecture, UNIT 4

• Address field contains address of operand• Effective address (EA) = address field (A)• e.g. ADD A

— Add contents of cell A to accumulator— Look in memory at address A for operand

• Single memory reference to access data• No additional calculations to work out effective address• Limited address space

Page 38: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Direct Addressing Diagram

Address AOpcode

Instruction

Memory

Operand

Page 39: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Indirect Addressing

Advanced Computers Architecture, UNIT 4

• Memory cell pointed to by address field contains the address of (pointer to) the operand

• EA = (A)— Look in A, find address (A) and look there for operand

• e.g. ADD (A)— Add contents of cell pointed to by contents of A to accumulator

Large address space

2n where n = word length

May be nested, multilevel, cascaded e.g. EA = (((A)))

Multiple memory accesses to find operand

Hence slower

Page 40: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Indirect Addressing Diagram

Address AOpcode

Instruction

Memory

Operand

Pointer to operand

Page 41: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

CISC Disadvantages

Advanced Computers Architecture, UNIT 4

Designers soon realised that the CISC philosophy had its own problems, including:

Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers.

So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine.

Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program.

CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them.

Page 42: Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Examples - CISC

Advanced Computers Architecture, UNIT 4

• Examples of CISC processors are• VAX• PDP-11• Motorola 68000 family• Intel x86/Pentium CPU’s