Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral

Multiprocessors

Advanced Computers Architecture, UNIT 4

Flynn's classification

Vector computers

Pipelining in Vector computers

Cray

Multiprocessor interconnection

General purpose Multiprocessor

Data Flow Computers

The Big Picture: Where are We Now?


The major issue is this:

We’ve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data?

How do we keep all copies of the data in synch with each other?

The Multiprocessor Picture


Processor/MemoryBus

PCI Bus

I/O Busses

Example: Pentium System

Organization

CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 4

Why Buy a Multiprocessor?

• Multiple users.• Multiple applications.• Multitasking within an application.• Responsiveness and/or throughput.


5

Multiprocessor Architectures

• Message-Passing Architectures– Separate address space for each processor.– Processors communicate via message passing.

• Shared-Memory Architectures– Single address space shared by all processors.– Processors communicate by memory read/write.– SMP or NUMA.– Cache coherence is important issue.



Message-Passing Architecture

. . .

processor

cache

memory

processor

cache

memory

processor

cache

memory

interconnection network

. . .



Shared-Memory Architecture

. . .

interconnection network

. . .

processor1

cache

processor2

cache

processorN

cache

memory1

memoryM

memory2


8

Shared-Memory Architecture:SMP and NUMA

• SMP = Symmetric Multiprocessor– All memory is equally close to all processors.– Typical interconnection network is a shared bus.– Easier to program, but doesn’t scale to many

processors.• NUMA = Non-Uniform Memory Access

– Each memory is closer to some processors than others. – a.k.a. “Distributed Shared Memory”.– Typically interconnection is grid or hypercube.– Harder to program, but scales to more processors.


Shared Memory Multiprocessor


Memory

Disk & other IO

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Chipset • Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O

• Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro

Shared Memory Multiprocessor


• Several processors share one address space– conceptually a shared memory– often implemented just like a

multicomputer• address space distributed

over private memories• Communication is implicit

– read and write accesses to shared memory locations

• Synchronization– via shared memory locations

• spin waiting for non-zero– barriers

P

M

Network/Bus

P P

Conceptual Model

Message Passing Multicomputers


• Computers (nodes) connected by a network– Fast network interface

• Send, receive, barrier– Nodes not different than regular PC or workstation

• Cluster conventional workstations or PCs with fast network – cluster computing– Berkley NOW– IBM SP2

P

M

P

M

P

M

Network

Node

Large-Scale MP Designs


Low LatencyHigh Reliability

40 cycles100 cycles

Memory: distributed with nonuniform memory access time (“numa”) and scalable interconnect (distributed memory)

Shared Memory Architectures


In this section we will understand the issues around:

• Sharing one memory space among several processors.

• Maintaining coherence among several copies of a data item.

The Problem of Cache Coherency

CPU

Cache

100

200

A’

B’

Memory

100

200

A

B

I/O

a) Cache and memory coherent: A’ = A, B’ = B.

CPU

Cache

550

200

A’

B’

Memory

100

200

A

B

I/OOutput of A gives 100

b) Cache and memory incoherent: A’ ^= A.

CPU

Cache

100

200

A’

B’

Memory

100

440

A

B

I/OInput 440 to B

c) Cache and memory incoherent: B’ ^= B.

Some Simple Definitions


Mechanism How It Works Performance Coherency Issues

Write Back

Write Through

Write modified data from cache to

memory only when necessary.

Write modified data from cache to

memory immediately.

Good, because doesn’t tie up

memory bandwidth.

Not so good - uses a lot of

memory bandwidth.

Can have problems with various copies containing different

values.

Modified values always written to memory;

data always matches.

What Does Coherency Mean?


• Informally:– “Any read must return the most recent write”– Too strict and too difficult to implement

• Better:– “Any write must eventually be seen by a read”– All writes are seen in proper order (“serialization”)

• Two rules to ensure this:– “If P writes x and P1 reads it, P’s write will be seen by P1 if the

read and write are sufficiently far apart”– Writes to a single location are serialized:

seen in one order• Latest write will be seen• Otherwise could see writes in illogical order

(could see older value after a newer value)

Vector Computers


• Vector Processing Overview• Vector Metrics, Terms• Greater Efficiency than Super Scalar Processors• Examples

– CRAY-1 (1976, 1979) 1st vector-register supercomputer– Multimedia extensions to high-performance PC processors– Modern multi-vector-processor supercomputer – NEC ESS

• Design Features of Vector Supercomputers• Conclusions

Vector Programming Model


+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic InstructionsADDV v3, v1, v2 v3

v2v1

v1Vector Load and Store InstructionsLV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

Vector Code Example


# Scalar Code LI R4, 64loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop

# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

# C codefor (i=0; i<64; i++) C[i] = A[i] + B[i];

Vector Arithmetic Execution


• Use deep pipeline (=> fast clock) to execute element operations

• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

V1

V2

V3

V3 <- v1 * v2

Six stage multiply pipeline

Vector Instruction Set Advantages


• Compact–one short instruction encodes N operations => N*FlOp BandWidth

• Expressive, tells hardware that these N operations:–are independent–use the same functional unit–access disjoint registers–access registers in the same pattern as previous instructions–access a contiguous block of memory (unit-stride load/store) OR access memory in a known pattern (strided load/store)

• Scalable–can run same object code on more parallel pipelines or lanes

Properties of Vector Processors


• Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate

• Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of 64-plus elements=> no (data) caches required! (but use instruction cache)

• Reduces branches and branch problems in pipelines

• Single vector instruction implies lots of work (≈ loop)=> fewer instruction fetches

Supercomputers


Definition of a supercomputer:• Fastest machine in world at given task• A device to turn a compute-bound problem into an

I/O bound problem • Any machine costing $30M+• Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Supercomputer Applications


Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer Vector Machine

Vector Supercomputers


Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions• Load/Store Architecture• Vector Registers• Vector Instructions• Hardwired Control• Highly Pipelined Functional Units• Interleaved Memory System• No Data Caches• No Virtual Memory

Cray-1 (1976)



Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MW/sec data load/store

320MW/sec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V. Mask

V. Length64 Element Vector Registers

Vector Memory System


0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base StrideVector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank

Vector Instruction Execution


ADDV C,A,B

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]

A[16] B[16]

A[20] B[20]

A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]

A[17] B[17]

A[21] B[21]

A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]

A[18] B[18]

A[22] B[22]

A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]

A[19] B[19]

A[23] B[23]

A[27] B[27]

Execution using four pipelined functional units

History of Microprocessors


1950s IBM instituted a research program

1964 Release of System/360

Mid-1970s improved measurement tools demonstrated on CISC

1979 32-bit RISC microprocessor (801) developed led by Joel Birnbaum

1984 MIPS developed at Stanford, as well as projects done at Berkeley

1988 RISC processors had taken over high-end of the workstation market

Early 1990s IBM’s POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC

What is CISC….?


A complex instruction set computer (CISC, pronounced like "sisk") is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction.

The philosophy behind it is, that hardware is always faster than software, therefore one should make a powerful instruction set, which provides programmers with assembly instructions to do a lot with short programs.

So the primary goal of the CISC is to complete a task in few lines of assembly instruction as possible.

Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy.


• Memory in those days was expensive bigger program->more storage->more money

Hence needed to reduce the number of instructions per program

• Number of instructions are reduced by having multiple operations within a single instruction

• Multiple operations lead to many different kinds of instructions that access memory In turn making instruction length variable and fetch-decode execute

time unpredictable – making it more complex Thus hardware handles the complexity

CISC philosophy


Use microcode • Used a simplified microcode instruction set to control the data path logic. This type of

implementation is known as a microprogrammed implementation.

Build rich instruction sets • Consequences of using a microprogrammed design is that designers could build more

functionality into each instruction.

Build high-level instruction sets • The logical next step was to build instruction sets which map directly from high-level

languages

Characteristics of a CISC design


Register to register, register to memory, and memory to register commands.

Uses Multiple addressing modes .

Variable length instructions where the length often varies according to the addressing mode

Instructions which require multiple clock cycles to execute.

Addressing Modes


• Immediate

• Direct

• Indirect

• Register

• Register Indirect

• Displacement (Indexed) • Stack

Immediate Addressing


• Operand is part of instruction• Operand = address field• e.g. ADD 5

— Add 5 to contents of accumulator— 5 is operand

• No memory reference to fetch data• Fast• Limited range

OperandOpcode

Instruction

Direct Addressing


• Address field contains address of operand• Effective address (EA) = address field (A)• e.g. ADD A

— Add contents of cell A to accumulator— Look in memory at address A for operand

• Single memory reference to access data• No additional calculations to work out effective address• Limited address space

Direct Addressing Diagram

Address AOpcode

Instruction

Memory

Operand

Indirect Addressing


• Memory cell pointed to by address field contains the address of (pointer to) the operand

• EA = (A)— Look in A, find address (A) and look there for operand

• e.g. ADD (A)— Add contents of cell pointed to by contents of A to accumulator

Large address space

2n where n = word length

May be nested, multilevel, cascaded e.g. EA = (((A)))

Multiple memory accesses to find operand

Hence slower

Indirect Addressing Diagram

Address AOpcode

Instruction

Memory

Operand

Pointer to operand

CISC Disadvantages


Designers soon realised that the CISC philosophy had its own problems, including:

Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers.

So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine.

Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program.

CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them.

Examples - CISC


• Examples of CISC processors are• VAX• PDP-11• Motorola 68000 family• Intel x86/Pentium CPU’s

Documents

Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral