LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:

LimitLESS Directories: LimitLESS Directories: A Scalable A Scalable

Cache Coherence SchemeCache Coherence Scheme

By: David Chaiken, By: David Chaiken,

John Kubiatowicz, John Kubiatowicz,

Anant AgarwalAnant Agarwal

Presented by: Sampath Rudravaram

Cache CoherenceCache Coherence The gap between the computing power of microprocessors and that of the

largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing.

Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time

Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache

Common Solution Snoopy coherence Directory based coherence <-- Compiler directed coherence

Directory (Full-map)Directory (Full-map) The message-based protocols allocate a section of the system’s memory Directory Each block of memory has an associated

directory entry which contains a bit for each cache in the system.

That bit indicates whether or not the associated cache contains a copy of memory block

Directory based CoherenceDirectory based Coherence

The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache.

When an entry is changed the directory must be notified either before the change is initiated or when it is complete.

When an entry is changed the directory either updates or invalidates the other caches with that entry.


FULL-MAP Directory Entry

Advantages ? ->No broadcast is necessary

Disadvantages ? ->Coherence traffic is high due to all requests to the

directory

->Great need for memory(size grows as Ө(N^2))

Read-only x x . . . . . . .. . . . . . . X

State 1 2 3 . . . . . . . N


Limited Directory Entry

Advantages ? ->Its performance is comparable to that of a full-map

scheme in case where there is limited sharing of data between processors

->Cheaper to implement

Disadvantages ? ->The protocol is susceptible to thrashing when the

number of processors sharing data exceeds the number of pointers in the directory entry

Read-Only

12 10 13 23

State Node ID Node ID Node ID Node ID

LimitLESSLimitLESS((LLimited directory imited directory LLocally ocally EExtended through xtended through SSoftware oftware SSupport. )upport. )

The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution.

The main idea behind this method is to handle the common case in hardware and the exceptional case in software.

Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.

Type Symbol Name Data?

CacheTo Memory

RREQWREQREPMUPDATEACKC

Read RequestWrite Request Replace ModifiedUpdateInvalidate Ack.

**

Memory To Cache

RDATAWDATAINVBUSY

Read DataWrite DataInvalidateBusy Signal

**

Component

Name Meaning

Memory Read-OnlyRead-WriteRead-TransactionWrite-Transaction

Some number of caches have read-only copies of the dataExactly one cache has a read-write copy of the dataHolding read request, update is in progressHolding write request, invalidating is in progress

Cache InvalidRead-OnlyRead-Write

Cache block may not be read or writtenCache block may be read, but not writtenCache block may be read or written

TransitionLabel

InputMessage

Precondition Directory EntryChange

OutputMessage (s)

1 i-> RREQ -- P=P U { i } RDATA -> i

2 i-> WREQi-> WREQ

P={ i }P={ }

--P={ i }

WDATA -> iWDATA -> i

3 i-> WREQi-> WREQ

P={k1,…kn}^ i PP={k1,…kn}^ i P

P={i}, AckCtr = nP={i}, AckCtr = n-1

¥kj INV-> kj¥kj≠i INV-> kj

4 j-> WREQ P={ i } P={j}, AckCtr = 1 INV-> i

5 j-> RREQ P={ i } P={j}, AckCtr = 1 INV-> i

6 i-> REPM P={ i } P={ } --

7 j-> RREQj->WREQj->ACKCj->REPM

----AckCtr ≠ 1--

----AckCtr = AckCtr -1--

BUSY->jBUSY->j----

8 j->ACKCJ->UPDATE

AckCtr = 1, P={i}, P={ i }

AckCtr = 0AckCtr = 0

WDATA -> iWDATA -> i

9 j->RREQj->WREQj->REPM

------

------

BUSY->jBUSY->j--

10 j->UPDATEj->ACKC

P={ i }P={ i }

AckCtr = 0AckCtr = 0

RDATA -> iRDATA -> i

<- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram

Architectural Features Architectural Features LimitLESSLimitLESS

Alewife is a large-scale multiprocessor with distributed shared memory and a cost-effective mesh network for communication.

An Alewife node consists of a 33MHz SPARCLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor

A 16-node Alewife machine A 128-node Alewife Chassis

Architectural Features Architectural Features LimitLESSLimitLESS

Be capable of rapid trap handling (five to ten cycles ).

A rapid context switching processor

A finely-tuned software trap architecture .

The processor needs complete access to coherence related controller state

The directory controller must be able to invoke processor trap handlers when necessary.

An interface to the network that allows the processor to launch and to intercept coherence protocol packets.

IPI( Interprocessor-Interrrupt))

Processor Controller

Condition Bits

Trap Lines

Data Bus

Address Bus

Architectural Features LimitLESSArchitectural Features LimitLESS

IPI provides a superset of the network functionality -> Used to send and receive cache protocol packets -> Used to send preemptive message to remote

processors

Network Packet Structure Protocol Opcode ->for cache coherence traffic Interrupt Opcode ->for interprocessor message

Transmission of IPI Packets -> enqueue the request on IPI output Queue Reception of IPI packets ->place the packet in the IPI input Queue IPI input traps are synchronous.

Source processorPacket Length

OpcodeOperand 1Operand 2

..

..

..Operand m-1Data word

Data word 2......

Data word n-1

Queue based diagram of the Alewife controller

Meta States & Trap HandlerMeta States & Trap Handler Meta States Trap Handler First time overflow: -The trap code allocates a full-map bit-vector in local memory. -Empty all hardware pointers, set the corresponding bits in the

vector -Directory Mode is set to Trap-On-Write before trap returns Additional overflow: -Empty all hardware pointers, set the corresponding bits in the

vector Termination (on WREQ or local write fault):

-Empty all hardware pointers -Record the identity of requester in the directory -Set the ActCtr to the # of bits in the vector that are set -Place directory in Normal Mode, Write Transaction Sate. -Invalidate all caches with the bit set in vector

PERFORMANCE MEASUREMENTPERFORMANCE MEASUREMENT

Comparision of the performance of Comparision of the performance of limited,LimitLESS and full-map directories.limited,LimitLESS and full-map directories.

Evaluated in terms of the total number of Evaluated in terms of the total number of cycles needed to execute an application cycles needed to execute an application on a 64 processor Alewife machineon a 64 processor Alewife machine..

Measurement TechniqueMeasurement Technique

ASIM,The Alewife System Simulator

Performance ResultsPerformance Results

ApplicationApplication Dir4Dir4NBNB LimitLESS4LimitLESS4 Full-MapFull-Map

MultigridMultigrid 0.7290.729 0.7040.704 0.6650.665

SIMPLESIMPLE 3.5793.579 2.9022.902 2.5532.553

MatexprMatexpr 1.2961.296 0.3170.317 0.1710.171

WeatherWeather 1.3561.356 0.6540.654 0.6210.621

-> four-pointer limited protocol,full-map protocol,LimitLESS scheme with Ts=50-> 64-node Alewife machine with 64K byte caches and 2D mesh n/ws

Performance Results (contd..)Performance Results (contd..)-> Result when the variable in Weather is not optimised.

Performance Results (contd..)Performance Results (contd..)-> Result when the variable in Weather is optimised

Performance Results (Contd..)Performance Results (Contd..)-> Result when emulation latency = 50 for LimitLESS protocol.

ConclusionConclusion This paper proposed a new scheme for cache This paper proposed a new scheme for cache

coherence, called LimitLess, which is being coherence, called LimitLess, which is being implemented in Alewife machine. implemented in Alewife machine.

Hardware requirement includes rapid trap Hardware requirement includes rapid trap handling and a flexible processor interface to the handling and a flexible processor interface to the network. network.

Preliminary simulation results indicate that the Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of LimitLEss scheme approaches the performance of a full-map directory protocol with the memory a full-map directory protocol with the memory efficiency of a limited directory protocol. efficiency of a limited directory protocol.

Furthermore, the LimitLess scheme provides a Furthermore, the LimitLess scheme provides a migration path toward a future in which cache migration path toward a future in which cache coherence is handled entirely in softwarecoherence is handled entirely in software

Documents

LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by: