Upload
virginia-french
View
225
Download
0
Embed Size (px)
Citation preview
LimitLESS Directories: LimitLESS Directories: A Scalable A Scalable
Cache Coherence SchemeCache Coherence Scheme
By: David Chaiken, By: David Chaiken,
John Kubiatowicz, John Kubiatowicz,
Anant AgarwalAnant Agarwal
Presented by: Sampath Rudravaram
Cache CoherenceCache Coherence The gap between the computing power of microprocessors and that of the
largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing.
Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time
Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache
Common Solution Snoopy coherence Directory based coherence <-- Compiler directed coherence
Directory (Full-map)Directory (Full-map) The message-based protocols allocate a section of the system’s memory Directory Each block of memory has an associated
directory entry which contains a bit for each cache in the system.
That bit indicates whether or not the associated cache contains a copy of memory block
Directory based CoherenceDirectory based Coherence
The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache.
When an entry is changed the directory must be notified either before the change is initiated or when it is complete.
When an entry is changed the directory either updates or invalidates the other caches with that entry.
Directory based CoherenceDirectory based Coherence
FULL-MAP Directory Entry
Advantages ? ->No broadcast is necessary
Disadvantages ? ->Coherence traffic is high due to all requests to the
directory
->Great need for memory(size grows as Ө(N^2))
Read-only x x . . . . . . .. . . . . . . X
State 1 2 3 . . . . . . . N
Directory based CoherenceDirectory based Coherence
Limited Directory Entry
Advantages ? ->Its performance is comparable to that of a full-map
scheme in case where there is limited sharing of data between processors
->Cheaper to implement
Disadvantages ? ->The protocol is susceptible to thrashing when the
number of processors sharing data exceeds the number of pointers in the directory entry
Read-Only
12 10 13 23
State Node ID Node ID Node ID Node ID
LimitLESSLimitLESS((LLimited directory imited directory LLocally ocally EExtended through xtended through SSoftware oftware SSupport. )upport. )
The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution.
The main idea behind this method is to handle the common case in hardware and the exceptional case in software.
Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.
Type Symbol Name Data?
CacheTo Memory
RREQWREQREPMUPDATEACKC
Read RequestWrite Request Replace ModifiedUpdateInvalidate Ack.
**
Memory To Cache
RDATAWDATAINVBUSY
Read DataWrite DataInvalidateBusy Signal
**
Component
Name Meaning
Memory Read-OnlyRead-WriteRead-TransactionWrite-Transaction
Some number of caches have read-only copies of the dataExactly one cache has a read-write copy of the dataHolding read request, update is in progressHolding write request, invalidating is in progress
Cache InvalidRead-OnlyRead-Write
Cache block may not be read or writtenCache block may be read, but not writtenCache block may be read or written
TransitionLabel
InputMessage
Precondition Directory EntryChange
OutputMessage (s)
1 i-> RREQ -- P=P U { i } RDATA -> i
2 i-> WREQi-> WREQ
P={ i }P={ }
--P={ i }
WDATA -> iWDATA -> i
3 i-> WREQi-> WREQ
P={k1,…kn}^ i PP={k1,…kn}^ i P
P={i}, AckCtr = nP={i}, AckCtr = n-1
¥kj INV-> kj¥kj≠i INV-> kj
4 j-> WREQ P={ i } P={j}, AckCtr = 1 INV-> i
5 j-> RREQ P={ i } P={j}, AckCtr = 1 INV-> i
6 i-> REPM P={ i } P={ } --
7 j-> RREQj->WREQj->ACKCj->REPM
----AckCtr ≠ 1--
----AckCtr = AckCtr -1--
BUSY->jBUSY->j----
8 j->ACKCJ->UPDATE
AckCtr = 1, P={i}, P={ i }
AckCtr = 0AckCtr = 0
WDATA -> iWDATA -> i
9 j->RREQj->WREQj->REPM
------
------
BUSY->jBUSY->j--
10 j->UPDATEj->ACKC
P={ i }P={ i }
AckCtr = 0AckCtr = 0
RDATA -> iRDATA -> i
<- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram
Architectural Features Architectural Features LimitLESSLimitLESS
Alewife is a large-scale multiprocessor with distributed shared memory and a cost-effective mesh network for communication.
An Alewife node consists of a 33MHz SPARCLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor
A 16-node Alewife machine A 128-node Alewife Chassis
Architectural Features Architectural Features LimitLESSLimitLESS
Be capable of rapid trap handling (five to ten cycles ).
A rapid context switching processor
A finely-tuned software trap architecture .
The processor needs complete access to coherence related controller state
The directory controller must be able to invoke processor trap handlers when necessary.
An interface to the network that allows the processor to launch and to intercept coherence protocol packets.
IPI( Interprocessor-Interrrupt))
Processor Controller
Condition Bits
Trap Lines
Data Bus
Address Bus
Architectural Features LimitLESSArchitectural Features LimitLESS
IPI provides a superset of the network functionality -> Used to send and receive cache protocol packets -> Used to send preemptive message to remote
processors
Network Packet Structure Protocol Opcode ->for cache coherence traffic Interrupt Opcode ->for interprocessor message
Transmission of IPI Packets -> enqueue the request on IPI output Queue Reception of IPI packets ->place the packet in the IPI input Queue IPI input traps are synchronous.
Source processorPacket Length
OpcodeOperand 1Operand 2
..
..
..Operand m-1Data word
Data word 2......
Data word n-1
Queue based diagram of the Alewife controller
Meta States & Trap HandlerMeta States & Trap Handler Meta States Trap Handler First time overflow: -The trap code allocates a full-map bit-vector in local memory. -Empty all hardware pointers, set the corresponding bits in the
vector -Directory Mode is set to Trap-On-Write before trap returns Additional overflow: -Empty all hardware pointers, set the corresponding bits in the
vector Termination (on WREQ or local write fault):
-Empty all hardware pointers -Record the identity of requester in the directory -Set the ActCtr to the # of bits in the vector that are set -Place directory in Normal Mode, Write Transaction Sate. -Invalidate all caches with the bit set in vector
PERFORMANCE MEASUREMENTPERFORMANCE MEASUREMENT
Comparision of the performance of Comparision of the performance of limited,LimitLESS and full-map directories.limited,LimitLESS and full-map directories.
Evaluated in terms of the total number of Evaluated in terms of the total number of cycles needed to execute an application cycles needed to execute an application on a 64 processor Alewife machineon a 64 processor Alewife machine..
Measurement TechniqueMeasurement Technique
ASIM,The Alewife System Simulator
Performance ResultsPerformance Results
ApplicationApplication Dir4Dir4NBNB LimitLESS4LimitLESS4 Full-MapFull-Map
MultigridMultigrid 0.7290.729 0.7040.704 0.6650.665
SIMPLESIMPLE 3.5793.579 2.9022.902 2.5532.553
MatexprMatexpr 1.2961.296 0.3170.317 0.1710.171
WeatherWeather 1.3561.356 0.6540.654 0.6210.621
-> four-pointer limited protocol,full-map protocol,LimitLESS scheme with Ts=50-> 64-node Alewife machine with 64K byte caches and 2D mesh n/ws
Performance Results (contd..)Performance Results (contd..)-> Result when the variable in Weather is not optimised.
Performance Results (contd..)Performance Results (contd..)-> Result when the variable in Weather is optimised
Performance Results (Contd..)Performance Results (Contd..)-> Result when emulation latency = 50 for LimitLESS protocol.
ConclusionConclusion This paper proposed a new scheme for cache This paper proposed a new scheme for cache
coherence, called LimitLess, which is being coherence, called LimitLess, which is being implemented in Alewife machine. implemented in Alewife machine.
Hardware requirement includes rapid trap Hardware requirement includes rapid trap handling and a flexible processor interface to the handling and a flexible processor interface to the network. network.
Preliminary simulation results indicate that the Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of LimitLEss scheme approaches the performance of a full-map directory protocol with the memory a full-map directory protocol with the memory efficiency of a limited directory protocol. efficiency of a limited directory protocol.
Furthermore, the LimitLess scheme provides a Furthermore, the LimitLess scheme provides a migration path toward a future in which cache migration path toward a future in which cache coherence is handled entirely in softwarecoherence is handled entirely in software