View
222
Download
0
Category
Tags:
Preview:
Citation preview
§ Georgia Institute of Technology, † Intel Corporation
Cache Coherence Support for Non-Shared Bus
Architecture on Heterogeneous MPSoCs
Cache Coherence Support for Non-Shared Bus
Architecture on Heterogeneous MPSoCs
Taeweon SuhTaeweon Suh §§, , Daehyun Kim Daehyun Kim ††, , and Hsien-Hsin S. Lee and Hsien-Hsin S. Lee §§
June 15,June 15, 20052005
2
MPSoCsMPSoCs
IP IP
IP
ADC
MemoryController
uP
Time-to-MarketTime-to-Market FlexibilityFlexibility Low costLow cost
– Share memory Share memory interface to reduce pin interface to reduce pin countcount
– However, shared bus However, shared bus arch. hinders the arch. hinders the versatility provided by versatility provided by each processoreach processor
– Non-Shared bus arch.Non-Shared bus arch. Real-time propertyReal-time property
– communication communication between processorsbetween processors
Wireless IP
Memory
SDRAM
uP
DSP
3
IntroductionIntroduction
Cache CoherenceCache Coherence– Well known technique for data consistency for Well known technique for data consistency for
multiprocessor systems multiprocessor systems
ProtocolStates
ModifiedExclusiveOwnedSharedInvalid
P0
D$ (MOESI)
Memory
P1
D$ (MOESI)
1234
Example operation sequence
E 1234S 1234 S 1234
shared
M abcd
invalidate
I 1234
cache-to-cache
O abcdS abcd P0: readP1: readP1: write (abcd)P0: read
I ----- I -----
4
MemoryController
Wrapper 0
Proc 0(MSI)
Bus
Wrapper 1
Proc 1(MESI)
Shared-signal assertion
Previous WorkPrevious Work
Integration techniques for Integration techniques for shared-busshared-bus based based platform platform [1][2][3][1][2][3]
[1] Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee, Supporting cache coherence in heterogeneous multiprocessor systems, In DATE’04, Feb. 2004 [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004
MemoryController
Wrapper 0
Proc 0(MEI)
Bus
Wrapper 1
Proc 1(MESI)
Read-to-write conversion
Read
Shared
Read/Write
Write
MemoryController
Wrapper 0
Proc 0(MEI)
Bus
Snoop-hit Buffer (single cache line)
Wrapper 1
Proc 1(MESI)
Snoop-hit buffer
Write-back
To memory
Read Read
5
ProposalProposal
CCache ache CCoherence-enforced oherence-enforced MMemory emory CController ontroller (ccMC) for Non-Shared bus based MPSoCs(ccMC) for Non-Shared bus based MPSoCs– Bypass approachBypass approach– Bookkeeping approachBookkeeping approach
Integration of invalidation-based protocols such Integration of invalidation-based protocols such as MEI, MSI, MESI, and MOESIas MEI, MSI, MESI, and MOESI
ccMCBus 0
Proc 1(MEI)
Bus 1
Proc 0(MESI)
Memory
MPSoC
6
Bypass ApproachBypass Approach
Blindly pass bus transactions if in shared Blindly pass bus transactions if in shared rangerange
Very inexpensive in terms of silicon areaVery inexpensive in terms of silicon area
ccMCBus 0
Proc 1(MEI)
Bus 1
Proc 0(MESI)
Memory
MPSoC
ccMC
Bus 0 Bus 1
Start_addr_reg
Range_reg
Snoop-hit buffer
mux
comparatorBus request 0
1 addr.
7
Bookkeeping ApproachBookkeeping Approach
Selectively pass bus transactions if in shared Selectively pass bus transactions if in shared rangerange
Expensive compared to bypass approachExpensive compared to bypass approach
ccMCBus 0
Proc 1(MEI)
Bus 1
Proc 0(MESI)
Memory
MPSoC
ccMC
Bus 0 Bus 1
Snoop-hit buffer
Bus request
if M
Start_addr_reg
Range_reg
addr.I I
S I
S S
M I
I I
I I
StatesP0 P1 if inside
shared range
•••
•
8
MPSoC
ccMC
Bus 0
Proc 1(MESI)
Bus 1
Proc 0(MSI)
Memory
I I
I I
P0 P1
ExampleExample
Bookkeeping approachBookkeeping approach
P1: readP1: write (abcd)P0: read
Example operation sequence
S
S
M -------- 1234abcd
S
S abcd
sharedinvalidate
M
Breq
abcd1234
S
S
9
Integration with no-coherence support processorIntegration with no-coherence support processor
No-coherence support processors work like No-coherence support processors work like having MEI w/o snooping: MEI-like integrated having MEI w/o snooping: MEI-like integrated protocolprotocol
Interrupt is used to inform possible snoop-hitsInterrupt is used to inform possible snoop-hits
ccMCBus 0
Proc 1(no hardware
support)
Bus 1
Proc 0(MESI)
Memory
MPSoCIRQ
10
Simulation ModelSimulation Model
Atalanta Atalanta [4][4] RTOS RTOS– Home-grown RTOS in Georgia Tech Home-grown RTOS in Georgia Tech – Designed for heterogeneous multiprocessor Designed for heterogeneous multiprocessor
SoCsSoCs Atalanta kernel simulationAtalanta kernel simulation
– Task insertion/deletionTask insertion/deletion– Tasks are managed in TCB (Task Control Block)Tasks are managed in TCB (Task Control Block)– TCBs are connected through doubly-linked listTCBs are connected through doubly-linked list– Each other’s TCB is accessible by other Each other’s TCB is accessible by other
processorprocessor– Update the highest priority TCB, waiting for Update the highest priority TCB, waiting for
system objects such as semaphore, when a system objects such as semaphore, when a system object is readysystem object is ready[4] Di-Shi Sun, Douglas M. Blough, and Vincent J. Mooney, A New Multiprocessor RTOS
Kernel for System-on-a-Chip Applications. Technical Report GIT-CC-02-09, CERCS
11
Simulation EnvironmentSimulation Environment
ProcessorsProcessors– Platform1: PPC755 (MEI) + ARM9 with MESIPlatform1: PPC755 (MEI) + ARM9 with MESI– Platform2: ARM9 with MSI + ARM9 with MESIPlatform2: ARM9 with MSI + ARM9 with MESI
Simulators: Seamless CVE + ModelSimSimulators: Seamless CVE + ModelSim
ccMCBus 0
Proc 1
Bus 1
Proc 0
Memory
DMA0 DMA1
100MbpsEthernet
320X240LCD
controller
12
Simulation Results Simulation Results
Bypass Approach: 2 tasks on each processorBypass Approach: 2 tasks on each processor
10 15 20 25 30 35 40 451.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
platform 2 (MSI-MESI): bypass with snoop-hit buffer platform 2 (MSI-MESI): bypass
Sp
eed
up
ove
r so
ftw
are
solu
tio
n
Miss penalty (cycles)
platform 1 (MEI-MESI): bypass with snoop-hit buffer platform 1 (MEI-MESI): bypass
13
Simulation Results Simulation Results
Bypass Approach: 32 tasks on each Bypass Approach: 32 tasks on each processorprocessor
10 15 20 25 30 35 40 45
3
4
5
6
7
platform 2 (MSI-MESI): bypass with snoop-hit buffer platform 2 (MSI-MESI): bypass
Sp
eed
up
ove
r so
ftw
are
solu
tio
n
Miss penalty (cycles)
platform 1 (MEI-MESI): bypass with snoop-hit buffer platform 1 (MEI-MESI): bypass
14
Simulation Results Simulation Results
Bookkeeping ApproachBookkeeping Approach– Platform 2, Miss penalty 14 cyclesPlatform 2, Miss penalty 14 cycles– Microbench simulationMicrobench simulation
0 20 40 60 80 1000.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
Sp
eed
up
ove
r th
e b
ypas
s ap
pro
ach
Bus utilization attempt by DMAs (percent)
accessed cache lines 1 2 4 8 16 32
15
Conclusions Conclusions
Proposed integration techniques for cache Proposed integration techniques for cache coherence on coherence on Non-shared bus based-MPSoCsNon-shared bus based-MPSoCs– Bypass approach, Bookkeeping approachBypass approach, Bookkeeping approach
Bypass approachBypass approach– Blindly pass shared memory operationsBlindly pass shared memory operations– Very cheap in terms of silicon areaVery cheap in terms of silicon area
Bookkeeping approachBookkeeping approach– Selectively pass shared memory operationsSelectively pass shared memory operations– Expensive compared to bypass approachExpensive compared to bypass approach
Effective solutions for communication as more Effective solutions for communication as more and more heterogeneous processors are and more heterogeneous processors are integrated in a single chipintegrated in a single chip
16
Questions, Comments?Questions, Comments?
Thanks for your attention!
17
Backup Slides
18
MotivationMotivation
Embedded systems more and more require Embedded systems more and more require heterogeneous processors on a chip according heterogeneous processors on a chip according to applications needsto applications needs
Efficient communication is imperative to meet Efficient communication is imperative to meet real-time property of embedded applications real-time property of embedded applications
Shared-bus architecture using AMBA, Shared-bus architecture using AMBA, CoreConnect compromises the versatility CoreConnect compromises the versatility provided by each processorprovided by each processor
Pin count restricts to use dedicated memory Pin count restricts to use dedicated memory interface for each processor on SoCsinterface for each processor on SoCs– Commercial MP SoCs such as TI’ OMAP and Commercial MP SoCs such as TI’ OMAP and
Philip’s Nexperia employ Non-shared bus Philip’s Nexperia employ Non-shared bus architecture sharing memory interface architecture sharing memory interface (check (check Nexperia)Nexperia)
19
MPSoC
ccMC
Bus 0
Proc 1(MESI)
Bus 1
Proc 0(MSI)
Memory
I I
I I
P0 P1
Bookkeeping Approach (cont’d)Bookkeeping Approach (cont’d) Problem with E-stateProblem with E-state
P1: readP1: writeP0: read
Example operation sequence
E
E
M
1234
-------- 1234abcd
E
E 1234
20
MPSoC
ccMC
Bus 0
Proc 1(MESI)
Bus 1
Proc 0(MSI)
Memory
I I
I I
P0 P1
Bookkeeping Approach (cont’d)Bookkeeping Approach (cont’d) Solution: Prohibit E-state (shared signal Solution: Prohibit E-state (shared signal
assertion)assertion)
P1: readP1: writeP0: read
Example operation sequence
S
S
M -------- 1234abcd
S
S abcd
sharedinvalidate
M
Breq
abcd1234
S
S
21
Previous Work (cont’d)Previous Work (cont’d)
Snoop-hit Buffer Snoop-hit Buffer [2][3][2][3]
RRegion-egion-BBasedased CCache ache CCoherence (RBCC) oherence (RBCC) [2][3][2][3]
[2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004
MemoryController
Wrapper 0
Proc 0(MEI)
Bus
Snoop-hit Buffer (single cache line)
Wrapper 1
Proc 1(MESI)
Snoop-hit buffer
Write-back
To memory
Read Read
MemoryController
Wrapper 2
Proc 0(MEI)
Bus
Wrapper 1
Proc 1(MESI)
RBCC
Wrapper 0
Proc 0(MESI)
MESIMEI
Recommended