§ Georgia Institute of Technology, Intel Corporation Cache Coherence Support for Non-Shared Bus...

§ Georgia Institute of Technology, † Intel Corporation

Cache Coherence Support for Non-Shared Bus

Architecture on Heterogeneous MPSoCs

Cache Coherence Support for Non-Shared Bus

Architecture on Heterogeneous MPSoCs

Taeweon SuhTaeweon Suh §§, , Daehyun Kim Daehyun Kim ††, , and Hsien-Hsin S. Lee and Hsien-Hsin S. Lee §§

June 15,June 15, 20052005

MPSoCsMPSoCs

MemoryController

Time-to-MarketTime-to-Market FlexibilityFlexibility Low costLow cost

– Share memory Share memory interface to reduce pin interface to reduce pin countcount

– However, shared bus However, shared bus arch. hinders the arch. hinders the versatility provided by versatility provided by each processoreach processor

– Non-Shared bus arch.Non-Shared bus arch. Real-time propertyReal-time property

– communication communication between processorsbetween processors

Wireless IP

Memory

IntroductionIntroduction

Cache CoherenceCache Coherence– Well known technique for data consistency for Well known technique for data consistency for

multiprocessor systems multiprocessor systems

ProtocolStates

ModifiedExclusiveOwnedSharedInvalid

D$ (MOESI)

Memory

D$ (MOESI)

Example operation sequence

E 1234S 1234 S 1234

shared

M abcd

invalidate

I 1234

cache-to-cache

O abcdS abcd P0: readP1: readP1: write (abcd)P0: read

I ----- I -----

MemoryController

Wrapper 0

Proc 0(MSI)

Wrapper 1

Proc 1(MESI)

Shared-signal assertion

Previous WorkPrevious Work

Integration techniques for Integration techniques for shared-busshared-bus based based platform platform [1][2][3][1][2][3]

[1] Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee, Supporting cache coherence in heterogeneous multiprocessor systems, In DATE’04, Feb. 2004 [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004

MemoryController

Wrapper 0

Proc 0(MEI)

Wrapper 1

Proc 1(MESI)

Read-to-write conversion

Shared

Read/Write

MemoryController

Wrapper 0

Proc 0(MEI)

Snoop-hit Buffer (single cache line)

Wrapper 1

Proc 1(MESI)

Snoop-hit buffer

Write-back

To memory

Read Read

ProposalProposal

CCache ache CCoherence-enforced oherence-enforced MMemory emory CController ontroller (ccMC) for Non-Shared bus based MPSoCs(ccMC) for Non-Shared bus based MPSoCs– Bypass approachBypass approach– Bookkeeping approachBookkeeping approach

Integration of invalidation-based protocols such Integration of invalidation-based protocols such as MEI, MSI, MESI, and MOESIas MEI, MSI, MESI, and MOESI

ccMCBus 0

Proc 1(MEI)

Proc 0(MESI)

Memory

Bypass ApproachBypass Approach

Blindly pass bus transactions if in shared Blindly pass bus transactions if in shared rangerange

Very inexpensive in terms of silicon areaVery inexpensive in terms of silicon area

ccMCBus 0

Proc 1(MEI)

Proc 0(MESI)

Memory

Bus 0 Bus 1

Start_addr_reg

Range_reg

Snoop-hit buffer

comparatorBus request 0

1 addr.

Bookkeeping ApproachBookkeeping Approach

Selectively pass bus transactions if in shared Selectively pass bus transactions if in shared rangerange

Expensive compared to bypass approachExpensive compared to bypass approach

ccMCBus 0

Proc 1(MEI)

Proc 0(MESI)

Memory

Bus 0 Bus 1

Snoop-hit buffer

Bus request

Start_addr_reg

Range_reg

addr.I I

StatesP0 P1 if inside

shared range

•••

Proc 1(MESI)

Proc 0(MSI)

Memory

ExampleExample

Bookkeeping approachBookkeeping approach

P1: readP1: write (abcd)P0: read

M -------- 1234abcd

S abcd

sharedinvalidate

abcd1234

Integration with no-coherence support processorIntegration with no-coherence support processor

No-coherence support processors work like No-coherence support processors work like having MEI w/o snooping: MEI-like integrated having MEI w/o snooping: MEI-like integrated protocolprotocol

Interrupt is used to inform possible snoop-hitsInterrupt is used to inform possible snoop-hits

ccMCBus 0

Proc 1(no hardware

support)

Proc 0(MESI)

Memory

MPSoCIRQ

Simulation ModelSimulation Model

Atalanta Atalanta [4][4] RTOS RTOS– Home-grown RTOS in Georgia Tech Home-grown RTOS in Georgia Tech – Designed for heterogeneous multiprocessor Designed for heterogeneous multiprocessor

SoCsSoCs Atalanta kernel simulationAtalanta kernel simulation

– Task insertion/deletionTask insertion/deletion– Tasks are managed in TCB (Task Control Block)Tasks are managed in TCB (Task Control Block)– TCBs are connected through doubly-linked listTCBs are connected through doubly-linked list– Each other’s TCB is accessible by other Each other’s TCB is accessible by other

processorprocessor– Update the highest priority TCB, waiting for Update the highest priority TCB, waiting for

system objects such as semaphore, when a system objects such as semaphore, when a system object is readysystem object is ready[4] Di-Shi Sun, Douglas M. Blough, and Vincent J. Mooney, A New Multiprocessor RTOS

Kernel for System-on-a-Chip Applications. Technical Report GIT-CC-02-09, CERCS

Simulation EnvironmentSimulation Environment

ProcessorsProcessors– Platform1: PPC755 (MEI) + ARM9 with MESIPlatform1: PPC755 (MEI) + ARM9 with MESI– Platform2: ARM9 with MSI + ARM9 with MESIPlatform2: ARM9 with MSI + ARM9 with MESI

Simulators: Seamless CVE + ModelSimSimulators: Seamless CVE + ModelSim

ccMCBus 0

Proc 1

Proc 0

Memory

DMA0 DMA1

100MbpsEthernet

320X240LCD

controller

Simulation Results Simulation Results

Bypass Approach: 2 tasks on each processorBypass Approach: 2 tasks on each processor

10 15 20 25 30 35 40 451.2

platform 2 (MSI-MESI): bypass with snoop-hit buffer platform 2 (MSI-MESI): bypass

Miss penalty (cycles)

platform 1 (MEI-MESI): bypass with snoop-hit buffer platform 1 (MEI-MESI): bypass

Bypass Approach: 32 tasks on each Bypass Approach: 32 tasks on each processorprocessor

10 15 20 25 30 35 40 45

platform 2 (MSI-MESI): bypass with snoop-hit buffer platform 2 (MSI-MESI): bypass

Miss penalty (cycles)

platform 1 (MEI-MESI): bypass with snoop-hit buffer platform 1 (MEI-MESI): bypass

Bookkeeping ApproachBookkeeping Approach– Platform 2, Miss penalty 14 cyclesPlatform 2, Miss penalty 14 cycles– Microbench simulationMicrobench simulation

0 20 40 60 80 1000.96

Bus utilization attempt by DMAs (percent)

accessed cache lines 1 2 4 8 16 32

Conclusions Conclusions

Proposed integration techniques for cache Proposed integration techniques for cache coherence on coherence on Non-shared bus based-MPSoCsNon-shared bus based-MPSoCs– Bypass approach, Bookkeeping approachBypass approach, Bookkeeping approach

Bypass approachBypass approach– Blindly pass shared memory operationsBlindly pass shared memory operations– Very cheap in terms of silicon areaVery cheap in terms of silicon area

Bookkeeping approachBookkeeping approach– Selectively pass shared memory operationsSelectively pass shared memory operations– Expensive compared to bypass approachExpensive compared to bypass approach

Effective solutions for communication as more Effective solutions for communication as more and more heterogeneous processors are and more heterogeneous processors are integrated in a single chipintegrated in a single chip

Questions, Comments?Questions, Comments?

Thanks for your attention!

Backup Slides

MotivationMotivation

Embedded systems more and more require Embedded systems more and more require heterogeneous processors on a chip according heterogeneous processors on a chip according to applications needsto applications needs

Efficient communication is imperative to meet Efficient communication is imperative to meet real-time property of embedded applications real-time property of embedded applications

Shared-bus architecture using AMBA, Shared-bus architecture using AMBA, CoreConnect compromises the versatility CoreConnect compromises the versatility provided by each processorprovided by each processor

Pin count restricts to use dedicated memory Pin count restricts to use dedicated memory interface for each processor on SoCsinterface for each processor on SoCs– Commercial MP SoCs such as TI’ OMAP and Commercial MP SoCs such as TI’ OMAP and

Philip’s Nexperia employ Non-shared bus Philip’s Nexperia employ Non-shared bus architecture sharing memory interface architecture sharing memory interface (check (check Nexperia)Nexperia)

Proc 1(MESI)

Proc 0(MSI)

Memory

Bookkeeping Approach (cont’d)Bookkeeping Approach (cont’d) Problem with E-stateProblem with E-state

P1: readP1: writeP0: read

-------- 1234abcd

E 1234

Proc 1(MESI)

Proc 0(MSI)

Memory

Bookkeeping Approach (cont’d)Bookkeeping Approach (cont’d) Solution: Prohibit E-state (shared signal Solution: Prohibit E-state (shared signal

assertion)assertion)

P1: readP1: writeP0: read

M -------- 1234abcd

S abcd

sharedinvalidate

abcd1234

Previous Work (cont’d)Previous Work (cont’d)

Snoop-hit Buffer Snoop-hit Buffer [2][3][2][3]

RRegion-egion-BBasedased CCache ache CCoherence (RBCC) oherence (RBCC) [2][3][2][3]

[2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004

MemoryController

Wrapper 0

Proc 0(MEI)

Snoop-hit Buffer (single cache line)

Wrapper 1

Proc 1(MESI)

Snoop-hit buffer

Write-back

To memory

Read Read

MemoryController

Wrapper 2

Proc 0(MEI)

Wrapper 1

Proc 1(MESI)

Wrapper 0

Proc 0(MESI)

MESIMEI

§ Georgia Institute of Technology, Intel Corporation Cache Coherence Support for Non-Shared Bus...

Documents

Product Guide UltraScale+ MPSoCs DPUCZDX8G for Zynq PG338

Lecture 2. Network Basics Prof. Taeweon Suh Computer Science & Engineering Korea University COM850 Computer Hacking and Security

Lecture 3. APIC ID Prof. Taeweon Suh Computer Science Education Korea University COM509 Computer Systems

ARM CPU Internal I Prof. Taeweon Suh Computer Science Education Korea University

Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs

High Performance Embedded Systems MPSoCs

Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture

Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

Lecture 1. Number Systems Prof. Taeweon Suh Computer Science Education Korea University ECM585 Special Topics in Computer Design

Fast Architecture Evaluation of Heterogeneous MPSoCs by Host-Compiled Simulation

A Survey on Existing MPSOCs Architectures

Lecture 0. Course Introduction Prof. Taeweon Suh Computer Science & Engineering Korea University COM850 Computer Hacking and Security

Security enhancements for FPGA-based MPSoCs: a boot-to ... · In embedded systems, MPSoCs (Multi-Processor Systems-on-Chip) are managed by an embedded OS (for instance, uCLinux [13])

Team : Go & StopGame is Not Simple Son NaeHwan Kim JungHyun Lee InGyu Kang DaeHyun Jo DongSuck

Design of a Network-On-Chip platform for MPSoCs using TLM

Introduction to the DE0 Board Prof. Taeweon Suh Computer Science & Engineering Korea University COSE221,…

MS OFFICE, PDF 및 JPEG 사이즈 축소allegrosoft.co.kr/Download/NXPower Proposal.pdfCusphilKorea, 다다 C&C, DaeHyun Meta, 대동하이렉스 보배드림링컨, 프랑스럭셔리브랜드롱샴,

ARM Instructions I Prof. Taeweon Suh Computer Science Education Korea University

Lecture 5. Dynamic Scheduling II Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture

Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010