Decoupled Direct Memory Accessomutlu/pub/decoupled... · Decoupled Direct Memory Access CPU ACCESS....

Preview:

Citation preview

IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM

DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,

Jongmoo Choi,Onur Mutlu

DecoupledDirectMemoryAccess

2

processor

LogicalSystemOrganization

mainmemory

IOdevices

CPUaccess

IOaccess

MainmemoryconnectsprocessorandIOdevicesasanintermediatelayer

3

processor

PhysicalSystemImplementation

mainmemory

IOdevices

CPUaccess

IOaccess

IOaccess

HighPinCostinProcessor

HighContentioninMemoryChannel

4

processor

OurApproach

mainmemory

IOdevices

CPUaccess

EnablingIOchannel,decoupled & isolated fromCPUchannel

IOaccess

IOaccess

5

ExecutiveSummary• Problem

– CPUandIOaccessescontendforthesharedmemorychannel

• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports

àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices

àDecoupleCPUandIOaccesses

• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)

• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)

6

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

1.Problem

7

mainmemory

CPU

DMA

graphics

network

storage

USB

IOinterfacememorycontroller

MemoryChannelContentionDRAMChip

ProcessorChip

Problem1:MemoryChannelContention

DMAIOinterface

8

0%

20%

40%

60%

80%

100%TimeSpentonCPU-GPUCommunication

Benchmarks

33.5%onaverage

Fractio

nofExecutio

nTime

AlargefractionoftheexecutiontimeisspentonIOaccesses

Problem1:MemoryChannelContention

9

IntegratingIOinterfaceontheprocessorchipleadstohighareacost

ProcessorPinCount(w/opowerpins)

power memory(2ch)

IOinterface(10.6%)

IOinterface(28.4%)

others

memory(2ch)

(w/powerpins)ProcessorPinCount

959pinsintotal 359pinsintotal

Problem2:HighCostforIOInterfaces

10

SharedMemoryChannel

• MemorychannelcontentionforIOaccessandCPUaccess

• HighareacostforintegratingIOinterfacesonprocessorchip

11

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

12

OurApproach

CPU

DMA

graphics

network

storage

USB

DRAMChip

mainmemory

?

DMACTRL.

DMAcontrol

ProcessorChip

controlchannel

Dual-Data-PortDRAM

Port1

Port2

memorycontroller IOinterface

DMAChip DMAIOinterface

13

OurApproach

?

CPU

graphics

network

storage

USB

DRAMChip

DMACTRL.

DMAcontrol

ProcessorChip

controlchannel

Dual-Data-PortDRAM

Port1

Port2

memorycontroller

DMAChip DMAIOinterface

IOACCESS

DecoupledDirectMemoryAccess

CPUACCESS

14

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

15

peripherallogic

bank

Background:DRAMOperation

mem

orychannel

datachannel controlchannel

control

port

dataport

control

port

dataport

bank

activateread

bankbankREADY

DRAMperipherallogic:i)controlsbanks,andii)transfersdataovermemorychannel

memorycontrolleratCPU

16

bank

Problem:SingleDataPort

periphery

Requestsareservedseriallyduetosingledataport

datachannel controlchannel

control

port

dataport

read

control

port

dataport

bankREADY

bankREADY

dataport

read

ManyBanks

SingleDataPort

memorycontrolleratCPU

17

Problem:SingleDataPort

RD

DATA

RD

DATA

ControlPort

DataPort

time

RD

DATA

RDControlPort

DataPort1

time

DATADataPort2

WhataboutaDRAMwithtwodataports?

18

bank

periphery

twicethebandwidth&independentdataportswithlowoverhead

datachannel controlchanneldataport1

bank

bank

control

port

toPort1(upper)

toPort2(lower)

bankdatabus

portse

lectsignal

dataport2

datachannel

mux

mux

OverheadArea:1.6%↑Pins:20↑

Dual-Data-PortDRAM

19

DDP-DRAMMemorySystem

bank

periphery

CPUchannel controlchannelwithportselectdata

port1

bank

bank

control

port

dataport2

IOchannel

mux

mux

DDMAIOinterface

memorycontrolleratCPU

20

ThreeDataTransferModes

• CPUAccess:AccessthroughCPUchannel– DRAMread/writewithCPUportselection

• IOAccess:AccessthroughIOchannel– DRAMread/writewithIOportselection

• PortBypass:Directtransferbetweenchannels– DRAMaccesswithportbypassselection

21

1.CPUAccessMode

bank

periphery

CPUchannel

bank

control

port

dataport2

IOchannel

DDMAIOinterface

controlchannelwithportselect

mux

mux

dataport

bankREADY

memorycontrolleratCPU

read

control

port

CPUchanneldataport1

controlchannelwithCPUchannel

22

2.IOAccessMode

bank

periphery

CPUchannel

bank

control

portIOchannel

DDMAIOinterface

controlchannelwithportselect

mux

mux

dataport1

controlchannelwith IOchannel

memorycontrolleratCPU

IOchannel

dataportdataport2

bankREADY

read

control

port

23

3.PortBypassMode

bank

periphery

CPUchannel

bank

control

portIOchannel

controlchannelwithportselect

mux

mux

controlchannelwith portbypass

IOchannel

bank

dataport

dataport

dataport2

dataport1

CPUchannel

DDMAIOinterface

memorycontrolleratCPU

24

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

25

ThreeApplicationsforDDMA

• Communicationb/wComputeUnits– CPU-GPUcommunication

• In-MemoryCommunicationandInitialization– Bulkpagecopy/initialization

• Communicationb/wMemoryandStorage– Servingpagefault/fileread&write

26

ctrl.channel

DDMActrl.

read

with

IOse

l.

CPU→

GPU

1.ComputeUnit↔ComputeUnitCPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

GPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

ctrl.channel

DDMActrl.

destination

DDMAIOinterface

source Ack.destination

DDMAIOinterface

write

with

IOse

l.

TransferdatathroughDDMAwithoutinterferingw/CPU/GPUmemoryaccesses

CPU

memorycontroller

GPU

memorycontroller

27

ctrl.chan.

read

with

IOse

l.write

with

IOse

l.

2.In-MemoryCommunication

DDMActrl.

CPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface

sourcedestination

TransferdatainDRAMthroughDDAMwithoutinterferingwithCPUmemoryaccesses

CPU

memorycontroller

28

DDMActrl.

Acc.Storage

Ack.

3.Memory↔Storage

ctrl.chan.

write

with

IOse

l.

CPU

DDMActrl.

memorycontroller

DDP-DRAM

DDMAIOinterface StorageStorage(source)

destination

DDMAIOinterface

TransferdatafromstoragethroughDDMAwithoutinterferingwithCPUmemoryaccesses

destination

CPU

memorycontroller

29

Outline1.Problem

3.Dual-Data-PortDRAM

5.Evaluation

4.ApplicationsforDDMA

2.OurApproach

30

EvaluationMethods• System

– Processor:4– 16cores– LLC:16-wayassociative,512KBprivatecache-slice/core– Memory:1– 4ranksand1– 4channels

• Workloads– Memoryintensive:SPECCPU2006,TPC,stream(31benchmarks)

– CPU-GPUcommunicationintensive:polybench (8benchmarks)

– In-memorycommunicationintensive:apache,bootup,compiler,filecopy,mysql,fork,shell,memcached (8intotal)

31

0%

5%

10%

15%

20%

25%

4-Core 8-Core 16-Core0%

5%

10%

15%

20%

25%

4-Core 8-Core 16-Core

Perfo

rmanceIm

provem

ent

Perfo

rmanceIm

provem

ent

CPU-GPUComm.-Intensive In-MemoryComm.-Intensive

More performance improvementathighercorecountHighperformanceimprovement

Performance(2Channel,2Rank)

32

PerformanceonVariousSystems

0%5%10%15%20%25%30%35%40%

1rank 2rank 4rank0%5%10%15%20%25%30%35%40%

1ch 2ch 4ch

ChannelCount RankCount

Perfo

rmanceIm

provem

ent

Perfo

rmanceIm

provem

ent

Performanceincreaseswithrankcount

33

0

200

400

600

800

1000

1200

1ch 1chDDMA

2ch0%20%40%60%80%100%120%140%160%180%

1ch 1chDDMA

2ch

Perfo

rmance

ProcessorP

inCou

nt

DDMAachieveshigherperformanceatlowerprocessorpincount

959 915

1103

DDMAvs.DoublingChannel

34

Conclusion• Problem

– CPUandIOaccessescontendforthesharedmemorychannel

• OurApproach:DecoupledDirectMemoryAccess(DDMA)– DesignnewDRAMarchitecturewithtwoindependentdataports

àDual-Data-PortDRAM– ConnectoneporttoCPUandtheotherporttoIOdevices

àDecoupleCPUandIOaccesses

• Application– Communicationbetweencomputeunits(e.g.,CPU–GPU)– In-memorycommunication(e.g.,bulkin-memorycopy/init.)– Memory-storagecommunication(e.g.,pagefault,IOprefetch)

• Result– Significantperformanceimprovement(20%in2ch.&2ranksystem)– CPUpincountreduction(4.5%)

IsolatingCPUandIOTrafficbyLeveragingaDual-Data-PortDRAM

DonghyukLeeLavanya Subramanian,Rachata Ausavarungnirun,

Jongmoo Choi,Onur Mutlu

DecoupledDirectMemoryAccess

36

SystemOverhead

DDMAreducesmoreexpensiveon-chiparea,whileincreasinglessexpensiveoff-chiparea

processor

DRAM

IOdevices

ConventionalSystem

processor

DDP-DRAM

IOdevicesDDMA-IO

ProposedSystem

LowC

ost

High

37

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

ChannelUtilizationAnalysis

SimultaneousChannelUtilizationàPerformanceImprovement

CPU-GPUCommunication-Intensive

ChannelU

tiliza

tion

CPU IO

CPU IO

CPU IO

CPU IO

CPU IO

CPU IO

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

0%10%20%30%40%50%60%70%80%90%100%

1Channel 2Channel 2Channel 1Rank 2Rank 4Rank

BothChannelsBusy SingleChannelBusy

4

Recommended