DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

Preview:

DESCRIPTION

DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance. Dang Tang, Yungang Bao , Weiwu Hu , Mingyu Chen 2010.1. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). The role of I/O. I/O is ubiquitous - PowerPoint PPT Presentation

Citation preview

INSTITU

TE OF CO

MPU

TING

TECH

NO

LOG

Y

DMA Cache Architecturally Separate I/O Data from

CPU Data for Improving I/O Performance

Dang Tang, Yungang Bao,Weiwu Hu, Mingyu Chen

2010.1

Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

INSTITUTE OF COMPUTING

TECHNOLOGY

The role of I/O I/O is ubiquitous

Load binary files: Disk Memory Brower web, media stream: NetworkMemory…

I/O is significant Many commercial applications are I/O intensive:

Database etc.

INSTITUTE OF COMPUTING

TECHNOLOGY

State-of-the-Art I/O Technologies I/O Bus: 20GB/s

PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect

I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

INSTITUTE OF COMPUTING

TECHNOLOGY

Direct Memory Access (DMA) DMA is used for I/O operations in all modern

computers

DMA allows I/O subsystems to access system memory independently of CPU.

 Many I/O devices have DMA engines Including disk drive controllers, graphics

cards, network cards, sound cards and GPUs

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver BufferDescriptor

②③

Kernel Buffer ④

An Example of Disk Read:DMA Receiving Operation

• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver BufferDescriptor

②③

Kernel Buffer④

Direct Cache Access [Ram-ISCA05]

• This is a typical Shared-Cache SchemePrefetch-Hint Approach [Kumar-Micro07]

INSTITUTE OF COMPUTING

TECHNOLOGY

Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing

Not suitable for other I/O Degrade performance

when DMA requests are large (>100KB) for “Oracle + TPC-H” application

To address this problem deeply, we need to investigate the I/O data characteristics.

INSTITUTE OF COMPUTING

TECHNOLOGY

I/O Data V.S. CPU Data

MemCtrlI/O Data

CPU Data

HMTT

I/O Data + CPU Data

INSTITUTE OF COMPUTING

TECHNOLOGY

A short AD of HMTT [Bao-Sigmetrics08]

A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,

virtual address Process id I/O operation

Can collect the trace of commercial applications, e.g., Oracle Web server

The HMTT System

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(1) % of Memory References to I/O data

% of References of various I/O types

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(2) I/O request size distribution?

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(3) Sequential access in I/O data

Compared with CPU data, I/O data is very regular

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(4) Reuse Distance (RD)

LRU Stack Distance 1

3

2

4

1

2

2

3

3

4

4

3

1

1

2

1

2

4

3

1

2

3

4

1

2

3

1

2

1

2

3

1

1

2

4

RD

CDF

x%

<=n

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(5)

DMA-W CPU-R

CPU-RW CPU-RW

CPU-W DMA-R

INSTITUTE OF COMPUTING

TECHNOLOGY

Rethink I/O & DMA Operation 20~40% of memory references are for I/O

data in I/O-intensive applications. Characteristics of I/O data are different from

CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential

Separating I/O data and CPU data

INSTITUTE OF COMPUTING

TECHNOLOGY

Separating I/O data and CPU data

Before Separating

After Separating

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Dedicated DMA Cache (DDC)

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through

policies are available Write Policy Cache Coherence Replacement Policy Prefetching

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

IO-ESI Protocol

for WT policy

IO-M

OESI Protocol

for WB

Policy

The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

INSTITUTE OF COMPUTING

TECHNOLOGY

A Big Issue

How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

INSTITUTE OF COMPUTING

TECHNOLOGY

A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]

DMA $ CPU $ CPU $……O S IM I S

OS+I+ √ MS+I+ X

EI+

R|E

MI+W|*

S+I+R|I

INSTITUTE OF COMPUTING

TECHNOLOGY

Global State Cache Coherence Theorem   Given N (N>1) well-defined cache protocols,

they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.

S+I+

EI+

I+

MI+

OS+I+

R|*

W|*

W|* R|I

R|M W|*

R|*

R|*

W|*

W|*

R|E

R|I

5 Global States:

S+I+

EI*

I*

MI*

OS*I*

√√√√√

INSTITUTE OF COMPUTING

TECHNOLOGY

MOESI + ESI

S+I+

ECI+

I+

MCI+

EDI+

OCS+I+

R*|*

RC|E R*|I

WC|* WD|*

RC|I RD |I

WD|I

RD|* WD|*

RC|I

WC|*

Wc|I

WD|I

WC|I

WD|SI R*|I

WC|*

RC|* RD|SI

WD|* RD|E RC|M

WC|*

6 Global States:

S+I+

ECI*

I*

MCI*

EDI*

OCS*I*

√√√√√√

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

An LRU-like Replace Policy1. Invalid 2. Shared 3. Owned4. Exlusive5. Modified

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)

Partition-Based DMA Cache

(PBDC)

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

Speedup of Dedicated DMA Cache

INSTITUTE OF COMPUTING

TECHNOLOGY

% of Valid Prefetched Blocks

DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

INSTITUTE OF COMPUTING

TECHNOLOGY

Performance Comparisons

Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions We have proposed a DMA cache technique to separate

I/O data and CPU We adopt a Global State Method for Integrating

Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are

better than the existing approaches that use unified, shared caches for I/O data and CPU data

Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different

types of data? How to optimize MC with awareness of IO

INSTITUTE OF COMPUTING

TECHNOLOGYThanks!&

Question?

INSTITUTE OF COMPUTING

TECHNOLOGY

RTL Emulation Platform LLC and DMA cache Model from Loongson-2F DDR2 Memory Controller from Loongson-2F DDR2 DIMM model from Micron Technology

LL Cache

MemCtrl

DDR2 DIMM

DMA Cache

Memory trace

INSTITUTE OF COMPUTING

TECHNOLOGY

Parameters

DDR2-666

INSTITUTE OF COMPUTING

TECHNOLOGY

Normalized Speedup for WB

Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Write & CPU Read Hit Rate

Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

INSTITUTE OF COMPUTING

TECHNOLOGY

Breakdown of Normalized Total Cycles

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity of PBDC

INSTITUTE OF COMPUTING

TECHNOLOGY

More References on Cache Coherence Protocol Verification Fong Pong , Michel Dubois, Formal

verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998

Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997

Recommended