42
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu Hu, Mingyu Chen 2010.1 Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

Embed Size (px)

Citation preview

Page 1: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INS

TIT

UTE O

F C

OM

PU

TIN

G

TEC

HN

OLO

GY

DMA Cache Architecturally Separate I/O Data from

CPU Data for Improving I/O Performance

Dang Tang, Yungang Bao,

Weiwu Hu, Mingyu Chen

2010.1

Institute of Computing Technology (ICT)

Chinese Academy of Sciences (CAS)

Page 2: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

The role of I/O

I/O is ubiquitous Load binary files: Disk Memory Brower web, media stream: NetworkMemory…

I/O is significant Many commercial applications are I/O intensive:

Database etc.

Page 3: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

State-of-the-Art I/O Technologies I/O Bus: 20GB/s

PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect

I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

Page 4: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Direct Memory Access (DMA)

DMA is used for I/O operations in all modern computers

DMA allows I/O subsystems to access system memory independently of CPU.

 Many I/O devices have DMA engines Including disk drive controllers, graphics

cards, network cards, sound cards and GPUs

Page 5: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 6: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

An Example of Disk Read:DMA Receiving Operation

• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles

Page 7: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

Direct Cache Access [Ram-ISCA05]

• This is a typical Shared-Cache Scheme

Prefetch-Hint Approach [Kumar-Micro07]

Page 8: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing

Not suitable for other I/O Degrade performance

when DMA requests are large (>100KB) for “Oracle + TPC-H” application

To address this problem deeply, we need to investigate the I/O data characteristics.

Page 9: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

I/O Data V.S. CPU Data

MemCtrlI/O Data

CPU Data

HMTT

I/O Data + CPU Data

Page 10: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

A short AD of HMTT [Bao-Sigmetrics08]

A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,

virtual address Process id I/O operation

Can collect the trace of commercial applications, e.g., Oracle Web server

The HMTT System

Page 11: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(1)

% of Memory References to I/O data

% of References of various I/O types

Page 12: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(2) I/O request size distribution?

Page 13: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(3) Sequential access in I/O data

Compared with CPU data, I/O data is very regular

Page 14: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(4) Reuse Distance (RD)

LRU Stack Distance 1

3

2

4

1

2

2

3

3

4

4

3

1

1

2

1

2

4

3

1

2

3

4

1

2

3

1

2

1

2

3

1

1

2

4

RD

CDF

x%

<=n

Page 15: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(5)

DMA-W CPU-R

CPU-RW CPU-RW

CPU-W DMA-R

Page 16: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Rethink I/O & DMA Operation

20~40% of memory references are for I/O data in I/O-intensive applications.

Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential

Separating I/O data and CPU data

Page 17: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Separating I/O data and CPU data

Before Separating

After Separating

Page 18: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 19: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Dedicated DMA Cache (DDC)

Page 20: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through

policies are available Write Policy Cache Coherence Replacement Policy Prefetching

Page 21: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

IO-E

SI P

roto

col

for W

T p

olicy

IO-M

OE

SI P

roto

col

for W

B P

olicy

The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

Page 22: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

A Big Issue

How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

Page 23: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]

DMA $ CPU $ CPU $

……O S IM I S

OS+I+ √ MS+I+ X

EI+

R|E

MI+W|*

S+I+R|I

Page 24: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Global State Cache Coherence Theorem  

Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.

S+I+

EI+

I+

MI+

OS+I+

R|*

W|*

W|* R|I

R|M W|*

R|*

R|*

W|*

W|*

R|E

R|I

5 Global States:

S+I+

EI*

I*

MI*

OS*I*

√√√√√

Page 25: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

MOESI + ESI

S+I+

ECI+

I+

MCI+

EDI+

OCS+I+

R*|*

RC|E R*|I

WC|* WD|*

RC|I RD |I

WD|I

RD|* WD|*

RC|I

WC|*

Wc|I

WD|I

WC|I

WD|SI R*|I

WC|*

RC|* RD|SI

WD|* RD|E RC|M

WC|*

6 Global States:

S+I+

ECI*

I*

MCI*

EDI*

OCS*I*

√√√√√√

Page 26: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

An LRU-like Replace Policy

1. Invalid

2. Shared

3. Owned

4. Exlusive

5. Modified

Page 27: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

Page 28: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)

Partition-Based DMA Cache

(PBDC)

Page 29: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 30: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Speedup of Dedicated DMA Cache

Page 31: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

% of Valid Prefetched Blocks

DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

Page 32: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Performance Comparisons

Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

Page 33: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

Page 34: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions We have proposed a DMA cache technique to separate

I/O data and CPU We adopt a Global State Method for Integrating

Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are

better than the existing approaches that use unified, shared caches for I/O data and CPU data

Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different

types of data? How to optimize MC with awareness of IO

Page 35: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGYThanks!&

Question?

Page 36: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

RTL Emulation Platform LLC and DMA cache Model from Loongson-2F DDR2 Memory Controller from Loongson-2F DDR2 DIMM model from Micron Technology

LL Cache

MemCtrl

DDR2 DIMM

DMA Cache

Memory trace

Page 37: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Parameters

DDR2-666

Page 38: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Normalized Speedup for WB

Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

Page 39: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Write & CPU Read Hit Rate

Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

Page 40: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Breakdown of Normalized Total Cycles

Page 41: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity of PBDC

Page 42: INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu

INSTITUTE OF COMPUTING

TECHNOLOGY

More References on Cache Coherence Protocol Verification

Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998

Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997