Download ppt - Ioana Burcea * Stephen Somogyi §, Andreas Moshovos, Babak Falsafi § # Predictor Virtualization University of Toronto Canada § Carnegie Mellon University

Ioana Burcea*

Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#

Predictor Virtualization

*University of Toronto

Canada

§Carnegie Mellon University

#École Polytechnique Fédérale de Lausanne

ASPLOS 13

March 4, 2008

2Ioana Burcea Predictor Virtualization University of Toronto

Why Predictors? History Repeats Itself

CPU

Branch Prediction

Prefetching

Value Prediction

Pointer Caching

Cache Replacement

Predictors

Application footprints grow

Predictors need to scale to remain effective


Extra Resources: CMPs With Large On-Chip Caches

Main Memory

D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$

CPU

L2 Cache10’s – 100’s of MB



Physical Memory Address Space

D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$

CPU

L2 Cache


Predictor Virtualization (PV)

Emulate large predictor tables

Reduce predictor table dedicated resources


Research Contributions PV – metadata stored in conventional cache hierarchy

Benefits Emulate larger tables → increased accuracy Less dedicated resources

Why now? Large caches / CMPs / Need for larger predictors

Will this work? Metadata locality → intrinsically exploited by caches

First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Advantages of virtualization


PV architecture

PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*

Conclusions

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Talk Road Map


PV architecture


Conclusions


Talk Road Map


PV Architecture

Virtualize

request prediction

D$I$

CPU

L2 Cache

Main Memory

Predictor

Table

Optimization

Engine


PV Architecture

request prediction

D$I$

CPU

L2 Cache

index

PVCache

PVProxy


PVTable

Optimization

Engine

PVStart


PV: Variable Prediction Latency

request prediction

D$I$

CPU

L2 Cache

index

PVCache

PVProxy


PVTable

Optimization

Engine

PVStart

Common

Case

Infrequent

Rare


Metadata Locality

Entry reuse Temporal

One entry used for multiple predictions

Spatial – can be engineered One miss overcome by several subsequent hits

Metadata access pattern predictability Predictor metadata prefetching


PV architecture


Conclusions


Talk Road Map


Spatial Memory Streaming [ISCA 06]M

emor

y

spatial patterns

1100000001101…

1100001010001…Spatial patterns stored in a pattern history table (PHT)

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos.

“Spatial Memory Streaming”


data access stream

Virtualizing “Spatial Memory Streaming” (SMS)

Detector Predictor

patterns

patterns

prefetchestrigger access

Virtualize

~1KB ~60 KB


8 sets

Virtualizing SMS

VirtualTable1K

sets

11 ways

PVCache

11 ways

tag pattern

tag tagpattern

pattern

unused

11 bits 32 bits 39 bits

Set entries → cache block – 64 bytes


Current Implementation

Non-Intrusive Virtual table stored in reserved physical address space

One table per core

Caches oblivious to metadata

Options Predictor tables stored in virtual memory

Single, shared table per application

Caches aware of metadata


Simulation Infrastructure

SimFlex

Full-system simulator based on Simics

Base processor configuration

4-core CMP

8-wide OoO

256-entry ROB

L1D/L1I 64KB 4-way set-associative

UL2 8MB 16-way set-associative

Commercial workloads

TPC-C: DB2 and Oracle

TPC-H: Query 1, Query 2, Query 16, Query 17

SpecWeb: Apache and Zeus


0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

better

Original Prefetcher – Accuracy vs. Predictor Size

L1

Rea

d M

isse

s


0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a



better


L1

Rea

d M

isse

s


0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a



better


L1

Rea

d M

isse

s



Small Tables Diminish Prefetching Accuracy

0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a



better

L1

Rea

d M

isse

s


Virtualized Prefetcher - Performance

Sp

eed

up

Original Prefetcher ~60KB

Virtualized Prefetcher < 1KB

better 0%

10%

20%

30%

40%

50%

60%

70%

Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17

Original - 1K sets Original - 16 sets Original - 8 sets Virtualized - 8 sets

Hardware Cost


Impact on L2 Memory Requests

Dark Side: Increased L2 Memory Requests

better

L2

Mem

ory

Req

ues

ts I

ncr

eas

e

0%

10%

20%

30%

40%


PV - 8 sets


Impact of Virtualization on Off-Chip Bandwidth

0%

1%

2%

3%

4%

5%

Apache Qry17 Oracle

App L2 Misses App L2 Write-backs

PV L2 Misses PV L2 Write-backs

Minimal Impact on Off-Chip Bandwidth

better

Off

-Ch

ip B

and

wid

th I

ncr

ease

Indirect impact on performance

Direct impact on performance


Conclusions

Predictor Virtualization Metadata stored in conventional cache hierarchy

Benefits Emulate larger tables → increased accuracy Less dedicated resources

First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation

Ioana Burcea*[email protected]




Canada



ASPLOS 13

March 4, 2008





Canada



ASPLOS 13

March 4, 2008





Canada



ASPLOS 13

March 4, 2008


PV – Motivating Trends

Dedicating resources to predictors hard to justify Larger predictor tables

Increased performance

Chip multiprocessors Space dedicated to predictors ↔ # processors

Memory hierarchies offer the opportunity Increased capacity

Diminishing returns

Use conventional memory hierarchies to store predictor metadata


Virtualizing the Predictor Table

Pattern History Table

Tag Pattern Tag Pattern…

…

…

1 0 1 0 1 1 1 0

1 0 1 0

1 0 1 1

0 0 1 1

0 0 1 1

PC

Trigger Access

Address

Tag index

Pattern

Prefetch

Virtualize

PHT stored in physical address space

Multiple PHT entries packed in one memory block

one memory request brings an entire table set


Packing Entries in One Cache Block

Index: PC + offset within spatial group PC →16 bits

32 blocks in a spatial group → 5 bit offset

→ 32 bit spatial pattern

Pattern table: 1K sets 10 bits to index the table → 11 bit tag

Cache block: 64 bytes 11 entries per cache block → Pattern table

1K sets – 11-way set associative

21 bit index

tag pattern

tag tagpattern

pattern0 11 43 54 85 unused


Memory Address Calculation

+000000

16 bits 5 bits

10 bits

PV Start Address

Block offset

Memory Address

PC

tag


Increase in Off-Chip Bandwidth – different L2 sizes

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

Apache Zeus DB2 Oracle Qry1 Qry2 Qry16 Qry17

Write-backs

L2 Misses

Off

-Ch

ip B

and

wid

th I

ncr

ease


Increased L2 Latency

0%

10%

20%

30%

40%

50%

60% SMS - 1K SMS - PV8

Sp

eed

up


Conclusions PV – metadata stored in conventional cache hierarchy

Benefits Less dedicated resources Emulate larger tables → increased accuracy

Example – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Why now? Large caches / CMPs / Need for larger predictors

Will this work? Metadata locality → intrinsically exploited by caches Metadata access pattern predictability

Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation

Download ppt - Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Download ppt - Ioana Burcea * Stephen Somogyi §, Andreas Moshovos, Babak Falsafi § # Predictor Virtualization University of Toronto Canada § Carnegie Mellon University