30
University of Pittsburg StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University of Pittsburgh

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

  • Upload
    alima

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache. Hyunjin Lee Sangyeun Cho Bruce R. Childers. Dept. of Computer Science University of Pittsburgh. Staggering processor chip yield. IBM Cell initial yield = 14% Two sources of yield loss Physical defects - PowerPoint PPT Presentation

Citation preview

Page 1: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

StimulusCache: Boosting Performance of Chip Multiprocessors

with Excess Cache

Hyunjin Lee Sangyeun Cho Bruce R. Childers

Dept. of Computer ScienceUniversity of Pittsburgh

Page 2: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Staggering processor chip yield

1 2 3 4 5 6 7 80%5%

10%15%20%25%30%35%

8-core CMP

# of sound cores

Prob

abili

ty

IBM Cell initial yield = 14%

Two sources of yield loss• Physical defects• Process variations

Smaller device sizes• Critical defect size shrinks• Process variations become

more severe

As a result, yield is severely limited

Page 3: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Core disabling to rescue Recent multicore processors employ “core disabling”

• Disable failed cores to salvage sound cores in a chip• Significant yield improvement, • IBM Cell: 14% 40% with core disabling of a single faulty core

Yet this approach unnecessarily disables many “good components”• Ex: AMD Phenom X3 disables L2 cache of faulty cores

But… is it economical?

Page 4: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Core disabling uneconomical

Many unused L2 caches exist Problem exacerbated with many cores

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320%

5%

10%

15%

20%

25%

30%

35%

40%L2 cacheprocessing logiccore (L2 + proc. Logic)

# of sound cores/ L2 caches

prob

abili

ty

32 core

1 2 3 4 5 6 7 80%

10%

20%

30%

40%

50%

60%

70%

80%8 core L2 cache

processing logiccore (L2 + proc. logic)

# of sound cores/ L2 caches

Prob

abili

ty

Page 5: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

StimulusCache Basic idea

• Exploit “excess cache” (EC) in a failed core• Core disabling (yield ↑) + larger cache capacity (performance ↑)

Simple HW architecture extension• Cache controller has knowledge about EC utilization• L2 cache are chain linked using vector tables

Modest OS support• OS manages the hardware data structures in cache controllers to set

up EC utilization policies

Sizable performance improvement

Page 6: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

StimulusCache design issues

Questions 1: How to arrange ECs to give to cores?

2: Where to place data, in ECs or local L2?

3: What HW support is needed?

4: How to flexibly and dynamically allocate ECs?

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess cachesWorking cores

Page 7: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Excess cache utilization policies

Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Simple

Limited Performance

Complex

Maximized Performance

Page 8: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Static Dynamic

Private

Sharing Shared by multiple cores

Performance interference Maximum capacity usage

Excess cache utilization policies

Tempo-ralSpa-

tial Exclusive to a core Performance isolation Limited capacity usage

Page 9: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Excess cache utilization policies

Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Static pri-vate

Dynamic shar-ing

Static shar-ing

BAD(not evalu-

ated)

Page 10: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Static private policy

Symmetric alloca-tion

Asymmetric alloca-tion

L2Core 0 EC

EC

EC

EC

L2Core 1

L2Core 2

L2Core 3

EC

EC

EC

EC

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Page 11: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Static sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

All ECs shared by all cores

Mainmem-ory

Mainmem-oryEC

#1EC #2

Page 12: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Flow-in#N: data block counts to EC#N

Hit#N: data block counts hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2

Page 13: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Hit/Flow-in ↑ more ECs Hit/Flow-in ↓ less ECs

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2

Page 14: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: at least 1 EC no harmful effect on EC#2

allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2

Hits#1

Flow-in#1

Page 15: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: at least 2 ECs

allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2

Flow-in#1

Hits#2

Flow-in#2

Page 16: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: at least 1 EC harmful effect on EC#2

allocate 1 EC

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2Flow-in#1

Hits#1

Flow-in#2

Page 17: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: no benefit with ECs

allocate 0 EC

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2Flow-in

#1 Flow-in#2

Page 18: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilizationMinimized capacity interference

Dynamic sharing policy

Mainmem-oryEC

#1EC #2

2

2

1

0

EC#

Page 19: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

HW architecture: Vector table

1 0 1 1 0 0 0 0 ECAV: Excess Cache Allocation Vector

1 0 1

NECP: Next Excess Cache Pointers

10 0 0

0 1 1

0

1

Valid

Next coreCore

0Core 1

Core 7

0 7

0 0 0 0 0 0 0 0 SCV: Shared Core Vector

1 2 3 4 5 6Core

Core

Page 20: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

ECAV: Excess cache allocation vector Data search support

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores

ECAV 1 0 1 1 0 0 0 0Core 6

0 1 2 3 4 5 6 7Core

Core

Page 21: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

SCV: Shared core vector Cache coherency support

SCV 0 0 0 0 0 0 1 0Core 0,2, and 3

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3

0 1 2 3 4 5 6 7Core

Core

Excess caches

Working cores

Page 22: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

NECP: Next excess cache pointers Data promotion/ eviction support

Valid

In-dex0 1 1Core

61 EC at Core 3

0 1 0Core 3

1 EC at Core 20 0 0Core

21 EC at Core 0

0 0 0Core 0

0 Main memory

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores

Page 23: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Software support OS enforces an excess cache utilization policy before

programming cache controllers• Explicit decision by administrator• Autonomous decision based on system monitoring

OS may program cache controllers• At system boot-up time• Before a program starts• During a program execution

OS may take into account workload characteristics be-fore programming

Page 24: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Experimental setup

Intel ATOM-like cores w/ a 16-stage pipeline @ 2GHz

Memory hierarchy L1: 32KB I/D, 1 cycle L2: 512KB, 10 cycles Main memory: 300 cycles, contention modeled

On-chip network Crossbar for 8-core CMP (4 cores + 4 ECs) 2D mesh for 32-core CMP (16 cores + 16 ECs) Contention modeled

SPEC CPU2006, SPLASH-2, and SPECjbb2005

Page 25: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Static private – single thread

h264

ref

hmm

er

asta

r

bzip

2

mcf

gcc

grom

acs

gam

ess

sopl

ex

sphi

nx3

Gem

sFD

TD

milc

Light Medium Heavy Light Medium HeavyINT FP

-20%

0%

20%

40%

60%

80%

100%

120%

140%4 EC 3 EC 2 EC 1 EC

Perf

orm

ance

impr

ovem

ent

Page 26: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

With “H” workloadsAll “H” workloadsWithout “H” workloadsMore than 40% improve-ment

HHHHLLH

H2LLH

H4LLM

M1LLM

M3

MMHH1

MMHH3MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.private

perf

orm

ance

impr

ovem

ent

Static private – multithread

Multi-pro-grammed

Multi-threaded

Page 27: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

HHHHLLH

H2LLH

H4LLM

M1LLM

M3

MMHH1

MMHH3MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.sharing static.private

perf

orm

ance

impr

ovem

ent Capacity interferenceMultithreaded work-

loadsSignificant improve-

ment

Static sharing – multithread

Multi-pro-grammed

Multi-threaded

Page 28: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

HHHHLLH

H2LLH

H4LLM

M1LLM

M3

MMHH1

MMHH3MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% dynamic.sharing static.sharing static.private

perf

orm

ance

impr

ovem

ent Additional improvementCapacity interference

avoided

Dynamic sharing – multithread

Multi-pro-grammed

Multi-threaded

Page 29: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

!!! StimulusCache is always better than the baseline

Dynamic sharing – individual work-loads

LLLLLLM

M1LLM

M2LLM

M3LLM

M4LLH

H1LLH

H2LLH

H3LLH

H4MMMM

MMHH1

MMHH2

MMHH3

MMHH4HHHH-2%

0%2%4%6%8%

10%12%14%16%18%20%

core 0 core 1 core 2 core 3

Addi

tion

al p

erfo

rman

ce

impr

ovem

ent

Significant additional improvement

over static sharing

Minimum degradation over

static sharing

Page 30: StimulusCache :  Boosting Performance of Chip Multiprocessors with Excess Cache

University of Pittsburgh

Conclusions Processing logic yield vs. L2 cache yield

• A large number of excess L2 caches

StimulusCache• Core disabling (yield ↑) + larger cache capacity (performance ↑)• Simple HW architecture extension + modest OS support

Three excess cache utilization policies• Static private, static sharing, and dynamic sharing

Performance improvement by up to 135%.