Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg

Preview:

Citation preview

Configurational Workload Characterization

Hashem H. Najaf-abadi

Eric Rotenberg

Program 2 Program 1

Heterogeneity

Processor

A Single-Core:

Program 2

Processor

Program 1

Heterogeneity

Processor

Processor

Multiple Cores:

Program 2

Processor

Program 1

Heterogeneity

Multiple Cores:

Processor

Processor

Program 1Program 2

Heterogeneity

Processor

Processor

Heterogeneous Cores:

Heterogeneous CMP Design

Must determine:

1) Best processor configuration for a group of

workloads.

2) Best way to group workloads together.

The Challenge:

A

B

C

D

Core 1

Core 2

Workload Space Best core configurations

Core 1

Core 2

Communal Customization

EF

GH

I

JK

L

M

N

Existing Approaches

• Regression models: Enable speedy exploration.

• Subsetting: Reduce workloads to a representative subset based on characteristics.

The Argument

• Subsetting isn’t a valid substitute or facilitator for communal customization.

• Reason: complex interdependencies between different architectural units.

Ties that bind

1) The global clock intertwines the sizing of different architectural units.

2) The burden of compromise in one unit can be passed on to another.

Example: The Global Clock

solid line: delay of the issue queue,dashed line: access delay of the cache

1ns

CacheIss

ue

Qu

eu

e

0.66ns

CacheIss

ue

Qu

eu

e

0.66ns

Cache

Iss

ue

Qu

eu

e

1ns

Cache

Iss

ue

Qu

eu

ePipeline:

Less slack Slack

Pipeline too deep

Small Issue-queue

Needlessly large cache

Example: The Global Clock

The clock period, issue-queue size and cache size can not be optimized independent of each other.

1ns

Cache

Issu

e Q

ueu

e

0.66ns

Cache

Issu

e Q

ueu

e

0.66ns

Cache

Issu

e Q

1ns

Cache

Issu

e Q

ueu

e

Ties that bind

1) The global clock intertwines the sizing of different architectural units.

2) The burden of compromise in one unit can be passed on to another.

Example: Passing on the Burden

02468

10A

B

CD

E

024

68

10A

B

CD

E

024

68

10A

B

CD

E

A) Working-set size, B) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* All normalized to a scale of 0~10

βα γ

Example: Passing on the Burden

02468

10A

B

CD

E

024

68

10A

B

CD

E

024

68

10A

B

CD

E

A) Working-set sizeB) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* all normalized to a scale of 0~10

βα γ

L HSpeed:

Core

Cache

Core

Cache

L HL H

Cache

L H L H

Customized Architectures:

Example: Passing on the Burden

02468

10A

B

CD

E

024

68

10A

B

CD

E

024

68

10A

B

CD

E

A) Working-set size, B) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* all normalized to a scale of 0~10

βα γ

Speed:

Core

CacheCache

Core

L HL H L H

Customized Architectures:

A More Accurate Solution

• Represent workloads by their customized architectural configurations.

• Allows for direct and accurate evaluation how well different workloads do on customized configurations.

• We call this Configurational Workload Characterization

Design Process Overview

Important workloads

Rep. workloads

Optimal core combination

Select representative workloads based on workload behavior

Search for opt. core combination

Important workloads

Customized architectures

Optimal core combination

Customize a core for each workload (configurational characterization)

Search for opt. core combination

How not to do it How to do it

Pros & Cons

- more costly to determine

+ provides a more optimal design solution

+ provides a systematic approach

+ can be performed prior to the design phase that is critical for time-to-market

XP-SCALAR

• A superscalar design-space exploration frame work

• www4.ncsu.edu/~hhashem/xpscalar.htm

• Uses Simplescalar to perform cycle-accurate simulations

• Uses CACTI model to approximate the access latency of the different units

XP-SCALAR

What parameters are varied: Clock period,

Processor width,

Size of the issue queue,

Size of the register-file,

Size of the load-store queue,

Size of the L1 and L2 caches

XP-SCALAR

How they are varied:a) Clock period is varied, and architecture

parameters are adjusted to make latencies fit within pipeline stages.

b) Number of pipeline stages of a unit is varied and its configuration

appropriately adjusted.

Determining the Best cores

• Execute all benchmarks on each-other’s customized configurations.

• From that, determine best grouping through a complete search.

Best Core Results

customized core(s) avg. IPT har. IPT

best config for avg. & har. IPT gcc 2.06 1.57

2 best configs for avg. IPT parser, twolf 2.27 1.76

2 best configs for har. IPT gcc, mcf 2.12 1.88

3 best configs for avg. IPT crafty, parser, twolf 2.35 1.82

3 best configs for har. IPT crafty, mcf, twolf 2.27 2.05

4 best configs for avg. & har. IPT crafty, mcf, parser, twolf 2.32 2.08

each benchmark on its own customized architecture

- 2.38 2.12

The effect of subsetting

• Subsetting of a single pair of benchmarks results in the extraction of a totally different set of best cores.

Representation

• Dendogram are

Conclusions

• There are interdependencies between architectural units in how they are customized.

• In the design of a heterogeneous CMP subsetting can lead to performance degradation.