38
-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability- Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego

Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

  • Upload
    avel

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems. Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego. Outline. Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-1-UC San Diego / VLSI CAD Laboratory

Optimal Reliability-Constrained Overdrive Frequency Selection in

Multicore Systems

Optimal Reliability-Constrained Overdrive Frequency Selection in

Multicore SystemsAndrew B. Kahng and Siddhartha Nath

VLSI CAD LABORATORY, UC San Diego

Page 2: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-2-

OutlineOutline

Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions

Page 3: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-3-

Reliability in MultiCore SystemsReliability in MultiCore Systems Modern multicore processors operate at

multiple operating modes– E.g., nominal, supply voltage scaling, turbo,

etc. Reliability is a key processor design

consideration at leading-edge technology nodes to guarantee a prescribed system lifetime

Task scheduling affects how cores are used– A subset of cores can fail before others

Page 4: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-4-

Scheduling in Multicore SystemsScheduling in Multicore Systems

Scheduler packs tasks using some or all the available processing cores

1 1

1

2

2

3

4

4

Application B

Application A

Time

Time

#C

ore

s#

Core

s

1 2 3 4 5 6 7 80%

20%

40%

60%

80%

100%

120% Application A Application BPacked A, B

#Active cores

%A

ctiv

e T

ime

Page 5: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-5-

Core WearoutCore Wearout Mean time to failure (MTTF) is a measure of

the lifetime of a core Reliability mechanisms degrade MTTF of a

core– E.g., electromigration (EM), stress migration,

hot carrier injection, bias temperature instability, etc.

When all cores are not simultaneously active– Adjust task scheduling on a subset of active

cores for balanced wearout

Page 6: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-6-

Impact of Overdrive FrequencyImpact of Overdrive Frequency Frequency due to overclocking the cores to

meet performance and throughput requirements

Overdrive frequencies cause faster MTTF degradation

Two challenges– Can violate “acceptable throughput” for tasks

Cores fail before all assigned tasks are completed– Can violate minimum “acceptable performance”

for tasksCores operate at lower frequencies

Page 7: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-7-

TerminologyTerminology Power-on-hours ()

– Effective number of lifetime hours consumed– Measure of a core’s lifetime degradation due to

operating conditions, e.g., temperature, frequency Nominal temperature

– Temperature at which MTTF degradation is the same as the number of hours a core is active

Acceleration factor (AF)

Page 8: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-8-

OutlineOutline

Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions

Page 9: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-9-

Classification of Existing WorksClassification of Existing Works

Work Type

Reiss12 NRC, NLG, NPG

Karpuzcu09 RC, NLG, NPG

Mihic04 RC, LG (Dynamic power management), NPG

Rosing07 RC, LG (Dynamic power management), NPG

Rong06 RC, LG (Dynamic power management), NPG

Coskun09 RC, LG (Dynamic thermal management), NPG

Srinivasan04 RC, LG (Dynamic reliability management), NPG

Karl08 RC, LG (Dynamic reliability management), NPG

(N)RC – (Non-) Reliability Constrained

(N)LG – (No) Lifetime Guarantee

(N)PG – (No) Performance Guarantee

Page 10: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-10-

Counterexample to NRC PoliciesCounterexample to NRC Policies Task schedule Max frequency =

3GHz Min acceptable

frequency = 1.8GHz Initial lifetime = 7

years (61320h)

#Active cores (m)

Nominal execution time (AF = 1)

Overdrive execution time (AF = 9.77)

1 1000h 3000h

2 2000h 5000h

3 3000h 8000h

4 2000h 5000h

All cores operate always at 3GHz– From HotSpot simulations, AF = 9.77

Lifetime after nominal tasks requiring m = 3 is 24947.5h– Tasks requiring m = 3 cannot complete overdrive execution– Tasks requiring m = 4 cannot complete at allCannot guarantee “acceptable throughput” !!!

Page 11: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-11-

Counterexample to RC-LG PoliciesCounterexample to RC-LG Policies Task schedule Max frequency =

3GHz Min acceptable

frequency = 1.8GHz

Initial lifetime = 61320h

#Active cores (m)

Nominal execution time (AF = 1)

Overdrive execution time (AF = 9.77)

1 1000h 3000h

2 2000h 5000h

3 3000h 8000h

4 2000h 5000h

All cores operate initially at 3GHz, and then at 1.6GHz– From HotSpot simulations, AF = 9.77

All tasks are completed but– Tasks requiring m = 3, 4 operate at 1.6GHz < 1.8GHz

(acceptable performance) !!!Cannot guarantee “acceptable performance” !!!

Page 12: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-12-

OutlineOutline

Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions

Page 13: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-13-

What Do We Do Differently?What Do We Do Differently? We formulate a new Maximum-Value Reliability-

Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem

Important because– Overdrive frequencies are our optimization variables– User experience is the value

We guarantee prescribed levels of “acceptable performance” and “acceptable throughput”

Page 14: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-14-

Comparison of Ours vs. Existing WorksComparison of Ours vs. Existing Works

Work Type

Reiss12 NRC, NLG, NPG

Karpuzcu09 RC, NLG, NPG

Mihic04 RC, LG (Dynamic power management), NPG

Rosing07 RC, LG (Dynamic power management), NPG

Rong06 RC, LG (Dynamic power management), NPG

Coskun09 RC, LG (Dynamic thermal management), NPG

Srinivasan04 RC, LG (Dynamic reliability management), NPG

Karl08 RC, LG (Dynamic reliability management), NPG

Our Work RC, LG (Dynamic reliability management, PG

(N)RC – (Non-) Reliability Constrained

(N)LG – (No) Lifetime Guarantee

(N)PG – (No) Performance Guarantee

Page 15: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-15-

What is the Optimal Solution?What is the Optimal Solution? Task schedule Max frequency =

3GHz Min acceptable

frequency = 1.8GHz

Initial lifetime = 61320h

#Active cores (m)

Nominal execution time (AF = 1)

Overdrive execution time (AF = 9.77)

1 1000h 3000h

2 2000h 5000h

3 3000h 8000h

4 2000h 5000h

Optimal (discretized) solution from exhaustive search

#Active cores (m)

Nominal frequency

Overdrive frequency

1 1.5GHz 2.85GHz

2 1.5GHz 2.3GHz

3 1.5GHz 1.8GHz

4 1.5GHz 1.8GHzWe guarantee both “acceptable performance” and “acceptable throughput” if a solution exists!!!

Page 16: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-16-

Our Key ContributionsOur Key Contributions We develop a new MVRCOF formulation to maximize

the value of operating multiple cores at overdrive frequencies

Our solutions provide guarantees for prescribed lower bounds on “acceptable performance” and “acceptable throughput”

We propose optimal (discretized) solution using exhaustive search as well as an approximate heuristic flow

Our solutions determine optimal overdrive frequencies as well as execution times for each active core

We empirically determine that our optimal solutions improve the objective function value by up to 17.4% versus existing works

Page 17: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-17-

OutlineOutline

Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions

Page 18: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-18-

FormulationFormulation

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒∑𝑚=1

𝑁

(𝑤𝑂𝐷 ,𝑚∙ 𝑓 𝑂𝐷 ,𝑚 ∙𝐸𝑂𝐷 ,𝑚+𝑤𝑛𝑜𝑚 ,𝑚∙ 𝑓 𝑛𝑜𝑚 ,𝑚∙𝐸𝑛𝑜𝑚 ,𝑚)

Page 19: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-19-

Formulation In EnglishFormulation In English

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒∑𝑚=1

𝑁

(𝑤𝑂𝐷 ,𝑚∙ 𝑓 𝑂𝐷 ,𝑚 ∙𝐸𝑂𝐷 ,𝑚+𝑤𝑛𝑜𝑚 ,𝑚∙ 𝑓 𝑛𝑜𝑚 ,𝑚∙𝐸𝑛𝑜𝑚 ,𝑚)

The value of operating at overdrive frequencies () described by weights () and the duration ()

The value of operating at nominal frequencies () described by weights () and the duration ()

Page 20: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-20-

Formulation In EnglishFormulation In English

Guarantees minimum “acceptable performance” () and upper bounded by the maximum achievable frequency ()

Guarantees “acceptable throughput”, i.e., all tasks complete within lifetime and cores wearout in a balanced manner

Upper bound on instantaneous power dissipated by any coreUpper bound on instantaneous temperature of all actives cores

Page 21: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-21-

MVRCOF Inputs: Task Description MVRCOF Inputs: Task Description App

1App

2

App X

Scheduler

El,m

wl,m

fnom,m

Execution times in nominal and overdrive modes with different number of active cores

Weights in nominal and overdrive modes with different number of active coresNominal frequencies at different number of active cores

Page 22: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-22-

MVRCOF Inputs: System DescriptionMVRCOF Inputs: System Description

SoC Designer

N

Pmax

fmax

Tmax

Tnom

MTTF

Number of available symmetric cores

Maximum power of any coreMaximum frequency of any coreMaximum die temperature

Nominal temperature

Initial MTTF of any core

Page 23: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-23-

MVRCOF OutputsMVRCOF OutputsMVRCO

F solver

fOD,m

vj,m,l

ui,l

Optimal overdrive frequencies for each set of active cores%execution time in each combination of the active coresE.g., in a system with three available cores, two cores can be active in ways

%lifetime each core operates at nominal and overdrive modes

Page 24: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-24-

MVRCOF Inputs and OutputsMVRCOF Inputs and OutputsApp

1App

2

App X

Scheduler

SoC Designer

N Pmax fmax Tmax Tnom MTTF

El,m wl,m

fnom,m

System Description

Task Description

MVRCOF

solver

fOD,m

vj,m,l ui,l

Outputs

Page 25: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-25-

OutlineOutline

Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions

Page 26: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-26-

Optimal (Discretized) Solution FlowOptimal (Discretized) Solution Flow

For each core– For each combination in which the core is active

Choose discrete values of overdrive frequencies within a range Perform power and temperature simulations Create a one-time LUT

– Example: If a system has 3 cores (Core A, B, C), the number of active cores

can be 1, 2 or 3 Core A is active

– One (out of three) combination when ; two (out of three) combinations when ; one (out of one) combination when

Perform exhaustive seach using the LUT for optimal overdrive frequencies that maximize the value of the objective function

Page 27: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-27-

Heuristic FlowHeuristic Flow

We maximize the overdrive frequency in the order of the set of active cores for which the product of weights and execution times is maximum– Example:

If a system has 3 cores, the number of active cores can be 1, 2 or 3

If , we maximizeand This achieves large improvements in the value of the

objective function

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒∑𝑚=1

𝑁

(𝒘𝑶𝑫 ,𝒎 ∙ 𝒇 𝑶𝑫 ,𝒎 ∙𝑬𝑶𝑫 ,𝒎+𝑤𝑛𝑜𝑚 ,𝑚 ∙ 𝑓 𝑛𝑜𝑚 ,𝑚 ∙𝐸𝑛𝑜𝑚 ,𝑚)

Page 28: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-28-

OutlineOutline

Motivation Previous Work Our Work Problem Statement Optimal (Discretized) Solution Flow Results Conclusions

Page 29: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-29-

Experimental SetupExperimental Setup Each core is simulated with 72 copies of jpeg_encoder from

OpenCores– SP&R implementation with commercial tools and foundry

45nm libraries Power simulation using Synopsys PrimeTime-PX

– Increase voltage from 0.8V to 1.2V in steps of 10mV– Increase frequency from 1.5GHz to 3GHz in steps of 50MHz

Thermal simulation using HotSpot LP solver is lp_solve Baseline policy is RC-LG from existing works

Page 30: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-30-

TestcasesTestcases

Name(Kh) (Kh)

4-I 1, 23, 4

1, 23, 2

3, 58, 5

0.5, 0.30.2, 0.4

0.5, 0.70.8, 0.6

Testcases are described by

Eight testcases in total– Format is -Testcase#– Seven have optimal solutions– One does not have feasible solution

Example

Page 31: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-31-

Optimal, Heuristic vs. RC-LGOptimal, Heuristic vs. RC-LG

4-I 4-II 4-III 4-IV 4-V 6-I 8-I0

5000

10000

15000

20000

25000

30000

35000

40000

45000 Optimal Heuristic Baseline

Testcase

Ob

jecti

ve F

un

cti

on

Valu

e

-3.3%

-17.4%

-12%

-9%

sw

Page 32: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-32-

Runtime ComparisonRuntime Comparison

4-I 4-II 4-III 4-IV 4-V 6-I 8-I0

0.2

0.4

0.6

0.8

1

1.2 Optimal Heuristic

Testcase

Norm

alized

Ru

nti

me

10

2.3 2.52.5

Page 33: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-33-

OutlineOutline

Motivation Previous Work Our Work Problem Statement Optimal (Discretized) Solution Flow Results Conclusions

Page 34: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-34-

ConclusionsConclusions We formulate and solve a new MVRCOF problem

under lifetime reliability constraints We develop MVRCOF solver that implements our

optimal (discretized) and heuristic flows Our optimal solutions guarantee both “acceptable

performance” and “acceptable throughput” We empirically demonstrate that our optimal solutions

achieve up to 17.4% greater value of the objective function than existing works

Our future works include– Application of our methods to traces from actual server

workloads– Expand our methods to handle other objectives– Achieve solutions that are temperature history-aware

Page 35: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-35-

Thank You!

Page 36: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-36-

Back up

Page 37: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-37-

NotationNotation number of simultaneously active cores number of symmetric cores in a system index for a core, overdrive and nominal frequencies when cores are active weights of achieved for overdrive and nominal frequencies execution time in overdrive and nominal frequencies maximum achievable frequency of any core maximum power consumption of any core maximum die temperature

Page 38: Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

-38-

Optimal Solution FlowOptimal Solution Flow

fOD,mPower(fOD,m)Power

simulation

Thermal simulatio

n

(fOD,m, temp, AF) LUT

(m, j)Core TempfOD,m AF

Exhaustive Search

For each core i, fOD,m and combination j of m

Optimal obj fn value, fOD,m and tj,m,l

LP

1