Day 4: Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009

Preview:

Citation preview

Day 4: Symbiotic Optimization

Kim HazelwoodACACES Summer School

July 2009

ACACES 2009 – Process Virtualization

Modern Computing Challenges

Performance

Power & temperature

Reliability

Parallelism (multicore)

Heterogeneity

Limited resources (embedded computing)

ACACES 2009 – Process Virtualization

Typical Approaches

Optimize using SW or HW techniques in isolation

Performance

• SW: Compile-time optimizations, OS scheduling

• HW: Architectural improvements, VLSI technology

Reliability: Code/data duplication (HW or SW)

Power & Temperature

• HW control mechanisms

• Profile + recompile cycle HW

SW

ACACES 2009 – Process Virtualization

Modern Design Constraints

Compilers – “Compile once, run anywhere”• Cannot ship “MS Office for 3Q08 batch of Core2

3GHz, > 8GB RAM, BrandX power supply, located in high altitudes…”

OSes – Scheduling algorithms don’t scale to modern architectures

Microarchitecture – Limited window of application knowledge • Past must predict the future

Circuits – Guaranteed correctness, reliability, • Must design for the worst case

ACACES 2009 – Process Virtualization

Tortola: Symbiotic Optimization

Enable HW/SW Communication

What small changes can we make to HW/OS to enable collaborative solutions?

SW Applications

Binary Modifier

OS/HW

runtime traits hardware

feedback;os load

scheduling hints

ACACES 2009 – Process Virtualization

The Power of Virtualization

• No longer restricted to a fixed ISA

• Reduce hardware complexity– No more backwards compatibility warts– Fix bugs after shipment– Reduce time to market for new architectures

HW

SW Applications

Binary Modifier

Initiallyx86x86

Eventually

SWI

HWI

ACACES 2009 – Process Virtualization

Tortola Applications

Combine global program information with run-time feedback

• System-specific power usage• Application-specific heat anomalies• Workload/input specific performance optimization

Two case studies

• The di/dt problem – solved using hardware feedback and dynamic optimization

• Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and OS hints from the binary modifier

ACACES 2009 – Process Virtualization

The di/dt Problem

Voltage stability is important for reliability, performance

Low-power techniques have a negative side effect: current variation

ITRS cites noise management as a Grand Challenge for 5-10 year time frame

Dips (undershoots) in supply voltage – can cause incorrect values to be calculated or stored

Spikes (overshoots) in supply voltage – can cause reliability problems

ACACES 2009 – Process Virtualization

di/dt Solutions

Software

MicroArch

Circuit-Level

Compiler Optimizations

Sensor/Actuator Mechanisms

Decoupling capacitors More Vdd Gnd pins on

package

Co-Designed MicroArch & SW Binary Modifier

ACACES 2009 – Process Virtualization

Sensor-Actuator Mechanisms

On-chip voltage sensors detect abnormally high/low voltage levels

On-chip actuator then attempts to quickly raise/lower the processor’s current draw

• Phantom firing– increases current (at the expense of power)• Resource throttling

– reduces current (at the expense of performance)

ACACES 2009 – Process Virtualization

1.05V

1.03V

1V

0.97V

0.95V

Op

era

tin

g

Volt

ag

e R

an

ge

Hard EmergencySoft

Emergency

ControlThreshold

Detecting Imminent Emergencies

ACACES 2009 – Process Virtualization

Targeting Mid-Frequency Di/dt

•Problematic: wide current spike

•Worst case: pulse at the resonant frequency

Su

pp

ly V

olta

ge (

V)

Pro

cess

or C

urr

ent

(A)

Time (Cycles)

Minimum Voltage

20 cycles

Su

pp

ly V

olta

ge (

V)

Pro

cess

or C

urr

ent

(A)

Time (Cycles)

Minimum Voltage

60 cycles

Maximum Voltage

*From:

Joseph et al.

HPCA 2003

ACACES 2009 – Process Virtualization

A di/dt Stressmark

BEGIN_LOOP:…ldt $f1, ($4)divt $f1, $f2, $f3divt $f3, $f2, $f3stt $f3, 8($4)ldq $7, 8($4)cmovne $31, $7, $3stq $3, $(4)stq $3, $(4)stq $3, $(4)…stq $3, $(4)…JUMP BEGIN_LOOP

Seq

uen

tial

Seq

uen

tial

Low

Pow

er

Low

Pow

er

Para

llel

Para

llel

Hig

h P

ow

er

Hig

h P

ow

er

But…Actuator engages every loop iteration degrading performance

Why not correct the problem in the code?

ACACES 2009 – Process Virtualization

Why use Dynamic Binary Modification?

Modify the instruction stream at run time

Much easier to react to an emergency than to predict one

Emergencies are processor dependent, but software should not be!

Enables run-time guarantees

ACACES 2009 – Process Virtualization

Leverage our additional software layer to supplement existing solutions

Microarchitecture provides feedback to our software-based virtual layer

Proposed Solution

Microprocessor

Sensor+Actuator Ext

AlteredAlteredExecutableExecutable

SWHW

BinaryModifierExecutableExecutable

VL

ACACES 2009 – Process Virtualization

Required Investigations

Characterizing emergencies

• How often do we see di/dt emergency loops?

• Are emergencies usually code-based?

Communication between the microarchitecture and the virtual layer

• What information should be passed to virtual layer during an emergency?

Fixing di/dt via binary modification

• Will existing techniques help?

• New algorithms?

ACACES 2009 – Process Virtualization

Last-Executed Branch

Data suggests modifying a few code sequences will eliminate many voltage emergencies

1

10

100

1000

10000

100000

1000000

gzip

vpr

gcc

mcf

crafty

parser

eon

perlbmk

gap

vortex

bzip2

twolf

wupwise

swim art

mgrid

applu

mesa

galgel

equake

facerec

sixtrack

ammp

apsi

Emergencies

Distinct Total

ACACES 2009 – Process Virtualization

Our goal is to• Smooth out current profile, or• Knock pulses off of the resonant frequency

Some existing options• Software pipelining, code motion, instruction

padding

Microprocessor

Sensor+Actuator Ext’ns

AlteredAlteredExecutableExecutable

BinaryModifier

ApplyApplyOptimizationsOptimizations

ExecutableExecutable

Possible Compiler Optimizations

ACACES 2009 – Process Virtualization

AABB

Iteration=1

Iteration=2

Iteration=3

Current

AABB

AABB

AABB

Software pipelining

smoothes profile

Current

Problematicloop:

AAAABBBB

Current

Loop unrolling disrupts

resonance pulse

Unrolledloop:

Loop Unrolling & SW Pipelining

ACACES 2009 – Process Virtualization

0.97V

0.98V

0.99V

1.00V

1.01V

1.02VBefore Loop Unrolling After Loop Unrolling

H

L

H1

H2

L1

L2

Unrolling the Di/dt Stressmark

ACACES 2009 – Process Virtualization

Two Case Studies

The di/dt problem – solved using hardware feedback and dynamic optimization

Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and hints from the binary modifier

ACACES 2009 – Process Virtualization

Process Heterogeneity

• Process variation

Design Heterogeneity

• Specialized processor cores

Today’s Multicore Designs

Future Multicore Designs

Heterogeneous Multicores

ACACES 2009 – Process Virtualization

The Challenge: Scheduling

Many OSes assume identical core resources

• Bad assumption, even today (hyperthreading)

The OS may not have enough information to make the best scheduling decisions

• depending on the type of heterogeneity

Ideal process-to-core mappings change dynamically

Should this be a task for the OS alone?

ACACES 2009 – Process Virtualization

Our Approach

Claim: OSes could benefit from scheduling hints

Solution: Combine historical performance data with application phase information

SW Applications

Binary Modifier

OS/HW

phase information performance

counter info; os load

scheduling hints

ACACES 2009 – Process Virtualization25

Initial Investigation

Hardware configuration

• In-order core

• Out-of-order core

Software configuration

• Two applications executing (SPEC all combinations)

Migration indicators

• IPC (from HW)

• Phase (from SW) 1M cycle granularity

out of order

in order

App1 App2

ACACES 2009 – Process Virtualization26

Results

1. Calculated ideal (omniscient) scheduling

2. Calculated random scheduling

3. Explored two migration heuristics• IPC-threshold scheduling• IPC-delta scheduling

Metric: Distance from ideal

ACACES 2009 – Process Virtualization27

Random Scheduling

Probability of Swapping at Each 1M Timeslice 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

60%

40%

20%

0%

Dis

tan

ce f

rom

Id

eal

ACACES 2009 – Process Virtualization28

IPC Threshold SchedulingPerc

en

t D

ista

nce

fro

m Id

eal

30

20

10

00 Backoff 0.5 Backoff 1.0 Backoff 1.5 Backoff 1.9 Backoff 2.0 Backoff

0.5 IPC 0.75 IPC 1.0 IPC 1.25 IPC 1.5 IPC 1.75 IPC 2.0 IPC

ACACES 2009 – Process Virtualization29

IPC Delta SchedulingPerc

en

t D

ista

nce

fro

m Id

eal

30

20

10

0 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.75 1 1.25 1.5 1.75 2

Migration Trigger - IPC Change

ACACES 2009 – Process Virtualization30

Ongoing Investigations

Varying heterogeneity

• cache sizes

• floating-point availability

Determining migration indicators

• cache misses

Combinations

Other software feedback

ACACES 2009 – Process Virtualization

Symbiotic Optimization

Cross-layer approaches can be powerful

Minor changes enable communication channels

Hardware feedback and binary modification can help solve the di/dt problem

Hardware feedback and program phases can guide OS scheduling decisions

The Tortola design can also target power reduction, temperature reduction, reliability, etc.

http://www.tortolaproject.com/

ACACES 2009 – Process Virtualization

Course Summary

Now you know …

Process VMs versus system VMs

Research issues in building process VMs

How to use process VMs for a variety of other research projects

Why cross-layer approaches can be powerful and how to use process VMs to get there

Thanks and enjoy the rest of your summer!

32

Recommended