HW/SW Co-designed Processors Challenges, Design Choices ... · HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar

HW/SW Co-designed Processors: Challenges, Design Choices and a

Simulation Infrastructure for Evaluation

Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3,

Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6

1University of Edinburgh, UK 2Intel 311pets

4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)Santa Rosa, California, USA - April 24-25, 2017

Outline

• HW/SW co-designed processors

• Building a simulation infrastructure

• DARCO

• Evaluation

• Conclusions

2

HW/SW co-designed processors

3

Energy Efficiency

Performance

HW/SW co-designed processor

Operating System

Translation Optimization Layer (Software)

Libraries

Application Programs

Execution Hardware

Host ISA

Guest ISA

Conventional processor

Operating System

Execution Hardware

Libraries


ISA

• Simple Host ISA

– In-order cores; move complexity to software layer

• Dynamic Binary Optimizations in software (TOL)

– Aggressive and speculative

– Exploit application behavior at runtime

• IBM DAISY (1997)

– Targets binary compatibility from PowerPC to VLIW architectures

• IBM BOA (1999)

– Targets high frequency PowerPC through simple hardware design

• Transmeta Crusoe (2000) and Efficeon (2003)

– Execute x86 binaries on proprietary VLIW with low power consumption

– Better energy efficiency than Intel Pentium III

• Nvidia Denver (2014)

– Executes ARMv8 binaries on proprietary in-order core

– Applying dynamic optimizations matches Out-of-order Intel Haswell

HW/SW co-designed processors: History

20031997 1999 2000 2014

DAISY BOA

4???

• IBM DAISY (1997)

– Targets binary compatibility from PowerPC to VLIW architectures

• IBM BOA (1999)

– Targets high frequency PowerPC through simple hardware design

• Transmeta Crusoe (2000) and Efficeon (2003)

– Execute x86 binaries on proprietary VLIW with low power consumption

– Better energy efficiency than Intel Pentium III

• Nvidia Denver (2014)

– Executes ARMv8 binaries on proprietary in-order core

– Applying dynamic optimizations matches Out-of-order Intel Haswell

HW/SW co-designed processors: History

20031997 1999 2000 2014

DAISY BOA

5???

Anything missing?No major academic project!

Can lack of simulation infrastructure be the reason?

Outline

• Introduction



• DARCO

• Evaluation

• Conclusions

6

What will a simulation infrastructure enable?

• Where to implement (HW or SW?) microarchitectural features like

– Instruction decoding/reordering, memory disambiguation, register renaming, …

7

Operating System

Translation Optimization Layer (Software)

Libraries


Execution Hardware

Host ISA

Guest ISA




• How to reduce “startup delay”

– One of the major problems of Transmeta products

• When and where to translate/optimize the guest binaries

– As soon as code becomes “hot”?, wait for a core to become idle?,…

• How to address speculative execution (memory, control)

– Checkpointing granularity?, find susceptible speculation failure points?

• When and how to profile the execution

– Overhead vs opportunity for improvement

8




• How to reduce “startup delay”

– One of the major problems of Transmeta products

• When and where to translate/optimize the guest binaries

– As soon as code becomes “hot”?, wait for a core to become idle?,…

• How to address speculative execution (memory, control)

– Checkpointing granularity?, find susceptible speculation failure points?

• When and how to profile the execution

– Overhead vs opportunity for improvement

9

A simulation infrastructure can help

evaluate trade-offs and design choices

Simulation infrastructure: Complexity

10

• Compilation framework

– Code analysis/translation

– Optimizations

– Code generation

• Runtime system

– Profiling and instrumentation

– Profile-guided optimizations

• Microarchitectural simulator

– Model components like pipeline, caches, …

– Allow sampling

Simulation infrastructure: Complexity

11

• Compilation framework

– Code analysis/translation

– Optimizations

– Code generation

• Runtime system

– Profiling and instrumentation

– Profile-guided optimizations

• Microarchitectural simulator

– Model components like pipeline, caches, …

– Allow sampling

Simulation infrastructure complexity =

Compilation framework + Runtime system +

Microarchitectural simulator

• Correctness

– It should not change program behavior

• Minimum software layer (TOL) overhead

– TOL execution time must be small

• Minimum emulation cost

– Host to guest instruction ratio must be low

• Support for multiple guest ISAs (front-ends)

– Enables wider applicability

• Plug and play support

– Easy to include/evaluate new features

• Debugging

– Strong debug toolchain

Simulation infrastructure: Requirements

12

TOL

Hardware

x86 ARM Power

Outline

• Introduction



• DARCO: Infrastructure for Research on HW/SW Co-designed Processors

• Evaluation

• Conclusions

13

DARCO: The big picture

14

x86 Componentx86 Binary

Commands Path Commands Path

Timing Simulator

Process Tracker

Data and Instruction Path


Co-designed Component

Controller

x86 Functional Emulator

RISCFunctional Emulator

x86 OSTranslation Optimization Layer (TOL)

Emulated x86 Memory State

Authoritative x86 Register State

Authoritative x86 Memory State

State Checker

Emulated x86 Register State

• Models a processor that executes x86 code on a RISC host architecture

• Four main components: Co-designed, x86, Timing Simulator, Controller

DARCO: Co-designed Component

• Models the functionality of a HW/SW co-designed processor

• Composed of TOL and host ISA functional emulator (user code)

• Maintains emulated x86 architectural and memory states

15



Timing Simulator

Process Tracker




Controller







State Checker


DARCO: x86 Component

• Provides a full-system functional emulator for the guest x86 ISA

• Maintains authoritative x86 architectural and memory states

• Filters instruction stream and passes user code to co-designed component

16



Timing Simulator

Process Tracker




Controller







State Checker


DARCO: Timing Simulator

• Models a parameterized in-order core

• Can distinguish application and TOL code

• Includes power and energy modelling (McPAT)

17



Timing Simulator

Process Tracker




Controller







State Checker


DARCO: Controller

• Provides full control over the app execution and debugging utilities

• Compares authoritative and emulated x86 states to ensure correctness

18

User



Timing Simulator

Process Tracker




Controller







State Checker


19

DARCO: Starting execution

x86 Component Tracker Co-designed ComponentController

User

CommandsUser : Execute application XYZ

Controller : Send proper command to x86 component

x86 OS : Starts application XYZ

Tracker : Identifies application XYZ

Controller : Request 1st code page from x86 component and send to TOL with initial state

TOL : Load the code page and starts emulating

TOLXYZ Exec Space

Code Code Reg File Data Reg File

Tracks the emulated

application. Passes

user-level code to co-

designed component

DataXYZ

20

DARCO: Handling system calls

Co-designed Componentx86 ComponentController

User

Events and CommandsTOL : Execution sequence reaches a system call

TOL : Flush state and send request to x86 component

x86 Comp : Reach the same execution point and make the system call

Controller : Send new state to TOL

TOL : Continue emulation

TOLXYZ Exec Space

Code Code

Tracker

Reg File Data Reg File

Sys CallSys Call

DataHandler

The system call is faked – it is

NOT emulated in the co-

designed component

21



Timing Simulator

Process Tracker




Controller







State Checker


DARCO: Translation Optimization Layer (TOL)

• Interpretation (IM)

– x86 instructions interpreted sequentially

– Profiling for execution frequency of basic blocks

• Basic block translation (BBM)

– x86 basic blocks translated to intermediate representation

– Lightweight optimizations (dead code, constants)

– Profiling for branch directions and execution frequency of basic blocks

• Superblock optimization (SBM)

– Bigger optimization regions across multiple x86 basic blocks

– Aggressive and speculative optimizations

– No profiling (reduces overhead)

Cold Code

22

Warm Code

Hot Code

DARCO: TOL execution modes

23

x86 eip

Interpret

> BBth

No

BB translateYes

Store in Code $

Chain

Execute from Code $

In Code

$?

No Yes

From Code $

Create SB

Optimize SB

> SBth

Yes

No

Already optimized

DARCO: TOL execution flow

24

x86 eip

Interpret

> BBth

No

BB translateYes

Store in Code $

Chain

Execute from Code $

In Code

$?

No Yes

From Code $

Create SB

Optimize SB

> SBth

Yes

No

Already optimized

- x86 to Intermediate Representation (IR)- Control Speculation- Loop Unrolling

- Static Single Assignment (SSA)- Forward Pass

- Constant Folding - Constant/Copy Propagation- Common Subexpression Elimination

- Backward Pass- Dead Code Elimination

- Data Dependence Graph (DDG)- Memory Alias Analysis- Redundant Load Removal- Store Forwarding

-Instruction Scheduling- Data Speculation

- Register Allocation- Code Generation (opt. IR to Host ISA)

DARCO: TOL execution flow

ACD

DARCO: Building a superblock

25

A

branch

B

branch

C

branch

D

branch

E

branch

5% 95%

2%98%

30% 70%

A

branch

C

branch

D

branch

Exec counter > SBth

assert

assert

branch

Unbiased Branch

SuperBlock

ACD

DARCO: Building a superblock

26

A

branch

B

branch

C

branch

D

branch

E

branch

5% 95%

2%98%

30% 70%

A

branch

C

branch

D

branch

Exec counter > SBth

assert

assert

branch

Unbiased Branch

Impact: Bigger optimization regions

SuperBlock


27



Timing Simulator

Process Tracker




Controller







State Checker



• Models a configurable RISC superscalar in-order processor

• Pipeline decoupled into: Front-End, Instruction Queue, Back-End

28

Front End

AC IF DEC

In. Queue Back End

ISSUE RR EXE WB

TLB L1

TLB L2

L1-I$ L2$ L1-D$

Main Memory

PrefetcherBPBTB

DARCO: Meeting the requirements

• Correctness

– Architectural/memory states compared periodically

• Minimum TOL overhead

– 3-stages for translation/optimization; Chaining

• Minimum emulation cost

– Aggressive/speculative optimizations

• Support for multiple guest ISAs (front-ends)

– Incorporating additional front-ends is straightforward

• Plug and play support

– Modular design

• Debugging

– Powerful debug toolchain

29

TOL

x86 ARM? Power?

x86 Componentx86

Binary


Timin

g

Simulator

Process Tracker



Co-design Component

Controller




Emulated x86

Memory State

Authoritative x86 Register

State

Authoritative x86 Memory

State

State Checker

Emulated x86

Register State

SBMOptim.

x86 eip

Interpret

> BBt

h

No

BB translate

Yes

Store in Code $

Chain

Execute from Code $

In Code $?

No

Yes

From Code $

Create SB

Optimize SB

> SBt

h

Yes

No

Already optimized

Controller

Controller

Outline

• Introduction



• DARCO

• Evaluation

• Conclusions

30

Evaluation methodology

• Benchmarks analyzed

– SPEC CPU2006: Integer, Floating point

– Physicsbench

• We simulate 4B (billion) x86 instructions

– Some benchmarks have fewer instructions and run to completion

• DARCO speed (x86 instructions)

– 3.4 MIPS and ~0.4 MIPS enabling the Timing Simulator

– Intel Xeon E5-2630L (2.40 GHz) and L5630 Dual (2.13 GHz) with 24-128 GB RAM

• Highlights

– x86 dynamic code distribution

– Execution time distribution

– Emulation cost

31

0%10%20%30%40%50%60%70%80%90%

100%

40

0.p

erlb

ench

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

8.s

jen

g

46

2.li

bq

uan

tum

46

4.h

26

4re

f

47

1.o

mn

etp

p

47

3.a

star

48

3.x

alan

cbm

k

41

0.b

wav

es

43

3.m

ilc

43

4.z

eusm

p

43

5.g

rom

acs

43

6.c

actu

sAD

M

43

7.le

slie

3d

44

4.n

amd

45

0.s

op

lex

45

3.p

ovr

ay

45

4.c

alcu

lix

45

9.G

emsF

DTD

47

0.lb

m

48

2.s

ph

inx3

10

0.b

reak

able

10

1.c

on

tin

uo

us

10

2.d

efo

rmab

le

10

4.e

xplo

sio

ns

10

5.h

igh

spee

d

10

6.p

erio

dic

10

7.r

agd

oll

SPEC

INT2

00

6

SPEC

FP2

00

6

Ph

ysic

sben

ch

SPECINT2006 SPECFP2006 Physicsbench Averages

X8

6 d

ynam

ic in

stru

ctio

n p

erc

en

tage

Insn executed in IM Insn execute in BBM Insn executed in SBM

Evaluation: x86 dynamic code distribution

32

• ~90% of the dynamic instruction stream comes from SBM (optimized code)

• Represents ~14% of the static code (not shown)

90%

Lower repetition factor of the SuperBlocks

Evaluation: Execution time distribution

• Application: cycles executing RISC instructions that emulate the x86 code

• TOL (overhead): cycles translating/optimizing the guest code, etc

33

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

40

0.p

erlb

ench

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

8.s

jen

g

46

2.li

bq

uan

tum

46

4.h

26

4re

f

47

1.o

mn

etp

p

47

3.a

star

48

3.x

alan

cbm

k

41

0.b

wav

es

43

3.m

ilc

43

4.z

eusm

p

43

5.g

rom

acs

43

6.c

actu

sAD

M

43

7.le

slie

3d

44

4.n

amd

45

0.s

op

lex

45

3.p

ovr

ay

45

4.c

alcu

lix

45

9.G

emsF

DTD

47

0.lb

m

48

2.s

ph

inx3

10

0.b

reak

able

10

1.c

on

tin

uo

us

10

2.d

efo

rmab

le

10

4.e

xplo

sio

ns

10

5.h

igh

spee

d

10

6.p

erio

dic

10

7.r

agd

oll

SPEC

INT2

00

6

SPEC

FP2

00

6

Ph

ysic

sben

ch


Pe

rce

nta

ge o

f ex

ecu

tio

n t

ime

TOL (overhead) Application 20%

0

1

2

3

4

5

6

40

0.p

erlb

ench

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

8.s

jen

g

46

2.li

bq

uan

tum

46

4.h

26

4re

f

47

1.o

mn

etp

p

47

3.a

star

48

3.x

alan

cbm

k

41

0.b

wav

es

43

3.m

ilc

43

4.z

eusm

p

43

5.g

rom

acs

43

6.c

actu

sAD

M

43

7.le

slie

3d

44

4.n

amd

45

0.s

op

lex

45

3.p

ovr

ay

45

4.c

alcu

lix

45

9.G

emsF

DTD

47

0.lb

m

48

2.s

ph

inx3

10

0.b

reak

able

10

1.c

on

tin

uo

us

10

2.d

efo

rmab

le

10

4.e

xplo

sio

ns

10

5.h

igh

spee

d

10

6.p

erio

dic

10

7.r

agd

oll

SPEC

INT2

00

6

SPEC

FP2

00

6

Ph

ysic

sben

ch


Ho

st In

stru

ctio

ns

pe

r x8

6 In

stru

ctio

n

Evaluation: Emulation cost

• Number of host instructions generated per x86 instruction ~3

• Reasonable for translating from CISC to RISC

34

3.2

Conclusions


– Huge potential to improve energy efficiency and performance

– Several industrial projects

– No major project in academia (lack of simulation infrastructures)

• Challenges

– To become mainstream (e.g. startup delay)

– In building a simulation infrastructure (e.g. software layer overhead)

• DARCO

– May enable academic research in HW/SW co-designed domain

– Provides a modular simulation infrastructure

– Easy to add new components/optimizations

35

HW/SW Co-designed Processors: Challenges, Design Choices and a

Simulation Infrastructure for Evaluation

Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3,

Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6

1University of Edinburgh, UK 2Intel 311pets

4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain

THANK YOU !!!

Documents

HW/SW Co-designed Processors Challenges, Design Choices ... · HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar