HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill...

Preview:

Citation preview

Darrell Boggs, CPU Architecture

Co-authors: Gary Brown, Bill Rozas,

Nathan Tuck, K S Venkatraman

HOT CHIPS 2014

NVIDIA’S DENVER PROCESSOR

2

The First 64-bit Android Kepler-Class

Chip

with Dual Denver CPUs TEGRA K1

3 3

TEGRA K1 192-core Kepler-Class Chip

One Chip — Two Versions

Pin Compatible

Quad A15 CPUs

32-bit

3-way Superscalar

Up to 2.3GHz

32K+32K L1$

Dual Denver CPUs

64-bit

7-way Superscalar

Up to 2.5GHz

128K+64K L1$

4

DENVER VALUE PROPOSITION

Highest performance and very energy-efficient ARMv8 processor

Greater dynamic sharing with GPU

Extended battery life

Low latency power-state transition

Best web browsing experience

Designed to bring PC-class performance to the ARM ecosystem

Content creation

Gaming

Enterprise applications

5 5

DENVER CPU Highest Perf ARMv8 CPU

7-wide superscalar

Aggressive HW prefetcher

Dynamic Code Optimization Optimize once, use many times

OOO execution without the power

Denver Core

IFUBPU

L1 Inst Cache

128 K – 4 Way

ARM HW Decoder

SchedulerL1 Data Cache

64 K – 4 Way

JSR IEU0 IEU1 FP0 FP1 LS0 LS1

HW

Prefetch

6

Branch

Mul

Integer Integer

FP

NEON

FP

NEON

Load

Store

Integer

Load

Store

Integer

Scheduler

Decoder

(predecode)

8 wide

Scheduler

Decoder

3 wide

Branch Integer Integer Mul

FP

NEON Load Store

FP

NEON

TEGRA K1 SUPERSCALAR ARCHITECTURE

Branch: 1

Integer: 2

Multiply: 1

Floating Point/Neon: 2 x 64-bit

LD/ST: 1 LD and 1 ST

Peak IPC 3

Branch: 1

Integer: 2 (+ Mul) + 2

Floating Point/Neon: 2 x 128-bit

LD/ST: 2 LD and/or ST

Peak IPC 7+

Cortex-A15 Tegra K1-32 Denver Tegra K1-64

7

Denver

RF wrJCC CMPBypass

I$ Rd Way Sel PickDec PB

EL4 EE5 ES6 EW7

Correct Target

ITLB RF wrBypass ALUBypassSch

EB1 EA2 ED3 EL4 EE5 ES6 EW7SB2

Misprediction

Signal

13 Cycle Penalty

IC2 IW3 IN4 IN5 SB1IP1

RF Rd

EB0

Pipeline Microarchitecture – Mispredict Penalty

Tegra K1-32

15 cycle mispredict

Tegra K1-64

13 cycle mispredict

Higher ILP and efficiency Lower is better

8

CORE CLUSTER RETENTION STATE

New power management state: CC4

Allows cache and architectural state retention

Allows voltage to be reduced below Vmin to a retention voltage

Fast entry and exit latencies enable wider range of use

9

DENVER IDLE POWER IMPROVES WITH RETENTION

Overhead from energy

required to flush L2 Leakage

Efficiency

crossover

Power penalty if

entered for short

durations

10

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimizer

Optimization

m-interrupt

11

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimizer

Optimization Cache

Optimized

mcode Dynamic

Profile

Information

Unrolls Loops Renames registers Reorders Loads and Stores Improves control flow Removes unused computation Hoists redundant computation Sinks uncommonly executed computation Improves scheduling

12

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units Optimization Cache

Optimized

mcode Optimization

Lookup Optimized code

execution from

optimization cache

13

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES

Instructions

Denver Hardware

Hardware

Decoder

Execution

Units

Optimization

Lookup

Optimizer

Optimization Cache

Optimized

mcode

Optimized

mcode

Optimized

mcode

Optimized

mcode

Dynamic

Profile

Information

Chaining

14

DENVER PERFORMANCE

0%

50%

100%

150%

200%

250%

300%

DMIPS SPECInt 2K SPECFP 2K AnTuTu 4 Geekbench 3Single-Core

GoogleOctane v2.0

16MBMemcpy(GB/s)

16MBMemset(GB/s)

16MBMemread

(GB/s)

Perf

orm

ance R

ela

tive t

o T

egra

K1 3

2

Baytrail (Celeron N2910)

Krait-400 (8974-AA)

iPhone 5s (A7 Cyclone)

Haswell (Celeron 2955U)

Tegra K1 64

15

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

Profile of execution

Optimized ucode execution HW Decoder

execution

Dynamic Code Optimizer

16

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

Profile of new ARM instructions

17

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

Profile of Instructions Per Cycle

IPC

18

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

19

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

20

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

21

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

22

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run

%

Exec

Types

New

ARM

Code

IPC

23

DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION 3% of benchmark run

%

Exec

Types

New

ARM

Code

IPC

24

CONCLUSION

Dynamic Code Optimization is the architecture of the future

Breaks the out-of-order window physical limitation

Opens synergy between HW and SW that current architectures lack

Improves efficiency by optimizing once and using many times

Delivering PC-class performance to mobile form factors

Enables PC-class gaming experience

Enables true enterprise applications

Enables content creation

25

ACKNOWLEDGMENT

We would like to thank the CPU team in NVIDIA for all the creativity, hard work, and dedication to bring this vision to a reality.

Recommended