46
Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

  • Upload
    others

  • View
    10

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs and David Wentzlaff

ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA

Page 2: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

2

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Page 3: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

3

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Page 4: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

4

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

?

Page 5: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

5

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Page 6: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

6

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Page 7: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

7

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Transistor scaling stops. Chip specialization runs out of steam.

What’s Next?

Page 8: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

8

Observation I: The Density of Emerging Memories are Projected to Increase

Scaling Datacenter Accelerators With Compute-Reuse Architectures

ITRS Logic Roadmap

Page 9: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

9

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec t=2 sec t=4 sec

Page 10: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

10

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec

0% recurrence 38% recurrence 61% recurrence

t=2 sec t=4 sec

Page 11: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

11

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

Page 12: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

12

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

Page 13: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

13

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

Page 14: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

14

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

Page 15: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

15

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

Page 16: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

16

InputLookup

core result

DMA Engine

Accelerator Core

input

input output

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

Page 17: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

17

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory

Page 18: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

18

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory

Page 19: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

19

Architectural Guidelines

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Page 20: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

20

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

Page 21: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

21

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

Page 22: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

22

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

Goal: Extend Specialization with Workload-Specific Memoization

Page 23: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

23

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

Page 24: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ New Modules:

o Input Hashing Unit (IHU)

24

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

Page 25: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

25

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

ILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

Hashes

AssociativeCache

Page 26: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

26

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Fetch

Page 27: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

27

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Fetch

Match Input

Page 28: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

28

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Use Output

Fetch

Page 29: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

29

Page 30: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

30

Page 31: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Runtime OPT: 5.8[us]

Energy OPT: 6.2[uJ]

EDP OPT: 148.7[pJs]

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

31

Page 32: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

32

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 33: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

33

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

Energy Optimization: 56.6% Energy Saved.

64KB ILU, 8MB CHT

EDP Optimization:63.5% EDP Saved.

512KB ILU, 2GB CHT

Runtime Optimization:2.7x Speedup.

512KB ILU, 32GB CHT

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 34: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

34

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Workloads

Page 35: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

35

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 36: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

36

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 37: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

37

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 38: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

38

Workloads

Methodology

o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 39: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

39

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 40: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

40

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 41: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

41

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM

▪ General Trends:

o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Page 42: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

42

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

Conclusions

Page 43: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

43

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

▪ Memoization is Fit for Datacenters

o Temporal Redundancy, Search Commonality, Content Popularity

Conclusions

Page 44: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

44

Conclusions

Page 45: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

▪ COREx Opens New Opportunities for Future Architectures

o Shift compute from non-scaling CMOS to still-scaling memories

45

Conclusions

Page 46: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs David [email protected] [email protected]