Google Workloads for Consumer Devices: Mitigating Data ... · TensorFlow Mobile On average, energy...

Preview:

Citation preview

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Google Consumer Workloads

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

ChromeGoogle’swebbrowser

TensorFlowMobileGoogle’smachinelearning

framework

VideoPlaybackandVideoCaptureGoogle’svideocodec

Goals Data Movement Cost

DataMovement

1stkeyobservaBon:62.7%oftotalsystemenergyisspentondatamovement

Processing-in-Memory(PIM)

SoC

DRAM L2 L1

CPU CPU CPU CPU

Compute Unit

2ndkeyobservaBon:asignificantfracBonofdatamovementoOencomesfromsimplefuncBons

Welookatwidely-usedGoogleworkloadstoidenBfymajorsourcesofenergyconsumpBon:

1

2

UnderstandthedatamovementrelatedboSlenecksinmodernconsumerworkloads

Analyzethebenefitsofprocessing-in-memory(PIM)tomiBgatedatamovementcost

InvesBgatethePIMlogicthatcanmaximizeenergyefficiencygiventhe

limitedareaandenergybudgetinconsumerdevices3

Browser

0% 20% 40% 60% 80%

100%

Frac

tion

of

Tota

l Ene

rgy

Texture Tiling Color Blitting Other

GoogleDocs Gmail Google

CalendarWord-Press Twi6er Ani-ma8on AVG

41.9%ofpagescrollingenergyisspentontextureBlingandcolorbliZng

77%oftotalenergyconsumpBongoestodatamovement

CPU L1 LLC Inter- connect

Mem Ctrl

DRAM

Tota

l Ene

rgy

(pJ) Texture Tiling Color Blitting Other

0×1012 3×1012 6×1012 9×1012

12×1012 15×1012 18×1012

49.1%oftotaldatamovementcomesfrom

textureBlingandcolorbliZng

0%

10%

20%

30%

40%

Texture Tiling

Color Blitting

Frac

tion

of

Tota

l Ene

rgy

Data Movement Compute

HTML

CSS

HTML Parser

CSS Parser

Render Tree Layout Rasteriza-

tion Composit-

ing

Texture Tiling

Color Blitting

Rasteriza8on

InvokeComposi8ng

idle

high

Conversion

TextureTiles

%meCPU Memory

Rasteriza8on

ReadBitmap

Conversion

WriteBack

CPU PIM

CPU + PIM CPU-Only

LinearBitmap

TextureTilesInvoke

Composi8ng

TextureTiling

datamovement

LinearBitmap

TextureBlingisagoodcandidateforPIMexecuBon

RequiressimpleprimiBves:memcopy,bitwiseoperaBons,andsimplearithmeBcoperaBons

PIM Core PIM Accelerator

Texture Tiling

LinearBitmap TiledTexture

9.4%oftheareaavailableforPIMlogic

7.1%oftheareaavailableforPIMlogic

PIMcoreandPIMacceleratorarefeasibletoimplementin-memorytexturetiling

ChromeProcess

RenderProcess(tab1)

RenderProcess(tab2)

RenderProcess(tabn)

•  Chrome uses compression to reduce each tab’s memory footprint

•  To study data movement during tab switching, we perform an experiment: –  A user opens 50 tabs (most-accessed websites) –  Scrolls through each for a few second –  Switches to the next tab

Compressionanddecompressioncontributeto

18.1%ofthetotalsystemenergy

19.6GBofdatamovebetweenCPUandZRAM

TensorFlow Mobile

57.3%oftheinferenceenergyisspentondatamovement

54.4%ofthedatamovementenergycomesfrompacking/unpackingandquanBzaBon

Inference PredicBon

Reorderselementsofmatricestominimizecachemisses

Upto40%oftheinferenceenergyand31%ofinferenceexecuBonBme

Packing’sdatamovementaccountsfor

upto35.3%oftheinferenceenergy

Packing Matrix PackedMatrix

AsimpledatareorganizaBonprocessthatrequiressimplearithmeBc

Converts32-bitfloaBngpointto8-bitintegers

Upto16.8%oftheinferenceenergyand16.1%ofinferenceexecuBonBme

MajorityofquanBzaBonenergycomesfromdatamovement

Quantization floa8ngpoint integer

AsimpledataconversionoperaBonthatrequiresshiO,addiBon,andmulBplicaBon

Video Playback and Capture

63.5%ofthesystemenergyisspentondatamovement

Compressedvideo VP9Decoder

Display

80.4%ofthedatamovementenergycomesfromsub-pixelinterpolaBonanddeblockingfilter

Deblockingfilter:asimplelow-passfilterthataSemptsto

removedisconBnuityinpixels

Sub-pixelinterpolaBon:interpolatesthevalueof

pixelsatnon-integerlocaBon

59.1%ofthesystemenergyisspentondatamovement

Capturedvideo

VP9Encoder

Compressed

MajorityofthedatamovementenergycomesfrommoBonesBmaBon

MoBonesBmaBon:compressestheframesusingtemporalredundancybetweenthem

Evaluation

0

0.2

0.4

0.6

0.8

1

Texture Tiling

Color Blitting

Com- pression

Decom- pression

Packing Quantization Sub-Pixel Interpolation

Deblocking Filter

Motion Estimation

Nor

mal

ized

Ene

rgy

CPU-Only PIM-Core PIM-Acc

Onaverage,energyconsumpBonreducesby49.1%usingPIMcoreand55.4%usingPIMaccelerator

0.0

0.2

0.4

0.6

0.8

1.0

Texture Tiling

Color Blitting

Comp- ression

Decomp- ression

Sub-Pixel Interpolation

Deblocking Filter

Motion Estimation

TensorFlow

Nor

mal

ized

Run

tim

e

CPU-Only PIM-Core PIM-Acc

ChromeBrowser VideoPlaybackandCapture

TensorFlowMobile

Onaverage,energyconsumpBonreducesby44.6%usingPIMcoreand54.2%usingPIMaccelerator

ChromeRenderingPipeline

Scrolling Tab Switching •  Chrome employs a multi-process architecture

•  Main operations during tab switching: –  Context switch –  Load the new page

CPU

DRAM

ZRAM

CompressedTab

12

CPUCPU-Only

Memory CPUCPU + PIM

PIM%me

SwapoutNpages

ReadNPages

Compress

Othertasks

Writeback

compression

ZRAM

SwapoutNpages

OthertasksCompress

ZRAMNooff-chipdata

movementdatamovement

high

UncompressedPages

UncompressedPages

BothfuncBonscanbenefitfromPIMexecuBonandcanbeimplementedatPIMlogic

ColorbliZngisagoodcandidateforPIMexecuBon

Generatesalargeamountofdatamovement

Requireslow-costoperaBons:Memset,simplearithmeBc,andshiOoperaBons

Accountsfor19.1%ofthetotalsystemenergyduringscrolling

ColorBliZng

TextureTiling

ColorbliZngisfeasibletoimplementinPIMlogic

Recommended