70
Google’s Training Chips Revealed: TPUv2 and TPUv3 Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P. Jouppi, and David Paerson

Google’s Training Chips Revealed: TPUv2 and TPUv3

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Google’s Training Chips Revealed: TPUv2 and TPUv3

Google’s Training Chips Revealed:TPUv2 and TPUv3

Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P. Jouppi, and David Patterson

Page 2: Google’s Training Chips Revealed: TPUv2 and TPUv3

Thanks to the Team

Page 3: Google’s Training Chips Revealed: TPUv2 and TPUv3

Data Training Serving

Production Pipeline

Cap

abili

tycirca 2014

Target level

Page 4: Google’s Training Chips Revealed: TPUv2 and TPUv3

Production Pipeline

Cap

abili

ty

Data Training TPUv1Serving

circa 2015Target level

Page 5: Google’s Training Chips Revealed: TPUv2 and TPUv3

Production Pipeline

Cap

abili

ty

Data TPUv2Training

TPUv1Serving

circa 2017Target level

Page 6: Google’s Training Chips Revealed: TPUv2 and TPUv3

Challenges of ML Training

More Computation

More Memory

Wider Operands

More Programmability

Harder Parallelization

Backprop, transpose, derivatives

Keep data around for backprop

Need dynamic range (more than int8)

User experimentation, optimizers

Scale-up instead of scale-out

Page 7: Google’s Training Chips Revealed: TPUv2 and TPUv3

Our Constraints

Time

Staffing We have a lean team.

A development day saved is a deployment day earned.

Page 8: Google’s Training Chips Revealed: TPUv2 and TPUv3

Key Goals (that is, where will we do a great job?)

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

Page 9: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

PCIeQueues

TPUv1 Recap

= Compute

= Memory

= Host

Page 10: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

PCIeQueues

TPUv1 Recap

Page 11: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

PCIeQueues

TPUv1 Recap

Page 12: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

PCIeQueues

TPUv1 Recap

Page 13: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

PCIeQueues

TPUv2 Changes

Page 14: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipeline Accumulators

DDR3

TPUv2 Changes

PCIeQueues

Page 15: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

Unit

ActivationStorage

ActivationPipelineAccumulators

DDR3

TPUv2 Changes

PCIeQueues

Page 16: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

ActivationPipeline

DDR3

TPUv2 Changes

PCIeQueues Single vector memory

instead of buffers between fixed function units.

Page 17: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

ActivationPipeline

DDR3

TPUv2 Changes

PCIeQueues

Page 18: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

DDR3

TPUv2 Changes

PCIeQueues General purpose

vector unit instead of a fixed function activation pipeline.

Page 19: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

DDR3

TPUv2 Changes

PCIeQueues

Page 20: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

DDR3

TPUv2 Changes

PCIeQueues Connect matrix unit as

an offload for the vector unit

Page 21: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

DDR3

TPUv2 Changes

PCIeQueues

Page 22: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

DDR3

TPUv2 Changes

PCIeQueues Connect DRAM into the

memory system instead of directly into the matrix unit

Page 23: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

HBM

TPUv2 Changes

PCIeQueues Move to HBM for

bandwidth

Page 24: Google’s Training Chips Revealed: TPUv2 and TPUv3

MatrixMultiply

UnitVector

Memory

Vector Unit

HBM

TPUv2 Changes

PCIeQueues

Interconnect

Add interconnect for high-bandwidth scaling

Page 25: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPUv2 Changes

MatrixMultiply

UnitVector

Memory

Vector Unit

HBM

PCIeQueues

Interconnect

TADA!

Page 26: Google’s Training Chips Revealed: TPUv2 and TPUv3

Core 0

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

= Compute

= Memory

= Host

= Interconnect

Redrawn with more detail...

Page 27: Google’s Training Chips Revealed: TPUv2 and TPUv3

Previous Slide

Core 0

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

Core 1

MatrixMultiply

Unit

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

Page 28: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU CoreTPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

● VLIW Architecture○ Leverage known compiler techniques

● Linear Algebra ISA○ Scalar, vector, and matrix○ Built for generality

Page 29: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core: Scalar Unit

● 322b VLIW bundle○ 2 scalar slots○ 4 vector slots (2 for load/store)○ 2 matrix slots (push, pop)○ 1 misc slot○ 6 immediates

● Scalar Unit performs:○ Full VLIW bundle fetch and decode○ Scalar slot execution

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

Page 30: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

TPU Core: Scalar Unit

InstructionBundleMemory

32 x 32b Scalar

Reg File

ALU0 ALU1

Sync Flags

Scalar Decode and

Issue

To Vector and Matrix

Units DMA Interface

4Ki x 32bScalar

Memory

Page 31: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

TPU Core: Vector Unit (Lane)

32 x 32b Vector (Lane)Reg File

ALU0 ALU1

Vector Control

Control from Scalar Unit

32Ki x 32bVector (Lane)

Memory

DM

A Interface

To M

atrix

U

nits

From

Mat

rix

Uni

ts

x8

x128

Page 32: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

TPU Core: Matrix Multiply Unit

● 128 x 128 systolic array○ Streaming LHS and results○ Stationary RHS (w/ optional transpose)

● Numerics○ bfloat16 multiply

■ {s, e, m} = {1, 8, 7}■ The original!

○ float32 accumulation

Page 33: Google’s Training Chips Revealed: TPUv2 and TPUv3

256x256 (1x)

128x128 (4x)

64x64 (16x)

1

2

3

4operations per operand ratio

utilization ratio

1x (normalized)

2x

4x

1x (normalized)

1.6x

1.7x

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

Why 128x128?

Constant FLOPS

Page 34: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

ScalarUnit

TPU Core: Transpose, Reduction, Permute Unit

● Efficient common matrix transforms○ Transpose○ Reduction○ Permutation

● Generally, allow reshuffling of data across vector lanes

Page 35: Google’s Training Chips Revealed: TPUv2 and TPUv3

Memory System

HBM

TPU Core

● Loads and stores against SRAM scratchpads

● Provides predictable scheduling within the core

● Can stall on sync flags

● Accessible through asynchronous DMAs

● Indicate completion in sync flags

Page 36: Google’s Training Chips Revealed: TPUv2 and TPUv3

Memory System

TPU Core

HBM

● Loads and stores against SRAM scratchpads

● Provides predictable scheduling within the core

● Can stall on sync flags

● Accessible through asynchronous DMAs

● Indicate completion in sync flags

Striding over

vectors

Page 37: Google’s Training Chips Revealed: TPUv2 and TPUv3

Memory System

TPU Core

Bandwidth!

HBM

● Loads and stores against SRAM scratchpads

● Provides predictable scheduling within the core

● Can stall on sync flags

● Accessible through asynchronous DMAs

● Indicate completion in sync flags

Striding over

vectors

Page 38: Google’s Training Chips Revealed: TPUv2 and TPUv3

Memory System

HBM

TPU Core

InterconnectRouter

HBM

TPU Core

PCIeQueues

PCIeQueues

Page 39: Google’s Training Chips Revealed: TPUv2 and TPUv3

Memory System

HBM

TPU Core

InterconnectRouter

HBM

TPU Core

PCIeQueues

PCIeQueues

Widths ∝ Bandwidths

Page 40: Google’s Training Chips Revealed: TPUv2 and TPUv3

Interconnect

InterconnectRouter

LinkLink

LinkLink

Core Core

● On-die router with 4 links

● 500 Gbps per link

● Assembled into 2D torus

● Software view:

○ Uses DMAs just like HBM

○ Restricted to push DMAs

○ Simply target another chip id

Page 41: Google’s Training Chips Revealed: TPUv2 and TPUv3

Floorplan

= Compute

= Memory

= Host

= Interconnect

= Routing

Page 42: Google’s Training Chips Revealed: TPUv2 and TPUv3

The Anti-Second System

TPUv3

Page 43: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

Unit

VectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

MatrixMultiply

Unit

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

Page 44: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

Page 45: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

+30% freq +30% freq

Page 46: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

+30% freq

+30% b/w

+30% freq

+30% b/w

Page 47: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

+30% freq

+30% b/w

2x capacity

+30% freq

+30% b/w

2x capacity

Page 48: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

+30% freq

+30% b/w

2x capacity

+30% b/w

+30% freq

+30% b/w

2x capacity

Page 49: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Core 0

MatrixMultiply

UnitVectorUnit Transpose /

Permute Unit

HBM

PCIeQueues

InterconnectRouter

ScalarUnit

Link

Link

Link

Link

TPU Core 1

VectorUnitTranspose /

Permute Unit

HBM

ScalarUnit

PCIeQueues

= Compute

= Memory

= Host

= Interconnect

MatrixMultiplyUnit (2x)

MatrixMultiply

Unit

MatrixMultiplyUnit (2x)

+30% freq

+30% b/w

2x capacity

+30% b/w

4x nodes

+30% freq

+30% b/w

2x capacity

Page 50: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

Page 51: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

● Co-design: simplified hardware with software predictability (e.g., VLIW, scratchpads)

● Willingness to make tradeoffs

Page 52: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

● Compute density with bfloat16 systolic array

● HBM to feed the compute

● XLA compiler optimizations

Page 53: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

● System-first approach

● Interconnect with a familiar interface for ease-of-use

Page 54: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

● Flexible big data cores with principled linear algebra framework

● XLA compiler

● HBM capacity

Page 55: Google’s Training Chips Revealed: TPUv2 and TPUv3

Retrospective: Key Goals

Build it quickly

Achieve high performance...

...at scale

...for new workloads out-of-the-box

...all while being cost effective

● Matrix Unit efficiency

● Simplicity

● High performance for good perf/$

Page 56: Google’s Training Chips Revealed: TPUv2 and TPUv3

System &Performance

Page 57: Google’s Training Chips Revealed: TPUv2 and TPUv3

Supercomputer with dedicated interconnect

• TPUv1: single-chip system—built as coprocessor to a CPU • Works well for inference

• TPUv2, v3: ML Supercomputer • Multi-chip scaling critical for practical training times

• Single TPUv2 chip would take 60 - 400 days for production workloads

Page 58: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Training Pod Architecture

TPUs interconnected in 2D Torus

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU Board

Dedicated network for synchronous parallel training

Page 59: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPU Training Pod Architecture

TOR

Cluster Network

HostPCI-e

NIC TPU Board

TPU Board

Host TPU Board

Host TPU Board

Storage

Host

2D Torus

TPUs interconnected in 2D Torus

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU TPU TPU TPU TPU

TPU Board

Dedicated network for synchronous parallel training

Page 60: Google’s Training Chips Revealed: TPUv2 and TPUv3

Supercomputer with dedicated interconnect

TPUv2 boards = 4 chips

TPUv2 supercomputer (256 chips)

Page 61: Google’s Training Chips Revealed: TPUv2 and TPUv3

TPUv3 boards = 4 chips TPUv2 boards = 4 chips

Supercomputer with dedicated interconnectTPUv2 supercomputer (256 chips)

TPUv3 supercomputer (1024 chips)

Page 62: Google’s Training Chips Revealed: TPUv2 and TPUv3

11.5 petaflops4 TB HBM2-D torus256 chips

> 100 petaflops32 TB HBM

Liquid cooledNew chip + larger-scale system

1024 chips

Supercomputer with dedicated interconnectTPUv2 supercomputer (256 chips)

TPUv3 supercomputer (1024 chips)

Page 63: Google’s Training Chips Revealed: TPUv2 and TPUv3

• MultiLayer Perceptrons (MLP)

• MLP0 is unpublished

• MLP1 is RankBrain [Cla15]

• Convolutional Neural Networks (CNN)

• CNN0 is AlphaZero, which mastered the games chess, Go, and shogi [Sil18]

• CNN1 is an Google-internal model for image recognition

• Recurrent Neural Networks (RNN)

• RNN0 is a Translation model [Che18]

• RNN1 is a Speech model [Chi18]

[Cla15] Clark, J. October 26, 2015, Google Turning Its Lucrative Web Search Over to AI Machines. Bloomberg Technology.[Che18] Chen, M.X. et al, 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849.[Chi18] Chiu, C.C. et al, 2018, April. State-of-the-art speech recognition with sequence-to-sequence models. In IEEE Int'l Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774-4778.[Sil18] Silver, D. et al, 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), pp.1140-1144.

6 Production Applications

Page 64: Google’s Training Chips Revealed: TPUv2 and TPUv3

● MLP0 & MLP1

○ 40% & 14% of perfect linear scale at

1024 chip-scale

■ Limited by embeddings

TPUv3 Supercomputer Scaling: 6 Production Apps

Page 65: Google’s Training Chips Revealed: TPUv2 and TPUv3

● MLP0 & MLP1

○ 40% & 14% of perfect linear scaling

● CNN0

○ 96% of perfect linear scaling!

● CNN1, RNN0, RNN1

○ 3 production apps run at 99% of

perfect linear scaling at 1024 chips!

TPUv3 Supercomputer Scaling: 6 Production Apps

Page 66: Google’s Training Chips Revealed: TPUv2 and TPUv3

● Improved scaling for newer larger models

and SW improvements for better quality

○ MLP0-next: 67% of perfect linear

scale at 1024 chips

■ Up from 40% from MLP0

TPUv3 Supercomputer Scaling: MLP0-next vs. MLP0

Page 67: Google’s Training Chips Revealed: TPUv2 and TPUv3

MLP

erf 0

.6Pr

oduc

tion

App

sH

W S

pecs

GeoMean 1.82

GeoMean 1.78

TPUv3 vs TPUv2: Memory Bound or 2X MXUs Useful?

Page 68: Google’s Training Chips Revealed: TPUv2 and TPUv3

Production apps also sped up:● CNN0 1.8x (more

bfloat16 use)

● MLP0 1.6x (better partitioning and placement of embeddings)

Performance enables larger models for improved accuracy

Speedup: MLPerf v0.5 (11/2018) - v0.6 (5/2019)

Page 69: Google’s Training Chips Revealed: TPUv2 and TPUv3

● Training chips can also do inference (looks like forward pass)● Bfloat16 numerics in TPUv2/v3 vs int8 in TPUv1

TPUv1

TPUv2

TPUv2Bigger Batch

TPUv3 Bigger Batch

Inference: TPUv2/v3 vs TPUv1

Page 70: Google’s Training Chips Revealed: TPUv2 and TPUv3

Used across many products

● TPUv2/v3 supercomputers with 256-1024 chips run production applications at scale, powering many Google products

● TPUs are widely used to accelerate production and research

● Proven results from Model/HW/SW codesign, with more opportunities still available

Key Takeaways