20
Copyright (C) by William J. Dally, All Rights Reserved EE482C, L2, Apr 9, 2002 1 EE482S Lecture 2 Discussion of 2 Papers Conclusion of Introductory Material April 9, 2002 William J. Dally Computer Systems Laboratory Stanford University [email protected]

EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 1

EE482SLecture 2

Discussion of 2 PapersConclusion of Introductory Material

April 9, 2002

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Page 2: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 2

Today’s Class Meeting

• Discuss two papers– Imagine: Media Processing with Streams

• Khailany et al., IEEE Micro, March-April 2001– Polygon Rendering on a Stream Architecture

• Owens et al., Eurographics HWWS, 2000.

• Conclude introductory discussion on Stream Architecture– What is a stream processor

Page 3: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 3

Discussion of Imagine Paper

Page 4: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 4

Discussion of Polygon Rendering Paper

Page 5: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 5

What is a Stream Processor?

• A processor that is optimized to execute a stream program

• Features include– Exploit parallelism

• TLP with multiple processors• DLP with multiple clusters within each processor• ILP with multiple ALUs within each cluster

– Exploit locality with a bandwidth hierarchy• Kernel locality within each cluster• Producer-consumer locality within each processor

• Many different possible architectures

Page 6: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 6

The Imagine Stream Processor

Stream Register File NetworkInterface

StreamController

Imagine Stream Processor

HostProcessorA

LU

Clu

ster

0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

cont

rolle

r

Net

wor

k

Page 7: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 7

Arithmetic Clusters

CU

Inte

rclu

ster

Net

wor

k

+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Page 8: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 8

A Bandwidth Hierarchy exploits locality and concurrency

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Stre

am

Reg

iste

r File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

• VLIW clusters with shared control• 41.2 32-bit floating-point operations per word of memory BW

Page 9: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 9

A Bandwidth Hierarchy exploits kernel and producer-consumer locality

Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAMSt

ream

R

egis

ter F

ile

ALU ClusterALU Cluster

ALU Cluster

544GB/s

Page 10: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 10

Producer-Consumer Locality in the Depth Extractor

Memory/Global Data SRF/Streams Clusters/Kernels

new partial sums

sharpened row

Convolution(Laplacian)

row of pixels

previous partial sumsConvolution(Gaussian)

new partial sums

blurred row

previous partial sums

filtered row segment

filtered row segment

previous partial sums

new partial sumsdepth map row segment

SAD

1 : 23 : 317

Page 11: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 11

Bandwidth Demand of Applications

Page 12: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 12

Die Plot

Page 13: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 13

Die Photos

• 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mm• 300 MHz TTTT, hope for 400 MHz in lab• Chips arrived 4/1/02, no fooling!

[Khailany et al., ICCD ’02 (submitted)]

Page 14: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 14

Performance demonstrated on signal and image processing

12.1

19.8

11.0

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel

Page 15: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 15

Initial studies indicate that it also applies to solving PDEs and ODEs

Page 16: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 16

Architecture of a Streaming Supercomputer

StreamProcessor64 FPUs

64GFLOPS

16 xDRDRAM2GBytes

38GBytes/s

20GBytes/s32+32 pairs

Node

On-Board Network

Node2

Node16 Board 2

16 Nodes1K FPUs1TFLOPS32GBytes

Intra-Cabinet Network(passive - wires only)

Board 64

160GBytes/s256+256 pairs

10.5" Teradyne GbX

Board

Cabinet

Inter-Cabinet Network

Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS

2TBytes

E/OO/E

5TBytes/s8K+8K links

Ribbon Fiber

Cabinet 16

Bisection 64TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

Page 17: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 17

Streaming processor

Stream Execution Unit

Stream Register File

Scal

arPr

oces

sor

AddressGenerators

MemoryControl

On-

Chi

pM

emor

y

LocalDRDRAM(38 GB/s)

Net

wor

kIn

terfa

ce

NetworkChannels(20 GB/s)

Clu

ster

0

Clu

ster

1

Clu

ster

15

CommFP

MUL

FPMUL

FPADD

FPADD

FPDSQ

Regs

Regs

Regs

Regs

Clu

ster

Sw

itch

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Scra

tch

Pad

Regs

Stream Execution Unit

Cluster

Stream Processor

To/From SRF(260 GB/s)

LRF BW = 1.5TB/s

Page 18: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 18

Rough per-node budget

200200Processor chip

4$/M-GUPS (250/node)

15$/GFLOPS (64/node)

976Per-Node Cost

501Power

4950000Cabinet

1883000Board/Backplane

32020Memory chip

50200Router chip

Per NodeCostItem

Preliminary numbers, parts cost only, no I/O included.

Page 19: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 19

Many open problems

• A small sampling• Software

– Program transformation– Program mapping– Bandwidth optimization– Conditionals– Irregular data structures

• Hardware– Alternative stream models– Register organization– Bandwidth hierarchies– Memory organization– Short stream issues– ISA design– Cluster organization– Processor organization

Page 20: EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 20

Next Time

• Walk through a streaming application