Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 1
EE482SLecture 2
Discussion of 2 PapersConclusion of Introductory Material
April 9, 2002
William J. DallyComputer Systems Laboratory
Stanford [email protected]
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 2
Today’s Class Meeting
• Discuss two papers– Imagine: Media Processing with Streams
• Khailany et al., IEEE Micro, March-April 2001– Polygon Rendering on a Stream Architecture
• Owens et al., Eurographics HWWS, 2000.
• Conclude introductory discussion on Stream Architecture– What is a stream processor
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 3
Discussion of Imagine Paper
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 4
Discussion of Polygon Rendering Paper
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 5
What is a Stream Processor?
• A processor that is optimized to execute a stream program
• Features include– Exploit parallelism
• TLP with multiple processors• DLP with multiple clusters within each processor• ILP with multiple ALUs within each cluster
– Exploit locality with a bandwidth hierarchy• Kernel locality within each cluster• Producer-consumer locality within each processor
• Many different possible architectures
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 6
The Imagine Stream Processor
Stream Register File NetworkInterface
StreamController
Imagine Stream Processor
HostProcessorA
LU
Clu
ster
0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
cont
rolle
r
Net
wor
k
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 7
Arithmetic Clusters
CU
Inte
rclu
ster
Net
wor
k
+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 8
A Bandwidth Hierarchy exploits locality and concurrency
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Stre
am
Reg
iste
r File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
• VLIW clusters with shared control• 41.2 32-bit floating-point operations per word of memory BW
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 9
A Bandwidth Hierarchy exploits kernel and producer-consumer locality
Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s
MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s
Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s
QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAMSt
ream
R
egis
ter F
ile
ALU ClusterALU Cluster
ALU Cluster
544GB/s
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 10
Producer-Consumer Locality in the Depth Extractor
Memory/Global Data SRF/Streams Clusters/Kernels
new partial sums
sharpened row
Convolution(Laplacian)
row of pixels
previous partial sumsConvolution(Gaussian)
new partial sums
blurred row
previous partial sums
filtered row segment
filtered row segment
previous partial sums
new partial sumsdepth map row segment
SAD
1 : 23 : 317
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 11
Bandwidth Demand of Applications
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 12
Die Plot
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 13
Die Photos
• 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mm• 300 MHz TTTT, hope for 400 MHz in lab• Chips arrived 4/1/02, no fooling!
[Khailany et al., ICCD ’02 (submitted)]
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 14
Performance demonstrated on signal and image processing
12.1
19.8
11.0
23.925.6
7.0
0
5
10
15
20
25
30
GO
PS
depth mpeg qrd dct convolve fft
16-bit kernels16-bit
applications
floating-pointapplication
floating-pointkernel
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 15
Initial studies indicate that it also applies to solving PDEs and ODEs
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 16
Architecture of a Streaming Supercomputer
StreamProcessor64 FPUs
64GFLOPS
16 xDRDRAM2GBytes
38GBytes/s
20GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs1TFLOPS32GBytes
Intra-Cabinet Network(passive - wires only)
Board 64
160GBytes/s256+256 pairs
10.5" Teradyne GbX
Board
Cabinet
Inter-Cabinet Network
Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS
2TBytes
E/OO/E
5TBytes/s8K+8K links
Ribbon Fiber
Cabinet 16
Bisection 64TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 17
Streaming processor
Stream Execution Unit
Stream Register File
Scal
arPr
oces
sor
AddressGenerators
MemoryControl
On-
Chi
pM
emor
y
LocalDRDRAM(38 GB/s)
Net
wor
kIn
terfa
ce
NetworkChannels(20 GB/s)
Clu
ster
0
Clu
ster
1
Clu
ster
15
CommFP
MUL
FPMUL
FPADD
FPADD
FPDSQ
Regs
Regs
Regs
Regs
Clu
ster
Sw
itch
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Scra
tch
Pad
Regs
Stream Execution Unit
Cluster
Stream Processor
To/From SRF(260 GB/s)
LRF BW = 1.5TB/s
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 18
Rough per-node budget
200200Processor chip
4$/M-GUPS (250/node)
15$/GFLOPS (64/node)
976Per-Node Cost
501Power
4950000Cabinet
1883000Board/Backplane
32020Memory chip
50200Router chip
Per NodeCostItem
Preliminary numbers, parts cost only, no I/O included.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 19
Many open problems
• A small sampling• Software
– Program transformation– Program mapping– Bandwidth optimization– Conditionals– Irregular data structures
• Hardware– Alternative stream models– Register organization– Bandwidth hierarchies– Memory organization– Short stream issues– ISA design– Cluster organization– Processor organization
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L2, Apr 9, 2002 20
Next Time
• Walk through a streaming application