Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
How to teach a billion transistor chip a new trickDr. ir. Bart Kienhuis
University Leiden, LIACS
Computer Systems Group
http://www.liacs.nl
ICT-Kenniscongres 10 April 2006
2
Outline� Focus is programming ICs for
Stream based Applications� FPGAs
� Stream Based Applications� Multi-media
� Imaging
� Bio-informatics
� Classical DSP
� Increasing demand for high-performance, embedded compute performance
� We know how to build billion transistor FPGAs, but cannot program efficiently
FPGABillion
Transistors
Applications
(C / Matlab / Java)
pro
gra
mm
ing?
3
Stream Based Applications
� Imaging� HDTV 1080x720 @ 100 Hz
� 77 Million samples per seconds
� Suppose 100 operations per sample
� 8 Billion Operations per second
� Classical DSP� LOFAR One antenna
� Sampling rate 200*2 = 400 Million Samples per second
� 100 Operations per second (First stage)
� 40 Billion Operations per second
� There are 10.000 Antennas expected = 40*10^13 operations per second
� Next beamforming is performed, statistics obtained, but on decimated signal
Operation
10 – 10.00 Operations per Sample
10 – 1000 Million Samples per second
Stream
Sample
Real-Time
4
Compute Requirements
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate Trend: ferocious appetite for more
embedded compute power
0
500
1000
1500
2000
2500
2000 2001 2002 2003 2004 2005 2006
Bil
lio
n M
AC
/s
HDTV
MPEG4
Video
over IP
3G
Wireless/
WCDMA
Future
Broadband
Standards
Voice
over IP General Purpose DSP/CPU
Market Requirements
Increasing
Gap
2.5G
?
5
Intrinsic Computational Efficiency (ICE)
E. Roza, System-on-chip: what are the limits?, IEE Electronics
Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.
Computational efficiency
[MOPS/W]
i386SX
i486DXP5
68040
microsparc
Supersparc
601604
Ultrasparc P6
604e21164a
Turbosparc 604e
21364
7400
106
105
104
103
102
101
100
2 1 0.5 0.25 0.13 0.07Feature size [µm]
Microprocessors
Intrinsic Computational
Efficiency of Silicon
Pla
yin
g fie
ld
7
Xilinx Virtex II Pro
� Heterogeneous Architecture� Multi CPUs
� Distributed Memory Blocks
� Programmable Logic (IP cores)
� Commercially Available Platform� Virtex-4 is released and available
� Virtex-5 is on the drawing board
� FPGAs are becoming more important in Embedded System Design� Flexible since they can easily be
re-programmed
� More functionality at lower cost
� replacing a larger part of the ASICs market
Reconfigurable logic
Heterogeneous Architecture
PowerPCs
8
4
16 w
ord
s x
1 b
it me
mo
ry
[A,B,C,D]
F
� A 4-input lookup table
(LUT) can implement any
function of 4 inputs.
� For example, a 1-bit
adder needs 2 LUTs:
A⊕B⊕Ci
A.B.Ci
AB
Ci
Co
S
Building an FPGA: Logic First
01011
01010
10010
10001
FAddress
(ABCD)
9
4
FF
CE RST
M
16 w
ord
s x
1 b
it me
mo
ry
Carry M
M
M
Din
WECin
Cout
� Make LUT RAM
a user resource.
� Fast carry ripple
to neighbor.
Arithmetic, Distributed RAM
10
4
4
4
4
4
4
4
4
40
� Group logic cells to
reduce overhead.
� Add H, V routing
channels with
switchboxes.
� Add input, output
MUXing between
logic and routing.
Add Interconnect
11
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
Build an Array
12
FPGA Technology
� Virtex-4 has already over a billion transistors!
� FPGA can take most advantage of Moore’s law� More an more logic
available on the same die
� CPUs in programmable logic � Hardcore =
� Softcore =
13
1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04
1
10
100
1000
Capacity
Speed
Price Virtex-II Pro
Spartan-3
Virtex-II Pro
Memory
Bandwidth
Year
FPGA, the last 10 years
Xilinx Research, Kees Vissers
14
FPGA Programming� FPGA is flexible in Hardware
and Software…� How to teach a billion
transistor chip a new trick
� Solution Industry doesn’t work
� Programming means:� Write “C” programs for the
Micro processors
� Write “VHDL” programs for the IP cores and interconnection network
� “C” programs and “VHDL” programs need to work together� High performance
� Error proof
� Nightmare to debug
� Programming is currently done manually with lots of engineers
“C”
“C”
“C”
“C”
VHDL
Virtex-II Pro FPGA
15
Productivity
Lines of Code/Man Month
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1981
1985
1989
1993
1997
2001
2005
2009
Logic transistors per chip
(K)
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Logic Tr./Chip
58% / Yr. compoundcomplexity growth rate
Loc/MM
8-10% / Yr. compoundproductivity growth rate
Productivitygap
Software Productivity
We know how to build Billion transistor chips, but do not know how to program them efficiently
16
Research at LIACS (COMPAAN)
Kahn Process Network
Matlab/C/Java
Compaan
FPGA Laura /Espam
for j = 1:1:N,
[x(j)] = S1( ); endfor i = 1:1:K,
[y(i)] = S2( ); endfor j = 1:1:N,
for i = 1:1:K,[y(i), x(j)] = func(y(i), x(j) );
endendfor i = 1:1:K,
[Out(i)] = Sink( y( I ) ); end
F1 F2
S2 F3 F4
Sink
S1
How ICT Research at LIACS tackles the Productivity Gap(Leiden University, STW PROGRESS)
Parameterized Nested Loop Programs
17
Toolbox� Toolbox to change the
number of processes
� Unrolling (More Parallelism)
� Merging (Less Parallelism)
� Better use of Resources
� Re-timing
� C-Slowing
� Exploration
� Track Technology
changes PartitioningLessParallelism
More Parallelism
P1 P2
P3
Application
18
Results: Motion JPEG (1)
for k = 1:1:NumFrames,
for j = 1:1:VNumBlocks,
for i = 1:1:HNumBlocks,
[ Block ] = VideoInMain();[ Block ] = DCT( Block );
[ Block ] = Q( Block );
[ Packets ] = VLE( Block );
[ ] = VideoOut( Packets );end
end
end
Generated Process Network
DATE04, ‘System Design using Kahn Process Networks: The Compaan/Laura Approach’
Block( j , i )
19
Motion JPEG (2)
� From Matlab to FPGA implementation� Automatically derived a multi processor solution
� Free choice between Microprocessor (MicroBlaze) or Hardware (IPCore)
� Automatically generated� “C” program for all Microprocessors
� VHDL program for interconnections and IP Core integration
� Correct by Construction
� Seamless integration into FPGA Development environment (EDK)
ND_2DCT
400
ND_1
VideoIn
ND_2
DCT
ND_3
Q
ND_4
VLE
ND_5
VideoOut
10837 135036 68972 54210 2795
FIFO
Difference: 337
- Manual: 3 months
- Compaan: 3 weeks
20
Results: QR Case studyfor k = 1:1:21,
for j = 1:1:7, [r(j,j), rr(j,j), a, b, d(k)] =
bcell( r(j,j), rr(j,j),x(k,j), d(k));
for i = j+1:1:7,
[r(j,i), x(k,i)] = icell( r(j,i), x(k,i), a, b );
end
end
end
0
200
400
600
800
1000
1200
1400
1600
1800
2000
QR QR
Skew
QR
Skew
+2st
QR
Skew
+4st
QR
Skew
+6st
QR
Skew
+7st
QR
Skew
+10st
MF
LO
PS x
28
QR decomposition
Beam former/Radar
FPL04, ‘Increasing pipelined IP core utilization in Process Networks using Exploration’
- Manual: 1 year- Compaan: 3 days
Exploration
- 60MFlops out-of-the-box
- 2 GFlops, with Compaan/Laura
21
Multidisciplinary Work
Mathematics
ElectricalEngineering
Mathematical Modeling (Polyhedral Mathematics)
Efficient Solvers (Parametric Integer Programming)
Processor DesignFPGA Technology
Electronics Design Automation
Models of ComputationsCompiler Technology
Software Engineering
Compaan Project: 6 years, 6 PhDs, 2 Postdocs
ComputerScience
22
Conclusions� Stream Based Applications require Heterogeneous
Multiprocessor architecture� Stream of 10 – 1000 million samples per second
� 10 – 100 Giga operations per second
� Latest FPGAs are Heterogeneous Multiprocessor architectures� Over a Billion Transistor and still counting
� Problem is that we know how to build FPGAs, but programming them is very hard
� Programming/Compiler Technology is a strategic ICT component to reduce Productivity Gap
� Showed the Compaan Project at LIACS, which provides a solution for stream based applications.
23
Thanks to
� Prof. Ed Deprettere
� Dr. Todor Stefanov
� Dr. Sven Verdoolaege
� Alexandru Turjan
� Claudiu Zissulescu
� Hristo Nikolov
� Sjoerd Meijer
� Jerome Lemaitre
� Bin Jiang
� Wei Zhong
For more information http://www.liacs.nl/~kienhuis