View
226
Download
0
Tags:
Embed Size (px)
Citation preview
System-level Exploration for Pareto-System-level Exploration for Pareto-optimal Configurations in Parameterized optimal Configurations in Parameterized
Systems-on-a-chip ArchitecturesSystems-on-a-chip Architectures
Tony Givargis (Frank Vahid, Jörg Henkel)Center for Embedded Computer Systems
University of California
Irvine, CA 92697
2
OverviewOverview
Given:– Parameterized SOC
architecture
Explore
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution Time (us)
Powe
r (uW
)
void main(){ while(1){ Receive(); Decode(); Display(); }} Application– Fixed application
Automatically explore the design space
Find optimal points w/respect to power and performance
SOCCPU Memory
JPEGCODEC
Math/FPU
UART
I$-D$BRIDGE
Size = {1K, 4K, 8K}Line = {4, 8, 16}Assoc = {1, 2, 4}
3
MotivationMotivation
Design trends:– Growing demand for
portable devices– Growing demand for
low power design– Increased application
complexity– Shrinking time-to-
market windows
Technology trends:– Increased chip
capacity– Increased I/O pins– Improved on-chip
integration techniques (storage, digital, analog, digital, …)
– SOC era
Need for greater designer productivity!
4
SOCCPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
?
MotivationMotivation
One approach: reuse of existing IP
???
?
– IP selection ?
MIPS
RAM
JPEGCODEC1 Math/FPU
UART
ISABRIDGE
ARM
SRAM
DRAM
AMBABRIDGE
JPEGCODEC2
USB
– IP integration ?
– SOC verification ?– Multi-source IP
licensing– More…
5
MotivationMotivation
Alternate approach: reuse of SOC– Designed, integrated, tested– Domain specific– Parameterized
Designed by firms specializing in SOC
User: map application, then, “configure-and-execute”
(successors to microcontrollers!)
Parameterized SOC
CPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
6
MotivationMotivation
Composed of 100s of cores
Cores are “configurable”
Configurations impact power/performance
Large number of total configurations!
Architecture is otherwise fixed!
Parameterized SOC
CPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
7
MotivationMotivation
ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box Market
Tensilica – Xtensa™ 1040 configurable processor cores
Philips Semiconductors – Velocity RSP9™ SOC platforms
Adelante Technologies – offers complete SOC customizable platforms for DSP domains
More…
8
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
9
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
10
Previous WorkPrevious Work
Parameterized SOC design– [Malik00], [Veidenbaum99], [Vahid99], [Stan95]
Power/performance evaluation– [Barndolese00], [Simunic99], [Li98], [Tiwari94]
Design space exploration (manual)– [givargis99], [Lieverse99]
Design space exploration (automatic)– Focus of this work…
11
Previous WorkPrevious Work
ArchitectureApplicationApplicationApplicationApplicationApplication
Mapping
Analysis
Numbers
Auto
Y-chart [Lieverse99]
12
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
13
Target ArchitectureTarget Architecture
UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
14
Target ArchitectureTarget Architecture
Voltage scale Size, line,
associativity Bus width,
encoding (gray, invert)
UART tx/rx buffer size
DCT resol. UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
15
Target ArchitectureTarget Architecture
Voltage scale Size, line,
associativity Bus width,
encoding (gray, invert)
UART tx/rx buffer size
DCT resol. UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
16
Target ArchitectureTarget Architecture
Voltage scale Size, line,
associativity Bus width,
encoding (gray, invert)
UART tx/rx buffer size
DCT resol. UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
17
Target ArchitectureTarget Architecture
Voltage scale Size, line,
associativity Bus width,
encoding (gray, invert)
UART tx/rx buffer size
DCT resol. UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
18
Target ArchitectureTarget Architecture
Voltage scale Size, line,
associativity Bus width,
encoding (gray, invert)
UART tx/rx buffer size
DCT resol. UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
19
Target ArchitectureTarget Architecture
26 parameters 1014
configurations What are the
optimal configuration (given a fixed application)?
UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
20
Problem SummaryProblem Summary
What are the possible power/performance tradeoffs? (100 trillion)
Need to efficiently evaluate power/performance (1/sec150,000 years)
Need to explore the configuration space
Parameterized SOC
CPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
21
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
22
Power EvaluationPower Evaluation
Exploration works with:– Chip instrumentation
(real-time)– System-level simulation– RTL simulation– Gate-level simulation– Circuit-level simulation
Relative accuracy required!
Digital camera application mapped on our SOC, capturing
1 image.
020000400006000080000
100000120000140000160000180000
1st Qtr
ChipSystemRTLGateCircuit
1 440
5400 28
800
1800
00
23
Power EvaluationPower Evaluation
Exploration works with:– Chip instrumentation
(real-time)– System-level simulation– RTL simulation– Gate-level simulation– Circuit-level simulation
Relative accuracy required!
Digital camera application mapped on our SOC, capturing
1 image.
020000400006000080000
100000120000140000160000180000
1st Qtr
ChipSystemRTLGateCircuit
1 440
5400 28
800
1800
00
24
Power Evaluation - ProcessorPower Evaluation - Processor
[Tiwari94/00]’s instruction-level
Measure watt/inst
Account for stalls + dependency
Apply traces UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
25
Power Evaluation – Cache/Mem.Power Evaluation – Cache/Mem.
[Evans95] Capacitance
model of sub- components
Switching obtained via simulation (parameter dependent)
UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
26
Power Evaluation – BusesPower Evaluation – Buses
[Chern92] Model bus
capacitance Switching
derived from I/O traffic (parameter dependent)
UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
27
Power Evaluation – PeripheralsPower Evaluation – Peripherals
Observation: cores execute instructions!
Apply a technique similar to that used for processors! UART
MIPSI-Cache
D-Cache
Bridge
Peripheral Bus
DCT CODEC
Memory
DMA
28
Power Evaluation – SummaryPower Evaluation – Summary
UART (5%)
MIPS (10%)I-Cache (8%)
D-Cache (8%)
Bridge (5%)
Peripheral Bus
DCT CODEC (5%)
Memory (8%)
DMA (5%)
~50-100K instruction/second! (Platune)
29
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
30
ExplorationExploration
Problem formulation P1, P2, … , Pn
A configuration (point) is an assignment of values to all parameters
How to efficiently generate all Pareto-optimal configurations?
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution Time (us)
Powe
r (uW
)
31
Exploration Exploration
* = 320 pointsAlgorithm Idea A (10)
B (32)
A and B interdependent+ = 42 points A and C are independent
A (10)
C (32)
C and B are independentC
(32)B
(32)+ = 64 points
138 points
With knowledge about dependency we prune 98.6%
* * = 10240 pointsB (32)
C (32)
A (10)
Directed graph
32
ExplorationExploration
A B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes along the path A B
A B A, (cycle) : Pareto-optimal configurations of all the parameters on the cycle calculated simultaneously
A : Pareto-optimal configurations calculated in isolation
33
Exploration Exploration
AB
C
D
J KE
F
G
H I
N O
L M
R S
P Q
V W
T U
X
YZ
Node Core Parameter
A MIPS
Voltage scale
B I$ Total size
C Line size
D Associativity
E D$ Total size
F Line size
G Associativity
H CPU I$
bus
Data bus width
I Data bus code
J Addr bus width
K Addr bus code
X UART
Tx buffer size
Y Rx buffer size
Node Core Parameter
L CPU D$ bus
Data bus width
M Data bus code
N Addr bus width
O Addr bus code
P I/D$ Mem bus
Data bus width
Q Data bus code
R Addr bus width
S Addr bus code
T Peripheral bus
Data bus width
U Data bus code
V Addr bus width
W Addr bus code
Z DCT CODE
C
Pixel resolution
Dependency Graph
34
AB
C
D
J KE
F
G
H I
N O
L M
R S
P Q
V W
T U
X
YZ
Dependency graph Based on designer
knowledge Computed by
simulating all pairs of nodes (quadratic time complexity, approx.)
One time effort
ExplorationExploration
35
Exploration – Algorithm Exploration – Algorithm
Step 1: Clustering followed by simulation
A
B
C
D
J K
E
F
GH I
N O
L M
R S
P Q
V W
T UX
Y Z
36
Exploration – Algorithm Exploration – Algorithm
A,H,I B,C,D,E,F,G
J,K,T,U
L,M,P,Q
N,O,V,W X,Y,R,S
Z
A,H,I,B,C,D,E,F,
G
J,K,T,U,Z
L,M,P,Q,N,O,V,W
X,Y,R,S
A,H,I,B,C,D,E,F,G,J,K,T,U,Z
L,M,P,Q,N,O,V,W,X,Y,R,S
A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q,N,O,V,W,X,Y,R,S
Step 2: Pair-wise merge followed by simulation
37
ExplorationExploration
Exhaustive solution Evaluate all points Sort by decreasing
execution time Walk through the
space, eliminate points with power > minimum seen so far!
Substitute heuristics 0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution Time (us)
Powe
r (uW
)
(only works for 1-4 parameters!)
38
ExplorationExploration
Complexity: O((K + log(K)) * 2N/K) K is the number of clusters N is the number of parameters 2N/K bounds the exhaustive comp. (K + log(k)) bounds the number of iterations Worse case K=1, best case K=N 2N/K decrease rapidly as K increases (e.g.,
226/2+226/2 is much smaller than 226!)
39
OutlineOutline
Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion
40
Exploration – Results Exploration – Results
JPEG
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution time (usec)
Po
wer
(u
W)
JPEG Exploration
time: 29.1 min Config. visited:
12352 (141) 5.10x exe. time 7.51x power 2.73x energy Pruning ratio >
0.99997
41
Exploration – Results Exploration – Results
CKEY
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
Execution time (usec)
Po
wer
(u
W)
CKEY Exploration
time: 108 min Config. visited:
15890 (223) 8.31x exe. time 6.08x power 2.57x energy Pruning ratio >
0.99993
42
Exploration – ResultsExploration – Results
IMAGE
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 1000 2000 3000 4000 5000 6000 7000 8000
Execution time (usec)
Po
wer
(u
W)
IMAGE Exploration
time: 50.2 min Config. visited:
10135 (80) 8.29x exe. time 8.57x power 1.81x energy Pruning ratio >
0.99998
43
Exploration – ResultsExploration – Results
MATRIX Exploration
time: 73.6 min Config. visited:
12623 (84) 10.7x exe. time 8.16x power 3.18x energy Pruning ratio >
0.99997
MATRIX
0
50
100
150
200
250
300
350
400
450
500
0 100 200 300 400 500 600
Execution time (usec)
Po
wer
(u
W)
44
Exploration – ResultsExploration – Results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 200 400 600 800 1000 1200 1400 1600 1800
Execution time (u sec)
Ener
gy (u
J)
JPEGJPEG
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution time (usec)
Powe
r (uW
)
JPEG JPEG
45
ConclusionConclusion
Gave a system-level algorithm for exploring the solution space of an application mapped to a parameterized SOC architectures– Given a dependency graph we extensively prune the
solution space– Pruning ratio > 0.99997 in experiments
Future work:– Automatically compute the dependency model– Replace the exhaustive sub-algorithm with a heuristic
(e.g., gradient search, GA)