Upload
colby-hoar
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1 © 2003 TENSILICA INC.
Fundamental Change in MPSOCA fifteen year outlook
Chris Rowen, President and CEO
Tensilica, Inc.
The Configurable Processor Company
2 © 2003 TENSILICA INC.
Design Productivity Crisis (SRC 1997) Potential Design Complexity and Designer Productivity
Moore’s Law: Opportunity, Crisis and ROI
Source: ITRS 2001, Moore 1965, Tensilica
20012003
20052007
20092011
20132015
10,000
1,000
100
Den
sity
(K
gat
es / m
m2)
AS
IC c
lock
(M
Hz)
Gates Clock
Moore’s Law: Standard cell density and speed
Lo
gic
Tra
nsi
sto
r p
er C
hip
( M
)
Pro
du
ctivity ( K
) Tran
s./Staff – M
o.
19811983
19851987
19891991
19931995
19971999
20012003
20052007
2009
100,000,000
0.01
0.1
1
10
100
1,000
10,000
Equivalent Added Complexity
1,000
100
10
1
0.1
0.01
0.001
10,000
21% / yr compounded
Productivity Growth Rate
xxx
xxx
x x
58% / yr c
ompounded
Complexity Growth Rate
Logic Tr. / Chip
Tr. / S.M.
costt developmen chip
costunit chipASP (chipvolume
Investment
Return )* ROI
3 © 2003 TENSILICA INC.
ROI Goal: One Design, Many Design-ins
$10M design cost, $15 manf. cost, 5% premium for programmability
Low-endstill
camera
High-endstill camera
Video camcorder
one chip
many systemdesigns
0
20
40
60
80
100
120
1 2 3 4 5 6 7
100,000
1,000,000
System designs per chip design
To
tal
pe
r u
nit
co
st
SOC Flexibility = Cost Reduction (Model: 100K and 1M system volumes)
4 © 2003 TENSILICA INC.
Configurable Processor’s Role
ConfigurableConfigurableProcessorsProcessors
Per
form
ance
Flexibility
Application-Application-specificspecificLogicLogic
General-General-purposepurpose
ProcessorsProcessors
5 © 2003 TENSILICA INC.
Configurable Processors Enable New RolesTaking Performance to a New Level
2.0
0.087 0.080 0.059 0.058 0.0390.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Xtensa optimized
Xtensa out-of-box
MIPS64 20Kc ARM1020E
MIPS64b (NEC VR5000)
MIPS32b (NEC VR4122)
Optimized ConsumerMarks/MHz
0.473
0.03 0.023 0.016 0.013 0.011
0.23
0.017
0.0
0.1
0.2
0.3
0.4
0.5
Xtensa optimized
TI C6203 optimized Xtensa out-of-box
TI C6203 out-of-box MIPS64 20Kc
MIPS64b (NEC VR5000) ARM1020E
MIPS32b (NEC VR4122)
Optimized TeleMarks/MHz
0.123
0.03
0.018 0.017 0.0160.01
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Xtensa optimized Xtensa out-of-box MIPS64 20Kc ARM1020E MIPS64b (NEC VR5000) MIPS32b (NEC VR4122)
Optimized NetMarks/MHz
© 2003 TENSILICA INC.Source: EEMBC
~E
nerg
y ef
ficie
ncy
6 © 2003 TENSILICA INC.
Automatic Generation of ProcessorsAchieves Required Performance Faster
ElectronicSpecification
Hardware Design
RISC
DSP
OCD
Timer
FPUDesigner-Defined
Cache
CustomizedSoftware
ProcessorGenerator
Build usingany IC
process
Design processor in one hour
7 © 2003 TENSILICA INC.
CPUCPU
SignalSignalprocessingprocessing
ProtocolProtocolprocessingprocessing
II // OOMemoryMemory
ApplicationApplicationacceleratoraccelerator EncryptEncrypt ImagingImaging
AudioAudio
Processors as Basic Build Block
DSPDSPApplication-Application-
specificspecificlogiclogic
Application-Application-specificspecific
logiclogic
Application-Application-specificspecific
logiclogic
Application-Application-specificspecific
logiclogic
Application-Application-specificspecific
logiclogic
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
Configurableprocessor
8 © 2003 TENSILICA INC.
Flexibility is the Key to ROI
*
costt developmen chip
)costunit chipASP (chip*volume
Investment
ReturnROI
Flexibility means more systems per design
Programmability more “hot features” available
Little impact on chip cost – pennies per processor
Automatically-generated configurable processors reduce design time, team size and re-spin risk
9 © 2003 TENSILICA INC.
Example:NEC TCP/IP Offload Engine (TOE) Platform
NEC TOE achieves full wire speed by eight parallel and two management and dispatch Tensilica cores (Total 10) for high performance IP-based network storage — NAS & IP-SAN
8 8 parallel Xtensa parallel Xtensa processorsprocessors8 8 parallel Xtensa parallel Xtensa processorsprocessors
200 MHz200 MHz
Gigabit Ether × 2 ports
200 MHz200 MHz
200 MHz200 MHz
MACMAC
10 © 2003 TENSILICA INC.
Implications of Multiprocessor SOC
Designers will routinely “waste” processors to get other benefits Greater speed to market and certainty of success
Higher abstraction in design
Tremendous creativity and diversity in on-chip communicationsTopologies: buses, hierarchies of buses, cross-bars, systolic arrays, pipelines
New issues and methods: reliability, redundancy, asynchrony, QOS
Programming models for large numbers of task – finding parallelism
Software languages displace hardware languagesC/C++, not Verilog, VHDL, System Verilog etc.
Changing demographics of complex SOC designBroader population of engineers and programmers capable of SOC design
Unified hardware-software design user interface “cockpit”
11 © 2003 TENSILICA INC.
New Types of ProcessorsExploiting Latent Parallelism
0
2
4
6
8
10
12
14
16
18
20
0 1 2 3 4 5 6 7 8 9
Operations per cycle
Nu
mb
er
of
Pro
ce
ss
ors
32
48
64
10
8 instrs/cycle
Source: K. Keutzer, UCB
Xelerated
Intel IXP1200
Broadcom BCM1250
Cognigine RCU/RSF
Cisco PXF
EZchip NP-1
IBM PowerNP
Lexra NetVortexMotorola C-5
BRECIS
AMCCnp7120
ClearwaterCNP810
Vitesse IQ2x00
Agere PayloadPlus
Alchemy
Mindspeed CX27470
64 instrs/cycle
16 instrs/cycle
Multiple processors vs. multiple-issue in network processors
• Very small processors• Modest extensions• High task-level parallelism
• High-performance processors
• VLIW, SIMD and application-specific extensions
• High data- and instruction-level parallelism
10 15 20
80
96
112
128
12 © 2003 TENSILICA INC.
Projected Processor Speed and Density
2009 2016Geometry 50nm 22nm
Clock 1.8GHz 5.7GHz
Small proc area (mm2) 0.08 0.016
Small proc/chip 240 1400
High perf proc/chip 10 15
MIPS/chip 600,000 11,000,000
40mm2 die size for consumer SOC
13 © 2003 TENSILICA INC.
The Law of SOC Processor Scaling
Processors/chip: Up to 30% per
year
Total MIPS: 65% per year
Tensilica model based on ITRS 2001, 140mm2 die size
Aggregate SOC Performance
10
100
1,000
10,000
100,000
2001 2003 2005 2007 2009 2011 2013 2015
Bill
ion
s o
f o
pe
rati
on
s/s
ec
on
d
Processors Per Chip
10
100
1,000
10,000
2001 2003 2005 2007 2009 2011 2013 2015
14 © 2003 TENSILICA INC.
Enablers for Large Scale MPSOCWhat is Tensilica Working On?
PerformanceThroughput and efficiency mean more opportunities for application-
specific processors over RTLNew processor interfaces enable greater parallelism
InsightUnified hardware-software development environmentPerformance and cost-oriented analysis
AutomationAutomatic generation of compilers, RTOS, MP models“Hands-free” instruction set optimization
15 © 2003 TENSILICA INC.
Performance:FLIX™
FLIX = Flexible Length Instruction Xtensions FLIX freely intermixes 16-, 24-, and 64-bit instructions
No code-bloatNo modesFull backwards code compatibility with current Xtensa ISALong instructions implement complex extensions
Fast and parallel code when needed, else very compact codeArbitrary Instruction Field Specification
Multiple independent operations packed into a wide instruction wordMultiple Load / Store Units
Minimal Overhead~5000 gates added control logic
64 24 16
24 64
64
24 16 24063 31
Instruction packing in Memory(Little Endian Shown)
16 © 2003 TENSILICA INC.
Performance:Pushing to new levels of throughput
FLIX: Average of 6% larger code on complex code sets
Simple RISC Task Engine
Minimal Configuration Xtensa processor(18K gates)
155,389 cycles
Scalar Performance Base Xtensa processor with MUL32 option 23633 cycles
SIMD PerformanceXtensa processor with4-way SIMD Vectra DSP Engine
3055 cycles
FLIX Performance Conexant Testarossa DSP with 4-way SIMD and FLIX 1063 cycles
256pt FFT (Radix-4)
17 © 2003 TENSILICA INC.
Performance:A Complex FLIX Example
Register File(16 x 256)
A B C D E
4K x 256 RAM
Addr 4K x 256 RAM
Addr
In-QOut-Q
63 59 58 57 53 52 37 36 25 24 19 18 14 13 9 8 4 3 0
MemA InQ WrtA ExA/B ExC/D ExE WrtB OutQ MemB 1110
5 1 5 16 12 6 5 5 5 4
WrtA WrtB• 9 independent operation fields
• Multiple load/store
• Input/output queues
18 © 2003 TENSILICA INC.
length l64 64 { InstBuf[3:0] == 14 }
format flix64 l64slot slot0 flix64[*]slot slot1 flix64[*]slot slot1 flix64[*]
opcode L32I slot0opcode S32I slot0opcode ADD slot0opcode NOP slot0
opcode ADD slot1opcode ADDI slot1opcode SUB slot1opcode NOP slot1
All components of processor solution automatically generated from the TIE code in <2 hours.
•RTL & HW flow scripts
•Toolchain
•System models
•Operating System support
Performance:Writing TIE for FLIX
19 © 2003 TENSILICA INC.
Performance:Conexant DSP Architectural Requirements
VLIW-SIMD programming model
16- and 24-bit scalar instructions
64-bit instructions with multiple operations
2 or 4 16x16 MAC units
6R/3W Conexant-defined register file
At least two load store units
7-stage pipe with 2 cycles for I/D memory access
Stall on memory bank conflicts
Backward compatibility with previous Conexant DSPs via Translation Instruction Set (a sub-operation of the 64-bit instructions)
20 © 2003 TENSILICA INC.
Performance:Conexant Testarossa Encoding
63 46 45 28 27 4 3 0
ALU MAC Load/Store 1 1 1 0 18 18 24 4
Testarossa Load/Store Vector Load/Store Scalar Load/Store Unaligned Load/Store
Xtensa Core Instructions Load/Store Branch ALU
234 operations
Complex Multiply
Real Multiply
Select
24 operations
ALU
Shift
2nd Load/Store
52 operations
21 © 2003 TENSILICA INC.
Insight:The Multiple Core SOC Design Problem
Software Development Environment
• C code development• Debugging• C project management• Code profiling, tuning
Processor Optimization Environment
• TIE code development for extensions
• Configuration option management
SOC System Architecture Exploration
• System modeling and simulation
• Multiple core debug
• Web-based Xtensa Processor Generator
• TIE CompilerSingle source TIE file for processor extension
• Xtensa Modeling Protocol (XTMP)
• Bus functional models for co-simulation / co-verification EDA tools
• GNU-based Tensilica software development tools
• Xtensa C/C++ compiler• Xtensa Instruction Set
Simulator
To
ols
• Command line interface• Partner-provided software
IDEs (WindRiver, ATI/Mentor, MontaVista)
• Command line interface• Web browser interface
• Command line interface• EDA partners system
analysis / debug environments
En
viro
nm
ent
Three Skill Sets, Three Environments?
22 © 2003 TENSILICA INC.
Insight:Xtensa Xplorer
Software Development Environment
• C code development• Debugging• C project management• Code profiling, tuning
Processor Optimization Environment
• TIE code development for extensions
• Configuration option management
SOC System Architecture Exploration
• System modeling and simulation
• Multiple core debug
23 © 2003 TENSILICA INC.
Insight:Develop and Manage Processor Configurations
Manage complexity of growing variety of processor optimization choices
Software and processor optimization within same IDE
Gate count estimate:•per instruction•per register file •per user state
Interactive display of instruction…•operands•pipelining•semantics
Interactive TIE Editor•language-sensitive editing and help
24 © 2003 TENSILICA INC.
Insight:Create, Analyze & Tune ISA Extensions (TIE)
Profile and visualize performance impact of custom instructions
Pipeline Viewer shows instruction flow of disassembled codeStatic analysis of pipeline stalls pinpoints areas for fine tuning
Highlight instructions with variable latency (e.g. cache misses)
Interlocks on deep TIE pipelines fully modeled and explained
25 © 2003 TENSILICA INC.
Insight:Analyze and Select Caches to Meet Speed/Area Goals
Automatically profile code across range of cache configuration options
Performance charts visually compare different configurations
26 © 2003 TENSILICA INC.
Insight:Chip-level Software and Simulation for MPSOC
Manage system memory maps & link/load for multiple-core SOCs
Develop, run and debug multiple-core simulations using Xtensa Modeling Protocol (XTMP)
Auto-generated XTMP model based on memory maps
• Specify chip-level memory maps for shared/private memories
• Place interrupt and reset vectors• Assign code/data to distributed
memories
27 © 2003 TENSILICA INC.
Automation:The Next Generation
Xtensa Processor Generator
Complete Hardware Design
Customized Software Tools
Any Fab
ALU
DSP
OCD
Timer
FPURegister File
Cache
ElectronicSpecification
ApplicationSource Code
NEWAutomation
Tool
int main(){ int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++)
int main(){ int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++)
28 © 2003 TENSILICA INC.
Automation:Goals for Processor Extension
FlexibilityApplication code might be written/modified after tape-out
Generated TIE must be sufficiently general purpose so that small changes to application code do not degrade performance
ControlFull automation
C/C++ in TIE out
C/C++ + generated TIE in binary code out
Optional full control by user
Guide tool and/or to select instructions
Add to or change generated TIE
Tune application to better take advantage of TIE
Speed: minutes, not days
29 © 2003 TENSILICA INC.
int *a, *b, *c;for (int i=0; i<n; i++)
c[i] = (a[i] + b[i]) >> 2
Automation:Basic Operation - Fusion
operation add_shift (out AR c, in AR a, in AR b) {
wire t[31:0] = a+b;
assign c = {2{t[29]},t[29:0]};
}
+
>>
2
Original C Code
Complete TIE Code
Combined add-shift operator automatically used wherever
equivalent expression occurs in source
30 © 2003 TENSILICA INC.
length l 64 { InstBuf[3:0] == 14 }format f lslot slot0 f[*] ADDI, NOPslot slot1 f[*] ADD, SRAI, NOPslot slot2 f[*] L32I, S32I, NOP
loop: {addi a9,a9,4; add a12,a10,a8;l32i a8,a9,0} {addi a11,a11,4;srai a12,a12,2; l32i a10,a11,0} {addi a13,a13,4;nop; s32i a12,a13,0}
for (int i=0; i<n; i++) c[i] = (a[i] + b[i]) >> 2
Automation:Basic Operation - Multiple Ops in FLIX
• Original C compiled to 3 cycles/iteration
S0 S1 S2
Original C Code 64 Bit Instruction with 3 Slots
Complete TIE CodeGenerated Assembly
31 © 2003 TENSILICA INC.
Automation:Basic Operation - SIMD/Vector
short *a, *b, *c;
for (int i=0; i<n; i++)
c[i] = a[i] + b[i];
regfile vec 64 16 v;
operation add16x4(out vec c, in vec a, in vec b) {
assign c = {a[63:48]+b[63:48],
a[47:32]+b[47:32],
a[31:16]+b[31:16],
a[15:0]+b[15:0]};
}
+=
ab
cComplete TIE Code
Original C Code
……
…
Four iterations in parallel
32 © 2003 TENSILICA INC.
Automation:Processor Extension Step 1
Compile the C/C++ application codeDesigner specifies compiler optimization flag
Compiler generates comments to help user tune code
Optimized code yields better results
Compiler generates information from application
Feedback optimization ranks code regions by frequency
Vectorizer determines which loops can be vectorized
Fuser generates dataflow graphs for important regions
Operation counts for each type of opcode for every region
33 © 2003 TENSILICA INC.
Automation:Processor Extension Step 2
Generated information used to select and generate TIE:For each code region, generate many potential sets of TIE
extensions (configurations)Vectorize by 1, 2, 4, 8Add FLIX functional unitsAdd fusionsGeneration guided by estimated performance
Evaluate all generated configurations across all regionsFind best set of merged configurations given budget
34 © 2003 TENSILICA INC.
Automation:Processor Extension Step 3
Use the TIE with a C/C++ or assembly applicationCompiler reads TIE (automatically or manually generated) and
generates codeFLIX slot/format TIE specification mapped to resource tablesGeneralized graph matcher generates dataflow graphs from TIE Vectorizer vectorizes a loop and checks if all required operations available in TIE
User free to tune the code in ANSI C/C++ or assemblySimulator, assembler, debugger, RTOS support generated
directly from TIE
35 © 2003 TENSILICA INC.
Automation:Example: “Sum-of-Absolute Differences” Search
i SpeedupGates
Added (K)
SIMDFactor
FLIXWidth (Slots)
Load /Store Units
Fu
sion
1 8.7x 74 8 3 2 Yes
2 8.1x 57 8 2 2 Yes
3 7.6x 46 4 3 2 Yes
4 7.6x 37 8 2 1 Yes
5 6.8x 33 4 2 2 Yes
6 6.8x 26 8 1 1 Yes
7 6.1x 18 4 2 1 Yes
8 5.1x 12 4 1 1 Yes
9 4.3x 8 2 2 1 Yes
10 3.4x 5 2 1 1 Yes
11 1.4x 0.3 1 1 1 Yes
Generated Configuration Parameters
1
6
2
34
5
7
8
9
10
Wide range of choices ofperformance increase versus hardware cost
36 © 2003 TENSILICA INC.
Automation:Application Examples
Application Speedup
Original Code Size
(Before Acceleration)
Code Size After
Acceleration
Code Size on MIPS32
(using gcc –O2)
Configurations Visited
Run Time to Generate
Configurations
Radix-4 FFT 10.6x 1.5 KB 3.6 KB 4.4KB 175,796 3 minutes
GSM Encoder 3.9x 17 KB 20 KB 38 KB 576,722 15 minutes
GSM Encoder
(using FFT TIE)1.8x 17 KB 19 KB 38 KB N/A N/A
MPEG4 Encoder 3.3x 111 KB 136 KB 356 KB 1,340,312 30 minutes
37 © 2003 TENSILICA INC.
Conclusion
MPSOC represents a new medium of implementation:Opportunity: Cost, power, bandwidth potential of semiconductors
Challenge: Return on investment for design of complex chips
The transition to MPSOC will drive……new parallel architectures (focus becomes interconnect not ISA)
…shift from hardwired design to programmable design
…new class of hardware/software environments for processor and SOC generation, integration and use
…rapid growth in processor counts and aggregate performance
Important historical parallel between integrated circuit
(many small transistors per chip)MPSOC
(many small processors per chip)and
38 © 2003 TENSILICA INC.
Key Research Directions
Tool environments for identification/exploitation of latent parallelism
Unified programming model for MP
Technical and economic tools for optimizing efficiency vs. flexibility (spectrum of early vs. late binding)
Application-specific interconnect topologies and generators
Role for hardware-centric programmability (FPGA) vs. software-centric programmability (processor)
Vision: set of communicating tasks + chip interface specification + performance constraints set of program binaries + chip GDSII
Profile-based automation:Generation of ISA
Assignment of tasks to processors [1 n, n 1; static vs. dynamic allocation]
Profile-based implementation of messaging mechanism and physical interconnect
Memory configuration, memory map, shared code and data section allocation
University Program: Free license to tools and models for MPSOC design using extensible processors: Steve Roddy: [email protected]