1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company

1 © 2003 TENSILICA INC.

Fundamental Change in MPSOCA fifteen year outlook

Chris Rowen, President and CEO

Tensilica, Inc.

The Configurable Processor Company


Design Productivity Crisis (SRC 1997) Potential Design Complexity and Designer Productivity

Moore’s Law: Opportunity, Crisis and ROI

Source: ITRS 2001, Moore 1965, Tensilica

20012003

20052007

20092011

20132015

10,000

1,000

100

Den

sity

(K

gat

es / m

m2)

AS

IC c

lock

(M

Hz)

Gates Clock

Moore’s Law: Standard cell density and speed

Lo

gic

Tra

nsi

sto

r p

er C

hip

( M

)

Pro

du

ctivity ( K

) Tran

s./Staff – M

o.

19811983

19851987

19891991

19931995

19971999

20012003

20052007

2009

100,000,000

0.01

0.1

1

10

100

1,000

10,000

Equivalent Added Complexity

1,000

100

10

1

0.1

0.01

0.001

10,000

21% / yr compounded

Productivity Growth Rate

xxx

xxx

x x

58% / yr c

ompounded

Complexity Growth Rate

Logic Tr. / Chip

Tr. / S.M.

costt developmen chip

costunit chipASP (chipvolume

Investment

Return )* ROI


ROI Goal: One Design, Many Design-ins

$10M design cost, $15 manf. cost, 5% premium for programmability

Low-endstill

camera

High-endstill camera

Video camcorder

one chip

many systemdesigns

0

20

40

60

80

100

120

1 2 3 4 5 6 7

100,000

1,000,000

System designs per chip design

To

tal

pe

r u

nit

co

st

SOC Flexibility = Cost Reduction (Model: 100K and 1M system volumes)


Configurable Processor’s Role

ConfigurableConfigurableProcessorsProcessors

Per

form

ance

Flexibility

Application-Application-specificspecificLogicLogic

General-General-purposepurpose

ProcessorsProcessors


Configurable Processors Enable New RolesTaking Performance to a New Level

2.0

0.087 0.080 0.059 0.058 0.0390.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Xtensa optimized

Xtensa out-of-box

MIPS64 20Kc ARM1020E

MIPS64b (NEC VR5000)


Optimized ConsumerMarks/MHz

0.473

0.03 0.023 0.016 0.013 0.011

0.23

0.017

0.0

0.1

0.2

0.3

0.4

0.5

Xtensa optimized

TI C6203 optimized Xtensa out-of-box

TI C6203 out-of-box MIPS64 20Kc

MIPS64b (NEC VR5000) ARM1020E


Optimized TeleMarks/MHz

0.123

0.03

0.018 0.017 0.0160.01

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Xtensa optimized Xtensa out-of-box MIPS64 20Kc ARM1020E MIPS64b (NEC VR5000) MIPS32b (NEC VR4122)

Optimized NetMarks/MHz

© 2003 TENSILICA INC.Source: EEMBC

~E

nerg

y ef

ficie

ncy


Automatic Generation of ProcessorsAchieves Required Performance Faster

ElectronicSpecification

Hardware Design

RISC

DSP

OCD

Timer

FPUDesigner-Defined

Cache

CustomizedSoftware

ProcessorGenerator

Build usingany IC

process

Design processor in one hour


CPUCPU

SignalSignalprocessingprocessing

ProtocolProtocolprocessingprocessing

II // OOMemoryMemory

ApplicationApplicationacceleratoraccelerator EncryptEncrypt ImagingImaging

AudioAudio

Processors as Basic Build Block

DSPDSPApplication-Application-

specificspecificlogiclogic

Application-Application-specificspecific

logiclogic


logiclogic


logiclogic


logiclogic

Configurableprocessor















Flexibility is the Key to ROI

*

costt developmen chip

)costunit chipASP (chip*volume

Investment

ReturnROI

Flexibility means more systems per design

Programmability more “hot features” available

Little impact on chip cost – pennies per processor

Automatically-generated configurable processors reduce design time, team size and re-spin risk


Example:NEC TCP/IP Offload Engine (TOE) Platform

NEC TOE achieves full wire speed by eight parallel and two management and dispatch Tensilica cores (Total 10) for high performance IP-based network storage — NAS & IP-SAN

8 8 parallel Xtensa parallel Xtensa processorsprocessors8 8 parallel Xtensa parallel Xtensa processorsprocessors

200 MHz200 MHz

Gigabit Ether × 2 ports

200 MHz200 MHz

200 MHz200 MHz

MACMAC


Implications of Multiprocessor SOC

Designers will routinely “waste” processors to get other benefits Greater speed to market and certainty of success

Higher abstraction in design

Tremendous creativity and diversity in on-chip communicationsTopologies: buses, hierarchies of buses, cross-bars, systolic arrays, pipelines

New issues and methods: reliability, redundancy, asynchrony, QOS

Programming models for large numbers of task – finding parallelism

Software languages displace hardware languagesC/C++, not Verilog, VHDL, System Verilog etc.

Changing demographics of complex SOC designBroader population of engineers and programmers capable of SOC design

Unified hardware-software design user interface “cockpit”


New Types of ProcessorsExploiting Latent Parallelism

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7 8 9

Operations per cycle

Nu

mb

er

of

Pro

ce

ss

ors

32

48

64

10

8 instrs/cycle

Source: K. Keutzer, UCB

Xelerated

Intel IXP1200

Broadcom BCM1250

Cognigine RCU/RSF

Cisco PXF

EZchip NP-1

IBM PowerNP

Lexra NetVortexMotorola C-5

BRECIS

AMCCnp7120

ClearwaterCNP810

Vitesse IQ2x00

Agere PayloadPlus

Alchemy

Mindspeed CX27470

64 instrs/cycle

16 instrs/cycle

Multiple processors vs. multiple-issue in network processors

• Very small processors• Modest extensions• High task-level parallelism

• High-performance processors

• VLIW, SIMD and application-specific extensions

• High data- and instruction-level parallelism

10 15 20

80

96

112

128


Projected Processor Speed and Density

2009 2016Geometry 50nm 22nm

Clock 1.8GHz 5.7GHz

Small proc area (mm2) 0.08 0.016

Small proc/chip 240 1400

High perf proc/chip 10 15

MIPS/chip 600,000 11,000,000

40mm2 die size for consumer SOC


The Law of SOC Processor Scaling

Processors/chip: Up to 30% per

year

Total MIPS: 65% per year

Tensilica model based on ITRS 2001, 140mm2 die size

Aggregate SOC Performance

10

100

1,000

10,000

100,000

2001 2003 2005 2007 2009 2011 2013 2015

Bill

ion

s o

f o

pe

rati

on

s/s

ec

on

d

Processors Per Chip

10

100

1,000

10,000

2001 2003 2005 2007 2009 2011 2013 2015


Enablers for Large Scale MPSOCWhat is Tensilica Working On?

PerformanceThroughput and efficiency mean more opportunities for application-

specific processors over RTLNew processor interfaces enable greater parallelism

InsightUnified hardware-software development environmentPerformance and cost-oriented analysis

AutomationAutomatic generation of compilers, RTOS, MP models“Hands-free” instruction set optimization


Performance:FLIX™

FLIX = Flexible Length Instruction Xtensions FLIX freely intermixes 16-, 24-, and 64-bit instructions

No code-bloatNo modesFull backwards code compatibility with current Xtensa ISALong instructions implement complex extensions

Fast and parallel code when needed, else very compact codeArbitrary Instruction Field Specification

Multiple independent operations packed into a wide instruction wordMultiple Load / Store Units

Minimal Overhead~5000 gates added control logic

64 24 16

24 64

64

24 16 24063 31

Instruction packing in Memory(Little Endian Shown)


Performance:Pushing to new levels of throughput

FLIX: Average of 6% larger code on complex code sets

Simple RISC Task Engine

Minimal Configuration Xtensa processor(18K gates)

155,389 cycles

Scalar Performance Base Xtensa processor with MUL32 option 23633 cycles

SIMD PerformanceXtensa processor with4-way SIMD Vectra DSP Engine

3055 cycles

FLIX Performance Conexant Testarossa DSP with 4-way SIMD and FLIX 1063 cycles

256pt FFT (Radix-4)


Performance:A Complex FLIX Example

Register File(16 x 256)

A B C D E

4K x 256 RAM

Addr 4K x 256 RAM

Addr

In-QOut-Q

63 59 58 57 53 52 37 36 25 24 19 18 14 13 9 8 4 3 0

MemA InQ WrtA ExA/B ExC/D ExE WrtB OutQ MemB 1110

5 1 5 16 12 6 5 5 5 4

WrtA WrtB• 9 independent operation fields

• Multiple load/store

• Input/output queues


length l64 64 { InstBuf[3:0] == 14 }

format flix64 l64slot slot0 flix64[*]slot slot1 flix64[*]slot slot1 flix64[*]

opcode L32I slot0opcode S32I slot0opcode ADD slot0opcode NOP slot0

opcode ADD slot1opcode ADDI slot1opcode SUB slot1opcode NOP slot1

All components of processor solution automatically generated from the TIE code in <2 hours.

•RTL & HW flow scripts

•Toolchain

•System models

•Operating System support

Performance:Writing TIE for FLIX


Performance:Conexant DSP Architectural Requirements

VLIW-SIMD programming model

16- and 24-bit scalar instructions

64-bit instructions with multiple operations

2 or 4 16x16 MAC units

6R/3W Conexant-defined register file

At least two load store units

7-stage pipe with 2 cycles for I/D memory access

Stall on memory bank conflicts

Backward compatibility with previous Conexant DSPs via Translation Instruction Set (a sub-operation of the 64-bit instructions)


Performance:Conexant Testarossa Encoding

63 46 45 28 27 4 3 0

ALU MAC Load/Store 1 1 1 0 18 18 24 4

Testarossa Load/Store Vector Load/Store Scalar Load/Store Unaligned Load/Store

Xtensa Core Instructions Load/Store Branch ALU

234 operations

Complex Multiply

Real Multiply

Select

24 operations

ALU

Shift

2nd Load/Store

52 operations


Insight:The Multiple Core SOC Design Problem

Software Development Environment

• C code development• Debugging• C project management• Code profiling, tuning

Processor Optimization Environment

• TIE code development for extensions

• Configuration option management

SOC System Architecture Exploration

• System modeling and simulation

• Multiple core debug

• Web-based Xtensa Processor Generator

• TIE CompilerSingle source TIE file for processor extension

• Xtensa Modeling Protocol (XTMP)

• Bus functional models for co-simulation / co-verification EDA tools

• GNU-based Tensilica software development tools

• Xtensa C/C++ compiler• Xtensa Instruction Set

Simulator

To

ols

• Command line interface• Partner-provided software

IDEs (WindRiver, ATI/Mentor, MontaVista)

• Command line interface• Web browser interface

• Command line interface• EDA partners system

analysis / debug environments

En

viro

nm

ent

Three Skill Sets, Three Environments?


Insight:Xtensa Xplorer

Software Development Environment

• C code development• Debugging• C project management• Code profiling, tuning

Processor Optimization Environment

• TIE code development for extensions

• Configuration option management

SOC System Architecture Exploration

• System modeling and simulation

• Multiple core debug


Insight:Develop and Manage Processor Configurations

Manage complexity of growing variety of processor optimization choices

Software and processor optimization within same IDE

Gate count estimate:•per instruction•per register file •per user state

Interactive display of instruction…•operands•pipelining•semantics

Interactive TIE Editor•language-sensitive editing and help


Insight:Create, Analyze & Tune ISA Extensions (TIE)

Profile and visualize performance impact of custom instructions

Pipeline Viewer shows instruction flow of disassembled codeStatic analysis of pipeline stalls pinpoints areas for fine tuning

Highlight instructions with variable latency (e.g. cache misses)

Interlocks on deep TIE pipelines fully modeled and explained


Insight:Analyze and Select Caches to Meet Speed/Area Goals

Automatically profile code across range of cache configuration options

Performance charts visually compare different configurations


Insight:Chip-level Software and Simulation for MPSOC

Manage system memory maps & link/load for multiple-core SOCs

Develop, run and debug multiple-core simulations using Xtensa Modeling Protocol (XTMP)

Auto-generated XTMP model based on memory maps

• Specify chip-level memory maps for shared/private memories

• Place interrupt and reset vectors• Assign code/data to distributed

memories


Automation:The Next Generation

Xtensa Processor Generator

Complete Hardware Design

Customized Software Tools

Any Fab

ALU

DSP

OCD

Timer

FPURegister File

Cache

ElectronicSpecification

ApplicationSource Code

NEWAutomation

Tool

int main(){ int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++)

int main(){ int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++)


Automation:Goals for Processor Extension

FlexibilityApplication code might be written/modified after tape-out

Generated TIE must be sufficiently general purpose so that small changes to application code do not degrade performance

ControlFull automation

C/C++ in TIE out

C/C++ + generated TIE in binary code out

Optional full control by user

Guide tool and/or to select instructions

Add to or change generated TIE

Tune application to better take advantage of TIE

Speed: minutes, not days


int *a, *b, *c;for (int i=0; i<n; i++)

c[i] = (a[i] + b[i]) >> 2

Automation:Basic Operation - Fusion

operation add_shift (out AR c, in AR a, in AR b) {

wire t[31:0] = a+b;

assign c = {2{t[29]},t[29:0]};

}

+

>>

2

Original C Code

Complete TIE Code

Combined add-shift operator automatically used wherever

equivalent expression occurs in source


length l 64 { InstBuf[3:0] == 14 }format f lslot slot0 f[*] ADDI, NOPslot slot1 f[*] ADD, SRAI, NOPslot slot2 f[*] L32I, S32I, NOP

loop: {addi a9,a9,4; add a12,a10,a8;l32i a8,a9,0} {addi a11,a11,4;srai a12,a12,2; l32i a10,a11,0} {addi a13,a13,4;nop; s32i a12,a13,0}

for (int i=0; i<n; i++) c[i] = (a[i] + b[i]) >> 2

Automation:Basic Operation - Multiple Ops in FLIX

• Original C compiled to 3 cycles/iteration

S0 S1 S2

Original C Code 64 Bit Instruction with 3 Slots

Complete TIE CodeGenerated Assembly


Automation:Basic Operation - SIMD/Vector

short *a, *b, *c;

for (int i=0; i<n; i++)

c[i] = a[i] + b[i];

regfile vec 64 16 v;

operation add16x4(out vec c, in vec a, in vec b) {

assign c = {a[63:48]+b[63:48],

a[47:32]+b[47:32],

a[31:16]+b[31:16],

a[15:0]+b[15:0]};

}

+=

ab

cComplete TIE Code

Original C Code

……

…

Four iterations in parallel


Automation:Processor Extension Step 1

Compile the C/C++ application codeDesigner specifies compiler optimization flag

Compiler generates comments to help user tune code

Optimized code yields better results

Compiler generates information from application

Feedback optimization ranks code regions by frequency

Vectorizer determines which loops can be vectorized

Fuser generates dataflow graphs for important regions

Operation counts for each type of opcode for every region



Generated information used to select and generate TIE:For each code region, generate many potential sets of TIE

extensions (configurations)Vectorize by 1, 2, 4, 8Add FLIX functional unitsAdd fusionsGeneration guided by estimated performance

Evaluate all generated configurations across all regionsFind best set of merged configurations given budget



Use the TIE with a C/C++ or assembly applicationCompiler reads TIE (automatically or manually generated) and

generates codeFLIX slot/format TIE specification mapped to resource tablesGeneralized graph matcher generates dataflow graphs from TIE Vectorizer vectorizes a loop and checks if all required operations available in TIE

User free to tune the code in ANSI C/C++ or assemblySimulator, assembler, debugger, RTOS support generated

directly from TIE


Automation:Example: “Sum-of-Absolute Differences” Search

i SpeedupGates

Added (K)

SIMDFactor

FLIXWidth (Slots)

Load /Store Units

Fu

sion

1 8.7x 74 8 3 2 Yes

2 8.1x 57 8 2 2 Yes

3 7.6x 46 4 3 2 Yes

4 7.6x 37 8 2 1 Yes

5 6.8x 33 4 2 2 Yes

6 6.8x 26 8 1 1 Yes

7 6.1x 18 4 2 1 Yes

8 5.1x 12 4 1 1 Yes

9 4.3x 8 2 2 1 Yes

10 3.4x 5 2 1 1 Yes

11 1.4x 0.3 1 1 1 Yes

Generated Configuration Parameters

1

6

2

34

5

7

8

9

10

Wide range of choices ofperformance increase versus hardware cost


Automation:Application Examples

Application Speedup

Original Code Size

(Before Acceleration)

Code Size After

Acceleration

Code Size on MIPS32

(using gcc –O2)

Configurations Visited

Run Time to Generate

Configurations

Radix-4 FFT 10.6x 1.5 KB 3.6 KB 4.4KB 175,796 3 minutes

GSM Encoder 3.9x 17 KB 20 KB 38 KB 576,722 15 minutes

GSM Encoder

(using FFT TIE)1.8x 17 KB 19 KB 38 KB N/A N/A

MPEG4 Encoder 3.3x 111 KB 136 KB 356 KB 1,340,312 30 minutes


Conclusion

MPSOC represents a new medium of implementation:Opportunity: Cost, power, bandwidth potential of semiconductors

Challenge: Return on investment for design of complex chips

The transition to MPSOC will drive……new parallel architectures (focus becomes interconnect not ISA)

…shift from hardwired design to programmable design

…new class of hardware/software environments for processor and SOC generation, integration and use

…rapid growth in processor counts and aggregate performance

Important historical parallel between integrated circuit

(many small transistors per chip)MPSOC

(many small processors per chip)and


Key Research Directions

Tool environments for identification/exploitation of latent parallelism

Unified programming model for MP

Technical and economic tools for optimizing efficiency vs. flexibility (spectrum of early vs. late binding)

Application-specific interconnect topologies and generators

Role for hardware-centric programmability (FPGA) vs. software-centric programmability (processor)

Vision: set of communicating tasks + chip interface specification + performance constraints set of program binaries + chip GDSII

Profile-based automation:Generation of ISA

Assignment of tasks to processors [1 n, n 1; static vs. dynamic allocation]

Profile-based implementation of messaging mechanism and physical interconnect

Memory configuration, memory map, shared code and data section allocation

University Program: Free license to tools and models for MPSOC design using extensible processors: Steve Roddy: [email protected]

Documents

1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company