Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Vlsi_03_2005.ppt/April_2005 1
VLSI Trends in MicroarchitecturePast, present and future
TAU universityJanuary 24, 2006
Uri Weiser
Vlsi_03_2005.ppt/April_2005 2
Agenda
Microarchitecture– VLSI– Trends Past and present:
» Pipeline, superpipeline» Out of Order» Branch prediction» Caches» Trace cache» Threads and Chip Multiprocessing
– Future» Asymmetric» Accelerators
Vlsi_03_2005.ppt/April_2005 3
“[In the beginning] we had little idea of what we had started. ...I remember... saying, ‘Okay, we’ve done integrated circuit. What do we do next?”
Gordon E. Moore
Vlsi_03_2005.ppt/April_2005 4
TRENDSIN
VLSI
Sources:Shekhar BorkarUri Weiser
Vlsi_03_2005.ppt/April_2005 5
Process TechnologyProcess Technology1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ
Intel386Intel386™™ DX DX ProcessorProcessor
Intel486Intel486™™ DX DX ProcessorProcessor
PentiumPentium®®ProcessorProcessor
PentiumPentium®® Pro Pro ProcessorProcessor
PentiumPentium®® II II ProcessorProcessor
PentiumPentium®® 4 4 ProcessorProcessor
PentiumPentium®® III III ProcessorProcessor
ProcessorProcessor
Technology trendTrends
Vlsi_03_2005.ppt/April_2005 6
Performance History
10
100
1,000
10,000
1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ
Freq
(Mhz
)
Freq (uArch)Freq (Process)
18.3X
3.3X
i486Pentium® proc
Pentium® II & III proc
1.0u-0.18u, 1989-2001
Frequency increased 61X
1. 18.3X due to process technology
2. Additional 3.3X due to uArch
1
10
100
1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ
Rel
ativ
e Pe
rfor
man
ce Perf due to uArchitecture
Perf due to Freq
14X
7X
i486Pentium® proc
Pentium® II & III proc
Pentium® 4 proc
Performance increased ~100X
1. 14X due to process tech
2. Additional 7X due to uArch & design
Trends
Vlsi_03_2005.ppt/April_2005 7
Process Technology: Minimum Feature Size
Source: Intel, SIA Technology RoadmapSource: Intel, SIA Technology Roadmap0.01
Feature SizeFeature Size(microns)(microns)
0.1
1
10
’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08
IntelSIA
Trends
Vlsi_03_2005.ppt/April_2005 8
Transistors on a Chip
400480088080
8085 8086286
386486 Pentium® proc
P6
0.001
0.01
0.1
1
10
100
1000
1970 1980 1990 2000 2010Year
Tran
sist
ors
(MT)
2X growth in 1.96 years!
Transistors on a chip doubled every two yearsTransistors on a chip doubled every two years
Pentium® 4 proc
Trends
Vlsi_03_2005.ppt/April_2005 9
Die Size Growth
40048008
80808085
8086 286386
486 Pentium ® procP6
1
10
100
1970 1980 1990 2000 2010Year
Die
sid
e (m
m) =
Are
a1/2
Die size grows? Is it saturated?Die size grows? Is it saturated?
Trends
Vlsi_03_2005.ppt/April_2005 10
Frequency
P6Pentium ® proc
48638628680868085
8080800840040.1
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Freq
uenc
y (M
hz)
Doubles every 2 years
Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years
Trends
Vlsi_03_2005.ppt/April_2005 11
Frequency of Operation
PPro-300PPro-225
486DX2-66
386DX-33 486DX-33
8080
80868085
8088
80286
386DX-16486DX-25
486DX-50PP-66
PP-90
1
10
100
1000
1970 1975 1980 1985 1990 1995 2000Years
Freq
uenc
yM
HZ
IntelPPCOther
Alpha
Trends
Vlsi_03_2005.ppt/April_2005 12
Frequency of Operation (cont.)
PPro-300
486DX2 PP-66PP-90
PP-100PP-120
PPro-225
PPro-180PPro-150/PPC604-150
PP-133/PPC604-133
PPC601-100
PPC620C-200
A21064A-275A21164-300
M-R4400-200
0
50
100
150
200
250
300
350
400
1992 1993 1994 1995 1996 1997 1998Years
Freq
uenc
y (M
Hz)
Intel
PPC
Other
A21164-333
A21164A-417
Trends
Vlsi_03_2005.ppt/April_2005 13
Brainiacs and Speed demons
Source: ISCA 95, p. 174
1.6
0.57 0.6
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350
Speed demons
Brai
niac
sSPECInt92 = 400
300
200
SPECInt92 = 10050
MHz
ALPHA
X86
PowerPC
PENTIUM
PENTIUM PRO
21164
21064
1.0
0.5
1.5
2.0
50 100 150 200 250 300 350 400
1
SPEC
Int9
2 / M
Hz
Trends
Vlsi_03_2005.ppt/April_2005 14
Trends of Future Processors
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
0 500 1000 1500 2000 2500
M HZ
SPEC
int/1
00M
HZ
PII, PIIISPECInt = 60SPECint = 40
SPECint = 27= 18
= 12= 8
= 4
SPECInt = 90
P IV
P M
Trends
Vlsi_03_2005.ppt/April_2005 15
1
10
100
1000
1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ
Wat
ts/c
m2
i386i486
Pentium ® processor
Pentium Pro ® processor
Pentium II ® processor Pentium III ® processor
Power density continues to get worse
Hot plate
Nuclear ReactorRocketNozzleSun’s
Surface
Trends
Vlsi_03_2005.ppt/April_2005 16
On Die Cache Memory
Core Logic
Cache
Core Logic
Cache
Core Logic
Cache
0.13μ
0.18μ
0.10μ
Pentium III & 4
Pentium IIIPentium IIPentium
ProPentium
0%
20%
40%
60%
80%
100%
0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.10μ
Cache % offull chiparea ?
Larger % of die area will be memoryLarger % of die area will be memory
Trends
Vlsi_03_2005.ppt/April_2005 17
Process trend – the theory (cont)Performance driven era vs. Power aware era
Processes
“Old” processes
Pow
er ra
tioProcess generation p vs. p-1
@ same die area
n1µ
n+1 n+2 n+30.35µ
“new” processes
m0.25µ
m+1 m+2 m+30.09µ
1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistors
Power increase/generation: 1.6X
Power aware era(Performance within power envelop)
1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistor
Power increase/generation: 1.1X
Performance driven era(no power constrains)
Reference: Eric Spargle
1.0
2.0
3.0
4.0
Trends
Vlsi_03_2005.ppt/April_2005 18
Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)
Processor generation k vs. k-1 compacted @ the same process technology
Power
0
1
2
3
4
1.5 1 0.7 0.5 0.35 0.18Technology Generation
Gro
wth
(X)
Perf/power delta ratio
1 : >3Performance
!Perf/power delta
ratio3 : 1
Trends
Vlsi_03_2005.ppt/April_2005 19
Microarchitecture
Vlsi_03_2005.ppt/April_2005 20
The Generic Processor
Datasupply
Instructionsupply
Executionengine
Sophisticated organization to “service” instructionsInstruction supply– Instruction cache– Branch prediction– Instruction decoder– ...
Execution engine– Instruction scheduler– Register files– Execution units– ...
Data supply– Data cache– TLB’s– …
Goal - Maximum throughput – balanced design
Microarchitecture
Vlsi_03_2005.ppt/April_2005 21
“The Core” - A Block Diagram
FPU ALU1 ALU2AGU
+MEM
n
2
DataCache
D-TLBReg.File
Scheduler
Instruction flowRead SourcesWrite DestinationWrite Bypass
Write Back
Microarchitecture
Vlsi_03_2005.ppt/April_2005 22
Parallelism Evolution
ProcessorElement
Instruction
Basic configuration Pipeline
VLIW
Superscalar - In order
Superscalar - Out of Order
a
...
PE
a ...
PE
b
PE
c
PE
n
PE
a ...
PE
d
PE
f
PE
n
b
c
e
PE
a a
PE
b c n
PE PE PE PE
Parallelism
Vlsi_03_2005.ppt/April_2005 23
PipelineBreak the work to smaller pieces
Increased throughput– increased # of completed instructions per cycle and reduces cycle time– Number of stages varies
– Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro), “ultra-pipeline” ~25 (PIV)
Calls for good balancing among stages
I1I2
I3
F D E WF D E W
F D E WF D E W
F D E WF D E W
F: FetchD: DecodeE: ExecuteW: Write Back
0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)
1/4 IPC = 4 CPI
1 IPC = 1 CPIIPC = Instructions Per CycleCPI = Cycles Per Instructions
ExamplesIntel 486NS 32532
t
Parallelism
Vlsi_03_2005.ppt/April_2005 24
Pipeline Stalls
But there are “stalls” in the pipeline– “Data Hazards”: Data flow dependency (instructions output/input)
» Solved by: bypasses, renaming– “Control Hazards”: Control flow dependencies
» Solved by branch prediction– “Structural Hazards”: Limited resources– Other (Cache misses, long latency instructions, page faults….)
F D E WF D stall W
F D E WF E W
F D E W
F: FetchD: DecodeE: ExecuteW: Write Back
Estall
D
Data Flow stall(w/o bypass)
Control Flow stall
stallstallstallstall
Address Generation Interlock
stall
0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)t
Parallelism
Vlsi_03_2005.ppt/April_2005 25
Super ScalarPerforms more in a single cycle
Ideally, can multiply the throughput– But stall occurs more frequently
0 1 2 3 4 5 6 7 8 9 10 11 12F D E W
F D E W
F D E W
F D E WF D E W
F D E W
F D E W
F D E W
F D E W
F D E W
F D E WF D E WStall
2 IPC = 1/2 CPI
ExamplesIntel Pentium® Proc. Alpha 21164
t
Parallelism
Vlsi_03_2005.ppt/April_2005 26
Super PipelineSplit to shorter stages - allows higher frequency
Ideally, can (again) multiply the throughput, but– Stall penalties do not scale (e.g., control flow stall, cache misses)– Clock setup/hold reduces net cycle time - each instruction takes longer!
In the example above: 2X stages, but performance gain is <33%
0 1 2 3 4 5 6 7 8 9 10 11 12
F: FetchD: DecodeE: ExecuteW: Write Back
1 IPC = 1 CPI33% higher freq!
0 1 2 3 4 5 6 7 8 9 10 11 12
F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2
F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2
F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2
F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2
Old clk =
New clk =
Examples:Intel Pentium® II/III/4
Parallelism
Vlsi_03_2005.ppt/April_2005 27
1 2 95 6 73 4 8
1 2 95 6 73 4 8
Out Of Order ExecutionIn Order Execution: instructions are processed in their program order.
– Limitation to potential Parallelism.OOO: Instructions are executed based on “data flow” rather than program orderBefore: src -> dest(1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11(4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4)
After:(1) load (r10), r21; (3) load a, r11;
<wait for loads to complete>(2) mov r21,r31; (4) mov r11,r22;
(5) mov r22,r23;Usually highly superscalar
1F
2F
3F
4F
1D
2D
3D
4D
5F 5D 5W
1W
2E
3w
4E
5E
2W
3W
4W
5E
1E
2E
3E
4E
5E
1E
2E
3E
4E
5E
1F
2F
3F
4F
1D
2D
3D
4D
5F 5D 5W
1W
2E
3E
4E
5E
2W
3E
4E
5E
4E
5E
1E
2E
3E
4E
5E
1E
2E
3E
4E
5E
4W
5E
3W
In Order Processing
Out of Order Processing
In Order vs. OOO execution.Assuming:- Unlimited resources- 2 cycles load latency
Examples:Intel Pentium® II/III/4Compaq Alpha 21264
t
t
OOO
Vlsi_03_2005.ppt/April_2005 28
Out Of Order (cont.)Advantages
– Help exploit Instruction Level Parallelism (ILP)– Help cover latencies (e.g., cache miss, divide)– Artificially increase the Register file size (i.e. number of registers)– Superior/complementary to compiler scheduler
» Dynamic instruction window» Make usage of more registers than the Architecture Registers
Complex microarchitecture– Complex scheduler. Involves also
» Large instruction window» Speculative execution
– Requires reordering back-end mechanism (retirement) for:» Precise interrupt resolution» Misprediction/speculation recovery» Memory ordering
OOO
Vlsi_03_2005.ppt/April_2005 29
Speculation
Vlsi_03_2005.ppt/April_2005 30
Branch PredictionGoal - ensure enough instruction supply by correct prefetchingIn the past - prefetcher assumed fall-through
– Lose on unconditional branch (e.g., call)– Lose on frequently taken branches (e.g., loops)
Branch prediction– Predicts whether a branch is taken/not taken– Predicts the branch target address
Misprediction cost varies (higher w/ increased pipeline length)Typical Branch prediction rates: ~90%-96%
4%-10% misprediction,10-25 branches between mispredictions50-125 instructions between mispredictions
Misprediction cost increased with– Pipeline depth– Machine width
» e.g. 3 width x 10 stages = 30 inst flushed!
?
Speculation - BP
Vlsi_03_2005.ppt/April_2005 31
Target Array + Direction PredictionTarget and direction are predicted separately Tag may be partial
Branch IP
tag predicted target
predicted Target
Address
predicted direction(taken/not-taken)
hit / miss(indicates a branch)
DirectionPrediction
(for conditionalbranches only)
TargetPrediction
Speculation - BP
Vlsi_03_2005.ppt/April_2005 32
Speculative ExecutionExecution of instructions from a predicted (yet unsure) pathEventually, path may turn wrong.Advantages:
– Ensure instruction supply– Allow large scheduling window (for out of order)
Issues:– Misprediction cost– Misprediction recovery
Speculation - execution
Vlsi_03_2005.ppt/April_2005 33
Cache - Motivation & PrincipleMemory consumption is growing about 2X every 2 years
– Typical size: (Y2000) 64M-128M, (Y2002) 128M-256MCPU speed grows faster than memory and buses
– CPU/Bus grew from 1:1 to 6:1, and still growing486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz33MHz 66MHZ 66-100MHz 133-200MHz 400MHz
– Memory: DRAM: 60-100ns (“10-16MHz”), Cost: <10$ per 1MSRAM is faster but much more expensiveMemory becomes the bottleneck for both instructions and data!SlowSlow or expensive
Solution: Cache - A Small,Small, Fast,Fast, CloseClose memory– Serves as a buffer between CPU and main memory
Contains copy of a portionof the main memory
– Small in size – Dynamically changed
Exploit space and time locality: – Code is fetched sequentially (Space)– Code is re-executed (loops, procedures) (Time)– Access close or previous data (Space, Time)
Cache MemoryCPU
Small, fastClose, Expensive
Large, SlowFast, Cheap
Speculation - Cache
Vlsi_03_2005.ppt/April_2005 34
The Generic Processor
Executionengine
Datasupply
Instruction Cache
BTB Instruction fetch/decode
Rename
Scheduler
Trace/decodedcache
Speculation – Trace Cache
35Vlsi_03_2005.ppt/April_2005 35
Fetch bandwidthexample
A CB B CAtime
Dynamic instruction stream
o o o o
A
B
CD
B E
Control flow graphA, B, C are instruction blocks
Speculation – Trace Cache
36Vlsi_03_2005.ppt/April_2005 36
Trace Cache Concept
• Hold in the “instruction”cache the dynamic stream of the executed instructions
=> Trace cache acts as “branch predictor” + wide instructions supplier
Speculation – Trace Cache
37Vlsi_03_2005.ppt/April_2005 37
Trace Cache Overview
I Cache
Decoder
Trace Cache
FillUnit
address
A B C
A B CA B C
To Execution Core
Build Mode
Stream Mode
Speculation – Trace Cache
38Vlsi_03_2005.ppt/April_2005 38
Trace cache line
• Tag: identifies starting address of trace
• N instructions (potentially decoded)• Next address: next fetch address• path info: branch flags (T, NT), number
of branches, trace ends w/ branch?,…)
N Instructions Next address path infoPC Tag
Speculation – Trace Cache
Vlsi_03_2005.ppt/April_2005 39
Threads
Vlsi_03_2005.ppt/April_2005 40
Scalar Execution
Dependencies reduce throughput/utilization
Time
Threads
Vlsi_03_2005.ppt/April_2005 41
Superscalar Execution
Generally increases throughput, but decreases utilization
Time
Threads
Vlsi_03_2005.ppt/April_2005 42
Predication
Generally increases utilization, increases throughput less(much of the utilization is thrown away)
Time
Threads
Vlsi_03_2005.ppt/April_2005 43
CMP – Chip Multi-ProcessorTime
Low utilization / higher throughput
Threads
Vlsi_03_2005.ppt/April_2005 44
Blocked Multithreading
May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)
Time
Threads
Vlsi_03_2005.ppt/April_2005 45
Fine Grained Multithreading
Increases utilization/throughput by reducing impact of dependences
Time
Threads
Vlsi_03_2005.ppt/April_2005 46
Simultaneous Multithreading
Time
Increases utilization/throughput
Threads
Vlsi_03_2005.ppt/April_2005 47
Future
Vlsi_03_2005.ppt/April_2005 48
Gain
Frequency
Analog Circuit Paradigm
GBWP = Gain Bandwidth Product = constant @ a given technology
BW1
Gain1
BW2Gain2
e.g. Gain1*BW1= Gain2*BW2
Future
Vlsi_03_2005.ppt/April_2005 49
Gain
Frequency
Analog Circuit Paradigm (cont.)
BW1 BWn Gain
Future
Vlsi_03_2005.ppt/April_2005 50
“Theory”Analog Gain Bandwidth Product (GBWP) is constant for a specific technology, this is also true for other “environments”…A computer structure can excel in performance for a specific application set but not at all applications (also true for benchmarks)a person can excel in several areas but not at all...…...
examples: benchmarks, application in coming foils people….
Future
Vlsi_03_2005.ppt/April_2005 51
Tuning for ApplicationsPerformance
“Applications”
Apps1 Appsn
Future
Vlsi_03_2005.ppt/April_2005 52
Provide Specialized “efficient”MIPS
Find a way to support the new performance requirements via an efficient “mechanism”A tailored solutions (to a specific application set) can provide an “efficient” MIPS via INTEGRATION, how?
Future
Vlsi_03_2005.ppt/April_2005 53
The Needthe environment
These days is the PC's 20th birthday – 835 Million PC sold 1981-2001– 138 million PCs in year 2001(IDC), 10X number of cars, 1.5X of
television sold annually– 2.2 Billion Email a day, 10X of the first class mail– 400 million on line users (200 in Sep99)– CPU performance improved ~8000X !!!
What will be the need for performance in the coming 20 years?What will be the technology progress in the coming 20 years? 10 years? 5 years?
Statistics courtesy of Gartner Dataquest, U.S. News & World Report, Jupiter Internet Population Model, and NUA Internet Surveys
Future
Vlsi_03_2005.ppt/April_2005 54
Windows XP examples that needs excessive performance:
- Movie Maker Video Indexing- Video smoothing
Example 1:Movie Maker Video Indexing
320*240, 30fps4X slower thanreal Time on CentrinoTM @1.6Ghz
1980x1080, 30fps~100X over CenrinoTM @1.6Mhz
Future
Vlsi_03_2005.ppt/April_2005 55
30 FPS (Enhanced)5 FPS (Jerky)
Example 2:Emulation of:Video smoothingVideo Enhancement
Video smoothing
352*240 pixelsCPU usage:70% of CentrinoTM @1.6Ghz
1980x1080, 30fps~21X over CenrinoTM @1.6Mhz
Future
Vlsi_03_2005.ppt/April_2005 56
The Need
Future
Vlsi_03_2005.ppt/April_2005 57
The need: Build a Panorama
M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003Performance: >30min P4 3GHz
Simplified capabilities at Microsoft Digital Image Suite 10 ($129.95)
Future
Vlsi_03_2005.ppt/April_2005 58
Future
Vlsi_03_2005.ppt/April_2005 59
CPU Usagean example(measured on IBM X20)
Streaming vs. General purpose
General purpose usageExcel and Outlook -->Burst need for MIPS
Streaming Processing6 low resolution videos-->Continuous need for MIPS
Avg.
Avg.
5min
~5min
Future
Vlsi_03_2005.ppt/April_2005 60
Process trend – the theory (cont)Performance driven era vs. Power aware era
Processes
“Old” processes
Pow
er ra
tioProcess generation p vs. p-1
@ same die area
n1µ
n+1 n+2 n+30.35µ
“new” processes
m0.25µ
m+1 m+2 m+30.09µ
1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistorsLeakage 30%
Power increase/generation: 1.6X
Power aware era(Performance within power envelop)
1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistorLeakage <5%
Power increase/generation: 1.1X
Performance driven era(no power constrains)
Reference: Eric Spargle
1.0
2.0
3.0
4.0
Future
Vlsi_03_2005.ppt/April_2005 61
Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)
Processor generation k vs. k-1 compacted @ the same process technology
Power
0
1
2
3
4
1.5 1 0.7 0.5 0.35 0.18Technology Generation
Gro
wth
(X)
Perf/power delta ratio
1 : >3Performance
!Perf/power delta
ratio3 : 1
Future
Vlsi_03_2005.ppt/April_2005 62
solution 1: CMP (Chip Multi-Processor)
Per
form
ance
Power
penalty: MP
Conventional design 1% performancefor 3% in power
One processor
2 CMP
3 CMP
4 CMP
*CMP = Symmetric General Purpose (GP) cores
CMP*
Future
Vlsi_03_2005.ppt/April_2005 63
solution 2: ACCMP (Asymmetric Cluster CMP)
Per
form
ance
Power
penalty: Specialized MIPS
AC
CMP
~1% performancefor 1% in power
>3% performancefor 1% in power
Future
Vlsi_03_2005.ppt/April_2005 64
ACCMPWhat is the ACCMP?– On Die Asymmetric Clusters of cores– Efficient specialized MIPS clusters with
>3-4X performance/power over GP cores– Compatible ISA?
Penalties– Multi-Processing (tasks or threads)
Specialized MIPS
ACCMP is a solution that enables to continue (for a while) Moore’s performance law within the power envelop
Future
Vlsi_03_2005.ppt/April_2005 65
ACCMP
Specialized MIPS A Cluster
Host coreHost core
L2 $
Host ClusterGeneral Purpose
MIPS
Interconnect
Specialized MIPS B Cluster
Ext.Bus
Future
Vlsi_03_2005.ppt/April_2005 66
Future - Processors
• applications need• Specialized MIPS• Detached from the CPU core• Different engines• Mixture of Programmable and fixed function
• ?