View
220
Download
0
Category
Preview:
Citation preview
1
Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing
2
Topic 3: Fundamentals of Media Processor Designs
Overview of High-Performance ProcessorsOverview of High-Performance Processors Multiple-issue, out-of-order, dynamic-window processor VLIW and Vector processor Systolic and Reconfigurable processor Hardwired stream processor Thread-level parallelism
Multimedia extension Media benchmark/workload Steaming media processing, sub-word parallelism Intel MMX/SSE media extension IA-64 multimedia instructions
Media processors IMAGINE: Media processing with streams IVRAM: Extent Intelligent RAM with vector unit Trimedia: Price-performance challenge for media processing
3
Digital Signal Processing (DSP)
In 1970s, DSP in telecomm, requires higher performance than microprocessor available
Computationally intensive: Dominated by vector dot product – multiply, multiply-add
Real-time requirement Streaming data, high memory bandwidth,
simple memory access pattern Predictable program flow, nested loops, less
branches, large basic blocks Sensitivity to numeric error
4
Early DSPs Single-cycle multiplier Streamlined multiply-add operation Separate instruction / data memory for high
memory bandwidth Specialized addressing hardware, autoincrement Complex instruction set, combine multiple
operations in single instruction Special-purpose, fixed-function hardware, lack
flexibility and programmability TI TMS32010, 1982
5
Today’s DSP (from 1995) Adapt general-purpose processor design
RISC-like instruction set Multiple-issue, VLIW approach, Vector SIMD, superscalar, chip multiprocessing
Programmability and compatibility Easy to program, better compiler target Better compatibility for future architecture
TI-TMS320C62xx family RISC instruction set 8-issue, VLIW design
6
General-Purpose Processors
Notice increasing applications (e.g. cellular phone) for DSP tasks ($6 billion DSP market in 2000)
Add architecture features to boost performance of common DSP tasks
Extended multimedia instruction set, adapted and integrated with existing hardware in almost all high-performance microprocessor, Intel MMX/SSE
New architecture, encompass DSP+general-proc., boost high parallelism, Stanford Imagine, etc.
Future directions? Graphics processors?
7
Media Processing Digital signal processing, 2D/3D graphics
rendering, image/audio compression/decompression
Real-time constraint, high performance density Large amount of data parallelism, latency
tolerance Steaming data, very little global data reuse Computational intensive, performing 100-200
arithmetic operations for each data element Require efficient hardware mapping with
algorithm flow, special-purpose media processors Extend instruction set / hardware on general-
purpose processors
8
Image/video/audio compression (JPEG/MPEG/GIF/png)
Front-end of 3D graphics pipeline(geometry, lighting)
Pixar Cray X-MP, Stellar, Ardent,Microsoft Talisman MSP
High Quality Additive Audio Synthesis Todd Hodes, UCB Vectorize across oscillators
Image Processing Adobe Photoshop
Speech recognition Front-end: filters/FFTs Phoneme probabilities: Neural net Back-end: Viterbi/Beam Search
Multimedia Applications
9
High-Performance Processors
Exploit instruction-level parallelism Superscalar, VLIW, vector SIMD, systolic array, etc. Flexible (superscalar, VLIW) vs. regular (vector, systolic) Data communication: through register (VLIW, vector)
vs. forwarding (superscalar, systolic, vector-chaining) Ratio of computation / memory access, data reuse ratio Hardware (superscalar) vs. software (VLIW, vector,
systolic) to discover ILP Exploit thread-level parallelism
Parallel computation (programming) model: streaming, macro-dataflow, SPMD, etc.
Data communication and data sharing behavior Multiprocessor synchronization requirement
10
Loop: load F0,0(R1)
add F4,F0,F2
store F4,0(R1)
addui R1,R1,#-8
bne R1,R2,Loop
Data DependenceControl Dependence
Instruction-Level Parallelism
• Limited Instruction-Level Parallelism (ILP)• Data dependence: True (RAW), Anti (WAR), Output (WAW)• Control dependence: Determine program flow
For (I=1000; I>0; I--) X[I] = X[I] + S;
11
Dynamic Out-of-order Execution
In-order fetch/issue, out-of-order execution, in-order completion: maintain precise interrupt
ReorderBufferFP
OpQueue
FP Adder FP Adder
Res Stations Res Stations
FP Regs
Use Reorder Buffer to hold results of uncommit.
Register rename to RB entry to drive dep. Inst.
Inst. commit in order, remove from RB, result to architecture register
Memory disambiguation Discover ILP dynamically
flexible, costly, suitable for integer programs
12
Fetch / Issue Unit
Must fetch beyond branches: Branch Prediction Must feed execution unit with high-bandwidth: Trace Cache Must utilize Inst / Trace cache bandwidth: Next line prediction Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch unit Need efficient (1-cycle) broadcast+wakeup+schedule logic for
dependent instruction scheduling
Instruction Fetchwith
Branch Prediction
Out-Of-OrderExecution
Unit
Correctness FeedbackOn Branch Results
Stream of InstructionsTo Execute
13
Superscalar Out-of-order Execution
Loop: load F0,0(R1)
add F4,F0,F2
store F4,0(R1)
addui R1R1,R1,#-8
bne R1R1,R2,Loop
Data DependenceControl Dependence
Branch prediction load F0F0,0(R1R1)
add F4F4,F0F0,F2
store F4F4,0(R1R1)
addui R1R1,R1R1,#-8
bne R1R1,R2,Loop
Register Renaming:R1R1, F0,F0, R1R1
Hardware discover ILP,
Most flexible
14
VLIW Approach – Static Multiple Issue Wide-instruction, multiple independent operations Loop unrolling, procedure inlining, trace scheduling,
etc. to enlarge basic blocks Compiler discover ind. operations, pack to long inst. Difficulties:
Code size: clever encoding Lock-step execution: hardware allows
unsynchronized. Binary code compatibility: object-code translation
Compiler techniques to improve ILP Compiler optimization with hardware support Better suited for applications with predictable
control flow, media / signal processing
15
VLIW Approach – ExampleMemory Ref 1
Memory Ref 2 FP Operation 1
FP Operation 2
Integer/Branch
Load F0,0(R1) Load F6, -8(R1)
Load F10,-16(R1)
Load F14,-24(R1)
Load F18,-32(R1)
Load F22,-40(R1)
Add F4,F0,F2 Add F8,F6,F2
Load F26,-48(R1)
Add F12,F10,F2
Add F16,F14,F2
Add F20,F18,F2
Add F24,F22,F2
Store F4,0(R1) Store F8,-8(R1) Add F28,F26,F2
Store F12,-16(R1)
Store F16,-24(R1)
Addui R1,R1, -56
Store F20,24(R1)
Store F24,16(R1)
Store F28,8(R1) Bne R1,R2,Loop
16
Vector Processor Single-Instruction, multiple-data, exploit regular
data parallelism, less flexible than VLIW Highly-pipeline, tolerate memory latency Require high-memory bandwidth (cache-less) Better suited for large scientific applications with
heavy loop structures, good for media application Dynamic vector chaining, compound instruction Example:
/* after vector loop blocking */ Vload V1,0(R1)
Vadd V2,V1,F2 Vstore V2,0(R1)
17
Systolic Array, Reconfigure Processor
Systolic array: Fixed function, fixed wire
+….. 8(R1), 0(R1)
….. F2, F2
….. 8(R1), 0(R1)
Avoid register communication, inflexible
Reconfigurable hardware: MIT Raw, Stanford Smart Memories
General-purpose engine for media applications is limited Fixed-function, fixed-wire too restricted Reconfigurable hardware provides compiler programmable
interconnections and system structure to suit applications Exploit thread-level parallelism
18
Thread-Level Parallelism Many applications, such as database transactions,
scientific computations, server applications, etc. demonstrate high-level (thread-level) parallelism
Two basic approaches: Execute each thread in a separate processor, the old
parallel processing approach Execute multiple threads in a single processor
Duplicating each thread state, PC, registers, etc.; but share functional units, memory hierarchy, etc.; minimize thread switching cost comparing context switching
Switching thread: coarse-grained vs. fine-grained Simultaneous multithreading (SMT): thread-level and
instruction-level parallelism are exploited simultaneously with multiple threads issues at the same cycle.
19
Simultaneous Multithreading Simultaneous multithreading is a processor design
that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle.
Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources.
Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP.
The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multiprogrammed and parallel environments.
20
Comparison of Multithreading
SuperScalar
Course MT Fine MT SMT
Time
21
Performance of SMT
SMT shows better performance than superscalar; however, contentions on caches
22
Summary Application-driven architecture studies Media applications
Computational intensive, lots parallelism, predictable control flow, real-time constraint
Memory intensive, streaming data access 8, 16, 24 bit data structures
Suitable architectures Dynamic schedule, out-of-order processors are
inefficient and overkill VLIW, Vector, Reconfigurable processors, or exploit
subword parallelism on general-purpose processors special handle memory access
Recommended