1 Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing

1

Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing

2

Topic 3: Fundamentals of Media Processor Designs

Overview of High-Performance ProcessorsOverview of High-Performance Processors Multiple-issue, out-of-order, dynamic-window processor VLIW and Vector processor Systolic and Reconfigurable processor Hardwired stream processor Thread-level parallelism

Multimedia extension Media benchmark/workload Steaming media processing, sub-word parallelism Intel MMX/SSE media extension IA-64 multimedia instructions

Media processors IMAGINE: Media processing with streams IVRAM: Extent Intelligent RAM with vector unit Trimedia: Price-performance challenge for media processing

3

Digital Signal Processing (DSP)

In 1970s, DSP in telecomm, requires higher performance than microprocessor available

Computationally intensive: Dominated by vector dot product – multiply, multiply-add

Real-time requirement Streaming data, high memory bandwidth,

simple memory access pattern Predictable program flow, nested loops, less

branches, large basic blocks Sensitivity to numeric error

4

Early DSPs Single-cycle multiplier Streamlined multiply-add operation Separate instruction / data memory for high

memory bandwidth Specialized addressing hardware, autoincrement Complex instruction set, combine multiple

operations in single instruction Special-purpose, fixed-function hardware, lack

flexibility and programmability TI TMS32010, 1982

5

Today’s DSP (from 1995) Adapt general-purpose processor design

RISC-like instruction set Multiple-issue, VLIW approach, Vector SIMD, superscalar, chip multiprocessing

Programmability and compatibility Easy to program, better compiler target Better compatibility for future architecture

TI-TMS320C62xx family RISC instruction set 8-issue, VLIW design

6

General-Purpose Processors

Notice increasing applications (e.g. cellular phone) for DSP tasks ($6 billion DSP market in 2000)

Add architecture features to boost performance of common DSP tasks

Extended multimedia instruction set, adapted and integrated with existing hardware in almost all high-performance microprocessor, Intel MMX/SSE

New architecture, encompass DSP+general-proc., boost high parallelism, Stanford Imagine, etc.

Future directions? Graphics processors?

7

Media Processing Digital signal processing, 2D/3D graphics

rendering, image/audio compression/decompression

Real-time constraint, high performance density Large amount of data parallelism, latency

tolerance Steaming data, very little global data reuse Computational intensive, performing 100-200

arithmetic operations for each data element Require efficient hardware mapping with

algorithm flow, special-purpose media processors Extend instruction set / hardware on general-

purpose processors

8

Image/video/audio compression (JPEG/MPEG/GIF/png)

Front-end of 3D graphics pipeline(geometry, lighting)

Pixar Cray X-MP, Stellar, Ardent,Microsoft Talisman MSP

High Quality Additive Audio Synthesis Todd Hodes, UCB Vectorize across oscillators

Image Processing Adobe Photoshop

Speech recognition Front-end: filters/FFTs Phoneme probabilities: Neural net Back-end: Viterbi/Beam Search

Multimedia Applications

9

High-Performance Processors

Exploit instruction-level parallelism Superscalar, VLIW, vector SIMD, systolic array, etc. Flexible (superscalar, VLIW) vs. regular (vector, systolic) Data communication: through register (VLIW, vector)

vs. forwarding (superscalar, systolic, vector-chaining) Ratio of computation / memory access, data reuse ratio Hardware (superscalar) vs. software (VLIW, vector,

systolic) to discover ILP Exploit thread-level parallelism

Parallel computation (programming) model: streaming, macro-dataflow, SPMD, etc.

Data communication and data sharing behavior Multiprocessor synchronization requirement

10

Loop: load F0,0(R1)

add F4,F0,F2

store F4,0(R1)

addui R1,R1,#-8

bne R1,R2,Loop

Data DependenceControl Dependence

Instruction-Level Parallelism

• Limited Instruction-Level Parallelism (ILP)• Data dependence: True (RAW), Anti (WAR), Output (WAW)• Control dependence: Determine program flow

For (I=1000; I>0; I--) X[I] = X[I] + S;

11

Dynamic Out-of-order Execution

In-order fetch/issue, out-of-order execution, in-order completion: maintain precise interrupt

ReorderBufferFP

OpQueue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Use Reorder Buffer to hold results of uncommit.

Register rename to RB entry to drive dep. Inst.

Inst. commit in order, remove from RB, result to architecture register

Memory disambiguation Discover ILP dynamically

flexible, costly, suitable for integer programs

12

Fetch / Issue Unit

Must fetch beyond branches: Branch Prediction Must feed execution unit with high-bandwidth: Trace Cache Must utilize Inst / Trace cache bandwidth: Next line prediction Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch unit Need efficient (1-cycle) broadcast+wakeup+schedule logic for

dependent instruction scheduling

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Unit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

13

Superscalar Out-of-order Execution

Loop: load F0,0(R1)

add F4,F0,F2

store F4,0(R1)

addui R1R1,R1,#-8

bne R1R1,R2,Loop

Data DependenceControl Dependence

Branch prediction load F0F0,0(R1R1)

add F4F4,F0F0,F2

store F4F4,0(R1R1)

addui R1R1,R1R1,#-8

bne R1R1,R2,Loop

Register Renaming:R1R1, F0,F0, R1R1

Hardware discover ILP,

Most flexible

14

VLIW Approach – Static Multiple Issue Wide-instruction, multiple independent operations Loop unrolling, procedure inlining, trace scheduling,

etc. to enlarge basic blocks Compiler discover ind. operations, pack to long inst. Difficulties:

Code size: clever encoding Lock-step execution: hardware allows

unsynchronized. Binary code compatibility: object-code translation

Compiler techniques to improve ILP Compiler optimization with hardware support Better suited for applications with predictable

control flow, media / signal processing

15

VLIW Approach – ExampleMemory Ref 1

Memory Ref 2 FP Operation 1

FP Operation 2

Integer/Branch

Load F0,0(R1) Load F6, -8(R1)

Load F10,-16(R1)

Load F14,-24(R1)

Load F18,-32(R1)

Load F22,-40(R1)

Add F4,F0,F2 Add F8,F6,F2

Load F26,-48(R1)

Add F12,F10,F2

Add F16,F14,F2

Add F20,F18,F2

Add F24,F22,F2

Store F4,0(R1) Store F8,-8(R1) Add F28,F26,F2

Store F12,-16(R1)

Store F16,-24(R1)

Addui R1,R1, -56

Store F20,24(R1)

Store F24,16(R1)

Store F28,8(R1) Bne R1,R2,Loop

16

Vector Processor Single-Instruction, multiple-data, exploit regular

data parallelism, less flexible than VLIW Highly-pipeline, tolerate memory latency Require high-memory bandwidth (cache-less) Better suited for large scientific applications with

heavy loop structures, good for media application Dynamic vector chaining, compound instruction Example:

/* after vector loop blocking */ Vload V1,0(R1)

Vadd V2,V1,F2 Vstore V2,0(R1)

17

Systolic Array, Reconfigure Processor

Systolic array: Fixed function, fixed wire

+….. 8(R1), 0(R1)

….. F2, F2

….. 8(R1), 0(R1)

Avoid register communication, inflexible

Reconfigurable hardware: MIT Raw, Stanford Smart Memories

General-purpose engine for media applications is limited Fixed-function, fixed-wire too restricted Reconfigurable hardware provides compiler programmable

interconnections and system structure to suit applications Exploit thread-level parallelism

18

Thread-Level Parallelism Many applications, such as database transactions,

scientific computations, server applications, etc. demonstrate high-level (thread-level) parallelism

Two basic approaches: Execute each thread in a separate processor, the old

parallel processing approach Execute multiple threads in a single processor

Duplicating each thread state, PC, registers, etc.; but share functional units, memory hierarchy, etc.; minimize thread switching cost comparing context switching

Switching thread: coarse-grained vs. fine-grained Simultaneous multithreading (SMT): thread-level and

instruction-level parallelism are exploited simultaneously with multiple threads issues at the same cycle.

19

Simultaneous Multithreading Simultaneous multithreading is a processor design

that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle.

Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources.

Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP.

The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multiprogrammed and parallel environments.

20

Comparison of Multithreading

SuperScalar

Course MT Fine MT SMT

Time

21

Performance of SMT

SMT shows better performance than superscalar; however, contentions on caches

22

Summary Application-driven architecture studies Media applications

Computational intensive, lots parallelism, predictable control flow, real-time constraint

Memory intensive, streaming data access 8, 16, 24 bit data structures

Suitable architectures Dynamic schedule, out-of-order processors are

inefficient and overkill VLIW, Vector, Reconfigurable processors, or exploit

subword parallelism on general-purpose processors special handle memory access

Documents

1 Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing