1 Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing

Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing

Topic 3: Fundamentals of Media Processor Designs

Overview of High-Performance ProcessorsOverview of High-Performance Processors Multiple-issue, out-of-order, dynamic-window processor VLIW and Vector processor Systolic and Reconfigurable processor Hardwired stream processor Thread-level parallelism

Multimedia extension Media benchmark/workload Steaming media processing, sub-word parallelism Intel MMX/SSE media extension IA-64 multimedia instructions

Media processors IMAGINE: Media processing with streams IVRAM: Extent Intelligent RAM with vector unit Trimedia: Price-performance challenge for media processing

Digital Signal Processing (DSP)

In 1970s, DSP in telecomm, requires higher performance than microprocessor available

Computationally intensive: Dominated by vector dot product – multiply, multiply-add

Real-time requirement Streaming data, high memory bandwidth,

simple memory access pattern Predictable program flow, nested loops, less

branches, large basic blocks Sensitivity to numeric error

Early DSPs Single-cycle multiplier Streamlined multiply-add operation Separate instruction / data memory for high

memory bandwidth Specialized addressing hardware, autoincrement Complex instruction set, combine multiple

operations in single instruction Special-purpose, fixed-function hardware, lack

flexibility and programmability TI TMS32010, 1982

Today’s DSP (from 1995) Adapt general-purpose processor design

RISC-like instruction set Multiple-issue, VLIW approach, Vector SIMD, superscalar, chip multiprocessing

Programmability and compatibility Easy to program, better compiler target Better compatibility for future architecture

TI-TMS320C62xx family RISC instruction set 8-issue, VLIW design

General-Purpose Processors

Notice increasing applications (e.g. cellular phone) for DSP tasks ($6 billion DSP market in 2000)

Add architecture features to boost performance of common DSP tasks

Extended multimedia instruction set, adapted and integrated with existing hardware in almost all high-performance microprocessor, Intel MMX/SSE

New architecture, encompass DSP+general-proc., boost high parallelism, Stanford Imagine, etc.

Future directions? Graphics processors?

Media Processing Digital signal processing, 2D/3D graphics

rendering, image/audio compression/decompression

Real-time constraint, high performance density Large amount of data parallelism, latency

tolerance Steaming data, very little global data reuse Computational intensive, performing 100-200

arithmetic operations for each data element Require efficient hardware mapping with

algorithm flow, special-purpose media processors Extend instruction set / hardware on general-

purpose processors

Image/video/audio compression (JPEG/MPEG/GIF/png)

Front-end of 3D graphics pipeline(geometry, lighting)

Pixar Cray X-MP, Stellar, Ardent,Microsoft Talisman MSP

High Quality Additive Audio Synthesis Todd Hodes, UCB Vectorize across oscillators

Image Processing Adobe Photoshop

Speech recognition Front-end: filters/FFTs Phoneme probabilities: Neural net Back-end: Viterbi/Beam Search

Multimedia Applications

High-Performance Processors

Exploit instruction-level parallelism Superscalar, VLIW, vector SIMD, systolic array, etc. Flexible (superscalar, VLIW) vs. regular (vector, systolic) Data communication: through register (VLIW, vector)

vs. forwarding (superscalar, systolic, vector-chaining) Ratio of computation / memory access, data reuse ratio Hardware (superscalar) vs. software (VLIW, vector,

systolic) to discover ILP Exploit thread-level parallelism

Parallel computation (programming) model: streaming, macro-dataflow, SPMD, etc.

Data communication and data sharing behavior Multiprocessor synchronization requirement

Loop: load F0,0(R1)

add F4,F0,F2

store F4,0(R1)

addui R1,R1,#-8

bne R1,R2,Loop

Data DependenceControl Dependence

Instruction-Level Parallelism

• Limited Instruction-Level Parallelism (ILP)• Data dependence: True (RAW), Anti (WAR), Output (WAW)• Control dependence: Determine program flow

For (I=1000; I>0; I--) X[I] = X[I] + S;

Dynamic Out-of-order Execution

In-order fetch/issue, out-of-order execution, in-order completion: maintain precise interrupt

ReorderBufferFP

OpQueue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Use Reorder Buffer to hold results of uncommit.

Register rename to RB entry to drive dep. Inst.

Inst. commit in order, remove from RB, result to architecture register

Memory disambiguation Discover ILP dynamically

flexible, costly, suitable for integer programs

Fetch / Issue Unit

Must fetch beyond branches: Branch Prediction Must feed execution unit with high-bandwidth: Trace Cache Must utilize Inst / Trace cache bandwidth: Next line prediction Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch unit Need efficient (1-cycle) broadcast+wakeup+schedule logic for

dependent instruction scheduling

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

Superscalar Out-of-order Execution

Loop: load F0,0(R1)

add F4,F0,F2

store F4,0(R1)

addui R1R1,R1,#-8

bne R1R1,R2,Loop

Data DependenceControl Dependence

Branch prediction load F0F0,0(R1R1)

add F4F4,F0F0,F2

store F4F4,0(R1R1)

addui R1R1,R1R1,#-8

bne R1R1,R2,Loop

Register Renaming:R1R1, F0,F0, R1R1

Hardware discover ILP,

Most flexible

VLIW Approach – Static Multiple Issue Wide-instruction, multiple independent operations Loop unrolling, procedure inlining, trace scheduling,

etc. to enlarge basic blocks Compiler discover ind. operations, pack to long inst. Difficulties:

Code size: clever encoding Lock-step execution: hardware allows

unsynchronized. Binary code compatibility: object-code translation

Compiler techniques to improve ILP Compiler optimization with hardware support Better suited for applications with predictable

control flow, media / signal processing

VLIW Approach – ExampleMemory Ref 1

Memory Ref 2 FP Operation 1

FP Operation 2

Integer/Branch

Load F0,0(R1) Load F6, -8(R1)

Load F10,-16(R1)

Load F14,-24(R1)

Load F18,-32(R1)

Load F22,-40(R1)

Add F4,F0,F2 Add F8,F6,F2

Load F26,-48(R1)

Add F12,F10,F2

Add F16,F14,F2

Add F20,F18,F2

Add F24,F22,F2

Store F4,0(R1) Store F8,-8(R1) Add F28,F26,F2

Store F12,-16(R1)

Store F16,-24(R1)

Addui R1,R1, -56

Store F20,24(R1)

Store F24,16(R1)

Store F28,8(R1) Bne R1,R2,Loop

Vector Processor Single-Instruction, multiple-data, exploit regular

data parallelism, less flexible than VLIW Highly-pipeline, tolerate memory latency Require high-memory bandwidth (cache-less) Better suited for large scientific applications with

heavy loop structures, good for media application Dynamic vector chaining, compound instruction Example:

/* after vector loop blocking */ Vload V1,0(R1)

Vadd V2,V1,F2 Vstore V2,0(R1)

Systolic Array, Reconfigure Processor

Systolic array: Fixed function, fixed wire

+….. 8(R1), 0(R1)

….. F2, F2

….. 8(R1), 0(R1)

Avoid register communication, inflexible

Reconfigurable hardware: MIT Raw, Stanford Smart Memories

General-purpose engine for media applications is limited Fixed-function, fixed-wire too restricted Reconfigurable hardware provides compiler programmable

interconnections and system structure to suit applications Exploit thread-level parallelism

Thread-Level Parallelism Many applications, such as database transactions,

scientific computations, server applications, etc. demonstrate high-level (thread-level) parallelism

Two basic approaches: Execute each thread in a separate processor, the old

parallel processing approach Execute multiple threads in a single processor

Duplicating each thread state, PC, registers, etc.; but share functional units, memory hierarchy, etc.; minimize thread switching cost comparing context switching

Switching thread: coarse-grained vs. fine-grained Simultaneous multithreading (SMT): thread-level and

instruction-level parallelism are exploited simultaneously with multiple threads issues at the same cycle.

Simultaneous Multithreading Simultaneous multithreading is a processor design

that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle.

Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources.

Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP.

The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multiprogrammed and parallel environments.

Comparison of Multithreading

SuperScalar

Course MT Fine MT SMT

Performance of SMT

SMT shows better performance than superscalar; however, contentions on caches

Summary Application-driven architecture studies Media applications

Computational intensive, lots parallelism, predictable control flow, real-time constraint

Memory intensive, streaming data access 8, 16, 24 bit data structures

Suitable architectures Dynamic schedule, out-of-order processors are

inefficient and overkill VLIW, Vector, Reconfigurable processors, or exploit

subword parallelism on general-purpose processors special handle memory access

1 Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing

Documents

Emerging Device Topics for Regulatory Consideration · Emerging Device Topics for Regulatory ... • Medical Device Directives: – safety and performance – manufacturer responsible

Emerging Hot Topics€¦ · 1 HEALTH SEMINAR FOR NEWER LEGISLATORS Emerging Hot Topics Session Objectives Understand emerging health topics that states are currently addressing, including

J. Ingram - 200743868 - Emerging Topics - Prof Nelson - Paper Presentation

The Emerging Power Crisis in Embedded Processors What Can ...mooney.gatech.edu/codesign/publications/crest/... · The Emerging Power Crisis in Embedded Processors What Can a (Poor)

Core and Emerging Topics - Department of Transportation

Emerging Contemporary Project Management … / Contemporary Project Management Topics 2012- 2013 Eric Tse Abstract This article introduces emerging or contemporary topics & trends

Current and Emerging Topics in Mass Torts

Discovering Emerging Topics from WWW

Emerging topics in project management - NDTA HQ · Emerging topics in IT, supply chain, project management, and Society Global NTDA Conference St. Louis October 2017 ... Planning

Emerging Topics to Guide Clinical Pain

Emerging Snippet Topics 10272016 - Energy.gov Snipp… · Emerging Snippet Topics David Kester – Project Management Policy and Systems (PM-30) Rick Millikin – CH2M, Vice President,

Emerging Issues and Hot Topics in Animal Law

Enzyme Engineering XXII: Emerging Topics in Enzyme Engineering

Emerging Topics in Discrimination Litigation

Emerging strategies and controversial topics in advanced ...media.aiom.it/userfiles/files/doc/AIOM-Servizi/... · Emerging strategies and controversial topics ... Teoh at 2018 Genitourinary

EMERGING RISKS/ HOT TOPICS RIMS 2015 March 26, 2015

IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN …

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, … · 2020. 1. 27. · IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, MANUSCRIPT ID 1 IoTility: Architectural Requirements

Emerging topics In data, application and infrastructure protection

EMERGING CHANNELS & TRENDING TOPICS€¦ · EMERGING CHANNELS & TRENDING TOPICS Ash Cooper ... Interest data marketing stack. AMAZON MARKETING SERVICES. Purchase data marketing stack