CS 339 -Topics in Computer Architecture: Instruction Level …narahari/cs339/intro.pdf · 2000-02-03 · Application Domain Real-time groupware, video conferencing high quality video

Bhagi Narahari, GWU CS 339 Spring 2000

CS 339 -Topics in ComputerArchitecture: InstructionLevel Parallel Processorsand Embedded Systems


Course Outline: Part I

• ILP Architectures• Overview of Technology Trends - Narahari 1/27/00

fabrication technology and implications on processors• Instruction Level Parallel (ILP) Processors 2/3/00

Overview of ILP: superscalar, VLIW/EPIC- Narahari 2/3 The HPL-PD architecture - Yul Williams 2/10 Intel Itanium Processor - Yogesh Chobe

• Compiler Optimization for ILP Processors 2/17 Introduction to Compiler Optimization - Narahari 2/17 The Trimaran Infrastructure- Ajay Jayaray 2/24 ILP Compiler techniques - 3/2/00

Overview- Narahari 3/9/00 Region Formation - Yul Williams/ Ghita 3/16/00 Scheduling - Narahari/ Yogesh Chobe 3/30/00

Parallelizing C compilers - Jason Mader


Course Outline: Part II

• Embedded Systems• Introduction - Narahari 3/30/00• Software support for Embedded Systems 3/30/00

Issues and Requirements: Narahari 3/30/00 Compiler Support Challenges: Narahari 4/6/00 Power Optimization: Brian Crilly 4/13/00 Validation: Brad Taylor 4/20/00 Compiler optimization for power: Renato Levy 4/13/00

• Embedded devices and processors - ? 4/27/00• Future architectures 4/27/00

Reconfigurable processors - Brian Schott Software support challenges

• Project Reports: 5/4/2000!!!


Course Requirements

• Completely Project Based Term paper and paper presentations Term research project

• Readings• Materials on Web• Lab resources provided by HACC lab• Additional resources by SEAS CF

some by NCAC/Jason!


Projects

• EPIC Architectures -- Intel’s IA-64 Project Optimizing assembler for EPIC Processors Learn and Use Trimaran Implement specific components and procedures Enhance optimizing compiler If all works, then use Intel’s Itanium Software Development

Kit• Power Aware Computing Toolset

modelling power consumption on a processor building a simulator that models power consumption compiler techniques for optimizing power consumption

• Validation techniques in compilers ?• Video on demand/Set top boxes ?• Universal Parallel C (UPC) ?• More details next class!!!


Lecture 1: Hardware Vs. Software

Hardware

Medium to compute functions

Software

Functions to compute


Functions to compute

• Programming language

• Turing Machines, Recursive Functions


Hardware Media

More complex functions DSP, ASIC

Network Processors

Instruction setsAdd, Multiply, Branch,...

Net lists

Gates

Assembly


Cache OptimizationsAlgorithmic Strategies

for compile-timeoptimizations

(Micro) Architecture Challenges

Micro-architecture Hardware Support

"Hardware support must scale

(e.g. HPL- PD)

"Eg. Clock dilation

"Sensitive to hidden hardware costs

ISA


Focusing on the ISA slice

• Can have any combinations of instructions

• CISC Instructions are short programs

• RISC of interest to this course Instructions are few cycles Composed to get the same effect as single CISC

instructions Why bother?


The Backdrop

• Who will program these machines? Programmers

• What do they expect? Performance (till now)

• How? Write HLL program and compile

• Automatic Compilation is key Short prototyping cycles “Assembly-like” performance


A study of mainframes in the 1970s at IBM revealed that (even) sophisticated optimizing compilers typically used about 10% of the (compiler) instruction set


Question

• Why have complex instructions?

Automatic usage is key

Limits to proliferation if we depend on hand-coded assembly


Reduced Instruction Set Computing

• Build this intuition into an ISA that a compiler can use

• Reinvest silicon in easy to engineer designs performance

� e.g. pipeline registers

• Captured in ISA examples will be discussed as case studies


By Contrast

Traditional DSPs

Engineering Complexity

Very difficult (CISC like) to compile for


This Course

• Philosophy : Today hardware is designed hand inhand with software used to compile (automatically)

• Gives a snapshot of current state of the art• Compiler / ISA sweet-spots• Discuss issues in building both sides• More of an architecture and compilers course


Furthermore

• Extrapolates beyond current technology to whereprocessors and their compilers are headed next

• Next major step is a dynamically redefinable ISA

• Will study reconfigurable processors towards the endof this course


Technology Trends

• Fabrication• Architecture• Application• Compilation/Software Support


Motivation

• Demands of Embedded Computing impactingdesiderata

Faster, cheaper processors Shorter times to market

• Poor scalability of superscalars complex control units

• EPIC / VLIW Simpler architectures Known compilation technology


More Motivation

• FPGA / Reconfigurable logic

Fine grained parallelism

Explicit control over micro-architectural features

Fast static communication


Trends In Technology,Applications,Architectures


Technology and Application Trends

• Feature size and the effect on signal delay

• Cost of verification and test of new designs

• The new media application shift

• Chip density and ILP


0

5

10

15

20

25

30

35

40

650 500 350 250 180 130 100

Feature Size (nm)

Del

ay (

ps)

Gate Delay (ps)

Interconnect Delay (ps) Cu & Low k

Interconnect Delay (ps) Al & SiO2

Delay vs. Feature Size

1999

Bohr, M. T., “Interconnect Scaling - The Real Limiter To High Performance ULSI”, Proceedings ofthe IEEE International Electron Devices, pages 241-242.


Impact Of Decreasing Feature Size

• Interconnect delay greater has impact than gate delays

• “...wires are not keeping pace with scaling of otherfeatures. … In fact, for CMOS processes below 0.25micron ... an unacceptably small percentage of the diewill be reachable during a single clock cycle.”

• “Architectures that require long-distance, rapidinteraction will not scale well ...” “Will Physical Scalability Sabotage Performance Gains?”

D.Matzke, Chief architect TI, IEEE Computer (9/97)


As Wire Delays Become Significant...

• Focus on architectures that

do not involve long distance communication

distribute control and data processing logic








Verification And Test

• With increasing chip complexity, verification and testcosts form a significant component of the overall cost

Based on trends in previous slide

• Scaling current superscalar architectural techniques islikely to exacerbate the test and verification cost factor

• Long testing process will also affect time to market


Impact of Rising Verification And Test Costs

• Keep the architecture simple and regular

move complex decision making logic fromprocessor to higher level tools (compiler)








Today’s Computational Requirements

[Dubey, IBM HotChips’97 Tutorial]

Application Domain

Real-time groupware, video conferencinghigh quality video for online interactive catalogs, streaming a/vworkgroup collaboration with 3-d graphicsvideo authoring, telegames with video/3-d graphicsspontaneous speech recognitiondigital library and media miningbroadband conferencingelectronic commerce with strong encryption

Equivalent# ofpentiums

2 410152020-3030+50+

• “…media processing will become the dominant force in computerarchitecture and microprocessor design.”

“How Multimedia Workloads Will Change Processor Design”,Diefendorff & Dubey, IEEE Computer (9/97)

• “…media processing will become the dominant force in computerarchitecture and microprocessor design.”

“How Multimedia Workloads Will Change Processor Design”,Diefendorff & Dubey, IEEE Computer (9/97)


Application Trends Summary

• Real-time processing• Packed 8-, 16-, and 32-bit integer data• Continuous data streams• Fine grain parallelism• Long integer arithmetic, table look-ups• Common kernels (small code size)• Low temporal reuse• High spatial locality


The Impact of Media Application Trends

• Simple regular architectures are desirable scope for lots of MIMD processing tuned for media kernels need newer caching technology

requirements of predictability, high throughput low temporal reuse, high spatial reuse








0

200

400

600

800

1000

1200

1400

1600

1997 1999 2001 2003 2006 2009 2012

Year

MPU Transistors/chip (M)

DRAM Bits/chip (G)

Transistors / Chip

50 pentiums


Available instruction-level parallelism[Wall’93, DECWRL]

0

10

20

30

40

50

60

70

80

90

100

egre

sedd

yacc

eco

grr

met

alvi

comp

dodu

espr

fppp

gcc1

hydr

li mdlj

ora

swm

tomc

Application

ILP

Perfect Model

Superb Model

Good Model


From Previous Two Slides...

• Lots of hardware parallelism available can accommodate approx. 50 pentiums on one die in 6 years

However,

• Conventional architectures and compilation cannot expose enough parallelism in applications even the “superb” model yields an ILP < 10 on average

• Need for new architectures and compilation techniques!


What Is The Response ElsewhereTo All This?


Architecture Research Approaches

• Past approaches better instruction fetch/issue improved instruction processing better prediction (branches, aliases) statically scheduled variants of VLIW

• Novel (different) approaches Reconfigurable processors IRAM and variants Simultaneous multi-threading On-chip multi-processing


Two Noteworthy Directions

• Reconfigurable Processors let compiler handle everything no commitment to a particular architecture compiler generates architecture and code for it

• Explicitly Controlled Architectures simplify architectures as much as possible architectural template is a known, conventional one compiler handles a lot of processor’s decision making

explicitly control issue, scheduling, allocation

Explicitly Parallel Instruction Computing (EPIC) subset of explicitly controlled architectures


Frontend and Optimizer

Determine Dependences

Determine Independences

Bind Operations to Function Units

Bind Transports to Busses

Determine Dependences

Bind Transports to Busses

Execute

Superscalar

Dataflow

Indep. Arch.

VLIW

TTA

Compiler Hardware

Determine Independences

Bind Operations to Function Units

B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective.The Journal of Supercomputing, 7(1-2):9-50, May 1993.

Compiler vs. Processor


Reconfigurable Computing:A Summary of Achievements

• Fastest RSA decryption 600Kb/s, 512b keys [DEC PRL]

• DNA sequence matching 100x faster than MPP’s, Cray3 [Splash, SRC]

• Filters on FPGA’s 10x faster than DSP’s [Xilinx,Altera application notes]

• Processor emulation [Butts IEEE CICC’95, Varghese IEEE Trans. VLSI’93]

• Hardware reuse [Multifunction PCMCIA, Wireless Video Coding UCLA]


Where can reconfigurabilitymake a difference?

• Applications requiring non-standard data-path FFT,DCT,CORDIC

• Static data, adaptive precision constant co-efficient filters, encryption-decryption

• Fault tolerance, real-time threat sensitive adaptation defense communication systems

• High-performance multifunction portables PDA’s, cellular phones, wearable computers

• Regular, fine-grained parallel processing signal/image processing (e.g. pattern recognition)


What are the hurdles?

• Poor compilation times lack of correspondence between standard IR and final configurations place and route inherently complex

• Additional runtime overheads large configuration size implies high reconfiguration costs this also implies context switches are very costly

• Lack of convenient abstract models, language support models for algorithm development (e.g. RMESH, USC-MAARC) models for compiler targets (ReaCT-ILP) language support for hardware structural information

but not as complex as HDL’s


State Of Compilation Technology

• Compilation for incremental architectures

well known technology

but bottleneck of “conventional compilation”

• Compilation for radically different architectures

no known efficient and automatic compilation

potential for breaking through the bottleneck


Simple ASIC

Complex ASIC

RaPiD

FPGA

GARP

DPGA

SuperSpeculative

RAW

TRACE (Multiscalar)

SMT

VECTOR

MultiChipCVH

SuperScalar

SimplePipelined/Embedded

EPIC/VLIW

0 4 16 32 64 128-512 1K-10K 100K-1M >1M

TTA

Adaptive EPIC

Early x86

Para

llelis

m

Approximate instruction packet size

Dataflow

What can be efficiently compiled for today?


Trends InTechnology, Applications,Architectures

( What can we infer? )

• Design/verification costs Simple, regular architectures

• Signal delays Shorter connections ; local interactions

• Media processing High throughput, highly compute intensive processing, many

integer types Not enough ILP through standard compilation Customized/special purpose compilation? Adaptive architectures?

Adaptive Explicitly Parallel Instruction Computing?Adaptive Explicitly Parallel Instruction Computing?

Documents

CS 339 -Topics in Computer Architecture: Instruction Level …narahari/cs339/intro.pdf · 2000-02-03 · Application Domain Real-time groupware, video conferencing high quality video