Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bhagi Narahari, GWU CS 339 Spring 2000
CS 339 -Topics in ComputerArchitecture: InstructionLevel Parallel Processorsand Embedded Systems
Bhagi Narahari, GWU CS 339 Spring 2000
Course Outline: Part I
• ILP Architectures• Overview of Technology Trends - Narahari 1/27/00
fabrication technology and implications on processors• Instruction Level Parallel (ILP) Processors 2/3/00
Overview of ILP: superscalar, VLIW/EPIC- Narahari 2/3 The HPL-PD architecture - Yul Williams 2/10 Intel Itanium Processor - Yogesh Chobe
• Compiler Optimization for ILP Processors 2/17 Introduction to Compiler Optimization - Narahari 2/17 The Trimaran Infrastructure- Ajay Jayaray 2/24 ILP Compiler techniques - 3/2/00
Overview- Narahari 3/9/00 Region Formation - Yul Williams/ Ghita 3/16/00 Scheduling - Narahari/ Yogesh Chobe 3/30/00
Parallelizing C compilers - Jason Mader
Bhagi Narahari, GWU CS 339 Spring 2000
Course Outline: Part II
• Embedded Systems• Introduction - Narahari 3/30/00• Software support for Embedded Systems 3/30/00
Issues and Requirements: Narahari 3/30/00 Compiler Support Challenges: Narahari 4/6/00 Power Optimization: Brian Crilly 4/13/00 Validation: Brad Taylor 4/20/00 Compiler optimization for power: Renato Levy 4/13/00
• Embedded devices and processors - ? 4/27/00• Future architectures 4/27/00
Reconfigurable processors - Brian Schott Software support challenges
• Project Reports: 5/4/2000!!!
Bhagi Narahari, GWU CS 339 Spring 2000
Course Requirements
• Completely Project Based Term paper and paper presentations Term research project
• Readings• Materials on Web• Lab resources provided by HACC lab• Additional resources by SEAS CF
some by NCAC/Jason!
Bhagi Narahari, GWU CS 339 Spring 2000
Projects
• EPIC Architectures -- Intel’s IA-64 Project Optimizing assembler for EPIC Processors Learn and Use Trimaran Implement specific components and procedures Enhance optimizing compiler If all works, then use Intel’s Itanium Software Development
Kit• Power Aware Computing Toolset
modelling power consumption on a processor building a simulator that models power consumption compiler techniques for optimizing power consumption
• Validation techniques in compilers ?• Video on demand/Set top boxes ?• Universal Parallel C (UPC) ?• More details next class!!!
Bhagi Narahari, GWU CS 339 Spring 2000
Lecture 1: Hardware Vs. Software
Hardware
Medium to compute functions
Software
Functions to compute
Bhagi Narahari, GWU CS 339 Spring 2000
Functions to compute
• Programming language
• Turing Machines, Recursive Functions
Bhagi Narahari, GWU CS 339 Spring 2000
Hardware Media
More complex functions DSP, ASIC
Network Processors
Instruction setsAdd, Multiply, Branch,...
Net lists
Gates
Assembly
Bhagi Narahari, GWU CS 339 Spring 2000
Cache OptimizationsAlgorithmic Strategies
for compile-timeoptimizations
(Micro) Architecture Challenges
Micro-architecture Hardware Support
"Hardware support must scale
(e.g. HPL- PD)
"Eg. Clock dilation
"Sensitive to hidden hardware costs
ISA
Bhagi Narahari, GWU CS 339 Spring 2000
Focusing on the ISA slice
• Can have any combinations of instructions
• CISC Instructions are short programs
• RISC of interest to this course Instructions are few cycles Composed to get the same effect as single CISC
instructions Why bother?
Bhagi Narahari, GWU CS 339 Spring 2000
The Backdrop
• Who will program these machines? Programmers
• What do they expect? Performance (till now)
• How? Write HLL program and compile
• Automatic Compilation is key Short prototyping cycles “Assembly-like” performance
Bhagi Narahari, GWU CS 339 Spring 2000
A study of mainframes in the 1970s at IBM revealed that (even) sophisticated optimizing compilers typically used about 10% of the (compiler) instruction set
Bhagi Narahari, GWU CS 339 Spring 2000
Question
• Why have complex instructions?
Automatic usage is key
Limits to proliferation if we depend on hand-coded assembly
Bhagi Narahari, GWU CS 339 Spring 2000
Reduced Instruction Set Computing
• Build this intuition into an ISA that a compiler can use
• Reinvest silicon in easy to engineer designs performance
� e.g. pipeline registers
• Captured in ISA examples will be discussed as case studies
Bhagi Narahari, GWU CS 339 Spring 2000
By Contrast
Traditional DSPs
Engineering Complexity
Very difficult (CISC like) to compile for
Bhagi Narahari, GWU CS 339 Spring 2000
This Course
• Philosophy : Today hardware is designed hand inhand with software used to compile (automatically)
• Gives a snapshot of current state of the art• Compiler / ISA sweet-spots• Discuss issues in building both sides• More of an architecture and compilers course
Bhagi Narahari, GWU CS 339 Spring 2000
Furthermore
• Extrapolates beyond current technology to whereprocessors and their compilers are headed next
• Next major step is a dynamically redefinable ISA
• Will study reconfigurable processors towards the endof this course
Bhagi Narahari, GWU CS 339 Spring 2000
Technology Trends
• Fabrication• Architecture• Application• Compilation/Software Support
Bhagi Narahari, GWU CS 339 Spring 2000
Motivation
• Demands of Embedded Computing impactingdesiderata
Faster, cheaper processors Shorter times to market
• Poor scalability of superscalars complex control units
• EPIC / VLIW Simpler architectures Known compilation technology
Bhagi Narahari, GWU CS 339 Spring 2000
More Motivation
• FPGA / Reconfigurable logic
Fine grained parallelism
Explicit control over micro-architectural features
Fast static communication
Bhagi Narahari, GWU CS 339 Spring 2000
Trends In Technology,Applications,Architectures
Bhagi Narahari, GWU CS 339 Spring 2000
Technology and Application Trends
• Feature size and the effect on signal delay
• Cost of verification and test of new designs
• The new media application shift
• Chip density and ILP
Bhagi Narahari, GWU CS 339 Spring 2000
0
5
10
15
20
25
30
35
40
650 500 350 250 180 130 100
Feature Size (nm)
Del
ay (
ps)
Gate Delay (ps)
Interconnect Delay (ps) Cu & Low k
Interconnect Delay (ps) Al & SiO2
Delay vs. Feature Size
1999
Bohr, M. T., “Interconnect Scaling - The Real Limiter To High Performance ULSI”, Proceedings ofthe IEEE International Electron Devices, pages 241-242.
Bhagi Narahari, GWU CS 339 Spring 2000
Impact Of Decreasing Feature Size
• Interconnect delay greater has impact than gate delays
• “...wires are not keeping pace with scaling of otherfeatures. … In fact, for CMOS processes below 0.25micron ... an unacceptably small percentage of the diewill be reachable during a single clock cycle.”
• “Architectures that require long-distance, rapidinteraction will not scale well ...” “Will Physical Scalability Sabotage Performance Gains?”
D.Matzke, Chief architect TI, IEEE Computer (9/97)
Bhagi Narahari, GWU CS 339 Spring 2000
As Wire Delays Become Significant...
• Focus on architectures that
do not involve long distance communication
distribute control and data processing logic
Bhagi Narahari, GWU CS 339 Spring 2000
Technology and Application Trends
• Feature size and the effect on signal delay
• Cost of verification and test of new designs
• The new media application shift
• Chip density and ILP
Bhagi Narahari, GWU CS 339 Spring 2000
Verification And Test
• With increasing chip complexity, verification and testcosts form a significant component of the overall cost
Based on trends in previous slide
• Scaling current superscalar architectural techniques islikely to exacerbate the test and verification cost factor
• Long testing process will also affect time to market
Bhagi Narahari, GWU CS 339 Spring 2000
Impact of Rising Verification And Test Costs
• Keep the architecture simple and regular
move complex decision making logic fromprocessor to higher level tools (compiler)
Bhagi Narahari, GWU CS 339 Spring 2000
Technology and Application Trends
• Feature size and the effect on signal delay
• Cost of verification and test of new designs
• The new media application shift
• Chip density and ILP
Bhagi Narahari, GWU CS 339 Spring 2000
Today’s Computational Requirements
[Dubey, IBM HotChips’97 Tutorial]
Application Domain
Real-time groupware, video conferencinghigh quality video for online interactive catalogs, streaming a/vworkgroup collaboration with 3-d graphicsvideo authoring, telegames with video/3-d graphicsspontaneous speech recognitiondigital library and media miningbroadband conferencingelectronic commerce with strong encryption
Equivalent# ofpentiums
2 410152020-3030+50+
• “…media processing will become the dominant force in computerarchitecture and microprocessor design.”
“How Multimedia Workloads Will Change Processor Design”,Diefendorff & Dubey, IEEE Computer (9/97)
• “…media processing will become the dominant force in computerarchitecture and microprocessor design.”
“How Multimedia Workloads Will Change Processor Design”,Diefendorff & Dubey, IEEE Computer (9/97)
Bhagi Narahari, GWU CS 339 Spring 2000
Application Trends Summary
• Real-time processing• Packed 8-, 16-, and 32-bit integer data• Continuous data streams• Fine grain parallelism• Long integer arithmetic, table look-ups• Common kernels (small code size)• Low temporal reuse• High spatial locality
Bhagi Narahari, GWU CS 339 Spring 2000
The Impact of Media Application Trends
• Simple regular architectures are desirable scope for lots of MIMD processing tuned for media kernels need newer caching technology
requirements of predictability, high throughput low temporal reuse, high spatial reuse
Bhagi Narahari, GWU CS 339 Spring 2000
Technology and Application Trends
• Feature size and the effect on signal delay
• Cost of verification and test of new designs
• The new media application shift
• Chip density and ILP
Bhagi Narahari, GWU CS 339 Spring 2000
0
200
400
600
800
1000
1200
1400
1600
1997 1999 2001 2003 2006 2009 2012
Year
MPU Transistors/chip (M)
DRAM Bits/chip (G)
Transistors / Chip
50 pentiums
Bhagi Narahari, GWU CS 339 Spring 2000
Available instruction-level parallelism[Wall’93, DECWRL]
0
10
20
30
40
50
60
70
80
90
100
egre
sedd
yacc
eco
grr
met
alvi
comp
dodu
espr
fppp
gcc1
hydr
li mdlj
ora
swm
tomc
Application
ILP
Perfect Model
Superb Model
Good Model
Bhagi Narahari, GWU CS 339 Spring 2000
From Previous Two Slides...
• Lots of hardware parallelism available can accommodate approx. 50 pentiums on one die in 6 years
However,
• Conventional architectures and compilation cannot expose enough parallelism in applications even the “superb” model yields an ILP < 10 on average
• Need for new architectures and compilation techniques!
Bhagi Narahari, GWU CS 339 Spring 2000
What Is The Response ElsewhereTo All This?
Bhagi Narahari, GWU CS 339 Spring 2000
Architecture Research Approaches
• Past approaches better instruction fetch/issue improved instruction processing better prediction (branches, aliases) statically scheduled variants of VLIW
• Novel (different) approaches Reconfigurable processors IRAM and variants Simultaneous multi-threading On-chip multi-processing
Bhagi Narahari, GWU CS 339 Spring 2000
Two Noteworthy Directions
• Reconfigurable Processors let compiler handle everything no commitment to a particular architecture compiler generates architecture and code for it
• Explicitly Controlled Architectures simplify architectures as much as possible architectural template is a known, conventional one compiler handles a lot of processor’s decision making
explicitly control issue, scheduling, allocation
Explicitly Parallel Instruction Computing (EPIC) subset of explicitly controlled architectures
Bhagi Narahari, GWU CS 339 Spring 2000
Frontend and Optimizer
Determine Dependences
Determine Independences
Bind Operations to Function Units
Bind Transports to Busses
Determine Dependences
Bind Transports to Busses
Execute
Superscalar
Dataflow
Indep. Arch.
VLIW
TTA
Compiler Hardware
Determine Independences
Bind Operations to Function Units
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective.The Journal of Supercomputing, 7(1-2):9-50, May 1993.
Compiler vs. Processor
Bhagi Narahari, GWU CS 339 Spring 2000
Reconfigurable Computing:A Summary of Achievements
• Fastest RSA decryption 600Kb/s, 512b keys [DEC PRL]
• DNA sequence matching 100x faster than MPP’s, Cray3 [Splash, SRC]
• Filters on FPGA’s 10x faster than DSP’s [Xilinx,Altera application notes]
• Processor emulation [Butts IEEE CICC’95, Varghese IEEE Trans. VLSI’93]
• Hardware reuse [Multifunction PCMCIA, Wireless Video Coding UCLA]
Bhagi Narahari, GWU CS 339 Spring 2000
Where can reconfigurabilitymake a difference?
• Applications requiring non-standard data-path FFT,DCT,CORDIC
• Static data, adaptive precision constant co-efficient filters, encryption-decryption
• Fault tolerance, real-time threat sensitive adaptation defense communication systems
• High-performance multifunction portables PDA’s, cellular phones, wearable computers
• Regular, fine-grained parallel processing signal/image processing (e.g. pattern recognition)
Bhagi Narahari, GWU CS 339 Spring 2000
What are the hurdles?
• Poor compilation times lack of correspondence between standard IR and final configurations place and route inherently complex
• Additional runtime overheads large configuration size implies high reconfiguration costs this also implies context switches are very costly
• Lack of convenient abstract models, language support models for algorithm development (e.g. RMESH, USC-MAARC) models for compiler targets (ReaCT-ILP) language support for hardware structural information
but not as complex as HDL’s
Bhagi Narahari, GWU CS 339 Spring 2000
State Of Compilation Technology
• Compilation for incremental architectures
well known technology
but bottleneck of “conventional compilation”
• Compilation for radically different architectures
no known efficient and automatic compilation
potential for breaking through the bottleneck
Bhagi Narahari, GWU CS 339 Spring 2000
Simple ASIC
Complex ASIC
RaPiD
FPGA
GARP
DPGA
SuperSpeculative
RAW
TRACE (Multiscalar)
SMT
VECTOR
MultiChipCVH
SuperScalar
SimplePipelined/Embedded
EPIC/VLIW
0 4 16 32 64 128-512 1K-10K 100K-1M >1M
TTA
Adaptive EPIC
Early x86
Para
llelis
m
Approximate instruction packet size
Dataflow
What can be efficiently compiled for today?
Bhagi Narahari, GWU CS 339 Spring 2000
Trends InTechnology, Applications,Architectures
( What can we infer? )
• Design/verification costs Simple, regular architectures
• Signal delays Shorter connections ; local interactions
• Media processing High throughput, highly compute intensive processing, many
integer types Not enough ILP through standard compilation Customized/special purpose compilation? Adaptive architectures?
Adaptive Explicitly Parallel Instruction Computing?Adaptive Explicitly Parallel Instruction Computing?