8/30/2006 eleg652-010-06F 1
Principles of Parallel Architecture
Fall 2006
Keys to a happy Life: Diversity and Variety. Diversity in the people that you meet. Variety in the things that you do
8/30/2006 eleg652-010-06F 2
Contact Information
Instructor
Name: Joseph B. Manzano
Office: 137 Evans Hall
Phone: N/A Email: [email protected]
Teaching Assistants
Name: Juergen Ributzka
Office: 326 Dupont Hall
Phone: (302) 831 0327 Email: [email protected]
Course Webpage: http://www.capsl.udel.edu/courses/eleg652/2006/
Name: Eunjung Park
Office: 326 Dupont Hall
Phone: (302) 831 0327 Email: [email protected]
8/30/2006 eleg652-010-06F 3
Important Course Information
Final Quiz
Final Project Due date
Grade Distribution
Activity WeightHomeworks 33.00%Class Project 33.00%Final Exam 33.00%Participation 1.00%
Four Homeworks, a comprehensive final examination and a class project assigned
by the instructor with a mentorActivities
Wednesday December 6th, 2006
Friday, December 8th, 2006
8/30/2006 eleg652-010-06F 4
Reference Material
Reference Books
John Henessy and David PattersonComputer Architecture: A Quantitative ApproachThird EditionMorgan Kaufmann Publishers, Inc.2003
D. E. Culler, J.P. Singh, and A. GuptaParallel Computer ArchitectureMorgan Kaufmann Publishers, Inc.1999
1
2
8/30/2006 eleg652-010-06F 5
Supporting MaterialsSelected publications from
Journals
IEEE Transaction on Parallel and Distributed Systems
IEEE Computer
IEEE Transactions in Computers
Conference Proceedings
PACT
MICRO
ISCA
HPCA
PLDI
Parallel Architectures and Compilation Techniques
ACM/IEEE Symposium on Micro-Architectures
International Symposium on Computer Architectures
ACM/IEEE Symposium High Performance Computer Architecture
International Symposium on Parallel Language Design and Implementation
8/30/2006 eleg652-010-06F 6
Course Contents
Provides an overview of technologies that are applicable in almost all aspects of computers and , soon to be, part of consumer electronics in general.
Shows the principles in which parallel machines are built and how these concepts have infiltrated other parts of the computer and entertainment industry.
Provides an understanding about how these concepts affects both hardware and software on its target machine and their different implementations.
8/30/2006 eleg652-010-06F 7
Expectations about this CourseYou should learn:
A basic idea about the lingo that is used in today's supercomputer/parallel machine market
Vector Processing and its place in consumer electronics
Different forms of parallelism and their current implementations
Shared memory models
Parallel Programming Models and Synchronization
Multi threaded Architectures
8/30/2006 eleg652-010-06F 8
Why Study Parallel Architectures?
Concepts that soon should become ubiquitous
Productively write software that takes advantages of new features of upcoming or existing hardware
Understand how current technologies haveevolved and how they can be improved
8/30/2006 eleg652-010-06F 9
Course Overview
Terminology and General Knowledge
Vector Processing and its Legacy
Instruction Level Parallelism: a brief overview
Multicore and Cellular architectures
Parallel (shared) memory models and synchronization primitives
Advance Topics such as Dataflow and Transactional Memory
8/30/2006 eleg652-010-06F 10
Course IntroductionThe Role of a Computer Architect
Maximize Productivity and Performance
Productivity = Programmability and a reduction in development time
Performance = “Reasonable” Throughput given technology and cost limitations
ParallelismTwo or more tasks may execute at the same timeAlternative to higher frequency clocksApplies to all levels of computer design
Importance has been constantly raising since several “walls” were hit
In the near future, it will be become the paradigm on all aspects of computing
8/30/2006 eleg652-010-06F 11
The Transition
Most consumer electronics will have some form of parallel architecture inside of them by next year (2007)
Reasons for the ChangeAn evolutionary change in computing due to:
TechnologyTechnology
ApplicationsApplications
ArchitectureArchitecture
EconomicsEconomics
Decrease in feature size Allowing more components into a chip
Effectively organizing components to maximize uses of resources and minimizing damaging size effects
Find Cost Effective ways to get the desired performance out of the given Hardware / Software combo
More and more performance and power hungry applications
8/30/2006 eleg652-010-06F 12
Applications Requirements
Demand for more cycles = More sophisticated Hardware
Wide Range of Performance Demands
Audio Processing = Real time response with an allowed threshold of errorBusiness Loads = A given quanta of time with no error allowed
Application and parallel computer: Obtain a speed up in application runtime
Productive Parallel Systems
Current Systems work on parallel concepts and designs (i.e. Desktop systems are Multithreaded)
Parallel Computing and computers are becoming ubiquitous as we speak
8/30/2006 eleg652-010-06F 13
Technology: An OverviewDecrease in Feature Size (Lambda)
Clock rates ~ proportional to in LambdaNumber of Transistors >= Lambda square
Performance: An increase of roughly 1000x in the last decadeThe fastest supercomputer in June 1996 (Tokyo's SR2201) was 220 GFLOPSThe fastest supercomputer now is 280 TFLOPS (IBM's eServer Blue Gene Solution)
and an increase of roughly 200x in the same decade with respect to clock frequency
Intel Pentium Pro at 150 ~ 200 Mhz in 1996 Intel Pentium D at 3.2 Ghz in 2006
Extra components: Parallelism V.S. Data locality: Fighting for Real State
8/30/2006 eleg652-010-06F 14
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Clock Growth from 1971 to 2004
Column C
Intel Microprocessor Family
Freq
uenc
y in
KH
zIntel: An Example of Clock
Frequency Growth
Growth has been steady until now!!!!
8/30/2006 eleg652-010-06F 15
Pentium M
Thermal Maps from the Pentium M obtained from simulated power density (left) and IREM measurement (right). Heat levels goes from black (lowest), red, orange, yellow and white (highest)
Figures courtesy of Dani Genossar and Nachum Shamir in their paper Intel ® Pentium ® M Processor Power Estimation, Bugdeting, Optimization and Validation published in the Intel Technical Journal, May 21, 2003
8/30/2006 eleg652-010-06F 16
Storage and Transistor Count Growth
Expected to reach one billion during this decade (2000)
Grow faster than clock rate: 40 % per year
StorageStorage
Transistor CountTransistor Count
Gap between storage and speed more pronounced
Larger memories = slower = Larger memory hierarchies (i.e. Caches, write / read buffers, etc)
Parallelism and Locality inside memory systems: Multi port memory, parallel caches, RAIDs, parallel disks with caching, etc
8/30/2006 eleg652-010-06F 17
Moore's LawThe complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.
Gordon Moore's original statement. "Cramming more components onto integrated circuits", Electronics Magazine 19 April 1965
In Layman terms: The number of components on integrated circuits will roughly double every 18 months. With that, the complexity (effort) and the headcount should increase proportionally
8/30/2006 eleg652-010-06F 18
Architectural TrendsDesigned for performance
Higher Frequency == Higher Performance ?
Memory V.S. Processor
Architectural Trends: Hide Latencies at all cost!!!
Overlap Computation with Memory accesses [DMA]
Multithreaded execution and sharing of resources [SMT and HT technologies, MTA]
Give more chip real state to speculative execution [Branch prediction and prefetching]
Bring more used-data closer to the processor [memory hierarchies]
Power Problem? Go Multicore!!!!
Takes N time to finish a M size problem using T amount of power
x x
Takes N/2 + 2X time to finish a M size problem using T/2 amount of power per unit
8/30/2006 eleg652-010-06F 19
Technology Progress Overview
Processor speeds = much faster (around 1000x)
Memory (RAM) speeds are increasing too but at a slower rate (around 10x)
But Memory (RAM) dimensions have grown even faster than processor's speed (around 1,000,000x)
Computation is almost free but bandwidth is very expensive
8/30/2006 eleg652-010-06F 20
The Pentium Chip
8/30/2006 eleg652-010-06F 21
Intel Pentium 4Nine Years and Millions of Dollars Later
8/30/2006 eleg652-010-06F 22
Next GenThe Cell Chip Layout
Many of them, simpler and cheaper!!!
8/30/2006 eleg652-010-06F 23
The Dawn of Parallelism
• Parallel architectures are becoming more attractive
• Milestone: the introduction of Pentium D (2005) and Centrino Duo (2006)
• Future Projects: IBM PERCS project, Cray Eldorado, Sun Hero, IBM Cell project, etc ...
• All the factors listed contributed to this “epiphany” in computing technology.
• Parallelism can be exploited at many levels in many ways
8/30/2006 eleg652-010-06F 24
The World's FastestJapan Dominance
1993 1994 1995 19960
50
100
150
200
250
300
350
400
Top SuperComputers
Years
GFL
OP
S
Numerical Wind TunnelNumerical Wind Tunnel
CP PACS
192 GFLOPS
368 GFLOPS
8/30/2006 eleg652-010-06F 25
The World FastestUSA Takes the Lead
1993 1994 1995 1996 1997 1998 1999 2000 20010
1000
2000
3000
4000
5000
6000
7000
8000
Top SuperComputers
Years
GFL
OP
S
ASCI RedASCI Red
ASCI White SP Power3 375 ASCI White SP Power3 375 MhzMhz
1.3 TFLOPS
7.3 TFLOPS
8/30/2006 eleg652-010-06F 26
The World FastestJapan Second Wind
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 20030
5000
10000
15000
20000
25000
30000
35000
40000
Top SuperComputers
Years
GFL
OP
S
EARTH SimulatorEARTH Simulator35 TFLOPS
8/30/2006 eleg652-010-06F 27
The World's Fastestand again...
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 20060
50000
100000
150000
200000
250000
300000
Top SuperComputers
Years
GFL
OP
S
BlueGene L BetaBlueGene L Beta
BlueGene L eServer BlueGene L eServer SolutionSolution
70 TFLOPS
280 TFLOPS
8/30/2006 eleg652-010-06F 28
The World's FastestBlueGeneL eServer Solution
• 65536 Dual Processors arrange in a 32 x 32 x 64 3D torus network.
• Global Tree structure for fast reduction and broadcast operations over all nodes
• A I/O node per 64 nodes– Inside a 64 group: Tree structure connections between I/O
node and computation nodes with an aggregate bandwidth of 2.1 GB/s
– Across 64 groups: Torus like connections
• Total Memory: 32 TeriBytes• Total Power Consumption: 1.5 MegaWatts
8/30/2006 eleg652-010-06F 29
The World's FastestBlueGeneL eServer Solution
8/30/2006 eleg652-010-06F 30
The Next Step
• So what is next?• Multicore, System on a chip, PIM, etc
– Simpler, colder, cheaper...• Intel Pentium D and Centrino Duo• AMD Opteron• The DARPA HPCS Project• IBM, Cray and SUN Multicore chips: CELL,
Cyclops, BlueGene, • Alternatives: Clearspeed [Programmable
Co-Processors], etc...