From early RISC to CMPs – Perspectives on Computer

© 2008 IBM Corporation

From early RISC to CMPs

Perspectives on Computer Architecture

Valentina [email protected] T.J. Watson Research Center

IBM Research

© 2008 IBM CorporationV. Salapura, From early RISC to CMPs – Perspectives on Computer ArchitectureCompArch Summer School on Parallel Programming and Architecture2

Hi, My name is Valentina Salapura

And I am a Computer ArchitectI work at IBM T.J. Watson research center

I used to be a professor of Computer Architecture at TechnischeUniversität Wien in Vienna

I always wanted to be an Architect (but of a different kind of architect)

IBM Research


What Is Computer Architecture?

The manner in which the components of a computer or computer system are organized and integrated.

Mirriam-Webster Dictionary

The term architecture is used here to describe the attributes ofa system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, andthe physical implementation.

Gene Amdahl, IBM Journal of R&D, Apr 1964

In this book the word architecture is intended to cover all three aspects of computer design – instruction set architecture, organization, and hardware.

Hennessy and Patterson, Computer Architecture: A Quantitative Approach

IBM Research


Let us look at real Architecture

Building ArchitectureSand, clay, wood, etc

Bricks, timber, …

Compose them to form buildings

Build cities

Computer ArchitectureTransistors, logic gates

ALUs, flip-flops, bit cells, crossbars, …

Compose them to form processors

Build machines

IBM Research


My Opinion: Architecture vs. Design

Computer architecture applies at many levelsSystem architectureMemory system architectureCache architectureNetwork architecture…

Architecture is the plan; design is the implementation

Good architects understand design; good designers understand architecture

You have to know both!

IBM Research


Architecture of the IBM System/360

First separation of architecture from designIBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models

The first true ISA designed as portable hardware-software interface

“Before 1964, each new computer model was designed independently; the system/360 was the first computer system designed as a family of machines, all sharing the same instruction set.“

– IBM Journal of R&D

Amdahl, Blaauw and Brooks, IBM Journal of R&D, Apr 1964

IBM Research


My Opinion: Architecture vs. Software

Architecture – software interaction at many levels Applications and algorithms

Programming models

Compilers

Operating systems

…Architecture is the plan; software exploits it

Good architects understand software; good software developers understand architecture

You have to understand both!

IBM Research


CMOS Scaling

Semiconductor scaling - driving force behind computer architecture research

Dennard’s scaling theory Feature size decreases about 15% each year Compute density increases about 35% each year Enables higher compute speed Die size increases about 10-20% each year Therefore, transistor count per chip increases about 55% per year

+2 GenerationsCMOS

room to addfunction

corecore

+1 GenerationsCMOS

corecore

Current Gen.CMOS

corecore

IBM Research


History of Computer architecture

Computer architecture research has changed over timeBuild better adders

Build better units

Build better ISA

Build better microarchitecture

Todaybuild better multiprocessor and better synchronization, coherence, communication, programming models and languages

IBM Research


Impact of IC Scaling

Much more fits on a chipEarly 1980s – entire 32-bit microprocessorLate 1980s – on-chip cachesLate 1990s – dynamic+static ILPEarly 2000s – on-chip routerFuture – billion transistors – multiprocessors? wide-issue ILP? ??

System balances always changingPipelining – cycle time vs. wire delayMemory wall – cycle time vs. memory latencyI/O bottleneck – transistor count vs. pin countPower wall – transistor count vs. power consumptionILP wall – increasing Instruction Level Parallelism vs. performance increase

IBM Research


Predicting future trends for computer systems

“Computers in the future may weigh no more than 1.5 tons. ”

Popular Mechanics, 1949

“There is no reason anyone would want a computer in their home. ”

Ken Olsen, founder of DEC, 1977

“640K ought to be enough for anybody. ”

Bill Gates, 1981

“Prediction is difficult, especially about the future”

Yogi Berra

Source: IBM GTO

IBM Research


pre-history her-story

IBM Research


Early 80s

LSI became VLSI

Challenges:Can we fit a whole processor on one chip?

– Yes, if it is sufficiently simple– RISC processor revolution– Keep it simple

Can we design 115 thousand transistors– Yes, with discipline and tools– Mead-Conway VLSI design revolution– Keep it simple

© 2008 IBM Corporation14 V. Salapura, From early RISC to CMPs – Perspectives on Computer ArchitectureCompArch Summer School on Parallel Programming and Architecture

IBM Research

Research agenda for 80s – 90s

Build better processor cores– Riding up Moore’s Law

– CMOS Scaling means we can use more transistors

– Find ways to use the transistors profitably

Build better design methodologies and tools– More complex systems require

better ability to control design complexity

– More complex systems require better performance prediction Graph Source: www.physics.udel.edu/~watson/scen103/intel.html

15V. Salapura, From early RISC to CMPs – Perspectives on Computer ArchitectureCompArch Summer School on Parallel Programming and Architecture

15

Technische Universität Wien

Better modeling of microprocessors

MIPS R2000 clone

FPU & System Components model working with SiemensBehavioral timing-exact modeling using a new language – VHDLExploration of other modeling approaches

StateCharts – David Harel

MIPS cloneModeling and design using a new approachLogic synthesis and VHDL


16


MIPS-I processor core

processor core as starting pointMIPS-I architecturedesigned from public information

compatible MIPS-I ISA implementation compatible hardware interface different internal operation

not a clone

written in VHDLdescribed at behavioral/register transfer level

hardware implementation using logic synthesis


17


MIPS I core on FPGA

FPGAs for rapid prototypingone VHDL model for FPGA and for the target technology

Significant reduction in verification time

IEEE Transactions on VLSI, April 2001


18


Building better cores

Application-specific processorsCan we optimize functions for specific applications?Extend processor with application-specific units

Prototyping with FPGAsA nascent technology taking advantage of increased integrationMethodology for FPGA design with synthesis

Share the same design for prototype and actual fabrication


19


Hardware/Software Co-evaluation

instruction set design based mostly on software benchmarks

minimize cycle countminimize code size

but introducing new instructions changes hardware implementation

cycle timeresource usage

software analysis is not enough for a balanced view of costs and benefits ⇒ co-design

integrates hardware and software analysis

Core

Program

Profiling

Partitioning

Interface Compiler

Prototyping Adapt Compiler

New Compiler

Synthesis

Processor

Compilation

Code

Simulation

RTL Model


IBM Research

Joining IBM

Projects I worked on at IBM– Cyclops

– SANlite

– Blue Gene/L

– Blue Gene/P


IBM Research

Late-late nineties and on

Diminishing returns on investment in transistors

Massively parallel many-threaded CMPs– Cyclops

– SANlite

Application-specific accelerations– Networking & Protocol Processing

– Networked Storage


IBM Research

SANlite: Multiprocessor subsystem for network processing

Self-contained, scalable multiprocessor system– With its own general-

purpose processors, interconnect, interfaces and memory

Connected to the SoCbus via a bridge– Easy integration in the

basic SoC structure

Inner structure and complexity is hidden from the SoC designer – Significantly simplifies

the SoC design

On-chip Peripheral B

us (OPB

)

UART (2)

I2C (2)

GPIO

Arb

Processor Local Bus (PLB)

Timers

InterruptController

RAM/ROM/Peripheral controller

Ext Bus Master

GPIO

GPT

OPBBridge

PCI-XBridge

Processor Local Bus (PLB)

10/100/1GEthernet

MAC

440 PowerPC

DMADDR

SDRAM controller

DMAController

SRAM

Multiprocessormacro

FibreChannel

Acc.

P0

x

...

Mem

P1 PN

Patent US7353362: Multiprocessor subsystem in SoC with bridge between processor clusters interconnection and SoC system bus


IBM Research

Details of the multiprocessor subsystem

One or more multiple processor clusters– Cyclops cluster has 8 processor

cores, 32 KB shared instruction cache, local SRAM

– Very simple processor cores• Single-issue, in-order

execution; • four stages deep pipeline

One or more local memory banks – For storing data and instructionsBridge to the SoC busOne or more application-specific hardware interfaces– Gigabit Ethernet, Fibre Channel,

Infiniband, etc.

crossbar switch

MemoryBank 0

DRAM

Bank 15

. . . MemoryBank n

PLB

bridge

Proc. core cluster 0

PLB bus1G/10GEMAC

Proc. core cluster m

1G/2G/4G/MAC

Fibre Channel

Interface

Gb Ethernet

Interface


IBM Research

The 00’s

The Power Challenge and how to build supercomputers in a power-constrained regime– Scale-out

– SIMD

– Cost-effective scaling of multicore


IBM Research

Key Supercomputer Challenges

More Performance More Power– Systems limited by data center power supplies and

cooling capacity• New buildings for new supercomputers

Scaling single core performance degrades power-efficiency– FLOPS/W not improving from technology

traditional supercomputer design hitting power & cost limits


IBM Research

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

180 GF/s16 GB

32 node cards

5.6 TF/s512 GB

104 Racks - LLNL

596 TF/s53 TB

Rack

System

Node card

Compute card

Chip

Blue Gene/L: from Chip to System

The Blue Gene/L project has been supported and partially funded by the Lawrence Livermore National Laboratories on behalf of theUnited States Department of Energy under Lawrence Livermore National Laboratories Subcontract No. B517552.


IBM Research

Blue Gene/L compute ASIC

PLB (4:1)

L2L2

SharedSRAM

SharedSRAM

JTAGGbitEthernet

144b DDRinterface

256

256

128

256snoop

128

32k I1/32k D132k I1/32k D1

PPC440PPC440

Double FPUDouble FPU

EthernetGbit

EthernetGbit JTAG

Access

JTAGAccess CollectiveCollectiveTorusTorus Global

Barrier

GlobalBarrier

DDRControllerw/ ECC

DDRControllerw/ ECC

32k I1/32k D132k I1/32k D1

PPC440PPC440

Double FPUDouble FPU4MB

eDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

1024b data 144b ECC

L2L2128

256

6b out, 6b in1.4Gb/sper link

3b out, 3b in2.8Gb/sper link

4 globalbarriers orinterrupts

High integration

eDRAM

Leverage components

DFPU


IBM Research

BlueGene/L ASIC

IBM Cu-11 0.13µ CMOS ASIC process technology 11 x 11 mm die size 1.5/2.5VCBGA package, 474 pinsTransistor count 95MClock frequency 700 MHzPower dissipation 12.9 W


IBM Research

Blue Gene/L Exploring the Benefits of SIMD

Power efficient

Low overhead (doubles data computation without paying cost of instruction decode, issue etc.)

UMT2K runs on 1024 nodes

Code optimized to exploit SIMD floating point IEEE Micro, Vol. 26 No. 5. 2006

0

0.2

0.4

0.6

0.8

1

1.2

t E E tmetrics

norm

aliz

ed

scalarSIMD

smal

ler i

s be

tter

smal

ler i

s be

tter


IBM Research

Blue Gene/L packaging


IBM Research

Trends ’00: Processor performance slowing down

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

(SPECint)


IBM Research

Microprocessors are dead!

“Today's microprocessors will become extinct by the end of the decade, to be replaced by computers built onto a single chip.”– Greg Papadopoulos, Sun Microsystems, Fall processor

forum 2003


IBM Research

Leakage Power

Yesterday: – Power to clock latches dominant

power dissipation component

– Active power dominates

Today: – Power consumed even if not

clocking latches

– Leakage power has become a significant component

– Must develop means to “disconnect” unused circuits

0

10

20

30

40

50

60

70

80

90

100

0.010.11Gate length (µm)

Active Power

Leakage Power


IBM Research

Hardware Synthesis and Floorplanning

Synthesis– Yesterday: reduce number of gates to make timing

– Today: Placement Driven Synthesis (PDS)• Wires have delays we have to consider

Floorplan– Yesterday: did not worry about a floorplan

– Today: do not have architecture without a floorplan• Floor-plan the architecture from the beginning• Process variability makes timing closure hard


IBM Research

Improving Performance

0.18um 0.13um 90nm 65nm

Technology Generation

0

20

40

60

80

100

Rel

ativ

e G

ain

in P

erfo

rman

ce

Innovation Scaling

No longer possible by scaling alone

– New Device Structures

– New Device Design point

– New Materials

Before 90’s1hydrogen

H

5

boron

B

8oxygen

O

15

phosphorus

P14

silicon

Si

33arsenic

As

13

aluminum

Al

Since the 90’s 7

nitrogen

N

9

fluorine

F

32germanium

Ge

22Ti22titanium

Ti

73Ta73tantalum

Ta

29

copper

Cu

29tungsten

W

17

chlorine

Cl

39yttrium

Y40

zirconium

Zr

72hafnium

Hf

23vanadium

V

41niobium

Nb

24chromium

Cr

42molybdenum

Mo

21scandium

Sc

75Rhenium

Re

43ruthenium

Ru

77iridium

Ir

45rhodium

Rh

27cobalt

Co

78platinum

Pt

45palladium

Pd

28nickel

Ni

57lanthanium

La58

cerium

Ce59

Praeseodymium

Pr

60neodymium

Nd62

samarium

Sm63

europlum

Eu64

gadollinium

Gd65

terbium

Tb66

dysprosium

Dy67

holmium

Ho68

erbium

Er69

thullium

Tm70

ytterbium

Yb

20calcium

Ca

38strontium

Sr

56barium

Ba

12magnesium

Mg

Beyond 2006

35Br35Bromine

Br30zinc

Zn

83bismuth

Bi

6carbon

C

2helium

He

54

xenon

Xe51

antimony

Sb52

tellurium

Te

Source: IBM GTO


IBM Research

The Discontinuity

Then 2002– Scaling drives performance

– Performance constrained– Active power dominates– Performance tailoring in

manufacturing– Focus on technology performance– Single core architectures

Now– Architecture drives performance

• Scaling drives down cost– Power constrained

– Standby power dominates

– Performance tailoring in design

– Focus on system performance

– Multi core architecture


IBM Research

POWER4 POWER4+

174 mtrs./115/125 W

10/01 11/02

2001 2002

3Q 4Q 3Q 4Q

~~ ~~

2005

1Q 2Q

~~3Q 4Q

2007

3Q 4Q 1Q

20082Q 2Q

180 nm/412 mm2

184 mtrs./70 W130 nm/380 mm2

8+ cores, new paradigms

Product Data: Sima, Goetz, Gschwind, Hofstee, 2007/2008Graphics: Sima 2007, Gschwind 2008

8

4/MT

4

2/MT

2

# C

ores

279 mtrs./63 W

Cell BE

10/2005

234 mtrs./95 W90 nm/221 mm2

(PPE:2-way MT)(SPEs: no MT)

SPARC Niagara

11/2005

90 nm/379 mm2

4-way MT/core

POWER5

5/04

276 mtrs./80W (est.)130 nm/389 mm2

2-way MT/core

POWER5+

10/05

276 mtrs./70 W90 nm/230 mm2

2-way MT/core

Xbox360

3Q/2005

90 nm2/ / 165 mtrs(3 cores, each 2-way MT)

Pentium D 800(Smithfield)

5/05

90 nm/206mm2

130 W

Same Gen

~~

Core 2 Duo(Conroe)

8/06

65 nm/143mm2

2x1 core/65 W

2006

SPARCNiagara II

2007

65 nm/342 mm2

8-way MT/core72 W (est.)

Core 2 Quad(Kentsfield)

1/07

65 nm/2x2 cores


IBM Research

Designing Blue Gene/P

Emphasis on modular design and component reuse

Reuse of Blue Gene/L design components when feasible– Optimized SIMD floating point unit

• Protect investment in Blue Gene/L application tuning– Basic network architecture

Add new capabilities when profitable– PPC 450 embedded core with hardware coherence support

– New data moving engine to improve network operation• DMA transfers from network into memory

– New performance monitor unit• Improved application analysis and tuning


IBM Research

JTAG 10 Gb/s

256

256

32k I1/32k D132k I1/32k D1

PPC450PPC450


Ethernet10 Gbit

Ethernet10 GbitJTAG

Access

JTAGAccess Collective

CollectiveTorus

Torus GlobalBarrier

GlobalBarrier

DDR-2Controllerw/ ECC


32k I1/32k D132k I1/32k D1

PPC450PPC450


4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

6 3.4Gb/sbidirectional

4 globalbarriers orinterrupts

128

32k I1/32k D132k I1/32k D1

PPC450PPC450


32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU L2L2

Snoop filter

Snoop filter

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

512b data 72b ECC

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

Multiplexing

switch

Multiplexing

switch

DMADMA

Multiplexing

switch

Multiplexing

switch

3 6.8Gb/sbidirectional



13.6 GB/sDDR2 DRAM bus

32

SharedSRAM

SharedSRAM

Hybrid PMU

w/ SRAM256x64b

Hybrid PMU

w/ SRAM256x64b

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

ArbArb

512b data 72b ECC

Blue Gene/P - next generation node design


IBM Research

PPC450PPC450

Multiprocessor cache management issues

Address DataAddress Data Address DataAddress Data

PPC450PPC450

L1D L1D

Address DataAddress Datamemory

FF00 5FF00 5FF00 5

FF00 7FF00 6

In multiprocessor systems, each processor has a separate privatecacheWhat happens when multiple processors cache and write the same data?


IBM Research

Multiprocessor cache management issues:Coherence protocol

Address DataAddress Data

PPC450PPC450


PPC450PPC450

L1D L1D


FF00 5FF00 5FF00 5

FF00 6 FF00 7

memory

Every time any processor modifies data – Every other processor needs to check its cacheHigh overhead cost– Cache busy snooping for significant fraction of cycles– Increasing penalty as more processors added


IBM Research

Snoop filtering of unnecessary coherence requests

Processor 0

snoop

SnoopUnit

Processor 1

snoop

Processor 2

snoop

Processor 3

snoop

L2 Cache

SnoopUnit

L2 Cache

SnoopUnit

L2 Cache

SnoopUnit

L2 Cache

L3 CacheDMA

Coherence traffic

Data transfer

Increases performance (remove unnecessary lookups, reduced cache interference)

Reduces power (and energy)


IBM Research

Snoop filtering improves power and performance

50%

60%

70%

80%

90%

100%

110%

time power E E × t

snoop disabled snoop filter

Actual hardware

measurements

UMT2k application

smal

ler i

s be

tter

smal

ler i

s be

tter

HPCA 2008


IBM Research

Blue Gene/P ASIC

IBM Cu-08 90nmCMOS ASIC process technology Die size 173 mm2

Clock frequency 850MHzTransistor count 208MPower dissipation 16W


IBM Research

Blue Gene/P compute card and node card

The Blue Gene/P project has been supported and partially funded by Argonne National Laboratory and the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Subcontract No. B554331.


IBM Research

Blue Gene/P cabinet

16 node cards in a half-cabinet (midplane)

512 nodes (8 x 8 x 8)

all wiring up to this level (>90%) card-level

1024 nodes in a cabinet

Pictured 512 way prototype (upper cabinet)


IBM Research

Research and Engineering is a team sport

Wherever you want to go, you will likely not get there alone– Find like-minded colleagues for technical and moral support

– You will not get there alone, so share success with the team and treat everybody with respect

Being part of a successful team is key– Enthusiasm and commitment to team success

– Learn to understand how your team works and how to get results


IBM Research

Building the World’s Fastest Supercomputer:The Blue Gene/L team

Team

BlueGene

No. 1


IBM Research

The Blue Gene/P team – Yorktown


IBM Research

My support team


IBM Research

Career Opportunities and Challenges

Find an interesting subject that you are passionate about– To make it work, you will spend many hours with it

Find interesting challenges of TOMORROW– Everybody else is already working on TODAY’s problems

– And solving today’s problems by tomorrow is usually too late


IBM Research

Challenges which influenced the course of Computer Architecture

What can we build?– Technology opportunities

and threats• VLSI & Dennard Scaling• Power Crisis• SER exposures

How can we build it – Tools

– Methodology

– Controlling complexity• Humans• Tools


IBM Research

Pioneers of Computer Architecture (and Related Fields)


IBM Research

Future Pioneers: 2006 Summer School

Documents

From early RISC to CMPs – Perspectives on Computer