157
Dezső Sima Multicore and Manycore Processors December 2008 Overview and Trends

Dezső Sima

  • Upload
    tiva

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Multicore and Manycore Processors. Overview and Trends. Dezső Sima. December 2008. Overview. 1. Overview. 2. Homog eneous multicore processors. 2.1 Conventional multicores. 2.2 Manycore processors. 3. Heterog eneous multicore processors. 3.1 Master/slave architectures. - PowerPoint PPT Presentation

Citation preview

Page 1: Dezső Sima

Dezső Sima

Multicore and Manycore Processors

December 2008

Overview and Trends

Page 2: Dezső Sima

Overview

1. Overview•

2. Homogeneous multicore processors •

3. Heterogeneous multicore processors•

2.1 Conventional multicores•

3.1 Master/slave architectures•

3.2 Attached processor architectures•

4. Outlook•

2.2 Manycore processors•

Page 3: Dezső Sima

1. Overview – inevitability of multicores

Page 4: Dezső Sima

Figure: Evolution of Intel’s IC fab technology [1]

1. Overview – inevitability of multicores (1)

Shrinking: ~ 0.7/2 Years

Page 5: Dezső Sima

1. Overview – inevitability of multicores (2)

IC fab technology

Moore’s rule

• same number of transistors: on ½ Si die area

Shrinking ~ 0.7x/2 years)

• on the same die area: 2x as many transistors

every two years

Doubling transistor counts ~ every two years (on the chips)

(2. formulation: from 1975)

Page 6: Dezső Sima

Utilization of the surplus transistors?

Wider processor width

superscalar1. Gen. 2. Gen.

1 2 4

pipeline

Doubling transistor counts ~ every two years

1. Overview – inevitability of multicores (3)

Page 7: Dezső Sima

1. Overview – inevitability of multicores (4)

Figure: Parallelism available in applications [2]

Available parallelism in general purpose apps: ~ 4-5

Page 8: Dezső Sima

Utilization of the surplus transistors?

Wider processor width Core enhancements Cache enhancements

superscalar

• branch prediction• speculative loads• ...

L2/L3enhancements

(size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Doubling transistor counts ~ every two years

1. Overview – inevitability of multicores (5)

Page 9: Dezső Sima

The best use of surplus transistors is: multiple cores

The inevitability of multicore processors

Increasing transistor count Diminishing return in performance

1. Overview – inevitability of multicores (6)

with doubling of core numbers ~ every two years

Page 10: Dezső Sima

Figure: Spreading Intel’s multicore processors [3]

1. Overview – inevitability of multicores (7)

Page 11: Dezső Sima

1. Overview – inevitability of multicores (8)

Figure 1.1: Main classes of multicore/manycore processors

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

Conventionalmulticores

Master/slavearchitectures

Add-onarchitectures

MPC

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

Page 12: Dezső Sima

2. Homogeneous multicores

2.1 Conventional multicores•

2.2 Manycore processors•

Page 13: Dezső Sima

2. Homogeneous multicores

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

Conventionalmulticores

Master/slavearchitectures

Add-onarchitectures

MPC

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

Figure 2.1: Main classes of multicore/manycore processors

Page 14: Dezső Sima

2.1 Conventional multicores

Multicore MP servers•

Intel’s multicore MP servers•

AMD’s multicore MP servers•

Page 15: Dezső Sima

2.1 Intel’s multicore MP servers (1)

Figure 2.1.1: Intel’s Tick-Tock development model [13]

The evolution of Intel’s basic microarchitecture

Page 16: Dezső Sima

Figure 2.1.2: Overview of Intel’s Tick-Tock model and the related MP servers [24]

11/2005: First DC MP Xeon

1Q/2009

7100 (Tulsa)

7300 (Tigerton QC)

7400 (Dunnington)

7xxx (Beckton)

(Potomac)

7000 (Paxville MP)

(Cransfield)

7200 (Tigerton DC)

2x1 C 1 MB L2/C 16 MB L3

2x2 C 4 MB L2/C

1x6 C 3 MB L2/2C 16 MB L3

1x8 C ¼ MB L2/C 24 MB L3

1x1 C 8 MB L2

2x1 C ½ MB L2/C

1x1 C 1 MB L2

1x2 C 4 MB L2/C

3/2005: First 64-bit MP Xeons

90nm

TICK Pentium 4 /Prescott)

TOCK Pentium 4 /Irwindale)

2.1 Intel’s multicore MP servers (2)

Intel’s Tick-Tock model for MP servers

Page 17: Dezső Sima

Figure 2.1.3: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

Preceding NBs

Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1

2.1 Intel’s multicore MP servers (3)

SC SC SC SC

1 Xeon MP before Potomac

Typically HI 1.5(266 MB/s)

System architecture (before Potomac)

Page 18: Dezső Sima

MP Platforms

Xeon 7000

11/2005

MP Cores Xeon 7100

8/2006

MP Chipsets

3/2005 4/2006

8500 8501

(Paxville MP DC) (Tulsa DC)

(Twin Castle) (?)

Figure 2.1.4: Intel’s Xeon-based MP server platforms

2xFSB667 MT/s

4 x XMB(2 x DDR2)

32GB

2xFSB800 MT/s

4 x XMB(2 x DDR2)

32GB

Truland

65 nm/1328 mtrs

2x1 MB L216/8/4 MB L3

800/667 MT/smPGA 604

P4-based/65 nm

3/2005

Xeon MP

3/2005

(Potomac SC)

90 nm/2x169 mtrs

2x1 (2) MB L2-

800/667 MT/smPGA 604

90 nm/675 mtrs

1 MB L28/4 MB L3

667 MT/smPGA 604

P4-based/90 nm

Truland

2.1 Intel’s multicore MP servers (4)

First 64-bit server

Page 19: Dezső Sima

Figure 2.1.5: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

Preceding NBs

Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1

2.1 Intel’s multicore MP servers (5)

SC SC SC SC

1 Xeon MPs before Potomac

Typically HI 1.5(266 MB/s)

Truland

DC

• Cransfield SC)• Tulsa (DC)

3 The 8500 supports also

2 First x86-64 MP processor

Up to 2005 2005

Serial link

(Twin Castle)

XMB

XMB

XMB

XMB

8500/8501

28 PCIe lanes + HI 1.5

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

(266 MT/s)(7 GT/s)

External Memory Bridge

Page 20: Dezső Sima

MP Platforms

Xeon 7000

11/2005

MP Cores Xeon 7200 Xeon 7300Xeon 7100

9/20078/2006

MP Chipsets

3/2005 4/2006 9/2007

8500 8501 7300

(Paxville MP DC) (Tulsa DC) (Tigerton DC) (Tigerton) QC

Caneland

9/2007

(Clarksboro)(Twin Castle) (?)

Figure 2.1.6: Intel’s Xeon-based MP server platforms

2xFSB667 MT/s

4 x XMB(2 x DDR2)

32GB

2xFSB800 MT/s

4 x XMB(2 x DDR2)

32GB

4xFSB1066 MT/s

4 x FBDIMM(DDR2)512GB

Truland

Xeon 7400

9/2008

(Dunnington 6C)

65 nm/1328 mtrs

2x1 MB L216/8/4 MB L3

800/667 MT/smPGA 604

65 nm/2x291 mtrs

2x4 MB L2-

1066 MT/smPGA 604

65 nm/2x291 mtrs

2x(4/3/2) MB L2-

1066 MT/smPGA 604

45 nm/1900 mtrs

9/6 MB L216/12/8 MB L3

1066 MT/smPGA 604

P4-based/65 nm Core2-based/65 nm Core2-based/45 nm

3/2005

Xeon MP

3/2005

(Potomac SC)

90 nm/2x169 mtrs

2x1 (2) MB L2-

800/667 MT/smPGA 604

90 nm/675 mtrs

1 MB L28/4 MB L3

667 MT/smPGA 604

P4-based/90 nm

Truland Caneland

7300

2.1 Intel’s multicore MP servers (6)

Page 21: Dezső Sima

Figure 2.1.7: Evolution of Intel’s Xeon MP-based system architecture

(until the appearance of Nehalem)

Preceding NBs

Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1

(Clarksboro)

Tigerton Tigerton Tigerton Tigerton

2.1 Intel’s multicore MP servers (7)

6C/QC/DC 6C/QC/DC 6C/QC/DC 6C/QC/DC

SC SC SC SC

Dunnington Dunnington Dunnington Dunnington

8 PCI-E lanes + ESI

Truland

Caneland

7300

1 Xeon MP before Potomac • Cransfield SC)• Tulsa (DC)

3 The 6500 supports also

2 First x86-64 MP processor

Typically HI 1.5(266 MB/s)

(2 GT/s) (1 GT/s)

QC/8CFB-DIMM(DDR2)

XMB

XMB

XMB

XMB

28 PCIe lanes + HI 1.5

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

Potomac2

Paxville MP3

DC/SC

(266 MT/s)(7 GT/s)

(Twin Castle) 8500/8501DC

Up to 2005 2005

2007

Page 22: Dezső Sima

2.1 Intel’s multicore MP servers (8)

Figure 2.1.8: Nehalem’s key innovations concerning the system architecture [22]

Nehalem’s key innovations concerning the system architecture (11/2008)

Page 23: Dezső Sima

2.1 Intel’s multicore MP servers (9)

Figure 2.1.9: Nehalem’s key innovations concerning the system architecture [22]

Nehalem’s key innovations concerning the system architecture (11/2008)

Page 24: Dezső Sima

Beckton 8C

Beckton8C

Beckton 8C

Beckton 8C

2.1 Intel’s multicore MP servers (10)

QPI QPI

QPIQPI

QPI

QPI

QPI: QuickPath Interconnect

QPIQPI

Figure 2.1.10: Intel’s Nehalem based MP server architecture

4xFB-DIMM

11/2008: Nehalem

Page 25: Dezső Sima

AMD’s multicore MP servers•

Page 26: Dezső Sima

2.1 AMD’s multicore MP servers (1)

AMD Direct Connect Architecture (2003)

• Integrated Memory Controller• Serial HyperTransport links

Figure 2.1.11: AMD’s Direct Connect Architecture [14]

Remark

• 3 HT 1.0 links at introduction (K8),• 4 HT 3.0 links with K10 (Barcelona)

Introduced in 2003 along with the x86-64 ISA extension

(Intel: 2008 with Nehalem)

Page 27: Dezső Sima

2.1 AMD’s multicore MP servers (2)

Use of available HyperTransport links [44]

UPs

Each link supports connections to I/O devices

DPs

Two links support connections to I/O devices,any one of the three links may connect to another DP or MP processor

MPs

Each link supports connections to I/O devices or other DP or MP processors

Page 28: Dezső Sima

AMDOpteron

AMDOpteron

PCI-X

PCI Express

AMDOpteron

AMDOpteron

PCI

AMDOpteron

AMDOpteron

PCI-X

I/O

RDD2

RDD2

RDD2 RDD2

RDD2

RDD2

HT HT

HT

HT

HT

HT

HT

HT

HT

Figure 2.1.12: 2P and 4P server architectures based on AMD’s Direct Connect Architecture [15], [16]

2.1 AMD’s multicore MP servers (3)

Page 29: Dezső Sima

Figure 2.1.13: Block diagram of Barcelona (K10) vs K8 [17]

(K10)

2.1 AMD’s multicore MP servers (4)

Page 30: Dezső Sima

Figure 2.1.14: Possible use of Barcelona’s four HT 3.0 links [39]

2.1 AMD’s multicore MP servers (5)

Page 31: Dezső Sima

Novel features of HT 3.0 links, such as

Current platforms (2. Gen. Socket F with available chipsets) do not support HT3.0 links [46].

• higher speed or• splitting a 16-bit HT link to two 8-bit links

can be utilized only with a new platform.

2.1 AMD’s multicore MP servers (6)

Page 32: Dezső Sima

Figure 2.1.15: AMD’s roadmap for server processors and platforms [19]

2.1 AMD’s multicore MP servers (7)

Page 33: Dezső Sima

2.2 Manycore processors

Page 34: Dezső Sima

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

Conventionalmulticores

Master/slavearchitectures

Add-onarchitectures

MPC

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

2.2 Manycore processors

Figure 2.2.1: Main classes of multicore/manycore processors

Page 35: Dezső Sima

2.2 Manycore processors

Intel’s Larrabee•

Intel’s Tiled processor•

Page 36: Dezső Sima

Larrabee

Part of Intel’s Tera-Scale Initiative.

Project started ~ 2005First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27])Due in ~ 2009

• Performance (targeted): 2 TFlops

• Brief history:

• Objectives:

Not a single product but a base architecture for a number of different products.High end graphics processing, HPC

2.2 Intel’ Larrabee (1)

Page 37: Dezső Sima

Figure 2.2.2: Block diagram of the Larrabee [4]

Basic architecture

• Cores: In order, 4-way multithreaded x86 IA cores, augmented with SIMD-16 capability

• L2 cache: fully coherent

• Ring bus: 1024 bits wide

2.2 Intel’ Larrabee (2)

Page 38: Dezső Sima

Figure 2.2.5: Larrabee vs the Pentium [11]

• 64-bit instructions

• 4-way multithreaded (with 4 register sets)

• addition of a 16-wide (16x32-bit) VU

• increased L1 caches (32 KB vs 8 KB)

• access to its 256 KB local subset of a coherent L2 cache

• ring network to access the coherent L2 $ and allow interproc. communication.

Main extensions

2.2 Intel’ Larrabee (3)

Page 39: Dezső Sima

Figure 2.2.3: Block diagram of the Vector Unit [5]

The Vector Unit

VU scatter-gather instructions

(load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly).

8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty.

Numeric conversions

L1 D$ becomesas an extension of the register file

Mask registers

have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.

2.2 Intel’ Larrabee (4)

Page 40: Dezső Sima

Figure 2.2.4: Layout of the 16-wide vector ALU [5]

• ALUs execute integer, SP and DP FP instructions• Multiply-add instructions are available.

ALUs

2.2 Intel’ Larrabee (5)

Page 41: Dezső Sima

Figure 2.2.6: System architecture of a Larrabee based 4-processor MP server [6]

2.2 Intel’ Larrabee (6)

CSI: Common Systems Interface (Serial packet-based bus)

Page 42: Dezső Sima

2.2 Intel’ Larrabee (7)

Programming of Larrabee [5]

• Larrabee has x86 cores with an unspecified ISA extension,

Page 43: Dezső Sima

2.2 Intel’ Larrabee (8)

Figure 2.2.7: Intel’s ISA extensions [11]

AES: Advanced Encryption Standard

AVX: Advanced Vector Extension

FMA: FP fused multiply-add instr. supporting 256-bit/128-bit SIMD

Page 44: Dezső Sima

2.2 Intel’ Larrabee (9)

Programming of Larrabee [5]

• Larrabee has x86 cores with an unspecified ISA extension,

• x86 cores allow to program Larrabee as usual x86 processors, by using enhanced C/C++ compilers from MS, Intel, GCC etc.

• this is a huge advantage compared to the competition (Nvidia, AMD/ATI),

Page 45: Dezső Sima

Intel’s Tiled processor•

Page 46: Dezső Sima

• First implementation of Intel’s Tera-Scale Initiative

Announced at IDF Fall 2006 9/2006Details at ISSCC 2007 2/2007Due to 2009/2010

• Aim: Tera-Scale research chip

- high bandwidth interconnect - energy management - programming manycore processors

(among more than 100 projects)

• Milestones of the development:

Tiled Processor

2.2 Intel’s Tiled processzor (1)

Remark

Based on ideas of the Raw processor (MIT)

Page 47: Dezső Sima

Figure 2.2.8: Basic structure of the Tiled Processor [7]

2.2 Intel’s Tiled processzor (2)

Page 48: Dezső Sima

2 single precision FP (Multiply-Add)

Figure 2.2.9: Block diagram of a tile [7], [9]

VLIW microarchitecture?

2.2 Intel’s Tiled processzor (3)

(For debugging)

SP FP cores

Page 49: Dezső Sima

2.2 Intel’s Tiled processzor (4)

Figure 2.2.10: Die shot

of the Tiled Proc.[8]

Page 50: Dezső Sima

Figure 2.2.13: Ring based interconnect network topology [7]

2.2 Intel’s Tiled processzor (5)

Page 51: Dezső Sima

Figure 2.2.14: Mesh interconnect topology [7]

2.2 Intel’s Tiled processzor (6)

Page 52: Dezső Sima

Figure 2.2.11: Integration of dedicated hardware units (accelerators) [7]

2.2 Intel’s Tiled processzor (7)

Page 53: Dezső Sima

2.2 Intel’s Tiled processzor (8)

Figure 2.2.12: Sleeping inactivated cores [7]

Page 54: Dezső Sima

2.2 Intel’s Tiled processzor (9)

Figure 2.2.15: Performance figures of the Tiled Processor [7]

Matrix multiplication (Single Precision)

Peak performance

4 SP FP/cycle

at 4 GHz:

1.6 TFLOPS

Page 55: Dezső Sima

3. Heterogenous multicores

3.1 Master/slave architectures•

2.2 Attached architectures•

Page 56: Dezső Sima

3. Heterogenous multicores

Figure 3.1: Main classes of multicore processors

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

ConventionalMC processors

Master/slavearchitectures

Add-onarchitectures

MPC

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

Page 57: Dezső Sima

3.1 Master/slave architectures

The Cell BE•

Page 58: Dezső Sima

3.1 The Cell BE (1)

Computational model

Master-slave computational model with cacheless private memory spaces (LSs)

allow efficient utilization of the die area for computations

• transferring the tasks (programs and data) from the master to the slaves and the results back from the slaves to the master

• Master slave computational model

allows to delegate tasks to dedicated, task-efficient units

• synchronization between the master and the

slaves• inter-core communication and synchronization • Cacheless private memory spaces

needs efficient mechanisms for

needs an efficient LS-based microarchitecture for the slaves

Page 59: Dezső Sima

Performance @ 3.2 GHz:

QS21 Peak performance (SP FP): 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle)

3.1 The Cell BE (2)

Page 60: Dezső Sima

3.1 The Cell BE (3)

Figure 3.1.2: Cell roadmap from 2007 [22]

Page 61: Dezső Sima

3.2 Attached architectures

Page 62: Dezső Sima

Figure 3.2.1: Main classes of multicore/manycore processors

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

ConventionalMC processors

Master/slavearchitectures

Add-onarchitectures

MPC

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

3.2 Attached architectures

Page 63: Dezső Sima

3.2 Attached architectures

Introduction to GPGPUs•

The SIMT computational model (CM)•

Recent implementations of the SIMT CM•

Intel’s future processors with attached architecture•

AMD’s future processors with attached architecture•

Page 64: Dezső Sima

Introduction to GPGPUs•

Page 65: Dezső Sima

Figure 3.2.2: Evolution of the microarchitecture of GPUs [23]

3.2 Introduction to GPGPUs (1)

Evolution of the microarchitecture of GPUs

Page 66: Dezső Sima

Figure 3.2.3: Simplified block

diagram of AMD/ATI’s RV770 [24]

160 cores x 5 execution units

3.2 Introduction to GPGPUs (2)

Page 67: Dezső Sima

Figure 3.2.4: Simplified structure of a core of the RV770 GPGPU [24]

Execution units (Stream Processing Units)

• 32-bit FP (ADD, MUL, MADD) • 64-bit FP • 32-bit FX . . .

3.2 Introduction to GPGPUs (3)

Page 68: Dezső Sima

3.2 Introduction to GPGPUs (4)

Figure 3.2.5: Peak SP FP performance figures Nvida’s GPUs vs Intel’s CPUs [25]

Page 69: Dezső Sima

3.2 Introduction to GPGPUs (5)

Figure 3.2.6: Bandwidth figures: Nvidia’s GPUs vs Intel’s CPUs [GB/s] [25]

Page 70: Dezső Sima

Not cached

Figure 3.2.7:Utilization of the die are in CPUs vs GPUs [25]

3.2 Introduction to GPGPUs (6)

Page 71: Dezső Sima

Based on their FP32 computing capability and the large number of execution units available

GPUs with unified shader architecture are prospective candidates for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs

(General Purpose GPUs)

3.2 Introduction to GPGPUs (7)

For HPC computations SIMT (Single Instruction Multiple Treads) computation model

Use of GPUs for HPC

Page 72: Dezső Sima

The SIMT computational model (CM)•

Page 73: Dezső Sima

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors

• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)

Figure 3.2.8: Main alternatives of data parallel execution

3.2 The SIMT computational model (1)

Page 74: Dezső Sima

Scalar execution SIMD execution SIMT execution

Domain of execution:single data elements

Domain of execution:elements of vectors

Domain of execution:elements of matrices

(at the programming level)

Figure 3.2.9: Scope of the data parallel execution vs scalar execution (at the programming level)

Remarks

1. SIMT execution is also termed as SPMD (Single-Program Multiple-Data) execution (Nvidia)2. At the processor level two dimensional domains of execution can be mapped to any set of cores (e.g. to a line of cores).

3.2 The SIMT computational model (2)

Page 75: Dezső Sima

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors

• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)

E.g. 2. and 3. generationsuperscalars

GPGPUs,data parallel accelerators

Figure 3.2.10: Main alternatives of data parallel execution

• data dependent flow control as well as • barrier synchronization

• is massively multithreaded, and provides

3.2 The SIMT computational model (3)

Page 76: Dezső Sima

Recent implementations of the SIMT CM•

Page 77: Dezső Sima

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Dedicated units supporting data parallel execution

with appropriate programming environment

Programmable GPUs with appropriate

programming environments

E.g. Nvidia’s 8800 and GTX linesAMD’s HD 38xx, HD48xx lines

Nvidia’s Tesla linesAMD’s FireStream lines

Have display outputs No display outputsHave larger memories than GPGPUs

Figure 3.2.12: Basic implementation alternatives of the SIMT execution

3.2 Recent implementations of the SIMT CM (1)

Page 78: Dezső Sima

GPGPUs

Nvidia’s line AMD/ATI’s line

Figure 3.2.13: GPGPU families of Nvidia and AMD/ATI’

90 nm G80

65 nm G92 G200

Shrink Enhanced arch.

80 nm R600

55 nm RV670 RV770

Shrink Enhanced arch.

3.2 Recent implementations of the SIMT CM (2)

Page 79: Dezső Sima

48 ALUs

6/08

65 nm/1400 mtrs

11/06

90 nm/681 mtrs

Cores

Cards

CUDA

Cores

G80

2005 2006 2007 2008

96 ALUs320-bit

8800 GTS

10/07

65 nm/754 mtrs

G92

128 ALUs384-bit

8800 GTX

112 ALUs256-bit

8800 GT

GT200

192 ALUs448-bit

GTX260

240 ALUs512-bit

GTX280

6/07

Version 1.0

11/07

Version 1.1

6/08

Version 2.0

5/08

55 nm/956 mtrs

5/07

80 nm/681 mtrs

R600

11/07

55 nm/666 mtrs

R670 RV770

11/05

R500

320 ALUs512-bit

HD 2900XT

320 ALUs256-bit

HD 3850

320 ALUs256-bit

HD 3870

800 ALUs256-bit

HD 4850

800 ALUs256-bit

HD 4870Cards (Xbox)

11/07

Brook+Brooks+

RapidMind

2009

NVidia

AMD/ATI

6/08

support

3870

Figure 3.2.14: Overview of GPGPUs

3.2 Recent implementations of the SIMT CM (3)

Page 80: Dezső Sima

Implementation alternatives of data parallel accelerators

On card implementation

Recent implementations

E.g. GPU cards

Nvidia’s Tesla AMD/ATI’s FireStream accelerator families

Figure 3.2.15: Implementation alternatives of data parallel accelerators

Data parallel accelerators

3.2 Recent implementations of the SIMT CM (4)

Page 81: Dezső Sima

6/08

GT200-based4 GB GDDR30.936 GLOPS

6/07

G80-based1.5 GB GDDR30.519 GLOPS

Card

Desktop

IU Server

C870

2007 2008

C1060

CUDA

NVidia Tesla

6/07

G80-based2*C870 incl.3 GB GDDR31.037 GLOPS

D870

6/07

G80-based4*C870 incl.6 GB GDDR32.074 GLOPS

S870

6/07

Version 1.0

6/08

GT200-based4*C1060

16 GB GDDR33.744 GLOPS

S1070

11/07

Version 1.01

6/08

Version 2.0

Figure 3.2.16: Overview of Nvidia’s Tesla family

3.2 Recent implementations of the SIMT CM (5)

Page 82: Dezső Sima

6/08

Shipped

11/07

RV670-based2 GB GDDR3

500 GLOPS FP32~200 GLOPS FP64

Card

Stream Computing SDK

9170

2007 2008

9170

Rapid Mind

AMD FireStream

6/08

RV770-based1 GB GDDR31 TLOPS FP32

~300 GFLOPS FP64

9250

12/07

Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

Version 1.0

10/08

Shipped

9250

Figure 3.2.17: Overview of AMD/ATI’s FireStream family

3.2 Recent implementations of the SIMT CM (6)

Page 83: Dezső Sima

Implementation alternatives of data parallel accelerators

On-dieintegration

On card implementation

Recent implementations

Futureimplementations

E.g. GPU cards

Nvidia’s Tesla

AMD/ATI’s FireStream

accelerator families

Intel’s HeavendahlAMD’s Fusion

integration technology

Trend

Figure 3.2.15: Implementation alternatives of data parallel accelerators

Data parallel accelerators

3.2 Recent implementations of the SIMT CM (4)

Page 84: Dezső Sima

3.2 Recent implementations of the SIMT CM (7)

Figure 3.2.18: Expected evolution of attached GPGPUs [42]

Integration to the chip

Page 85: Dezső Sima

Intel’s future processors with attached architecture•

Page 86: Dezső Sima

3.2 Intel’s future processors with attached architecture (1)

Figure 3.2.19: Intel’s desktop roadmap [26]

Core2 i7 (Nehalem)Pentium4

Q4/08

Page 87: Dezső Sima

3.2 Intel’s future processors with attached architecture (2)

Figure 3.2.20: A part of Intel’s desktop roadmap [26]

Q2/09 Q3/09Q4/08 Q1/09

(45 nm)

Page 88: Dezső Sima

AMD’s future processors with attached architecture•

Page 89: Dezső Sima

3.2 AMD’s future processors with attached architecture (1)

Figure 3.2.21: AMD’s view about the major phases of processor evolution [27]

Page 90: Dezső Sima

6/2006 The Torrenza initiative (2006 Technology Analyst Day)

• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].

3.2 AMD’s future processors with attached architecture (4)

Page 91: Dezső Sima

Figure 3.2.22: Introduction of the Torrenza platform level integration technique [40]

(cache coherent HT)

3.2 AMD’s future processors with attached architecture (3)

Page 92: Dezső Sima

6/2006 The Torrenza initiative (2006 Technology Analyst Day)

• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].

10/2006 Acquisition of ATI

10/2006 The Fusion initiative

• Silicon level integration of accelerators to AMD processors (first Fusion processors due to the end of 2008 early 2009) [41]

3/2007 “Integration” of the Torrenza and the Fusion initiatives into a continuum of accelerated computing solutions

3.2 AMD’s future processors with attached architecture (4)

Page 93: Dezső Sima

Figure 3.2.23: The Torrenza platform and the Fusion integration technologyas a continuum for accelerated computing solutions [29]

Remark: It is based on an earlier Alienware presentation from 6/2006 [38].

3.2 AMD’s future processors with attached architecture (5)

Page 94: Dezső Sima

Implementation of Fusion processors

• 2007/2008 AMD made a number of confusing announcements and withdrawals [31] – [35].

• According to the latest announcements (11/2008) AMD plans to introduce 32 nm Fusion processors only in 2011 [37].

3.2 AMD’s future processors with attached architecture (6)

Page 95: Dezső Sima

Figure 3.2.24: AMD’ 2008 roadmap for client processors [37]

3.2 AMD’s future processors with attached architecture (7)

Page 96: Dezső Sima

4. Outlook

Page 97: Dezső Sima

4. Outlook (1)

Outlook

Heterogenous multicores

Master/slavearchitectures

Add-onarchitectures

1(Ma):M(S) 2(Ma):M(S) M(Ma):M(S) 1(CPU):1(D) M(CPU):1(D) M(CPU):M(D)

Ma: MasterS: SlaveM: Many

D: Dedicated (like GPU)H: HomogenousM: Many

M(Ma) = M(CPU)

M(S) M(D)

?M(S) M(D)

Figure 4.1: Expected evolution of heterogeneous multicore processors

The future of heterogenous multicores

Page 98: Dezső Sima

4. Outlook (2)

M(CPU):M(D)Heterogenous multicores:

Page 99: Dezső Sima

The future of homogeneous multicores

Larrabee

Tiled processor

In fact: both are of the same type: M(CPU):M(D)

4. Outlook (3)

Figure 4.2: Simplified block diagrams of Larrabee and the Tiled processor [4], [7]

Page 100: Dezső Sima

4. Outlook (4)

The main road of processor evolution M(CPU):M(D)

Page 101: Dezső Sima

Thank you for your attention!

Page 102: Dezső Sima

5. References

[1]: Bhandarkar D., „The Dawn of a New Era”, 11. EMEA, May 2006, Budapest,

[2]: Wall D. W., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-15.html

[3]: Loktu A.,”Itanium 2 for Enterprise Computing,” http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

[4]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee- intels-biggest-leap-ahead-since-the-pentium-pro.html

[5]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[6]: Timm J.-F., “Larrabee: Fakten zur Intel Highend-Grafikkarte,” Computer Base, 2. Juni 2007, http://www.computerbase.de/news/hardware/grafikkarten/2007/juni/larrabee_fakten_ intel_highend-grafikkarte/

[7]: Shrout R., “Intel’s 80 Core Terascale Chip Explored: 4 GHz Clocks and more,” PC Perspective, Feb. 11. 2007, http://www.pcper.com/article.php?aid=363

[8]: Goto H., “Intel’s Manycore CPUs,” PC Watch, June 11. 2007, http://pc.watch.impress.co.jp/docs/2007/0611/kaigai364.htm

Page 103: Dezső Sima

[9]: Hoskote Y. & al., “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro, Sept./Oct. 2007, (Vol. 27 No. 5), pp. 51-61

[10]: Taylor M. & al., “The Raw Processor,” Hot Chips Aug. 13. 2001, http://www.hotchips.org/archives/hc13/3_Tue/22mit.pdf

[11]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008, http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm

[12]: Stokes J., “Larrabee: Intel's biggest leap since the Pentium Pro,” Ars Technica, Aug. 04, 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels- biggest-leap-ahead-since-the-pentium-pro.html

[13]: Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family: Architecture Insight and Power Management , IDF Taipeh, Oct. 2008, http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF -Taipei_TPTS001_100.pdf

[14]: AMD Opteron™ Processor for Servers and Workstations, http://amd.com.cn/CHCN/Processors/ProductInformation/0,,30_118_8826_8832, 00-1.html

[15]: AMD Opteron Processor with Direct Connect Architecture, 2P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/2P_Power_PID_41497.pdf

[16]: AMD Opteron Processor with Direct Connect Architecture, 4P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/4P_Power_PID_41498.pdf

Page 104: Dezső Sima

[17]: Kanter D., “Inside Barcelona: AMD's Next Generation, Real World Tech., May 16. 2007, http://www.realworldtech.com/page.cfm?ArticleID=RWT051607033728

[18]: Kanter D,, “AMD's K8L and 4x4 Preview, Real World Tech. June 02. 2006, http://www.realworldtech.com/page.cfm?ArticleID=RWT060206035626&p=1

[19]: Enderle R., AMD Shanghai “We are back! TGDaily, November 13. 2008, http://www.tgdaily.com/content/view/40176/128/

[20] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf

[21]: Wright C., Henning P., Bergen B., “Roadrunner Tutorial – An Introduction to Roadrunner and the Cell Processor,” Febr. 7 2008, http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf

[22]: Hofstee H. P., “Industry trends in Microprocessor Design,”, IBM, Oct. 4 2007, http://lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/Cell_Hofstee_Non_Conf.pdf

[23]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[24]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[25]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia

Page 105: Dezső Sima

[26]: Goto H., “Intel Desktop CPU Roadmap,” 2008, http://pc.watch.impress.co.jp/docs/2008/0326/kaigai02.pdf

[27]: The industry-Changing Imact of Accelerated Computing – Fusion White Paper, AMD, 2008, http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf

[28]: AMD Announces Initiatives To Elevate AMD64 As Platform For System- And Industry-Wide Innovation, AMD, Jine 1 2006. http://www.amd.com/us-en/Weblets/0,,7832_8366_5730~109409,00.html

[29]: Metal G., “AMD Torrenza and Fusion together,” Metalghost, March 22 2007, http://www.metalghost.ro/index.php?view=article&catid=30%3Ahardware&id =233%3Aamd-torrenza-and-fusion-together&option=com_content

IT Hardware IT NewsIT ReviewsProcessorsMotherboardsMemoriesGraphic CardsStorage DevicesDisplaysDigital Audio Devices

Web Design Ghost Web SitesMetal BandsHardware & SoftwarePetrochemicalBucovinaStep1 Media PromotionKinder LandFloridan TransMiniclip OnlineShipcare ServicesReebok Mania

Latest Articles Most Popular

[30]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[31]: Hester P., 2007 Technology Analyst Day, AMD, July 26 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

[32]: Rivas M., 2007 Financial Analyst Day, AMD, Dec. 13. 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

[33]: Smalley T., “Shrike is AMD’s First Fusion Platform,”, Trusted Reviews, June 9. 2008, http://www.trustedreviews.com/notebooks/news/2008/06/09/Shrike-Is-AMDs- First-Fusion-Platform/p1

Page 106: Dezső Sima

[34]: Hruska J., “AMD Fusion now pushed back to 2011,” Ars Technica, Nov. 14. 2008, http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to -2011.html

[35]: Gruener W., “AMD delays Fusion processor to 2011,” TgDaily, Nov. 13. 2008, http://www.tgdaily.com/content/view/40186/135

[36]: Wilson D., “AMD Analyst Day Platform Announcements,” Anandtech, June 2. 2006, http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2

[37]:Allen R., Financial Analyst Day, AMD, Nov. 13. 2008, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/RandyAllen AMD2008AnalystDay11-13-2008.pdf

[38]: Gonzales N., 2006 Technology Analyst Day, Alienware, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf

[39]: Hester P., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf

[40]: Seyer M., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MartySeyer AMDAnalystWebv3.pdf

[41]: AMD Completes ATI Acquisition and Creates Processing Powerhouse , Oct. 25. 2006, http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543 ~113741,00.html

Page 107: Dezső Sima

[42]: Stokes J., “A closer look at AMD’s CPU/GPU Fusion,” Ars Technica, Nov. 19. 2006, http://arstechnica.com/news.ars/post/20061119-8250.html

Page 108: Dezső Sima

842-856 (Athens)

82xx (Santa Rosa)

8347-56 (Barcelona)

840-850 (Sledgehammer)

865-890 (Egyipt)

1x1 C 1 MB L2

1x2 C 2 MB L2/C

1x4 C 1/2 MB L2/C 2 MB L3

3 MB L3

1x1 C 1 MB L2

1x2 C 2 MB L2/2C

8378-84 (Shanghai)

Figure: AMD’s Tick-Tock model and the related Opteron MP servers

TOCK

TICK

TOCK

TICK

TICK45nm

65nm

90nm

130nm

Page 109: Dezső Sima

Figure: Larrabee’s Software stack [12]

Page 110: Dezső Sima

Figure: Layout of MIT’s Raw Processor [10]

Page 111: Dezső Sima

3.1 The Cell BE (1)

Cell BE

02/2006 Cell Blade QS2008/2007 Cell Blade QS2105/2008 Cell Blade QS22

• Joint development of Sony, IBM and Toshiba

11/2006 Playstation 3 (PS3) QS2x Blade Server family

• 2x PSX3 performance• 12 cores/45 nm• GDDR3/DDR3 (instead of XDR)

Summer 2000: Basic decisions concerning the architecture

• Aim: Games, multimedia, in addition HPC

Rumors (9/2008):

2011? Playstation4 (Competition: XBox3)

Page 112: Dezső Sima

EIB: Element Interface Bus

Figure 3.1.1: Block diagram of the Cell BE [20]

SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit

PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit

MIC Memory Interface Contr.BIC: Bus Interface Contr.

XDR: Rambus DRAM

3.1 The Cell BE (3)

Page 113: Dezső Sima

Figure: Layout of the EIB [21]

3.1 Mester/szolga elvű többmagos processzorok - A Cell (4)

Page 114: Dezső Sima

Figure: Concurrent data transfers over the EIB [21]

3.1 Mester/szolga elvű többmagos processzorok - A Cell (5)

Page 115: Dezső Sima

Massive multithreading

Multithreading is implemented by

creating and managing parallel executable threads for each data element of the execution domain.

Figure: Threads allocated to the element of an execution domain

Same instructions for all data elements

3.2 Csatolt többmagos processzorok (10)

Page 116: Dezső Sima

ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

CTX CTX CTX CTX CTX CTX CTXCTX

CTX CTX CTX CTX CTX CTX CTXCTX

CTX CTX CTX CTX CTX CTX CTXCTX

CTX CTX CTX CTX CTX CTX CTXCTX

CTX CTX CTX CTX CTX CTX CTXCTX

CTX CTX CTX CTX CTX CTX CTXCTX

Actual context Register file (RF)

Context switch

Figure 3.2.11: Per thread contexts needed per ALU for fast context switch

Fetch/Decode

SIMT core

3.2 The SIMT computational model (4)

Page 117: Dezső Sima

Figure: Early vision on the integration of CPUs and GPUs(Presented by Alienware (performance pc maker) [38]

Alienware’s early vision on the integration of CPUs and GPUs (6/2006)

Page 118: Dezső Sima

Figure: AMD’s view about the evolution of mainstream computing [30]

5.2 AMD/ATI’s GPGPU line (1)

Page 119: Dezső Sima

Figure: AMD’s planned 32 nm mobile/mainstream Falcon Fusion family [31]

(32 nm brand new core)

AMD’s plans to implement Fusion class processors

The 32 nm Falcon processor with the Bulldozer CPU core (7/2007: Technology Analyst Day)

(UVD: Unified Video Decoder)

Page 120: Dezső Sima

The 45 nm Swift processor family (12/2007: Financial Analyst Day)

Figure: AMD’s planned 45 nm Swift Fusion processor family [32]

(K10)

Page 121: Dezső Sima

Figure: AMD’s planned 45 nm Shrike mobile platform with the Swift processor [33]

The 45 nm Shrike platform with the Swift processor (6/2008)

Page 122: Dezső Sima

Nov. 2008 (Financial Analyst Day):

AMD cancelled both the 45 nm Shrike platform and the Swift processor [34], [35]

Reason

The 45 nm implementation would result only in modest improvements in performance, power and cost.

Recent plan

• 32 nm technology is awaited to implement the planned CPU GPU integration, it is due to in 2011.

Page 123: Dezső Sima

5. References

[1]: Bhandarkar ic techn

[2]: Wall D., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl

[3]: Intel mc

Page 124: Dezső Sima

Large-Scale Systems Modeling:Networks of QS2x Blades

Peter Altevogt, Tibor Kiss IBM STG Boeblingen

Wolfgang DenzelIBM Research Zurich

Miklos KozlovszkyBudapest Tech

Page 125: Dezső Sima

Research objectives

Provide simulation infrastructure for a

detailed modification analysis of IO subsystems, networks and workloads

limited modification analysis of processor cores:as workload generators they are treated as black (grey) boxes

workload characterization is based on low-level processor core simulations or measurements

SubtasksHigh-Level Simulation Design of Networks of QS2x Blades

System representationWorkload representation

Implementation

Page 126: Dezső Sima

Modeled Components

Workloadas generated by the processor cores

System components:processor cores* as workload generators

for executing computational delays

memory and IO subsystemsbus interfaces, southbridges, network adapter

networkswitches, router,...

* without bus interfaces

Page 127: Dezső Sima

Network

General SetupBlades

... : requests

Page 128: Dezső Sima

High-Level Simulation Design

• Blade system: hardware view

EIB1 EIB0

mem

1

mem

0

SB1 SB0

Cores0Cores1

Processor cores Southbridges

Network adapter

Buses

Memories

to/fromnetwork

Processor cores:–generating requests against IO

subsystem / network–executing computational

requests in form of delays

Page 129: Dezső Sima

High-Level Simulation Design (2):

• Blade system: detailed simulation view

Processor cores (2 chips in case of

Blades)

netw

mem

1EIB1SB1

mem

0EIB0SB0

IO subsystem

Adaptive workloadgenerator

network

Workload generator:–generating requests against

IO subsystem / networkProcessor cores:

–executing computational requests in form of delays

Page 130: Dezső Sima

Figure: Overview of the implementation of Intel’s Tick-Tock model for MP servers [24]

2x1 C, 1 MB L2/C 16 MB L3 7100 (Tulsa)

1x2 C, 4 MB L2/C 7200 (Tigerton DC)

2x2 C, 4 MB L2/C 7300 (Tigerton QC)

1x6 C, 3 MB L2/2C 16 MB L3 7400 (Dunnington)

1x8 C, ¼ MB L2/C 24 MB L3 7xxx (Beckton)

2. Intel’s MP servers (5)

TICK Pentium 4 /Prescott) 1x1 C, 8 MB/C (Potomac)

TOCK Pentium 4 /Irwindale) 2x1 C, ½ MB/C 7000 (Paxville MP)

1x1 C, 1 MB/C (Cransfield) 90nm

3/2005: First 64-bit MP Xeon11/2005: First DC MP Xeon

1Q/2009

Page 131: Dezső Sima

2.2 ábra: Intel MP szerver lapka készleteinek fejlődése

Preceding NB

Potomac Potomac Potomac Potomac

Clarksboro

Tigerton Tigerton Tigerton Tigerton

(Twin Castle)

Paxville MPTulsa

XMB

XMB

XMB

XMB

Paxville MPTulsa

Paxville MPTulsa

Paxville MPTulsa

8500

DC/QC DC/QC DC/QC DC/QC

SC SC SC SC DC DC DC DC

2005: 2006:

2007:

DDR/DDR2

FBDIMM/DDR2

DDR/DDR2

2.1 – Intel többmagos MP szerver processzorai (2)

7300

FSB FSB

FSB

Page 132: Dezső Sima

FB-DIMM DDR2

192 GB 7200 DC 7300 QC(Tigerton)

Xeon

2.3 ábra: Négyfoglalatos 7300 (Caneland) alaplap (Supermicro X7QC3)

SBE2 SB

7300 NB

2.1 – Intel többmagos MP szerver processzorai (3)

Page 133: Dezső Sima

UP: Opteron 100/1000 DP: Opteron 200/2000, MP: 800/8000

CPU0

1MB L2 Cache

CPU1

System Request Interface

Crossbar Switch

MemoryController HT

1MB L2 Cache

CPU0

1MB L2 Cache

CPU1

System Request Interface

Crossbar Switch

MemoryController 0 1 2

1MB L2 Cache

HyperTransport™

2 x 72 bit 2 x 72 bit 800/8000: 3 coherent links200/2000: 1 coherent link

2.4 ábra: Az Opteron család alapvető felépítése

2.1 – AMD többmagos MP szerver processzorai (1)

Page 134: Dezső Sima

2.5 ábra: AMD 4P/8P Direct Connect szerver architektúrája

2.1 – AMD többmagos MP szerver processzorai (2)

Page 135: Dezső Sima

2.1 – AMD többmagos MP szerver processzorai (3)

2.6 ábra: Intel Nehalem processzcsaládjának (2008 nov.17) rendszer architektúrája

On-die Memory Controller

Page 136: Dezső Sima

a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)

The PPE• is optimized to run a 32/64-bit OS• controls usually the SPEs,• complies with the 64-bit PowerPC ISA.

• the PPE is more adept at control-intensive tasks and quicker in task switching,• the SPEs are more adept at compute intensive tasks and slower at task switcing.

• are optimized to run compute intensive SIMD apps.,• operate usually under the control of the PPE,• run their individual apps. (threads),• have full access to a coherent shared memory including the memory mapped I/O-space,• can be programmed in C/C++.

The SPEs

Contrasting the PPE and the SPEs

Unique features of the Cell BE

Overview of the Cell BE (4)

Page 137: Dezső Sima

b) The SPEs have an unusual storage architecture, as

• SPEs

• The LS

access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands).

operate in connection with a local store (LS) of 256 KB, i.e.

o they fetch instructions from their private LS and

o their Load/Store-instructions access their LS rather than the main store,

• SPEs

has no associated cache.

Overview of the Cell BE (5)

Page 138: Dezső Sima

Figure: Die shot and floorplan the Cell BE (221mm2, 234 mtrs) [15]

3.1 Mester/szolga elvű többmagos processzorok - A Cell (3)

Page 139: Dezső Sima

4. Kitekintés (1)

Processor Technology Aim

Bloomfield (45 nm) desktopBeckton (45 nm) MP serverWestmare (32 nm) desktop DP server

Cores Memory channels

4 triple channel DDR3 8 quad channel FB_DIMM (2)

4/6 triple channel DDR3 4/6 quad channel DDR3

Intel’s Nehalem (i7) family (17. Nov. 2008)

• Integrated memory controller

• 4/6/8-cores

• Dual-threaded

• FSB replaced by a serial bus (QuickPath Interconnect)

Main features

Page 140: Dezső Sima
Page 141: Dezső Sima
Page 142: Dezső Sima

http://pc.watch.impress.co.jp/docs/2007/0122/kaigai330.htm

Page 143: Dezső Sima
Page 144: Dezső Sima

http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG

Page 145: Dezső Sima
Page 146: Dezső Sima

HTX slots will be standard interfaces connected directly to an AMD CPU's HyperTransport link. If both of these links are coherent, the device and the CPU will be able to communicate directly with each other with cache coherency. Because of this, latency can be reduced greatly over other buses as well, enabling hardware vendors to begin to create true coprocessor technology once again.

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2

Page 147: Dezső Sima

http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf

Fusion announced in Oct. 2006, due to in 1H 2008.

http://www.google.com/search?hl=en&q=amd+fusion+pcwatch&btnG=Google+Search&aq=f&oq=

Page 148: Dezső Sima

http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007_AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

32nm brand new 32 nm core

Page 149: Dezső Sima

http://download.amd.com/Corporate/MarioRivasDec2007AMDAnalystDay.pdf

Page 150: Dezső Sima

fusion constraintsdie size

dissipationmemory bandwidth

Phil HesterFusion will never go to high end due to dissipation

AMD's CPU die size of the high-end desktop CPU about 200 square mm, with the main stream CPU 120-150 sq mm, the value CPU around 100 square mm or less. Therefore, FUSION half die (semiconductor units) as a spare GPU core, the size of the core GPU with a degree can be constrained. GPU die, the high-end GPU more than 300 square mm, with midrange GPU 120-150 square mm, Value-GPU around 100 square mm or less. Therefore, 45nm process can be integrated into FUSION-generation GPU core, 65nm-generation below the rank of the discrete GPU will be the size and extent

cpu uses commodity dram gpu graphics dram gddr3, 4, 5

memory size bandwidth mem. data path 8 B 32/64 B

http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm

coexistence of torrenza and fusion (high end: torrenza)

Page 151: Dezső Sima

を真剣に考えているようだ。

                                  

                          

FUSIONプロセッサの想定図

PDF版はこちら

http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG

Page 152: Dezső Sima

http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to-2011.html

Nov 14 20008

Page 153: Dezső Sima

http://www.techpowerup.com/reviews/AMD/Analysts_Day

Page 154: Dezső Sima

PC Watch, 07.01.31)

http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm

Memory bandwidth per 10GFLOPS needs 1GB/sec about and will be calculated.

Page 155: Dezső Sima

The 45 nm Fusion processor, initially promised as a 2009 chip and then moved into 2010 is essentially cancelled.

The chip, which was described to combine a CPU and GPU under one hood in the “Shrike” core, was found to only bring modest improvements over today’s platforms in terms of power efficiency, cost and performance. Instead, the company will introduce Fusion (which actually isn’t called Fusion anymore) as a 2011 model in a 32 nm version with Llano core. Allen said that 32 nm would be the right technology to introduce the product. Llano will feature four cores, 4 MB of cache, DDR3 memory support and an integrated GPU.

http://www.tgdaily.com/content/view/40186/135/

Nov 13 2008

Page 156: Dezső Sima

Possible use of surplus transistors

Wider processor width Core enhancements Cache enhancements

superscalar

• branch prediction• speculative loads• ...

L2/L3enhancements

(size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Doubling transistor counts ~ every two years

Utilization of the surplus transistors?

Moore’s rule

1. Többmagos processzorok megjelenésének szükségszerűsége (3)

Page 157: Dezső Sima

Figure: Overview of Intel’s Tick-Tock model and the related MP servers [24]

TICK Pentium 4 /Prescott)

TOCK Pentium 4 /Irwindale) 90nm

11/2005: First DC MP Xeon

1Q/2009

7100 (Tulsa)

7300 (Tigerton QC)

7400 (Dunnington)

7xxx (Beckton)

(Potomac)

7000 (Paxville MP)

(Cransfield)

7200 (Tigerton DC)

2x1 C 1 MB L2/C 16 MB L3

2x2 C 4 MB L2/C

1x6 C 3 MB L2/2C 16 MB L3

1x8 C ¼ MB L2/C 24 MB L3

1x1 C 8 MB L2

2x1 C ½ MB L2/C

1x1 C 1 MB L2

1x2 C 4 MB L2/C

3/2005: First 64-bit MP Xeons