Dezső Sima © Dezső Sima 2011 (v1.0, Last updated 04/15/2011) GPGPUs/DPAs April 2011

Dezső Sima

© Dezső Sima 2011(v1.0, Last updated 04/15/2011)

GPGPUs/DPAs

April 2011

1. Introduction (1)

Aim

Brief introduction and overview.

2. Basics of the SIMT execution

Content

s1.Introduction

3. Overview of GPGPUs

4. Overview of data parallel accelerators

5. References

1. Introduction

Vertex

Edge Surface

Vertices

• have three spatial coordinates• supplementary information necessary to render the object, such as

• color• texture• reflectance properties• etc.

Representation of objects by triangles

1. Introduction (2)

Main types of shaders in GPUs

Shaders

Geometry shaders Vertex shaders Pixel shaders(Fragment shaders)

Transform each vertex’s 3D-position in the virtual space

to the 2D coordinate, at which it appears on the screen

Calculate the color of the pixels

Can add or removevertices from a mesh

1. Introduction (3)

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows Server 2003

9.0 (12/2002) 2.0 2.0

9.0a (3/2003) 2_A, 2_B 2.x

9.0c (8/2004) 3.0 3.0 Windows XP SP2

10.0 (11/2006) 4.0 4.0 Windows Vista

10.1 (2/2008) 4.1 4.1 Windows Vista SP1/ Windows Server 2008

11 (in development) 5.0 5.0

Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MS’s OSs [18], [21]

1. Introduction (4)

DirectX: Microsoft’s API set for MM/3D

Convergence of important features of the vertex and pixel shader models

Subsequent shader models introduce typically, a number of new/enhanced features.

Shader model 2 [19]

• Different precision requirements

Vertex shader: FP32 (coordinates)

Pixel shader: FX24 (3 colors x 8)

• Different instructions

• Different resources (e.g. registers)

Differences between the vertex and pixel shader models in subsequent shader models concerning precision requirements, instruction sets and programming resources.

Shader model 3 [19]

• Unified precision requirements for both shaders (FP32) with the option to specify partial precision (FP16 or FP24) by adding a modifier to the shader code

• Different instructions

• Different resources (e.g. registers)

1. Introduction (3)

Shader model 4 (introduced with DirectX10) [20]

• Unified precision requirements for both shaders (FP32) with the possibility to use new data formats.

• Unified instruction set

• Unified resources (e.g. temporary and constant registers)

Shader architectures of GPUs prior to SM4

GPUs prior to SM4 (DirectX 10):

have separate vertex and pixel units with different features. Drawback of having separate units for vertex and pixel shading

• Inefficiency of the hardware implementation• (Vertex shaders and pixel shaders often have complementary load patterns [21]).

1. Introduction (3)

Unified shader model (introduced in the SM 4.0 of DirectX 10.0)

The same (programmable) processor can be used to implement all shaders;

• the vertex shader• the pixel shader and• the geometry shader (new feature of the SMl 4)

Unified, programable shader architecture

1. Introduction (5)

Figure: Principle of the unified shader architecture [22]

1. Introduction (6)

Based on its FP32 computing capability and the large number of FP-units available

the unified shader is a prospective candidate for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs

(General Purpose GPUs)

1. Introduction (7)

or

cGPUs

(computational GPUs)

1. Introduction (8)

Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]

1. Introduction (9)

Evolution of the FP-32 performance of GPGPUs [44]

Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]

1. Introduction (9)

Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]

1. Introduction (10)

1. Introduction (9)

Background slides to Introduction

Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]

1. Introduction

Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]

1. Introduction

2. Basics of the SIMT execution

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors

• Two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (matrices)

E.g. 2. and 3. generationsuperscalars

GPGPUs,data parallel accelerators

Figure: Main alternatives of data parallel execution

• data dependent flow control as well as • barrier synchronization

• is massively multithreaded, and provides

Needs an FX/FP SIMD extension of the ISA

Needs an FX/FP SIMT extension of the ISA and the API

2. Basics of the SIMT execution (1)

Scalar execution SIMD execution SIMT execution

Domain of execution:single data elements

Domain of execution:elements of vectors

Domain of execution:elements of matrices

(at the programming level)

Figure: Domains of execution in case of scalar, SIMD and SIMT execution


Remark

SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

Scalar, SIMD and SIMT execution

Key components of the implementation of SIMT execution

• Data parallel execution

• Massive multithreading

• Data dependent flow control

• Barrier synchronization


(i.e. all ALUs of a SIMT core perform typically the same operation).

Data parallel execution

Fetch/Decode

ALU ALU ALUALU

SIMT core

Figure: Basic layout of a SIMT core

ALU ALU ALUALU

Performed by SIMT cores

SIMT cores execute the same instruction stream on a number of ALUs

SIMT cores are the basic building blocks of GPGPU or data parallel accelerators.


During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.

• streaming multiprocessor (Nvidia), • superscalar shader processor (AMD),• wide SIMD processor, CPU core (Intel).

Remark 1

Different manufacturers designate SIMT cores differently, such as


Fetch/Decode

ALU ALU ALUALU

RF RF RF RF

Each ALU is allocated a working register set (RF)

Figure: Main functional blocks of a SIMT core

ALU ALU ALUALU

RFRFRFRF


SIMT ALUs perform typically, RRR operations, that is

ALUs take their operands from and write the calculated results to the register set (RF) allocated to them.

ALU

RF

Figure: Principle of operation of the SIMD ALUs


Remark 2

Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs

Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file.

ALU

RF RF RF RF RF RF RF RF

ALU ALU ALUALU ALU ALUALU ALU ALUALU ALU


Basic operation of recent SIMT ALUs

ALU

RF

• are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle),

• execute basically SP FP-MADD (simple precision i.e. 32-bit. Multiply-Add) instructions of the form axb+c ,

• need a few number of clock cycles, e.g. 2 or 4 shader cycles, to present the results of the SP FMADD operations to the RF,

That is, without further enhancements their peak performance is 2 SP FP operations/cycle


Additional operations provided by SIMT ALUs

• FX operations and FX/FP conversions,• DP FP operations,• trigonometric functions (usually supported by special functional units).


Aim of massive multithreading

to speed up computations by increasing the utilization of available computing resources in case of stalls (e.g. due to cache misses).


Massive multithreading

• Suspend stalled threads from execution and allocate ready to run threads for execution.

• When a large enough number of threads are available long stalls can be hidden.

Principle

Multithreading is implemented by

creating and managing parallel executable threads for each data element of the execution domain.

Figure: Parallel executable threads for each element of the execution domain

Same instructions for all data elements


Effective implementation of multithreading

if thread switches, called context switches, do not cause cycle penalties.

• providing separate contexts (register space) for each thread, and

• implementing a zero-cycle context switch mechanism.

Achieved by


ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

CTX CTX CTX CTX CTX CTX CTXCTX






Actual context Register file (RF)

Context switch

Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU

Fetch/Decode

SIMT core


Data dependent flow control

Implemented by SIMT branch processing

In SIMT processing both paths of a branch are executed subsequently such that

for each path the prescribed operations are executed only on those data elements which fulfill the data condition given for that path (e.g. xi > 0).

Example


Figure: Execution of branches [24]

The given condition will be checked separately for each thread


Figure: Execution of branches [24]

First all ALUs meeting the condition execute the prescibed three operations,then all ALUs missing the condition execute the next two operatons


Figure: Resuming instruction stream processing after executing a branch [24]


Barrier synchronization

Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10].

In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7].

Remark

Lets wait all threads for completing all prior instructions before executing the next instruction.


Each kernel invocation lets execute all

thread blocks (Block(i,j))belonging to the

related Grid

Host Device

Figure: Hierarchy of threads [25]

Principle of SIMT execution assuming serial kernel processing


Remark

In the Figure CUDA terminology is used.


Remark

A parallel kernel processing is also possible assuming advanced GPGPU devices (such as Nvidia’s Fermi or AMD’s HD 69xx GPGPUs) and appropriate software support.

3. Overview of GPGPUs

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Dedicated units supporting data parallel execution

with appropriate programming environment

Programmable GPUs with appropriate

programming environments

E.g. Nvidia’s 8800 and GTX linesAMD’s HD 38xx, HD48xx lines

Nvidia’s Tesla linesAMD’s FireStream lines

Have display outputs No display outputsHave larger memories than GPGPUs

Figure: Basic implementation alternatives of the SIMT execution

3. Overview of GPGPUs (1)

GPGPUs

Nvidia’s line AMD/ATI’s line

Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

90 nm G80

65 nm G92 G200

Shrink Enhanced arch.

80 nm R600

55 nm RV670 RV770

ShrinkEnhanced

arch.


40 nm GF100(Fermi)

Shrink

RV870

ShrinkEnhanced

arch. Enhanced

arch.

Cayman

Enhanced arch.

48 ALUs

6/08

65 nm/1400 mtrs

11/06

90 nm/681 mtrs

Cores

Cards

CUDA

Cores

G80

2005 2006 2007 2008

96 ALUs320-bit

8800 GTS

10/07

65 nm/754 mtrs

G92

128 ALUs384-bit

8800 GTX

112 ALUs256-bit

8800 GT

GT200

192 ALUs448-bit

GTX260

240 ALUs512-bit

GTX280

6/07

Version 1.0

11/07

Version 1.1

6/08

Version 2.0

5/08

55 nm/956 mtrs

5/07

80 nm/681 mtrs

R600

11/07

55 nm/666 mtrs

R670 RV770

11/05

R500

320 ALUs512-bit

HD 2900XT

320 ALUs256-bit

HD 3850

320 ALUs256-bit

HD 3870

800 ALUs256-bit

HD 4850

800 ALUs256-bit

HD 4870Cards (Xbox)

11/07

Brook+Brooks+

RapidMind

NVidia

AMD/ATI

6/08

support

3870

Figure: Overview of GPGPUs and their basic software support (1)


OpenCL

12/08

OpenCL

11/08

Version 2.1

9/08 12/08

Brook+ 1.3Brook+ 1.2

OpenCL OpenCL

Standard

Standard

(SDK v.1.3)(SDK v.1.2)(SDK v.1.0)

Cores

Cards

CUDA

Cores

2009 2010

448 ALUs320-bit

480 ALUs384-bit

5/09 3/10 6/10

Version 3.1

Cards

3/09

Brook+ 1.4Brooks+

RapidMind

2011

NVidia

AMD/ATI

Figure: Overview of GPGPUs and their basic software support (2)


OpenCL

3/10

40 nm/3000 mtrs

GF100 (Fermi)

GTX 470 GTX 480

07/10

40 nm/1950 mtrs

GF104 (Fermi)

336 ALUs192/256-bit

GTX 460

512 ALUs384-bit

480 ALUs384-bit

11/10

40 nm/3000 mtrs

GF110 (Fermi)

GTX 580 GTX 560 Ti

Version 22

6/09

Version 2.3 Version 3.0

1/11

Version 3.2

1/11

10/10

40 nm/1700 mtrs

8/09Intel bought RapidMind

Barts Pro/XT

1440/1600 ALUs256-bit

HD 5850/70

960/1120 ALUs256-bit

HD 6850/70

9/09

40 nm/2100 mtrs

RV870 (Cypress)

12/10

40 nm/2640 mtrs

Cayman Pro/XT

1408/1536 ALUs256-bit

HD 6950/70

OpenCL

6/10

SDK 1.1

OpenCL 1.1

03/10

(SDK V.2.01)

OpenCL 1.0

08/10

(SDK V.2.2)

OpenCL 1.1

(SDK V.1.4 Beta)

10/09

SDK 1.0

OpenCL 1.0

6/09

SDK 1.0 Early release

OpenCL 1.0

11/09

(SDK V.2.0)

OpenCL 1.0

3/11

Beta

Version 4.0

Cores

2009 2010

Cards

3/09

Brook+ 1.4Brooks+

RapidMind

2011

AMD/ATI


OpenCL

10/10

40 nm/1700 mtrs

8/09Intel bought RapidMind

Barts Pro/XT

1440/1600 ALUs256-bit

HD 5850/70

960/1120 ALUs256-bit

HD 6850/70

9/09

40 nm/2100 mtrs

RV870 (Cypress)

12/10

40 nm/2640 mtrs

Cayman Pro/XT

1408/1536 ALUs256-bit

HD 6950/70

03/10

(SDK V.2.01)

OpenCL 1.0

08/10

(SDK V.2.2)

OpenCL 1.1

(SDK V.2.01)

11/09

(SDK V.2.0)

OpenCL 1.0

• both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share memories) and • their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed in Section 5.2.

Remarks on AMD-based graphics cards [45], [66]

Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+ and started supporting OpenCL as the basis of their HLL programming language.

As a consequence AMD changed also


Remarks on Fermi-based graphics cards [45], [66]

FP64 speed

ECC available only on the Tesla 20-series

Number of DMA engines

Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine. This means that CUDA applications can overlap computation and communication on Tesla using bi-directional communication over PCI-e.

Memory size

Tesla 20 products have larger on board memory (3GB and 6GB)

• ½ of the FP32 speed for the Tesla 20-series• 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards 1/12 for other GForce GTX4xx cards


Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]

Nvidia manages the continuous evolution by

Nvidia’s compute capability concept

c) and specifying compatibility rules. among them.

a) defining sets of capabilities and features designated as compute capability versions,

b) specifying which compute capability version is supported by their

• programming environments, represented by their SDKs, and

• GPGPU lines,



a) Defined sets of compute capability versions

by Nvidia-1 [81]


a) Defined sets of compute capability versions by Nvidia-2 [81]

Fermi

b1) Compute capability versions of the PTX ISAs generated by different releases of CUDA SDKs [50]


GPGPU cores GPGPU devices

1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M

1.1 G86, G84, G98, G96, G96b, G94, G94b, G92, G92b

GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50

1.2 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M

1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800

2.0 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580

2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M

b2) Support of the compute capability versions by Nvidia’s GPGPU cards [81]


Capability

c) Compatibility rules related to compute capability versions [50]

The basic rule is forward compatibility within the main versions (versions 1.x and 2.x), but not across main versions.

Object files (called CUBIN files) compiled to a particular compute capability, are supported

on all devices having the same or higher version number within the same main version. E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices but not supported on compute capability 2.0 (Fermi) devices.

This is interpreted as follows

For more details see [52].


8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280

Core G80 G80 G92 GT200 GT200

Introduction 11/06 11/06 10/07 6/08 6/08

IC technology 90 nm 90 nm 65 nm 65 nm 65 nm

Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs

Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2

Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz

Computation

No of SMs (cores) 12 16 14 24 30

No.of FP32 EUss 96 128 112 192 240

Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz

No. FP32 operations./cycle 21 3 3

Peak FP32 performance 230.4 GFLOPS 345.61 GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS

Peak FP64 performance – – – 59.62 GFLOPS 77.76 GFLOPS

Memory

Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s

Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit

Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s

Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB

Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3

Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit

System

Multi. CPU techn. SLI SLI SLI SLI SLI

Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10 10 10.1 subset 10.1 subset

TDP 146 W 155 W 105 W 182 W 236 W

Table: Main features of Nvidia’s GPGPUs-1


1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle

In publications there are conflicting statements about whether or not the GT80 makes use of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not. Official specifications [22] declare the capability of dual issue, but other literature sources [64] and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65]) deny it. A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s specifications includes calculations made both by the ALUs in the SMs and by the texture processing units TPU).

Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks, such as texture filtering.

Accordingly, in our discussion focusing on numerical calculations it is fair to take only the MAD operations into account for specifying the peak numerical performance.


Remarks

They are FP32 or FP16 capable [46]

• TA: Texture Address units• TF: Texture Filter Units

Texture processing Units

consisting of


Structure of an SM of the G80 architecture

GTX 470 GTX 480 GTX 460 GTX 570 GTX 580

Core GF100 GF100 GF104 GF110 GF110

Introduction 3/10 3/10 7/10 12/10 11/10




Core frequency 732 MHz 772 MHz

Computation

No of SMs (cores) 14 15 7 15 16

No. of FP32 EUs 448 480 336 480 512

Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz

No. FP32 operations/cycle 2 2 3 2 2

Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS

Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS

Memory

Mem. transfer rate (eff) 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s

Mem. interface 320-bit 384-bit 192/256-bit 320-bit 384-bit

Mem. bandwidth 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s

Mem. size 1.28 GB 1.536 GB 0.768/1.024 GB/s 1.28 GB 1.536/3.072 GB

Mem. type GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

Mem. channel 5*64-bit 6*64-bit 3/4 *64-bit 5*64-bit 6*64-bit

System

Multi. CPU techn. SLI SLI SLI SLI SLI

Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16

MS Direct X 11 11 11 11 11

TDP 215 W 250 W 150/160 W 219 W 244 W

Table: Main features of Nvidia’s GPGPUs-2


Remarks


1) The GDDR3 memory has a double clocked data transfer

Effective memory transfer rate = 2 x memory frequency

The GDDR5 memory has a quad clocked data transfer

Effective memory transfer rate = 4 x memory frequency

2) Both the GDDR3 and GDDR5 memories are 32-bit devices. Nevertheless, memory controllers of GPGPUs may be designed either to control a single 32-bit memory channel or dual memory channels, providing a 64-bit channel width.

Nvidia GeForce GTX 480 (GF 100 based) [47]


Examples for Nvidia cards

Nvidia GeForce GTX 480 and 580 cards [77]


GTX 480(GF 100 based)

GTX 580(GF 110 based)


A pair of GeForce GTX 480 cards [47]

(GF100 based)

HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

Core R600 R670 R670 RV770 (R700-based) RV770 (R700 based)

Introduction 5/07 11/07 11/07 5/08 5/08




Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

Computation

No. of ALUs 320 320 320 800 800

Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

No. FP32 operations./cycle 2 2 2 2 2

Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS

Peak FP64 performance – – – 200 GFLOPS 240 GFLOPS

Memory

Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)

Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit

Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s

Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB

Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5

Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit

Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar

System

Multi. CPU techn. CrossFire X CrossFire X CrossFire X CrossFire X CrossFire X

Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10.1 10.1 10.1 10.1

TDP Max./Idle 150 W 75 W 105 W 110 W 150 W

Table: Main features of AMD/ATIs GPGPUs-1


Evergreen series HD 5850 HD 5870 HD 5970

Core Cypress PRO (RV870-based) Cypress XT (RV870-based) Hemlock XT (RV870-based)

Introduction 9/09 9/09 11/09

IC technology 40 nm 40 nm 40 nm

Nr. of transistors 2154 mtrs 2154 mtrs 2*2154 mtrs

Die are 334 mm2 334 mm2 2*334 mm2

Core frequency 725 MHz 850 MHz 725 MHz

Computation

No. of SIMD cores / VLIW5 ALUs 18/16 20/16 2*20/16

No. of EUs 1440 1600 2*1600

Shader frequency 725 MHz 850 MHz 725 MHz

No. FP32 inst./cycle 2 2 2

Peak FP32 performance 2088 GFLOPS 2720 GFLOPS 4640 GFLOPS

Peak FP64 performance 417.6 GFLOPS 544 GFLOPS 928 GFLOPS

Memory

Mem. transfer rate (eff) 4000 Mb/s 4800 Mb/s 4000 Mb/s

Mem. interface 256-bit 256-bit 2*256-bit

Mem. bandwidth 128 GB/s 153.6 GB/s 2*128 GB/s

Mem. size 1.0 GB 1.0/2.0 GB 2*(1.0/2.0) GB

Mem. type GDDR5 GDDR5 GDDR5

Mem. channel 8*32-bit 8*32-bit 2*8*32-bit

System

Multi. CPU techn. CrossFire X CrossFire X CrossFire X

Interface PCIe 2.1*16 PCIe 2.1*16 PCIe 2.1*16

MS Direct X 11 11 11

TDP Max./Idle 151/27 W 188/27 W 294/51 W

Table: Main features of AMD/ATI’s GPGPUs-2


Northerm Islands series HD 6850 HD 6870

Core Barts Pro Barts XT

Introduction 10/10 10/10

IC technology 40 nm 40 nm

Nr. of transistors 1700 mtrs 1700 mtrs

Die are 255 mm2 255 mm2

Core frequency 775 MHz 900 MHz

Computation

No. of SIMD cores /VLIW5 ALUs 12/16 14/16

No. of EUs 960 1120

Shader frequency 775 MHz 900 MHz

No. FP32 inst./cycle 2 2

Peak FP32 performance 1488 GFLOPS 2016 GFLOPS

Peak FP64 performance - -

Memory

Mem. transfer rate (eff) 4000 Mb/s 4200 Mb/s

Mem. interface 256-bit 256-bit

Mem. bandwidth 128 GB/s 134.4 GB/s

Mem. size 1 GB 1 GB

Mem. type GDDR5 GDDR5

Mem. channel 8*32-bit 8*32-bit

System

Multi. CPU techn. CrossFire X CrossFire X

Interface PCIe 2.1*16 PCIe 2.1*16

MS Direct X 11 11

TDP Max./Idle 127/19 W 151/19 W

Table: Main features of AMD/ATI’s GPGPUs-3


Northerm Islands series HD 6950 HD 6970 HD 6990 HD 6990 unlocked

Core Cayman Pro Cayman XT Antilles Antilles

Introduction 12/10 12/10 3/11 3/11

IC technology 40 nm 40 nm 40 nm 40 nm

Nr. of transistors 2.64 billion 2.64 billion 2*2.64 billion 2*2.64 billion

Die are 389 mm2 389 mm2 2*389 mm2 2*389 mm2

Core frequency 800 MHz 880 MHz 830 MHz 880 MHz

Computation

No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 2*24/16

No. of EUs 1408 1536 2*1536 2*1536

Shader frequency 800 MHz 880 MHz 830 MHz 880 MHz

No. FP32 inst./cycle / ALU 4 4 4 4

Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS

Peak FP64 performance 0.5625 TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS

Memory

Mem. transfer rate (eff) 5000 Mb/s 5500 Mb/s 5000 Mb/s 5000 Mb/s

Mem. interface 256-bit 256-bit 256-bit 256-bit

Mem. bandwidth 160 GB/s 176 GB/s 2*160 GB/s 2*160 GB/s

Mem. size 2 GB 2 GB 2*2 GB 2*2 GB

Mem. type GDDR5 GDDR5 GDDR5 GDDR5

Mem. channel 8*32-bit 5*32-bit 2*8*32-bit 2*8*32-bit

System

ECC - - - -

Multi. CPU techn. CrossFireX CrossFireX CrossFireX CrossFireX

Interface PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit

MS Direct X 11 11 11 11

TDP Max./Idle 200/20 W 250/20 W 350/37 W 415/37 W

Table: Main features of AMD/ATIs GPGPUs-4



Remark

The Radeon HD 5xxx line of cards is designated also as the Evergreen series andthe Radeon HD 6xxx line of cards is designated also as the Northern islands series.

HD 5870 (RV870 based) [41]

Examples for AMD cards


HD 5970 (actually RV870 based) [80]

ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock


HD 5970 (actually RV870 based) [79]

ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock


AMD HD 6990 (actually Cayman based) [78]

AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock


Price relations (as of 01/2011)

Nvidia

GTX 570 ~ 350 $GTX 580 ~ 500 $

AMD

HD 6970 ~ 400 $HD 6990 ~ 700 $ (Dual 6970)


4. Overview of data parallel accelerators

Implementation alternatives of data parallel accelerators

On-dieintegration

On card implementation

Recent implementations

Emergingimplementations

E.g. GPU cards

Data-parallelaccelerator cards

Intel’s Heavendahl

Intel’s Sandy Bridge (2011)

AMD’s Torrenzaintegration technology

AMD’s Fusion (2008)integration technology

Trend

Figure: Implementation alternatives of dedicated data parallel accelerators

Data parallel accelerators

4. Overview of data parallel accelerators (1)

2010/2011

On-card accelerators

1U serverimplementations

Cardimplementations

Desktopimplementations

Usually dual cardsmounted into a box,

connected to anadapter card

that is inserted into a free PCI-E x16 slot of the host PC through a cable.

E.g. Nvidia Tesla D870 Nvidia Tesla S870Nvidia Tesla S1070Nvidia Tesla S2050/S2070

Nvidia Tesla C870Nvidia Tesla C1060Nvidia Tesla C2070AMD FireStream 9170AMD FireStream 9250AMD FireStream 9370

Usually 4 cards mounted into a 1U server rack,connected two adapter cards

that are inserted into two free PCIEx16 slots of a server

through two switches and two cables.

Single cards fittinginto a free PCI Ex16 slotof the host computer.

Figure: Implementation alternatives of on-card accelerators


6/08

4 GB GDDR3SP: 933 GFLOPSDP: 77.76 GFLOPS

6/07

1.5 GB GDDR3SP: 345.6 GFLOPSDP: -

Card

Desktop

IU Server

C870

2007 2008

C1060

CUDA

NVidia Tesla-1

6/07

2*C870 incl.3 GB GDDR3SP: 691.2 GFLOPSDP: -

D870

6/07

4*C870 incl.6 GB GDDR3SP: 1382 GFLOPSDP: -

S870

6/07

Version 1.0

6/08

4*C106016 GB GDDR3SP: 3732 GFLOPSDP: 311 GFLOPS

S1070

11/07

Version 1.01

6/08

Version 2.0

Figure: Overview of Nvidia’s G80/G200-based Tesla family-1


G80-based GT200-based

345.6

Figure: Main functional units of Nvidia’s Tesla C870 card [2]

FB: Frame Buffer


Figure: Nvida’s Tesla C870 and AMD’s FireStream 9170 cards [2], [3]


Figure: Tesla D870 desktop implementation [4]


Figure: Nvidia’s Tesla D870 desktop implementation [4]


Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]


Figure: Concept of Nvidia’s Tesla S870 1U rack server [5]


Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6]


Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards inserted into PCI-E x16 slots of the host server [6]


Card

Module

IU Server

2009 2010

CUDA

NVidia Tesla-2

11/09

4*C2050/C207012/24 GB GDDR31

SP: 4.1 TFLOPSDP: 8.2 TFLOPS

S2050/S2070

Figure: Overview of Nvidia’s GF100 (Fermi)-based Tesla family

11/09

3/6 GB GDDR5SP: 1.03 TLOPS1

DP: 0.515 TFLOPS

C2050/C2070

04/10

3/6 GB GDDR5SP: 1.03 TFLOPS1

DP: 0.515 TFLOPS

M2050/M2070

08/10

6 GB GDDR5SP: 1.03 TFLOPS1

DP: 0.515 TFLOPS

M2070Q

CUDA

3/10 6/10

Version 3.1Version 2.2

6/09

Version 2.3 Version 3.0

1/11

Version 3.2

5/09

OpenCL+

6/10

OpenCL 1.1


GF100 (Fermi)-based

2011

1: Without SF (Special Function) operations


Tesla C2050/C2070 Card [71] Tesla S2050/S2070 1U [72]

(11/2009) (11/2009)

Fermi based Tesla devices

Single GPU Card

3/6 GB GDDR5515 GFLOPS DP

ECC

Four GPUs

12/16 GB GDDR5s2060 GFLOPS DP

ECC

Tesla M2050/M2070/M2070Q Processor Module

(Dual slot board with PCIe Gen. 2 x16 interface)

(04/2010)


Used in the Tianhe-1A Chinese supercomputer (10/2010)

Remark

The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)

Figure: Tesla M2050/M2070/M2070Q Processor Module [74]

• Upgraded version of the Tianhe-1 (China)• 2.6 PetaFLOPS (fastest supercomputer in the World in 2010)

Tianhe-1A (10/2010) [48]

• 14 336 Intel Xeon 5670• 7 168 Nvidia Tesla M2050


(448 ALUs) (448 ALUs)


Specification data of the Tesla M2050/M2070/M2070Q modules [74]

Remark

The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)

• Fermi based Tesla devices introduced the support of ECC.

• By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or DPA devices support ECC [76].


Support of ECC


Tesla S2050/S2070 1U

GPU Specification

Number of processor cores: 448

Processor core clock: 1.15 GHz

Memory clock: 1.546 GHz

Memory interface: 384 bit

System Specification

Four Fermi GPUs

12.0/24.0 GB of GDDR5,

configured as 3.0/6.0 GB per GPU.

When ECC is turned on, available memory is ~10.5 GB

Typical power consumption: 900 W

Figure: Block diagram and technical specificationsof Tesla S2050/S2070 [75]

The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB.

6/08

Shipped

11/07

2 GB GDDR3FP32: 500 GLOPS FP64:~200 GLOPS

Card

Stream Computing SDK

9170

2007 2008

9170

Rapid Mind

AMD FireStream-1

6/08

1 GB GDDR3FP32: 1000 GLOPS FP64: ~300 GFLOPS

9250

12/07

Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

Version 1.0

10/08

Shipped

9250

Figure: Overview of AMD/ATI’s FireStream family-1


RV670-based RV770-based

09/08

Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

Version 1.2

Card

Stream Computing SDK

2009 2010

AMD FireStream-2

03/10

OpenCL 1.0

Version 2.01

Figure: Overview of AMD/ATI’s FireStream family-2

08/10

OpenCL 1.1

RV870-based

05/10

OpenCL 1.0

Version 2.1

In 01/11 Version 2.3renamed to APP

2011

APP: Accelerated Parallel Processing

12/10

OpenCL 1.1

Version 23Version 2.2

03/09

Brooks+

Version 1.4

10/10

Shipped

06/10

2/4 GB GDDR5FP32: 2016 GLOPS

FP64: 403/528 GLOPS

9350/93709350/9370



Table: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]

Nvidia Tesla cards

Core type C870 C1060 C2050 C2070

Based on G80 GT200 T20 (GF100-based)

Introduction 6/07 6/08 11/09

Core

Core frequency 600 MHz 602 MHz 575 MHz

ALU frequency 1350 MHz 1296 GHz 1150 MHz

No. of SMs (cores) 16 30 14

No. of ALUs 128 240 448

Peak FP32 performance 345.6 GFLOPS 933 GFLOPS 1030.4 GFLOPS

Peak FP64 performance - 77.76 GFLOPS 515.2 GFLOPS

Memory

Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 3000 Gb/s

Mem. interface 384-bit 512-bit 384-bit

Mem. bandwidth 768 GB/s 102 GB/s 144 GB/s

Mem. size 1.5 GB 4 GB 3 GB 6 GB

Mem. type GDDR3 GDDR3 GDDR5

System

ECC - - ECC

Interface PCIe *16 PCIe 2.0*16 PCIe 2.0*16

Power (max) 171 W 200 W 238 W 247 W

Table: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]


AMD FireStream cards

Core type 9170 9250 9350 9370

Based on RV670 RV770 RV870 RV870

Introduction 11/07 6/08 10/10 10/10

Core

Core frequency 800 MHz 625 MHz 700 MHz 825 MHz

ALU frequency 800 MHz 325 MHz 700 MHz 825 MHz

No. of EUs 320 800 1440 1600

Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS

Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS

Memory

Mem. transfer rate (eff) 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s

Mem. interface 256-bit 256-bit 256-bit 256-bit

Mem. bandwidth 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s

Mem. size 2 GB 1 GB 2 GB 4 GB

Mem. type GDDR3 GDDR3 GDDR5 GDDR5

System

ECC - - - -

Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16

Power (max) 150 W 150 W 150 W 225 W

Price relations (as of 1/2011)

Nvidia Tesla

C2050 ~ 2000 $C2070 ~ 4000 $S2050 ~ 13 000 $ S2070 ~ 19 000 $

NVidia GTX

GTX580 ~ 500 $


Background slides for intro to SIMT processing

1. Introduction (8)

Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]

1. Introduction (8)

Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]

1. Introduction (9)

5. References

5. References (to all four sections)

[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

[1]: Torricelli F., AMD in HPC, HPC07, 2007 http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf

[3] AMD FireStream 9170, 2008 http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008, Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008, http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007, http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU, ASPLOS 2006, June 2008

[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007 http://ati.amd.com/developer/techpapers.html

5. References (1)

[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia

[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007, University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/ lectures/lecture7-threading%20hardware.ppt

[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008, http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

[14]: Goto H., Nvidia G80, PC Watch, April 16 2007, http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

[15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct. 31 2007, http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

[16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review, PC Perspective, June 16 2008, http://www.pcper.com/article.php?aid=577&type=expert&pid=3

5. References (2)

[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL), Version 2.0, AMD, Oct. 2008

[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for Visual Information Technology, IIIT Hyderabad, March 2007, http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,” Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies, Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242

[23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch, http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, Nov. 2007, Nvidia

5. References (3)

[18]: http://en.wikipedia.org/wiki/DirectX

[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia, http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica, Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee- intels-biggest-leap-ahead-since-the-pentium-pro.html

[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1 http://ati.amd.com/technology/streamcomputing/ Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007, http://www.graphicshardware.org/previous/www_2007/presentations/ doggett-radeon2900-gh07.pdf

5. References (4)

[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,” ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007, http://www.pcper.com/article.php?aid=453

[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008, http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

5. References (5)

[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009, http://www.realworldtech.com/includes/templates/articles.cfm?

ArticleID=RWT093009110932&mode=print

[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed, Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

[41]: Wasson S., AMD's Radeon HD 5870 graphics processor, Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1

[42]: Bell B., ATI Radeon HD 5870 Performance Preview , Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/ ati_radeon_hd_5870_performance_preview/default.asp

[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008, http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

[43]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010 http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf

5. References (6)

[44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley, January 24-25 2011 http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_ Berkeley_2011.pdf

[46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture, PC Perspective, Nov 8 2006 http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10- and-unified-architecture/g80-architecture

[47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processorsTech Report, March 31 2010, http://techreport.com/articles.x/18682

[49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009, http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870- architecture-analysis/8

[45]: Wasson S., Nvidia's GeForce GTX 580 graphics processorTech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1

[48]: Gangar K., Tianhe-1A from China is world’s fastest SupercomputerTech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1a-

from-china-is-worlds-fastest-supercomputer/

[51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies, Sept 25 2010 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

5. References (7)

[52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0, February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/ docs/NVIDIA_FermiCompatibilityGuide.pdf

[54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview, Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/

[55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178- nvidia-geforce-gtx-460-se-1gb-review-2.html

[53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009, http://tech.icrontic.com/articles/nvidia_fermi_dissected/

[56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/ P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf

[57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009, University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/ al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt

[50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.2, Oct 14 2010, http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/ ptx_isa_2.2.pdf

[58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009 http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_ Architecture_Whitepaper.pdf

5. References (8)

[59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009, University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/ al/lectures/lecture8-threading-hardware-spring-2009.ppt

[61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors (GPUs), SAAB Technologies, Jan 27 2010, http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf

[62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010, http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2

[60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU Microarchitecture through Microbenchmarking, University of Toronto, 2010, http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

[63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played, Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforce- gtx-580-gf110-geforce-gtx-480,2781.html

[64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006, http://www.beyond3d.com/content/reviews/1/11

[65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008 Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/ Chapter3-CudaThreadingModel.pdf

[66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008 http://forums.nvidia.com/index.php?showtopic=73056

5. References (9)

[67]: Wikipedia: Comparison of AMD graphics processing units, 2011 http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units

[69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews, July 13 2010, http://www.trustedreviews.com/graphics/review/2010/07/13/ Nvidia-GeForce-GTX-460-1GB-Fermi/p1

[70]: NVIDIA GF100 Architecture Details, Geeks3D, 2008-2010, http://www.geeks3d.com/20100118/nvidia-gf100-architecture-details/

[68]: Nvidia OpenCL Overview, 2009 http://gpgpu.org/wp/wp-content/uploads/2009/06/05-OpenCLIntroduction.pdf

[71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone, 17 nov. 2009, http://forum.xcitefun.net/nvidia-tesla-c2050-and-c2070-cards-t39578.html

[72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10, Nvidia, Nov. 16 2009 http://www.nvidia.com/object/io_1258360868914.html

[73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla

[74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules, Board Specification, v. 03, Nvidia, Aug. 2010, http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf

[75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009, http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf

5. References (10)

[76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies, 19 Aug. 2009, http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132

[77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/ 37789-nvidia-geforce-gtx-580-review-5.html

[78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011, http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html

[79]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/two-cypress-gpus,0101-230369- 7179-0-0-0-jpg-.html

[80]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-230349- 7179-0-0-0-jpg-.html

[81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA

[82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html

[83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009, http://www.nvidia.com/object/gpu_tech_conf_press_room.html

5. References (10)

[84]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/SM,0101-110801-0-14-15-1-jpg-.html

[85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance, Hot Chips 22, Aug. 24 2010 http://www.hotchips.org/index.php?page=hot-chips-22 .

[86]: Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010, http://www.beyond3d.com/content/reviews/55/1

[88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD, AnandTech, Dec. 15 2010, http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950

[89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support, Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/ amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support

[90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008, http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf

[87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology, AMD, March 9 2010, http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20 Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf

[91]: Programming Guide: ATI Stream Computing Compute Abstraction Layer (CAL), Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_ SDK_CAL_Programming_Guide_v2.0.pdf

[92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008, http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf

[93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Revision 1.2, AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/ AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

5. References (11)

[95]: Behr D., Introduction to OpenCL PPAM 2009, Sept. 15 2009, http://gpgpu.org/wp/wp-content/uploads/2009/09/B1-OpenCL-Introduction.pdf

[96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug. 26 2009, MacResearch, http://www.macresearch.org/files/opencl/Episode_2.pdf

[94]: An Introduction to OpenCL, AMD, http://www.amd.com/us/products/technologies/ stream-technology/opencl/pages/opencl-intro.aspx

[97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010, http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3

[98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks, Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/ 38899-amd-radeon-hd-6970-hd-6950-review-3.html

[99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0, Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/ AMD_HD_6900_Series_Instruction_Set_Architecture.pdf

[100]: Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010, http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf

5. References (12)

[101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a, AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_ Set_Architecture.pdf

[102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010

[104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL), May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream %20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9

[103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL), http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20- %20Introduction%20to%20CAL.pdf

[105]: Reference Guide: AMD Accelerated Parallel Processing Technology, AMD Intermediate Language (IL), Revision 2.0e, March 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language _(IL)_Specification_v2.pdf

[106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing Using Graphics Hardware and Conventional CPUs, AMD, 2007, http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf

[107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008, http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf

[108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central, http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

5. References (13)

[110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009, http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net

[109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011, http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APP- SDK.aspx

[112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/ AMD_Evergreen-Family_Instruction_Set_Architecture.pdf

[111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010, http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_ to_OpenCL_Programming%20Training_Guide%20(201005).pdf

[113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub (GMCH) Datasheet, June 1999 ftp://download.intel.com/design/chipsets/datashts/29065602.pdf

[114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006, http://www.dailytech.com/article.aspx?newsid=4696

[116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010, http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg

[117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html

[115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wp- content/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf

5. References (14)

[118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with OpenCLTM 1.1 Support?????

[119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMD- Bobcat-x86.pdf

[121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops, Febr. 8 2010, http://arstechnica.com/business/news/2010/02/amd-reveals- fusion-cpugpu-to-challege-intel-in-laptops.ars

[120]: AMD Ontario APU pictures, Xtreme Systems, Sept. 3 2010, http://www.xtremesystems.org/forums/showthread.php?t=258499

[122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo, Nov. 10 2010, http://www.cdrinfo.com/sections/news/Details.aspx?NewsId=28748

[123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers, AnandTech, Jan. 22 2010, http://www.anandtech.com/show/2921

[125]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA

[126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100 Tested, AnandTech, Jan. 3 2011, http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7- 2600k-i5-2500k-core-i3-2100-tested/11

[124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review, Jan. 3 2011, http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2

5. References (15)

[127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010, http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizes- take-2.html

[128]: De Vries H., AMD Bulldozer, 8 core processor, Nov. 24 2010, http://chip-architect.com/

[130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May 4 2007, http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm

[129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH), Mai 2002 http://www.intel.com/design/chipsets/datashts/290746.htm

[131]: AMD Fusion, Wikipedia, http://en.wikipedia.org/wiki/AMD_Fusion

[132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire, Softpedia, Jan. 21 2011, http://news.softpedia.com/news/AMD-Llano-APU-to- Get-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire-179740.shtml

[134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic Hardware, March 16 2011, http://www.nordichardware.com/news/69-cpu-chipset/42650-the-graphical- performance-of-the-amd-a-series-apus.html

[133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A., Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011, http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5624589

5. References (16)

[135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance, Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.720-Butler -AMD-Bulldozer.pdf

[136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22, http://www.slideshare.net/AMDUnprocessed/amd-hot-chips-bulldozer-bobcat -presentation-5041615

[137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware, Jan. 04 2010, http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/

[138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware, Review of the IDF San Francisco, Sept. 2009, http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass

[139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M Reviewed . Anand Tech, 1/4/2010 http://www.anandtech.com/show/2902

[141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem) Based Platform, Presentation ARCS001, IDF 2009

[140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware, Jan. 03 2010, http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/

[142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010

5. References (17)

[143]: Valich T., Intel's "Anti AMD Fusion" Sandy Bridge CPU tapes out, July 5 2009, http://www.brightsideofnews.com/news/2009/7/5/intels-anti-amd-fusion-sandy- bridge-cpu-tapes-out.aspx