ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2)...

Preview:

Citation preview

1

ESE 345 Computer Architecture

Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing

Computer Architecture

CA: DLP and SIMD

2

SIMD Architectures

Data-Level Parallelism (DLP): Executing one operation on multiple data streams

Example: Multiplying a coefficient vector by a data vector (e.g. in filtering)

y[i] := c[i] x[i], 0i<n

Sources of performance improvement: One instruction is fetched & decoded for entire

operation

Multiplications are known to be independent

Pipelining/concurrency in memory access as well

CA: DLP and SIMD

3

Intel’s SIMD Instructions

CA: DLP and SIMD

4

X86 SIMD Processing Extensions (1)

MMX – Intel Pentium II 1997

8 64-bit integer registers aliased with x87 FP regs

3DNow! – AMD K6-2 1998

Similar to MMX

extended with 21 SP FP ops

SSE (Streaming SIMD Extensions) – Intel Pentium III 1999

Single-precision floating-point instructions

8 new 128 bit XMM registers

SSE-2 Intel Pentium 4 2001

Double-precision floating-point opsCA: DLP and SIMD

5

X86 SIMD Processing Extensions (2)

SSE3 – 2004 , SSE4 -2007

AVX (Advanced Vector Extensions) – proposed by Intel and AMD in 2008

Intel Sandy Bridge processor -2011

16 256-bit registers

AVX2 – Intel Haswell processor 2013

256-bit SSE and AVX ops

Fused multiply-add (FMA3)

AVX-512 Intel Knights Landing processor 2016

32 512-bit regs

4 operand instructionsCA: DLP and SIMD

6

Example: SIMD Array Processing

CA: DLP and SIMD

7

SSE Instruction Categories for Multimedia Support

CA: DLP and SIMD

SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands

8

Intel Architecture SSE2+ 128-Bit SIMD Data Types

CA: DLP and SIMD

9

XMM Registers

CA: DLP and SIMD

10

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

11

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

12

Example: Add Single Precision FP Vectors

CA: DLP and SIMD

13

Packed and Scalar Double-Precision Floating-Point Operations

CA: DLP and SIMD

14

Example: Image Converter (1/5)

CA: DLP and SIMD

15

Example: Image Converter (2/5)

CA: DLP and SIMD

16

Example: Image Converter (3/5)

CA: DLP and SIMD

17

Example: Image Converter (4/5)

CA: DLP and SIMD

18

Example: Image Converter (5/5)

CA: DLP and SIMD

19

Intel SSE Intrinsics

CA: DLP and SIMD

20

Sample of SSE Intrinsics

CA: DLP and SIMD

21

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

22

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

23

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

24

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

25

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

26

Inner loop from gcc -O -S

CA: DLP and SIMD

27

Performance-Driven ISA Extensions

CA: DLP and SIMD

28CA: DLP and SIMD

Architectures for Data Parallelism

The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ.

E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps.

GK110: nVidia’s Kepler GPU, customized for compute applications.

SONY/IBM PS3 Cell processor in 2006.

GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.

29CA: DLP and SIMD

Sony/IBM Playstation PS3 Cell Chip - Released 2006

8 SPEs,

3.2 GHz clock,

200 GigaOps/s (peak)

30CA: DLP and SIMD

Sony PS3 Cell Processor SPE Floating-Point Unit

4 single-precision

multiply-adds

issue in lockstep

(SIMD) per cycle.

6-cycle latency

Single-Instruction

Multiple-Data

3.2 GHz clock,

--> 25.6 SP GFLOPS

31CA: DLP and SIMD

Intel Xeon Ivy Bridge E5-2697v2 (2013)12-core Xeon Ivy Bridge

0.52 TeraOps/s (130W)

12 cores @ 2.7 GHz

Each core

can issue 16 single-precision operations

per cycle.

$2,600 per chip

Haswell: 1.04 TeraOps/s

32CA: DLP and SIMD

Intel E5-2697v2 vs. Haswell

12 cores @ 2.7 GHz

Each core can issue 16 single-precision

operations per cycle.

Haswell cores issue 32 SP FLOPS/cycle.

How?

33

Advanced Vector Extension (AVX) unit

CA: DLP and SIMD

Relative area has increased in Haswell

Smaller than L3 cache, but larger than L2 cache.

Die closeup of one Sandy Bridge core

34

AVX: Not Just Single-precision Floating-point

CA: DLP and SIMD

256-bit version -> double-precision vectors of length 4

AVX instruction variants interpret 128-bit registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...

35

Sandy Bridge, Haswell (2014)

CA: DLP and SIMD

Sandy Bridge extends register set to 256 bits: vectors are twice the size.

IA-64

AVX/AVX2

has 16

registers (IA-32: 8)

Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA))

2 EX units with FMA --> 2X increase in ops/cycle

36

Out of Order Issue in Haswell (2014)

CA: DLP and SIMD

Haswell sustains 4micro-op issues per cycle.

One possibility:

2 for AVX, and

2 for Loads, Stores and book-keeping.

Haswell has twocopies of the FMAengine, on separate ports.

37

Graphical Processing Units Given the hardware invested to do graphics well, how

can be supplement it to improve performance of a wider range of applications?

Basic idea:

Heterogeneous execution model

CPU is the host, GPU is the device

Develop a C-like programming language for GPU

Unify all forms of GPU parallelism as CUDA thread

Programming model is “Single Instruction Multiple Thread”

CA: DLP and SIMD

38

Kepler GK 110 nVidia GPU

CA: DLP and SIMD

5.12 TeraOps/s

2880 MACs

@ 889 MHz

single-precision multiply-adds

$999

GTX Titan Black with 6GB GDDR5

(and 1 GPU)

39

Applications

Multimedia processing

Video compression

Graphics

Image processing

Simulations

Engineering tools

cad

Cryptography

Etc…

CA: DLP and SIMD

40

SIMD Summary

Intel SSE/AVX SIMD Instructions

One instruction fetch that operates on multiple operands simultaneously

512/256/128/64 bit multimedia (XMM & YMM registers

Embed the SSE machine instructions directly into C programs through use of intrinsics

CA: DLP and SIMD

41

ESE345 Project: Pipelined Multimedia SIMD Unit Block Diagram

CA: DLP and SIMD

42

128-bit Multimedia ALU

Three 128-bit Inputs

One 128-bit Output

Packed Register Format

Word: Four 32-bit Fields

Halfword (HW): Eight 16-bit Fields

CA: DLP and SIMD

127 96 95 64 63 32 31 0

Word 3 Word 2 Word 1 Word 0

127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0

HW 7 HW 6 HW 5 HW 4 HW 3 HW 2 HW 1 HW 0

43

Example: Add instruction

All instructions take place for each field specified

Treated as separate registers

Carry does not enter field to left

CA: DLP and SIMD

a3 a2 a2 a0

b3 b2 b1 b0

a3+b3 a2+b2 a1+b1 a0+b0

+

=

44

Load Immediate

Unlike MIPS, an Immediate value can be loaded directly

Need to specify by Load Index which field to place Immediate into

Can only load 16 bits (halfword) at a time

CA: DLP and SIMD

45

R4 Instructions

Multiply and Add/Subtract Instructions

Takes 3 Inputs

Multiplication Field Half-Size of Addition/Subtraction

This half is determined by High/Low bit

Supports two different multiplication sizes

With Saturation

CA: DLP and SIMD

46

Acknowledgements These slides contain material developed

and copyright by:

Morgan Kauffmann (Elsevier, Inc.)

Arvind (MIT)

Krste Asanovic (MIT/UCB)

Joel Emer (Intel/MIT)

James Hoe (CMU)

John Kubiatowicz (UCB)

David Patterson (UCB)

Justin Hsia (UCB)

Mikhail Dorojevets (SBU)

CA: DLP and SIMD

Recommended