46
1 ESE 345 Computer Architecture Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing Computer Architecture CA: DLP and SIMD

ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

1

ESE 345 Computer Architecture

Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing

Computer Architecture

CA: DLP and SIMD

Page 2: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

2

SIMD Architectures

Data-Level Parallelism (DLP): Executing one operation on multiple data streams

Example: Multiplying a coefficient vector by a data vector (e.g. in filtering)

y[i] := c[i] x[i], 0i<n

Sources of performance improvement: One instruction is fetched & decoded for entire

operation

Multiplications are known to be independent

Pipelining/concurrency in memory access as well

CA: DLP and SIMD

Page 3: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

3

Intel’s SIMD Instructions

CA: DLP and SIMD

Page 4: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

4

X86 SIMD Processing Extensions (1)

MMX – Intel Pentium II 1997

8 64-bit integer registers aliased with x87 FP regs

3DNow! – AMD K6-2 1998

Similar to MMX

extended with 21 SP FP ops

SSE (Streaming SIMD Extensions) – Intel Pentium III 1999

Single-precision floating-point instructions

8 new 128 bit XMM registers

SSE-2 Intel Pentium 4 2001

Double-precision floating-point opsCA: DLP and SIMD

Page 5: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

5

X86 SIMD Processing Extensions (2)

SSE3 – 2004 , SSE4 -2007

AVX (Advanced Vector Extensions) – proposed by Intel and AMD in 2008

Intel Sandy Bridge processor -2011

16 256-bit registers

AVX2 – Intel Haswell processor 2013

256-bit SSE and AVX ops

Fused multiply-add (FMA3)

AVX-512 Intel Knights Landing processor 2016

32 512-bit regs

4 operand instructionsCA: DLP and SIMD

Page 6: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

6

Example: SIMD Array Processing

CA: DLP and SIMD

Page 7: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

7

SSE Instruction Categories for Multimedia Support

CA: DLP and SIMD

SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands

Page 8: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

8

Intel Architecture SSE2+ 128-Bit SIMD Data Types

CA: DLP and SIMD

Page 9: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

9

XMM Registers

CA: DLP and SIMD

Page 10: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

10

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

Page 11: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

11

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

Page 12: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

12

Example: Add Single Precision FP Vectors

CA: DLP and SIMD

Page 13: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

13

Packed and Scalar Double-Precision Floating-Point Operations

CA: DLP and SIMD

Page 14: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

14

Example: Image Converter (1/5)

CA: DLP and SIMD

Page 15: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

15

Example: Image Converter (2/5)

CA: DLP and SIMD

Page 16: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

16

Example: Image Converter (3/5)

CA: DLP and SIMD

Page 17: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

17

Example: Image Converter (4/5)

CA: DLP and SIMD

Page 18: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

18

Example: Image Converter (5/5)

CA: DLP and SIMD

Page 19: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

19

Intel SSE Intrinsics

CA: DLP and SIMD

Page 20: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

20

Sample of SSE Intrinsics

CA: DLP and SIMD

Page 21: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

21

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Page 22: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

22

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Page 23: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

23

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Page 24: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

24

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Page 25: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

25

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Page 26: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

26

Inner loop from gcc -O -S

CA: DLP and SIMD

Page 27: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

27

Performance-Driven ISA Extensions

CA: DLP and SIMD

Page 28: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

28CA: DLP and SIMD

Architectures for Data Parallelism

The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ.

E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps.

GK110: nVidia’s Kepler GPU, customized for compute applications.

SONY/IBM PS3 Cell processor in 2006.

GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.

Page 29: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

29CA: DLP and SIMD

Sony/IBM Playstation PS3 Cell Chip - Released 2006

8 SPEs,

3.2 GHz clock,

200 GigaOps/s (peak)

Page 30: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

30CA: DLP and SIMD

Sony PS3 Cell Processor SPE Floating-Point Unit

4 single-precision

multiply-adds

issue in lockstep

(SIMD) per cycle.

6-cycle latency

Single-Instruction

Multiple-Data

3.2 GHz clock,

--> 25.6 SP GFLOPS

Page 31: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

31CA: DLP and SIMD

Intel Xeon Ivy Bridge E5-2697v2 (2013)12-core Xeon Ivy Bridge

0.52 TeraOps/s (130W)

12 cores @ 2.7 GHz

Each core

can issue 16 single-precision operations

per cycle.

$2,600 per chip

Haswell: 1.04 TeraOps/s

Page 32: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

32CA: DLP and SIMD

Intel E5-2697v2 vs. Haswell

12 cores @ 2.7 GHz

Each core can issue 16 single-precision

operations per cycle.

Haswell cores issue 32 SP FLOPS/cycle.

How?

Page 33: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

33

Advanced Vector Extension (AVX) unit

CA: DLP and SIMD

Relative area has increased in Haswell

Smaller than L3 cache, but larger than L2 cache.

Die closeup of one Sandy Bridge core

Page 34: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

34

AVX: Not Just Single-precision Floating-point

CA: DLP and SIMD

256-bit version -> double-precision vectors of length 4

AVX instruction variants interpret 128-bit registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...

Page 35: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

35

Sandy Bridge, Haswell (2014)

CA: DLP and SIMD

Sandy Bridge extends register set to 256 bits: vectors are twice the size.

IA-64

AVX/AVX2

has 16

registers (IA-32: 8)

Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA))

2 EX units with FMA --> 2X increase in ops/cycle

Page 36: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

36

Out of Order Issue in Haswell (2014)

CA: DLP and SIMD

Haswell sustains 4micro-op issues per cycle.

One possibility:

2 for AVX, and

2 for Loads, Stores and book-keeping.

Haswell has twocopies of the FMAengine, on separate ports.

Page 37: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

37

Graphical Processing Units Given the hardware invested to do graphics well, how

can be supplement it to improve performance of a wider range of applications?

Basic idea:

Heterogeneous execution model

CPU is the host, GPU is the device

Develop a C-like programming language for GPU

Unify all forms of GPU parallelism as CUDA thread

Programming model is “Single Instruction Multiple Thread”

CA: DLP and SIMD

Page 38: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

38

Kepler GK 110 nVidia GPU

CA: DLP and SIMD

5.12 TeraOps/s

2880 MACs

@ 889 MHz

single-precision multiply-adds

$999

GTX Titan Black with 6GB GDDR5

(and 1 GPU)

Page 39: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

39

Applications

Multimedia processing

Video compression

Graphics

Image processing

Simulations

Engineering tools

cad

Cryptography

Etc…

CA: DLP and SIMD

Page 40: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

40

SIMD Summary

Intel SSE/AVX SIMD Instructions

One instruction fetch that operates on multiple operands simultaneously

512/256/128/64 bit multimedia (XMM & YMM registers

Embed the SSE machine instructions directly into C programs through use of intrinsics

CA: DLP and SIMD

Page 41: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

41

ESE345 Project: Pipelined Multimedia SIMD Unit Block Diagram

CA: DLP and SIMD

Page 42: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

42

128-bit Multimedia ALU

Three 128-bit Inputs

One 128-bit Output

Packed Register Format

Word: Four 32-bit Fields

Halfword (HW): Eight 16-bit Fields

CA: DLP and SIMD

127 96 95 64 63 32 31 0

Word 3 Word 2 Word 1 Word 0

127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0

HW 7 HW 6 HW 5 HW 4 HW 3 HW 2 HW 1 HW 0

Page 43: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

43

Example: Add instruction

All instructions take place for each field specified

Treated as separate registers

Carry does not enter field to left

CA: DLP and SIMD

a3 a2 a2 a0

b3 b2 b1 b0

a3+b3 a2+b2 a1+b1 a0+b0

+

=

Page 44: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

44

Load Immediate

Unlike MIPS, an Immediate value can be loaded directly

Need to specify by Load Index which field to place Immediate into

Can only load 16 bits (halfword) at a time

CA: DLP and SIMD

Page 45: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

45

R4 Instructions

Multiply and Add/Subtract Instructions

Takes 3 Inputs

Multiplication Field Half-Size of Addition/Subtraction

This half is determined by High/Low bit

Supports two different multiplication sizes

With Saturation

CA: DLP and SIMD

Page 46: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

46

Acknowledgements These slides contain material developed

and copyright by:

Morgan Kauffmann (Elsevier, Inc.)

Arvind (MIT)

Krste Asanovic (MIT/UCB)

Joel Emer (Intel/MIT)

James Hoe (CMU)

John Kubiatowicz (UCB)

David Patterson (UCB)

Justin Hsia (UCB)

Mikhail Dorojevets (SBU)

CA: DLP and SIMD