Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
1
ESE 345 Computer Architecture
Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing
Computer Architecture
CA: DLP and SIMD
2
SIMD Architectures
Data-Level Parallelism (DLP): Executing one operation on multiple data streams
Example: Multiplying a coefficient vector by a data vector (e.g. in filtering)
y[i] := c[i] x[i], 0i<n
Sources of performance improvement: One instruction is fetched & decoded for entire
operation
Multiplications are known to be independent
Pipelining/concurrency in memory access as well
CA: DLP and SIMD
3
Intel’s SIMD Instructions
CA: DLP and SIMD
4
X86 SIMD Processing Extensions (1)
MMX – Intel Pentium II 1997
8 64-bit integer registers aliased with x87 FP regs
3DNow! – AMD K6-2 1998
Similar to MMX
extended with 21 SP FP ops
SSE (Streaming SIMD Extensions) – Intel Pentium III 1999
Single-precision floating-point instructions
8 new 128 bit XMM registers
SSE-2 Intel Pentium 4 2001
Double-precision floating-point opsCA: DLP and SIMD
5
X86 SIMD Processing Extensions (2)
SSE3 – 2004 , SSE4 -2007
AVX (Advanced Vector Extensions) – proposed by Intel and AMD in 2008
Intel Sandy Bridge processor -2011
16 256-bit registers
AVX2 – Intel Haswell processor 2013
256-bit SSE and AVX ops
Fused multiply-add (FMA3)
AVX-512 Intel Knights Landing processor 2016
32 512-bit regs
4 operand instructionsCA: DLP and SIMD
6
Example: SIMD Array Processing
CA: DLP and SIMD
7
SSE Instruction Categories for Multimedia Support
CA: DLP and SIMD
SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands
8
Intel Architecture SSE2+ 128-Bit SIMD Data Types
CA: DLP and SIMD
9
XMM Registers
CA: DLP and SIMD
10
SSE/SSE2 Floating Point Instructions
CA: DLP and SIMD
11
SSE/SSE2 Floating Point Instructions
CA: DLP and SIMD
12
Example: Add Single Precision FP Vectors
CA: DLP and SIMD
13
Packed and Scalar Double-Precision Floating-Point Operations
CA: DLP and SIMD
14
Example: Image Converter (1/5)
CA: DLP and SIMD
15
Example: Image Converter (2/5)
CA: DLP and SIMD
16
Example: Image Converter (3/5)
CA: DLP and SIMD
17
Example: Image Converter (4/5)
CA: DLP and SIMD
18
Example: Image Converter (5/5)
CA: DLP and SIMD
19
Intel SSE Intrinsics
CA: DLP and SIMD
20
Sample of SSE Intrinsics
CA: DLP and SIMD
21
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
22
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
23
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
24
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
25
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
26
Inner loop from gcc -O -S
CA: DLP and SIMD
27
Performance-Driven ISA Extensions
CA: DLP and SIMD
28CA: DLP and SIMD
Architectures for Data Parallelism
The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ.
E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps.
GK110: nVidia’s Kepler GPU, customized for compute applications.
SONY/IBM PS3 Cell processor in 2006.
GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.
29CA: DLP and SIMD
Sony/IBM Playstation PS3 Cell Chip - Released 2006
8 SPEs,
3.2 GHz clock,
200 GigaOps/s (peak)
30CA: DLP and SIMD
Sony PS3 Cell Processor SPE Floating-Point Unit
4 single-precision
multiply-adds
issue in lockstep
(SIMD) per cycle.
6-cycle latency
Single-Instruction
Multiple-Data
3.2 GHz clock,
--> 25.6 SP GFLOPS
31CA: DLP and SIMD
Intel Xeon Ivy Bridge E5-2697v2 (2013)12-core Xeon Ivy Bridge
0.52 TeraOps/s (130W)
12 cores @ 2.7 GHz
Each core
can issue 16 single-precision operations
per cycle.
$2,600 per chip
Haswell: 1.04 TeraOps/s
32CA: DLP and SIMD
Intel E5-2697v2 vs. Haswell
12 cores @ 2.7 GHz
Each core can issue 16 single-precision
operations per cycle.
Haswell cores issue 32 SP FLOPS/cycle.
How?
33
Advanced Vector Extension (AVX) unit
CA: DLP and SIMD
Relative area has increased in Haswell
Smaller than L3 cache, but larger than L2 cache.
Die closeup of one Sandy Bridge core
34
AVX: Not Just Single-precision Floating-point
CA: DLP and SIMD
256-bit version -> double-precision vectors of length 4
AVX instruction variants interpret 128-bit registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...
35
Sandy Bridge, Haswell (2014)
CA: DLP and SIMD
Sandy Bridge extends register set to 256 bits: vectors are twice the size.
IA-64
AVX/AVX2
has 16
registers (IA-32: 8)
Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA))
2 EX units with FMA --> 2X increase in ops/cycle
36
Out of Order Issue in Haswell (2014)
CA: DLP and SIMD
Haswell sustains 4micro-op issues per cycle.
One possibility:
2 for AVX, and
2 for Loads, Stores and book-keeping.
Haswell has twocopies of the FMAengine, on separate ports.
37
Graphical Processing Units Given the hardware invested to do graphics well, how
can be supplement it to improve performance of a wider range of applications?
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C-like programming language for GPU
Unify all forms of GPU parallelism as CUDA thread
Programming model is “Single Instruction Multiple Thread”
CA: DLP and SIMD
38
Kepler GK 110 nVidia GPU
CA: DLP and SIMD
5.12 TeraOps/s
2880 MACs
@ 889 MHz
single-precision multiply-adds
$999
GTX Titan Black with 6GB GDDR5
(and 1 GPU)
39
Applications
Multimedia processing
Video compression
Graphics
Image processing
Simulations
Engineering tools
cad
Cryptography
Etc…
CA: DLP and SIMD
40
SIMD Summary
Intel SSE/AVX SIMD Instructions
One instruction fetch that operates on multiple operands simultaneously
512/256/128/64 bit multimedia (XMM & YMM registers
Embed the SSE machine instructions directly into C programs through use of intrinsics
CA: DLP and SIMD
41
ESE345 Project: Pipelined Multimedia SIMD Unit Block Diagram
CA: DLP and SIMD
42
128-bit Multimedia ALU
Three 128-bit Inputs
One 128-bit Output
Packed Register Format
Word: Four 32-bit Fields
Halfword (HW): Eight 16-bit Fields
CA: DLP and SIMD
127 96 95 64 63 32 31 0
Word 3 Word 2 Word 1 Word 0
127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0
HW 7 HW 6 HW 5 HW 4 HW 3 HW 2 HW 1 HW 0
43
Example: Add instruction
All instructions take place for each field specified
Treated as separate registers
Carry does not enter field to left
CA: DLP and SIMD
a3 a2 a2 a0
b3 b2 b1 b0
a3+b3 a2+b2 a1+b1 a0+b0
+
=
44
Load Immediate
Unlike MIPS, an Immediate value can be loaded directly
Need to specify by Load Index which field to place Immediate into
Can only load 16 bits (halfword) at a time
CA: DLP and SIMD
45
R4 Instructions
Multiply and Add/Subtract Instructions
Takes 3 Inputs
Multiplication Field Half-Size of Addition/Subtraction
This half is determined by High/Low bit
Supports two different multiplication sizes
With Saturation
CA: DLP and SIMD
46
Acknowledgements These slides contain material developed
and copyright by:
Morgan Kauffmann (Elsevier, Inc.)
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
Justin Hsia (UCB)
Mikhail Dorojevets (SBU)
CA: DLP and SIMD