ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2)...

ESE 345 Computer Architecture

Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing

Computer Architecture

CA: DLP and SIMD

SIMD Architectures

Data-Level Parallelism (DLP): Executing one operation on multiple data streams

Example: Multiplying a coefficient vector by a data vector (e.g. in filtering)

y[i] := c[i] x[i], 0i<n

Sources of performance improvement: One instruction is fetched & decoded for entire

operation

Multiplications are known to be independent

Pipelining/concurrency in memory access as well

CA: DLP and SIMD

Intel’s SIMD Instructions

CA: DLP and SIMD

X86 SIMD Processing Extensions (1)

MMX – Intel Pentium II 1997

8 64-bit integer registers aliased with x87 FP regs

3DNow! – AMD K6-2 1998

Similar to MMX

extended with 21 SP FP ops

SSE (Streaming SIMD Extensions) – Intel Pentium III 1999

Single-precision floating-point instructions

8 new 128 bit XMM registers

SSE-2 Intel Pentium 4 2001

Double-precision floating-point opsCA: DLP and SIMD

X86 SIMD Processing Extensions (2)

SSE3 – 2004 , SSE4 -2007

AVX (Advanced Vector Extensions) – proposed by Intel and AMD in 2008

Intel Sandy Bridge processor -2011

16 256-bit registers

AVX2 – Intel Haswell processor 2013

256-bit SSE and AVX ops

Fused multiply-add (FMA3)

AVX-512 Intel Knights Landing processor 2016

32 512-bit regs

4 operand instructionsCA: DLP and SIMD

Example: SIMD Array Processing

CA: DLP and SIMD

SSE Instruction Categories for Multimedia Support

CA: DLP and SIMD

SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands

Intel Architecture SSE2+ 128-Bit SIMD Data Types

CA: DLP and SIMD

XMM Registers

CA: DLP and SIMD

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD

Example: Add Single Precision FP Vectors

CA: DLP and SIMD

Packed and Scalar Double-Precision Floating-Point Operations

CA: DLP and SIMD

Example: Image Converter (1/5)

CA: DLP and SIMD

Intel SSE Intrinsics

CA: DLP and SIMD

Sample of SSE Intrinsics

CA: DLP and SIMD

Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD

Inner loop from gcc -O -S

CA: DLP and SIMD

Performance-Driven ISA Extensions

CA: DLP and SIMD

28CA: DLP and SIMD

Architectures for Data Parallelism

The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ.

E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps.

GK110: nVidia’s Kepler GPU, customized for compute applications.

SONY/IBM PS3 Cell processor in 2006.

GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.

29CA: DLP and SIMD

Sony/IBM Playstation PS3 Cell Chip - Released 2006

8 SPEs,

3.2 GHz clock,

200 GigaOps/s (peak)

30CA: DLP and SIMD

Sony PS3 Cell Processor SPE Floating-Point Unit

4 single-precision

multiply-adds

issue in lockstep

(SIMD) per cycle.

6-cycle latency

Single-Instruction

Multiple-Data

3.2 GHz clock,

--> 25.6 SP GFLOPS

31CA: DLP and SIMD

Intel Xeon Ivy Bridge E5-2697v2 (2013)12-core Xeon Ivy Bridge

0.52 TeraOps/s (130W)

12 cores @ 2.7 GHz

Each core

can issue 16 single-precision operations

per cycle.

$2,600 per chip

Haswell: 1.04 TeraOps/s

32CA: DLP and SIMD

Intel E5-2697v2 vs. Haswell

12 cores @ 2.7 GHz

Each core can issue 16 single-precision

operations per cycle.

Haswell cores issue 32 SP FLOPS/cycle.

Advanced Vector Extension (AVX) unit

CA: DLP and SIMD

Relative area has increased in Haswell

Smaller than L3 cache, but larger than L2 cache.

Die closeup of one Sandy Bridge core

AVX: Not Just Single-precision Floating-point

CA: DLP and SIMD

256-bit version -> double-precision vectors of length 4

AVX instruction variants interpret 128-bit registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...

Sandy Bridge, Haswell (2014)

CA: DLP and SIMD

Sandy Bridge extends register set to 256 bits: vectors are twice the size.

AVX/AVX2

has 16

registers (IA-32: 8)

Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA))

2 EX units with FMA --> 2X increase in ops/cycle

Out of Order Issue in Haswell (2014)

CA: DLP and SIMD

Haswell sustains 4micro-op issues per cycle.

One possibility:

2 for AVX, and

2 for Loads, Stores and book-keeping.

Haswell has twocopies of the FMAengine, on separate ports.

Graphical Processing Units Given the hardware invested to do graphics well, how

can be supplement it to improve performance of a wider range of applications?

Basic idea:

Heterogeneous execution model

CPU is the host, GPU is the device

Develop a C-like programming language for GPU

Unify all forms of GPU parallelism as CUDA thread

Programming model is “Single Instruction Multiple Thread”

CA: DLP and SIMD

Kepler GK 110 nVidia GPU

CA: DLP and SIMD

5.12 TeraOps/s

2880 MACs

@ 889 MHz

single-precision multiply-adds

GTX Titan Black with 6GB GDDR5

(and 1 GPU)

Applications

Multimedia processing

Video compression

Graphics

Image processing

Simulations

Engineering tools

Cryptography

Etc…

CA: DLP and SIMD

SIMD Summary

Intel SSE/AVX SIMD Instructions

One instruction fetch that operates on multiple operands simultaneously

512/256/128/64 bit multimedia (XMM & YMM registers

Embed the SSE machine instructions directly into C programs through use of intrinsics

CA: DLP and SIMD

ESE345 Project: Pipelined Multimedia SIMD Unit Block Diagram

CA: DLP and SIMD

128-bit Multimedia ALU

Three 128-bit Inputs

One 128-bit Output

Packed Register Format

Word: Four 32-bit Fields

Halfword (HW): Eight 16-bit Fields

CA: DLP and SIMD

127 96 95 64 63 32 31 0

Word 3 Word 2 Word 1 Word 0

127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0

HW 7 HW 6 HW 5 HW 4 HW 3 HW 2 HW 1 HW 0

Example: Add instruction

All instructions take place for each field specified

Treated as separate registers

Carry does not enter field to left

CA: DLP and SIMD

a3 a2 a2 a0

b3 b2 b1 b0

a3+b3 a2+b2 a1+b1 a0+b0

Load Immediate

Unlike MIPS, an Immediate value can be loaded directly

Need to specify by Load Index which field to place Immediate into

Can only load 16 bits (halfword) at a time

CA: DLP and SIMD

R4 Instructions

Multiply and Add/Subtract Instructions

Takes 3 Inputs

Multiplication Field Half-Size of Addition/Subtraction

This half is determined by High/Low bit

Supports two different multiplication sizes

With Saturation

CA: DLP and SIMD

Acknowledgements These slides contain material developed

and copyright by:

Morgan Kauffmann (Elsevier, Inc.)

Arvind (MIT)

Krste Asanovic (MIT/UCB)

Joel Emer (Intel/MIT)

James Hoe (CMU)

John Kubiatowicz (UCB)

David Patterson (UCB)

Justin Hsia (UCB)

Mikhail Dorojevets (SBU)

CA: DLP and SIMD

ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2)...

Documents

OpenStack Extensions

Implementing PHP 5 OOP extensions - somabo.desomabo.de/...implementing_php5_oop_extensions.pdf · Marcus Börger Implementing PHP 5 OOP extensions 3. PHP 5 Extensions; PHP 5 extensions

Extensions Substation extensions for high-voltage ... · Extensions Substation extensions for high-voltage switchgear applications When you extend or redesign your substation or replace

Air Extensions

Vectorization for SIMD Architectures - SKKUarcs.skku.edu/.../Courses/MulticoreComputing/05-SIMD.pdf · 2012-04-02 · SIMD Extensions for ILP Processors Cost-efficient

EXTENSIONS OF MENDELIAN GENETICS EXTENSIONS OF MENDELIAN GENETICS

Extensions of Client/Server-Model: structural extensions object-oriented mechanisms

Intel® SSE4 Programming Reference - Intel® Developer Zone

Rethinking SIMD Vectorization for In-Memory Databasespages.cs.wisc.edu/~shivaram/cs744-slides/cs744-harshal-simd.pdf · Sri Harshal Parimi . Motivation ´ Need for fast analytical

Plugins( Extensions( add1ons) · 2 Extensions(for(privacy(protection • Extensions“ ad1blocking” Blockvisible+ advertisements • Extensions“ tracking1blocking” Blockinvisible+

Brand Extensions

Intel MMX, SSE, SSE2, SSE3/SSSE3/SSE4 Architecturestm.spbstu.ru › images › d › db › Intel_simd.pdfAdvantages of SSE In MMX • An application cannot execute MMX instructions

Moodle Extensions

Extensions Unlimited

Intel® Roadmap Directions 2009 · computational throughput. With Intel® HT Technology, highly threaded applications can get ... Includes the full SSE4 instruction set, significantly

Intel® SSE4 Programming Reference

SIMD - UCSBtyang/class/240a17/slides/SIMD.pdf · SIMD: Single Instruction, Multiple Data + • Scalar processing • traditional mode • one operationproduces one result •SIMD

Amazing PPC Tactics PPC Tactics 062811.pdf · Types of Ad Extensions •Sitelinks •Product Extensions •Call Extensions •Location Extensions •Seller Ratings •Alpha’s &

Ad Extensions 2014 - Which Extensions Should You Be Using

Arquillian Extensions