SSE2

with a focus on floating point

For floating point (i.e., real numbers), MASM supports: real4

single precision; IEEE standard; analogous to float real8

double precision; IEEE standard; analogous to double

real10 double extended precision Not IEEE standard

NaN = Not a Number (see p. 4-14 of v1)

SSE2 supports 32 and 64 bit f.p. data x87 supports 32, 64, and 80 bit f.p. data

Note: These are 24-bit binary numbers.

Here they are in base 10: 2.00000000000000 1.99999988079071

SSE2 = Streaming SIMD Extensions 2 SIMD = Single Instruction Multiple Data

instructions

SSE2 introduced in 2000 on Pentium 4 and Intel Xeon processors.

1996 Intel MMX 1998 AMD 3DNow! 1999 Intel SSE on P3 2001 Intel SSE2 on P4 2003 Intel SSE3 (since Prescott P4) 2006 Intel SupplementalSSE3 (since Woodcrest Xeons) 2006 Intel SSE4 (4.1 and 4.2) 2007 AMD SSE5 (proposed 2007, implemented 2011) 2008 Intel AVX (proposed 2008, implemented 2011 in Intel

Westmere and AMD Bulldozer) XMM registers go from 128 bit to 256 bit, called YMM.

1. You must use MASM v6.15 or newer for SIMD support. (MASM v6.15 is available from the course software web page.)

2. You must enable MASM support for these instructions with the following:

.686 ;instructions for Pentium Pro (or better)

.xmm ;allow simd instructions.model flat, stdcall ;no crazy segments!

Each one of the 8 128-bit registers (xmm0...xmm7) can hold: 16 packed 1 byte integers 8 packed word (2 byte) integers 4 packed doubleword (4 byte) integers 2 packed quadword (8 byte) integers 1 double quadword (16 byte)

4 packed single precision (4 bytes each) floating point values

2 packed double precision (8 bytes each) floating point values

IA32 Registers: 8 32-bit GPRs

Integer only 8 80-bit fp regs

Floating point only 8 64-bit mmx regs

Integer only Re-uses fp regs

8 128-bit xmm regs Integer and fp

8 128-bit xmm regs Integer and fp These will be the

focus of our discussion.

XMMregisterformats

The utilities.asm MASM code (on the course’s software web page) contains a function that you can call to display the contents of the 8 xmm registers (dump) as pairs of 64 bit double precision fp values.

call dumpXmm64

1. Data movement

2. Arithmetic

3. Comparison

4. Conversion

1. Data movement

2. Arithmetic

3. Comparison

4. Conversion

movhpd Move High Packed Double-Precision Floating-

Point Value

movlpd Move Low Packed Double-Precision Floating-

Point Value

movsd Move Scalar Double-Precision Floating-Point

movhpd - Move High Packed Double-Precision Floating-Point Value for memory to XMM move:

DEST[127-64] ← SRC; DEST[63-0] unchanged Ex. movhpd xmm0, m64

for XMM to memory move: DEST ← SRC[127-64] Ex. movhpd m64, xmm2

movlpd - Move Low Packed Double-Precision Floating-Point Value for memory to XMM move:

DEST[127-64] unchanged; DEST[63-0] ← SRC

Ex. movlpd xmm1, m64 for XMM to memory move:

DEST ← SRC[63-0] Ex. movlpd m64, xmm2

movsd - Move Scalar Double-Precision Floating-Point Value

1. when source and destination operands are both XMM registers: DEST[127-64] remains unchanged; DEST[63-0] ←

SRC[63-0] Ex. movsd xmm1, xmm3

2. when source operand is XMM register and destination operand is memory location: DEST ← SRC[63-0] Ex. movsd m64, xmm2

3. when source operand is memory location and destination operand is XMM register: DEST[127-64] ← 0000000000000000H; DEST[63-0] ← SRC Ex. movsd xmm1, m64

1. Data movement

2. Arithmetic (scalar)

3. Comparison

4. Conversion

addsd - Add Scalar Double-Precision Floating-Point Values

subsd - Subtract Scalar Double-Precision Floating-Point Values

mulsd - Multiply Scalar Double-Precision Floating-Point Values

divsd - Divide Scalar Double-Precision Floating-Point Values

Also sqrtsd but no sin or cos SSE2 instructions! We have to use the x87 instructions for that!

addsd DEST[63-0] ← DEST[63-0] + SRC[63-0] DEST[127-64] remains unchanged

subsd DEST[63-0] ← DEST[63-0] − SRC[63-0] DEST[127-64] remains unchanged

mulsd DEST[63-0] ← DEST[63-0] * xmm2/m64[63-0] DEST[127-64] remains unchanged

divsd DEST[63-0] ← DEST[63-0] / SRC[63-0] DEST[127-64] remains unchanged

1. Data movement

2. Arithmetic (packed)

3. Comparison

4. Conversion

addpd - Add Packed Double-Precision Floating-Point Values

subpd - Subtract Packed Double-Precision Floating-Point Values

mulpd - Multiply Packed Double-Precision Floating-Point Values

divpd - Divide Packed Double-Precision Floating-Point Values

addpd - Add Packed Double-Precision Floating-Point Values DEST[63-0] ← DEST[63-0] + SRC[63-0] DEST[127-64] ← DEST[127-64] + SRC[127-64]

subpd - Subtract Packed Double-Precision Floating-Point Values DEST[63-0] ← DEST[63-0] / (SRC[63-0]) DEST[127-64] ← DEST[127-64] / (SRC[127-64])

mulpd - Multiply Packed Double-Precision Floating-Point Values DEST[63-0] ← DEST[63-0] / (SRC[63-0]) DEST[127-64] ← DEST[127-64] / (SRC[127-64])

divpd - Divide Packed Double-Precision Floating-Point Values DEST[63-0] ← DEST[63-0] / (SRC[63-0]) DEST[127-64] ← DEST[127-64] / (SRC[127-64])

1. Data movement

2. Arithmetic

3. Comparison

4. Conversion

comisd Compare Scalar Ordered Double-Precision

Floating-Point Values and Set EFLAGS

1. Data movement

2. Arithmetic

3. Comparison

4. Conversion

cvtsd2si Convert Scalar Double-Precision Floating-Point

Value to Doubleword Integer

cvtsi2sd Convert Doubleword Integer to Scalar Double-

Precision Floating-Point Value

cvtsd2si Convert Scalar Double-Precision Floating-Point

Value to Doubleword Integer DEST[31-0] ←

Convert_Double_Precision_Floating_Point_To_Integer(SRC[63-0])

cvtsi2sd Convert Doubleword Integer to Scalar Double-

Precision Floating-Point Value DEST[63-0] ←

Convert_Integer_To_Double_Precision_Floating_Point(SRC[31-0])

DEST[127-64] remains unchanged

SSE2

Documents

PC software for microprocessor-based burner controls ... Software...(Win8, with support for PAE, NX and SSE2) 1 GHz (x64) or higher (Win8, with support for ... ing and after commissioning

SUSTAINABILITY in - WSEASwseas.us › e-library › conferences › 2009 › timisoara › SSE2 › SSE2-00.pdfBrief Biography of the Speaker: Nicolae Jula was born on December 14th,

ECE/ME/EMA/CS 759 High Performance Computing for ... · CPU SIMD Support Comes in many flavors Most CPUs support 128 bit wide vectorization SSE, SSE2, SSE3, SSE4 Newer CPUs support

diagramas.diagramasde.comdiagramas.diagramasde.com/otros/electrobisturi Valleylab SSE2-L.pdf · SECTION 4 — CIRCUIT DESCRIPTIONS The SSE2L consists of a power supply section, an

Contents - Autodesk · Autodesk Inventor 2010 has been optimized to take advantage of the SSE2 extended instruction sets supported on Pentium 4, AMD Athlon 64 and AMD Opteron processors

Real HDR User Manual · 3 System Requirements Real HDR runs on MacOS (from 10.12) and Windows 7 SP1+ or higher, Graphics card with DX10 (SM 4.0) capabilities and a CPU with SSE2 instruction

VMware Horizon Client for Windows Installation and Setup ......Hardware requirements for RDP n x86-based processor with SSE2 extensions, with an 800 MHz or faster processor speed

Manufactura Sin Errores - Intelligy · Excel and Word Other Anti-Virus Network Virtual environments SOLIDWORKS 2016 .EDU 2016-2017) Intel or AMD with SSE2 support. 64-bit operating

REDEFINE YOUR - samplemagic.com · System Requirements Windows • Windows Vista or higher (tested up to Windows 10) •SSE2 - enabled processor (Pentium 4 or later) • Minimum 2GB

1 Introduction to MMX, XMM, SSE and SSE2 Technology Multimedia Extension, Streaming SIMD Extension 11/23/98, 5/6/99, 2/5/03, 5/10/04, 5/4/05

HDR Sky User Manual · Real HDR runs on MacOS (from 10.12) and Windows 7 SP1+ or higher, Graphics card with DX10 (SM 4.0) capabilities and a CPU with SSE2 instruction set support,

SEISMIC STABILITY EVALUATIONS (SSE2) 1...Technical Memorandum SSE2 – TM – 1LN SSE2-TM-1LN Final.docx Page 3 of 15 to reduce the uncertainties in the results of the seismic stability

Introduction to MMX, XMM, SSE and SSE2 Technology

Computer Software: The 'Trojan Horse' of HPC · swallach - oct 2010 - ucsb 6 Deja Vu • Multi-Core Evolves – Many Core – ILP fizzles • x86 extended with sse2, sse3, and sse4

Studies of seismic velocities in subduction zones from ...There are four main SSEs (Warren-Smith et al. 2019) during the deployment, from which SSE2 has bigger slip than the other

Trends in Efficient Parallel Computing and … CFX, RADIOSS, Abaqus Limited SSE2 support Weather WRF, UM, NEMO, CAM Yes ... Trends in Efficient Parallel Computing and Performance

STRENGTH AND DUCTILITY PERFORMANCE OF WELDED CONNECTIONS ...wseas.us/e-library/conferences/2009/timisoara/SSE2/SSE2-21.pdf · beam-to-column joints under monotonic and cyclic loading

Implicit Vectorisation · hpo_openmp, hpo_threadization, hpo_vectorization, pgo, tcollect, ... building blocks . Advanced vector instr Vector . SSE . 1999 . SSE2 . 2000 . SSE3 2004

PDF viewing archiving 300 dpi - wemed1.com · VALLEYLAB SSE2L ELECTROSURGIC AL GENERATOR PRODUCT INFORMATION 1 INTENDED USES ... Valleylab line of SSE2 electrosurgical generators

Intel Software Guard Extensions (Intel SGX)...microprocessors for optimizations that are not unique to Intel micro-processors. These optimizations include SSE2, SSE3, and SSSE3 instruction