Intel® AVX-512 Architecture - LLVM...A0 A1 A2 A3 A1 A2 A3 A6 A4 A5 A6 A7 A7 Mask: k[] = 01110011...

Intel® AVX-512 Architecture Comprehensive vector extension for HPC and enterprise

512-bit wide vectors, 32 SIMD registers

8 new mask registers Embedded Rounding Control Embedded Broadcast New Math instructions 2-source shuffles Gather and Scatter Compress and Expand Conflict Detection

AVX-512 – What’s new?

AVX-512

AVX-512 Foundation

Exponential and Reciprocal

Prefetching

Conflict Detection

32 SIMD registers 512 bit wide

More And Bigger Registers

Sparse computations are hard for vectorization Code above is wrong if any values within B[i] are duplicated VPCONFLICT instruction detects elements with conflicts

Conflict Detection

for(i=0; i<16; i++) { A[B[i]]++; }

index = vload &B[i] // Load 16 B[i]

old_val = vgather A, index // Grab A[B[i]]

new_val = vadd old_val, +1.0 // Compute new values

vscatter A, index, new_val // Update A[B[i]]

index = vload &B[i] // Load 16 B[i]

pending_elem = 0xFFFF;

curr_elem = get_conflict_free_subset(index, pending_elem)

old_val = vgather {curr_elem} A, index // Grab A[B[i]]

new_val = vadd old_val, +1.0 // Compute new values

vscatter A {curr_elem}, index, new_val // Update A[B[i]]

pending_elem = pending_elem ^ curr_elem // remove done idx

} while (pending_elem)

Compress And Expand

Compress values from 512-bit vectors compound of f64, f32, i64, i32 elements using mask and store in register or memory

VCOMPRESSPD zmm1/mV {k1}, zmm2

VEXPANDPS zmm1 {k1}{z}, zmm2/mV

A0 A1 A2 A3

A1 A2 A3 A6

A4 A5 A6 A7

Mask: k[] = 01110011 Zeroed in the register form

X A0 A1 A2

A0 A1 A2 A3

X X A3 A4

Mask: k[] = 01110011 All “X” are zeroed or remain unchanged

expand

for (i =0; i < N; i++) {

if (topVal > b[i]) {

*dst = a[i];

dst++;

CMP %regMask, ops

ADD %regA1, x1, y1

ADD %regA2 = x2, y2

BLEND %regA, %regMask, %regA1, %regA2

Predication Scheme

If (condition) {

A = ..

} else {

A = ..

Source code

Machine code

%Mask = cmp (condition)

%A = SELECT %Mask, %A1, %A2

LLVM IR

Mask = cmp (condition)

Not-Mask = not(Mask)

{Mask} C1 =

{Mask} B1 =

{Mask} A = B1+C1

{Not-Mask} C2 =

{Not-Mask} B2 =

{Not-Mask} A = B2+C2

To set predicates for instructions that calculate A1 and A2 (Mask and Not-Mask)

Result:

If the mask is all-zero, instruction will not be executed.

Mask Propagation Pass – design ideas

topVal =

Mask = cmp()

C1 = topVal*2

A1 = B1 + C1

A2 = B2 + C2

A = BLEND(Mask, A1, A2)

A new Machine Pass:

• Before register allocation

• Start from the “blend” operands and go up recursively till mask definition

• Check all users of the destination operand before applying the mask

Predicated memory accesses in LLVM IR would be helpful !

Mask Propagation Pass does not guarantee full mask propagation

over the whole path from blend to compare

Load/Store operations require exact masking

FP operations require masking if exceptions are not suppressed

– IR generators should use compiler intrinsics

topVal =

Mask = cmp()

C1 = topVal*2

A1 = B1 + C1

A2 = B2 + C2

topVal =

Mask = cmp()

!Mask = Not(Mask)

C1 = topVal*2

A1 = B1 + C1

A2 = B2 + C2

topVal =

Mask = cmp()

!Mask = Not(Mask)

C1 = topVal*2

A = B1 + C1

A = B2 + C2

Masking in LLVM Masking

Unmasked elements remain unchanged: VADDPD zmm1 {k1}, zmm2, zmm3 Or zeroed: VADDPD zmm1 {k1} {z}, zmm2, zmm3

float32 A[N], B[N], C[N];

for(i=0; i<16; i++)

if (B[i] != 0)

A[i] = A[i] / B[i];

A[i] = A[i] / C[i];

} VMOVUPS zmm2, A

VCMPPS k1, zmm0, B

VDIVPS zmm1 {k1}{z}, zmm2, B

KNOT k2, k1

VDIVPS zmm1 {k2}, zmm2, C

VMOVUPS A, zmm1

A source from memory is repeated across all the elements. vbroadcastss zmm3, [rax]

vaddps zmm1, zmm2, zmm3

vaddps zmm1, zmm2, [rax] {1to16}

Embedded Broadcast

• Static (per instruction) rounding rode

• No need to access MXCSR any more!

vaddps zmm7 {k6}, zmm2, zmm4 {rd}

vcvtdq2ps zmm1, zmm2, {ru}

All exceptions are always suspended by using embedded RC

Embedded Rounding Control

Memory fault suppression Avoid FP exceptions Avoid extra blends

a7 a6 a5 a4 a3 a2 a1 a0 zmm1

b7 b6 b5 b4 b3 b2 b1 b0 zmm2

b7+c7 a6 b5+c5 b4+c4 b3+c3 b2+c2 a1 a0 zmm1

+ + + + + + + +

1 0 1 1 1 1 0 0

c7 c6 c5 c4 c3 c2 c1 c0

XMM0-15 128-bits

YMM0-15 256-bits

ZMM0-31 512-bits

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR

ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. • Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. • The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. • Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm • Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. • Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Copyright © 2013 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S and other countries.

Elena Demikhovsky Intel® Software and Services Group

Israel

Intel® AVX-512 Architecture - LLVM...A0 A1 A2 A3 A1 A2 A3 A6 A4 A5 A6 A7 A7 Mask: k[] = 01110011...

Documents

A1 B1 A2 B2 A3 A4 B3 - agrotekuin.com

Shifting International Trade Routes...A2 A3 Ba a 1 B a a2 B a a 3 a1 a2 a3 B1 B2 3 C a a1 C aa2 Ca Credit Rating Default Probability Investment Grade Speculative Grade A1 A2 A3 Ba

Chemical Modification and Cluster Dynamics in High Kinetic ...€¦ · a1 a2 a3 a1 + ACN a2 + ACN a3 + ACN a1 + acetone a2 + acetone a3 + acetone 2,0E-04 2,2E-04 2,4E-04 2,6E-04 2,8E-04

3D TEM-IP inversion workflow for galvanic source TEM data · A1 A2 A3 A4 A1 A2 A3 A4 Measure potential difference (Easting direction) - 200 m bi-pole (625 mid points) Time range:

A1 Pareto A2 Symmetry A3 Invariance A4 IIA

Invest in Confidence Invest in Confidence · 2013. 4. 18. · AS1000 F25 100 55 482 442 780 92 a1 a2 a3 Flansche Wellenmaße (mm) a1 max. a2 a3 ca. Gewicht (kg) Bohrung (max.) Vierkant

Pediatric Thrombosis and Bleeding- case studies · A2 1 22 A1 2 3 2 6 A3 1 22 2 3 2 6 A3 A2 A1 FVIII A2 A3 1 22 2 3 2 6 A1 A2 2 3 2 6 A1 22 1 3 Inversion Type 1. Recurrent joint bleed

KEAM 2014 Physics Solutions for all Codes A1, A2, A3 & A4

Microcontrollers. T835, T850 BTA08, BTB08, T810 · A2 A1 A2 D²PAK TO-220AB TO-220AB Ins. IPAK DPAK A2 A1 G G A1 A2 G A2 A1 G A2 A2 A2 A2 A1 G A2 A1 G Features • On-state rms current,

FROSTED GLASS SHADE SERIES - Solavanti Lighting...FROSTED GLASS SHADE SERIES FGSS A1 - B13.408 A2 - B23.408 A3 - B24.306 A1 A2 A3 table lamp + reading light floor lamp + reading light

A1 A2 X8 A3 X8 A5 X4 A4 A6 A7 B1 1 4 5 6x4 A3 A1 - Studio 22 · PDF filea1 a2 x8 a3 x8 a5 x4 a6 a7 x8 b1 x2 b2 x2 b3 x2 b3 a6 b2 a4 x4 x4 3 a1 x4 6 a3 a1 b2 a5 ... d p u 1 4 5 a5 a4

Compensation of Lorenz detuning 1. Slide-jack tuner (A1, A2, A3)

Pulp Enamel Dentin Lingual · 2017-02-01 · Enamel A1,A2,A3,B1 (each 10PLT) Dentin A1,A2,A3 (each 10PLT) Milky-White (5PLT) Trans (5PLT) Bleach1,2 (each 10PLT) Custom Shade Guide

OPERATORS MANUAL MP25 - ProAudio.com 25.pdf · d2 a3 p2 a4 a1 a2 d3 a3 p3 a4 a1 a2 d4 a3 p4 a4 a1 a2 level high pan low level high pan low flexfx send return level mid high low mid

Cascades A fundamental structure of cognitive ... · If a1/A1 ↥a2/A2 and a2/A2 ↥a3/A3, then a1/A1 ↥a3/A3 > c-constitution forms chains > several steps of c -constitution form

CARDIFF HIGH SCHOOL · 2020-02-19 · Music A1 A2 A3 A4 A5 PDHPE A1 A2 A3 A4 Science A1 A2 A3 A4 TAS A1 A1 A2 A3 A4 Visual Arts A1 A2 A3 A4 A5 A6 Key: A1 = Assessment Task 1 A2 =

Electrodermal Activity (EDA) Sensor Data Sheet · EDA1-A1-A2-A3-S Electrodermal Activity (EDA) sensor built with custom lengths A1, A2 and/or A3 (all in cm), and custom sleeve color

AssetWise Overview - bentleyuser.dk · Arch. Eng. System A Component A1 Comp A2 Comp A3 Comp A3.1 Comp A3.2 System N P&IDs System A Equipment #A1 Equipment #A2 Equipment #A3 Equipment

Moscow - hospitality.fifa.comhospitality.fifa.com/docs/default-source/docs-fwc18/fwc2018... · XXX XXX MBS B MBS D C B C D Tsarsky MBS A2 MBS A1 A3 A2 A3 A1 Moscow Luzhniki Stadium

CSCI.6962/4962 Software VericationŠ Fundamental Proof ...define a2 := (a1 plus 2)} module M3 {open M2 define a3 := (a2 plus a1)} > M3.a3 Term: 4 CSCI.6962/4962 Software VericationŠFundamental