118
FPGAs for High performance computing 1 Admintech 2018 – Valencia – May, 9th Francisco Perez Field Applications Engineer [email protected]

FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

FPGAs for High performance computing

1

Admintech 2018 – Valencia – May, 9th

Francisco PerezField Applications [email protected]

Page 2: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

AGENDA

• FPGAs como aceleradores HW de aplicaciones

• Introducción a la arquitectura de las FPGAs

• Herramientas y plataformas disponibles -> FPGAs para programadores

• Casos reales de utilización FPGA en datacenter

• Inteligencia Artificial: Beneficios de las FPGAs para inferencia de CNN

Page 3: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Intro to fpga

3

Multi-purpose accelerator engine

Francisco PerezField Applications [email protected]

Page 4: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 4

What iS an fpga?

Definition

• An FPGA or Field-programable gate array is a configurable devicecontaining thousands of digital logic blocks.

• How these blocks are connected together, and their functionality, can beimplemented using a specific hardware description languaje.

• This array of programable logic can reproduce quite simple circuits, like alogic gate or combinational function, up to really complex System-on-chip solutions.

• It’s reprogramable, so their funcionality can be changed when needed.

Page 5: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 5

What iS an fpga TODAY?

An advanced, multi-function accelerator

• Offer greater throughput, execution speed, and energy efficiency than CPUs on computationally intensive parts of algorithms

• With the ability to adapt quickly to changes in algorithms, new standards, data patterns, or performance needs

• They can be reconfigured in the field to accelerate any algorithm

Page 6: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Transforming Data Centers To a single Accelerator Architecture

6

CPU GPU ASSP

ASIC FPGA

Artificial Intelligence

Big Data Analytics (Hadoop, SPARK, SQL, NoSQL)

Video Transcoding

NFV/SDNStorage Acceleration

Security and DPI (Deep Packet Inspection)

Page 7: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Accelerating Key network Functions

7

M a n y K i n d s o f B o x e s

Routers Firewalls SwitchesSpecial-Purpose

Appliances

Switching

Security

Inspection & Reporting

Page 8: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Data AnalyticsArtificialIntelligence

VideoTranscoding

Cyber SecurityFinancial Acceleration

Genomics

8

What can FPGAs do for your application?

Page 9: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

yper-acceleration of Apache SparkData Analytics Solution

Page 10: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Bigstream: Spark acceleration solution

The only platform to offer seamless acceleration of Apache Spark using Intel FPGAs

• Zero code change for Spark• Intelligent, automatic, computation slicing• Multilevel acceleration strategies• Abstracts away programming front end and

processor back end• Intelligently, automatically programs FPGA

H Y P E R - A C C E L E R A T I O N

Dataflow Adaptation Layer

Bigstream Dataflow

Bigstream Hypervisor

accelerationUse of fpga is limited

FPGA developers lack Spark programming models and big data knowledge

Skill gapProgramming model difference

Big Data developers lack FPGA experience

Page 11: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Spark acceleration Accelerating Performance

0Code changes or additions

to Queries1

8XPerformance

acceleration1

1: Running TPC-DS benchmark per Spark/SQL Business Intelligence Benchmarks. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines.Compares to open source Apache Spark running on Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz.

Page 12: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

– image processing accelerationVideo Application

Page 13: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Image Processing Needs AND Challenges

Decoding, resizing, cropping, encoding of image files are typical processes which need large numbers of servers. This becomes cost prohibitive.

Boom resources performanceImage Computational CPU

Internet traffic increasing by 24%* annually - image is a large portion of internet data.Companies are handling huge volumes of images in the data center• Cloud storage• Mobile instant messaging• Social networking• E-Commerce

CPU performance per core is struggling to keep paceFPGA to the rescue

*source: Cisco--VNI Forecast Highlights Tool

Page 14: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Ctaccel Accelerates image processing

CTAccel Image Processing (CIP) effectively accelerates the following image processing/analytics workflows• Thumbnail Generation/Transcoding• Image processing (sharpen/color filter)• Image analytics

CIP includes the following FPGA-based accelerated functions• Decoder: JPEG• Pixel processing: Resizing/Crop• Encoder: JPEG, WebP, Lepton

Software compatibility with OpenCV, ImageMagick and Lepton

Page 15: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Image Processing: Accelerating Performance

4.9xFaster JPEG to

WebP 1

5X lower

latency 1

1 Compared to Intel® Xeon® E5-2630 v2 CPU, JPEG to WEBP.

Page 16: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

database access acceleration High velocity cloud data applications

Page 17: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Database access Latency challenges

Increasing Faster Real-Time

Flood of data from multiple sources (Big

data, Internet of Things (IoT), business analysis,

e-commerce)

data volumes decision-Making Performance

Companies increasingly reliant

on data to fuel innovation and

decision making

Database analytics requires real-time

performance (SaaS, Finance, Industrial,

Resource management)

Cloud/relational database based data analytics employed across all industries – access times impact business results

Page 18: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Swarm64 Accelerates database access times

Performance SCALABILITythroughput optimization

Seamless plug-in that enables popular databases and supports any configuration – in the cloud or on-premise

• High velocity data access• Accelerates filtering, SQL-query pre-processing and de/compression• Compatible with existing applications• MySQL, PostgreSQL and MariaDB support (others in development)• No change to IT infrastructure required, easy to deploy

Page 19: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Solving Real-World Problems: database acceleration

Traditional Data Warehousing 2

2X+ 3X+ 10X+ FASTER REAL-TIME DATA ANALYTICS 1

Storagecompression 3

1. Based on database queries run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64..2. Data warehousing tested with queries and data taken from TPC-DS benchmark. Testing performed by Swarm64. 3. Based on database size run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64.

13

Page 20: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

20

Solution Roadmap

2018

Demos:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- GZIP (Accelize / CAST)- SDR to HDR Conversion (Accelize/b<>com)

2017

DCP 1.0 ProductionDCP 1.1 Alpha (with network)

DCP 1.1 Production(with network connectivity)

DCP 1.0 Beta

Production*:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- SQL DB Acceleration (Swarm64)- Spark Acceleration (Bigstream)- AI Training & Inference (i-abra)

- Broadcast H.264 / H.265 Codecs (SoC Technologies)- Financial back testing (Levyx)- Genomics GATK Pipeline (Falcon Computing)Beta:- Deep Learning Acceleration Suite (Intel – Beta release)

Production*:- NoSQL DB Acceleration (Reniac)- C/C++ to OpenCL Compiler (Falcon Computing)- JPEG to WebP (CTAccel)- H.264 Transcode (Adaptive Microware)- Spark/Hadoop Shuffle Accel(A3Cube)- High Frequency Trading (Algo-Logic)- Deep Learning Acceleration Suite (Intel)

Q1 Q2 Q3 Q4Q4

Production*:- PAL / SAP Hana Acceleration (Xelera)- Advanced Firewall (F5 Networks)- Security NIC (Napatech)- 40G TCP/IP Offload (Enyx)- Real-time Financial Analytics (Velocidata)- Machine Learning Compiler (Myrtle Software)- High Frequency Trading (Celerix Technology)

Production:- H.264 Encoder/Decoder, H.265 Decoder (IBEX)- Kafka / Spark Accelerator Engine (Megh Computing)- Algorithmic trading (Xcelerit)- AV1 Hybrid Codec (ATEME)- Oil & Gas (Senai)- Risk Check Compliance (Aplicata)- Hadoop / Spark Acceleration (Wasai)

* Production status by end of quarter

Page 21: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Basic Architecture Description

Page 22: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 22

FPGA Overview

▪ Field Programmable Gate Array (FPGA)

– Millions of logic elements

– Thousands of embedded memory blocks

– Thousands of DSP blocks

– Programmable routing

– High speed transceivers

– Various built-in hardened IP

▪ Used to create Custom Hardware!

DSP Block

Memory Block

Programmable

Routing Switch

Logic

ModulesLet’s zoom in

Page 23: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

23

Basic Elements

1-bit configurable operation

Configured to perform any 1-bit operation:

AND, OR, NOT, ADD, SUB

Basic Element

1-bit register(store result)

Page 24: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

24

Flexible Interconnect

Wider custom operations are implemented by configuring and interconnecting Basic Elements

… …

Page 25: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

25

Custom Operations Using Basic Elements

Wider custom operations are implemented by configuring and interconnecting Basic Elements

16-bit add

Your custom 64-bit bit-shifter and encode

32-bit sq rt

… …

Page 26: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

26

Memory Blocks

MemoryBlock

20 Kb

addr

data_in

data_out

Can be configured and grouped using the

interconnect to create various cache architectures

Lots of smaller caches

Few larger caches

Page 27: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

27

Floating Point Multiplier/Adder Blocks

data_in

Dedicated floating point multiply and add blocks

data_out

Page 28: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

28

Configurable Routing

Blocks are connected into a custom data-path that matches your application.

Page 29: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Configurable IOThe Custom data-path can

be connected directly to custom or standard IO

interfacesfor inline data processing:

PCIe, Network InterfacesCameras, Disk Drives

Page 30: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Traditional FPGA Design Entry

▪ Used by hardware designers only

▪ Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog

▪ A designer must describe the behavior of the algorithm to create a low-level digital circuit

– Logic, Registers, Memories, State Machines, etc.

▪ Complete design times up to several months!

always @(a or b or c or d or sel)

begin

case (sel)

2’b00: mux_out = a;

2b’01: mux_out = b;

2b’10: mux_out = c;

2’b11: mux_out = d;

endcase

a

dsel

2

b mux_outc

30

Page 31: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 32

FPGA High Level Design with OpenCL™

Goal: Design FPGA custom hardware with C-based software language

▪ Benefits

– Makes FPGA acceleration available to software engineers

– Debug and optimize in a software-like environment

– Significant productivity gains compared to hardware-centric flow

– Easier to perform design exploration

– Abstracts away FPGA design flow and FPGA hardware

__kernel void _foo (__global float *x) {

int i …

}

*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos

Page 32: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Pipeline Generation for FPGAsWhy does a software designer want an FPGA?

33

Page 33: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

A simple program

34

add:R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

kernel void

add( global int* Mem ) {

...

Mem[100] += 42*Mem[101];

}

OpenCL Code Instruction Level (IR)

Why execute your program on an FPGA over a CPU?

Page 34: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

A simple 3-address CPU

35

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

Page 35: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Load memory value into register

36

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

R0Load Mem[100]R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Page 36: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Load memory value into register

37

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CDataR0 Load Mem[100]

R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Page 37: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Load immediate value into register

38

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CDataR0 Load Mem[100]R1 Load Mem[101]

R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Page 38: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Multiply two registers, store result in register

39

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42

R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Page 39: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Add two registers, store result in register

40

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2

R0 Add R2, R0Store R0 Mem[100]

Page 40: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

B

A

AALU

Store register value into memory

41

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0

Store R0Mem[100]

Page 41: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

CPU activity, step by step

42

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

Time

Page 42: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

How about the FPGA?

43

FPGA does not have a fixed architecture

FPGA is massively parallel and configurable

Should not limit yourself to fixing the architecture and executing instructions sequentially

Try to execute instructions in a parallel fashion

Page 43: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

Unroll the cpu Hw and specialize by position

44

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

1. Instructions are fixed. Remove “Fetch”

Page 44: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

… and specialize

45

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

1. Instructions are fixed. Remove “Fetch”

2. Remove unused ALU ops

Page 45: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

… and specialize

46

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

1. Instructions are fixed. Remove “Fetch”

2. Remove unused ALU ops3. Remove unused Load / Store

Page 46: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

… and specialize

47

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed. Remove “Fetch”

2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!

And propagate state.

Page 47: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

… and specialize

48

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed. Remove “Fetch”

2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!

And propagate state.5. Remove dead data.

Page 48: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential

… and specialize

49

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed. Remove “Fetch”

2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!

And propagate state.5. Remove dead data.6. Reschedule!

Page 49: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

50

FPGA Custom HardwareCustom Datapath: Your algorithm, in Silicon!

▪ Creates typically very deeply pipelined version of a kernel

– Huge number of operations simultaneously inflight

▪ Data can more easily be localized on chip Build exactly what you need:

Operations

Data widths

Memory size & configuration

Efficiency:

Throughput / Latency / Power

load load

store

42High-level code

Mem[100] += 42 * Mem[101]

Custom datapath

Page 50: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 51

Summary

▪ FPGAs are composed by millions of reconfigurable logic elements, memory

and DSP blocks.

▪ Algorithms in HW can be implemented by chaining blocks together using a

programmable interconnection matrix.

▪ FPGAs are an ideal solution to build high performance processing datapaths

offloading generic processors.

▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and

energy-efficient solution for accelerating workloads

Page 51: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Accelerating workloads

52

Francisco PerezIntel Field Applications [email protected]

with Intel® XEON® CPUs and FPGAs

Page 52: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Network Platforms Group

Data Movement and Processing Explosion

‡ Source: “Gartner Says 8.4 Billion Connected ‘Things’ Will Be in Use in 2017, Up 31 Percent From 2016”, 2/7/2017, http://www.gartner.com/newsroom/id/3598917 (Table 1 - IoT Units Installed Base by Category, 2020 column – Grand Total, including consumer+business units) 53

5G Wireless

Big Data Processing and Analytics

Explosion in data processing needs in

▪ Network▪ Storage▪ Compute

High Speed wireline and wireless links▪ Bring the data to the data center at

ever increasing rates

>20BConnected Devices

by 2020‡

Workloads

▪ Processing must be done within a fixed space and power budget

▪ Data Centers cannot grow unbounded▪ By leveraging Intel accelerators, like FPGAs,

these processing needs can be addressed

Hyper-connectedWorld

High-PerformanceComputing Demands

Page 53: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Network Platforms Group

Intel® Xeon® Scalable Processor Family Acceleration Options

General Purpose Optimized

Intel® Xeon® CPU Intel® FPGA Intel® QuickAssistTechnology

Workloads General-purposeAVX-512

Flexible and Versatile set of algorithmic workloads

Standard Cryptography and Compression

Product Focus Software and instruction acceleration

Low-latency, parallel stream processing, custom

High-bandwidth

Hardware FlexibilitySoftware FlexibilityFixed

HW Acceleration

Page 54: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Network Platforms Group

System flexibility with Intel Xeon CPU SKU optionsCan be slotted into 1U servers

Intel® FPGA Data Center Platform OptionsEnabled By The Acceleration Stack for Intel® Xeon® CPU with FPGAs

PCIe Acceleration Cards

PCIe Gen3x8

Versatile Workload Acceleration• Customizable Hardware Architecture using Arria® 10 GX FPGAs

High Performance with Arria® 10 GX FPGA • 1150K logic elements available with 53Mb of embedded memory• 8GB DDR4 Memory with ECC (2 banks), 2133 Mbps

High Data Ingestion and Lower Latency• PCIe x8 Gen3 electrical, x16 mechanical *• 1x QSFP with 4x 10GbE or 40GbE support

Low Power in Small Form Factor• 70W TDP, 45W FPGA• 650 LFM at Tla 55°C – Passively Cooled• 1 RU, as small as ½ Length, ½ Height

Page 55: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Intel® Programmable Acceleration Card (PAC)

With Arria® 10 GX FPGA

SAmpling TodayGeneral Availability 1H2018

Next Generation PACs & Platforms

Powered by InteL® Xeon® CPU with FPGAs

Higher PerformanceIncreased connectivity

More integration options

Application & IP Migration to Multiple Platforms

24

Page 56: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

PCIe link

SW Applicationuses AcceleratorHW Accelerator

Configuration

DeviceHW Accelerator Execution Model

Host Machine

HW

Accelerator

Page 57: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Intel Confidential for NDA Use Only 58

End UserDeveloped

New Cloud Scale Services with FPGA in the Data Center

Static/dynamic FPGA programming

FPGA

Storage Network

Orchestration Software (FPGA Enabled)

Intel Developed

3rd partyDeveloped

Compute

Resource Pool

SoftwareDefinedInfrastructure

Secure

Public and Private Cloud Users

Accelerator Store

Launch workload

Workloadaccelerators

Intel® Xeon® processor VM

Accel

Virtualized

Workload NWorkload 2

Workload 1

Pull workloadfrom library

AllocateCompute Unit

Page 58: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Microsoft Azure

Page 59: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks
Page 60: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

A faster, more efficient, more intelligent cloud

Data explosion: 2013 4.4 ZB - 2020 44 ZB

ML, DNN, AI are driving requirements up faster

Autonomous decision making

Real-time insights into connected devices

Interactive user experiences

Cloud-scale services

Searches and recommendations (Indexing the Internet!)

The need for SCALE

The need for LOW-LATENCY

The need for THROUGHPUT

1001101010

2013

1001101010

2020

4.4 ZB 44 ZB

0100010101

1101101010

1011000110

0100010101

1001101010

0010011011

1011101010

0001001110

0110001011

Source: IDC 2014

Page 61: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

WCS Gen4.1 Blade with NIC and Catapult FPGA

Catapult v2 Mezzanine card

Page 62: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Management

Fabric

Hardware

(FPGA)

Super Low-

latency

Network

Page 63: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Traditional software (CPU) server plane

QPI CPUCPU

QSFP

TOR40Gb/s

Web search

ranking

Page 64: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Web search

ranking

Traditional software (CPU) server plane

QPICPU

QSFP

40Gb/s ToR

FPGA

CPU

40Gb/s

QSFP QSFP

Hardware acceleration plane

Interconnected FPGAs form a

separate plane of computation

Can be managed and used

independently from the CPU

Web search

ranking

Deep neural

networks

SDN offload

SQL

Page 65: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

https://insidehpc.com/2018/04/cray-build-fpga-accelerated-supercomputer-paderborn-university/

Page 66: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

FPGAs for every programmer

67

Page 67: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

68

Software Developers are the New FPGA Developers

“I don’t speak FPGA!

What is the programming model, and where are the compilers, libraries and tools I am used to?”

Page 68: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Board Design &Qualification

Software Development

FPGA Accelerator Development

Intel® Investment in All These Areas Democratizes FPGA Acceleration

16

Page 69: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Loadable AFU image(.gbs)

FPGA Platforms (Programmable Acceleration Cards)

Intel Xeon FPGA Acceleration Libraries

Frameworks

Orchestration / Rack Level Management

FPGA Interface Manager (FIM)

Intel® DAALIntel® MKLIntel® MKL-DNN

Rack Scale Design

Hardware

Vertical Software Frameworks/Libs (DL, Networking, Genomics, etc.)

Applications/ Orchestration

Intel® DL Deployment Toolkit

70

IP Libraries: DLA, GEMM, VirtIO, pHMMCompression, Encryption, etc..

Open Programmable Acceleration Engine (OPAE Software API)

Drivers, virtualization, API’s, acceleration engineIntel FPGA SDK for OpenCL™, Intel Quartus® Prime

FPGA Images

NDA required

User Applications Deep Learning, Networking, Genomics, etc.

Operating Systems OS Enablement: Linux, Windows

FPGA HW & SW Tool Chains

✓ Simplify FPGA programming model

Common Infrastructure

What is acceleration Stack for Xeon with FPGA?

Page 70: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Ecosystem of FPGAWorkloads

Application & FPGA Development

FPGA Deployment& Management

Data Center OperatorIntegrated Services Vendors

HW &SW Developer

End ApplicationUser

Enabled by

71

Page 71: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Out-of-Box Flow for Acceleration StackBuy Server

w/ PAC

Download & Install Deployment Package of

Acceleration Stack

Intel Website

Deployment Flow

Development Flow

Download & Install Developer Package of

Acceleration Stack

Install Server OS

Download & Install Workload

Download & Install Simulator

Download & HLS or OpenCL(Optional)

Write Host Application

Vendor Website

Create & Simulate WorkloadHW & SW

Developer

End ApplicationUser

Page 72: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

73

How Can FPGA Accelerators Be Created?

AcceleratorFunctionalUnit (AFU)

Self-Developed Externally-Sourced

VHDL or VerilogC/C++ Programming

Language Ecosystem Partner

Performance OptimizedHigher Productivity Contracted EngagementIntel® Reference Designs

Intel® HLS Compiler

Intel® FPGA SDK for

OpenCL™

Page 73: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

components of acceleration Stack for Xeon with FPGA: Overview

74

Application

Drivers

Accelerator

Functional

Unit (AFU)

Signal Bridge and Management

Intel®

Xeon®

Software

FPGA

Hardware

FPGA Interface ManagerProvided by Intel

User, Intel, or 3rd-Party IPPlugs into AFU Slot

PCIe* DriversProvided by Intel

Open Programmable Acceleration Engine (OPAE)

Provided by Intel

Libraries

Developed by User

User, Intel, and 3rd Party

FPGA Platforms (Programmable Acceleration Cards)

Qualified and Validated for volume deploymentProvided by OEMs

Page 74: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

components of acceleration Stack: FPGA INTERFACE MANAGER (FIM)

76

Simplifies the use of FPGAs

Hardware

Application

Drivers

Software

Accelerator

Functional

Unit (AFU)

Signal Bridge and Management

Intel® Xeon®

FPGA

FPGA Interface ManagerProvided by Intel

User, Intel, or 3rd-Party IPPlugs into AFU Slot

PCIe* DriversProvided by Intel

Open Programmable Acceleration Engine (OPAE)

Provided by Intel

Libraries

Developed by User

User, Intel, and 3rd Party

FPGA Platforms (Programmable Acceleration Cards)

Page 75: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

How Accelerator Functions interface to FPGA INTERFACE Manager

77

FPGA

FPGA INTERFACE UNIT (FIU)

FPGA INTERFACE MANAGER (FIM) 400 MHzPCIe Gen 3x8 Hard IP Controller

CCI-P (512-bit Bidirectional Data Path)

User Accelerator

Logic

(e.g. Matrix Multiply)

ACCELERATOR FUNCTION UNIT (AFU)

400

MHz

Standard framework and abstraction layer for AFU integration with Acceleration Stack

AV MM

Slave

SDRAM Bank 0 Interface

267 MHz

512-Bit

DIMM 0

AV MM

Master

1067 MHz

64-Bit

ECC

200

MHz

100

MHzUsr_Clk

Usr_Clk

/2

CH2

TX

CH1

TX

CH1

RX

CH0

TX

CH0

RX

AV MM

Slave

SDRAM Bank 1 Interface

267 MHz

512-Bit

DIMM 1

AV MM

Master

1067 MHz

64-Bit

ECC

Interface to Xeon Host via common API Drivers (OPAE)

Page 76: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

components of acceleration Stack: OPEN Programmable ACCELERATION ENGINE

78

Simplifies the use of FPGAs

Hardware

Application

Drivers

Software

Accelerator

Functional

Unit (AFU)

Signal Bridge and Management

Intel® Xeon®

FPGA

FPGA Interface ManagerProvided by Intel

User, Intel, or 3rd-Party IPPlugs into AFU Slot

PCIe* DriversProvided by Intel

Open Programmable Acceleration Engine (OPAE)

Provided by Intel

Libraries

Developed by User

User, Intel, and 3rd Party

FPGA Platforms (Programmable Acceleration Cards)

Page 77: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

79

OPAE: Simplified FPGA Programming Model for Application Developers

Bare Metal

FPGA Hardware + Interface Manager

FPGA Driver(physical function – PF)

FPGA API (C) (enumeration, management, access)

Applications, Frameworks, Intel® Acceleration Libraries

Bare Metal OS Virtual Machine

FPGA Driver(virtual function - VF)

OS, Hypervisor

FPGA Driver (common – AFU, local memory, HSSI)

OS

Consistent API across product generations and platforms▪ Abstraction for hardware specific FPGA resource details

Designed for minimal software overhead and latency▪ Lightweight user-space library (libfpga)

Open ecosystem for industry and developer community▪ License: FPGA API (BSD), FPGA driver (GPLv2)

FPGA driver being upstreamed into Linux kernel

Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)

Includes guides, command-line utilities and sample code

Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE

Page 78: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

80

What an FPGA Accelerator looks like to Application Software

From the OS’s point of view

▪ FPGA hardware appears as a regular PCIe device

▪ FPGA accelerator appears as a set of features accessible by software programs running on host

Unified C API model

▪ Resource management and orchestration services in a data center use to discover and select the FPGA resources and organize them to be used by the workloads

Architecture supports Single Root I/O Virtualization (SROIV) PCIe extension, enabling host software to access the accelerator:

▪ Via a hypervisor/VMM (Virtual Function)

▪ Bypassing the VMM/Hypervisor Physical Functions

User Application Software

Orchestration Services

Application Libraries

Operating System

Drivers

Hypervisor

OPAE

AFUFPGA

Page 79: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

* 01.org is an open source community site

• Acceleration Stack for Intel® Xeon® with FPGAs

• FPGA Acceleration Platforms• Acceleration Solutions & Ecosystem• Knowledge Center• FPGA as a Service• Academia• 01.org *

Intel® portal for all things relatedto FPGA acceleration

25

www.intel.com/fpgaaccelerationhub

Page 80: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 82

Summary

▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and

energy-efficient solution for accelerating workloads

▪ Intel® Programmable Acceleration Cards are PCIe cards already certified for

Servers

▪ Acceleration Stack simplifies FPGAs adoption to software programmers

▪ There is a growing list of ready-to-use workloads accelerators to solve real use

cases

▪ Intel® provides high level synthesis tools to develop HW accelerators for

custom needs

Page 81: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Deep learning for intel fpgas

83

Francisco PerezIntel Field Applications [email protected]

High Performance and Custom Inference

Page 82: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

A Car A Black Car Volkswagen Passat license plate number Not the owner !!

Amazing new capabilities

Thief

Thief

Thief

People Detection People Tracking Analyze behavior/ intentions

Page 83: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Challenges Markets and Applications

Edge Gateway/Fog Data center/Cloud

Image, Audio, Speech, Text, NLP

Medical Imaging, Auto, Industrial

Data Center, CloudDigital Surveillance, Smart

City, Smart Classroom

Data center applications require efficient, low latency

compute across multiple nodes, a diverse set of

workloads including image, speech, and text.

Digital surveillance solutions need to support many input cameras and provide real-

time, low latency identification of specific people, faces, vehicle plates & gestures.

Location-Aware applications require real-time detection

and identification of objects using a variety of input

sensors and hybrid/ heterogenous processing.

Page 84: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

The full system

edge gateway datacenter

more analytics to the edge

Faster respond time, more controllability on the edge

Less bandwidth

Less storage required

Page 85: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

87

Machine Learning How do you

engineer the best features?

𝑁 × 𝑁

Arjun

NEURAL NETWORK

𝒇𝟏, 𝒇𝟐, … , 𝒇𝑲Roundness of faceDist between eyesNose widthEye socket depthCheek bone structureJaw line length…etc.

CLASSIFIERALGORITHM

SVMRandom ForestNaïve BayesDecision TreesLogistic RegressionEnsemble methods

𝑁 × 𝑁

Arjun

Deep LearningHow do you guide the model to find the best features?

MULTIPLE approaches to AI

Page 86: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

88

Deep learning: Training vs. inference

Lots of labeled data!

Training

Inference

Forward

Backward

Model weights

Forward“Bicycle”?

“Strawberry”

“Bicycle”?

Error

HumanBicycle

Strawberry

??????

Data set size

Acc

ura

cy

Did you know?Training requires a very large

data set and deep neural network (i.e. many layers) to achieve the highest accuracy

in most cases

Page 87: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Real-time Inference

Mainstream Training

Intensive Training

Mainstream Inference

Higher Inference Throughput

NNP

Vision1-20W

Speech/Audio1-100+mW

Mainstream Inference

Autonomous driving

CustomInference

IntelGNA

(IP)

Mainstream AI

Flexible Acceleration

GeneralAI

Deep Learning

train

inginf

eren

ceDa

ta Ce

nter

/ Wo

rkst

ation

Data

Cent

er/

Work

stat

ionGa

tewa

y/edg

e

All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.

End-to-end ai compute

Page 88: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

libraries

Intel® Deep Learning Deployment Toolkittools

Frameworks

Intel® DAAL

hardwareMemory & Storage Networking

Intel Python Distribution

Mlib BigDL

Intel® Nervana™ Graph

inteL® AI portfolio

experiences

Associative Memory Base

Intel® Computer Vision SDK

Visual Intelligence

Intel® FPGA DL Acceleration

SuiteIntel® Math Kernel Library

(MKL, MKL-DNN)

Compute

More*

90

Page 89: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks
Page 90: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 92

Design Flow with Machine Learning

Data Collection Data

Store

Choose

Network

Train

Network

Inference

Engine

Parameters

Selection

Architecture

Choose Network topology▪ Use framework (e.g. Caffe,

Tensor Flow)

Train Network▪ A high-performance computing (HPC)

workload from large dataset▪ Weeks to months process

Inference Engine (FPGA Focus)▪ Implementation of the neural

network performing real-time inferencing

Improvement Strategies• Collect more data• Improve network

Page 91: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 93

Deep Learning Topology Processing

“head”

1

“head”

2

“head”

10

Neural net

“the body”

image

Most of the compute is here

Vision: CNNs

features

Feature vector

for index

Tags

Object

detect

Post-processing

Intel FPGA Deep Learning

Acceleration Suite

Re-size /

crop

image

Pre-processing

Page 92: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

94

Intel® Computer Vision SDK & Components

OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

What’s Inside the Intel® Computer Vision SDKComponent tools

Traditional Computer Vision for Intel CPU/CPU with integrated graphics - Optimized Computer Vision Libraries

GPUCPU FPGA VPU

Trained Models

Linux for FPGA only

Increase Processor Graphics Performance–Linux* only

GPU = Intel CPU with integrated graphics processing unit/Intel® Processor GraphicsVPU = Intel® Movidius™ Vision Processing Unit

Intel® Deep Learning Deployment Toolkit

Model Optimizer Convert & Optimize

IR

Inference EngineOptimized Inference

OpenCV* OpenVX*

OpenCL™ Intel® Integrated Graphics

Drivers & Runtimes

Intel® Media SDK (open source

version)

BitstreamsFPGA RunTime Environment (RTE) (from Intel® FPGA SDK for OpenCL™)

IR = Intermediate

Representation format

Page 93: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

95

Intel® Deep Learning Deployment Toolkit Take Full Advantage of the Power of Intel® Architecture

Caffe

TF

MxNet

.dataIRIR

IR = Intermediate Representation format

Convert & optimize to fit all targets

Load, infer

CPU Plugin

GPU Plugin

FPGA Plugin

Myriad Plugin

Model Optimizer

Convert & Optimize

Extendibility C++

Extendibility OpenCL™

Extendibility OpenCL/TBD

Extendibility TBD

Model Optimizer

▪ What it is: Preparation step -> imports trained models

▪ Why important: Optimizes for performance/space with conservative topology transformations; biggest boost is from conversion to data types matching hardware.

Inference Engine

▪ What it is: High-level inference API

▪ Why important: Interface is implemented as dynamically loaded plugins for each hardware type. Delivers best performance for each type without requiring users to implement and maintain multiple code pathways.

Trained Model

Inference Engine

Common API (C++)

Optimized cross-platform inference

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Page 94: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

96

Improve Performance with Model Optimizer

▪ Easy to use, Python*-based workflow does not require rebuilding frameworks.

▪ Import Models from various frameworks (Caffe*, TensorFlow*, MXNet*, more are planned…)

▪ More than 100 models for Caffe, MXNet and TensorFlow validated.

▪ Caffe is not required to generate IRs for models consisting of Standard Layers, OR when user already provides his custom layers

Trained Model

Model Optimizer

Analyze

Quantize

Optimize topology

Convert

Intermediate Representation (IR) file

Page 95: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Optimal Model Performance Using the Inference Engine

97

Inference Engine Common API

Plu

g-I

n A

rch

ite

ctu

re

Inference Engine Runtime

Movidius API

Movidius™ Myriad 2

DLAS

Intel® IntegratedGraphics(GPU)

CPU: Intel® Xeon®/Core™/Atom®

clDNN PluginIntel Math Kernel

Library (MKLDNN)Plugin

OpenCL™Intrinsics

FPGA Plugin

Applications/Service

Intel® FPGA

▪ Simple & Unified API for Inference across all Intel® architecture (IA)

▪ Optimized inference on large IA hardware targets (CPU/GEN/FPGA)

▪ Heterogeneity support allows execution of layers across hardware types

▪ Asynchronous execution improves performance

▪ Futureproof/scale your development for future Intel® processors

Transform Models & Data into Results & Intelligence

MovidiusPlugin

Page 96: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

DLA SW

API

Intel® FPGA DLA Suite Usage

GoogleNet Optimized Template

ResNet Optimized Template

Additional, Generic CNN Templates

SqueezeNet Optimized Template

VGG Optimized Template

• Supports common software frameworks (Caffe, Tensorflow)

• Intel DL software stack provides graph optimizations

• Intel FPGA Deep Learning Acceleration Suite provides turn-key or customized CNN acceleration for common topologies

Caffe TensorFlow

Intel®

Xeon®

Processor

Intel ®

FPGA

Inference

Engine

Model

Optimizer

ConvPE Array

Crossbar

DDR

Memory

Reader/Writer

Feature Map Cache

DDR

DDR

DDR

ConfigEngine

Optimized Acceleration Engine

Standard ML Frameworks

Intel Deep Learning

Deployment Toolkit

Heterogenous

CPU/FPGA

Deployment

Pre-compiled Graph Architectures

Hardware Customization Supported

Page 97: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 99

Machine Learning on Intel® FPGA Platform

Acceleration Stack Platform Solution

DLA Runtime Engine DLA Workload

OpenCL™ RuntimeBBS

Hardware

Platform & IP

Software Stack

DL Deployment Toolkit

Acceleration Stack

Application

PAC Family

Boards

Intel® Xeon

CPU

ML Framework

(Caffe*, TensorFlow*)

For more information on the Acceleration Stack for Intel® Xeon® CPU with FPGAs on

the Intel® Programmable Acceleration Card, visit the Intel® FPGA Acceleration Hub

Page 98: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

100

Increase Deep Learning Performance on Public Models using the Intel® Computer Vision SDK even MORE with FPGA Accelerator Cards (Frames Per Second (FPS))

Public modelsBatch

Size

OpenCV* optimized

(non-Intel)

Intel® CV SDK

on CPU

Intel CV SDK w/

Floating Point 16

(FP16)1

Intel CV SDK on Intel®

Arria 10-1150GX FPGA

Squeezenet* 1.1 1 4.27x 7.03x 4.39x 16.51x

Vgg16* 1 1.83x 2.39x 4.32x 5.57x

GoogLeNet* v1 1 3.37x 6x 6.11x 16.89x

SSD 300* 1 1.85x 2.66x 4.54x 8.61x

Squeezenet* 1.1 32 4.22x 5.95x 7.52x 19.91x

Vgg16* 32 1.91x 2.64x 4.35x 8.08x

GoogLeNet* v1 32 3.48x 5.77x 7.11x 18.81x

SSD 300* 32 1.89x 2.72x 3.87x 8.87x

Or offload to Intel® FPGA

Intel Computer Vision SDK Accelerates Performance of Deep Learning Models running on Intel Hardware Get Faster Results with Less Work

Optimize itUse Intel

Tools

Or offload to Intel® Iris™ Pro Graphics

Baseline Caffe* Framework - Out of Box

These are multiples of how much faster than base line the model will run

Page 99: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Exploits the benefits of HW parallelism

101

Page 100: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

CNN Computation in One Slide

Inew 𝑥 𝑦

=

𝑥′=−1

1

𝑦′=−1

1

Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Input Feature Map

(Set of 2D Images)

Filter

(3D Space)

Output Feature

Map

Repeat for Multiple Filters

to Create Multiple “Layers”

of Output Feature Map

102

Page 101: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

Why Intel® FPGAs for Machine Learning? – Reason 1

Convolutional Neural Networks are Compute Intensive

Fine-grained & low latency between compute and memory

Function 2Function 1 Function 3

IO IO

Optional

MemoryOptional Memory

Pipeline Parallelism

Feature Benefit

Highly parallel

architecture

Facilitates efficient low-batch video

stream processing and reduces latency

Configurable

Distributed

Floating Point DSP

Blocks

FP32 9Tflops, FP16, FP11

Accelerates computation by tuning

compute performance

Tightly coupled

high-bandwidth

memory

>50TB/s on chip SRAM bandwidth,

random access, reduces latency,

minimizes external memory access

Programmable

Data Path

Reduces unnecessary data movement,

improving latency and efficiency

Configurability

Support for variable precision (trade-off

throughput and accuracy). Future proof

designs, and system connectivity

Convolutional Neural Networks are Compute Intensive

Page 102: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group

▪ Deep Learning Is Undergoing constant innovation

– Better Accuracy/Higher Compute Density

▪ Efforts to improve throughput and efficiency are ongoing

– Batching, Sparsity, Weight Sharing, Compression, etc. . .

▪ This rapid and constant evolution can present a challenge if implemented on a fixed architecture (e.g. a GPU) . . .

104

Why Intel® FPGAs for Machine Learning? – Reason 2Future Proof: Rapid Innovation of DL Topologies

Page 103: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 105

Intel® FPGA Deep Learning Acceleration Suite

▪ CNN acceleration engine for common topologies executed in a graph loop architecture

– AlexNet, GoogleNet, LeNet, SqueezeNet, VGG16, ResNet, Yolo, SSD, LSTM…

▪ Software Deployment

– No FPGA compile required

– Run-time reconfigurable

▪ Customized Hardware Development

– Custom architecture creation w/ parameters

– Custom primitives using OpenCL™ flow

Convolution PE Array

Crossbar

prim prim prim custom

DD

R

Memory Reader/Writer

Feature Map Cache

DD

R

ConfigEngine

Page 104: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 106

DLA Architecture: Built for Performance

▪ Maximize Parallelism on the FPGA

– Filter Parallelism (Processing Elements)

– Input-Depth Parallelism

– Winograd Transformation

– Batching

– Feature Stream Buffer

– Filter Cache

▪ Choosing FPGA Bitstream

– Data Type / Design Exploration

– Primitive Support

ReLUConvolution /

Fully

ConnectedNorm MaxPool

Stream Buffer

ConvPE

Array

Crossbar

ReLUMaxPool

DDR

Memory Reader/Writer

Feature Map Cache

DDR

DDR

DDR

ConfigEngine

Norm

Execute

Page 105: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 107

Mapping Graphs in DLA

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

ReLU

Convolution /

Fully

Connected

Norm MaxPool

Blocks are run-time reconfigurable and bypassable

Stream Buffer

Page 106: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 108

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

Norm MaxPool

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 107: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 109

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

Norm MaxPool

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 108: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 110

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Stream Bufferoutput

input

Blocks are run-time reconfigurable and bypassable

Page 109: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 111

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 110: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 112

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

MaxPool

Page 111: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 113

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 112: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 114

Mapping Graphs in DLA

ReLU

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 113: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 115

Mapping Graphs in DLA

Convolution /

Fully

Connected

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Blocks are run-time reconfigurable and bypassable

Stream Bufferoutput

input

Page 114: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 116

Support for Different Topologies

ReLUConvolution /

Fully

Connected

Norm MaxPool

Stream Buffer

Permute Flatten PriorBox SoftMaxConcatLRN Reshape

ReLU

Convolution /

Fully

Connected

Norm MaxPool

Stream Buffer

Page 115: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 117

Support for Different Topologies

Tradeoff between features and performance

Convolution PE Array

Crossbar

ReLULRN

NormMaxPool

Memory

Reader/Writer

Feature Map Cache

ConfigEngine

Convolution PE Array

Crossbar

ReLULRN

NormMaxPool

Memory

Reader/Writer

Feature Map Cache

ConfigEngine

Prior Box

Permute

Concat FlattenSoftMax

Reshape

vs

Page 116: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group Intel Confidential – CNDA Required 118

User Flows for Intel® FPGA DL Acceleration Suite

IPArchitect

Neural Net

Design

Offline

Compiler

Intel® FPGA SDK

for OpenCL™

BitstreamLibrary

Data Scientist

Compile

DLA Runtime Engine

DLA Graph Compiler

DLA Runtime API

Customized Architecture

custom primitives

custom layers

CV SDK

Model Optimizer

Inference Engine API

Software Deployment Flow

Architecture Development Flow

Design Program

Page 117: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

Programmable Solutions Group 119

Summary

▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and

energy-efficient solution for accelerating AI applications

▪ Intel® FPGA DLA Suite supports CNN inference on FPGAs

▪ Accessed through Intel® Computer Vision SDK

▪ Available for Intel® Programmable Acceleration Card

▪ Future Proof: can adapt to rapid innovation of DL Topologies

Page 118: FPGAs for High performance computing€¦ · • An FPGA or Field-programable gate array is a configurable device containing thousands of digital logic blocks. • How these blocks

120