14
Graham McKenzie – Acceleration Systems FAE (EMEA) 1

Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Graham McKenzie – Acceleration Systems FAE (EMEA)

1

Page 2: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Bigger Data Better Hardware Smarter Algorithms

2

Driving Trends for AI

Image: 1000 KB / picture

Audio: 5000 KB / song

Video: 5,000,000 KB / movie

Transistor density doubles every 18 months

Cost / GB in 1995: $1000.00

Cost / GB in 2015: $0.03

Advances in neural networks leading to better accuracy in training models

Page 3: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Market Demands Scalability for Machine Learning

• <10 Classes

• Frame Rate: 15-30fps

• Power: 1W-5W

• Cost: Low

• Varying accuracy

• Custom form factor

Cloud Analytics Embedded AnalyticsCloud Analytics Embedded Analytics

• 1000s of Classes

• Large Workloads

• Highly Efficient (Performance / W)

• Varying accuracy

• Server Form Factor

3

Page 4: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Rapidly Evolving CNN Topologies

4

2012 AlexNet

2014 GoogLeNet

2015 ResNet

2016 FractalNet

SqueezeNet

LeNet

HMAX, NeoCognitron

NVIDIA DriveNet

Rapidly Evolving, Computation Intensive

Page 5: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group 5

Convolution

Input Feature Map(Set of 2D Images)

Filter(3D Space)

Output Feature Map

Page 6: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group 6

Convolution

Input Feature Map(Set of 2D Images)

Filter(3D Space)

Repeat for Multiple Filters to Create Multiple “Layers” of Output Feature

Map

Page 7: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Why an FPGA for Deep Learning?

7

1 TFLOP floating point performance in Arria 10

– 35W total device power

– Enable massive parallelism, compute spatially

8 TB/s memory bandwidth: keep state on chip!

– Exceeds available external bandwidth by factor of 50*

– Random access, low latency (2 clks)

Avoid costly data movement

– Place all data in on-chip memory, compute temporally

Flexibility

– Support rapidly evolving algorithm and future architecture

– Enable accelerator pipeline for the best system efficiency

Fine-grained & low latency between compute and memory

Kernel 2Kernel 1 Kernel3

FPGA

IO IO

Optional Memory

Optional Memory

* DDR4 @ 3.2GHz, 72Bits…etc.

Page 8: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Deep Learning FPGA Accelerator IP

8

Turns FPGA into deep learning accelerator

Reconfigurable to different CNN topologies

– Use Intel Caffe to define topology

– No FPGA compile required

Optimized implementation of core primitives

– Common primitives for CNN topologies

– Can be used or bypassed to create custom graphs

ML Framework

(Torch, Theano, Caffe)

MKL-DNN /DLA SW API

ReLUConvolution /

Fully ConnectedNorm MaxPool

Stream Buffer

Page 9: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Energy Efficient Inference with Infrastructure Flexibility

Excellent energy efficiency up to 25 images/sec/watt inference on Caffe/Alexnet

Reconfigurable accelerator can be used for variety of data center workloads

Integrated FPGA with Intel® Xeon® processor fits in standard server infrastructure -OR- Discrete FPGA fits in PCIe card and embedded applications*

Intel® Arria® 10 FPGASuperior Inference Capabilities

Offers high perf/watt for inference with low latency and flexible precision

*Xeon with Integrated FPGA refers to Broadwell Proof of Concept Configuration details on slide: 44Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Page 10: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group 10

Scalable Architecture

Page 11: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group 11

Intel® Deep Learning Inference Accelerator (DLIA)

Turnkey inference solution to accelerate convolutional neural networks (CNN)

Image processing applications

Available Q2’17

Turnkey Solution

• Hardware: PCIe* add-in card with Intel® Arria 10 FPGA

• Software: Optimized deep learning framework with MKL-DNN and Caffe* integration. Preloaded CNN image recognition algorithms support multiple topologies

• IP: DLA IP accelerates 6 primitives on FPGA

• Validation, warranty and support

Value Proposition

• Accelerate time to market by simplifying deployment with turnkey solution & software ecosystem.

• Reduce TCO by offloading/accelerating inference workloads to support datacenter scalability

• Unified APIs on Intel Architecture provide a consistent user experience across Intel product families

Page 12: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Software Architecture

12

Deep Learning Accelerator (DLA) IPImplements 6 CNN primitives (conv, FC, relu, pooling, norm, concat)

DLIA SW API

Board Support Package

MKL-DNN

Caffe integrated w/ MKL-DNN

CNN application

DLA IP

OpenCL Runtime

Driver to the board

OpenCL API to access FPGA

SW API to expose the FPGA primitives

Unified Intel deep learning API integrated with DLA IP

DNN

Unified Intel deep learning API integrated with DLA IPUnsupported primitives in DLA IP are padded with CPU primitives in MKL-

DNN

Caffe* framework integrated with MKL-DNNUnsupported primitives in MKL-DNN are padded with CPU impl in Caffe

Application using AlexNet, GoogleNet or custom topology for recognition

RobustnessCompatibilityEase of use

Deep Learning SDK

Page 13: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Intel® Nervana™ PortfolioCommon Architecture for AI Implementations

Targeted acceleration

Most widely deployed machine learning platform (>97%)

Intel® Xeon® ProcessorsHigher performance, general purpose

machine learning

Intel® Xeon Phi™ Processors

Higher perf/watt inference, programmable

Intel® Xeon® Processor +FPGA

Best in class neural network performance

Intel® Xeon® Processor + LakE CREST

LAKECREST

Page 14: Graham McKenzie –Acceleration Systems FAE (EMEA) · Targeted acceleration Most widely deployed machine learning platform (>97%) Intel® Xeon® Processors Higher performance, general

Programmable Solutions Group

Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

14