Cloud and Infrastructure Security Technologies · 2 information in this document is provided in connection with intel products. no license, express or implied, by estoppel or otherwise,

HPC Advisory Council ConferenceLugano, SwitzerlandApril 11th, 2017

Claire Bild, Intel

deep learning with Intel® Xeon Phi™ processors

2

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THISDOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, Intel Xeon, Intel Xeon Phi™, Intel® Atom™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Copyright © 2017, Intel Corporation

*Other brands and names may be claimed as the property of others.Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. The cost reduction scenarios described in this document are intended to enable you to get a better understanding of how the purchase of a given Intel product, combined with a number of situation-specific variables, might affect your future cost and savings. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs.Intel® Advanced Vector Extensions (Intel® AVX)* are designed to achieve higher throughput to certain integer and floating point operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and systemconfiguration and you should consult your system manufacturer for more information.*Intel® Advanced Vector Extensions refers to Intel® AVX, Intel® AVX2 or Intel® AVX-512. For more information on Intel® Turbo Boost Technology 2.0, visit http://www.intel.com/go/turboAll products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice

2

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User andReference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. Copyright © 2017 Intel Corporation. All rights reserved.

Legal Notices & disclaimers

http://www.intel.com/design/literature.htm

http://www.intel.com/performance/datacenter

3 3

The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward looking statements that involve a number of risksand uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,”“estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from thoseexpressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company’s expectations. Demandfor Intel's products is highly variable and, in recent years, Intel has experienced declining orders in the traditional PC market segment. Demand could be different from Intel's expectations due to factorsincluding changes in business and economic conditions; consumer confidence or income levels; customer acceptance of Intel’s and competitors’ products; competitive and pricing pressures, includingactions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory atcustomers. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. Intel's gross margin percentage could varysignificantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segmentproduct mix; the timing and execution of the manufacturing ramp andassociated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross marginmay also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introducenew products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political andphysical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, healthconcerns and fluctuations in currency exchange rates. Intel’s results could be affected by the timing of closing of acquisitions, divestitures and other significant transactions. Intel's results could be affectedby adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer,antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC filings. An unfavorable ruling could include monetary damages or an injunction prohibiting Intelfrom manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing ofintellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.Rev. 4/15/14

Risk factors

4

• HPC, Analytics and Artificial Intelligence on IA

• Deep Learning with Intel® Xeon Phi™ Processors

• Deep Learning tools, frameworks and libraries

• Other ingredients for a Deep Learning solution on IA

5





6

By 2020…The average internet user will generate

~1.5 GB of traffic per daySmart hospitals will be generating over

3,000 GB per daySelf driving cars will be generating over

4,000 GB per day… each

All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

Self driving cars will be generating over

4,000 GB per day… eachA connected plane will be generating over

40,000 GB per dayA connected factory will be generating over

1,000,000 GB per day

radar ~10-100 KB per second

sonar ~10-100 KB per second

gps ~50 KB per second

lidar ~10-70 MB per second

cameras ~20-40 MB per second

1 car 5 exaflops per hour

7

Bigger Data Better Hardware Smarter Algorithms

Image: 1000 KB / picture

Audio: 5000 KB / song

Video: 5,000,000 KB / movie

Transistor density doubles every 18 months

Cost / GB in 1995: $1000.00

Cost / GB in 2015: $0.03

Advances in neural networks leading to better accuracy in training models

8

Random Forest

Support Vector Machines

Regression

Naïve Bayes

Hidden Markov

K-Means Clustering

Ensemble Methods

More…

Classic MLUsing optimized functions or algorithms

to extract insights from data

TrainingData*

Inference, Clustering, orClassification

New Data*

Deep learningUsing massive labeled data sets to train deep (neural)

graphs that can make inferences about new data

Step 1: Training

Use massive labeled dataset (e.g. 10M tagged images) to iteratively adjust weighting of neural network connections

Step 2: Inference

Form inference about new input data (e.g. a photo) using trained neural network

Hours to Days in Cloud

Real-Time at Edge/Cloud

New Data

Untrained Trained

Algorithms

CNN,RNN,RBM..

*Note: not all classic machine learning functions require training

9

health Finance industrialEarly Tumor Detection

Leading medical imaging company

Early detection of malignant tumors in mammograms

Millions of “Diagnosed” Mammograms

Deep Learning (CNN) tumor image recognition

Higher accuracy and earlier breast cancer detection

Data SynthesisFinancial services institution with >$750B assets

Parse info to reduce portfolio manager time to insight

Vast stores of documents (news, emails, research, social)

Deep Learning (RNN w/ encoder/decoder)

Faster and more informed invest-ment decisions

Smart AgricultureWorld leader in agricultural biotech

Accelerate hybrid plant development

Large dataset of hybrid plant performance based on genotype markers

Deep Learning (CNN) to detect favorable interactions between genotype markers

More accurate selection leading to cost reduction

10

Modeling & Simulation Artificial Intelligence VisualizationHPC Data Analytics

11

3D-NAND3D Xpoint™ FPGAs

Omni-PathArchitecture

Xeon®Xeon PHI™

SiliconPhotonics

COMPUTE FABRIC I/O MEMORYSTORAGE

ACCELERATION

12

libraries Intel® MKL MKL-DNN Intel® MLSL

Intel® Deep Learning SDKtools

Frameworks

Intel® DAAL

hardwareMemory & Storage NetworkingCompute

Intel Dist

Mlib BigDL

Intel® Nervana™ Graph*

experiences

Movidius MvTensor

Library

Associative Memory Base

E2E Tool

Lake Crest

Intel® Computer Vision SDK

Visual Intelligence*Coming 2017

*

Movidius Neural Compute Stick

13





14

Up to 72 cores (288 threads)

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)V[512]

Vectorized & Parallelized

Scalar & ParallelizedVectorized & Single-ThreadedScalar & Single-Threaded

>100XIntel® Xeon® Processors are

increasingly parallel and require modern code

CPU Generation (2011-2016)

Intel® Xeon Phi™ Processors are extremely parallel and

use general purpose programming

15

Bootable Host CPU

Processor Package

Integrated Fabric

No PCIe BottleneckBootable host processor

Topple Memory WallIntegrated 16GB memory

Run Any x86 WorkloadIntel® Xeon® processor binary-compatible

Scale Out SeamlesslyEfficient scaling like Intel® Xeon® processors

Reduce Cost1

Dual-port Intel® Omni-Path Fabric

Raise Memory CeilingPlatform memory up to 384 GB (DDR4)

2VPU

2VPU

Core Core

HUB

1MBL2

1Reduced cost based on Intel internal estimate comparing cost of discrete networking components with the integrated fabric solution

16

$ ./run.sh

[DDR] Calculations took 15376.5 [units].

[HBM] Calculations took 3056.19 [units].

Computations happened 5.03125x times faster

in high-bandwidth memory.

int main()

{

data_t *A_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);

data_t *B_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);

data_t *C_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);

data_t *A_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);

data_t *B_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);

data_t *C_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);

init(A_reg, B_reg, C_reg, A_hbw, B_hbw, C_hbw);

auto res_reg = run_bench(A_reg, B_reg, C_reg, "[DDR]");

auto res_hbw = run_bench(A_hbw, B_hbw, C_hbw, "[HBM]");

std::cout << "Computations happened " << res_reg/res_hbw

<< "x times faster in high-bandwidth memory.\n";

free(A_reg);

free(B_reg);

free(C_reg);

hbw_free(A_hbw);

hbw_free(B_hbw);

hbw_free(C_hbw);

}

MCDRAM DDR4

Running a bandwidth-hungry workload -“STREAM addition”:

results_t run_bench(data_t *A, data_t *B, data_t *C, const

char* id)

{

// ... begin timing

//!! START WORKLOAD

for (size_t i = 0; i < NUM_ITERATIONS; ++i) {

#pragma omp parallel for simd

for (size_t i = 0; i < NUM_ELTS; ++i) {

C[i] = A[i] + B[i];

}

}

//!! END WORKLOAD

// ... end timing

}

17

GPU/Coprocessor

CPU

Past Present

………

…

…

…

…

…

…

………

…

…

…

…

…

…

…

Scheduler

18

Configurations: Up to 50X faster training on 128-node as compared to single-node based on AlexNet* topology workload (batch size = 1024) training time using a large image database running one node Intel Xeon Phiprocessor 7250 (16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux* 6.7 (Santiago), 1.0 TB SATAdrive WD1003FZEX-00MK2A0 System Disk, running Intel® Optimized DNN Framework, training in 39.17 hours compared to 128-node identically configured with Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 PortPCIe x16 connectors training in 0.75 hours. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

Topology: AlexNet* Dataset: Large image database

1.0x 1.9x 3.7x6.6x

12.8x

23.5x

33.7x

52.2x

1 node 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes 128 nodes

No

rma

lize

d T

rain

ing

Tim

e –

Hig

he

r

is b

ett

er

Deep Learning Image Classification Training Performance – MULTI-NODE Scaling

19

Configurations: Up to 38% better scaling efficiency at 32-nodes claim based on GoogLeNet deep learning image classification training topology using a large image database comparing one node Intel Xeon Phi processor 7250(16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, DDR4 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux 6.7, Intel® Optimized DNN Frameworkwith 87% efficiency to unknown hosts running 32 each NVIDIA Tesla* K20 GPUs with a 62% efficiency (Source: http://arxiv.org/pdf/1511.00175v2.pdf showing FireCaffe* with 32 NVIDIA Tesla* K20s (Titan Supercomputer*)running GoogLeNet* at 20x speedup over Caffe* with 1 K20).

Dataset: Large image database

0

20

40

60

80

100

1 2 4 8 16 32 64 128

SCA

LIN

G E

FFIC

IEN

CY

%

# OF INTEL® XEON PHI™ PROCESSOR 7250 (68-CORES, 1.4 GHZ, 16 GB) NODES

OverFeat* AlexNet* VGG-A* GoogLeNet

32 NVIDIA Tesla* GPUs

62

Up to 38% better scaling

87

*Other names and brands may be property of others

Deep Learning Image Classification Training Performance - MULTI-NODE Scaling

20

100%94% 96% 97% 97% 97%

0%

20%

40%

60%

80%

100%

1 2 4 8 16 32

GoogLeNet V1 scaling…

Data pre-partitioned across all nodes in the cluster before training. There is no data transferred over the fabric while training.

Number of Intel® Xeon Phi™ Processor 7250 nodes

Tim

e-t

o-T

rain

Sca

lin

g

Eff

icie

ncy

O

n I

nte

l® X

eo

n P

hi™

72

50

n

od

es

Configurations: 32 nodes of Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: flat mode), 96GB DDR4 memory, Red Hat* Enterprise Linux 6.7, export OMP_NUM_THREADS=64 (the remaining 4 cores are usedfor driving communication) Intel® MKL 2017 Update 1, MPI: 2017.1.132, Endeavor KNL bin1 nodes, export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2, Throughput is measured using “train” command. Data pre-partitioned across all nodes in the cluster before training. There is no data transferred over the fabric while training. Scaling efficiency computed as: (Single node performance / (N * Performance measured with N nodes))*100,where N = Number of nodesIntel® Caffe: Intel internal version of CaffeGoogLeNetV1: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43022.pdf, batch size 1536

http://static.googleusercontent.com/media/research.google.com/en/pubs/archive/43022.pdf

21

0

0.4

0.8

1.2

1.6

2

Caffe/AlexNet TensorFlow/AlexNet-ConvNet

Normalized Throughput (Images/Second)

2S Intel® Xeon® processor E5-2699 v4 Intel® Xeon Phi™ processor 7250

Inte

l® X

eo

n P

hi™

72

50

Re

lati

ve

Pe

rfo

rma

nce

(No

rma

lize

d t

o 1

.0 b

ase

lin

e o

f a

In

tel®

Xe

on

® p

roce

sso

r 2

S E

5-2

69

9 V

4)

Up to1.5x Up to1.3x

Configurations: Caffe: Intel® Xeon™ processor E5-2699v4 (22 Cores, 2.2 GHz), 128GB memory, Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based on BVLC Caffe as of Jul 16, 2016,Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU coresCaffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based onBVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU cores –TensorFlow: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory1200MHz DDR4 , Red Hat Enterprise Linux Server release 7.2 (Maipo), TensoFlow v0.10 available on demand, Intel®MKL GOLD UPDATE1 09/10/2016 nightly, data in memoryAlexNet: https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, Batch Size: 256

https://github.com/intel/caffe


https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

22

0

0.4

0.8

1.2

1.6

2

Caffe/AlexNet TensorFlow/AlexNet-ConvNet TensorFlow/ ConvNet VGG


2S Intel® Xeon® processor E5-2697 v4 Intel® Xeon Phi™ processor 7250

Up to1.8x Up to1.7xUp to1.4x

Configurations: Intel® Xeon™ processor E5-2697v4 node w/dual sockets, 18 cores/socket HT Enabled @2.3GHz 145W (E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMSCaffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based onBVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU cores – 2TensorFlow: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory1200MHz DDR4 , Red Hat Enterprise Linux Server release 7.2 (Maipo), TensoFlow v0.10 available on demand, MKLGOLD UPDATE1 09/10/2016 nightly, data in memoryAlexNet: https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, AlexNet, AlexNet ConvNet - Batch Size: 256, VGG Batch Size: 64

Inte

l® X

eo

n P

hi™

72

50

Re

lati

ve

Pe

rfo

rma

nce

(No

rma

lize

d t

o 1

.0 b

ase

lin

e o

f a

In

tel®

Xe

on

® p

roce

sso

r 2

S E

5-2

69

7 V

4)



23

SOON

Optimized for Deep Learning

Optimized for scale-out

Flexible, high capacity memory

Enhanced variable precision

Improved efficiency

2013 20161ST

GENERATIONXEON PHI

2ND

GENERATIONXEON PHI

2017

Sin

gle

-Pre

cisi

on

Te

rafl

op

s

Knights Mill

24

0

2

4

6

Deep Learning Performance

Normalized Performance

Intel® Xeon Phi™ processor 7290 Intel® Xeon Phi™ processor family - Knights Mill

Est

ima

ted

no

rma

lize

d p

erf

orm

an

ce

on

In

tel®

Xe

on

Ph

i™ p

roce

sso

r 7

29

0 c

om

pa

red

to

In

tel®

Xe

on

Ph

i™ K

nig

hts

Mil

l

Up to 4x

Configurations: Knights Mill performance: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in yoursystem hardware, software or configuration may affect your actual performance ; BASELINE: Intel® Xeon Phi™ Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 7.2-kernel3.10.0-327 using Intel® MKL 11.3 Update 4, Relative performance 1.0NEW: Intel® Xeon phi™ processor family – Knights Mill, Relative performance up to 4x

SOON

25





26

Intel® Math Kernel LibraryIntel® Data Analytics

Acceleration Library (DAAL)

Intel® Distribution

Open Source Frameworks

Intel Deep Learning SDK

Intel® Computer Vision SDKIntel® MKL MKL-DNN

High Level

Overview

High performance math primitives

granting low level of control

Free open source DNN functions for

high-velocity integration with deep learning frameworks

Broad data analytics acceleration object

oriented library supporting

distributed ML at the algorithm level

Most popular and fastest growing

language for machine learning

Toolkits driven by academia and industry for

training machine learning algorithms

Accelerate deep learning model

design, training and deployment

Toolkit to develop & deploying vision-oriented solutions

that harness the full performance of Intel

CPUs and SOC accelerators

Example Usage

Framework developers call

matrixmultiplication,

convolution functions

New framework with functions

developers call for max CPU

performance

Call distributed alternating least

squares algorithm for a

recommendation system

Call scikit-learnk-means function

for credit card fraud detection

Script and train a convolution neural network for image

recognition

Deep Learningtraining and model

creation, with optimization for deployment on constrained end

device

Use deep learning to do pedestrian

detection

…

software.intel.com/deep-learning-sdksoftware.intel.com/ai

https://software.intel.com/intel-distribution-for-python

https://software.intel.com/ai

27

1.0

400

0

100

200

300

400

500

Caffe/AlexNet


Out-of-Box (OOB*) Performance Current Performance (Intel Caffe)

No

rma

lize

d I

ma

ge

s/S

eco

nd

on

In

tel®

X

eo

n P

hi™

pro

cess

or

72

50

ba

seli

ne

Hig

he

r is

be

tte

r

Up to 400x

Configurations: BASELINE: Caffe Out Of the Box, Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, BVLC-Caffe:https://github.com/BVLC/caffe, with OpenBLAS, Relative performance 1.0NEW: Caffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* EnterpriseLinux 7.2, Intel® Caffe: : https://github.com/intel/caffe based on BVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1, Relative performance up to 400xAlexNet used for both configuration as perhttps://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, Batch Size: 256

https://github.com/BVLC/caffe



28

Alternating Least Squares better than Stochastic Gradient Descent when:• Hardware system offers massive parallelization• Training set relies on implicit data (user preference inferred from clicks, purchase history)• Dense matrices

Source: Matrix Factorization Techniques For Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky

Collaborative Filtering is a technique used by a recommender systems predicting what items to recommend for a specific shopper at Amazon*, what movies to watch at Netflix*, etc...

…using a matrix factorization…

…which can be solved with different approaches: Alternative Least Squares, Stochastic Gradient Descent, …

29

1.03.4

8.8

18

0.0

4.0

8.0

12.0

16.0

20.0

2S Intel® Xeon® processor

E5-2697 v2 with F2JBLAS


E5-2697 v2+ Intel®

MKL2017


E5-2699 v3+ Intel®

MKL2017


E5-2697A v4+ Intel®

MKL2017Tim

e-T

o-T

rain

on

8-N

od

e A

pa

che

Sp

ark

Clu

ste

r N

orm

ali

zed

to

Ou

t-O

f-B

ox

BL

AS

Lib

rary

Up to 18x

Configurations: BASELINE: Intel® Xeon® Processor E5-2697 v2 (12 Cores, 2.7 GHz), 256GB memory, CentOS 6.6*, F2JBLAS: https://github.com/fommil/netlib-java, Relative performance 1.0 ; NEW: Intel® Xeon® processor E5-2697 v2 Apache* Spark* Cluster: 1-Master+ 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel® Xeon® processor E5-2697 v2 (12 Cores, 2.7 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-240GB SSD OS Drive, 12-3TB HDDs Data Drives Per System, CentOS* 6.6, Linux2.6.32-642.1.1.el6.x86_64, Intel® MKL 2017 build U1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relativeperformance up to 3.4x ; NEW: Intel® Xeon® processor E5-2699 v3 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel® Xeon® processor E5-2699 v3 (18 Cores, 2.3 GHz), Hyper-Threading Enabled, 256GBRAM per System, 1-480GB SSD OS Drive, 12-4TB HDDs Data Drives Per System, CentOS* 7.0, Linux 3.10.0-229.el7.x86_64, Intel® MKL 2017 build U1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone,OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 8.8x ; NEW: Intel® Xeon® processor E5-2697A v4 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit Ethernet/sec fabric, Each systemwith 2 Processors, Intel® Xeon® processor E5-2697A v4 (16 Cores, 2.6 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-800GB SSD OS Drive, 10-240GB SSDs Data Drives Per System, CentOS* 6.7, Linux 2.6.32-573.12.1.el6.x86_64, Intel® MKL 2017 buildU1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 18xMachine learning algorithm used for all configurations : Alternating Least Squares ALS Machine Learning Algorithm https://github.com/databricks/spark-perf

Alternating Least Squares (ALS)Normalized Time to Train on 8 Node Intel® Xeon® Cluster

https://github.com/fommil/netlib-java

https://github.com/databricks/spark-perf

30





31

Common Architecture for Machine & Deep Learning

Targeted acceleration

Most widely deployed machine learning platform (>97%*)

Intel® Xeon® ProcessorsHigher performance, general

purpose machine learning

Intel® Xeon Phi™ Processors

Higher perf/watt inference, programmable

Intel® Xeon® Processor +FPGA

Best in class neural network training performance

Intel® Xeon® Processor + LakE CREST

LAKECREST

*Intel® Xeon® processors are used in 97% of servers that are running machine learning workloads today (Source: Intel)

32

Hardware for DL Workloads

Custom-designed for deep learning

Unprecedented compute density

More raw computing power than today’s state-of-the-art GPUs

Blazingly Fast Data Access

32 GB of in package memory via HBM2 technology

8 Tera-bits/s of memory access speed

High Speed Scalability

12 bi-directional high-bandwidth links

Seamless data transfer via interconnects

Deep Learning by Design

Add-in card for unprecedented compute density in deep learning centric environments

Lake Crest

Everything needed for deep learning and nothing more!

COMING 2017

33

Intel® Nervana™ PlatformFor Deep Learning

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance

COMPARED TO TODAY’S FASTEST SOLUTION1

100X ReductionDelivering In time to trainby 2020

Knights CrestBootable Intel Xeon Processor

with integrated acceleration

Lake CrestDiscrete acceleratorFirst silicon 1H’2017

34

Energy Efficient Inference with Infrastructure Flexibility

Excellent energy efficiency up to 25 images/sec/watt inference on Caffe/Alexnet

Reconfigurable accelerator can be used for variety of data center workloads

Integrated FPGA with Intel® Xeon® processor fits in standard server infrastructure -OR- Discrete FPGA fits in PCIe card and embedded applications*

Superior Inference Capabilities

Add-in card for higher performance/watt inference with low latency and flexible precision

*Xeon with Integrated FPGA refers to Broadwell Proof of Concept Configurations: Intel® Arria 10 – 1150 FPGA energy efficiency on Caffe/AlexNet up to 25 img/s/w with FP16 at 297MHz ; Vanilla AlexNet Classification Implementation as specified by http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf, Training Parameters taken from Caffe open-source Framework are 224x224x3 Input, 1000x1 Output, FP16 with Shared Block-Exponents, All compute layers (incl. Fully Connected) done on the FPGA except for Softmax, Arria 10-1150 FPGA, -1 Speed Grade on Altera PCIe DevKit with x72 DDR4 @ 1333 MHz, Power measured through on-board power monitor (FPGA POWER ONLY), ACDS 16.1 Internal Builds + OpenCL SDK 16.1 Internal Build, Compute machine is an HP Z620 Workstation, Xeon E5-1660 at 3.3 GHz with 32GB RAM. The Xeon is not used for compute.

35

Breakthrough Performance Increases price performance, reduces

communication latency compared to InfiniBand EDR1:

Up to 21% Higher Performance, lower latency at scale

Up to 17% higher messaging rate Up to 9% higher application

performance

Building on some of Industry’s best technologies Highly leverage existing Aries & Intel

True Scale fabrics

Excellent price/performance price/port, 48 radix

Re-use of existing OpenFabrics Alliance Software

Over 80+ Fabric Builder Members

World-Class Interconnect Solution for Shorter Time to Train

Fabric interconnect for breakthrough performance on scale-out apps like deep learning training

HFI AdaptersSingle portx8 and x16

Edge Switches1U Form Factor24 and 48 port

Director SwitchesQSFP-based

192 and 768 port

SoftwareOpen Source

Host Software and Fabric Manager

CablesThird Party Vendors

Passive Copper Active Optical

Innovative Features Improve performance, reliability and

QoS through:

Traffic Flow Optimization to maximize QoS in mixed traffic

Packet Integrity Protection for rapid and transparent recovery of transmission errors

Dynamic lane scaling to maintain link continuity

1Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOUnon-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series100 Edge Switch – 48 port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package. EDR IB* testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. EDR tested with MLNX_OFED_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC-X. Message rate claim: Ohio State Micro Benchmarks v. 5.0.osu_mbw_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best ofdefault, MXM_TLS=self,rc, and -mca pml yalla tunings. All measurements include one switch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Applicationclaim: GROMACS version 5.0.4 ion_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064. Additional configuration details available upon request.

36

• Intel® Scalable System Framework : one framework for HPC, Analytics and Artificial Intelligence on IA

37


• Intel® Xeon Phi™ Processors : your path to Deep Learning o Breakthrough performance at scale for highly-parallel,

memory intensive appso Next Gen Intel® Xeon Phi™ “Knights Mill”, soon !

38


• Intel® Xeon Phi™ Processors : your path to Deep Learning o Breakthrough performance at scale for highly-parallel,

memory intensive appso Next Gen Intel® Xeon Phi™ “Knights Mill”, soon !

• Choose Intel Deep Learning tools, frameworks and libraries to make the most of hardware performance

Documents

Cloud and Infrastructure Security Technologies · 2 information in this document is provided in connection with intel products. no license, express or implied, by estoppel or otherwise,