Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
HPC Advisory Council ConferenceLugano, SwitzerlandApril 11th, 2017
Claire Bild, Intel
deep learning with Intel® Xeon Phi™ processors
2
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THISDOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Intel, Intel Xeon, Intel Xeon Phi™, Intel® Atom™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Copyright © 2017, Intel Corporation
*Other brands and names may be claimed as the property of others.Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. The cost reduction scenarios described in this document are intended to enable you to get a better understanding of how the purchase of a given Intel product, combined with a number of situation-specific variables, might affect your future cost and savings. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs.Intel® Advanced Vector Extensions (Intel® AVX)* are designed to achieve higher throughput to certain integer and floating point operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and systemconfiguration and you should consult your system manufacturer for more information.*Intel® Advanced Vector Extensions refers to Intel® AVX, Intel® AVX2 or Intel® AVX-512. For more information on Intel® Turbo Boost Technology 2.0, visit http://www.intel.com/go/turboAll products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice
2
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User andReference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. Copyright © 2017 Intel Corporation. All rights reserved.
Legal Notices & disclaimers
3 3
The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward looking statements that involve a number of risksand uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,”“estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from thoseexpressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company’s expectations. Demandfor Intel's products is highly variable and, in recent years, Intel has experienced declining orders in the traditional PC market segment. Demand could be different from Intel's expectations due to factorsincluding changes in business and economic conditions; consumer confidence or income levels; customer acceptance of Intel’s and competitors’ products; competitive and pricing pressures, includingactions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory atcustomers. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. Intel's gross margin percentage could varysignificantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segmentproduct mix; the timing and execution of the manufacturing ramp andassociated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross marginmay also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introducenew products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political andphysical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, healthconcerns and fluctuations in currency exchange rates. Intel’s results could be affected by the timing of closing of acquisitions, divestitures and other significant transactions. Intel's results could be affectedby adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer,antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC filings. An unfavorable ruling could include monetary damages or an injunction prohibiting Intelfrom manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing ofintellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.Rev. 4/15/14
Risk factors
4
• HPC, Analytics and Artificial Intelligence on IA
• Deep Learning with Intel® Xeon Phi™ Processors
• Deep Learning tools, frameworks and libraries
• Other ingredients for a Deep Learning solution on IA
5
• HPC, Analytics and Artificial Intelligence on IA
• Deep Learning with Intel® Xeon Phi™ Processors
• Deep Learning tools, frameworks and libraries
• Other ingredients for a Deep Learning solution on IA
6
By 2020…The average internet user will generate
~1.5 GB of traffic per daySmart hospitals will be generating over
3,000 GB per daySelf driving cars will be generating over
4,000 GB per day… each
All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
Self driving cars will be generating over
4,000 GB per day… eachA connected plane will be generating over
40,000 GB per dayA connected factory will be generating over
1,000,000 GB per day
radar ~10-100 KB per second
sonar ~10-100 KB per second
gps ~50 KB per second
lidar ~10-70 MB per second
cameras ~20-40 MB per second
1 car 5 exaflops per hour
7
Bigger Data Better Hardware Smarter Algorithms
Image: 1000 KB / picture
Audio: 5000 KB / song
Video: 5,000,000 KB / movie
Transistor density doubles every 18 months
Cost / GB in 1995: $1000.00
Cost / GB in 2015: $0.03
Advances in neural networks leading to better accuracy in training models
8
Random Forest
Support Vector Machines
Regression
Naïve Bayes
Hidden Markov
K-Means Clustering
Ensemble Methods
More…
Classic MLUsing optimized functions or algorithms
to extract insights from data
TrainingData*
Inference, Clustering, orClassification
New Data*
Deep learningUsing massive labeled data sets to train deep (neural)
graphs that can make inferences about new data
Step 1: Training
Use massive labeled dataset (e.g. 10M tagged images) to iteratively adjust weighting of neural network connections
Step 2: Inference
Form inference about new input data (e.g. a photo) using trained neural network
Hours to Days in Cloud
Real-Time at Edge/Cloud
New Data
Untrained Trained
Algorithms
CNN,RNN,RBM..
*Note: not all classic machine learning functions require training
9
health Finance industrialEarly Tumor Detection
Leading medical imaging company
Early detection of malignant tumors in mammograms
Millions of “Diagnosed” Mammograms
Deep Learning (CNN) tumor image recognition
Higher accuracy and earlier breast cancer detection
Data SynthesisFinancial services institution with >$750B assets
Parse info to reduce portfolio manager time to insight
Vast stores of documents (news, emails, research, social)
Deep Learning (RNN w/ encoder/decoder)
Faster and more informed invest-ment decisions
Smart AgricultureWorld leader in agricultural biotech
Accelerate hybrid plant development
Large dataset of hybrid plant performance based on genotype markers
Deep Learning (CNN) to detect favorable interactions between genotype markers
More accurate selection leading to cost reduction
10
Modeling & Simulation Artificial Intelligence VisualizationHPC Data Analytics
11
3D-NAND3D Xpoint™ FPGAs
Omni-PathArchitecture
Xeon®Xeon PHI™
SiliconPhotonics
COMPUTE FABRIC I/O MEMORYSTORAGE
ACCELERATION
12
libraries Intel® MKL MKL-DNN Intel® MLSL
Intel® Deep Learning SDKtools
Frameworks
Intel® DAAL
hardwareMemory & Storage NetworkingCompute
Intel Dist
Mlib BigDL
Intel® Nervana™ Graph*
experiences
Movidius MvTensor
Library
Associative Memory Base
E2E Tool
Lake Crest
Intel® Computer Vision SDK
Visual Intelligence*Coming 2017
*
Movidius Neural Compute Stick
13
• HPC, Analytics and Artificial Intelligence on IA
• Deep Learning with Intel® Xeon Phi™ Processors
• Deep Learning tools, frameworks and libraries
• Other ingredients for a Deep Learning solution on IA
14
Up to 72 cores (288 threads)
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)V[512]
Vectorized & Parallelized
Scalar & ParallelizedVectorized & Single-ThreadedScalar & Single-Threaded
>100XIntel® Xeon® Processors are
increasingly parallel and require modern code
CPU Generation (2011-2016)
Intel® Xeon Phi™ Processors are extremely parallel and
use general purpose programming
15
Bootable Host CPU
Processor Package
Integrated Fabric
No PCIe BottleneckBootable host processor
Topple Memory WallIntegrated 16GB memory
Run Any x86 WorkloadIntel® Xeon® processor binary-compatible
Scale Out SeamlesslyEfficient scaling like Intel® Xeon® processors
Reduce Cost1
Dual-port Intel® Omni-Path Fabric
Raise Memory CeilingPlatform memory up to 384 GB (DDR4)
2VPU
2VPU
Core Core
HUB
1MBL2
1Reduced cost based on Intel internal estimate comparing cost of discrete networking components with the integrated fabric solution
16
$ ./run.sh
[DDR] Calculations took 15376.5 [units].
[HBM] Calculations took 3056.19 [units].
Computations happened 5.03125x times faster
in high-bandwidth memory.
int main()
{
data_t *A_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);
data_t *B_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);
data_t *C_reg = (data_t*) malloc(sizeof(data_t) * NUM_ELTS);
data_t *A_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);
data_t *B_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);
data_t *C_hbw = (data_t*) hbw_malloc(sizeof(data_t) * NUM_ELTS);
init(A_reg, B_reg, C_reg, A_hbw, B_hbw, C_hbw);
auto res_reg = run_bench(A_reg, B_reg, C_reg, "[DDR]");
auto res_hbw = run_bench(A_hbw, B_hbw, C_hbw, "[HBM]");
std::cout << "Computations happened " << res_reg/res_hbw
<< "x times faster in high-bandwidth memory.\n";
free(A_reg);
free(B_reg);
free(C_reg);
hbw_free(A_hbw);
hbw_free(B_hbw);
hbw_free(C_hbw);
}
MCDRAM DDR4
Running a bandwidth-hungry workload -“STREAM addition”:
results_t run_bench(data_t *A, data_t *B, data_t *C, const
char* id)
{
// ... begin timing
//!! START WORKLOAD
for (size_t i = 0; i < NUM_ITERATIONS; ++i) {
#pragma omp parallel for simd
for (size_t i = 0; i < NUM_ELTS; ++i) {
C[i] = A[i] + B[i];
}
}
//!! END WORKLOAD
// ... end timing
}
17
GPU/Coprocessor
CPU
Past Present
………
…
…
…
…
…
…
………
…
…
…
…
…
…
…
Scheduler
18
Configurations: Up to 50X faster training on 128-node as compared to single-node based on AlexNet* topology workload (batch size = 1024) training time using a large image database running one node Intel Xeon Phiprocessor 7250 (16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux* 6.7 (Santiago), 1.0 TB SATAdrive WD1003FZEX-00MK2A0 System Disk, running Intel® Optimized DNN Framework, training in 39.17 hours compared to 128-node identically configured with Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 PortPCIe x16 connectors training in 0.75 hours. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.
Topology: AlexNet* Dataset: Large image database
1.0x 1.9x 3.7x6.6x
12.8x
23.5x
33.7x
52.2x
1 node 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes 128 nodes
No
rma
lize
d T
rain
ing
Tim
e –
Hig
he
r
is b
ett
er
Deep Learning Image Classification Training Performance – MULTI-NODE Scaling
19
Configurations: Up to 38% better scaling efficiency at 32-nodes claim based on GoogLeNet deep learning image classification training topology using a large image database comparing one node Intel Xeon Phi processor 7250(16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, DDR4 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux 6.7, Intel® Optimized DNN Frameworkwith 87% efficiency to unknown hosts running 32 each NVIDIA Tesla* K20 GPUs with a 62% efficiency (Source: http://arxiv.org/pdf/1511.00175v2.pdf showing FireCaffe* with 32 NVIDIA Tesla* K20s (Titan Supercomputer*)running GoogLeNet* at 20x speedup over Caffe* with 1 K20).
Dataset: Large image database
0
20
40
60
80
100
1 2 4 8 16 32 64 128
SCA
LIN
G E
FFIC
IEN
CY
%
# OF INTEL® XEON PHI™ PROCESSOR 7250 (68-CORES, 1.4 GHZ, 16 GB) NODES
OverFeat* AlexNet* VGG-A* GoogLeNet
32 NVIDIA Tesla* GPUs
62
Up to 38% better scaling
87
*Other names and brands may be property of others
Deep Learning Image Classification Training Performance - MULTI-NODE Scaling
20
100%94% 96% 97% 97% 97%
0%
20%
40%
60%
80%
100%
1 2 4 8 16 32
GoogLeNet V1 scaling…
Data pre-partitioned across all nodes in the cluster before training. There is no data transferred over the fabric while training.
Number of Intel® Xeon Phi™ Processor 7250 nodes
Tim
e-t
o-T
rain
Sca
lin
g
Eff
icie
ncy
O
n I
nte
l® X
eo
n P
hi™
72
50
n
od
es
Configurations: 32 nodes of Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: flat mode), 96GB DDR4 memory, Red Hat* Enterprise Linux 6.7, export OMP_NUM_THREADS=64 (the remaining 4 cores are usedfor driving communication) Intel® MKL 2017 Update 1, MPI: 2017.1.132, Endeavor KNL bin1 nodes, export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2, Throughput is measured using “train” command. Data pre-partitioned across all nodes in the cluster before training. There is no data transferred over the fabric while training. Scaling efficiency computed as: (Single node performance / (N * Performance measured with N nodes))*100,where N = Number of nodesIntel® Caffe: Intel internal version of CaffeGoogLeNetV1: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43022.pdf, batch size 1536
21
0
0.4
0.8
1.2
1.6
2
Caffe/AlexNet TensorFlow/AlexNet-ConvNet
Normalized Throughput (Images/Second)
2S Intel® Xeon® processor E5-2699 v4 Intel® Xeon Phi™ processor 7250
Inte
l® X
eo
n P
hi™
72
50
Re
lati
ve
Pe
rfo
rma
nce
(No
rma
lize
d t
o 1
.0 b
ase
lin
e o
f a
In
tel®
Xe
on
® p
roce
sso
r 2
S E
5-2
69
9 V
4)
Up to1.5x Up to1.3x
Configurations: Caffe: Intel® Xeon™ processor E5-2699v4 (22 Cores, 2.2 GHz), 128GB memory, Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based on BVLC Caffe as of Jul 16, 2016,Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU coresCaffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based onBVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU cores –TensorFlow: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory1200MHz DDR4 , Red Hat Enterprise Linux Server release 7.2 (Maipo), TensoFlow v0.10 available on demand, Intel®MKL GOLD UPDATE1 09/10/2016 nightly, data in memoryAlexNet: https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, Batch Size: 256
22
0
0.4
0.8
1.2
1.6
2
Caffe/AlexNet TensorFlow/AlexNet-ConvNet TensorFlow/ ConvNet VGG
Normalized Throughput (Images/Second)
2S Intel® Xeon® processor E5-2697 v4 Intel® Xeon Phi™ processor 7250
Up to1.8x Up to1.7xUp to1.4x
Configurations: Intel® Xeon™ processor E5-2697v4 node w/dual sockets, 18 cores/socket HT Enabled @2.3GHz 145W (E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMSCaffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, Intel® Caffe: : https://github.com/intel/caffe available now, based onBVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1 (22092016), Intel® MKL2017 prototxt, data in memory, images/sec results obtained using the “time” command, OMP_NUM_THREADS = number of CPU cores – 2TensorFlow: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory1200MHz DDR4 , Red Hat Enterprise Linux Server release 7.2 (Maipo), TensoFlow v0.10 available on demand, MKLGOLD UPDATE1 09/10/2016 nightly, data in memoryAlexNet: https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, AlexNet, AlexNet ConvNet - Batch Size: 256, VGG Batch Size: 64
Inte
l® X
eo
n P
hi™
72
50
Re
lati
ve
Pe
rfo
rma
nce
(No
rma
lize
d t
o 1
.0 b
ase
lin
e o
f a
In
tel®
Xe
on
® p
roce
sso
r 2
S E
5-2
69
7 V
4)
23
SOON
Optimized for Deep Learning
Optimized for scale-out
Flexible, high capacity memory
Enhanced variable precision
Improved efficiency
2013 20161ST
GENERATIONXEON PHI
2ND
GENERATIONXEON PHI
2017
Sin
gle
-Pre
cisi
on
Te
rafl
op
s
Knights Mill
24
0
2
4
6
Deep Learning Performance
Normalized Performance
Intel® Xeon Phi™ processor 7290 Intel® Xeon Phi™ processor family - Knights Mill
Est
ima
ted
no
rma
lize
d p
erf
orm
an
ce
on
In
tel®
Xe
on
Ph
i™ p
roce
sso
r 7
29
0 c
om
pa
red
to
In
tel®
Xe
on
Ph
i™ K
nig
hts
Mil
l
Up to 4x
Configurations: Knights Mill performance: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in yoursystem hardware, software or configuration may affect your actual performance ; BASELINE: Intel® Xeon Phi™ Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 7.2-kernel3.10.0-327 using Intel® MKL 11.3 Update 4, Relative performance 1.0NEW: Intel® Xeon phi™ processor family – Knights Mill, Relative performance up to 4x
SOON
25
• HPC, Analytics and Artificial Intelligence on IA
• Deep Learning with Intel® Xeon Phi™ Processors
• Deep Learning tools, frameworks and libraries
• Other ingredients for a Deep Learning solution on IA
26
Intel® Math Kernel LibraryIntel® Data Analytics
Acceleration Library (DAAL)
Intel® Distribution
Open Source Frameworks
Intel Deep Learning SDK
Intel® Computer Vision SDKIntel® MKL MKL-DNN
High Level
Overview
High performance math primitives
granting low level of control
Free open source DNN functions for
high-velocity integration with deep learning frameworks
Broad data analytics acceleration object
oriented library supporting
distributed ML at the algorithm level
Most popular and fastest growing
language for machine learning
Toolkits driven by academia and industry for
training machine learning algorithms
Accelerate deep learning model
design, training and deployment
Toolkit to develop & deploying vision-oriented solutions
that harness the full performance of Intel
CPUs and SOC accelerators
Example Usage
Framework developers call
matrixmultiplication,
convolution functions
New framework with functions
developers call for max CPU
performance
Call distributed alternating least
squares algorithm for a
recommendation system
Call scikit-learnk-means function
for credit card fraud detection
Script and train a convolution neural network for image
recognition
Deep Learningtraining and model
creation, with optimization for deployment on constrained end
device
Use deep learning to do pedestrian
detection
…
software.intel.com/deep-learning-sdksoftware.intel.com/ai
27
1.0
400
0
100
200
300
400
500
Caffe/AlexNet
Normalized Throughput (Images/Second)
Out-of-Box (OOB*) Performance Current Performance (Intel Caffe)
No
rma
lize
d I
ma
ge
s/S
eco
nd
on
In
tel®
X
eo
n P
hi™
pro
cess
or
72
50
ba
seli
ne
Hig
he
r is
be
tte
r
Up to 400x
Configurations: BASELINE: Caffe Out Of the Box, Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, BVLC-Caffe:https://github.com/BVLC/caffe, with OpenBLAS, Relative performance 1.0NEW: Caffe: Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* EnterpriseLinux 7.2, Intel® Caffe: : https://github.com/intel/caffe based on BVLC Caffe as of Jul 16, 2016, Intel® MKL GOLD UPDATE1, Relative performance up to 400xAlexNet used for both configuration as perhttps://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf, Batch Size: 256
28
Alternating Least Squares better than Stochastic Gradient Descent when:• Hardware system offers massive parallelization• Training set relies on implicit data (user preference inferred from clicks, purchase history)• Dense matrices
Source: Matrix Factorization Techniques For Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky
Collaborative Filtering is a technique used by a recommender systems predicting what items to recommend for a specific shopper at Amazon*, what movies to watch at Netflix*, etc...
…using a matrix factorization…
…which can be solved with different approaches: Alternative Least Squares, Stochastic Gradient Descent, …
29
1.03.4
8.8
18
0.0
4.0
8.0
12.0
16.0
20.0
2S Intel® Xeon® processor
E5-2697 v2 with F2JBLAS
2S Intel® Xeon® processor
E5-2697 v2+ Intel®
MKL2017
2S Intel® Xeon® processor
E5-2699 v3+ Intel®
MKL2017
2S Intel® Xeon® processor
E5-2697A v4+ Intel®
MKL2017Tim
e-T
o-T
rain
on
8-N
od
e A
pa
che
Sp
ark
Clu
ste
r N
orm
ali
zed
to
Ou
t-O
f-B
ox
BL
AS
Lib
rary
Up to 18x
Configurations: BASELINE: Intel® Xeon® Processor E5-2697 v2 (12 Cores, 2.7 GHz), 256GB memory, CentOS 6.6*, F2JBLAS: https://github.com/fommil/netlib-java, Relative performance 1.0 ; NEW: Intel® Xeon® processor E5-2697 v2 Apache* Spark* Cluster: 1-Master+ 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel® Xeon® processor E5-2697 v2 (12 Cores, 2.7 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-240GB SSD OS Drive, 12-3TB HDDs Data Drives Per System, CentOS* 6.6, Linux2.6.32-642.1.1.el6.x86_64, Intel® MKL 2017 build U1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relativeperformance up to 3.4x ; NEW: Intel® Xeon® processor E5-2699 v3 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel® Xeon® processor E5-2699 v3 (18 Cores, 2.3 GHz), Hyper-Threading Enabled, 256GBRAM per System, 1-480GB SSD OS Drive, 12-4TB HDDs Data Drives Per System, CentOS* 7.0, Linux 3.10.0-229.el7.x86_64, Intel® MKL 2017 build U1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone,OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 8.8x ; NEW: Intel® Xeon® processor E5-2697A v4 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit Ethernet/sec fabric, Each systemwith 2 Processors, Intel® Xeon® processor E5-2697A v4 (16 Cores, 2.6 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-800GB SSD OS Drive, 10-240GB SSDs Data Drives Per System, CentOS* 6.7, Linux 2.6.32-573.12.1.el6.x86_64, Intel® MKL 2017 buildU1_20160808 , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* 1.6.1 standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 18xMachine learning algorithm used for all configurations : Alternating Least Squares ALS Machine Learning Algorithm https://github.com/databricks/spark-perf
Alternating Least Squares (ALS)Normalized Time to Train on 8 Node Intel® Xeon® Cluster
30
• HPC, Analytics and Artificial Intelligence on IA
• Deep Learning with Intel® Xeon Phi™ Processors
• Deep Learning tools, frameworks and libraries
• Other ingredients for a Deep Learning solution on IA
31
Common Architecture for Machine & Deep Learning
Targeted acceleration
Most widely deployed machine learning platform (>97%*)
Intel® Xeon® ProcessorsHigher performance, general
purpose machine learning
Intel® Xeon Phi™ Processors
Higher perf/watt inference, programmable
Intel® Xeon® Processor +FPGA
Best in class neural network training performance
Intel® Xeon® Processor + LakE CREST
LAKECREST
*Intel® Xeon® processors are used in 97% of servers that are running machine learning workloads today (Source: Intel)
32
Hardware for DL Workloads
Custom-designed for deep learning
Unprecedented compute density
More raw computing power than today’s state-of-the-art GPUs
Blazingly Fast Data Access
32 GB of in package memory via HBM2 technology
8 Tera-bits/s of memory access speed
High Speed Scalability
12 bi-directional high-bandwidth links
Seamless data transfer via interconnects
Deep Learning by Design
Add-in card for unprecedented compute density in deep learning centric environments
Lake Crest
Everything needed for deep learning and nothing more!
COMING 2017
33
Intel® Nervana™ PlatformFor Deep Learning
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance
COMPARED TO TODAY’S FASTEST SOLUTION1
100X ReductionDelivering In time to trainby 2020
Knights CrestBootable Intel Xeon Processor
with integrated acceleration
Lake CrestDiscrete acceleratorFirst silicon 1H’2017
34
Energy Efficient Inference with Infrastructure Flexibility
Excellent energy efficiency up to 25 images/sec/watt inference on Caffe/Alexnet
Reconfigurable accelerator can be used for variety of data center workloads
Integrated FPGA with Intel® Xeon® processor fits in standard server infrastructure -OR- Discrete FPGA fits in PCIe card and embedded applications*
Superior Inference Capabilities
Add-in card for higher performance/watt inference with low latency and flexible precision
*Xeon with Integrated FPGA refers to Broadwell Proof of Concept Configurations: Intel® Arria 10 – 1150 FPGA energy efficiency on Caffe/AlexNet up to 25 img/s/w with FP16 at 297MHz ; Vanilla AlexNet Classification Implementation as specified by http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf, Training Parameters taken from Caffe open-source Framework are 224x224x3 Input, 1000x1 Output, FP16 with Shared Block-Exponents, All compute layers (incl. Fully Connected) done on the FPGA except for Softmax, Arria 10-1150 FPGA, -1 Speed Grade on Altera PCIe DevKit with x72 DDR4 @ 1333 MHz, Power measured through on-board power monitor (FPGA POWER ONLY), ACDS 16.1 Internal Builds + OpenCL SDK 16.1 Internal Build, Compute machine is an HP Z620 Workstation, Xeon E5-1660 at 3.3 GHz with 32GB RAM. The Xeon is not used for compute.
35
Breakthrough Performance Increases price performance, reduces
communication latency compared to InfiniBand EDR1:
Up to 21% Higher Performance, lower latency at scale
Up to 17% higher messaging rate Up to 9% higher application
performance
Building on some of Industry’s best technologies Highly leverage existing Aries & Intel
True Scale fabrics
Excellent price/performance price/port, 48 radix
Re-use of existing OpenFabrics Alliance Software
Over 80+ Fabric Builder Members
World-Class Interconnect Solution for Shorter Time to Train
Fabric interconnect for breakthrough performance on scale-out apps like deep learning training
HFI AdaptersSingle portx8 and x16
Edge Switches1U Form Factor24 and 48 port
Director SwitchesQSFP-based
192 and 768 port
SoftwareOpen Source
Host Software and Fabric Manager
CablesThird Party Vendors
Passive Copper Active Optical
Innovative Features Improve performance, reliability and
QoS through:
Traffic Flow Optimization to maximize QoS in mixed traffic
Packet Integrity Protection for rapid and transparent recovery of transmission errors
Dynamic lane scaling to maintain link continuity
1Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOUnon-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series100 Edge Switch – 48 port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package. EDR IB* testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. EDR tested with MLNX_OFED_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC-X. Message rate claim: Ohio State Micro Benchmarks v. 5.0.osu_mbw_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best ofdefault, MXM_TLS=self,rc, and -mca pml yalla tunings. All measurements include one switch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Applicationclaim: GROMACS version 5.0.4 ion_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064. Additional configuration details available upon request.
36
• Intel® Scalable System Framework : one framework for HPC, Analytics and Artificial Intelligence on IA
37
• Intel® Scalable System Framework : one framework for HPC, Analytics and Artificial Intelligence on IA
• Intel® Xeon Phi™ Processors : your path to Deep Learning o Breakthrough performance at scale for highly-parallel,
memory intensive appso Next Gen Intel® Xeon Phi™ “Knights Mill”, soon !
38
• Intel® Scalable System Framework : one framework for HPC, Analytics and Artificial Intelligence on IA
• Intel® Xeon Phi™ Processors : your path to Deep Learning o Breakthrough performance at scale for highly-parallel,
memory intensive appso Next Gen Intel® Xeon Phi™ “Knights Mill”, soon !
• Choose Intel Deep Learning tools, frameworks and libraries to make the most of hardware performance