Upload
intel-software
View
90
Download
1
Embed Size (px)
Citation preview
Wei LI, PHDVice President, Software and Services GroupGM, Machine Learning and Translation
Dec, 2017
Accelerating AI
Avg. Internet User Autonomous DrivingSmart Hospital Smart Factory Airplane
1.5 GB¹3,000 GB²
4,000 GB³40 k GB2
1 M GB2
All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html , http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 , http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
DataDeluge
ComputeBreakthrough
InnovationSurge
mainframes Standards-based servers Cloud computing
AI Compute Cycles will grow by 2020 12X12X is a compute demand (volume x some level of perf/throughput of each sever). The overall compute demand is growing tremendously as early deployments had relatively low utilization. Over time through the forecast, this transitions to broader deployments powering multiple applications with higher utilization, improved performance of the HW, and SW continued optimizations that drive compute growth.
Consumer Health Finance Retail Government Energy Transport Industrial OtherSmart
Assistants
Chatbots
Search
Personalization
Augmented Reality
Robots
Enhanced Diagnostics
Drug Discovery
Patient Care
Research
Sensory Aids
Algorithmic Trading
Fraud Detection
Research
Personal Finance
Risk Mitigation
Support
Experience
Marketing
Merchandising
Loyalty
Supply Chain
Security
Defense
Data Insights
Safety & Security
Resident Engagement
Smarter Cities
Oil & Gas Exploration
Smart Grid
Operational Improvement
Conservation
Autonomous Cars
Automated Trucking
Aerospace
Shipping
Search & Rescue
Factory Automation
Predictive Maintenance
Precision Agriculture
Field Automation
Advertising
Education
Gaming
Professional & IT Services
Telco/Media
Space Exploration
Sports
Step 1: Training(In Data Center – Over Hours/Days/Weeks)
•Step 2: Inference(At the Edge or in the Data Center – Instantaneous)
Person
90% person8% traffic light
97% person
Trained Model
Forward & Backwards Propagation Forward Propagation
OutputClassification
Create “Deep neural net”
math model
Massive data sets: labeled or tagged
input data
OutputClassification
Trained neuralnetwork model
New input fromcamera and sensors
HARDWAREFor the cloud/The edge
Intel® Xeon® Processors
Intel® Core™ & Atom™ Processors
Intel® FPGA
Intel® Xeon Phi™ Processors*
Intel® Nervana™ Neural Network Processors
Intel® Processor Graphics
Movidius VPUVision
Intel® GNA (IP)*
CPU+
Speech
datacenter Edge
Potent performanceTrain in days HOURS with up to 113X2 perf vs. 2-3 year old servers (2.2x excluding optimized SW1)
*Optimization notice slide 31
Intel® Nervana™Neural Network Processor
Compute Density
Blazing Data Access
High-speed Scalability
Project brainwave for real-time AI “A major leap forward in both performance and flexibility for cloud-based serving of deep learning models.”
- Doug BurgerDistinguished Engineer
+
*Google Cliphttps://www.theverge.com/2017/10/4/16402682/google-clips-camera-announced-price-release-date-wireless
Technologies
HOW DO WE DO IT: Innovate Hardware
TrainingExpert-led trainings,
hands-on workshops, exclusive remote access,
and more.
ToolsLatest libraries,
frameworks, tools and technologies from Intel
CommunityCollaborate with
industry luminaries, developers and Intel
engineers.
Intel® Nervana™
DEVCLOUD
platforms
Frameworks
experiences
libraries
Intel® Nervana™DL Studio
Movidius(VPU)
Intel® ComputerVision SDK
Intel® Nervana™Cloud & Appliance
Mllib BigDL
Intel® Math Kernel Library (MKL-DNN)
Intel® Data Analytics Acceleration Library
(DAAL)
Intel® Nervana™
Graph*
Intel Python Distribution
* Other names and brands may be claimed as the property of others.
Intel® Math Kernel Libraryfor Deep Neural Networks (MKL-DNN)
Intel® Machine Learning Scaling Library (Intel® MLSL)
Computational Primitives Communication Primitives
Deep Learning Frameworks
* Other names and brands may be claimed as the property of others.
INFERENCE THROUGHPUT
Up to
138xIntel® Xeon® Platinum 8180 Processor
higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 24,28Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any ofthose factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit:http://www.intel.com/performance Source: Intel measured as of June 2017
* Optimization Notice – Slide 31
TRAINING THROUGHPUT
Up to
113xIntel® Xeon® Platinum 8180 Processor
higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Family
Optimized Frameworks
Optimized Intel® MKL Libraries
Inference and training throughput uses FP32 instructions
* Blog by Valeriu Codreanu, Damian Podareanu, Zheng Meyer-Zhao, Vikram Saletore: http://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/ System Details: https://www.bsc.es/news/bsc-news/marenostrum-4-begins-operation
System specs2S Intel® Xeon® Processor 8160
*We acknowledge PRACE for awarding us access to resource MareNostrum 4 based in Spain at Barcelona Supercomputing Center (BSC)
0
10
20
30
40
50
60
70
4 16 32 64 96 128 192 256
Ideal SURFsara/Intel IBM Caffe IBM Torch Facebook
Resnet-50 training time in 44 minutes on 512 Xeon’s
* Optimization notice slide 31.
ResNet-50 Time to Train
31minutes
Intel® Xeon® Platinum 8160 Processor 1600 nodes
Batch size = 16K, 75.3% Top-1 Accuracy
AlexNet Time to Train
11minutes
Intel® Xeon® Platinum 8160 Processor1024 nodes
Batch size = 32K, 58.6% Top-1 Accuracy
Measured on Intel Caffe and Intel® Machine Learning Scaling Library (Intel® MLSL)
• Technical Report by Y. You, Z. Zhang, C-J. Hsieh, J. Demmel, K. Keutzer:https://people.eecs.berkeley.edu/~youyang/publications/imagenet_minutes.pdf
Large Batch Size method with Layer-wise Scaling Layer-wise Adaptive Rate Scaling (LARS) algorithm
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality,
or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product
User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of November 2017.
Get Excellent multi-node scaling and generational performance with your existing hardware
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that productwhen combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2017
* Optimization Notice: Slide 31
Configuration Details 1 Link
Higher is Better
Higher is Better
FRAMEWORK HARDWARE
Deep learning Training in hours
* Other names and brands may be claimed as the property of others.
* Other names and brands may be claimed as the property of others.
FRAMEWORK HARDWARE
Deep learning training with huge datasetfor stunning accuracy
* Other names and brands may be claimed as the property of others.
Deep learning Inference for Smooth user experience with lower cost
FRAMEWORK HARDWARE
www.pikazoapp.com
Similar properties:
* Other names and brands may be claimed as the property of others.
Get The Best AI Performance Workload
Scaling Utilize all the cores Vectorize/SIMD Efficient memory/cache useImprove load
balancing
Reduce synchronization events, all-to-all
comms
OpenMP, MPI
Reduce synchronization
events, serial code
Improve load balancing
Unit strided access per SIMD lane
High vector efficiency
Data alignment
Blocking
Data reuse
Prefetching
Memory allocation
Naïve convolution algorithm is slow
Not vectorization friendly
Not cache friendly while having good compute/memory ratio
Optimized implementation of convolution algorithm in MKL-DNN
Effective usage of SIMD registers, vectorization across channels.
Blocked data format for cache friendly data blocking, data reuse, and effective prefetching.
Parallelization across mini-batch, output channels and spatial domains allows utilization of all computational units and improves load balancing in case of small convolutions.
Optimizations: AVX-512 vectorization, data reuse, parallelization
Distributed SW Implementation
Support of Parallel Training Algorithms
Optimizing DL Frameworks w/ Intel® Machine Learning Scaling Library (Intel® MLSL) & Intel® MPI
Increasing Scaling EfficiencyLarge mini-batch training methods
Communication volume reduction
Data/model Parallelism
Synchronous, asynchronous & hybrid SGD
Massive scale-out solution for Deep Learning training with large datasets
Hybrid Approach• Nodes are split into groups
• Each group performs synchronous SGD
• Groups communicates with asynchronous SGD
•Results• Kurth, Thorsten, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Ali
Patwary, Tareq Malas et al. "Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data“, arXiv:1708.05256 (2017).
• Scale training of a single model to obtain peak performance of 11.73-15.07 PFLOP/s and
sustained performance of 11.41-13.27 PFLOP/s on 9600 Intel® Xeon® Phi™ nodes
• Implemented on top of Intel* Distribution of Caffe* and Intel® Machine Learning Scaling Library (Intel®MLSL)
* Other names and brands may be claimed as the property of others.
Intel® XEON®’s INT8 capability for Deep Learning inference: Lower response time, higher throughput, less memory
IA optimized framework provide automated process to ease deployment
–Collect statistics
–Calibrate scaling factor
–Quantize
Result: no significant accuracy loss comparing with FP32
TopologyFP32 (Top1/Top5Accuracy
INT8
ResNet-50 72.50% / 90.87% 71.76% / 90.48%
GoogLeNet v3 74.50% / 92.42% 74.21% / 92.27%
SSD 77.68% (mAp) 77.37% (mAp)
accelerating AI from the cloud to the edgeboth hardware and software
Visit www.intel.com/ai for more information
Configuration details32/64-node CPU system Intel® Xeon® 6148 Gold processor with 10GB Ethernet / OPA
Benchmark Segment AI/MLBenchmark type TrainingBenchmark Metric Images/Sec or Time to train in secondsFramework CaffeTopology Resnet-50, VGG-16, GoogleNet V3# of Nodes 32/64Platform Wolfpass (Skylake)Sockets 2S
ProcessorXeon Processor code named Skylake, B0, ES2*, 24c, 2.4GHz, 145W, 2666MT/s, QL1K CPUID=0x50652
BIOS SE5C620.86B.01.00.0412.020920172159Enabled Cores 24 cores / socketPlatform Wolfpass (Skylake)Slots 12Total Memory 192GBMemory Configuration 12x16GB DDR4 2R, 1.2V, RDIMM, 2666MT/sMemory Comments Micron MTA 18ASF2G72PDZ-2G6B1SSD 800GB Model: ATA INTEL SSDSC2BA80 (scsi)
OSOracle Linux Server 7.3, Linux kernel 3.10.0-514.6.2.0.1.el7.x86_64.knl1
EthernetConfigurations
Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 03)
Omni-PathConfigurations
Intel Omni-Path HFI Silicon PCIe Adapter 100 Series [discrete]. OFED Version 10.2.0.0.158_72. 48 port OPA switch, with dual leaf switches per rack 48 nodes per rack, 24 spine switches
HT ONTurbo ONComputer Type ServerFramework Version Internal Caffe version
Topology VersionInternal ResNet-50 topologyInternal VGG-16 topologyInternal GoogleNet V3 topology
Batch sizeResNet-50 : 128 x # of nodeVGG-16 : 64 x # of nodeGoogleNet V3 : 64 x # of node
Dataset, versionImagenet, ILSVRC 2012 (Endeavor location), JPEG resized 256x256
MKLDNN aab753280e83137ba955f8f19d72cb6aaba545efMKL mklml_lnx_2018.0.1.20171007MLSL 2017.2.018
Compiler Intel compiler 2017.4.196
Optimization Notice• Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
• For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
System Configuration Network Fabric Minibatch Size Top-1 Accuracy Measured TTT
64-node Xeon systemSKX-6148 based *
10Gb Ethernet 8192 75.9% 7.3 hours
64 card GPU systemP100 based (Facebook Research **)
50Gb Ethernet 8192 76.3% 4 hours
Throughput Scaling (1 node 32/64 nodes)
Resnet-50 Time to Train Performance
1x
1x
1x
1x
1x
1x
28.6x
31.4x
32.0x
29.5x
31.4x
32.0x63.9x64.0x
0 1000 2000 3000 4000 5000 6000
VGG-16
GOOGLENET V3
RESNET50
images/sec
64 nodes OPA
64 nodes 10Gb Ethernet
32 nodes OPA
32 nodes 10Gb Ethernet
1 node OPA
1 node 10Gb Ethernet
Intel Internal measured data on system configuration noted above, Configuration slide 30
** https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf?
Measured with Intel® Distribution of Caffe* and Intel® Machine Learning Scaling Library (Intel® MLSL)
90% scaling efficiency with up to 74% Top-1 accuracy on 256 nodesConfiguration Details 2
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality,
or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product
User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of November 2017.
PUBLIC