10
Abstract In this work, we present the computational performance and classification accuracy for object classification using the VGG16 network on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The results can be used as criteria for iteration selection optimization in different experimental setups using these processors and also in multinode architecture. With an objective of evaluating accuracy for real- time logo detection from video, the results are applicable on a logo image dataset suitable for detecting the classification accuracy of the logos. 1. Introduction Deep learning (DL), which refers to a class of neural network models with deep architectures, forms an important and expressive family of machine learning (ML) models. Modern deep learning models, such as convolutional neural networks (CNNs), have achieved notable successes in a wide spectrum of machine learning tasks including speech recognition 1 , visual recognition 2 , and language understanding 3 . The explosive success and rapid adoption of CNNs by the research community is largely attributable to high-performance computing hardware such as the Intel® Xeon® processor, Intel® Xeon Phi™ processor, and graphics processing units (GPUs), as well as a wide range of easy-to-use open source frameworks including Caffe*, TensorFlow*, the cognitive toolkit (CNTK*), Torch*, and so on. 2. Setting up a Multinode Cluster The Intel® Distribution for Caffe* is designed for both single node and multinode operation. There are two general approaches to parallelization (data parallelism and model parallelism), and Intel uses data parallelism. Data parallelism is when you use the same model for every thread, but feed it with different data. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example, a network is trained on three nodes. All of them have a batch size of 64. The (total) batch size in a single iteration of the stochastic gradient descent algorithm is 3*64=192. Model parallelism means using the same data across all nodes, but each node is responsible for estimating different parameters. The nodes then exchange their estimates with each other to come up with the right estimate for all parameters. Table of Contents Abstract ........................ 1 1. Introduction .................. 1 2. Setting up a Multinode Cluster . 1 3. Experiments .................. 2 3.1. Training Data .............. 3 3.2. Model Building and Network Topology ....................... 4 4. Results ....................... 4 4.1 Observations on Intel® Xeon® Processor ....................... 4 4.2 Observations on Intel® Xeon Phi™ Processor .................. 6 5. Conclusion and Future Work . . . 9 Object Classification Using CNN Across Intel® Architecture Artificial Intelligence Object Classification Intel® AI Builders WHITE PAPER

Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

AbstractIn this work, we present the computational performance and classification accuracy for object classification using the VGG16 network on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The results can be used as criteria for iteration selection optimization in different experimental setups using these processors and also in multinode architecture. With an objective of evaluating accuracy for real-time logo detection from video, the results are applicable on a logo image dataset suitable for detecting the classification accuracy of the logos.

1. IntroductionDeep learning (DL), which refers to a class of neural network models with deep architectures, forms an important and expressive family of machine learning (ML) models. Modern deep learning models, such as convolutional neural networks (CNNs), have achieved notable successes in a wide spectrum of machine learning tasks including speech recognition1, visual recognition2, and language understanding3. The explosive success and rapid adoption of CNNs by the research community is largely attributable to high-performance computing hardware such as the Intel® Xeon® processor, Intel® Xeon Phi™ processor, and graphics processing units (GPUs), as well as a wide range of easy-to-use open source frameworks including Caffe*, TensorFlow*, the cognitive toolkit (CNTK*), Torch*, and so on.

2. Setting up a Multinode ClusterThe Intel® Distribution for Caffe* is designed for both single node and multinode operation. There are two general approaches to parallelization (data parallelism and model parallelism), and Intel uses data parallelism.

Data parallelism is when you use the same model for every thread, but feed it with different data. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example, a network is trained on three nodes. All of them have a batch size of 64. The (total) batch size in a single iteration of the stochastic gradient descent algorithm is 3*64=192. Model parallelism means using the same data across all nodes, but each node is responsible for estimating different parameters. The nodes then exchange their estimates with each other to come up with the right estimate for all parameters.

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . 1

1. Introduction . . . . . . . . . . . . . . . . . . 1

2. Setting up a Multinode Cluster . 1

3. Experiments . . . . . . . . . . . . . . . . . . 2

3.1. Training Data . . . . . . . . . . . . . . 3

3.2. Model Building and Network Topology . . . . . . . . . . . . . . . . . . . . . . . 4

4. Results . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 Observations on Intel® Xeon® Processor . . . . . . . . . . . . . . . . . . . . . . . 4

4.2 Observations on Intel® Xeon Phi™ Processor . . . . . . . . . . . . . . . . . . 6

5. Conclusion and Future Work . . . 9

Object Classification Using CNN Across Intel® Architecture

Artificial IntelligenceObject ClassificationIntel® AI Builders

white paper

Page 2: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

2

To set up a multinode cluster, download and install the Intel® Machine Learning Scaling Library (Intel® MLSL) 2017 package from https://github.com/01org/MLSL/releases/tag/v2017-Preview and source the mlslvars.sh, and then recompile the Caffe build with MLSL: = 1 in the makefile.config. When the makefile completes successfully, start the Caffe training using the message passing interface (MPI) command as follows:

mpirun -n 3 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train \

--solver=models/bvlc_googlenet/solver_client.prototxt --engine=MKL2017

where n defines the number of nodes and ppn represents the number of processes per node. The nodes will be configured in the ~/mpd.hosts with their respective IP addresses as follows:

192.161.32.1

192.161.32.2

192.161.32.3

192.161.32.4

Ansible* scripts are used to copy the binaries or files across the nodes.

Clustering communication employs Intel® Omni-Path Architecture (Intel® OPA)4.

Validation of cluster setup is performed by using the command ‘opainfo’ in all machines, and the port state must always be ‘Active’.

3. ExperimentsThe current experiment focuses on measuring the performance of the VGG16 network on the Flickr* logo dataset, which has 32 different classes of logo. Intel® Optimized Technical Preview for Multinode Caffe* is used for experiments on the single node and with Intel® MLSL enabled for multinode experiments. The input images were all converted to lightning memory-mapped database (LMDB) format for better efficiency. All of the experiments are set to run for 10K iterations, and the observations are noted below. We conducted our experiments in the following machine configurations. Due to lack of time we had to limit our experiments to a single execution per architecture.

Intel Xeon Phi processor

• Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz

• Core(s) Per Socket: 68 RAM (free): 70 GB

• OS: CentOS* 7.3

Intel Xeon processor

• Model Name: Intel® Xeon® processor E5-2699 v4 @ 2.20GHz

• Core(s) Per Socket: 22 RAM (free): 123 GB

• OS: Ubuntu* 16.1

White Paper | Object Classification Using CNN Across Intel® Architecture

Figure 1: Intel® Omni-Path Architecture (Intel® OPA) cluster information.

Page 3: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

3

White Paper | Object Classification Using CNN Across Intel® Architecture

The multinode cluster setup is configured as follows:

KNL 01 (Master)

• Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz

• Core(s) Per Socket: 68 RAM (free): 70 GB

• OS: CentOS 7.3

KNL 03 (Slave node)

• Model Name: Intel Xeon Phi processor 7250 @1.40GHz

• Core(s) Per Socket: 68 RAM (free): 70 GB

• OS: CentOS 7.3

KNL 04 (Slave node)

• Model Name: Intel Xeon Phi processor 7250 @1.40GHz

• Core(s) Per Socket: 68 RAM (free): 70 GB

• OS: CentOS 7.3

3.1. Training DataThe training and test image datasets were obtained from Datasets: FlickrLogos32 / FlickrLogos47, which is maintained by the Multimedia Computing and Computer Vision Lab, Augsburg University. There are 32 logo classes or brands in the dataset, which are downloaded from Flickr, as illustrated in the following figure:

The 32 classes are as follows: Adidas*, Aldi*, Apple*, Becks*, BMW*, Carlsberg*, Chimay*, Coca-Cola*, Corona*, DHL*, Erdinger*, Esso*, Fedex*, Ferrari*, Ford*, Foster's*, Google*, Guinness*, Heineken*, HP*, Milka*, Nvidia*, Paulaner*, Pepsi*, Ritter Sport*, Shell, Singha*, Starbucks*, Stella Artois*, Texaco*, Tsingtao*, and UPS*.

The training set consists of 8240 images; 6000 images are no_logo images, and 70 images per class for 32 classes comprise the remaining 2240 images, thereby making the dataset highly skewed. Also, the training and test dataset is split in a ratio of 90:10 from the full 8240 samples.

Figure 2: Flickr logo image dataset with 32 classes.

Page 4: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

4

White Paper | Object Classification Using CNN Across Intel® Architecture

3.2. Model Building and Network TopologyVGG16 network topology was used for our experiments. VGG16 network topology is a 16 weights layer (13 convolutional and 3 fully connected (FC) layers) and has very small (3 x 3) convolution filters, which showed significant enhancement in network performance and detection accuracy over prior art (winning the first and second prizes in the ImageNet* challenge in 2014), and henceforth widely used as a reference topology.

4. Results

4.1 Observations on Intel® Xeon® ProcessorThe Intel Xeon processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 5.4.0

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

Figure 3: Training loss variation with iterations (batch size 32, LR policy as POLY).

Figure 4: Accuracy variation with iterations (batch size 32, LR policy as POLY).

Page 5: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

5

White Paper | Object Classification Using CNN Across Intel® Architecture

The following observations were noted while training for 10K iterations with a batch size of 64 and learning rate policy as POLY.

The real-time training and test observations using different batch sizes for the Intel Xeon processor is depicted in the following table. The Table 2 depicts how the accuracy varies with batch size.

Batch Size LR Policy Start Time End Time Duration Loss Accuracy at Top 1

Accuracy at Top 5

32 POLY 18:20 23:46 5.26 0.00016 0.62 0.84

64 POLY 16:20 9:57 17:37 0.00003 0.64 0.86

64 STEP 16:41 6:37 13:56 0.0005 0.65 0.85

Figure 5: Training loss variation with iterations (batch size 64, LR policy as POLY).

Figure 6: Accuracy variation with iterations (batch size 64, LR policy as POLY).

Table 1: Real-time training results for Intel® Xeon® processor.

Page 6: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

6

White Paper | Object Classification Using CNN Across Intel® Architecture

32 Batch Size 64 Batch SizeIterations Accuracy@Top1 Accuracy@Top5 Iterations Accuracy@Top1 Accuracy@Top5

0 0 0 0 0 0

1000 0.165937 0.49125 1000 0.30375 0.6375

2000 0.374375 0.754062 2000 0.419844 0.785156

3000 0.446875 0.74125 3000 0.513906 0.803437

4000 0.50375 0.78625 4000 0.522812 0.838437

5000 0.484062 0.783437 5000 0.580781 0.848594

6000 0.549062 0.819062 6000 0.584531 0.843594

7000 0.553125 0.826563 7000 0.632969 0.847969

8000 0.615625 0.807187 8000 0.64375 0.84875

9000 0.607813 0.83 9000 0.624844 0.856406

10000 0.614567 0.83616 10000 0.641234 0.859877

4.2 Observations on Intel® Xeon Phi™ ProcessorThe Intel Xeon Phi processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 6.2

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

Table 2: Batch size versus accuracy details on the Intel® Xeon® processor.

Figure 7: Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

Page 7: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

7

White Paper | Object Classification Using CNN Across Intel® Architecture

Figure 8: Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

Figure 9: Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

Figure 10: Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

Page 8: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

8

White Paper | Object Classification Using CNN Across Intel® Architecture

32 Batch Size 64 Batch SizeIterations Accuracy@Top1 Accuracy@Top5 Iterations Accuracy@Top1 Accuracy@Top5

0 0 0 0 0 0

1000 0.138125 0.427812 1000 0.200469 0.54875

2000 0.24 0.589688 2000 0.330781 0.678594

3000 0.295625 0.621875 3000 0.362188 0.68375

4000 0.295312 0.660312 4000 0.40625 0.708906

5000 0.337813 0.67 5000 0.437813 0.74625

6000 0.374687 0.71 6000 0.40625 0.723594

7000 0.335 0.6875 7000 0.432187 0.749219

8000 0.38375 0.692187 8000 0.455312 0.745781

9000 0.39625 0.70875 9000 0.455469 0.722969

10000 0.40131 0.713456 10000 0.469871 0.748901

Figure 11: Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

Figure 12: Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

Table 3: Batch size versus accuracy details on the Intel® Xeon® processor.

Page 9: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

9

White Paper | Object Classification Using CNN Across Intel® Architecture

128 Batch SizeIterations Accuracy@Top1 Accuracy@Top5

0 0 0

1000 0.272266 0.665156

2000 0.397422 0.696328

3000 0.432813 0.750234

4000 0.46 0.723437

5000 0.446328 0.776641

6000 0.432969 0.74125

7000 0.473203 0.75

8000 0.419688 0.700938

9000 0.455312 0.763281

10000 0.478901 0.798771

Batch Size LR Policy Start Time End Time Duration Loss Accuracy at Top 1

Accuracy at Top 5

32 POLY 17:53 20:36 2:43 0.005 0.4 0.71

64 POLY 10:59 16:07 6:08 0.00007 0.47 0.75

128 POLY 18:00 4:19 10:19 0.00075 0.48 0.8

5. Conclusion and Future Work We observed from Table 1 that the batch size of 32 was the optimal configuration in terms of speed and accuracy. Though there is a slight increase in accuracy with batch size 64, the gain seems to be quite low, compared to the increase in training time. It was also observed that the learning rate policies have quite a significant impact on the training time and less impact on accuracy. Perhaps the recalculation of the learning rates on every iteration would have slowed down this training. There is a minor gain in the Top 5 Accuracy with the LR policy as POLY, and this might be due to the optimal calculation of the learning rate. There is a chance that the gain might vary quite significantly in a larger dataset.

We observed from Table 3 that the Intel Xeon Phi processor efficiency increases as the batch size is increased, and also the decrease in loss happens faster as the batch size is increased. Table 4 infers that the higher batch size also runs faster on Intel Xeon Phi processors.

The observations as per the above tables implicates that training in Intel Xeon Phi machines are faster than the same conducted in Xeon machines. Thanks to the bootable host processor that delivers massive parallelism & vectorization. However the accuracy rate produced by Intel Xeon Phi processors is much lower than those produced for Intel Xeon processors for the same number of iterations, so it must be noted that we have to run a few more iterations on Intel Xeon Phi processors as compared to Intel Xeon processors to meet the same accuracy levels.

List of Abbreviations

Abbreviations Expanded FormMLSL machine learning scalable library

CNN convolution neural network

GPU graphics processing unit

ML machine learning

CNTK cognitive toolkit

DL deep learning

LMDB lightning memory-mapped database

Table 4: Real-time training results for the Intel® Xeon Phi™ processor.

Page 10: Object Classification Using CNN Across Intel® Architecture...Ansible* scripts are used to copy the binaries or files across the nodes. Clustering communication employs Intel® Omni-Path

10

References1. Deng, L., LI, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M. L., Zweig, G., He, X., Williams, J., Gong, Y., and Aceri, A. Recent Advances in Deep Learning for Speech Research at Microsoft. In ICASSP (2013).

2. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS (2012).

3. Mikolov, T., Chen, K., Corrado, G., and Deahn, J. Efficient Estimation of Word Representations in Vector Space. In ICLRW (2013).

4. Cherlopalle, Deepthi and Weage, Joshua Dell HPC Omni-Path Fabric: Supported Architecture and Application Study June 2016

More details on Intel Xeon Phi processor: Intel Xeon Phi Processor

Intel® Distribution for Caffe*: Manage Deep Learning Networks with Intel Distribution for Caffe

Multinode Guide: Guide to multi-node training with Intel® Distribution of Caffe*

Intel Omni Path Architecture Cluster Setup: Dell HPC Omni-Path Fabric: Supported Architecture and Application Study

Intel MLSL Package: Intel® MLSL 2017 Beta https://github.com/01org/MLSL/releases/tag/v2017-Beta

White Paper | Object Classification Using CNN Across Intel® Architecture

Optimization Notice

Intel's Compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimization include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors-dependent optimizations in this product are intended to use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guide for more information regarding specific instruction sets covered by this notice. Notice revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors maycause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that productwhen combined with other products. For more complete information visit www.intel.com/benchmarks.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on systemconfiguration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”.Implementation of these updates may make these results inapplicable to your device or system.Intel, the Intel logo, Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation Printed in USA 0518/BA/PDF Please Recycle