Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Near-Term Technology Trends
Rob Farber
CEO TechEnablement.com
Contact info at techenablement.com for consulting, teaching, writing, and other inquiries
Machine learning has redefined the market
I the near future every piece of data in the data center will be i tera ted with y AI – Ian Buck (VP Accelerated Computing, NVIDIA)
By 2020 servers will run data analytics more than any other workload – Diane Bryant (VP and GM of the Data Center Group, Intel)
Wh ? Co putatio al U i ersalit ia trai i g!
• The famous XOR problem nicely emphasizes the
importance of hidden neurons
• Networks with hidden units can implement all
Boolean functions used to build a computer
Computational Universal Machine
Learning!
• Networks without nonlinear hidden units cannot
learn XOR hence are not computationally universal
• Cannot represent large classes of problems
G(x)
NetTalk
Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel
network that learns to read aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
500 learning loops Finished
"Applications of Neural Net
and Other Machine Learning
Algorithms to DNA Sequence
Analysis", (1989).
How Neural Networks
work", (Lapedes,Farber
1987).
Deep-Learning (learn from data many of the things we do)
Speech recognition in noisy
environments (Siri, Cortana,
Google, Baidu, …
Better than human
accuracy face
recognition
Self-driving cars
• Internet Search • Robotics • Self guiding drones • Much, much, more
Speech recognition is a Bellwether
A driving force
for ubiquitous
inferencing in
the data center
E pect a azi g gro th $ T i cre e tal alue with 1000x increase in data volume
• CEO Saudi Telecom statement during his KAUST Global IT Keynote • We e pe t 5G to i ease the olu e of o ile data ,
• $10T incremental value
Khalid Bin Hussein Bayari
CEO, Saudi Telecom See also: http://www.mwc.gr/presentations/2016/kolokotronis.pdf and https://www.itu.int/en/ITU-T/Workshops-
and-Seminars/standardization/201603/Documents/Abstracts-Presentations/S2P3_Ali_Amer.pptx
Source: METIS
From NetTalk to Bioinformatics
Internal
connections
The phoneme to
be pronounced
NetTalk
Sejnowski, T. J. and Rosenberg, C. R. (1986)
NETtalk: a parallel network that learns to read
aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural
_network)
Internal
connections
t t e X A T C G T
"Applications of Neural Net and Other Machine
Learning Algorithms to DNA Sequence Analysis",
A.S. Lapedes, C. Barnes, C. Burks, R.M. Farber, K.
Sirotkin, Computers and DNA, SFI Studies in the
Sciences of Complexity, vol. VII, Eds. G. Bell and
T. Marr, Addison-Wesley, (1989).
T|F Exon region
From Bioinformatics to drug design (The closer you look the greater the complexity)
Electron Microscope
We for ed a co pa , the The Questio
How do we know you are not playing
expensive computer games with our money?
Train then utilize a blind test
Internal
connections
A0
Binding
affinity for a
specific
antibody
A1 A2 A3 A4 A5
Possible hexamers
206 = 64M 1k – 2k pseudo-random
(hexamer, binding)
affinity pairs
Approx. 0.001%
sampling
“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan
Lapedes and Robert Farber, Los Alamos National Laboratory
Technical Report LA-UR-94-4391 (1994).
Hill climbing to find high affinity
Internal
connections
A0
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴 𝑡𝑖𝑏 𝑑𝑦
A1 A2 A3 A4 A5
Learn: 𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴 𝑡𝑖𝑏 𝑑𝑦 = 𝑓 𝐴 ,… , 𝐴
𝑓(F,F,F,F,F,F)
𝑓(F,F,F,F,F,L) 𝑓(F,F,F,F,F,V) 𝑓(F,F,F,F,L,L)
𝑓(P,C,T,N,S,L)
Predict P,C,T,N,S,L has the
highest binding affinity
Confirm
experimentally
Two important points
• The computer appears to correctly predict experimental data
• Demonstrated that complex binding affinity relationships can be learned from a small set of samples
• Necessary because it is only possible to sample a very small subset of the binding affinity landscape for drug candidates
Time series Iterate 𝑋𝑡+ = 𝑓 𝑋𝑡 , 𝑋𝑡− , 𝑋𝑡− , … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡 , 𝑋𝑡− , … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡+ , 𝑋𝑡, … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡+ , 𝑋𝑡+ , …
Internal
connections
Xt Xt-1
Learn: 𝑋𝑡+ = 𝑓 𝑋𝑡 , 𝑋𝑡− , 𝑋𝑡− , …
Xt-2 Xt-3 Xt-4 Xt-5
Xt+1
Works great! (better than other
methods at that time) "How Neural Nets Work", A.S. Lapedes, R.M. Farber,
reprinted in Evolution, Learning, Cognition, and Advanced
Architectures, World Scientific Publishing. Co., (1987).
Pt+1
Slidi g i fere ce during training to increase accuracy
Xt Xt-1
Internal
connections
Xt-2 Xt-3 Xt-4 Xt-5 Xt+1 Xt+2 Xt+3
Pt+3
Error(example) = (𝑋𝑡+𝑖−𝑃𝑡+𝑖𝑖= )2
Pt+1
Pt+2 Pt+2
Designing ANNs for Integration and Bifurcation analysis – trai i g a netlet
"Identification of Continuous-Time Dynamical Systems: Neural Network Based Algorithms and Parallel Implementation", R. M. Farber, A. S. Lapedes, R. Rico-Martinez and I. G. Kevrekidis, Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, Norfolk, Virginia, March 1993.
ANN schematic for continuous-time
identification. (a) A four-layered ANN based on
a fourth order Runge-Kutta integrator. (b) ANN
embedded in a simple implicit integrator.
(a) Periodic attractor of the Van der Pol oscillator for g= 1.0,
d= 4.0 and w = 1.0. The unstable steady state in the interior
of the curve is marked +. (b) ANN-based predictions for the
attractors of the Van der Pol oscillator shown in (a).
Dimension reduction
• The curse of dimensionality • People cannot visualize data beyond 3D + color
• Search volume rapidly increases with dimension • Queries return too much data or no data
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
Sensor 1 Sensor 2 Sensor 3 Sensor N
Sensor 1 Sensor 2 Sensor 3 Sensor N
Sensor 1 Sensor 2 Sensor 3 Sensor N
Sensor 1 Sensor 2 Sensor 3 Sensor N
Sensor 1 Sensor 2 Sensor 3 Sensor N
X Y Z
A general SIMD mapping: Optimize(LMS_Error = objFunc(p1, p2, … pn))
17
Examples
0, N-1
Examples
N, 2N-1
Examples
2N, 3N-1
Examples
3N, 4N-1
Step 2
Calculate partials
Step1
Broadcast
parameters
Optimization Method
(Powell, Conjugate Gradient, Other)
Step 3
Sum partials to get
energy
GPU 1 GPU 2 GPU 3
p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn
GPU 4
Host
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000 3500
Av
era
ge
Su
sta
ine
d T
F/s
Number of Intel Xeon Phi coprocessors/Sandy Bridge nodes
TACC Stampede PCA scaling
Many problems are too big for a single computer – Strong scaling execution model!
Perfect strong scaling decreases runtime linearly by the number of processing elements
• O(LogN s ali g is good e ough"
See a path to exascale (MPI can map to thousands of GPU or Processor nodes)
19
Al a s epo t Ho est Flops
Expect significant algorithm and HW retooling
Toda : o l 7% of all servers being used for machine learning and only 0.1% are running deep neural ets – Forbes
By 2020, 100% of servers running machine learning – Intel, NVIDIA, Wall Street
CPU
GPU
FPGA
Chips Roughly four different camps
Cut through the marketing fluff
• Deep-learning originated as a technical term • Lots of layers between the Input and Output layers
• Originally described attempts to mimic the many layers of the brain
• Morphed into a marketing term to sell based on market preconceptions
• No , Ack! o phi g i to AI – even worse for preconceptions!
• Training: • Is ot lear i g
• In the human sense (think of the learning to read aloud example)
• Nor is trai i g AI
Training is the numerical optimization of a set of model parameters to minimize a cost function
• Parallelism speeds training • The SIMD computational model maps beautifully and efficiently to
processors, vector processors, accelerators, FPGAs, and custom chips alike.
• Training is memory bound! • Performance is limited by the cache and memory subsystems rather than
flops/s.
• The training set must be large enough to use all the device parallelism • Else performance is wasted
• Make sure you will have enough data for massive GPU parallelism
• CPUs likely to be better for data sets with 100s to tens of thousands examples
Trai i g is ath … ot lear i g (ANNs have no concept of reality or a goal)
• A project in the 1990s attempted to train an ANN to distinguish between images of a tank vs. a car.
• A low error was found after training but in the field the real-world accuracy was abysmal.
• Further investigation found that most of the tank pictures were taken on a sunny day while the pictures of the cars were taken on cloudy days.
• The et o k sol ed the opti izatio p o le disti guishi g loud s. sunny days and not cars vs. tanks
• Bad news for people driving on a sunny day!
Inferencing is a sequential operation (unless done in volume like in a datacenter)
• Single inference operation parallelism: • Limited by the data dependencies defined by the ANN architecture
• Generally low (else an additional speedup during training)
• Inferencing is memory bound! • Performance is limited by the cache and memory subsystems rather than flops/s.
• Ma h e a ple pape : Deep Voice: Real-time Neural Text-to-Speech
• Succinctly: • Most people will not need inferencing optimized devices
• Unless they plan perform volume processing of data in a data center.
• Inferencing of individual data items will be dominated by the sequential performance of the device.
• Expect GPUs to have poor low-volume/individual inferencing performance relative to CPUs, FPGAs, ASICs.
Use Gradients for orders-of-magnitude faster time-to-solution
• Provides an algorithmic speedup L-BFGS, Conjugate G adie t, … • Potentially orders-of-magnitude faster than gradient free methods
• Gradient approximations can add nParameters extra evaluations per dfunc.
• Popular software packages such as Theano make it easy • Can symbolically calculate the gradient through the use of automatic differentiation
• Generates native code for speed
• Bad news: Gradient gets very large (O(N2) Jacobian) according to model size
• Memory, and cache bound plus can be atomic operation limited as well
• Code can be simply too big for GPUs
• Limited memory devices used chunked gradient calculations • Adds communications overhead (The PCIe bus is famous for limiting GPU performance)
• Nvlink performance is unknown.
• Succinctly: real e ory for real perfor a e • Especially stacked memory!
Most data scientists will use one or two layers
• Deep ANNs (DNNs) are useful, but many in the data analytics world will not use more than one or two hidden layers
• Due to the vanishing gradient problem. • Nice tutorial: http://neuralnetworksanddeeplearning.com/chap5.html.
• An illustration:
Tries to account for as much of the
gradient as possible
Tries to account for as much of the
remaining gradient as possible
Lo e la e s a lea slo e
Reduced precision and specialized hardware • Half-precision (FP16) can double the number of training iterations per
unit time • Expect 4x from 8-bit math
• Think of NVIDIA reduced-prec Tensor Co e as ANN e sio of MMX fo ulti edia – good for a few DNNs but not general machine-learning
• Redu ed p e isio likel to IMHO ause slo o e ge e • Time-to-model can increase significantly far beyond hardware speedup
• See How Neural Networks Work Lapedes and Farber
• Get stuck in local minima • Bad solutio s i.e. e e lea e the a li g phase i eadi g aloud e a ple
• In general, avoid reduced precision for training as it will likely harm rather than help.
• Reduced precision is more useful for volume inferencing and specialized DNNs
NVIDIA (GPUs)
• Restarted massive parallelism with CUDA and GPU computing
• Making big inroads into the data center
GPU Threads are grouped into threadblocks
• Threads can only communicate within a thread block • (yes, there are atomic ops)
• Fast hardware scheduling • Blks run when dependencies resolved
Data
• Blocks that are ready to run get assigned to processing elements
• Fast hardware scheduling
Scalability required to use all those cores (strong scaling execution model)
Active Queue
Executables can run unchanged on bigger GPUs
• Dealer Analogy
Scheduler SMX
Strong Scaling
Execution Model
NVIDIA Claims Big Perf. Increases since 2013
NVIDIA on speech recognition
NVIDIA Claims for P40 using INT8 math & TensorRT
Processor-based computing
AMD
Intel
Intel Xeon and Xeon Phi
• Intel Xeon Phi is many-core with two AVX-512 vector units per core • High a d idth, good a he, lots of pa allelis
• Cores 1/3 performance of Intel Xeon. • Not a problem for training.
• On-package photonics Intel Omni-Path • On-package puts Mellanox at risk of being outdated like the USB-WiFi dongle companies
• I tel Xeo Skylake • Extra memory channel, AVX-512 vector units, more cores, up to 8 sockets
• Good a he a d fastest se ue tial pe fo a e
• FPGA/ASIC integration (ASICs may leapfrom CPU and GPU Performance!)
• On-package photonics Intel Omni-Path
• Big memory plus (unconfirmed at this time) likely to get Intel 3D Xpoint DIMMS.
• Many procurements are delaying to get actual Skylake performance numbers
Traditional Vector ISA
Core wide
SSE wide
SSE wide
SSE wide
SSE
Core
Core
512 wide vector unit
512 wide vector unit
Core 512 wide vector unit
Core 512 wide vector unit
Illustration
Floating-point performance comes from the dual per core vector units
• AVX-512 = 16 32-bit ops/clock
P
e
r
f
o
r
m
a
n
c
e Scalar and
single-threaded
Parallelism
No Parallelism Massive Parallelism
Vector and
single-threaded
Image courtesy
Elsevier
Highest
Performance
Convergence (for training and HPC in general)
• NVIDIA Pascal has a working MMU (Memory Management Unit) • Data can be automatically be moved between CPU and GPU on a demand basis. • Offload programming is no longer a requirement, its an optimization!
• This is a really big deal as code changes are a barrier to GPU adoption
• Pascal GPUs have fast stacked memory (Much faster, more capacity, energy efficient) • and NVlink (fast host/GPU memory bandwidth transfer – but only with IBM!)
• Intel Xeon Phi (formerly known as Knights Landing): • Data can be automatically moved between near (stacked) and far (DDR4) memory on a
demand basis using cache mode. • Offload programming is no longer a requirement, its an optimization! • Stacked memory is much faster, more capacity, energy efficient
• IBM • Data can be automatically moved between Power and GPU on a demand basis using
NVlink. • Offload programming is no longer a requirement, its an optimization!
IBM approach
• Sumit Gupta (VP, High Performance Computing and Data Analytics, IBM), fu da e tall , a ele ato s a e the path fo a d.
• These accelerators are GPUs for compute, storage accelerators for big data
and FPGAs for special functions.
• Watson (of Jeopardy fame) for software
• TrueNorth • Developed as part of the DARPA SyNAPSE program
• 46 Billion synaptic op/s using 70 mW!
Sumit Gupta
The OpenPower special sauce
1. CAPI (IBM Coherent Accelerator Processor Interface)
2. NVlink used i the CORAL "Su it" a d "Sie a supercomputers
• Make application acceleration: • Much easier
• Transparent for the application programmer.
• Ope fo all to joi
CAPI is important • Supports compute, storage, and
special (e.g. FPGA) accelerators
• Shares virtual addressing – everything works with same memory addresses
• Provides hardware managed cache coherence
• Claims to eliminate 97% of code path length!
Data handling can take as much time as the
computational problem!
• ORNL Titan
– 112,128 GB of GPU memory in 18,688 K20x GPUs
• Data handling must be
– Language agnostic
– Scalable
SPEC M3 Benchmark on 68 TB CAPI system
• Fastest mean response times and most consistent response times (lowest standard deviation) ever reported, for all combinations of query type, data volume, and concurrent users.
• Each mean response time was 5.5x to 212x the previous best result, including:
• 21x to 212x the performance of the previous best published result for the market snap benchmarks (10T.YR[n]-MKTSNAP.TIME) *
• 21x the performance of the previous best published result for year high bid in the smallest year of the dataset (1T.OLDYRHIBID.TIME) *
• 8-10x the performance of the previous best published result for the 100-user volume-weighted average bid benchmarks (100T.YR[n]VWAB-12D-HO.TIME) **
• 5-8x the performance of the previous best published result for the N-year high-bid benchmarks (1T.[n]YRHIBID.TIME)
Fast scalable data loads for training via
parallel file systems
Node
1
Node
2
Node
3 Node
4
Node
5 Node
6
Node
7 Node
500
Each MPI client on
each node:
1. Opens file
2. Seeks to location
3. Reads data
4. Close
Other training and inference
solutions FPGAs
• Fast, power efficient and perfect for variable precision arithmetic
• Accessible via CAPI
• Very difficult to program
Custom chips
• Fast, power efficient and perfect for variable precision arithmetic
Heatsink City
• A is quad Google TPU2 motherboard side
view
• B is dual IBM Po e Zaius motherboard
• C is dual IBM Po e Mi sk motherboard
• D is Dual I tel Xeo Fa e ook Yose ite motherboard
• E is Nvidia P100 SMX2 module with heat
sink and Fa e ook Big Basi motherboard
(Image courtesy The Next Platform)
Nervana and many other offerings
Google TPU2:
• 15x-30x faster than CPU and GPU, On
our production AI workloads that utilize
neural network inference
• 30x to 80x improvement in TOPS/Watt
measure
• Exclusive to Google Cloud
IBM TrueNorth (A path to the future???)
• PNAS (8/9/16) Convolutional networks for fast, energy-efficient neuromorphic computing, Steven K. Essera et. al.
• Chip implements networks using integrate-and-fire spiking neurons
• IBM researchers ran the datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per watt)
• Can go really big
A system roughly the neuronal size of a rat brain
TrueNorth: Accuracy
• PNAS pape : [We] demonstrate that neuromorphic o puti g … can implement deep convolution networks
that approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and spee h, pe fo i fe e e hile p ese i g the ha d a e s underlying energy-effi ie a d high th oughput.
Accuracy of different sized networks
running on one or more TrueNorth chips to
perform inference on eight datasets. For
comparison, accuracy of state-of-the-art
unconstrained approaches are shown as
bold horizontal lines (hardware resources
used for these networks are not indicated).
Common software
• Theano: A Python library that generates C-code for a CPU or GPU
• TensorFlow: Google s ope sou e li a fo a hi e lea i g
• cuDNN: NVIDIA s a hi e lea i g li a
• Intel DAAL (Data Analytics Library)
• Torch: An open source middleware library
• Caffe: Be kele s popula f a e o k
Machine 1
App A
App B
App C
App D CPU
Loa
d-b
ala
ncin
g
Splitte
r
App A
App B
App C
App D CPU
Machine 2
App A
App B
App C
App D CPU
Machine 3
Fast and scalable heterogeneous workflows
Full source code in my DDJ tutorial http://www.drdobbs.com/parallel/232601605
Volta
FPGA
Custom Asic
So much more, you have been great, Thank You!
Rob Farber
CEO TechEnablement.com
Contact info at techenablement.com for consulting, teaching, writing, and other inquiries