State-Of -The A rt M achine Le arning A lgorithm s and H

State-Of-The Art Machine Learning Algorithms and How They Are Affected By Near-Term Technology Trends

Rob Farber

CEO TechEnablement.com

Contact info at techenablement.com for consulting, teaching, writing, and other inquiries

http://www.scientificcomputing.com/

https://www.google.com/url?q=http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/01/28/the-george-washington-university-uses-gpus-to-study-flying-snakes.aspx&sa=U&ei=3DgmU7aeLsrkoASloIDoAw&ved=0CC8Q9QEwAQ&sig2=yzwuSJXqJUQBD1RwSIm3eQ&usg=AFQjCNGkhYPeiFXvKvSH6OHhOccW-q6-2Q

http://www.pgroup.com/index.htm

https://www.google.com/url?q=http://en.wikipedia.org/wiki/Linux_Journal&sa=U&ei=kzgmU6WJNpbYoAT1t4KgAw&ved=0CC0Q9QEwAA&sig2=YEb7zD16WsmnX4PkEdOGVg&usg=AFQjCNH7La4Nm8Cf1AbeyuW6K-0zjV-l9w

mailto:[email protected]

Machine learning has redefined the market

I the near future every piece of data in the data center will be i tera ted with y AI – Ian Buck (VP Accelerated Computing, NVIDIA)

By 2020 servers will run data analytics more than any other workload – Diane Bryant (VP and GM of the Data Center Group, Intel)

Wh ? Co putatio al U i ersalit ia trai i g!

• The famous XOR problem nicely emphasizes the

importance of hidden neurons

• Networks with hidden units can implement all

Boolean functions used to build a computer

Computational Universal Machine

Learning!

• Networks without nonlinear hidden units cannot

learn XOR hence are not computationally universal

• Cannot represent large classes of problems

G(x)

NetTalk

Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel

network that learns to read aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

500 learning loops Finished

"Applications of Neural Net

and Other Machine Learning

Algorithms to DNA Sequence

Analysis", (1989).

How Neural Networks

work", (Lapedes,Farber

1987).

http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

Deep-Learning (learn from data many of the things we do)

Speech recognition in noisy

environments (Siri, Cortana,

Google, Baidu, …

Better than human

accuracy face

recognition

Self-driving cars

• Internet Search • Robotics • Self guiding drones • Much, much, more

Speech recognition is a Bellwether

A driving force

for ubiquitous

inferencing in

the data center

E pect a azi g gro th $ T i cre e tal alue with 1000x increase in data volume

• CEO Saudi Telecom statement during his KAUST Global IT Keynote • We e pe t 5G to i ease the olu e of o ile data ,

• $10T incremental value

Khalid Bin Hussein Bayari

CEO, Saudi Telecom See also: http://www.mwc.gr/presentations/2016/kolokotronis.pdf and https://www.itu.int/en/ITU-T/Workshops-

and-Seminars/standardization/201603/Documents/Abstracts-Presentations/S2P3_Ali_Amer.pptx

Source: METIS

http://www.mwc.gr/presentations/2016/kolokotronis.pdf

http://www.mwc.gr/presentations/2016/kolokotronis.pdf

https://www.itu.int/en/ITU-T/Workshops-and-Seminars/standardization/201603/Documents/Abstracts-Presentations/S2P3_Ali_Amer.pptx









From NetTalk to Bioinformatics

Internal

connections

The phoneme to

be pronounced

NetTalk

Sejnowski, T. J. and Rosenberg, C. R. (1986)

NETtalk: a parallel network that learns to read

aloud, Cognitive Science, 14, 179-211 http://en.wikipedia.org/wiki/NETtalk_(artificial_neural

_network)

Internal

connections

t t e X A T C G T

"Applications of Neural Net and Other Machine

Learning Algorithms to DNA Sequence Analysis",

A.S. Lapedes, C. Barnes, C. Burks, R.M. Farber, K.

Sirotkin, Computers and DNA, SFI Studies in the

Sciences of Complexity, vol. VII, Eds. G. Bell and

T. Marr, Addison-Wesley, (1989).

T|F Exon region



From Bioinformatics to drug design (The closer you look the greater the complexity)

Electron Microscope

We for ed a co pa , the The Questio

How do we know you are not playing

expensive computer games with our money?

Train then utilize a blind test

Internal

connections

A0

Binding

affinity for a

specific

antibody

A1 A2 A3 A4 A5

Possible hexamers

206 = 64M 1k – 2k pseudo-random

(hexamer, binding)

affinity pairs

Approx. 0.001%

sampling

“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan

Lapedes and Robert Farber, Los Alamos National Laboratory

Technical Report LA-UR-94-4391 (1994).

Hill climbing to find high affinity

Internal

connections

A0

𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴 𝑡𝑖𝑏 𝑑𝑦

A1 A2 A3 A4 A5

Learn: 𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦𝐴 𝑡𝑖𝑏 𝑑𝑦 = 𝑓 𝐴 ,… , 𝐴

𝑓(F,F,F,F,F,F)

𝑓(F,F,F,F,F,L) 𝑓(F,F,F,F,F,V) 𝑓(F,F,F,F,L,L)

𝑓(P,C,T,N,S,L)

Predict P,C,T,N,S,L has the

highest binding affinity

Confirm

experimentally

Two important points

• The computer appears to correctly predict experimental data

• Demonstrated that complex binding affinity relationships can be learned from a small set of samples

• Necessary because it is only possible to sample a very small subset of the binding affinity landscape for drug candidates

Time series Iterate 𝑋𝑡+ = 𝑓 𝑋𝑡 , 𝑋𝑡− , 𝑋𝑡− , … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡 , 𝑋𝑡− , … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡+ , 𝑋𝑡, … 𝑋𝑡+ = 𝑓 𝑋𝑡+ , 𝑋𝑡+ , 𝑋𝑡+ , …

Internal

connections

Xt Xt-1

Learn: 𝑋𝑡+ = 𝑓 𝑋𝑡 , 𝑋𝑡− , 𝑋𝑡− , …

Xt-2 Xt-3 Xt-4 Xt-5

Xt+1

Works great! (better than other

methods at that time) "How Neural Nets Work", A.S. Lapedes, R.M. Farber,

reprinted in Evolution, Learning, Cognition, and Advanced

Architectures, World Scientific Publishing. Co., (1987).

https://papers.nips.cc/paper/59-how-neural-nets-work.pdf

Pt+1

Slidi g i fere ce during training to increase accuracy

Xt Xt-1

Internal

connections

Xt-2 Xt-3 Xt-4 Xt-5 Xt+1 Xt+2 Xt+3

Pt+3

Error(example) = (𝑋𝑡+𝑖−𝑃𝑡+𝑖𝑖= )2

Pt+1

Pt+2 Pt+2

Designing ANNs for Integration and Bifurcation analysis – trai i g a netlet

"Identification of Continuous-Time Dynamical Systems: Neural Network Based Algorithms and Parallel Implementation", R. M. Farber, A. S. Lapedes, R. Rico-Martinez and I. G. Kevrekidis, Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, Norfolk, Virginia, March 1993.

ANN schematic for continuous-time

identification. (a) A four-layered ANN based on

a fourth order Runge-Kutta integrator. (b) ANN

embedded in a simple implicit integrator.

(a) Periodic attractor of the Van der Pol oscillator for g= 1.0,

d= 4.0 and w = 1.0. The unstable steady state in the interior

of the curve is marked +. (b) ANN-based predictions for the

attractors of the Van der Pol oscillator shown in (a).

https://arxiv.org/pdf/comp-gas/9305001.pdf





Dimension reduction

• The curse of dimensionality • People cannot visualize data beyond 3D + color

• Search volume rapidly increases with dimension • Queries return too much data or no data

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

I I I I I

B

Sensor 1 Sensor 2 Sensor 3 Sensor N





X Y Z

A general SIMD mapping: Optimize(LMS_Error = objFunc(p1, p2, … pn))

17

Examples

0, N-1

Examples

N, 2N-1

Examples

2N, 3N-1

Examples

3N, 4N-1

Step 2

Calculate partials

Step1

Broadcast

parameters

Optimization Method

(Powell, Conjugate Gradient, Other)

Step 3

Sum partials to get

energy

GPU 1 GPU 2 GPU 3

p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn

GPU 4

Host

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000 3500

Av

era

ge

Su

sta

ine

d T

F/s

Number of Intel Xeon Phi coprocessors/Sandy Bridge nodes

TACC Stampede PCA scaling

Many problems are too big for a single computer – Strong scaling execution model!

Perfect strong scaling decreases runtime linearly by the number of processing elements

• O(LogN s ali g is good e ough"

See a path to exascale (MPI can map to thousands of GPU or Processor nodes)

19

Al a s epo t Ho est Flops

Expect significant algorithm and HW retooling

Toda : o l 7% of all servers being used for machine learning and only 0.1% are running deep neural ets – Forbes

By 2020, 100% of servers running machine learning – Intel, NVIDIA, Wall Street

CPU

GPU

FPGA

Chips Roughly four different camps

Cut through the marketing fluff

• Deep-learning originated as a technical term • Lots of layers between the Input and Output layers

• Originally described attempts to mimic the many layers of the brain

• Morphed into a marketing term to sell based on market preconceptions

• No , Ack! o phi g i to AI – even worse for preconceptions!

• Training: • Is ot lear i g

• In the human sense (think of the learning to read aloud example)

• Nor is trai i g AI

Training is the numerical optimization of a set of model parameters to minimize a cost function

• Parallelism speeds training • The SIMD computational model maps beautifully and efficiently to

processors, vector processors, accelerators, FPGAs, and custom chips alike.

• Training is memory bound! • Performance is limited by the cache and memory subsystems rather than

flops/s.

• The training set must be large enough to use all the device parallelism • Else performance is wasted

• Make sure you will have enough data for massive GPU parallelism

• CPUs likely to be better for data sets with 100s to tens of thousands examples

Trai i g is ath … ot lear i g (ANNs have no concept of reality or a goal)

• A project in the 1990s attempted to train an ANN to distinguish between images of a tank vs. a car.

• A low error was found after training but in the field the real-world accuracy was abysmal.

• Further investigation found that most of the tank pictures were taken on a sunny day while the pictures of the cars were taken on cloudy days.

• The et o k sol ed the opti izatio p o le disti guishi g loud s. sunny days and not cars vs. tanks

• Bad news for people driving on a sunny day!

Inferencing is a sequential operation (unless done in volume like in a datacenter)

• Single inference operation parallelism: • Limited by the data dependencies defined by the ANN architecture

• Generally low (else an additional speedup during training)

• Inferencing is memory bound! • Performance is limited by the cache and memory subsystems rather than flops/s.

• Ma h e a ple pape : Deep Voice: Real-time Neural Text-to-Speech

• Succinctly: • Most people will not need inferencing optimized devices

• Unless they plan perform volume processing of data in a data center.

• Inferencing of individual data items will be dominated by the sequential performance of the device.

• Expect GPUs to have poor low-volume/individual inferencing performance relative to CPUs, FPGAs, ASICs.

https://arxiv.org/pdf/1702.07825.pdf









Use Gradients for orders-of-magnitude faster time-to-solution

• Provides an algorithmic speedup L-BFGS, Conjugate G adie t, … • Potentially orders-of-magnitude faster than gradient free methods

• Gradient approximations can add nParameters extra evaluations per dfunc.

• Popular software packages such as Theano make it easy • Can symbolically calculate the gradient through the use of automatic differentiation

• Generates native code for speed

• Bad news: Gradient gets very large (O(N2) Jacobian) according to model size

• Memory, and cache bound plus can be atomic operation limited as well

• Code can be simply too big for GPUs

• Limited memory devices used chunked gradient calculations • Adds communications overhead (The PCIe bus is famous for limiting GPU performance)

• Nvlink performance is unknown.

• Succinctly: real e ory for real perfor a e • Especially stacked memory!

http://deeplearning.net/software/theano/

Most data scientists will use one or two layers

• Deep ANNs (DNNs) are useful, but many in the data analytics world will not use more than one or two hidden layers

• Due to the vanishing gradient problem. • Nice tutorial: http://neuralnetworksanddeeplearning.com/chap5.html.

• An illustration:

Tries to account for as much of the

gradient as possible

Tries to account for as much of the

remaining gradient as possible

Lo e la e s a lea slo e

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

http://neuralnetworksanddeeplearning.com/chap5.html

http://neuralnetworksanddeeplearning.com/chap5.html

Reduced precision and specialized hardware • Half-precision (FP16) can double the number of training iterations per

unit time • Expect 4x from 8-bit math

• Think of NVIDIA reduced-prec Tensor Co e as ANN e sio of MMX fo ulti edia – good for a few DNNs but not general machine-learning

• Redu ed p e isio likel to IMHO ause slo o e ge e • Time-to-model can increase significantly far beyond hardware speedup

• See How Neural Networks Work Lapedes and Farber

• Get stuck in local minima • Bad solutio s i.e. e e lea e the a li g phase i eadi g aloud e a ple

• In general, avoid reduced precision for training as it will likely harm rather than help.

• Reduced precision is more useful for volume inferencing and specialized DNNs

https://papers.nips.cc/paper/59-how-neural-nets-work.pdf

NVIDIA (GPUs)

• Restarted massive parallelism with CUDA and GPU computing

• Making big inroads into the data center

GPU Threads are grouped into threadblocks

• Threads can only communicate within a thread block • (yes, there are atomic ops)

• Fast hardware scheduling • Blks run when dependencies resolved

Data

• Blocks that are ready to run get assigned to processing elements

• Fast hardware scheduling

Scalability required to use all those cores (strong scaling execution model)

Active Queue

Executables can run unchanged on bigger GPUs

• Dealer Analogy

Scheduler SMX

Strong Scaling

Execution Model

NVIDIA Claims Big Perf. Increases since 2013

NVIDIA on speech recognition

NVIDIA Claims for P40 using INT8 math & TensorRT

Processor-based computing

AMD

Intel

Intel Xeon and Xeon Phi

• Intel Xeon Phi is many-core with two AVX-512 vector units per core • High a d idth, good a he, lots of pa allelis

• Cores 1/3 performance of Intel Xeon. • Not a problem for training.

• On-package photonics Intel Omni-Path • On-package puts Mellanox at risk of being outdated like the USB-WiFi dongle companies

• I tel Xeo Skylake • Extra memory channel, AVX-512 vector units, more cores, up to 8 sockets

• Good a he a d fastest se ue tial pe fo a e

• FPGA/ASIC integration (ASICs may leapfrom CPU and GPU Performance!)

• On-package photonics Intel Omni-Path

• Big memory plus (unconfirmed at this time) likely to get Intel 3D Xpoint DIMMS.

• Many procurements are delaying to get actual Skylake performance numbers

Traditional Vector ISA

Core wide

SSE wide

SSE wide

SSE wide

SSE

Core

Core

512 wide vector unit

512 wide vector unit

Core 512 wide vector unit

Core 512 wide vector unit

Illustration

Floating-point performance comes from the dual per core vector units

• AVX-512 = 16 32-bit ops/clock

P

e

r

f

o

r

m

a

n

c

e Scalar and

single-threaded

Parallelism

No Parallelism Massive Parallelism

Vector and

single-threaded

Image courtesy

Elsevier

Highest

Performance

Convergence (for training and HPC in general)

• NVIDIA Pascal has a working MMU (Memory Management Unit) • Data can be automatically be moved between CPU and GPU on a demand basis. • Offload programming is no longer a requirement, its an optimization!

• This is a really big deal as code changes are a barrier to GPU adoption

• Pascal GPUs have fast stacked memory (Much faster, more capacity, energy efficient) • and NVlink (fast host/GPU memory bandwidth transfer – but only with IBM!)

• Intel Xeon Phi (formerly known as Knights Landing): • Data can be automatically moved between near (stacked) and far (DDR4) memory on a

demand basis using cache mode. • Offload programming is no longer a requirement, its an optimization! • Stacked memory is much faster, more capacity, energy efficient

• IBM • Data can be automatically moved between Power and GPU on a demand basis using

NVlink. • Offload programming is no longer a requirement, its an optimization!

IBM approach

• Sumit Gupta (VP, High Performance Computing and Data Analytics, IBM), fu da e tall , a ele ato s a e the path fo a d.

• These accelerators are GPUs for compute, storage accelerators for big data

and FPGAs for special functions.

• Watson (of Jeopardy fame) for software

• TrueNorth • Developed as part of the DARPA SyNAPSE program

• 46 Billion synaptic op/s using 70 mW!

Sumit Gupta

http://www.artificialbrains.com/darpa-synapse-program




The OpenPower special sauce

1. CAPI (IBM Coherent Accelerator Processor Interface)

2. NVlink used i the CORAL "Su it" a d "Sie a supercomputers

• Make application acceleration: • Much easier

• Transparent for the application programmer.

• Ope fo all to joi

CAPI is important • Supports compute, storage, and

special (e.g. FPGA) accelerators

• Shares virtual addressing – everything works with same memory addresses

• Provides hardware managed cache coherence

• Claims to eliminate 97% of code path length!

Data handling can take as much time as the

computational problem!

• ORNL Titan

– 112,128 GB of GPU memory in 18,688 K20x GPUs

• Data handling must be

– Language agnostic

– Scalable

SPEC M3 Benchmark on 68 TB CAPI system

• Fastest mean response times and most consistent response times (lowest standard deviation) ever reported, for all combinations of query type, data volume, and concurrent users.

• Each mean response time was 5.5x to 212x the previous best result, including:

• 21x to 212x the performance of the previous best published result for the market snap benchmarks (10T.YR[n]-MKTSNAP.TIME) *

• 21x the performance of the previous best published result for year high bid in the smallest year of the dataset (1T.OLDYRHIBID.TIME) *

• 8-10x the performance of the previous best published result for the 100-user volume-weighted average bid benchmarks (100T.YR[n]VWAB-12D-HO.TIME) **

• 5-8x the performance of the previous best published result for the N-year high-bid benchmarks (1T.[n]YRHIBID.TIME)

Fast scalable data loads for training via

parallel file systems

Node

1

Node

2

Node

3 Node

4

Node

5 Node

6

Node

7 Node

500

Each MPI client on

each node:

1. Opens file

2. Seeks to location

3. Reads data

4. Close

Other training and inference

solutions FPGAs

• Fast, power efficient and perfect for variable precision arithmetic

• Accessible via CAPI

• Very difficult to program

Custom chips

• Fast, power efficient and perfect for variable precision arithmetic

Heatsink City

• A is quad Google TPU2 motherboard side

view

• B is dual IBM Po e Zaius motherboard

• C is dual IBM Po e Mi sk motherboard

• D is Dual I tel Xeo Fa e ook Yose ite motherboard

• E is Nvidia P100 SMX2 module with heat

sink and Fa e ook Big Basi motherboard

(Image courtesy The Next Platform)

Nervana and many other offerings

Google TPU2:

• 15x-30x faster than CPU and GPU, On

our production AI workloads that utilize

neural network inference

• 30x to 80x improvement in TOPS/Watt

measure

• Exclusive to Google Cloud

IBM TrueNorth (A path to the future???)

• PNAS (8/9/16) Convolutional networks for fast, energy-efficient neuromorphic computing, Steven K. Essera et. al.

• Chip implements networks using integrate-and-fire spiking neurons

• IBM researchers ran the datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per watt)

• Can go really big

A system roughly the neuronal size of a rat brain

http://www.pnas.org/content/early/2016/09/19/1604850113.full





http://www.pnas.org/search?author1=Steven+K.+Esser&sortspec=date&submit=Submit

http://www.pnas.org/search?author1=Steven+K.+Esser&sortspec=date&submit=Submit

http://www.pnas.org/content/early/2016/09/19/1604850113.full#aff-1

TrueNorth: Accuracy

• PNAS pape : [We] demonstrate that neuromorphic o puti g … can implement deep convolution networks

that approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and spee h, pe fo i fe e e hile p ese i g the ha d a e s underlying energy-effi ie a d high th oughput.

Accuracy of different sized networks

running on one or more TrueNorth chips to

perform inference on eight datasets. For

comparison, accuracy of state-of-the-art

unconstrained approaches are shown as

bold horizontal lines (hardware resources

used for these networks are not indicated).

Common software

• Theano: A Python library that generates C-code for a CPU or GPU

• TensorFlow: Google s ope sou e li a fo a hi e lea i g

• cuDNN: NVIDIA s a hi e lea i g li a

• Intel DAAL (Data Analytics Library)

• Torch: An open source middleware library

• Caffe: Be kele s popula f a e o k

https://github.com/Theano/Theano

https://www.tensorflow.org/

http://torch.ch/

http://caffe.berkeleyvision.org/

Machine 1

App A

App B

App C

App D CPU

Loa

d-b

ala

ncin

g

Splitte

r

App A

App B

App C

App D CPU

Machine 2

App A

App B

App C

App D CPU

Machine 3

Fast and scalable heterogeneous workflows

Full source code in my DDJ tutorial http://www.drdobbs.com/parallel/232601605

Volta

FPGA

Custom Asic

http://www.drdobbs.com/parallel/232601605

So much more, you have been great, Thank You!

Rob Farber

CEO TechEnablement.com

Contact info at techenablement.com for consulting, teaching, writing, and other inquiries

http://www.scientificcomputing.com/

https://www.google.com/url?q=http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/01/28/the-george-washington-university-uses-gpus-to-study-flying-snakes.aspx&sa=U&ei=3DgmU7aeLsrkoASloIDoAw&ved=0CC8Q9QEwAQ&sig2=yzwuSJXqJUQBD1RwSIm3eQ&usg=AFQjCNGkhYPeiFXvKvSH6OHhOccW-q6-2Q

http://www.pgroup.com/index.htm

https://www.google.com/url?q=http://en.wikipedia.org/wiki/Linux_Journal&sa=U&ei=kzgmU6WJNpbYoAT1t4KgAw&ved=0CC0Q9QEwAA&sig2=YEb7zD16WsmnX4PkEdOGVg&usg=AFQjCNH7La4Nm8Cf1AbeyuW6K-0zjV-l9w

mailto:[email protected]