Joy of Designing Deep Neural Networks November 28, 2019 · #bigdata2019 @electricbrainio Joy of Designing Deep Neural Networks November 28, 2019 1

https://www.electricbrain.io/ #bigdata2019 @electricbrainio

Joy of Designing Deep Neural Networks

November 28, 2019

1


Creating a complex function, but not having to actually program it

def computeSomething(data):…newData = something # I don’t give a shit return newData

2

Imagine

https://www.electricbrain.io/ #bigdata2019 @electricbrainio 2

History


● Avid video game player● Programming from a young age

3

At the beginning


Globulation 2


A different direction

● Took a business degree● Freelanced on the side● Did a startup (failed)● Broke; back to programming for cash


Sensibill

● Receipt processing technology


Sensibill{ "items": [{ "name": "T1 Cafvn Lt Frapp", "regularPriceTotal": "4.25" }], "receiptDate": "04/06/2016", "receiptNumber": "656335", "receiptTime": "07:35 am", "store": [{ "addressLines": [ "438 Richmond Street West", "Toronto, ON M5V 3S6" ], "name": "Starbucks Coffee Canada #", "storeID": "4495" }], "taxes": [{ "amount": "0.55", "currencyCode": "", "percent": "13", "ruleID": "HST" }], "tenders": [{ "amount": "4.80", "currencyCode": "", "currentBalance": "14.80", "maskedCardNumber": "**** 3616", "tenderType": "Sbux Card" }], "total": { "currencyCode": "", "grand": "4.80", "subtotal": "4.25" }

>


Stumbling Around

● Hand baked heuristic algorithm● Later turns out to be a variant on

k-nearest-neighbor algorithm


● Building and maintaining AI datasets● Designing annotators● Building a data-operations team● Cleaning and transforming data● Testing different algorithms (like

decision trees)

9

Valuable Experience


A hint


● Upgraded to recurrent neural network

● Massive improvement in accuracy

11

Deep Neural Network Upgrade


● Became obsessed with neural networks and how they are designed

● Passion for deep learning ignites!

12

It begins


● Learning everything I can about deep neural networks

13

Diving deep


● As a programmer, I’m not a big fan of mathematical equations

14

My dirty secret


● Diving deep into the equations is very rarely needed

● The graphs, charts, and anecdotes say everything

15

Not a problem


Example

● Choosing an activation function for a neural network

● What’s the difference between sigmoid and tanh? They’re both just non-linear equations


Sigmoid

● Outputs between 0 and 1● For inputs, action is in the center, close to 0


Tanh

● Outputs between -1 and 1● For inputs, action is in the center, close to 0


When does it matter?

● Do you want negative numbers or not?

● If you don’t care, it doesn’t really matter. Use whatever gets highest accuracy


Example: What do the layers do?

● Neural networks are made of layers● There are many types of layers● How do we understand them?


What is an LSTM?

i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + W[c->i]c[t−1] + b[1->i]) (1)f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + W[c->f]c[t−1] + b[1->f]) (2)z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c]) (3)c[t] = f[t]c[t−1] + i[t]z[t] (4)o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + W[c->o]c[t] + b[1->o]) (5)h[t] = o[t]tanh(c[t]) (6)

(source: https://github.com/Element-Research/rnn)


What is an LSTM?

Image Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/


LSTM Variant: GRU

Image Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/


What is a convolutional network?

output[i][j][k] = bias[k] + sum_l sum_{s=1}^kW sum_{t=1}^kH w[s][t][l][k] * i[dW*(i-1)+s)][dH*(j-1)+t][l]

(source: https://github.com/torch/nn/blob/master/doc/convolution.md)


What is a convolutional network?

Image Source: http://deeplearning.net/tutorial/lenet.html


Understanding the layers

● Layers can be understood as math equations, but why bother?

● The high-level intuitions are much more useful

● You don’t need to know how the CPU works to write Python code. Same with deep learning.


Examples

● Linear/Dense = Generally combines / processes data

● Convolution = Matches fixed number of patterns on data

● Recurrent = Processes data of arbitrary size, like sequences/arrays


Examples

● Attention Layer = Suppresses irrelevant information

● Dropout = Prevents overfitting, spreads the knowledge across vector

● Batch Norm = Speeds up learning● Pooling = Combines nearby data

together


Programming Analogies

● Projection - cast<float64[]>(data)● Dense Layer - function (data) {...}● Recurrent Layer - for () {...} loop● Convolutional Layer - RegExp● Attention Layer - Map & Reduce


The architecture

● At architecture level, the math is hideous! It can only be understood as a graph

● Many intuitions form based on the graphs

● What's possible, what’s useful


Understanding the architecture


Edges and Nodes

● What is an edge?○ Technically it’s a vector, e.g. <1, 5.3, 3.1>○ Intuitively, its information

● Whats a node?○ Technically, a bundle of math equations○ Intuitively, an information processing unit


The architecture: recurrent translation

Image Source: http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf



Image Source: http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf

All information about that sentence!



Image Source: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/


The intuition

● Any information can be represented in abstract within a vector

● Vectors can be at various stages of processing, part way between the input and finished output


The architecture: inception network

Image Source: https://arxiv.org/pdf/1409.4842.pdf


The intuition

● Sometimes it’s useful to process the same information multiple ways


The architecture: residual networks


The intuition

● Too many information processing units in a row don’t work, due to vanishing gradients

● Adding paths around processing modules allows networks to go deeper

● Works similar to gossip


Going Rogue

● After enough reading, playing, and experimenting, you start to feel comfortable enough to create original designs


● Infinite search space of possible architectures, each with massive to infinite number of hyper-parameters

● Intuition, experimentation, trial and error dominate

41

Big problem


● More or less, code either works or it doesn’t

● The deep neural network always works at least a little bit. Question is, what makes it better or worse?

42

Not like programming


● Julian Konomi “The deep neural network always learns something”

● The only question is, did it learn something useful? Did it learn better than another design?

43

The Profound and Annoying Fact


Example: Regulation Matching

● Client wants to detect if part of a project plan might violate a law or internal control

● Automated Compliance Review


Options ...


Options ...

Description:Use vanilla recurrent network stack

Process one paragraph at a time

Treat each regulation / control as an independent output

Output the probability the whole paragraph violates each regulation.

Hyperparameters:● # of recurrent layers● # of dense layers● Type of recurrent layer● Size of recurrent layer● Type of activation function● Size of dense layers● Method for input word-vectors● Use attention?● Use residual connections?● more….


Options ...


Options ...

Description:Use a neural network with an external memory

Process the entire project plan rather than one paragraph at a time

Treat the regulations as mutually exclusive, n-class output

Output on a word-by-word basis whether or not that word is describing a violation

Hyperparameters:● # of recurrent layers in storage location

stack● # of recurrent layers in storage value stack● # of recurrent layers in retrieval location

stack● Size of storage location recurrent layers● Size of storage value recurrent layers● Size of retrieval location recurrent layers● # of dense layers at the end● Type of activation function● Size of dense layers● Size of single vector in NN memory● Number of locations in NN memory● Method for input word-vectors● Use attention?● Use residual connections?● more….


Options….


Options ...

Description:Use two vanilla recurrent stacks, one for the project plan and one for the regulation

Process the project plan in sentences

Process the text of the regulation through the neural network as well, computing a ‘regulation vector’

Use cosine distance between ‘regulation vector’ and ‘project vector’ to determine relevance

Hyperparameters:● # of recurrent layers in project plan stack● # of recurrent layers in regulation stack● Size of project plan recurrent layers● Size of regulation stack recurrent layers● # of dense layers after project plan stack● # of dense layers after regulation stack● Type of activation function● Size of dense layers● Size of the matching vector● Cutoff point to determine if project plan

fails regulation● Method for input word-vectors● Use attention?● Use residual connections?● more….


Challenge and Joy

● Far too many techniques to test for any one client

● If only clients had unlimited money…. (**cough cough** Google)

● One must gain an intuition on what's likely to work - can’t rely entirely on copying results from NIPS papers


Winging It

● Even smartest PhD’s don’t really understand how or why deep neural networks work so well

● Deep learning is half science, half art● Just dive in and get your hands dirty.

Don’t bother trying to “understand”


Conclusion

● Anyone with a technical background can learn to apply deep learning, and even to create novel architectures

● Mathematics not required● Learning these intuitions is fun!


Have a great evening!

Documents

Joy of Designing Deep Neural Networks November 28, 2019 · #bigdata2019 @electricbrainio Joy of Designing Deep Neural Networks November 28, 2019 1