Upload
braincreators
View
203
Download
2
Embed Size (px)
Citation preview
@tommasogritti
Deep Neural Network intuition
Embeddings
Transfer Learning
Tips
Outline
Deep Neural Network omnipresence
https://trends.google.com/trends/explore?date=2008-03-09%202017-04-09&q=artificial%20intelligence,machine%20learning,deep%20learning
Deep Neural Network omnipresence
https://trends.google.com/trends/explore?date=2008-03-09%202017-04-09&q=artificial%20intelligence,machine%20learning,deep%20learning
Deep Neural Network omnipresence
https://trends.google.com/trends/explore?date=2008-03-09%202017-04-09&q=artificial%20intelligence,machine%20learning,deep%20learning
Deep Neural Network omnipresence
https://trends.google.com/trends/explore?date=2008-03-09%202017-04-09&q=artificial%20intelligence,machine%20learning,deep%20learning
… or almost
https://trends.google.com/trends/explore?date=2008-03-09%202017-04-09&q=artificial%20intelligence,machine%20learning,deep%20learning
Applications
http://www.yaronhadad.com/deep-learning-most-amazing-applications/
Human1011 neurons 104 synapses per neuron1016 “operations” / sec250 M neurons per mm3 180,000 km of “wires” 25 Watts
Deep Neural Networks sound coolGPU8x1012 operations / sec 500 Watts 5760 (small) cores $2000
Toy example
Num website visits
Num page visited
Average time on page
Converted?
1 13 55s 1
2 1 141s 1
1 8 10s 0
3 5 127s 0
2 3 18s 0
Toy example
Num website visits
Num page visited
Average time on page
Converted?
1 13 55s 1
2 1 141s 1
1 8 10s 0
3 5 127s 0
2 3 18s 0
“Num website visits” does not seem to influence output
Toy example
Num website visits
Num page visited
Average time on page
Converted?
1 13 55s 1
2 1 141s 1
1 8 10s 0
3 5 127s 0
2 3 18s 0
“Num page visited” above 9 seems to be a good threshold, but even when =1 a person can convert ⇒ no simple threshold
Toy example
Num website visits
Num page visited
Average time on page
Converted?
1 13 55s 1
2 1 141s 1
1 8 10s 0
3 5 127s 0
2 3 18s 0
“Avg time on page” > 128 seems to be a good threshold, but even when =55 a person can convert ⇒ no simple threshold
Toy example
1
13
55
??
Num website visits
Num page visited
Average time on page
User converted?w1=??
w2=??
w3=??
> 0 converted< 0 not converted
multiply
sum
1*w1 + 13*w2 + 55*w3 > 0
Toy example
1
13
55
3.58
Num website visits
Num page visited
Average time on page
User converted?-7.04
0.28
0.12
multiply
sum
> 0 ? YES
Toy example
3
5
127
-3.76
Num website visits
Num page visited
Average time on page
User converted?
multiply
sum
> 0 ? NO
-7.04
0.28
0.12
Toy example
Num website visits
Num page visited
Average time on page
3
5
-7.04
0.28
0.12127
Method #1
Num website visits
Num page visited
Average time on page
3
5
-2.4
0.91
0.013127
Method #2
Num website visits
Num page visited
Average time on page
3
5
-3.9
0.21
0.03127
Method #3
Num website visits
Num page visited
Average time on page
3
5
-1.1
0.83
0.18127
Method #4
Toy example
1
13
55
Method #1
Method #2
Method #3
Method #4
Final estimate
Num website visits
Num page visited
Average time on page
Toy example
1
13
55
Method #1
Method #2
Method #3
Method #4
Final estimate
Num website visits
Num page visited
Average time on page
input layerhidden layer
output layer
Toy example
1
13
55
Method #1
Method #2
Method #3
Method #4
Final estimate
Num website visits
Num page visited
Average time on page
input layerhidden layer
output layer
Deep = lots of hidden layers
http://www.asimovinstitute.org/neural-network-zoo/
Lots of configurations
Open source toolkits
Neural Networks - Take Home Message● Applicable to endless domains: object recognition, medical imaging,
automotive, finance, robotics, natural language processing, translation
systems, speech recognition
● At the simplest levels only a series of nodes doing sums & thresholding
● Lots of variety
● Lots of open source tools
Embeddings
Context: object recognitionAutomatically classify product images into 1000s
of categories
Dress Boot
Image dataset Image Features Classifier
● Edges● Contrast● Local patterns● colors
● Adaboost● SVM● Random
Forests● Neural Network
Image Classifier (old school)
0.2-0.30.150.750.11……
0.93
V =
Input Image Feature extraction Image features
Image Features (old school)
Classifier
f(V) > 0
Image dataset Image Features Classifier
● Edges● Contrast● Local patterns● colors
● Adaboost● SVM● Random
Forests● Neural Network
10% 45% 45%
Effort (old school)
...still in use today
DatasetData gathered from 100s of scraped webshops
Dataset5 million products, uncategorised
Datasetuncategorised
● Keywords filtering
● Visual clustering
● Human inspection
~500 labelled classes~ 1000 images / class
Image classifier (the new way)
Deep Convolutional Neural Network (DCNN)~500 labelled classes~1000 images / class
Backpropagation + Gradient descent
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
Image classifier (the new way)
Forward passtraining imagelabel: pans
predicted label=shoe
Image classifier (the new way)
training imagelabel: pans
predicted label=shoe
Backpropagation + Gradient descent
Image classifier (the new way)
training imagelabel: pans
predicted label=shoe
Backpropagation + Gradient descent= update weights “towards target”
Image classifier (the new way)
Forward pass
Backpropagation + Gradient descent
● Repeat for all training images
● Repeat till stopping criteria
Effort (the new way)
50% 50%
~500 labelled classes~1000 images / class Deep Convolutional Neural Network (DCNN)
What is going on in the network?
Dress
http://vision03.csail.mit.edu/cnn_art/data/single_layer.png
What is going on in the network?
predicted label=pans
http://vision03.csail.mit.edu/cnn_art/data/single_layer.png
What is going on in the network?
AbstractImage
“concepts”
Low levelImage
“concepts”
Embedding = self-learnt descriptors
Abstract level concept / descriptor
Dress
0.2-0.30.150.750.11……
0.93
Distance in embedding space
E(a), E(b), E(c)
a
b
c
d ( , ) >> d ( , )E(a) E(b) E(c) E(b)
Distance in embedding space
Bracelets (unsorted)
Bracelets (sorted on embedding)
Bracelets (sorted on embedding)
Shoes (unsorted)
Shoes (sorted on embedding)
Shoes (sorted on embedding)
Iterative refinementNewly discovered
classesRe-train classifier Results
95% from 5M products classified with confidence > 96%
More than 250 new labeled categories
Context: identity recognitionAutomatically recognize celebrities from
red carpet events
Jennifer Aniston
LL Cool J
Embedding training
Triplet Loss
Train network to discriminate between triplets of images
Triplets
Training
Random embedding initialization
Embedding training
Trained embedding
Celebrity identifier
Celebrity identifier
Celebrity identifier
Celebrity identifier
Celebrity identifier
NLP - Word embeddings
https://medium.com/@ageitgey/machine-learning-is-fun-part-5-language-translation-with-deep-learning-and-the-magic-of-sequences-2ace0acca0aa
With a different network setup we can learn an embedding for words:
NLP - Word embeddings
Each word is represented by a vector. Vector allow to explore very
interesting relationships learnt automatically from the data:
● King - man + woman → queen
● Paris - France + Italy → Rome
● Obama - USA + Russia → Putin
● President - power → prime minister
Embeddings - Take Home Message● From feature engineering to data collection
● Neural Networks automatically learn relevant high level
abstractions
● Embedding spaces very useful to explore data
● Application areas: retrieval or ranking tasks (e.g. product
recommendation, customer segmentation), classification
Transfer learning
ImageNet
1.5 million training examples
1000 categories
Training time ~ days on best GPUs
Transfer Learningrandomly initialized
weightsImageNet
Transfer Learningrandomly initialized
weightsImageNet Network trained to
classify 1000 classes
Classify correctly (>90%) images in
1000 classes
Transfer LearningNew data
New classes
?
Transfer Learningrandomly initialized
weightsImageNet Network trained to
classify 1000 classesFine-tune model(update weights)
New dataNew classes
Transfer Learningrandomly initialized
weightsImageNet Network trained to
classify 1000 classesFine-tune model(update weights)
New dataNew classes● Faster training time
● Better performance
Sharing pre-trained models● Model-Zoo:
https://github.com/BVLC/caffe/wiki/Model-Zoo
● Common format to share pre-trained models
● Active discussion and contributions
Sharing pre-trained models
Transfer Learning - recommendations
small; similar
large; similar
small; different
large; different
Similarity of the data
Size
of d
atab
ase
Transfer Learning - recommendations
Small; similar
large; similar
small; different
large; different
Use existing embedding
Similarity of the data
Size
of d
atab
ase
Transfer Learning - recommendations
Small; similar
large; similar
small; different
large; different
Use existing embedding
Fine-tune complete network
Similarity of the data
Size
of d
atab
ase
Transfer Learning - recommendations
Small; similar
large; similar
small; different
large; different
Use existing embedding
Fine-tune complete network
Use activations from earlier
in the network
Fine-tune complete network (or start from scratch)
Similarity of the data
Size
of d
atab
ase
Transfer Learning - recommendations
Small; similar
large; similar
small; different
large; different
Use existing embedding
Fine-tune complete network
Use activations from earlier
in the network
Fine-tune complete network (or start from scratch)
Similarity of the data
Size
of d
atab
ase
Transfer Learning - Take Home Message● Faster progress
● Training also with much smaller amount of data
● Check closest available model before starting from scratch
Should we all go
deep?
Some questions you should ask● What is the performance of the baseline?
○ What can be achieved with a simpler system?
○ Can we start testing the value proposition with a simpler system?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
● Do we need labeled data or can we use unlabeled data?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
● Do we need labeled data or can we use unlabeled data?
● How well does it work on data it has never seen? Generalization / Overfitting
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
● Do we need labeled data or can we use unlabeled data?
● How well does it work on data it has never seen? Generalization / Overfitting
● What are the failure cases?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
● Do we need labeled data or can we use unlabeled data?
● How well does it work on data it has never seen? Generalization / Overfitting
● What are the failure cases?
● How reliable is the confidence of the prediction?
Some questions you should ask● What is the performance of the baseline?
● How much training data is required?
● Do we have the data, can we acquire it or how long does it take to collect it?
● Do we need labeled data or can we use unlabeled data?
● How well does it work on data it has never seen? Generalization / Overfitting
● What are the failure cases?
● How reliable is the confidence of the prediction?
● Can we explain why a prediction has been made?