Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Deep Learning IntroductionApril 10, 2019
http://cross-entropy.net/ML410/Deep_Learning_0.pdf
Agenda for Tonight
• About the Class
• About the Textbooks
• Chapter 1: Deep Learning with Python
• Chapter 1: Introduction to Deep Learning
• Homework
Course Topics
• Forward Propagation of Features and Backward Propagation of Loss for Machine Learning
• Tensorflow Basics
• Keras Basics
• Multi-Layer Perceptron models
• Convolution: Filters, Strides, and Padding
• Available Keras Models for Image Classification, including the Residual Network (ResNet) model and the Dense Network (DenseNet) model
• Word Embeddings
• Recurrent Neural Networks, including Long Short-Term Memory cells and Gated Recurrent Unit cells
• Sequence-to-Sequence Models
• Transformers and Bidirectional Encoder Representations from Transformers
• Auto-Encoders
• Generative Adversarial Networks
• Deep Reinforcement Learning
• Best Practices
Technical Requirements
• We'll be using the Tensorflow and Keras libraries to construct and evaluate Deep Learning models. You'll need access to a browser, such as Chrome, to access the Vocareum lab environment as well as the Canvas learning management system. We’ll also be using the following libraries for working with data:• Python Imaging Library (PIL): Pillow fork
• syntactic parsing using Cython (spaCy) [also NLTK]
• libROSA: library for the Recognition and Organization of Speech and Audio
Student Assessment
Weekly assignments will include 14 questions from our textbook as well as 9 Kaggle tasks. Homework questions from the textbook will be worth 1 point each. Kaggle tasks will be worth 4 points each. For the Kaggle tasks, you must beat a "baseline" method to receive credit. You need to attend at least 8 class sessions and receive a total of at least 25 points to pass the course.
Textbook #1: Deep Learning with Python[DLP]1. What is deep learning?
2. Before we begin: the mathematical building blocks of neural networks
3. Getting started with neural networks
4. Fundamentals of machine learning
5. Deep learning for computer vision
6. Deep learning for text and sequences
7. Advanced deep-learning best practices
8. Generative deep learning
9. Conclusions
Textbook #2: Introduction to Deep Learning[IDL]1. Feed-Forward Neural Nets
2. Tensorflow
3. Convolutional Neural Networks
4. Word Embeddings and Recurrent Neural Networks
5. Sequence-to-Sequence Learning
6. Deep Reinforcement Learning
7. Unsupervised Neural-Network Models
Textbook #1
The cover of our text book is captioned “Habit of a Persian Lady in 1568”, from Thomas Jeffreys’ book, “A Collection of the Dresses of Different Nations”
[DLP] Chapter 1: What is Deep Learning?
1. Artificial Intelligence, Machine Learning, and Deep Learninga. Artificial Intelligenceb. Machine Learningc. Learning Representations from
Datad. The “Deep” in Deep Learninge. Understanding How Deep
Learning Works, in Three Figuresf. What Deep Learning Has
Achieved So Farg. Don’t Believe the Short-Term
Hypeh. The Promise of AI
2. Before Deep Learning: a Brief History of Machine Learninga. Probabilistic Modelingb. Early Neural Networksc. Kernel Methodsd. Decision Trees, Random Forests,
and Gradient Boosting Machinese. Back to Neural Networksf. What Makes Deep Learning
Differentg. The Modern Machine Learning
Landscape
[DLP] Chapter 1: What is Deep Learning?
3. Why Deep Learning? Why Now?a. Hardware
b. Data
c. Algorithms
d. A New Wave of Investment
e. The Democratization of Deep Learning
f. Will it Last?
• This chapter covers• High-level definitions of
fundamental concepts
• Timeline of the development of machine learning
• Key factors behind deep learning’s rising popularity and future potential
Artificial Intelligence
• Concise definition: the effort to automate intellectual tasks normally performed by humans
• Initial take: expert rules• Fine for chess
• Difficult to develop rules for image classification, speech recognition, or language translation
Artificial Intelligence
Relationships Between AI, ML, and DL
• Expert Rules
• Linear Regression
• Logistic Regression
• Random Forests
• Gradient Boosting
• Multi-Layer Perceptron Network
• Convolutional Neural Networks
• Recurrent Neural Networks
Artificial Intelligence
Expert Rules Example
% Data: fruit(X) :- attributes(Y)
fruit(banana) :- colour(yellow), shape(crescent).
fruit(apple) :- (colour(green); colour(red)), shape(sphere), stem(yes).
fruit(lemon) :- colour(yellow), (shape(sphere);shape('tapered sphere')), acidic(yes).
fruit(lime) :- colour(green), shape(sphere), acidic(yes).
fruit(pear) :- colour(green), shape('tapered sphere').
fruit(plum) :- colour(purple), shape(sphere), stone(yes).
fruit(grape) :- (colour(purple);colour(green)), shape(sphere).
fruit(orange) :- colour(orange), shape(sphere).
fruit(satsuma) :- colour(orange), shape('flat sphere').
fruit(peach) :- colour(peach).
fruit(rhubarb) :- (colour(red); colour(green)), shape(stick).
fruit(cherry) :- colour(red), shape(sphere), stem(yes), stone(yes).
What is the value for colour?
[red, orange, yellow, green, purple, peach]
green
What is the value for shape?
[sphere, crescent, tapered sphere, flat sphere, stick]
stick
The fruit is rhubarb
http://www.paulbrownmagic.com/blog/simple_prolog_expert
Artificial Intelligence
Machine Learning
• Ada Lovelace, 1843: “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.”
• Alan Turing, 1950: quoting Ada Lovelace, while pondering whether general-purpose computers could be capable of learning
trained versus programmed
Artificial Intelligence
Learning Representations from Data
• Need three things (for supervised learning)• Input data points: structured data, image files, sound files, text documents
• Examples of expected output
• Way to measure whether the algorithm is doing a good job
• Input representation examples• Image as Red, Green, and Blue picture element (pixel) values
• Image as Hue, Saturation, and Value
https://en.wikipedia.org/wiki/HSL_and_HSV
Artificial Intelligence
Example Data
• The inputs are the coordinates of our points
• The expected outputs are the colors of the points
• A way to measure whether our algorithm is doing a good job could be the percentage of points that are being classified correctly (accuracy)
Artificial Intelligence
New Representation
import numpy as np
translation = np.array([ -2, -2 ])
theta = 0.25 * np.pi
Rotation = np.array([[ np.cos(theta), - np.sin(theta) ], [np.sin(theta), np.cos(theta) ]])
np.dot(Input + translation, Rotation)
Artificial Intelligence
Deep Neural Network for Digit Classification
• Successive layers of increasingly meaningful representations
• Alternative names for deep learning• Layered representations learning
• Hierarchical representations learning
Artificial Intelligence
Deep Representations Learned by aDigit-Classification Model
Artificial Intelligence
How Deep Learning Works: Part 1 of 3
Neural Network Parameterized by its Weights
Artificial Intelligence
How Deep Learning Works: Part 2 of 3
Loss Function Measures Quality of Network’s Output
Artificial Intelligence
How Deep Learning Works: Part 3 of 3
Loss Score Used as Feedback Signal to Adjust Weights
Artificial Intelligence
Deep Learning Achievements
• Near-human-level image classification
• Near-human-level speech recognition
• Near-human-level handwriting transcription
• Improved machine translation
• Improved text-to-speech conversion
• Digital assistants such as Google Now and Amazon Alexa
• Near-human-level autonomous driving
• Improved ad targeting, as used by Google, Baidu, and Bing
• Improved search results on the web
• Ability to answer natural-language questions
• Superhuman Go playing
Artificial Intelligence
Deep Learning Hype
• Although some world-changing applications like autonomous cars are already within reach, many more are likely to remain elusive for a long time, such as believable dialogue systems, human-level machine translation across arbitrary languages, and human-level natural language understanding
• Previous AI “Winters”:1. XOR (eXclusive Or): my perception is the inability of perceptron to solve this
problem cast a shadow on AI (though this was understood at the time)2. By the early 90s, rule-based systems had proven expensive to maintain
difficult to scale, and limited in scope
• We are currently in the intense optimism phase of a new cycle
Artificial Intelligence
Promise of AI
• Most of the research findings of deep learning aren’t yet applied to the full range of problems they can solve across industries• “Your doctor doesn’t use AI, and neither does your accountant” [I thought I
uploaded an image of a document during tax time]
• “Back in 1995, it would have been difficult to believe in the future impact of the internet”
• “In a not-so-distant future, AI will be your assistant; it will answer your questions, help educate your kids, and watch over your health. It will deliver your groceries to your door and drive you from point A to point B.”
• Don’t believe the short-term hype, but do believe in the long-term vision
Artificial Intelligence
History: Probabilistic Modeling
• Naïve Bayes
• Logistic Regression
Machine Learning
Early Neural Networks
• 1950s: Perceptron
• 1980s: Backpropagation
• Late 1980s: Yann LeCun’s work on MNIST
Machine Learning
History: Kernel Method
• Example kernel method for classification
Machine Learning
History: Decision Tree Ensembles
• Parameters that are learned are questions about the data“Is feature2 in the data greater than 3.5?”
Machine Learning
History: Back to Neural Networks
• AlexNet was not the first fast GPU-implementation of a CNN to win an image recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU.[6] A deep CNN of Dan Ciresan et al. (2011) at IDSIA was already 60 times faster[7] and achieved superhuman performance in August 2011.[8] Between May 15, 2011 and September 10, 2012, their CNN won no less than four image competitions.[9][10] They also significantly improved on the best performance in the literature for multiple image databases.[11]
• According to the AlexNet paper,[5] Ciresan's earlier net is "somewhat similar." Both were originally written with CUDA to run with GPU support. In fact, both are actually just variants of the CNN designs introduced by Yann LeCun et al. (1989)
https://en.wikipedia.org/wiki/AlexNet
Machine Learning
Reasons for Deep Learning Success
• Incremental layer-by-layer way in which increasingly complex representations are developed
• Fact that these intermediate incremental representations are learned jointly
Machine Learning
Why Deep Learning? Why Now?
• Three technical forces driving advances in machine learning• Hardware
• Datasets and benchmarks
• Algorithmic advances
• Following a scientific revolution, progress generally follows a sigmoid curve: it starts with a period of fast progress, which generally stabilizes as researchers hit hard limitations, and then, further improvements become incremental
Deep Learning
Models to Try
• Deep Learning should be viewed as another tool in the toolbox
• There are many possible machine learning methods to apply
• Suggestion is to try …• Linear models
• Tree-based ensembles; e.g. random forests and gradient boosting
• Deep learning; e.g. feedforward, convolutional, and recurrent networks
• Gradient boosting and deep learning have won a lot of Kaggle competitions
Deep Learning
Textbook #2
The cover of our text book pays homage to the Modified National Institutes of Standards and Technology data set, which is what led to the development of a convolutional neural network [learning the filters to classify handwritten zip codes, instead of handcrafting filters]
About the Textbook
• Reads like a rough draftExample: equation 1.21 in the first printing is incorrect
[change the X(uppercase phi) to lj]
• From the preface: “So I did what any self-respecting professor would do, scheduled myself to teach the stuff, started a crash course by surfing the web, and got my students to teach it to me”
• Put down the torches and pitchforks: this book does provide a decent introduction to the theory behind deep learning
[IDL] Chapter 1: Feed-Forward Neural Nets
1. Perceptrons
2. Cross-entropy Loss Functions for Neural Nets
3. Derivatives and Stochastic Gradient Descent
4. Writing Our Program
5. Matrix Representation of Neural Nets
6. Data Independence
7. References and Further Reading
8. Written Exercises
MNIST Digit as a Matrix
MNIST
MNIST Digit as an Image
image looks funkybecause we’ve zoomedin on a low-resolution(28 x 28) image
MNIST
Schematic Diagram of a Perceptron
Perceptrons
A Typical Neuron
• A single biological neuron typically has …• many inputs (dendrites)
• a cell body
• a single output (the axon)
• Neural networks are inspired by the biological neuron (not an accurate representation of how learning is performed by humans)
Perceptrons
The Perceptron Function
The perceptron function is parameterized by the vector Φ, where 𝜙𝑖 ∈ Φ is the ith
parameter [“fee sub eye” is an element of “cap fee (capital fee)”]b is a bias termw is a weight vector
The capital sigma (Σ) is a summation operator, with index i.
Nota Bene (note well): a neuron is a weight vector
Perceptrons
The Dot Product
The dot product is the inner product of two vectors.
The dot product is the numerator of the cosine of the angle between the two vectors. It can be viewed as measuring the similarity between the two vectors. A large positive number means the two vectors are similar, while a large negative number means the two vectors are dissimilar. Normalizing (dividing) the dot product by the product of the two vector lengths yields a number in the interval [-1, 1]. Taking the dot product of an input vector and a weight vector yields a new “feature”, measuring how similar the input vector is to the weight vector. This is why deep learning is sometimes referred to as learning representations [as in the International Conference on Learning Representations (ICLR); the other important Deep Learning conference is the Neural Information Processing Systems conference (NeurIPS)].
This is *the* most common operation in deep learning. We’ve peaked early. It’s all downhill from here ☺
Perceptrons
Input Example
𝒙𝑘 = 𝑥1𝑘 … 𝑥𝑙
𝑘
𝒙𝑘 identifies the input vector for example “k”, with features 1 through “l” (lower-case “l”)
I would not have chosen lower-case “l” to represent the length of the input feature vector, but I want to be consistent with the book [which is a decent textbook for deep learning]Note: a lower-case bold letter often represents a vector, while an upper-case bold letter often represents a matrix.
𝑎𝑘 identifies the output answer for example “k”
Perceptrons
The Perceptron Learning Algorithm
The capital delta (Δ) represents the quantity to be added to the weight.
We are treating the bias b as a weight for the imaginary feature 𝑥0 = 1.
N is our first hyperparameter, a value assigned by us to control the learning algorithm’s behavior.
Perceptrons
Multiple Perceptrons for Multiple Classes
One perceptron for each class
Perceptrons
Neural Network Showing Layers
• First rectangle represents the 5 input features: the Input layer
• Second rectangle represents the 3 output neurons: the Perceptron layer
Perceptrons
Weight Updates
• The weight update is equal to the negated learning rate multiplied by the partial derivative of loss with respect to the weight• This is true for *any* weight in the network
• The number of operations (e.g. products and activations) between the weight and the output of the model dictates the number of terms that comprise the partial derivative of loss with respect to the weight
• Δ: upper-case delta [I say “cap delta”]
• 𝜙: lower-case phi [I say “fee”]
• ℒ: script L
Cross-Entropy
Loss as a Function of 𝜙1
• Our goal is to minimize the loss function value (vertical axis)
• This said, we try to avoid “large” weights
Cross-Entropy
Softmax
• Softmax function takes logit [I say “low-jit”] values as inputs and produces probability estimates as outputs
• Softmax is an activation function, not a layer unto itself• “Is this a layer?”• “Does it have weights that are applied to activations from a previous layer?”
• 𝜎: lower-case sigma
• ℯ: script ‘e’ [Euler’s number ≈ 2.71828 : a constant; not a variable]
• Σ: upper-case sigma [summation operator; not a variable]
• bold, lower-case x: a vector [bold, upper-case X: a matrix]
Cross-Entropy
Cross-Entropy Loss Function
• 𝑋 : italicized ‘X’ represents the loss function
• Φ: upper-case phi [“cap fee”] represents the model parameters
• 𝑥: italicized ‘x’ represents an input vector
• ln: the natural logarithm function [Euler’s number is the base]
• 𝑎𝑥: the actual label for input vector ‘x’
Cross-Entropy
Graph of -ln x
• Reminder: a probability estimate will never be larger than 1
• -ln 0 = Infinity• clip softmax probability estimates to avoid this
• min(max(estimate, 1e-6), 1 - 1e-6)
• 1e-6 = 1 * 10^(-6) = 0.000001
• min(max(estimate, 0.000001), 0.999999)
Cross-Entropy
Relationships between Cross Entropy Loss, Softmax Activation, and Perceptron Outputs• Cross entropy loss
• Softmax activation
• Perceptron (product) output
Derivatives
Chain Rule for Weight Update
Example from book is for last (only) layer
𝜕𝑙𝑜𝑠𝑠
𝜕𝑤𝑒𝑖𝑔ℎ𝑡=
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑤𝑒𝑖𝑔ℎ𝑡
𝜕𝑙𝑜𝑠𝑠
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
Book uses …• X() for cross entropy loss• lowercase ‘l’ for logit
(product of input and weight vectors)
• lowercase ‘p’ for the softmax probability output
𝜕𝑙𝑜𝑠𝑠
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡=
𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑙𝑜𝑠𝑠
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑙𝑜𝑠𝑠
𝜕𝑤𝑒𝑖𝑔ℎ𝑡=
𝜕𝑙𝑜𝑠𝑠
𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑤𝑒𝑖𝑔ℎ𝑡
Book says Dave says
Putting those two equations together for the weight update
Derivatives
Partial Derivatives Used for Chain Rule
The three components for updating weights in the last layer (multiply them together to derive the desired gradient) …
𝜕𝑙𝑜𝑠𝑠
𝜕𝑤𝑒𝑖𝑔ℎ𝑡=
𝜕𝑙𝑜𝑠𝑠
𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝜕𝑤𝑒𝑖𝑔ℎ𝑡
Derivatives
Pseudocode for Simple Feed-Forward Digit Recognition
Derivatives
Matrix Manipulation
• X = Y + Z [element-wise addition]
• X = Y * Z [dot-products between rows of Y and columns of Z]
Matrix Representation
Quick Example
L = X * W + B• L is the layer output• X is the input batch• W is the Weight matrix• B is the Bias vector
Matrix Representation
Matrix Transposition
Exchanging rows and columns
Matrix Representation
Matrix Form of Weight Update
The upside down triangle is the symbol for the gradient, a multi-variable generalization of the derivative [below: it’s a vector of partial derivatives of loss with respect to the perceptron (product) output]
Matrix Representation
Matrix Representation of Neural Networks
• We often use X for the input matrix, W for the Weight matrix, and b for the bias vector
• L represents a layer output: L = X * W + b
Matrix Representation
Data Independence
• iid assumption: data observations are independent and identically distributed
• Simply randomizing (randomly shuffling) the data presented for training can improve results• this is handled for us by the “shuffle = True” default of Keras’ “fit()” method
• google “keras model fit shuffle” and click the “Jump to fit” link
• shuffling between epochs means the learning algorithm is unlikely to see the same batches again
Data Independence