An introduction to Neural Networks and Deep Learningasperti/SLIDES/neural.pdf · An introduction to...

Preview:

Citation preview

An introduction to Neural Networksand Deep Learning

Talk given at the Department of Mathematicsof the University of Bologna

February 20, 2018

Andrea Asperti

DISI - Department of Informatics: Science and EngineeringUniversity of Bologna

Mura Anteo Zamboni 7, 40127, Bologna, ITALYandrea.asperti@unibo.it

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 1

A branch of Machine Learning

What is Machine Learning?

There are problems that are difficult to address with traditionalprogramming techniques:

I classify a document according to some criteria (e.g. spam,sentiment analysis, ...)

I compute the probability that a credit card transaction isfraudulent

I recognize an object in some image (possibly from an inusualviewpoint, in new lighting conditions, in a cluttered scene)

I ...

Typically the result is a weighted combination of a large number ofparameters, each one contributing to the solution in a small degree.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 2

The Machine Learning approach

Suppose to have a set of input-output pairs (training set)

{〈xi , yi 〉}

the problem consists in guessing the map xi 7→ yi

The M.L. approach:

• describe the problem with a model depending on someparameters Θ (i.e. choose a parametric class of functions)

• define a loss function to compare the results of the modelwith the expected (experimental) values

• optimize (fit) the parameters Θ to reduce the loss to aminimum

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 3

Why Learning?

Machine Learning problems are in fact optimization problems!So, why talking about learning?

The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).

So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.

This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 4

Why Learning?

Machine Learning problems are in fact optimization problems!So, why talking about learning?

The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).

So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.

This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 5

Why Learning?

Machine Learning problems are in fact optimization problems!So, why talking about learning?

The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).

So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.

This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 6

Why Learning?

Machine Learning problems are in fact optimization problems!So, why talking about learning?

The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).

So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.

This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 7

Using gradients

The objective is to minimize some loss function over (fixed)training samples, e.g.

Θ(w) =∑i

E (o(w , xi ), yi )

by suitably adjusting the parameters w .

See how it changes according to small perturbations ∆(w) of theparameters w : this is the gradient

∇w [θ] = [ ∂Θ∂w1

, . . . , ∂Θ∂wn

]

of Θ w.r.t. w .

The gradient is a vectorpointing in the direction ofsteepest ascent.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 8

Gradient descent

Goal: minimize some loss function Θ(w) by suitably adjusting theparameters.

We can reach a minimalconfiguration for Θ(w) byiteratively taking small stepsin the direction opposite tothe gradient (gradient descent).

This is a general technique.

Warning: not guaranteed to work:

I may end up in local minima

I may get lost in plateau

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 9

Next arguments

A bit of taxonomy

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 10

Different types of Learning Tasks

• supervised learning:inputs + outputs (labels)- classification

- regression

• unsupervised learning:just inputs- clustering

- component analysis

- autoencoding

• reinforcement learningactions and rewards- learning long-term gains

- planning

supervised

unsupervised

reinforcement

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 11

Classification vs. Regression

Two forms of supervised learning: {〈xi , yi 〉}

inputNew

Probably a cat!

classification

New input

Expectedvalue

regression

y is discete: y ∈ {•,+} y is (conceptually) continuous

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 12

Many different techniques

• Different ways todefine the models:- decision trees

- linear models

- neural networks

- ...

• Different error(loss) functions:- mean squared errors

- logistic loss

- cross entropy

- cosine distance

- maximum margin

- ...

Sunny Overcast Rain

High Strong Normal Weak

No Yes

Yes

No Yes

Outlook

Humidity Wind

decision tree neural net

mean squared errors maximum margin

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 13

Next argument

Neural Networks

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 14

Neural Network

A network of (artificial) neurons

Artificial neuron

Each neuron takes multiple inputs and produces a single output(that can be passed as input to many other neurons).

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 15

The artificial neuron

w1

output

x

x

b

+1

Σ

inputs

function

activation

bias

1

x22w

n

nw

The purpose of the activation function is to introduce athresholding mechanism(similar to the axon-hillock of cortical neurons).

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 16

Different activation functions

The activation function is responsible for threshold triggering.

0

1

threshold: if x > 0 then 1 else 0 logistic function: 11+e−x

1

0

hyperbolic tangent: ex−e−x

ex +e−x rectified linear (RELU): if x > 0 then x else 0

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 17

A comparison with the cortical neuron

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 18

Next argument

Networks typology/topology

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 19

Layers

A neural network is a collection of artificial neurons connectedtogether.Neurons are usually organized in layers.

If there is more than one hidden layer the network is deep,otherwise it is called a shallow network.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 20

Feed-forward networks

If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.

Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?

The activation function

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 21

Feed-forward networks

If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.

Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?

The activation function

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 22

Dense networks

The most typical feed-forward network is a dense network whereeach neuron at layer k − 1 is connected to each neuron at layer k .

The network is defined by a matrix of parameters (weights) Wk foreach layer (+ biases).

The matrix Wk has dimension Lk × Lk+1 where Lk is the numberof neurons at layer k .

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 23

Parameters and hyper-parameters

The weights Wk are the parameters of the model: they arelearned during the training phase.

The number of layers and the number of neurons per layer arehyper-parameters: they are chosen by the user and fixed beforetraining may start.

Other important hyper-parameters govern training such as learningrate, batch-size, number of ephocs an many others.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 24

Convolutional networks

Convolutional networks are used with inputs with a topologicalstructure: signal sequences (e.g. sound), or images.

They repeteadly apply a (small) uniform linear transformation,called kernel, shifting it over the whole input image.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 25

Example

[−1 0 1

]−→

−101

−→

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 26

Computing features

Many interesting kernels (filters) known from Image Processing:

I first and second order derivatives, image gradients

I sobel, prewitt, . . .

In Neural Networks, kernels are learned by training.

Since kernels are small and weights are shared training is relativelyfast.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 27

Recurrent Networks

In a recurrent net-woks you may have cycles:

• dynamics is very complexnot even clear it stabilizes

• difficult to train

• biologically more realistic

Restricted models:

I Long-Short Term Memory models (LSTM),

I Gated Recurrent Unit (GRU)

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 28

LSTM and GRU

LSTM are useful to model sequences:

I equivalent to very deep nets with one hidden layer per timeslice (net unrolling)

I weights are shared between different time slices

I they can keep information for a long time in an internal state

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 29

Symmetrically connected networks

Similar to recurrent networks, but connections between units aresymmetrical (they have the same weight in both directions).

They have stable configurations corresponding to local minima of asuitable energy function.

Hopfield nets: symmetrically connected nets without hidden units

Boltzmann machines: symmetrically connected nets with hidden units:

I more powerful models than Hopfield nets

I less powerful than general recurrent networks

I have a nice and simple learning algorithm

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 30

How a real network looks like

VGG 16 (Simonyan e Zisserman). 92.7 accuracy (top-5) in ImageNet.

Picture by Davi Frossard: VGG in TensorFlow

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 31

How do we implement a neural net?

Neural nets looks complicated.

How do we implement them?

There exist suitable languages:

I Theano, University of Montreal

I TensorFlow, Google Brain

I Caffe, Berkeley Vision

I Keras, F.Chollet

I PyTorch, Facebook

I . . .

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 32

VGG 16 in Keras

From GitHub

def VGG_16(weights_path=None):

model = Sequential()

model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))

model.add(Convolution2D(64, 3, 3, activation=’relu’))

model.add(ZeroPadding2D((1,1)))

model.add(Convolution2D(64, 3, 3, activation=’relu’))

model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))

model.add(Convolution2D(128, 3, 3, activation=’relu’))

model.add(ZeroPadding2D((1,1)))

model.add(Convolution2D(128, 3, 3, activation=’relu’))

model.add(MaxPooling2D((2,2), strides=(2,2)))

...

The whole model is defined in 50 lines of code.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 33

But what about training?

So complex ...

fit(x, y, batch_size=32, epochs=10)

I x: input, an array of data (hence, typically, an array of arrays)

I y: labels, an array of target categories

I batch size: integer, number of samples per gradient update.

I epochs: integer, the number of epochs (passes) to train themodel.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 34

Next arguments

Features and deep features

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 35

Features

Any individual measurable property of data useful for the solutionof a specific task is called a feature.

Examples:

I Emergency C-section: age, first pregnancy, anemia, fetusmalpresentation, previous premature birth, anomalousultrasound, ...

I Meteo: humidity, pression, temperature, wind, rain, snow, ...

I Expected lifetime: age, health, annual income, kind of work,...

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 36

Derived (inner) features

New interesting features may be derived as a combination of inputfeatures.

Suppose for instance that we are interested to model somephenomenon with a cubic function

f (x) = ax3 + bx2 + cx + d

We can use x as input or . . .

we can precompute x , x2 and x3 reducing the problem to a linearmodel!

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 37

Derived (inner) features

New interesting features may be derived as a combination of inputfeatures.

Suppose for instance that we are interested to model somephenomenon with a cubic function

f (x) = ax3 + bx2 + cx + d

We can use x as input or . . .

we can precompute x , x2 and x3 reducing the problem to a linearmodel!

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 38

Traditional Image Processing

In order to process animage we start computinginteresting derived featureson the image:

- first order derivatives- second order derivatives- difference of gaussians- laplacian- ...

original gaussian blur 25

gaussian difference gaussian blur 10

Then we use these derived features to get the desired output.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 39

Deep learning, in deeper sense

Discovering good features is a complex task.

Why not delegating the task to the machine, learning them?

Deep learning exploits a hierarchical organization of the learningmodel, allowing complex features to be computed in terms ofsimpler ones, through non-linear transformations.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 40

AI, machine learning, deep learning

• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules

• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping

• Deep-Learning: get rid of the expert

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 41

AI, machine learning, deep learning

• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules

• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping

• Deep-Learning: get rid of the expert

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 42

AI, machine learning, deep learning

• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules

• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping

• Deep-Learning: get rid of the expert

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 43

Relations between research areas

Deep learning

Example: MLPs autoencoders Example:

Representation learning

Machine learning

Example:

logistic

regression

Artificial Intelligence

bases

knowledgeExample:

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 44

Components trained to learn

Input InputInput

Rule−based

systems

Input

Output

Output Output

Output

Hand−

designedprogram features

Hand−

designed

Mapping

from

features

Mapping

from

features

Features Features

Mapping

from

features

features

complex

More

Classic

Machine

Learning

Learning

Representation Deep

Learning

learning

components

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 45

Next arguments

Some successful applications

• MINST and ImageNet• Speech Recognition• Lip reading• Text generation• Deep dreams and Inceptionism• Mimicking style• Robot navigation• Game simulation

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 46

MNIST

Modified National Institute of Standards and Technology database

I grayscale images of handwritten digits, 20× 20 pixels each

I 60,000 training images and 10,000 testing images

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 47

MNIST

A comparison of different techniques

Classifier Error rateLinear classifier 7.6K-Nearest Neighbors 0.52SVM 0.56Shallow neural network 1.6Deep neural network 0.35Convolutional neural network 0.21

See LeCun’s page the mnist database for more data.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 48

ImageNet

ImageNet (@Stanford Vision Lab)

I high resolution color images covering 22K object classes

I over 15 million labeled images from the web

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 49

ImageNet competition

Annual competition of image classification (since 2010).

I 1.2 Million images (30K categories)

I make five guesses about image label, ordered by confidence

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 50

ImageNet samples

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 51

ImageNet results

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 52

Speech recognition

Several stages (similar to optical character recognition):

I Segmentation. Convert the sound wave into a vector ofacoustic coefficients. Typical sampling: 10 milliseconds.

I The acoustic model Use adjacent vectors of acousticcoefficients to associate probabilities with phonemes.

I Decoding Find the sequence of phonemes that best fit theacoustic data, and a model of expected sentences.

Deep neural networks, pioneered by George Dahl andAbdel-rahman Mohamed, are replacing previous machine learningmethods.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 53

Speech recognition

Major industries are investing lot of money on speech recognition:Amazon (with Intel), Google, Microsoft, ...

Achieving Human Parity in Conversational Speech Recognition.Speech & Dialog research group at Microsoft, 2016

R.Zweig (project manager) attributes the accomplishment to thesystematic use of the latest neural network technology in allaspects of the system.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 54

Lip reading

Google’s DeepMind AI can lip-read TV shows better than aprofessional

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 55

Text Generation

See Andrej Karpathy’s blogThe Unreasonable Effectiveness of Recurrent Neural Networks

Examples of fake algebraic documents automatically generated by a

RNN.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 56

Deep dreams

Visit Deep dreams generator

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 57

Mimicking style

A neural algorithm of artistic styleL.A. Gatys, A.S. Ecker, M. Bethge

Similar to inceptionism, but with “style” (texture) instead of content.

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 58

More examples

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 59

More examples

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 60

Mimicking style: a different approach

Image-to-image translation with Cycle Generative Adversarial Networks

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 61

Robot navigation

Quadcopter Navigation in the Forest using Deep Neural Networks

Robotics and Perception Group, University of Zurich, Switzerland &

Institute for Artificial Intelligence (IDSIA), Lugano Switzerland

Based on Imitation Learning

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 62

Atari Games and Q-learning

Google DeepMind’s system playing Atari games (2013)

Recently extended to Augmented Imagination (2017)

video

Based on:

I deep neural networks

I an innovative reinforcement learning technique calledQ-learning

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 63

Atari Games and Q-learning

The same network architecturewas applied to all games

Input are screen frames

Works well for reactive games,not for planning

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 64

Recommended