Machine Learning Neural Networks Human Brain Neurons

Machine Learning

Neural Networks

Human Brain

Neurons

Input-Output Transformation

Input Spikes

Output Spike

(Excitatory Post-Synaptic Potential)

Spike (= a brief pulse)

Human Learning

• Number of neurons: ~ 1010

• Connections per neuron: ~ 104 to 105

• Neuron switching time: ~ 0.001 second• Scene recognition time: ~ 0.1 second

100 inference steps doesn’t seem much

Machine Learning Abstraction

Synapses

Dendrites

Synapses++

(weights)

Artificial Neural Networks

• Typically, machine learning ANNs are very artificial, ignoring:– Time– Space– Biological learning processes

• More realistic neural models exist– Hodkins & Huxley (1952) won a Nobel

prize for theirs (in 1963)• Nonetheless, very artificial ANNs have

been useful in many ML applications

Perceptrons

• The “first wave” in neural networks• Big in the 1960’s

– McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

A single perceptron

else 0

0 if 10

iii xw

Bias (x0 =1,always)

Logical Operators

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

Perceptron Hypothesis Space

• What’s the hypothesis space for a perceptron on n inputs?

1| nwwH

Learning Weights

• Perceptron Training Rule• Gradient Descent• (other approaches: Genetic

Algorithms)

else 0

0 if 10

iii xw

Perceptron Training Rule

• Weights modified for each example • Update Rule:

iii www

ii xotw )( where

learning rate

targetvalue

perceptronoutput

inputvalue

What weights make XOR?

• No combination of weights works• Perceptrons can only represent

linearly separable functions

else 0

0 if 10

iii xw

Linear Separability

Perceptron Training Rule

• Converges to the correct classification IF– Cases are linearly separable– Learning rate is slow enough– Proved by Minsky and Papert in 1969

Killed widespread interest in perceptrons till the 80’s

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

What’s wrong with perceptrons?

• You can always plug multiple perceptrons together to calculate any function.

• BUT…who decides what the weights are?– Assignment of error to parental inputs

becomes a problem….– This is because of the threshold….

• Who contributed the error?

Problem: Assignment of error

else 0

0 if 10

iii xw

Perceptron ThresholdStep function

• Hard to tell from the output who contributed what

• Stymies multi-layer weight learning

Solution: Differentiable Function

iii xw

Simple linear function

• Varying any input a little creates a perceptible change in the output

• This lets us propagate error to prior nodes

Measuring error for linear units

• Output Function

• Error Measure:

dd otwE 2)(2

data targetvalue

linear unit output

Gradient Descent

Gradient:Training rule:

Gradient Descent Rule

E 2)(2

didd xot ))(( ,

diddii xotww ,)(Update Rule:

Back to XOR

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

else 0

0 if 10

iii xw

Gradient Descent for Multiple Layers

iii xw

We can compute:

Gradient Descent vs. Perceptrons

• Perceptron Rule & Threshold Units– Learner converges on an answer ONLY IF

data is linearly separable– Can’t assign proper error to parent nodes

• Gradient Descent– Minimizes error even if examples are not

linearly separable– Works for multi-layer networks

• But…linear units only make linear decision surfaces (can’t learn XOR even with many layers)

– And the step function isn’t differentiable…

A compromise function• Perceptron

• Linear

• Sigmoid (Logistic)

else 0

0 if 10

iii xw

output

iii xwnetoutput

netenetoutput

The sigmoid (logistic) unit

• Has differentiable function– Allows gradient descent

• Can be used to learn non-linear functions

iii xw

Logistic function

Inputs

Coefficients

Output

Independent variables

Prediction

Age 34

1Gender

Stage 4

“Probability of beingAlive”

iii xw

Neural Network Model

Inputs

Weights

Output

Dependent variable

Prediction

Age 34

2Gender

Stage 4

WeightsHiddenLayer

Getting an answer from a NN

Inputs

Weights

Output

Dependent variable

Prediction

Age 34

2Gender

Stage 4

WeightsHiddenLayer

Inputs

Weights

Output

Dependent variable

Prediction

Age 34

2Gender

Stage 4

WeightsHiddenLayer

Inputs

Weights

Output

Dependent variable

Prediction

Age 34

1Gender

Stage 4

WeightsHiddenLayer

Minimizing the Error

winitialwtrained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Differentiability is key!

• Sigmoid is easy to differentiate

• For gradient descent on multiple layers, a little dynamic programming can help:– Compute errors at each output node– Use these to compute errors at each hidden node– Use these to compute errors at each input node

))(1()()(

The Backpropagation Algorithm

jikjiji

koutputsk

δwooδ

otooδ

ight network weeach Update.4

error term its calculate h,unit hidden each For .3

error term its calculate ,unit output each For 2.

network in the unit every for

output thecompute andnetwork the to instanceInput 1.

example, ninginput traieach For

Inputs

Weights

Output

Dependent variable

Prediction

Age 34

1Gender

Stage 4

WeightsHiddenLayer

Expressive Power of ANNs

• Universal Function Approximator:– Given enough hidden units, can

approximate any continuous function f

• Need 2+ hidden units to learn XOR

• Why not use millions of hidden units?– Efficiency (neural network training is

slow)– Overfitting

Overfitting

Overfitted ModelReal Distribution

Combating Overfitting in Neural Nets

• Many techniques

• Two popular ones:– Early Stopping

• Use “a lot” of hidden units• Just don’t over-train

– Cross-validation• Choose the “right” number of hidden units

Early Stopping

b = training set

a = validation set

Overfitted model

Epochs

minerror )

error a

error b

Stopping criterion

Cross-validation

• Cross-validation: general-purpose technique for model selection – E.g., “how many hidden units should I use?”

• More extensive version of validation-set approach.

Cross-validation

• Break training set into k sets• For each model M

– For i=1…k•Train M on all but set i•Test on set i

• Output M with highest average test score, trained on full training set

Summary of Neural Networks

Non-linear regression technique that is trained with gradient descent.

Question: How important is the biological metaphor?

Summary of Neural Networks

When are Neural Networks useful?– Instances represented by attribute-value pairs

• Particularly when attributes are real valued

– The target function is• Discrete-valued• Real-valued• Vector-valued

– Training examples may contain errors– Fast evaluation times are necessary

When not?– Fast training times are necessary– Understandability of the function is required

Advanced Topics in Neural Nets

• Batch Move vs. incremental• Hidden Layer Representations• Hopfield Nets• Neural Networks on Silicon• Neural Network language models

Incremental vs. Batch Mode

• In Batch Mode we minimize:

• Same as computing:

• Then setting

Hidden Layer Representations

• Input->Hidden Layer mapping:– representation of input vectors tailored to

the task

• Can also be exploited for dimensionality reduction– Form of unsupervised learning in which we

output a “more compact” representation of input vectors

– <x1, …,xn> -> <x’1, …,x’m> where m < n

– Useful for visualization, problem simplification, data compression, etc.

Dimensionality Reduction

Model: Function to learn:

Dimensionality Reduction: Example

Neural Networks on Silicon

• Currently:Simulation of continuous device physics (neural

networks)

Digital computational model (thresholding)

Continuous device physics (voltage)

Why not skip this?

Example: Silicon Retina

Simulates function of biological retina

Single-transistor synapses adapt to luminance, temporal contrast

Modeling retina directly on chip => requires 100x less power!

Example: Silicon Retina

• Synapses modeled with single transistors

Luminance Adaptation

Comparison with Mammal Data

• Real:

• Artificial:

• Graphics and results taken from:

General NN learning in silicon?

• Seems less in-vogue than in late-90s

• Interest has turned somewhat to implementing Bayesian techniques in analog silicon

Neural Network Language Models

• Statistical Language Modeling:– Predict probability of next word in sequence

I was headed to Madrid , ____

P(___ = “Spain”) = 0.5,P(___ = “but”) = 0.2, etc.

• Used in speech recognition, machine translation, (recently) information extraction

• Estimate:

Formally

121 ,...,,| njjjj wwwwP

jj hwP |

Optimizations

• Key idea – learn simultaneously:– vector representations for each word (50

dim)– a predictor of the next word

• Short-lists– Much complexity in hidden->output layer

• Number of possible next words is large

– Only predict a subset of words• Use a standard probabilistic model for the rest

Design Decisions (1)

• Number of hidden units

• Almost no difference…

Design Decisions (2)

• Word representation (# of dimensions)

• They chose 120

Comparison vs. state of the art

Machine Learning Neural Networks Human Brain Neurons

Documents

Neural Networks and Statistical Models - … Networks and...Neural Networks and Statistical Models or "Neurons, Neurons everywhere and not a brain in sight" Derek Powell and Martin

Lecture 4: [+12pt]Feed--Forward Neural Networkseis.mdx.ac.uk/staffpages/rvb/teaching/BIS4435/04-FFNN-b.pdf · Biological neurons and the brain A Model of A Single Neuron Neurons as

Mental Health. Brain Basics Neurons & neural circuits Neurotransmitters Brain regions understanding_of_mental_illness

Translating neural stem cells to neurons in the mammalian ... · Translating neural stem cells to neurons in the mammalian brain ... A subpopulation of these embryonic neural precursors

Neural Networks and Support Vector Machines · INF5390-AI-11 Neural Networks 3 Neural networks in AI The human brain is a huge network of neurons A neuron is a basic processing unit

Induction of A9 dopaminergic neurons from neural stem cells - Brain

Neural Networks. What are they Models of the human brain used for computational purposes Brain is made up of many interconnected neurons

Mechanisms of Neural Function and Dysfunction Theme … · the brain . Measuring gene and protein expression in neurons and glial cells . Creating 3D brain atlases . Light, fluorescent,

NEURAL TRANSMISSION Neurons Electrical and Chemical Transmission

Neural Networks€¦ · What can they do? Many of the functions our brain accomplishes 100 billion neurons, each of which may be connected to 1,000 other neurons; over a trillion

Neural Networks in Data Mining - 123seminarsonly.com · Neural Networks in Data Mining Page 2 “ human brain contains roughly 1011 or 100 billion neurons. That number approximates

Neural Network Concepts and Paradigms · Computational Intelligence in Complex Decision Systems G. Oltean Human Brain on average the human brain has • on average 86 billions neurons

Psychology’s biological roots: neurons and neural communication

4.02 neurons, brain chemistry, and neurotransmission

Artifical Neural Networks · 1 1Introduction The term neural networks refers to networks of neurons in the mammalian brain. Neurons are its fundamental units of computation. In the

CS407 Neural Computation - Unit information · Neural Networks use a set of processing elements (or nodes) loosely analogous to neurons in the brain (hence the name, neural networks)

Neurons, Neural Networks, and Learning 1. Human brain contains a massively interconnected net of 10 10 -10 11 (10 billion) neurons (cortical cells) Biological

Neurons and the Brain

Understanding Brain Tumours › media › ...The brain, spinal cord and nerves consist of billions of nerve cells called neurons or neural cells, which process and send information

Neurons and Neural Networks - Harvard University