We have been discussing some simple ideas from statistical … · 2017. 8. 4. · •We have been discussing some simple ideas from statistical learning theory. • The risk minimization

• We have been discussing some simple ideas fromstatistical learning theory.

PR NPTEL course – p.1/123


• The risk minimization framework that we discussedgives us a better perspective on understanding theunifying theme in different learning algorithms.




• We will now go back to studying pattern classificationalgorithms.




• We will now go back to studying pattern classificationalgorithms.

• We will first briefly review algorithms for learninglinear classifiers and then start looking at methods tolearn nonlinear classifiers.


Linear Models

• In the two class case, the linear classifier is given by

h(X) = sign(W TX + w0)


Linear Models

• In the two class case, the linear classifier is given by

h(X) = sign(W TX + w0)

• We have seen that we can also think of h(X) as

h(X) = sign(W TΦ(X) + w0),

where Φ(X) = [φ1(X), · · · , φm(X)]T

as long as φi are fixed (possibly non-linear) functions.


• We discussed many algorithms for learning W .


• We discussed many algorithms for learning W .• The Perceptron algorithm is a simple error-correcting

method that is guarenteed to find a separatinghyperplane if one exists.




• The perceptron convergence theorem shows thatgiven any training set of linearly separable patterns,the algorithm will find a separating hyperplane.




• The perceptron convergence theorem shows thatgiven any training set of linearly separable patterns,the algorithm will find a separating hyperplane.

• Our discussion on statistical learning theory gives usan idea of how many iid examples we should havebefore we can be confident that the hyperplane thatseparates the examples will also do well on test data.


• We have also seen the least-squares method wherewe find W to minimize

J(W ) =1

n

∑

i

(W TXi − yi)2

where, for simplicity of notation, we have assumedaugumented feature vectors.


• We have also seen the least-squares method wherewe find W to minimize

J(W ) =1

n

∑

i

(W TXi − yi)2

where, for simplicity of notation, we have assumedaugumented feature vectors.

• In our risk minimization framework, H is parametrizedby W , we take h(X) = W TX and minimize empiricalrisk under squared-error loss function.


• We have seen how to obtain the least-squaressolution:

W ∗ = (ATA)−1ATY

where rows of matrix A are feature vectors andcomponents of Y are yi.





• The least-squares method can also be used to learnlinear regression models.





• The least-squares method can also be used to learnlinear regression models.

• The only difference is that in a regression model, theyi are real-valued.


• We have seen that we can also minimize the empiricalrisk J(W ) using gradient descent.



• We can also run this gradient descent in anincremental fashion by considering one example at atime.



• We can also run this gradient descent in anincremental fashion by considering one example at atime.

• That gives us another classical algorithm called theLMS algorithm.


• We have also seen that we can use the least squaresidea to learn a model g(W TX) by redefining J as

J(W ) =1

n

∑

i

(g(W TXi) − yi)2



J(W ) =1

n

∑

i

(g(W TXi) − yi)2

• An important example is the logistic regression wherewe take g as the sigmoid function.



J(W ) =1

n

∑

i

(g(W TXi) − yi)2

• An important example is the logistic regression wherewe take g as the sigmoid function.

• We minimize J by incremental version of gradientdescent.


• Another important method for learning linearclassifiers is the Fisher Linear Discriminant.



• Here, we look for a direction W such that the patternsof the two classes get ‘well-separated’ when projectedonto this one-dimensional subspace.



• Here, we look for a direction W such that the patternsof the two classes get ‘well-separated’ when projectedonto this one-dimensional subspace.

• As we mentioned, Fisher Linear Discriminant can bethought of as a special case of least-squares methodof learning a linear regression model with specialtarget values.


Beyond linear models

• Learning linear models (classifiers) is generallyefficient.




• However, linear models are not always sufficient.




• However, linear models are not always sufficient.• Best linear functions may still be a poor fit.




• However, linear models are not always sufficient.• Best linear functions may still be a poor fit.• We have looked at three broad approaches to

learning nonlinear classifiers.




• However, linear models are not always sufficient.• Best linear functions may still be a poor fit.• We have looked at three broad approaches to

learning nonlinear classifiers.• We now discuss neural network models.


Neural network models

• We need a ‘good’ parameterized class of nonlinearfunctions to learn nonlinear classifiers.




• Artificial neural networks are one such class




• Artificial neural networks are one such class• Nonlinear functions are built up through composition

of summation and sigmoids.




• Artificial neural networks are one such class• Nonlinear functions are built up through composition

of summation and sigmoids.• Useful for both classification and Regression.


• In this course we will study only multilayer feedforwardnetworks.



• They are useful because they offer goodparameterized class of nonlinear functions and thereare some efficient algorithms to learn them.




• However, historically, development of (artificial) neuralnetwork models was motivated by some ideas on thestructure of human brain.




• However, historically, development of (artificial) neuralnetwork models was motivated by some ideas on thestructure of human brain.

• We briefly look at this perspective of neural networksas an approach to engineering intelligent systems.


What is an Artificial Neural Network?



"A parallel distributed information processor made up of

simple processing units that has a propensity for acquiring

problem solving knowledge through experience"



"A parallel distributed information processor made up ofsimple processing units that has a propensity foracquiring problem solving knowledge through experience"

• Large number of inter connected units




• Large number of inter connected units• Each unit implements simple function, nonlinear




• Large number of inter connected units• Each unit implements simple function, nonlinear• The ‘knowledge’ resides in the interconnection

strengths.





strengths.• Problem solving ability is often through ‘learning’





strengths.• Problem solving ability is often through ‘learning’

An architecture inspired by the structure of Brain


The Human Brain

• Neuron - the basic computing unit


The Human Brain

• Neuron - the basic computing unit• Brain is a highly organized structure of networks of

interconnected neurons


The Human Brain


interconnected neurons• In the Brain

No of neurons ∽ 1011 (100 billion)


The Human Brain



No of neurons ∽ 1011 (100 billion)Average synapses per neuron ∽ 10000

(1000-100000)


The Human Brain




(1000-100000)Total synapses ∽ 1015


The Human Brain




(1000-100000)Total synapses ∽ 1015

Neuron time constants ∽ Milliseconds


The Human Brain




(1000-100000)Total synapses ∽ 1015

Neuron time constants ∽ MillisecondsSingle neuron can send 100 spikes per second


A rough estimate of processing power:

One arithmetic operation per synapse→ 104 operations per neuron per spike→ 106 operations per neuron per sec→ 1017 operations per sec!!(A gigaflop is 109 operations, teraflop is 1012 operations!)




Massive parallelism can deliver massive computingpower,




Massive parallelism can deliver massive computingpower,

if we know how to manage it


Digital computers:• Precise design, highly constrained, not very adaptive

or fault tolerant, Centralized control, deterministic,basic switching times ∽ 10−9 sec




Natural neural networks:• massively parallel, highly adaptive and fault tolerant,

self configuring, self repairing, noisy, stochastic, basicswitching time ∽ 10−3 sec




Natural neural networks:• massively parallel, highly adaptive and fault tolerant,

self configuring, self repairing, noisy, stochastic, basicswitching time ∽ 10−3 sec

• Most capabilities of Brain are LEARNT.


Artificial Intelligence (AI)

• ‘Understanding’ intelligence in computational terms.



• ‘Understanding’ intelligence in computational terms.• Developing ‘machines’ that are ‘intelligent’.




At least two distinct approaches





• Try to model intelligent behavior in terms ofprocessing structured symbols. (Resulting methods,algorithms etc may not resemble brain at theimplementation level)





• Try to model intelligent behavior in terms ofprocessing structured symbols. (Resulting methods,algorithms etc may not resemble brain at theimplementation level)

• A second approach is based on mimicking humanbrain at architectural/implementation level


The symbolic AI approach

• Brain is to be understood in computational terms only



• Brain is to be understood in computational terms only• Physical symbol system hypothesis



• Brain is to be understood in computational terms only• Physical symbol system hypothesis• A digital computer is a universal symbol manipulator

and can be programmed to be intelligent




and can be programmed to be intelligent• Many useful engineering applications e.g. Expert

systems




and can be programmed to be intelligent• Many useful engineering applications e.g. Expert

systems

An implicit faith: The architecture of Brain per se is irrele-

vant for engineering intelligent artifacts


Artificial Neural Networks

• Can be viewed as one approach towardsunderstanding brain/building intelligent machines




• Computational architectures inspired by brain




• Computational architectures inspired by brainComputational methods for ‘learning’dependencies in data stream




• Computational architectures inspired by brainComputational methods for ‘learning’dependencies in data streame.g. Pattern Recognition, System identification





• Characteristics: Emergent properties, learning, selfadaptation





• Characteristics: Emergent properties, learning, selfadaptation

• Modeling Biology?Mathematically purified neurons!!



Computing machines that try to mimic brain architecture.



Computing machines that try to mimic brain architecture.• A large network of interconnected units



Computing machines that try to mimic brain architecture.• A large network of interconnected units• Each unit has simple input-output mapping



Computing machines that try to mimic brain architecture.• A large network of interconnected units• Each unit has simple input-output mapping• Each interconnection has numerical weight attached

to it




to it• Output of unit depends on outputs and connection

weights of units connected to it





weights of units connected to it• ‘Knowledge’ resides in the weights





weights of units connected to it• ‘Knowledge’ resides in the weights• Problem solving ability is often through learning


Single neuron model


Single neuron model

• xi are inputs into the (artificial) neuron and wi are thecorresponding weights. y is the output of the neuron


Single neuron model


• Net input : η =∑

jwjxj

• output: y = f(η), where f(.) is called activationfunction


Single neuron model


• Net input : η =∑

jwjxj

• output: y = f(η), where f(.) is called activationfunction(Perceptron, AdaLinE are such models).


Networks of neurons

• We can connect a number of such units or neurons toform a network. Inputs to a neuron can be outputs ofother neurons (and/or external inputs).


Networks of neurons



Networks of neurons


• Notation:yj – output of jth neuron;wij – weight of connection from neuron i to neuron j.


• Each neuron computes weighted sum of inputs andpasses it through its activation function, to computeoutput



• For example, output of neuron 5 is

y5 = f5 (w35 y3 + w45 y4)




y5 = f5 (w35 y3 + w45 y4)

= f5 (w35 f3(w13y1 + w23y2) + w45 f4(w14y1 + w24y2))




y5 = f5 (w35 y3 + w45 y4)

= f5 (w35 f3(w13y1 + w23y2) + w45 f4(w14y1 + w24y2))

• By convention, we take y1 = x1 and y2 = x2.




y5 = f5 (w35 y3 + w45 y4)

= f5 (w35 f3(w13y1 + w23y2) + w45 f4(w14y1 + w24y2))

• By convention, we take y1 = x1 and y2 = x2.• Here, x1, x2 are inputs and y5, y6 are outputs.


• A single neuron ‘represents’ a class of functions fromℜm to ℜ.



• Specific set of weights realise specific functions.



• Specific set of weights realise specific functions.• By interconnecting many units/neurons, networks can

represent more complicated functions from ℜm to ℜm′

.





.• The architecture constrains the function class that can

be represented. Weights define specific function inthe class.





.• The architecture constrains the function class that can

be represented. Weights define specific function inthe class.

• To form meaningful networks, nonlinearity ofactivation function is important.


Typical activation functions

1. Hard limiter:

f(x) = 1 if x > τ= 0 otherwise


Typical activation functions

1. Hard limiter:

f(x) = 1 if x > τ= 0 otherwise

• We can keep the τ to be zero and add one more inputline to the neuron. An example of a single neuron withthis activation function is Perceptron.


Activation functions (cont).......

2. Sigmoid function:

f(x) =a

1 + exp (−bx), a, b > 0


Activation functions (Contd.)

3. tanh

f(x) = atanh(bx), a, b > 0


Why study such models?

• A belief that the architecture of Brain is critical tointelligent behavior.




• Models can implement highly nonlinear functions.They are adaptive and can be trained.





Useful in many applications





Useful in many applicationsTime series prediction





Useful in many applicationsTime series predictionsystem identification and control





Useful in many applicationsTime series predictionsystem identification and controlpattern recognition and Regression





Useful in many applicationsTime series predictionsystem identification and controlpattern recognition and Regression

• Model can help us understand Brain functionComputational neuroscience


Many different models are possible



• Evolution:• Discrete time / continuous time• synchronous / asynchronous• deterministic / stochastic




• Interconnections:• Feedforward / having feedback




• Interconnections:• Feedforward / having feedback

• States or outputs of units:• binary / finitely many / continuous


Recurrent networks

• The network we saw earlier has no feedback.


Recurrent networks

• The network we saw earlier has no feedback.• Here is an example of a network with feedback


Recurrent networks

• The network we saw earlier has no feedback.• Here is an example of a network with feedback

• Can model a dynamical system:

y(k) = f(y(k − 1), x1(k), x2(k))


• We will consider only feedforward networks whichprovide a general class of nonlinear functions.



• These can always be organized as a layered network.



• These can always be organized as a layered network.

• This network represents a class of functions from ℜ2

to ℜ2.PR NPTEL course – p.118/123

• Each unit can also have a ‘bias’ input.


• Each unit can also have a ‘bias’ input.• This is shown for a single unit below.



• One can always think of bias as an extra input

y = f

(

d∑

i=1

wixi + w0

)



• One can always think of bias as an extra input

y = f

(

d∑

i=1

wixi + w0

)

= f

(

d∑

i=0

wixi

)

, x0 = +1


Multilayer feedforward networks

• Here is a general multilayer feedforward network.


Linear ModelsLinear ModelsBeyond linear modelsBeyond linear modelsBeyond linear modelsBeyond linear modelsBeyond linear modelsNeural network modelsNeural network modelsNeural network modelsNeural network modelsWhat is an Artificial Neural Network?What is an Artificial Neural Network?What is an Artificial Neural Network?What is an Artificial Neural Network?What is an Artificial Neural Network?What is an Artificial Neural Network?What is an Artificial Neural Network?The Human BrainThe Human BrainThe Human BrainThe Human BrainThe Human BrainThe Human BrainThe Human Brain A rough estimate of processing power: A rough estimate of processing power: A rough estimate of processing power: Artificial Intelligence (AI)Artificial Intelligence (AI)Artificial Intelligence (AI)Artificial Intelligence (AI)Artificial Intelligence (AI) The symbolic AI approach The symbolic AI approach The symbolic AI approach The symbolic AI approach The symbolic AI approach Artificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksArtificial Neural NetworksSingle neuron modelSingle neuron modelSingle neuron modelSingle neuron modelNetworks of neuronsNetworks of neuronsNetworks of neuronsTypical activation functionsTypical activation functionsActivation functions (cont).......Activation functions (Contd.) Why study such models? Why study such models? Why study such models? Why study such models? Why study such models? Why study such models? Why study such models?Many different models are possibleMany different models are possibleMany different models are possibleMany different models are possibleRecurrent networksRecurrent networksRecurrent networksMultilayer feedforward networks

Documents

We have been discussing some simple ideas from statistical … · 2017. 8. 4. · •We have been discussing some simple ideas from statistical learning theory. • The risk minimization