Last lecture summary Naïve Bayes Classifier. Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated

Last lecture summaryNaïve Bayes Classifier

Bayes Rule

Normalization Constant

Likelihood PriorPosterior

Prior and likelihood must be learnt (i.e. estimated from the data)

• learning prior– A hundred independently drawn training examples

will usually suffice to obtain a reasonable estimate of P(Y).

• larning likelihood– The Naïve Bayes Assumption: Assume that all

features are independent given the class label Y.

𝑃 (𝑋 1,…, 𝑋𝑛|𝑌 )=∏𝑖=1

𝑛

𝑃 (𝑋 𝑖∨𝑌 )

Example – Play Tennis

Example – Learning Phase

Outlook Play=Yes Play=No

Sunny 2/9 3/5Overcast 4/9 0/5

Rain 3/9 2/5

Temperature Play=Yes Play=NoHot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5

Humidity Play=Yes Play=No

High 3/9 4/5Normal 6/9 1/5

Wind Play=Yes Play=No

Strong 3/9 3/5Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

P(Outlook=Sunny|Play=Yes) = 2/9

Example - Predictionx’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong)

Look up tables

P(Outl=Sunny|Play=No) = 3/5

P(Temp=Cool|Play=No) = 1/5

P(Hum=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Outl=Sunny|Play=Yes) = 2/9

P(Temp=Cool|Play=Yes) = 3/9

P(Hum=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Last lecture summaryBinary classifier performance

TP, TN, FP, FN

Precision, Positive Predictive Value (PPV) TP / (TP + FP)

Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN)

False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN)

Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR

Accuracy (TP + TN) / (TP + TN + FP + FN)

Neural networks(new stuff)

Biological motivation

• The human brain has been estimated to contain (~1011) brain cells (neurons).

• A neuron is an electrically excitable cell that processes and transmits information by electrochemical signaling.

• Each neuron is connected with other neurons through the connections called synapses.

• A typical neuron possesses a cell body (often called soma), dendrites (many, mm), and an axon (one, 10 cm – 1 m).

• Synapse permits a neuron to pass an electrical or chemical signal to another cell.

• Synapse can be either excitatory, or inhibitory.• Synapses are of different strength (the stronger

the synapse is, the more important it is).• The effects of synapses cumulate inside the

neuron.• When the cumulative effect of synapses reaches

certain threshold, the neuron gets activated, the signal is sent to the axon, through which the neuron is connected to other neuron(s).

• Simplistic view of the function of neuron– Neuron accumulates positive/negative stimuli

from other neurons.– Then is processed further – – to produce an

output, i.e. neuron sends an output signal to neurons connected to it.

Neural networks for applied science and engineering, Samarasinghe

Warren McCulloch Walter Pitts

1899 - 1969 1923 - 1969

Threshold neuron

• 1st mathematical model of neuron – McCulloch & Pitts binary (threshold) neuron– only binary inputs and output– the weights are pre-set, no learning

x1 x2 t

0.2 0.3 0

0.2 0.8 0

0.8 0.2 0

1.0 0.8 1

– inputs – weights – activation (tansfer) function - output

• In this exercise, both weights will be fixed

• When the target is classified as 0 and when as 1?

• Set the threshold. – If threshold, then it is classified as 1. – If threshold, then it is classified as 0.

• Which threshold would you use?– e.g.

2

1 1 2 2 1 21

. j jj

w x w x w x x x

w x

x1 x2 t

0.2 0.3 0

0.2 0.8 0

0.8 0.2 0

1.0 0.8 1

Heavyside (threshold) activation function

• Threshold is incorporated as a weight of one additional input with input value .

• Such input is called bias.

2

0 1 1 2 20

1.0j jj

w x w w x w x

• Because the location of the threshold function defines the two categories, its value of 1.3 decides a classification boundary that can be formulated as

Perceptron (1957)

Frank Rosenblatt

Developed the learning algorithm.

Used his neuron (pattern recognizer = perceptron) for classification of letters.

• binary classifier, maps its input x (real-valued vector) to – a binary value (0 or 1)

• (including bias)• 0 … otherwise

• perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm

Multiple output perceptron• for multicategory (i.e. more than 2 classes) classification• one output neuron for each class

input layer

output layer

single layer (one-layered)vs.

double layer (two-layered)

Learning

• Set the weights (including threshold ).• Supervised learning, we know the target

values .• We want the outputs to be as close as

possible to the desired values of . • We define an error (Sum of Squares Error, we

already know this one)

• “ to be as close as possible to ” means that shoud be minimal

• So we want to minimize , which is the function of weights .– is also called objective function or sometimes

energy.

2

0 0i i j

E E

w w w

requirements for the minimum

Gradient grad is a vector pointing in the direction of the greatest rate of increase of the function

We want to decline, we take -grad.

1 2

grad ,E E

E Ew w

Delta rule

• gradient descent• How to train linear neuron using delta rule?• Demonstration will be given for one neuron

with one input , no bias, one output .

Σ 𝑦𝑥𝑤1

• Neuron is presented with an input pattern.• It calculates , and its outuput as (no threshold

is used)• The error E:

• If you draw against , which curve you get?

erro

r gradient

• To find a gradient , differentiate the error E with respect to w1:

• According to the delta rule, weight change is proportional to the negative of the error gradient:

• New weight:

1

d 2

d 2

Et y x t y x x

w

1w x

1 1 1 1new old oldw w w w x

𝐸=12

(𝑡 – 𝑦 )2=12

(𝑡 –𝑤1𝑥 )2𝑑𝐸𝑑𝑤1

=?

β is called a learning rate. It determines how far along the gradient it is necessary to move.

11 1 1 1i i iw w w w x the new weight after ith iteration

• This is an iterative algorithm, one pass through training set is not enough.

• One pass of the whole training data set is called an epoch.

• Adjusting the weights after each input pattern presentation (iteration) is called example-by-example (online) learning.– For some problems this can cause weights to

oscillate – adjustment required by one pattern may be canceled by the next pattern.

– More popular is the next method.

• Batch learning – wait until all input patterns (i.e. epoch) have been processed and then adjust weights in the average sense.– More stable solution.– Obtain the error gradient for each input pattern– Average them at the end of the epoch– Use this average value to adjust the weights using

the delta rule

11

1 n

i ii

w xn

Documents

Last lecture summary Naïve Bayes Classifier. Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated