Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture 6. Basic statistical modeling

The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Introduction to statistical modeling– Motivating examples– Generative and discriminative models– Classification and regression

2. Bayes and Naïve Bayes classifiers3. Logistic regression

Last update: 5-Oct-2015

INTRODUCTION TO STATISTICAL MODELING

Part 1


Statistical modeling• We have studied many biological concepts in this course– Genes, exons, introns, ...

• We want to provide a description of a concept by means of some observable features– Sometimes it can be (more or less) an exact rule:

• The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC

– In most cases it is not exact:• If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA,

and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene

• If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer



The examples

• Reasons for the descriptions to be inexact:– Incomplete information

• What mutations on BRCA1/BRCA2? Any mutations on other genes?

– Exceptions• “If one has fever, he/she has a flu” – Not everyone with a flu has

fever, also not everyone with fever is due to a flu

– Intrinsic randomness


Concept/ Class

DNA recognized by the enzyme EcoRI

Protein coding sequence of a yeast gene

Developing breast cancer

Features observable from data

The DNA sequence (the string)

Raw: The DNA sequenceDerived:• The first three characters• The last three characters• The length

Mutations at BRCA1 geneMutations at BRCA2 gene


Features known, concept unsure• In many cases, we are interested in the situation

that the features are observed but whether a concept is true is unknown– We know the sequence of a DNA region, but we do

not know whether it corresponds to a protein coding sequence

– We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer

– We know a subject is having fever, but we do not know whether he/she has flu infection or not



Statistical models• Statistical models provide a principal way to

specify the inexact descriptions• For the flu example, using some symbols:– X: a set of features• In this example, a single binary feature with X=1 if a

subject has fever and X=0 if not

– Y: the target concept• In this example, a binary concept with Y=1 if a subject

has flu and Y=0 if not

– A model is a function that predicts values of Y based on observed values X and parameters



Parameters• Some details of a statistical model are provided by its

parameters, – Suppose whether a person with flu has fever can be

modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to

have fever is q1 and the probability not to have fever is 1-q1.• Different people are assumed to be statistically independent.

– Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2

– Finally, the probability for a person to have flu is p– Then the whole set of parameters is = {p, q1, q2}



Basic probabilities• Pr(X)Pr(Y|X) = Pr(X and Y)

– If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop

– Capital letters mean it is true for all values of X and Y– Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X

and Y• Law of total probability:

(The summation should consider all possible values of Y)– If there is

• A 0.12 chance that it will both rain and have a temperature drop tomorrow, and• A 0.08 chance that it will both rain and not have a temperature drop tomorrow

– Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow• Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y) 0

– Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y)– Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z) 0


Prሺ𝑋ሻ= Prሺ𝑋 and 𝑌= 𝑦ሻ𝑌=𝑦

http://ureply.mobi/mobile_index.php


A complete numeric example• Assume the following parameters (X: has fever or not; Y: has flu or not):

– 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7– 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1– 20% of people have flu: Pr(Y=1) = 0.2

• We have a simple model to predict Y from and X:– Probability that someone has fever:

Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22

– Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)

= (0.7)(0.2) / 0.22 = 0.64– Probability that someone does not have flu, given that he/she has fever:

Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36– Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) =

Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08

– Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92




Statistical estimation• Questions we can ask:

– Given a model, what is the likelihood of the observation?• Pr(X|Y,) – in the previous page, was omitted for simplicity• If a person has flu, how likely would he/she have fever?

– Given an observation, what is the probability that a concept is true?• Pr(Y|X,)• If a person has fever, what is the probability that he/she has flu?

– Given some observations, what is the likelihood of a parameter value?• Pr(|X), or Pr(|X,Y) if whether the concept is true is also known• Suppose we have observed that among 100 people with flu, 70

have fever. What is the likelihood that q1 is equal to 0.7?



Statistical estimation• Questions we can ask (cont’d):– Maximum likelihood estimation: Given a model

with unknown parameter values, what parameter values can maximize the data likelihood?• or

– Prediction of concept: Given a model and an observation, what is the concept most likely to be true?•


argmaxθ Prሺ𝑋|𝑌,θሻ

argmaxy Prሺ𝑌= 𝑦|𝑋,θሻ

argmaxθ Prሺ𝑋,𝑌|θሻ


Generative vs. discriminative modeling

• If a model predicts Y by providing information about Pr(X,Y), it is called a generative model– Because we can use the model to generate data– Examples: Naïve Bayes

• If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model– Example: Logistic regression



Classification vs. regression• If there is a finite number of discrete, mutually exclusive

concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier– Given that the BRCA1 gene of a subject has a deleted exon 2, we

want to predict whether the subject will develop breast cancer in the life time• Y=1: the subject will develop breast cancer;• Y=0: the subject will not develop breast cancer

• If Y takes on continuous values, it is a regression problem and the model is called an estimator– Given that the BRCA1 gene of a subject has a deleted exon 2, we

want to estimate the lifespan of the subject• Y: lifespan of the subject


BAYES AND NAÏVE BAYES CLASSIFIERS

Part 2


Bayes classifiers• In the example of flu (Y) and fever (X), we have

seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule:–

• We use capital letter to represent variables (single-valued or vector), and small letters to represent values

• When we do not specify the value, it means something is true for all values. For example, all the following are true according to Bayes’ rule:– Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1)– Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0)– Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1)– Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0)


Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦


Terminology• – Pr(Y) is called the prior probability• E.g., Pr(Y=1) is the probability of having flu, without

considering any evidence such as fever• Can be considered the prior guess that the concept is

true before seeing any evidence

– Pr(X|Y) is called the likelihood• E.g., Pr(X=1|Y=1) is the probability of having fever if we

know one has flu

– Pr(Y|X) is called the posterior probability• E.g., Pr(Y=1|X=1) is the probability of having flu, after

knowing that one has feverLast update: 5-Oct-2015



Generalizations• In general, the above is true even if:– X involves a set of features X={X(1), X(2), ..., X(m)}

instead of a single feature• Example: predict whether one has flu after knowing

whether he/she has fever, headache and running nose

– X can take on continuous values• In that case, Pr(X) is the probability density of X• Examples:

– Predict whether a person has flu after knowing his/her body temperature

– Predict whether a gene is involved in a biological pathway given its expression values in several conditions



Parameter estimation• Let’s consider the discrete case first• Suppose we want to estimate the parameters of our flu model by

learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set

• How many parameters are there in the model?– We need to know the prior probabilities, Pr(Y)

• Two parameters: Pr(Y=1), Pr(Y=0)• Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter

– We need to know the likelihoods, Pr(X|Y)• Suppose we have m binary features, fever, headache, running nose, ...• 2m+1 parameters for all X and Y value combinations• 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is

one

– Total: 2(2m-1) + 1 independent parameters• How large should n be in order to estimate these parameters accurately?

– Very large, given the exponential number of parameters



List of all the parameters• Let Y be having flu (Y=1) or not (Y=0)• Let X(1) be having fever (X(1)=1) or not (X(1)=0)• Let X(2) be having headache (X(2)=1) or not (X(2)=0)• Let X(3) be having running nose (X(3)=1) or not (X(3)=0)• Then the complete list of parameters for a generative model is

(variables not independent are in gray):– Pr(Y=0), Pr(Y=1)– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1,

X(3)=0,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0)

– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1)



Why having many parameters is a problem?

• Statistically, we will need a lot of data to accurately estimate the values of the parameters– Imagine that we need to estimate the 15

parameters on the last page with only data about 20 people

• Computationally, estimating the values of an exponential number of parameters could take a long time



Conditional independence

• One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then– Pr(X(1), X(2)|Y)

= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]

= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption]

– Probability for a flu patient to have fever is independent of whether he/she has running nose

– Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2))• Without knowing whether a person has flu, having fever and

having running nose are definitely correlated



Conditional independence and Naïve Bayes

• Number of parameters after making the conditional independence assumption:– 2 prior probabilities Pr(Y=0) and Pr(Y=1)

• Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0)

– 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y• Only 2m independent parameters, as Pr(X(j)=1|Y=y) =

Pr(X(j)=0|Y=y) for all possible values of j and y

– Total: 4m+2, which is much smaller than 2(2m-1)+1!• The resulting model is usually called a Naïve

Bayes model




Estimating the parameters• Now, suppose we have the known examples (X1, Y1), (X2,

Y2), ..., (Xn, Yn) in the training set• The prior probabilities can be estimated in this way:– , where 𝕀 is the indicator function,

with𝕀(true) = 1 and 𝕀(false) = 0– That is , fraction of examples with class label y

• Similarly, for any particular feature X(j), its likelihoods can be estimated in this way:– – That is, fraction of class y examples having value x at feature X(j)

– To avoid zeros, we can add pseudo-counts:

– , where c has a small value


Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ = 𝑥|𝑌= 𝑦൯= c+ σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=12c+ σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1


Example• Suppose we have the training

data as shown on the right• How many parameters does

the Naïve Bayes model have?• Estimated parameter values

using the formulas on the last page:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5


Subject i

Has fever? X(1)

Has headache? X(2)

Has flu? Y

1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No


Meaning of the estimations• The formulas for estimating the parameters are intuitive• In fact they are also the maximum likelihood estimators – the values that

maximize the likelihood if we assume the data were generated by independent Bernoulli trials– Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever– This likelihood can be expressed as

• That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product

– Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b)

– This value can be found by differentiating the log likelihood and equating it to zero:

– The formula for estimating the prior probabilities Pr(Y) can be similarly derivedLast update: 5-Oct-2015

Lሺ𝑞ሻ= ෑ� 𝑞𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁሺ1− 𝑞ሻቂ1−𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁቃ𝑖:𝑌𝑖=1

lnLሺ𝑞ሻ= ቄ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁln𝑞+ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃlnሺ1− 𝑞ሻቅ𝑖:𝑌𝑖=1

dd𝑞lnLሺ𝑞ሻ= ቐ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑞 −ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃ1− 𝑞 ቑ𝑖:𝑌𝑖=1

dd𝑞lnLሺ𝑞ሻ= 0 ⇒𝑞 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑖:𝑌𝑖=1σ 1𝑖:𝑌𝑖=1 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1,𝑌𝑖 = 1ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 1ሻ𝑛𝑖=1


Short summary• So far, we have got the formulas for estimating

the parameters of a Naïve Bayes model, which correspond to the parameter values, among all possible values, that maximize the data likelihood

• The parameter estimates:– Prior probabilities:

– Likelihoods:


Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1


Using the model• Now with Pr(Y=y) and Pr(X(j)=x|Y=y) estimated

for all features j and all values x and y, the model can be applied to estimate Pr(Y=y|X) for any X, either in the training set or not– Recall that

– For classification, we can compare Pr(Y=1|X) and Pr(Y=0|X), and• Predict X to be of class 1 if the former is larger• Predict X to be of class 0 if the latter is larger




Example• Suppose we have the same training data as

shown on the right• Parameter values of Naïve Bayes model we have

previously estimated:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5

• Now, for a new subject with fever but not headache, we would predict its probability of having flu as

Pr(Y=1|X(1)=1,X(2)=0)= Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) / [Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) +

Pr(X(1)=1|Y=0)Pr(X(2)=0|Y=0)Pr(Y=0)]= (2/3)(1-1/3)(3/8) /

[(2/3)(1-1/3)(3/8) + (2/5)(1-1/5)(1-3/8)]= 5/11


Subject i

Has fever? X(1)

Has headache? X(2)

Has flu? Y

1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No


Numeric features• If X(j) can take on continuous values, we need a

continuous distribution instead of a discrete one– Fever is a feature with binary values: 1 means “has

fever”; 0 means “does not have fever”– Body temperature is a feature with continuous values

• For the features with binary values, we have assumed that each feature X(j) has a Bernoulli distribution conditioned on Y, i.e., Pr(X(j)=1|Y=y) = q with the value of parameter q to be estimated

• For continuous values, we can similarly estimate Pr(X(j)=x|Y=y) based on an assumed distribution



Gaussian distribution• Suppose the body temperatures of flu patients

follow a Gaussian distribution:

– There are two parameters to estimate:• The mean (center) of the distribution, • The variance (spread) of the distribution, 2


35 36 37 38 39 40 41

Prob

abili

ty d

ensi

ty

Body temperature of people with flu


Estimating the parameters• Maximum likelihood estimations [optional]:


Prቀ𝑋𝑖ሺ𝑗ሻ = 𝑥|𝑌= 𝑦ቁ= 1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ22σ2

Lሺμ,σሻ= ෑ�1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ2

2σ2𝑖:𝑌𝑖=𝑦

lnLሺμ,σሻ= ln൬1σξ2π൰−ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ2 𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂μ = 𝑋𝑖ሺ𝑗ሻ− μσ2𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂μ = 0 ⇒μ= σ 𝑋𝑖ሺ𝑗ሻ𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

∂lnLሺμ,σሻ∂σ = −σξ2πσ2ξ2π + 2ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ3 𝑖:𝑌𝑖=𝑦 = −1σ +ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2

σ3 𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂σ = 0 ⇒σ2 = σ ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1


Estimating the parameters• Results:– The formulas:

– Meanings: The mean and variance of the training data• The above formula for the variance is a biased estimation (i.e., when

you have many sets of training data and each time you estimate the variance by this formula, the average of the estimations does not converge to the actual variance of the Gaussian distribution).

• May use the sample variance instead, which is the minimum variance unbiased estimator – see further readings.


μ= σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

σ2 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

LOGISTIC REGRESSIONPart 3


Discriminative learning• In the Bayes and Naïve Bayes classifiers, in order to

compute Pr(Y|X) we need to model Pr(X,Y) or Pr(X|Y)Pr(Y)– It seems to be complicating things: Using the solution of a

harder problem [modeling Pr(X,Y)] to answer an easier question [Pr(Y|X)]

– We may not always have a good idea how to model Pr(X|Y) and Pr(Y)• For example, while assuming Gaussian for Pr(X|Y) is mathematically

convenient, is it really suitable?• What if we cannot find a good well-studied distribution that fits the

data well, or it is difficult to derive the maximum likelihood estimation formulas?

• We now study a discriminative method that models Pr(Y|X) directly



Logistic regression: the idea• The logistic regression model relies on the assumption that the

class can be determined by a linear combination of the features– Conceptually, we hope to have a rule of this type:

“If a1X(1) + a2X(2) + ... + amX(m) t, then Y=1; otherwise, Y=0”• If 0.2 <body temperature> + 0.5 <headache> + 0.6 <running nose> 8.1,

then <has flu> = 1• The coefficients a1, a2, ..., am and the threshold t are model parameters the

values of which we want to estimate from training data• Graphically, the rule is a step function:


0

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

<has

flu>

0.2 <body temperature> + 0.5 <headache> +0.6 <running nose>


Logistic regression: actual form• However, the step function is mathematically not easy to handle

– For instance, it is not smooth, and thus not differentiable

• Let’s model Pr(Y=1|X) using a smooth function, and then derive the classification rule based on it:– Let f(X) = exp(a1X(1) + a2X(2) + ... + amX(m) - t)– Pr(Y=1|X) = f(X) / [1 + f(X)] -- the logistic function– Pr(Y=0|X) = 1 - Pr(Y=1|X) = 1 / [1 + f(X)]

• When a1X(1) + a2X(2) + ... + amX(m) >> t, Pr(Y=1|X)=1 and Pr(Y=0|X)=0

• When a1X(1) + a2X(2) + ... + amX(m) << t, Pr(Y=1|X)=0 and Pr(Y=0|X)=1

• When a1X(1) + a2X(2) + ... + amX(m) = t, Pr(Y=1|X) = Pr(Y=0|X) = 0.5

– We predict X to be of class 1 if Pr(Y=1|X) Pr(Y=0|X), i.e., f(X) 1

• We need to estimate the values of the model parameters a1, ..., am and t from training data. Will discuss how to do it.




Visualizing the functions• The rule “if f(X) 1, then

predict Y=1” is exactly the same as “if a1X(1) + a2X(2) + ... + amX(m) t, then predict Y=1”


0

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

Y

a1X(1) + a2X(2) + ... + amX(m)

0.001

0.01

0.1

1

10

100

1000

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.111.112.113.1f(x)

a1X(1) + a2X(2) + ... + amX(m)

0

0.2

0.4

0.6

0.8

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

a1X(1) + a2X(2) + ... + amX(m)

Pr(Y=1|X)

Pr(Y=0|X)

The original step function

Pr(Y=1|X) and Pr(Y=0|X) f(X): ratio of Pr(Y=1|X) and Pr(Y=0|X)

Can set parameter values so that are very similar


Estimating the parameters• For Naïve Bayes, we estimated parameters

using maximum likelihood, i.e., to find such that and , or their logarithms, are maximized

• For logistic regression, we do not have models for these probabilities, so instead we directly maximize the conditional data likelihood,

, where includes the parameters a1, a2, ..., am and t


ෑ� Prሺ𝑋𝑖|𝑌𝑖,θሻ𝑛𝑖=1 ෑ� Prሺ𝑌𝑖|θሻ𝑛

𝑖=1

ෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1


Maximizing the conditional likelihood

• The log conditional data likelihood can be written as follows [optional]:


lnLሺ𝜃ሻ = lnෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1

= lnෑ� Prሺ𝑌𝑖 = 1|𝑋𝑖,θሻ𝑌𝑖Prሺ𝑌𝑖 = 0|𝑋𝑖,θሻ1−𝑌𝑖𝑛

𝑖=1= ሾ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻ+ሺ1− 𝑌𝑖ሻlnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻሿ𝑛

𝑖=1= ቈ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ+ lnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ𝑛

𝑖=1= 𝑌𝑖 lnexpቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ

𝑚𝑗=1 ቍ+ ln 11+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ

𝑛𝑖=1

= ൦𝑌𝑖 ቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚

𝑗=1 ቍ− ln൮1+ expቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚

𝑗=1 ቍ൲൪𝑛

𝑖=1


Maximizing the conditional likelihood

• Again, we can write down the expression for the partial derivative of ln L() with respect to each parameter:

• However, when we set them to 0, each equation involves multiple parameters and we cannot get their optimal values separately– That is, they form a system of non-linear equations


∂lnLሺ𝜃ሻ∂𝑎𝑘 = 𝑌𝑖𝑋𝑖ሺ𝑘ሻ− expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑋𝑖ሺ𝑘ሻ𝑛

𝑖=1= 𝑋𝑖ሺ𝑘ሻ𝑌𝑖 − expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ

𝑛𝑖=1

∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1


Gradient ascent• The system of non-linear equations has no closed-

form formulas.– Instead, we use a numerical method to solve it– Main idea: since we hope each equation to be zero, we

move them closer to zero iteratively– For example, since ,

we use the following update rule for t: , where is a small

constant• In the right hand side of the assignment, current estimates of the

parameters are used


∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1

𝑡 ≔ 𝑡+ η −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1


Meaning of gradient ascent• Why? It is like climbing a hill

– At each point, the gradient is the direction with maximum increase– We want to move towards that direction, but for a small step each

time so as not to overshoot• We don’t know exactly how large the step should be, because if we knew

that we could jump to the peak directly (i.e., when we have the closed-form formulas, as in the case of maximum likelihood for Gaussian Naïve Bayes)


Image source: http://www.absoluteastronomy.com/topics/Hill_climbing

aa1

t

ln L() Current estimateDirection with maximum increase

New estimate


Relationship with Naïve Bayes• Interestingly, logistic regression has a tight relationship with Naïve Bayes

when each Pr(X(j)|Y) is a Gaussian distribution• [Optional] First, in general, the posterior probability Pr(Y=1|X) of Naïve

Bayes is as follows:


Prሺ𝑌= 1|𝑋ሻ = Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑋ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ1+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ= explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨1+ explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨= expቈlnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1

1+ explnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPrሺ𝑋ሺ𝑗ሻ|𝑌= 1ሻPrሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1 ൨


Relationship with Naïve Bayes• [Optional] Suppose Pr(X(j)|Y=1) has mean j1 and

variance 2j, and Pr(X(j)|Y=0) has mean j0 and variance

2j (different means but same variance), then


lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚

𝑗=1 = ln1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗1൯22σ𝑗2

1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗0൯22σ𝑗2𝑚

𝑗=1

= ൫𝑋ሺ𝑗ሻ− μ𝑗0൯22σ𝑗2 −൫𝑋ሺ𝑗ሻ− μ𝑗1൯2

2σ𝑗2 ൩𝑚

𝑗=1= ቈ

2𝑋ሺ𝑗ሻ൫μ𝑗1 − μ𝑗0൯+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚𝑗=1

= ቈμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚

𝑗=1


Relationship with Naïve Bayes• Plugging it back to the formula for Pr(Y=1|X), we have

which is exactly the form of logistic regression, with coefficients and threshold

– Weight of a feature, aj, depends on how well it separates the two classes

– Threshold depends on the means of the Gaussians and the prior probabilities of the two classes


Prሺ𝑌= 1|𝑋ሻ= expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1

1+ expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1

𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2

𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚

𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ


Relationship with Naïve Bayes• In summary, if:– A Naïve Bayes classifier models Pr(X(j)|Y) by a

Gaussian distribution with equal variance for Pr(X(j)|Y=1) and Pr(X(j)|Y=0) AND

– A logistic regression classifier uses the coefficients

and threshold• Then their predictions are exactly the same.


𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2 𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚

𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ


Remarks• When the conditional independence

assumption is not true, logistic regression can be more accurate than Naïve Bayes classifier

• However, when there are few observed data points (i.e., n is small), Naïve Bayes could be more accurate


CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue


Case study: Fallacies related to statistics

• “According to this gene model, this DNA sequence has a data likelihood of 0.6, while according to this model for intergenic regions, this DNA sequence has a data likelihood of 0.1. Therefore the sequence is more likely to be a gene.”– Right or wrong?




• Likelihood vs. posterior:– If Y represents whether the sequence is a gene

(Y=1) or not (Y=0), and X is the sequence features, then the above statement is comparing the likelihoods Pr(X|Y=1) and Pr(X|Y=0), but we know that the posterior Pr(Y|X)=Pr(X|Y)Pr(Y)/Pr(X), and Pr(Y=1) << Pr(Y=0)

• Another famous example: “This cancer test has a 99% accuracy, and therefore highly reliable.”




• “Drug A is more effective than drug B for our male patients. Drug A is also more effective than drug B for our female patients. Therefore drug A is a better drug than drug B in general.”– Right or wrong?




• Simpson’s paradox:– Consider this situation:

• Again, it is related to different priors Pr(Gender) for the two drugs.

• You may argue that more females can be recruited to test drug A and more males can be recruited to test drug B.– How about “Rate of a disease is higher for both males and

females in population A than population B”?


Drug A Drug BEffective Ineffective Effective Ineffective

Male 60 40 5 5

Female 7 3 65 35

Total 67 43 70 40


Summary• Statistical modeling allows us to predict the

class Y (e.g., has flu) of an object by combining some observed features X (e.g., body temperature, fever and running nose) and some parameters – Generative models: Predict Pr(Y|X) by modeling

Pr(Y) and Pr(X|Y)• Example: Naïve Bayes classifier

– Discriminative models: Predict Pr(Y|X) by modeling it directly• Example: Logistic regression



Further readings• A book chapter written by Tom Mitchell, Generative and Discriminative

Classifiers: Naïve Bayes and Logistic Regression– Available at http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf– Discusses how to avoid over-fitting by regularization

• Over-fitting: Forming a model too complex to fit the data, including the noise contained in it that does not help predictions

– Describes how logistic regression work when there are more than 2 classes– Contains some discussions on using priors to turn maximum likelihood

estimates into maximum a posteriori (MAP) estimates– Note: Some notations are different from what we use

• This year we have removed the expectation-maximization (EM) algorithm from the syllabus. The algorithm is for situations in which the concept depends on both the observed data and some unobserved hidden data– If you are interested in learning EM, you can read its Wikipedia entry, which is

quite well-written


http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

Documents

Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics