55
Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Embed Size (px)

Citation preview

Page 1: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture 6. Basic statistical modeling

The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

Page 2: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Introduction to statistical modeling– Motivating examples– Generative and discriminative models– Classification and regression

2. Bayes and Naïve Bayes classifiers3. Logistic regression

Last update: 5-Oct-2015

Page 3: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

INTRODUCTION TO STATISTICAL MODELING

Part 1

Page 4: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 4

Statistical modeling• We have studied many biological concepts in this course– Genes, exons, introns, ...

• We want to provide a description of a concept by means of some observable features– Sometimes it can be (more or less) an exact rule:

• The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC

– In most cases it is not exact:• If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA,

and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene

• If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer

Last update: 5-Oct-2015

Page 5: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 5

The examples

• Reasons for the descriptions to be inexact:– Incomplete information

• What mutations on BRCA1/BRCA2? Any mutations on other genes?

– Exceptions• “If one has fever, he/she has a flu” – Not everyone with a flu has

fever, also not everyone with fever is due to a flu

– Intrinsic randomness

Last update: 5-Oct-2015

Concept/ Class

DNA recognized by the enzyme EcoRI

Protein coding sequence of a yeast gene

Developing breast cancer

Features observable from data

The DNA sequence (the string)

Raw: The DNA sequenceDerived:• The first three characters• The last three characters• The length

Mutations at BRCA1 geneMutations at BRCA2 gene

Page 6: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 6

Features known, concept unsure• In many cases, we are interested in the situation

that the features are observed but whether a concept is true is unknown– We know the sequence of a DNA region, but we do

not know whether it corresponds to a protein coding sequence

– We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer

– We know a subject is having fever, but we do not know whether he/she has flu infection or not

Last update: 5-Oct-2015

Page 7: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 7

Statistical models• Statistical models provide a principal way to

specify the inexact descriptions• For the flu example, using some symbols:– X: a set of features• In this example, a single binary feature with X=1 if a

subject has fever and X=0 if not

– Y: the target concept• In this example, a binary concept with Y=1 if a subject

has flu and Y=0 if not

– A model is a function that predicts values of Y based on observed values X and parameters

Last update: 5-Oct-2015

Page 8: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 8

Parameters• Some details of a statistical model are provided by its

parameters, – Suppose whether a person with flu has fever can be

modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to

have fever is q1 and the probability not to have fever is 1-q1.• Different people are assumed to be statistically independent.

– Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2

– Finally, the probability for a person to have flu is p– Then the whole set of parameters is = {p, q1, q2}

Last update: 5-Oct-2015

Page 9: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 9

Basic probabilities• Pr(X)Pr(Y|X) = Pr(X and Y)

– If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop

– Capital letters mean it is true for all values of X and Y– Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X

and Y• Law of total probability:

(The summation should consider all possible values of Y)– If there is

• A 0.12 chance that it will both rain and have a temperature drop tomorrow, and• A 0.08 chance that it will both rain and not have a temperature drop tomorrow

– Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow• Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y) 0

– Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y)– Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z) 0

Last update: 5-Oct-2015

Prሺ𝑋ሻ= Prሺ𝑋 and 𝑌= 𝑦ሻ𝑌=𝑦

Page 10: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 10

A complete numeric example• Assume the following parameters (X: has fever or not; Y: has flu or not):

– 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7– 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1– 20% of people have flu: Pr(Y=1) = 0.2

• We have a simple model to predict Y from and X:– Probability that someone has fever:

Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22

– Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)

= (0.7)(0.2) / 0.22 = 0.64– Probability that someone does not have flu, given that he/she has fever:

Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36– Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) =

Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08

– Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92

Last update: 5-Oct-2015

Page 11: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 11

Statistical estimation• Questions we can ask:

– Given a model, what is the likelihood of the observation?• Pr(X|Y,) – in the previous page, was omitted for simplicity• If a person has flu, how likely would he/she have fever?

– Given an observation, what is the probability that a concept is true?• Pr(Y|X,)• If a person has fever, what is the probability that he/she has flu?

– Given some observations, what is the likelihood of a parameter value?• Pr(|X), or Pr(|X,Y) if whether the concept is true is also known• Suppose we have observed that among 100 people with flu, 70

have fever. What is the likelihood that q1 is equal to 0.7?

Last update: 5-Oct-2015

Page 12: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 12

Statistical estimation• Questions we can ask (cont’d):– Maximum likelihood estimation: Given a model

with unknown parameter values, what parameter values can maximize the data likelihood?• or

– Prediction of concept: Given a model and an observation, what is the concept most likely to be true?•

Last update: 5-Oct-2015

argmaxθ Prሺ𝑋|𝑌,θሻ

argmaxy Prሺ𝑌= 𝑦|𝑋,θሻ

argmaxθ Prሺ𝑋,𝑌|θሻ

Page 13: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 13

Generative vs. discriminative modeling

• If a model predicts Y by providing information about Pr(X,Y), it is called a generative model– Because we can use the model to generate data– Examples: Naïve Bayes

• If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model– Example: Logistic regression

Last update: 5-Oct-2015

Page 14: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 14

Classification vs. regression• If there is a finite number of discrete, mutually exclusive

concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier– Given that the BRCA1 gene of a subject has a deleted exon 2, we

want to predict whether the subject will develop breast cancer in the life time• Y=1: the subject will develop breast cancer;• Y=0: the subject will not develop breast cancer

• If Y takes on continuous values, it is a regression problem and the model is called an estimator– Given that the BRCA1 gene of a subject has a deleted exon 2, we

want to estimate the lifespan of the subject• Y: lifespan of the subject

Last update: 5-Oct-2015

Page 15: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

BAYES AND NAÏVE BAYES CLASSIFIERS

Part 2

Page 16: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 16

Bayes classifiers• In the example of flu (Y) and fever (X), we have

seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule:–

• We use capital letter to represent variables (single-valued or vector), and small letters to represent values

• When we do not specify the value, it means something is true for all values. For example, all the following are true according to Bayes’ rule:– Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1)– Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0)– Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1)– Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0)

Last update: 5-Oct-2015

Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦

Page 17: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 17

Terminology• – Pr(Y) is called the prior probability• E.g., Pr(Y=1) is the probability of having flu, without

considering any evidence such as fever• Can be considered the prior guess that the concept is

true before seeing any evidence

– Pr(X|Y) is called the likelihood• E.g., Pr(X=1|Y=1) is the probability of having fever if we

know one has flu

– Pr(Y|X) is called the posterior probability• E.g., Pr(Y=1|X=1) is the probability of having flu, after

knowing that one has feverLast update: 5-Oct-2015

Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦

Page 18: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 18

Generalizations• In general, the above is true even if:– X involves a set of features X={X(1), X(2), ..., X(m)}

instead of a single feature• Example: predict whether one has flu after knowing

whether he/she has fever, headache and running nose

– X can take on continuous values• In that case, Pr(X) is the probability density of X• Examples:

– Predict whether a person has flu after knowing his/her body temperature

– Predict whether a gene is involved in a biological pathway given its expression values in several conditions

Last update: 5-Oct-2015

Page 19: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 19

Parameter estimation• Let’s consider the discrete case first• Suppose we want to estimate the parameters of our flu model by

learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set

• How many parameters are there in the model?– We need to know the prior probabilities, Pr(Y)

• Two parameters: Pr(Y=1), Pr(Y=0)• Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter

– We need to know the likelihoods, Pr(X|Y)• Suppose we have m binary features, fever, headache, running nose, ...• 2m+1 parameters for all X and Y value combinations• 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is

one

– Total: 2(2m-1) + 1 independent parameters• How large should n be in order to estimate these parameters accurately?

– Very large, given the exponential number of parameters

Last update: 5-Oct-2015

Page 20: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 20

List of all the parameters• Let Y be having flu (Y=1) or not (Y=0)• Let X(1) be having fever (X(1)=1) or not (X(1)=0)• Let X(2) be having headache (X(2)=1) or not (X(2)=0)• Let X(3) be having running nose (X(3)=1) or not (X(3)=0)• Then the complete list of parameters for a generative model is

(variables not independent are in gray):– Pr(Y=0), Pr(Y=1)– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1,

X(3)=0,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0)

– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1)

Last update: 5-Oct-2015

Page 21: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 21

Why having many parameters is a problem?

• Statistically, we will need a lot of data to accurately estimate the values of the parameters– Imagine that we need to estimate the 15

parameters on the last page with only data about 20 people

• Computationally, estimating the values of an exponential number of parameters could take a long time

Last update: 5-Oct-2015

Page 22: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 22

Conditional independence

• One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then– Pr(X(1), X(2)|Y)

= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]

= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption]

– Probability for a flu patient to have fever is independent of whether he/she has running nose

– Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2))• Without knowing whether a person has flu, having fever and

having running nose are definitely correlated

Last update: 5-Oct-2015

Page 23: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 23

Conditional independence and Naïve Bayes

• Number of parameters after making the conditional independence assumption:– 2 prior probabilities Pr(Y=0) and Pr(Y=1)

• Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0)

– 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y• Only 2m independent parameters, as Pr(X(j)=1|Y=y) =

Pr(X(j)=0|Y=y) for all possible values of j and y

– Total: 4m+2, which is much smaller than 2(2m-1)+1!• The resulting model is usually called a Naïve

Bayes model

Last update: 5-Oct-2015

Page 24: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 24

Estimating the parameters• Now, suppose we have the known examples (X1, Y1), (X2,

Y2), ..., (Xn, Yn) in the training set• The prior probabilities can be estimated in this way:– , where 𝕀 is the indicator function,

with𝕀(true) = 1 and 𝕀(false) = 0– That is , fraction of examples with class label y

• Similarly, for any particular feature X(j), its likelihoods can be estimated in this way:– – That is, fraction of class y examples having value x at feature X(j)

– To avoid zeros, we can add pseudo-counts:

– , where c has a small value

Last update: 5-Oct-2015

Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ = 𝑥|𝑌= 𝑦൯= c+ σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=12c+ σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Page 25: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 25

Example• Suppose we have the training

data as shown on the right• How many parameters does

the Naïve Bayes model have?• Estimated parameter values

using the formulas on the last page:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5

Last update: 5-Oct-2015

Subject i

Has fever? X(1)

Has headache? X(2)

Has flu? Y

1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No

Page 26: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 26

Meaning of the estimations• The formulas for estimating the parameters are intuitive• In fact they are also the maximum likelihood estimators – the values that

maximize the likelihood if we assume the data were generated by independent Bernoulli trials– Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever– This likelihood can be expressed as

• That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product

– Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b)

– This value can be found by differentiating the log likelihood and equating it to zero:

– The formula for estimating the prior probabilities Pr(Y) can be similarly derivedLast update: 5-Oct-2015

Lሺ𝑞ሻ= ෑ� 𝑞𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁሺ1− 𝑞ሻቂ1−𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁቃ𝑖:𝑌𝑖=1

lnLሺ𝑞ሻ= ቄ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁln𝑞+ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃlnሺ1− 𝑞ሻቅ𝑖:𝑌𝑖=1

dd𝑞lnLሺ𝑞ሻ= ቐ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑞 −ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃ1− 𝑞 ቑ𝑖:𝑌𝑖=1

dd𝑞lnLሺ𝑞ሻ= 0 ⇒𝑞 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑖:𝑌𝑖=1σ 1𝑖:𝑌𝑖=1 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1,𝑌𝑖 = 1ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 1ሻ𝑛𝑖=1

Page 27: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 27

Short summary• So far, we have got the formulas for estimating

the parameters of a Naïve Bayes model, which correspond to the parameter values, among all possible values, that maximize the data likelihood

• The parameter estimates:– Prior probabilities:

– Likelihoods:

Last update: 5-Oct-2015

Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Page 28: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 28

Using the model• Now with Pr(Y=y) and Pr(X(j)=x|Y=y) estimated

for all features j and all values x and y, the model can be applied to estimate Pr(Y=y|X) for any X, either in the training set or not– Recall that

– For classification, we can compare Pr(Y=1|X) and Pr(Y=0|X), and• Predict X to be of class 1 if the former is larger• Predict X to be of class 0 if the latter is larger

Last update: 5-Oct-2015

Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦

Page 29: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 29

Example• Suppose we have the same training data as

shown on the right• Parameter values of Naïve Bayes model we have

previously estimated:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5

• Now, for a new subject with fever but not headache, we would predict its probability of having flu as

Pr(Y=1|X(1)=1,X(2)=0)= Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) / [Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) +

Pr(X(1)=1|Y=0)Pr(X(2)=0|Y=0)Pr(Y=0)]= (2/3)(1-1/3)(3/8) /

[(2/3)(1-1/3)(3/8) + (2/5)(1-1/5)(1-3/8)]= 5/11

Last update: 5-Oct-2015

Subject i

Has fever? X(1)

Has headache? X(2)

Has flu? Y

1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No

Page 30: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 30

Numeric features• If X(j) can take on continuous values, we need a

continuous distribution instead of a discrete one– Fever is a feature with binary values: 1 means “has

fever”; 0 means “does not have fever”– Body temperature is a feature with continuous values

• For the features with binary values, we have assumed that each feature X(j) has a Bernoulli distribution conditioned on Y, i.e., Pr(X(j)=1|Y=y) = q with the value of parameter q to be estimated

• For continuous values, we can similarly estimate Pr(X(j)=x|Y=y) based on an assumed distribution

Last update: 5-Oct-2015

Page 31: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 31

Gaussian distribution• Suppose the body temperatures of flu patients

follow a Gaussian distribution:

– There are two parameters to estimate:• The mean (center) of the distribution, • The variance (spread) of the distribution, 2

Last update: 5-Oct-2015

35 36 37 38 39 40 41

Prob

abili

ty d

ensi

ty

Body temperature of people with flu

Page 32: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 32

Estimating the parameters• Maximum likelihood estimations [optional]:

Last update: 5-Oct-2015

Prቀ𝑋𝑖ሺ𝑗ሻ = 𝑥|𝑌= 𝑦ቁ= 1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ22σ2

Lሺμ,σሻ= ෑ�1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ2

2σ2𝑖:𝑌𝑖=𝑦

lnLሺμ,σሻ= ln൬1σξ2π൰−ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ2 𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂μ = 𝑋𝑖ሺ𝑗ሻ− μσ2𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂μ = 0 ⇒μ= σ 𝑋𝑖ሺ𝑗ሻ𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

∂lnLሺμ,σሻ∂σ = −σξ2πσ2ξ2π + 2ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ3 𝑖:𝑌𝑖=𝑦 = −1σ +ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2

σ3 𝑖:𝑌𝑖=𝑦

∂lnLሺμ,σሻ∂σ = 0 ⇒σ2 = σ ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Page 33: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 33

Estimating the parameters• Results:– The formulas:

– Meanings: The mean and variance of the training data• The above formula for the variance is a biased estimation (i.e., when

you have many sets of training data and each time you estimate the variance by this formula, the average of the estimations does not converge to the actual variance of the Gaussian distribution).

• May use the sample variance instead, which is the minimum variance unbiased estimator – see further readings.

Last update: 5-Oct-2015

μ= σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

σ2 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1

Page 34: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

LOGISTIC REGRESSIONPart 3

Page 35: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 35

Discriminative learning• In the Bayes and Naïve Bayes classifiers, in order to

compute Pr(Y|X) we need to model Pr(X,Y) or Pr(X|Y)Pr(Y)– It seems to be complicating things: Using the solution of a

harder problem [modeling Pr(X,Y)] to answer an easier question [Pr(Y|X)]

– We may not always have a good idea how to model Pr(X|Y) and Pr(Y)• For example, while assuming Gaussian for Pr(X|Y) is mathematically

convenient, is it really suitable?• What if we cannot find a good well-studied distribution that fits the

data well, or it is difficult to derive the maximum likelihood estimation formulas?

• We now study a discriminative method that models Pr(Y|X) directly

Last update: 5-Oct-2015

Page 36: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 36

Logistic regression: the idea• The logistic regression model relies on the assumption that the

class can be determined by a linear combination of the features– Conceptually, we hope to have a rule of this type:

“If a1X(1) + a2X(2) + ... + amX(m) t, then Y=1; otherwise, Y=0”• If 0.2 <body temperature> + 0.5 <headache> + 0.6 <running nose> 8.1,

then <has flu> = 1• The coefficients a1, a2, ..., am and the threshold t are model parameters the

values of which we want to estimate from training data• Graphically, the rule is a step function:

Last update: 5-Oct-2015

0

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

<has

flu>

0.2 <body temperature> + 0.5 <headache> +0.6 <running nose>

Page 37: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 37

Logistic regression: actual form• However, the step function is mathematically not easy to handle

– For instance, it is not smooth, and thus not differentiable

• Let’s model Pr(Y=1|X) using a smooth function, and then derive the classification rule based on it:– Let f(X) = exp(a1X(1) + a2X(2) + ... + amX(m) - t)– Pr(Y=1|X) = f(X) / [1 + f(X)] -- the logistic function– Pr(Y=0|X) = 1 - Pr(Y=1|X) = 1 / [1 + f(X)]

• When a1X(1) + a2X(2) + ... + amX(m) >> t, Pr(Y=1|X)=1 and Pr(Y=0|X)=0

• When a1X(1) + a2X(2) + ... + amX(m) << t, Pr(Y=1|X)=0 and Pr(Y=0|X)=1

• When a1X(1) + a2X(2) + ... + amX(m) = t, Pr(Y=1|X) = Pr(Y=0|X) = 0.5

– We predict X to be of class 1 if Pr(Y=1|X) Pr(Y=0|X), i.e., f(X) 1

• We need to estimate the values of the model parameters a1, ..., am and t from training data. Will discuss how to do it.

Last update: 5-Oct-2015

Page 38: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 38

Visualizing the functions• The rule “if f(X) 1, then

predict Y=1” is exactly the same as “if a1X(1) + a2X(2) + ... + amX(m) t, then predict Y=1”

Last update: 5-Oct-2015

0

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

Y

a1X(1) + a2X(2) + ... + amX(m)

0.001

0.01

0.1

1

10

100

1000

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.111.112.113.1f(x)

a1X(1) + a2X(2) + ... + amX(m)

0

0.2

0.4

0.6

0.8

1

2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1

a1X(1) + a2X(2) + ... + amX(m)

Pr(Y=1|X)

Pr(Y=0|X)

The original step function

Pr(Y=1|X) and Pr(Y=0|X) f(X): ratio of Pr(Y=1|X) and Pr(Y=0|X)

Can set parameter values so that are very similar

Page 39: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 39

Estimating the parameters• For Naïve Bayes, we estimated parameters

using maximum likelihood, i.e., to find such that and , or their logarithms, are maximized

• For logistic regression, we do not have models for these probabilities, so instead we directly maximize the conditional data likelihood,

, where includes the parameters a1, a2, ..., am and t

Last update: 5-Oct-2015

ෑ� Prሺ𝑋𝑖|𝑌𝑖,θሻ𝑛𝑖=1 ෑ� Prሺ𝑌𝑖|θሻ𝑛

𝑖=1

ෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1

Page 40: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 40

Maximizing the conditional likelihood

• The log conditional data likelihood can be written as follows [optional]:

Last update: 5-Oct-2015

lnLሺ𝜃ሻ = lnෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1

= lnෑ� Prሺ𝑌𝑖 = 1|𝑋𝑖,θሻ𝑌𝑖Prሺ𝑌𝑖 = 0|𝑋𝑖,θሻ1−𝑌𝑖𝑛

𝑖=1= ሾ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻ+ሺ1− 𝑌𝑖ሻlnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻሿ𝑛

𝑖=1= ቈ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ+ lnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ𝑛

𝑖=1= 𝑌𝑖 lnexpቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ

𝑚𝑗=1 ቍ+ ln 11+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ

𝑛𝑖=1

= ൦𝑌𝑖 ቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚

𝑗=1 ቍ− ln൮1+ expቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚

𝑗=1 ቍ൲൪𝑛

𝑖=1

Page 41: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 41

Maximizing the conditional likelihood

• Again, we can write down the expression for the partial derivative of ln L() with respect to each parameter:

• However, when we set them to 0, each equation involves multiple parameters and we cannot get their optimal values separately– That is, they form a system of non-linear equations

Last update: 5-Oct-2015

∂lnLሺ𝜃ሻ∂𝑎𝑘 = 𝑌𝑖𝑋𝑖ሺ𝑘ሻ− expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑋𝑖ሺ𝑘ሻ𝑛

𝑖=1= 𝑋𝑖ሺ𝑘ሻ𝑌𝑖 − expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ

𝑛𝑖=1

∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1

Page 42: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 42

Gradient ascent• The system of non-linear equations has no closed-

form formulas.– Instead, we use a numerical method to solve it– Main idea: since we hope each equation to be zero, we

move them closer to zero iteratively– For example, since ,

we use the following update rule for t: , where is a small

constant• In the right hand side of the assignment, current estimates of the

parameters are used

Last update: 5-Oct-2015

∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1

𝑡 ≔ 𝑡+ η −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛

𝑖=1

Page 43: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 43

Meaning of gradient ascent• Why? It is like climbing a hill

– At each point, the gradient is the direction with maximum increase– We want to move towards that direction, but for a small step each

time so as not to overshoot• We don’t know exactly how large the step should be, because if we knew

that we could jump to the peak directly (i.e., when we have the closed-form formulas, as in the case of maximum likelihood for Gaussian Naïve Bayes)

Last update: 5-Oct-2015

Image source: http://www.absoluteastronomy.com/topics/Hill_climbing

aa1

t

ln L() Current estimateDirection with maximum increase

New estimate

Page 44: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 44

Relationship with Naïve Bayes• Interestingly, logistic regression has a tight relationship with Naïve Bayes

when each Pr(X(j)|Y) is a Gaussian distribution• [Optional] First, in general, the posterior probability Pr(Y=1|X) of Naïve

Bayes is as follows:

Last update: 5-Oct-2015

Prሺ𝑌= 1|𝑋ሻ = Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑋ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ1+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ= explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨1+ explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨= expቈlnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1

1+ explnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPrሺ𝑋ሺ𝑗ሻ|𝑌= 1ሻPrሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1 ൨

Page 45: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 45

Relationship with Naïve Bayes• [Optional] Suppose Pr(X(j)|Y=1) has mean j1 and

variance 2j, and Pr(X(j)|Y=0) has mean j0 and variance

2j (different means but same variance), then

Last update: 5-Oct-2015

lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚

𝑗=1 = ln1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗1൯22σ𝑗2

1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗0൯22σ𝑗2𝑚

𝑗=1

= ൫𝑋ሺ𝑗ሻ− μ𝑗0൯22σ𝑗2 −൫𝑋ሺ𝑗ሻ− μ𝑗1൯2

2σ𝑗2 ൩𝑚

𝑗=1= ቈ

2𝑋ሺ𝑗ሻ൫μ𝑗1 − μ𝑗0൯+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚𝑗=1

= ቈμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚

𝑗=1

Page 46: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 46

Relationship with Naïve Bayes• Plugging it back to the formula for Pr(Y=1|X), we have

which is exactly the form of logistic regression, with coefficients and threshold

– Weight of a feature, aj, depends on how well it separates the two classes

– Threshold depends on the means of the Gaussians and the prior probabilities of the two classes

Last update: 5-Oct-2015

Prሺ𝑌= 1|𝑋ሻ= expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1

1+ expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1

𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2

𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚

𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ

Page 47: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 47

Relationship with Naïve Bayes• In summary, if:– A Naïve Bayes classifier models Pr(X(j)|Y) by a

Gaussian distribution with equal variance for Pr(X(j)|Y=1) and Pr(X(j)|Y=0) AND

– A logistic regression classifier uses the coefficients

and threshold• Then their predictions are exactly the same.

Last update: 5-Oct-2015

𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2 𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚

𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ

Page 48: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 48

Remarks• When the conditional independence

assumption is not true, logistic regression can be more accurate than Naïve Bayes classifier

• However, when there are few observed data points (i.e., n is small), Naïve Bayes could be more accurate

Last update: 5-Oct-2015

Page 49: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue

Page 50: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 50

Case study: Fallacies related to statistics

• “According to this gene model, this DNA sequence has a data likelihood of 0.6, while according to this model for intergenic regions, this DNA sequence has a data likelihood of 0.1. Therefore the sequence is more likely to be a gene.”– Right or wrong?

Last update: 5-Oct-2015

Page 51: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 51

Case study: Fallacies related to statistics

• Likelihood vs. posterior:– If Y represents whether the sequence is a gene

(Y=1) or not (Y=0), and X is the sequence features, then the above statement is comparing the likelihoods Pr(X|Y=1) and Pr(X|Y=0), but we know that the posterior Pr(Y|X)=Pr(X|Y)Pr(Y)/Pr(X), and Pr(Y=1) << Pr(Y=0)

• Another famous example: “This cancer test has a 99% accuracy, and therefore highly reliable.”

Last update: 5-Oct-2015

Page 52: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 52

Case study: Fallacies related to statistics

• “Drug A is more effective than drug B for our male patients. Drug A is also more effective than drug B for our female patients. Therefore drug A is a better drug than drug B in general.”– Right or wrong?

Last update: 5-Oct-2015

Page 53: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 53

Case study: Fallacies related to statistics

• Simpson’s paradox:– Consider this situation:

• Again, it is related to different priors Pr(Gender) for the two drugs.

• You may argue that more females can be recruited to test drug A and more males can be recruited to test drug B.– How about “Rate of a disease is higher for both males and

females in population A than population B”?

Last update: 5-Oct-2015

Drug A Drug BEffective Ineffective Effective Ineffective

Male 60 40 5 5

Female 7 3 65 35

Total 67 43 70 40

Page 54: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 54

Summary• Statistical modeling allows us to predict the

class Y (e.g., has flu) of an object by combining some observed features X (e.g., body temperature, fever and running nose) and some parameters – Generative models: Predict Pr(Y|X) by modeling

Pr(Y) and Pr(X|Y)• Example: Naïve Bayes classifier

– Discriminative models: Predict Pr(Y|X) by modeling it directly• Example: Logistic regression

Last update: 5-Oct-2015

Page 55: Lecture 6. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 55

Further readings• A book chapter written by Tom Mitchell, Generative and Discriminative

Classifiers: Naïve Bayes and Logistic Regression– Available at http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf– Discusses how to avoid over-fitting by regularization

• Over-fitting: Forming a model too complex to fit the data, including the noise contained in it that does not help predictions

– Describes how logistic regression work when there are more than 2 classes– Contains some discussions on using priors to turn maximum likelihood

estimates into maximum a posteriori (MAP) estimates– Note: Some notations are different from what we use

• This year we have removed the expectation-maximization (EM) algorithm from the syllabus. The algorithm is for situations in which the concept depends on both the observed data and some unobserved hidden data– If you are interested in learning EM, you can read its Wikipedia entry, which is

quite well-written

Last update: 5-Oct-2015