Upload
alan-wilcox
View
217
Download
5
Embed Size (px)
Citation preview
Lecture 6. Basic statistical modeling
The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 2
Lecture outline1. Introduction to statistical modeling– Motivating examples– Generative and discriminative models– Classification and regression
2. Bayes and Naïve Bayes classifiers3. Logistic regression
Last update: 5-Oct-2015
INTRODUCTION TO STATISTICAL MODELING
Part 1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 4
Statistical modeling• We have studied many biological concepts in this course– Genes, exons, introns, ...
• We want to provide a description of a concept by means of some observable features– Sometimes it can be (more or less) an exact rule:
• The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC
– In most cases it is not exact:• If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA,
and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene
• If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 5
The examples
• Reasons for the descriptions to be inexact:– Incomplete information
• What mutations on BRCA1/BRCA2? Any mutations on other genes?
– Exceptions• “If one has fever, he/she has a flu” – Not everyone with a flu has
fever, also not everyone with fever is due to a flu
– Intrinsic randomness
Last update: 5-Oct-2015
Concept/ Class
DNA recognized by the enzyme EcoRI
Protein coding sequence of a yeast gene
Developing breast cancer
Features observable from data
The DNA sequence (the string)
Raw: The DNA sequenceDerived:• The first three characters• The last three characters• The length
Mutations at BRCA1 geneMutations at BRCA2 gene
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 6
Features known, concept unsure• In many cases, we are interested in the situation
that the features are observed but whether a concept is true is unknown– We know the sequence of a DNA region, but we do
not know whether it corresponds to a protein coding sequence
– We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer
– We know a subject is having fever, but we do not know whether he/she has flu infection or not
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 7
Statistical models• Statistical models provide a principal way to
specify the inexact descriptions• For the flu example, using some symbols:– X: a set of features• In this example, a single binary feature with X=1 if a
subject has fever and X=0 if not
– Y: the target concept• In this example, a binary concept with Y=1 if a subject
has flu and Y=0 if not
– A model is a function that predicts values of Y based on observed values X and parameters
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 8
Parameters• Some details of a statistical model are provided by its
parameters, – Suppose whether a person with flu has fever can be
modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to
have fever is q1 and the probability not to have fever is 1-q1.• Different people are assumed to be statistically independent.
– Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2
– Finally, the probability for a person to have flu is p– Then the whole set of parameters is = {p, q1, q2}
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 9
Basic probabilities• Pr(X)Pr(Y|X) = Pr(X and Y)
– If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop
– Capital letters mean it is true for all values of X and Y– Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X
and Y• Law of total probability:
(The summation should consider all possible values of Y)– If there is
• A 0.12 chance that it will both rain and have a temperature drop tomorrow, and• A 0.08 chance that it will both rain and not have a temperature drop tomorrow
– Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow• Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y) 0
– Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y)– Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z) 0
Last update: 5-Oct-2015
Prሺ𝑋ሻ= Prሺ𝑋 and 𝑌= 𝑦ሻ𝑌=𝑦
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 10
A complete numeric example• Assume the following parameters (X: has fever or not; Y: has flu or not):
– 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7– 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1– 20% of people have flu: Pr(Y=1) = 0.2
• We have a simple model to predict Y from and X:– Probability that someone has fever:
Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22
– Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)
= (0.7)(0.2) / 0.22 = 0.64– Probability that someone does not have flu, given that he/she has fever:
Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36– Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) =
Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08
– Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 11
Statistical estimation• Questions we can ask:
– Given a model, what is the likelihood of the observation?• Pr(X|Y,) – in the previous page, was omitted for simplicity• If a person has flu, how likely would he/she have fever?
– Given an observation, what is the probability that a concept is true?• Pr(Y|X,)• If a person has fever, what is the probability that he/she has flu?
– Given some observations, what is the likelihood of a parameter value?• Pr(|X), or Pr(|X,Y) if whether the concept is true is also known• Suppose we have observed that among 100 people with flu, 70
have fever. What is the likelihood that q1 is equal to 0.7?
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 12
Statistical estimation• Questions we can ask (cont’d):– Maximum likelihood estimation: Given a model
with unknown parameter values, what parameter values can maximize the data likelihood?• or
– Prediction of concept: Given a model and an observation, what is the concept most likely to be true?•
Last update: 5-Oct-2015
argmaxθ Prሺ𝑋|𝑌,θሻ
argmaxy Prሺ𝑌= 𝑦|𝑋,θሻ
argmaxθ Prሺ𝑋,𝑌|θሻ
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 13
Generative vs. discriminative modeling
• If a model predicts Y by providing information about Pr(X,Y), it is called a generative model– Because we can use the model to generate data– Examples: Naïve Bayes
• If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model– Example: Logistic regression
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 14
Classification vs. regression• If there is a finite number of discrete, mutually exclusive
concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier– Given that the BRCA1 gene of a subject has a deleted exon 2, we
want to predict whether the subject will develop breast cancer in the life time• Y=1: the subject will develop breast cancer;• Y=0: the subject will not develop breast cancer
• If Y takes on continuous values, it is a regression problem and the model is called an estimator– Given that the BRCA1 gene of a subject has a deleted exon 2, we
want to estimate the lifespan of the subject• Y: lifespan of the subject
Last update: 5-Oct-2015
BAYES AND NAÏVE BAYES CLASSIFIERS
Part 2
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 16
Bayes classifiers• In the example of flu (Y) and fever (X), we have
seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule:–
• We use capital letter to represent variables (single-valued or vector), and small letters to represent values
• When we do not specify the value, it means something is true for all values. For example, all the following are true according to Bayes’ rule:– Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1)– Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0)– Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1)– Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0)
Last update: 5-Oct-2015
Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 17
Terminology• – Pr(Y) is called the prior probability• E.g., Pr(Y=1) is the probability of having flu, without
considering any evidence such as fever• Can be considered the prior guess that the concept is
true before seeing any evidence
– Pr(X|Y) is called the likelihood• E.g., Pr(X=1|Y=1) is the probability of having fever if we
know one has flu
– Pr(Y|X) is called the posterior probability• E.g., Pr(Y=1|X=1) is the probability of having flu, after
knowing that one has feverLast update: 5-Oct-2015
Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 18
Generalizations• In general, the above is true even if:– X involves a set of features X={X(1), X(2), ..., X(m)}
instead of a single feature• Example: predict whether one has flu after knowing
whether he/she has fever, headache and running nose
– X can take on continuous values• In that case, Pr(X) is the probability density of X• Examples:
– Predict whether a person has flu after knowing his/her body temperature
– Predict whether a gene is involved in a biological pathway given its expression values in several conditions
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 19
Parameter estimation• Let’s consider the discrete case first• Suppose we want to estimate the parameters of our flu model by
learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set
• How many parameters are there in the model?– We need to know the prior probabilities, Pr(Y)
• Two parameters: Pr(Y=1), Pr(Y=0)• Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter
– We need to know the likelihoods, Pr(X|Y)• Suppose we have m binary features, fever, headache, running nose, ...• 2m+1 parameters for all X and Y value combinations• 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is
one
– Total: 2(2m-1) + 1 independent parameters• How large should n be in order to estimate these parameters accurately?
– Very large, given the exponential number of parameters
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 20
List of all the parameters• Let Y be having flu (Y=1) or not (Y=0)• Let X(1) be having fever (X(1)=1) or not (X(1)=0)• Let X(2) be having headache (X(2)=1) or not (X(2)=0)• Let X(3) be having running nose (X(3)=1) or not (X(3)=0)• Then the complete list of parameters for a generative model is
(variables not independent are in gray):– Pr(Y=0), Pr(Y=1)– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1,
X(3)=0,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0)
– Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1)
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 21
Why having many parameters is a problem?
• Statistically, we will need a lot of data to accurately estimate the values of the parameters– Imagine that we need to estimate the 15
parameters on the last page with only data about 20 people
• Computationally, estimating the values of an exponential number of parameters could take a long time
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 22
Conditional independence
• One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then– Pr(X(1), X(2)|Y)
= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]
= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption]
– Probability for a flu patient to have fever is independent of whether he/she has running nose
– Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2))• Without knowing whether a person has flu, having fever and
having running nose are definitely correlated
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 23
Conditional independence and Naïve Bayes
• Number of parameters after making the conditional independence assumption:– 2 prior probabilities Pr(Y=0) and Pr(Y=1)
• Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0)
– 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y• Only 2m independent parameters, as Pr(X(j)=1|Y=y) =
Pr(X(j)=0|Y=y) for all possible values of j and y
– Total: 4m+2, which is much smaller than 2(2m-1)+1!• The resulting model is usually called a Naïve
Bayes model
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 24
Estimating the parameters• Now, suppose we have the known examples (X1, Y1), (X2,
Y2), ..., (Xn, Yn) in the training set• The prior probabilities can be estimated in this way:– , where 𝕀 is the indicator function,
with𝕀(true) = 1 and 𝕀(false) = 0– That is , fraction of examples with class label y
• Similarly, for any particular feature X(j), its likelihoods can be estimated in this way:– – That is, fraction of class y examples having value x at feature X(j)
– To avoid zeros, we can add pseudo-counts:
– , where c has a small value
Last update: 5-Oct-2015
Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
Pr൫𝑋ሺ𝑗ሻ = 𝑥|𝑌= 𝑦൯= c+ σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=12c+ σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 25
Example• Suppose we have the training
data as shown on the right• How many parameters does
the Naïve Bayes model have?• Estimated parameter values
using the formulas on the last page:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5
Last update: 5-Oct-2015
Subject i
Has fever? X(1)
Has headache? X(2)
Has flu? Y
1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 26
Meaning of the estimations• The formulas for estimating the parameters are intuitive• In fact they are also the maximum likelihood estimators – the values that
maximize the likelihood if we assume the data were generated by independent Bernoulli trials– Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever– This likelihood can be expressed as
• That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product
– Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b)
– This value can be found by differentiating the log likelihood and equating it to zero:
– The formula for estimating the prior probabilities Pr(Y) can be similarly derivedLast update: 5-Oct-2015
Lሺ𝑞ሻ= ෑ� 𝑞𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁሺ1− 𝑞ሻቂ1−𝕀ቀ𝑋𝑖ሺ𝑗ሻ=1ቁቃ𝑖:𝑌𝑖=1
lnLሺ𝑞ሻ= ቄ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁln𝑞+ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃlnሺ1− 𝑞ሻቅ𝑖:𝑌𝑖=1
dd𝑞lnLሺ𝑞ሻ= ቐ𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑞 −ቂ1− 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁቃ1− 𝑞 ቑ𝑖:𝑌𝑖=1
dd𝑞lnLሺ𝑞ሻ= 0 ⇒𝑞 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1ቁ𝑖:𝑌𝑖=1σ 1𝑖:𝑌𝑖=1 = σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 1,𝑌𝑖 = 1ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 1ሻ𝑛𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 27
Short summary• So far, we have got the formulas for estimating
the parameters of a Naïve Bayes model, which correspond to the parameter values, among all possible values, that maximize the data likelihood
• The parameter estimates:– Prior probabilities:
– Likelihoods:
Last update: 5-Oct-2015
Prሺ𝑌= 𝑦ሻ= 1𝑛 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
Pr൫𝑋ሺ𝑗ሻ= 𝑥|𝑌= 𝑦൯= σ 𝕀ቀ𝑋𝑖ሺ𝑗ሻ= 𝑥,𝑌𝑖 = 𝑦ቁ𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 28
Using the model• Now with Pr(Y=y) and Pr(X(j)=x|Y=y) estimated
for all features j and all values x and y, the model can be applied to estimate Pr(Y=y|X) for any X, either in the training set or not– Recall that
– For classification, we can compare Pr(Y=1|X) and Pr(Y=0|X), and• Predict X to be of class 1 if the former is larger• Predict X to be of class 0 if the latter is larger
Last update: 5-Oct-2015
Prሺ𝑌|𝑋ሻ= Prሺ𝑋|𝑌ሻPrሺ𝑌ሻPrሺ𝑋ሻ = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋,Yሻ𝑦 = Prሺ𝑋|𝑌ሻPrሺ𝑌ሻσ Prሺ𝑋|𝑌= 𝑦ሻPrሺ𝑌= 𝑦ሻ𝑦
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 29
Example• Suppose we have the same training data as
shown on the right• Parameter values of Naïve Bayes model we have
previously estimated:– Pr(Y=1) = 3/8– Pr(X(1)=1|Y=1) = 2/3– Pr(X(1)=1|Y=0) = 2/5– Pr(X(2)=1|Y=1) = 1/3– Pr(X(2)=1|Y=0) = 1/5
• Now, for a new subject with fever but not headache, we would predict its probability of having flu as
Pr(Y=1|X(1)=1,X(2)=0)= Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) / [Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) +
Pr(X(1)=1|Y=0)Pr(X(2)=0|Y=0)Pr(Y=0)]= (2/3)(1-1/3)(3/8) /
[(2/3)(1-1/3)(3/8) + (2/5)(1-1/5)(1-3/8)]= 5/11
Last update: 5-Oct-2015
Subject i
Has fever? X(1)
Has headache? X(2)
Has flu? Y
1 Yes Yes Yes2 Yes No Yes3 No No Yes4 Yes No No5 No Yes No6 No No No7 No No No8 Yes No No
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 30
Numeric features• If X(j) can take on continuous values, we need a
continuous distribution instead of a discrete one– Fever is a feature with binary values: 1 means “has
fever”; 0 means “does not have fever”– Body temperature is a feature with continuous values
• For the features with binary values, we have assumed that each feature X(j) has a Bernoulli distribution conditioned on Y, i.e., Pr(X(j)=1|Y=y) = q with the value of parameter q to be estimated
• For continuous values, we can similarly estimate Pr(X(j)=x|Y=y) based on an assumed distribution
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 31
Gaussian distribution• Suppose the body temperatures of flu patients
follow a Gaussian distribution:
– There are two parameters to estimate:• The mean (center) of the distribution, • The variance (spread) of the distribution, 2
Last update: 5-Oct-2015
35 36 37 38 39 40 41
Prob
abili
ty d
ensi
ty
Body temperature of people with flu
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 32
Estimating the parameters• Maximum likelihood estimations [optional]:
Last update: 5-Oct-2015
Prቀ𝑋𝑖ሺ𝑗ሻ = 𝑥|𝑌= 𝑦ቁ= 1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ22σ2
Lሺμ,σሻ= ෑ�1σξ2πe−ቀ𝑋𝑖ሺ𝑗ሻ−μቁ2
2σ2𝑖:𝑌𝑖=𝑦
lnLሺμ,σሻ= ln൬1σξ2π൰−ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ2 𝑖:𝑌𝑖=𝑦
∂lnLሺμ,σሻ∂μ = 𝑋𝑖ሺ𝑗ሻ− μσ2𝑖:𝑌𝑖=𝑦
∂lnLሺμ,σሻ∂μ = 0 ⇒μ= σ 𝑋𝑖ሺ𝑗ሻ𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
∂lnLሺμ,σሻ∂σ = −σξ2πσ2ξ2π + 2ቀ𝑋𝑖ሺ𝑗ሻ− μቁ22σ3 𝑖:𝑌𝑖=𝑦 = −1σ +ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2
σ3 𝑖:𝑌𝑖=𝑦
∂lnLሺμ,σሻ∂σ = 0 ⇒σ2 = σ ቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑖:𝑌𝑖=𝑦σ 1𝑖:𝑌𝑖=𝑦 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 33
Estimating the parameters• Results:– The formulas:
– Meanings: The mean and variance of the training data• The above formula for the variance is a biased estimation (i.e., when
you have many sets of training data and each time you estimate the variance by this formula, the average of the estimations does not converge to the actual variance of the Gaussian distribution).
• May use the sample variance instead, which is the minimum variance unbiased estimator – see further readings.
Last update: 5-Oct-2015
μ= σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑋𝑖ሺ𝑗ሻ𝑛𝑖=1σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
σ2 = σ 𝕀ሺ𝑌𝑖 = 𝑦ሻቀ𝑋𝑖ሺ𝑗ሻ− μቁ2𝑛𝑖=1 σ 𝕀ሺ𝑌𝑖 = 𝑦ሻ𝑛𝑖=1
LOGISTIC REGRESSIONPart 3
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 35
Discriminative learning• In the Bayes and Naïve Bayes classifiers, in order to
compute Pr(Y|X) we need to model Pr(X,Y) or Pr(X|Y)Pr(Y)– It seems to be complicating things: Using the solution of a
harder problem [modeling Pr(X,Y)] to answer an easier question [Pr(Y|X)]
– We may not always have a good idea how to model Pr(X|Y) and Pr(Y)• For example, while assuming Gaussian for Pr(X|Y) is mathematically
convenient, is it really suitable?• What if we cannot find a good well-studied distribution that fits the
data well, or it is difficult to derive the maximum likelihood estimation formulas?
• We now study a discriminative method that models Pr(Y|X) directly
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 36
Logistic regression: the idea• The logistic regression model relies on the assumption that the
class can be determined by a linear combination of the features– Conceptually, we hope to have a rule of this type:
“If a1X(1) + a2X(2) + ... + amX(m) t, then Y=1; otherwise, Y=0”• If 0.2 <body temperature> + 0.5 <headache> + 0.6 <running nose> 8.1,
then <has flu> = 1• The coefficients a1, a2, ..., am and the threshold t are model parameters the
values of which we want to estimate from training data• Graphically, the rule is a step function:
Last update: 5-Oct-2015
0
1
2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1
<has
flu>
0.2 <body temperature> + 0.5 <headache> +0.6 <running nose>
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 37
Logistic regression: actual form• However, the step function is mathematically not easy to handle
– For instance, it is not smooth, and thus not differentiable
• Let’s model Pr(Y=1|X) using a smooth function, and then derive the classification rule based on it:– Let f(X) = exp(a1X(1) + a2X(2) + ... + amX(m) - t)– Pr(Y=1|X) = f(X) / [1 + f(X)] -- the logistic function– Pr(Y=0|X) = 1 - Pr(Y=1|X) = 1 / [1 + f(X)]
• When a1X(1) + a2X(2) + ... + amX(m) >> t, Pr(Y=1|X)=1 and Pr(Y=0|X)=0
• When a1X(1) + a2X(2) + ... + amX(m) << t, Pr(Y=1|X)=0 and Pr(Y=0|X)=1
• When a1X(1) + a2X(2) + ... + amX(m) = t, Pr(Y=1|X) = Pr(Y=0|X) = 0.5
– We predict X to be of class 1 if Pr(Y=1|X) Pr(Y=0|X), i.e., f(X) 1
• We need to estimate the values of the model parameters a1, ..., am and t from training data. Will discuss how to do it.
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 38
Visualizing the functions• The rule “if f(X) 1, then
predict Y=1” is exactly the same as “if a1X(1) + a2X(2) + ... + amX(m) t, then predict Y=1”
Last update: 5-Oct-2015
0
1
2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1
Y
a1X(1) + a2X(2) + ... + amX(m)
0.001
0.01
0.1
1
10
100
1000
2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.111.112.113.1f(x)
a1X(1) + a2X(2) + ... + amX(m)
0
0.2
0.4
0.6
0.8
1
2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 12.1 13.1
a1X(1) + a2X(2) + ... + amX(m)
Pr(Y=1|X)
Pr(Y=0|X)
The original step function
Pr(Y=1|X) and Pr(Y=0|X) f(X): ratio of Pr(Y=1|X) and Pr(Y=0|X)
Can set parameter values so that are very similar
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 39
Estimating the parameters• For Naïve Bayes, we estimated parameters
using maximum likelihood, i.e., to find such that and , or their logarithms, are maximized
• For logistic regression, we do not have models for these probabilities, so instead we directly maximize the conditional data likelihood,
, where includes the parameters a1, a2, ..., am and t
Last update: 5-Oct-2015
ෑ� Prሺ𝑋𝑖|𝑌𝑖,θሻ𝑛𝑖=1 ෑ� Prሺ𝑌𝑖|θሻ𝑛
𝑖=1
ෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 40
Maximizing the conditional likelihood
• The log conditional data likelihood can be written as follows [optional]:
Last update: 5-Oct-2015
lnLሺ𝜃ሻ = lnෑ� Prሺ𝑌𝑖|𝑋𝑖,θሻ𝑛𝑖=1
= lnෑ� Prሺ𝑌𝑖 = 1|𝑋𝑖,θሻ𝑌𝑖Prሺ𝑌𝑖 = 0|𝑋𝑖,θሻ1−𝑌𝑖𝑛
𝑖=1= ሾ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻ+ሺ1− 𝑌𝑖ሻlnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻሿ𝑛
𝑖=1= ቈ𝑌𝑖 lnPrሺ𝑌𝑖 = 1|𝑋𝑖,θሻPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ+ lnPrሺ𝑌𝑖 = 0|𝑋𝑖,θሻ𝑛
𝑖=1= 𝑌𝑖 lnexpቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ
𝑚𝑗=1 ቍ+ ln 11+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ
𝑛𝑖=1
= ൦𝑌𝑖 ቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚
𝑗=1 ቍ− ln൮1+ expቌ−𝑡+ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚
𝑗=1 ቍ൲൪𝑛
𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 41
Maximizing the conditional likelihood
• Again, we can write down the expression for the partial derivative of ln L() with respect to each parameter:
• However, when we set them to 0, each equation involves multiple parameters and we cannot get their optimal values separately– That is, they form a system of non-linear equations
Last update: 5-Oct-2015
∂lnLሺ𝜃ሻ∂𝑎𝑘 = 𝑌𝑖𝑋𝑖ሺ𝑘ሻ− expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑋𝑖ሺ𝑘ሻ𝑛
𝑖=1= 𝑋𝑖ሺ𝑘ሻ𝑌𝑖 − expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ
𝑛𝑖=1
∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛
𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 42
Gradient ascent• The system of non-linear equations has no closed-
form formulas.– Instead, we use a numerical method to solve it– Main idea: since we hope each equation to be zero, we
move them closer to zero iteratively– For example, since ,
we use the following update rule for t: , where is a small
constant• In the right hand side of the assignment, current estimates of the
parameters are used
Last update: 5-Oct-2015
∂lnLሺ𝜃ሻ∂𝑡 = −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛
𝑖=1
𝑡 ≔ 𝑡+ η −𝑌𝑖 + expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ1+ expቀ−𝑡+ σ 𝑎𝑗𝑋𝑖ሺ𝑗ሻ𝑚𝑗=1 ቁ𝑛
𝑖=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 43
Meaning of gradient ascent• Why? It is like climbing a hill
– At each point, the gradient is the direction with maximum increase– We want to move towards that direction, but for a small step each
time so as not to overshoot• We don’t know exactly how large the step should be, because if we knew
that we could jump to the peak directly (i.e., when we have the closed-form formulas, as in the case of maximum likelihood for Gaussian Naïve Bayes)
Last update: 5-Oct-2015
Image source: http://www.absoluteastronomy.com/topics/Hill_climbing
aa1
t
ln L() Current estimateDirection with maximum increase
New estimate
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 44
Relationship with Naïve Bayes• Interestingly, logistic regression has a tight relationship with Naïve Bayes
when each Pr(X(j)|Y) is a Gaussian distribution• [Optional] First, in general, the posterior probability Pr(Y=1|X) of Naïve
Bayes is as follows:
Last update: 5-Oct-2015
Prሺ𝑌= 1|𝑋ሻ = Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑋ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻ= Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ1+ Prሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ= explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨1+ explnPrሺ𝑌= 1ሻPrሺ𝑋|𝑌= 1ሻPrሺ𝑌= 0ሻPrሺ𝑋|𝑌= 0ሻ൨= expቈlnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1
1+ explnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ+ σ lnPrሺ𝑋ሺ𝑗ሻ|𝑌= 1ሻPrሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚𝑗=1 ൨
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 45
Relationship with Naïve Bayes• [Optional] Suppose Pr(X(j)|Y=1) has mean j1 and
variance 2j, and Pr(X(j)|Y=0) has mean j0 and variance
2j (different means but same variance), then
Last update: 5-Oct-2015
lnPr൫𝑋ሺ𝑗ሻ|𝑌= 1൯Prሺ𝑋ሺ𝑗ሻ|𝑌= 0ሻ𝑚
𝑗=1 = ln1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗1൯22σ𝑗2
1σ𝑗ξ2πe−൫𝑋ሺ𝑗ሻ−μ𝑗0൯22σ𝑗2𝑚
𝑗=1
= ൫𝑋ሺ𝑗ሻ− μ𝑗0൯22σ𝑗2 −൫𝑋ሺ𝑗ሻ− μ𝑗1൯2
2σ𝑗2 ൩𝑚
𝑗=1= ቈ
2𝑋ሺ𝑗ሻ൫μ𝑗1 − μ𝑗0൯+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚𝑗=1
= ቈμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 𝑚
𝑗=1
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 46
Relationship with Naïve Bayes• Plugging it back to the formula for Pr(Y=1|X), we have
which is exactly the form of logistic regression, with coefficients and threshold
– Weight of a feature, aj, depends on how well it separates the two classes
– Threshold depends on the means of the Gaussians and the prior probabilities of the two classes
Last update: 5-Oct-2015
Prሺ𝑌= 1|𝑋ሻ= expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1
1+ expቈlnPrሺ𝑌= 0ሻPrሺ𝑌= 1ሻ+ σ ቆμ𝑗1 − μ𝑗0σ𝑗2 𝑋ሺ𝑗ሻ+ μ𝑗02 − μ𝑗122σ𝑗2 ቇ𝑚𝑗=1
𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2
𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚
𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 47
Relationship with Naïve Bayes• In summary, if:– A Naïve Bayes classifier models Pr(X(j)|Y) by a
Gaussian distribution with equal variance for Pr(X(j)|Y=1) and Pr(X(j)|Y=0) AND
– A logistic regression classifier uses the coefficients
and threshold• Then their predictions are exactly the same.
Last update: 5-Oct-2015
𝑎𝑗 = μ𝑗1 − μ𝑗0σ𝑗2 𝑡 = μ𝑗12 − μ𝑗022σ𝑗2𝑚
𝑗=1 + lnPrሺ𝑌= 1ሻPrሺ𝑌= 0ሻ
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 48
Remarks• When the conditional independence
assumption is not true, logistic regression can be more accurate than Naïve Bayes classifier
• However, when there are few observed data points (i.e., n is small), Naïve Bayes could be more accurate
Last update: 5-Oct-2015
CASE STUDY, SUMMARY AND FURTHER READINGS
Epilogue
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 50
Case study: Fallacies related to statistics
• “According to this gene model, this DNA sequence has a data likelihood of 0.6, while according to this model for intergenic regions, this DNA sequence has a data likelihood of 0.1. Therefore the sequence is more likely to be a gene.”– Right or wrong?
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 51
Case study: Fallacies related to statistics
• Likelihood vs. posterior:– If Y represents whether the sequence is a gene
(Y=1) or not (Y=0), and X is the sequence features, then the above statement is comparing the likelihoods Pr(X|Y=1) and Pr(X|Y=0), but we know that the posterior Pr(Y|X)=Pr(X|Y)Pr(Y)/Pr(X), and Pr(Y=1) << Pr(Y=0)
• Another famous example: “This cancer test has a 99% accuracy, and therefore highly reliable.”
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 52
Case study: Fallacies related to statistics
• “Drug A is more effective than drug B for our male patients. Drug A is also more effective than drug B for our female patients. Therefore drug A is a better drug than drug B in general.”– Right or wrong?
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 53
Case study: Fallacies related to statistics
• Simpson’s paradox:– Consider this situation:
• Again, it is related to different priors Pr(Gender) for the two drugs.
• You may argue that more females can be recruited to test drug A and more males can be recruited to test drug B.– How about “Rate of a disease is higher for both males and
females in population A than population B”?
Last update: 5-Oct-2015
Drug A Drug BEffective Ineffective Effective Ineffective
Male 60 40 5 5
Female 7 3 65 35
Total 67 43 70 40
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 54
Summary• Statistical modeling allows us to predict the
class Y (e.g., has flu) of an object by combining some observed features X (e.g., body temperature, fever and running nose) and some parameters – Generative models: Predict Pr(Y|X) by modeling
Pr(Y) and Pr(X|Y)• Example: Naïve Bayes classifier
– Discriminative models: Predict Pr(Y|X) by modeling it directly• Example: Logistic regression
Last update: 5-Oct-2015
CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 55
Further readings• A book chapter written by Tom Mitchell, Generative and Discriminative
Classifiers: Naïve Bayes and Logistic Regression– Available at http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf– Discusses how to avoid over-fitting by regularization
• Over-fitting: Forming a model too complex to fit the data, including the noise contained in it that does not help predictions
– Describes how logistic regression work when there are more than 2 classes– Contains some discussions on using priors to turn maximum likelihood
estimates into maximum a posteriori (MAP) estimates– Note: Some notations are different from what we use
• This year we have removed the expectation-maximization (EM) algorithm from the syllabus. The algorithm is for situations in which the concept depends on both the observed data and some unobserved hidden data– If you are interested in learning EM, you can read its Wikipedia entry, which is
quite well-written
Last update: 5-Oct-2015