57
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize and analyze the parameters of the model with an objective in mind. Example : SBP of Cancer Patients vs. Normal patients Cancer: 145, 165, 134, 120, 112, 156, 145, 133, 135, 120 Normal: 138, 120, 112, 110, 128, 134, 128, 1

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

  • Upload
    chana

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize and analyze the parameters of the model with an objective in mind. Example : SBP of Cancer Patients vs. Normal patients - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

1

Statistical Modeling and Data Analysis

Given a data set, first question a statistician ask is,

“What is the statistical model to this data?”

We then characterize and analyze the parameters of the model with an objective in mind.

• Example : SBP of Cancer Patients vs. Normal patients

Cancer: 145, 165, 134, 120, 112, 156, 145, 133, 135, 120Normal: 138, 120, 112, 110, 128, 134, 128, 109, 138, 140

Objective: Do cancer patients have higher SBP than the normal patients?

Page 2: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

2

Systolic blood pressure

normal cancer

𝜇1 𝜇2

Objective is to test the Hypothesis:

Does the data support this hypothesis?

Population of cancer patients with a probability distribution

Population of normal patients with a probability distribution

Page 3: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

3

Assumption: The data is random and is generated from the normal distributions?

• Random Variable

is the collection of all subjects. What we observe is one realization

• Random Sample:

We collect a sample of subjects

Page 4: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

4

Observed Sample:

Assumption: – Simple Random Sample (equally likely than any other sample)

• Multivariate Observations

An observed vector is one realization of this, i.e.,

Page 5: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

5

Random Sample:

Observed sample is a realization of

Note: If the simultaneous inference is to made on its components, the probability statement should be viewed in terms of probability of observing

Page 6: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

6

Stochastic Process

Observed value of this is one realization

Can we describe a probability distribution of

?

Kolmogorov Consistency Theorem says that probability distribution can be described.

Page 7: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

7

These are three realizations with

Page 8: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

8

Discrete time points

If this process is stationary, then a probability model for can be described in a concise way. For example,

,

where is white noise.

Page 9: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

9

Image Process:

Page 10: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

10

, where is the set of all pixels.

Note that what we observe is a realization of this

Page 11: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

11

The same can be said about weather map.

Page 12: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

12

Data Analysis

Generally speaking, we perform one or more of the following tasks in data analysis (statistical inference)

• Estimate the model• Hypothesis testing• Predictive analysis

Given the sample data, objective is to make inference about the population described by the probability model.

All inferences are based on probability model assumed.

Page 13: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

13

Estimation

Think of estimating any parameters of a probability model. For example, estimating and of a regression model

How good is the estimate ?

Well, you might say that if , it is a good estimate.

Not so simple! Note that is unknown.

Page 14: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

14

Frequentist’s Interpretation

Note that depends on the sample we observe.

Sample

…… -- -- -- --

…… -- -- -- --

observed observed observed observed observed

…… -- -- -- --

is better than if the average of is smaller than the average of , i.e.,

for all .

Page 15: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

15

is better than if for all .

A best estimate, in this sense, is of course not possible. If irrespective of the observed sample, then

for

We restrict to a class of estimators, and then try to find best Estimate within this class.

For example, we may consider a class of all unbiased estimators.

Page 16: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

16

Theories are well developed for achieving best estimates among the class of unbiased estimates for simple probability models.

For complicated model, we can always fall back to maximum likelihood estimates.

Obtain the estimate by maximizing the likelihood function

For small sample size , this may not always yield good estimate, but for large sample size , this generally yield optimal estimates.

Page 17: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

17

Asymptotic Optimality of Maximum Likelihood Estimate

– sequence of asymptotically normal estimates

as

can be interpreted as asymptotic variance of .

,

- Fisher Information Matrix

Under regular probability models, maximum likelihood estimates achieves the lower bound.

Page 18: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

18

Bayesian Interpretation

Prior Distribution -

Through this we might say that some values of are more likely than other values.

is better than if

.

A best estimate is now possible; for example,

The RHS is the expectation with respect to the posterior distributionof .

Page 19: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

19

Prior Distribution - Really? Where did it come from?

You may not believe this, but we are really talking in terms of a statistical philosophy.

Can you really believe that the true state of nature is random?

Systolic blood pressure

normal cancer

𝜇1 𝜇2

Page 20: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

20

and are supposed to be fixed mean SBPs of the normal and cancer populations. Now, we are saying that they are random.

Bayesian Paradigm

is never a fixed value; under most circumstances some values of are more likely than other values.

Before a data is analyzed, we should explore this prior. Then update it based on the information provided by the data.

Prior: Data:

Posterior:

All information about is contained in the posterior.

Page 21: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

21

Example:

1 in 1,000 in the population carry a particular genetic disorder.

Certain tests on a person are performed, and data is collected

Data:

Prior:

Posterior:

Page 22: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

22

The main issues with Bayesian inference are

(1) Appropriateness of the prior(2) Computation of the posterior distribution

random sample from

Prior:

This is a conjugate prior because the posterior distribution is of same form as the prior distribution.

Is this prior appropriate?

Page 23: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

23

Prior:

If nothing is known about , .

This gives almost flat prior for and .

There are other ways to assign non-informative priors.

Note that if

Prior:

then we will have computational problem of computing posterior distribution.

Page 24: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

24

Computation of the posterior

There are two popular techniques of computing posterior distribution:

1. Metropolis-Hasting Algorithm2. Gibbs Sampler

These techniques can be used effectively for complex probability model and reasonable priors.

Page 25: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

25

Frequentist vs. Bayesian

Frequentist Bayesian

All data information is All data information is contained in the likelihood contained in the likelihood function. function and the prior

The estimates are viewed Estimates are viewed in in terms of how they behave terms of where they are on the average located in the posterior

Estimates are generally obtained Estimates are obtained from by maximizing the likelihood the posterior. Techniques function. Techniques include include Gibbas Sampler, Newton-Raphson, EM-algorithm Metropolis-Hasting etc.

Page 26: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

26

Mixture Models

Suppose the population is a mixture of two or more populations.

Bayesians would have a good answer to estimate this model than frequentists would.

Page 27: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

27

Hypothesis Testing

Think about how it started in statistical literature.

Data: drawn from a probability model.

associated with the probability model

Does the data support this hypothesis?

Bayesians had an answer to this, but they were not popular at the time.

Ans.

Page 28: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

28

(Fisher)

drawn from

Hypothesis :

Compute

If this is vey small (, then the data provide very little evidence in support of the hypothesis.

Conclusion: Reject the Hypothesis

Page 29: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

29

Analysis of Variance (ANOVA)

ANOVA is one of the most popular statistical tools of analyzing data.

Y

Factor 1

Factor 2

Factor 3

Does Y (the response) depends on any of the factors?

A Response Variable

Page 30: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

30

Example 1: You are doing a research on mpg (miles per gallon) for a brand of automobiles.

Question: What effects mpg?

mpg

Wind speed

Air temperature

Air moisture

Do wind speed, air temperature, and air moisture effect mpg?

Page 31: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

31

Example 2:

Research Question: Does blood pressure (BP) depend on weight and gender?

BP

Weight

Gender

Page 32: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

32

Weight

BP*

*

**

***

**

***

* Female* Male

There is a variation in BP. Some is due to weight, and some is due to gender.

Page 33: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

33

Concept:

Variation(BP) = Variation(Weight) + Variation(Gender) + Variation(Error)

These variation can be described by Sums of Squares

SS(BP) = SS(Weight) + SS(Gender) + SS(Error)

is the degrees of freedom that represent the effective number of terms in the sums of squares

Page 34: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

34

F-StatisticsWeight: Test Statistic

Hypothesis : Weight is not a factor in BP

If p-value (<0.05), then there is little evidence that weight is not a factorGender: Test Statistics

Same can be done to see if gender is a factor.

Page 35: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

35

Neyman – Pearson Lemma

Basis for Classical Hypothesis Testing

Null hypothesis Alternative Hypothesis (Research Hypothesis)TS: Test StatisticsDecision RuleConclusion

Type-I Error: False Discovery Type-II Error: False Non-Discovery

Devise a decision rule so that = Pr(False Discovery)

is very small (=0.05). Through Neyman-Pearson Lemma, a most powerful decision rule can be obtained.

Page 36: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

36

Uniformly Most Powerful Unbiased Decision Rule is

,

where is such that

.

Note that this is a frequentist method since the probability statement should be interpreted in a frequentist manner.

Page 37: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

37

Likelihood Approach

Neyman-Perason Lemma works only on simple probability models.

Test Statistics

If the hypothesis is correct, the should be closed to 0. Thus, we reject the hypothesis if

The cut-off point can be obtained through asymptotic distribution of , which is usually .

Page 38: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

38

Model Selection

Suppose you want choose one model out of several. This is a type of multiple hypotheses problem.

Regression:

Not all predictors are significant, and you want to select the set of significant predictors. This can be viewed as selecting one of the several models

Choose the model that yields the smallest

Page 39: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

39

This yields a biased selection, meaning that a model with higher number of parameters has a better chance of being selected.

AIC or BIC Information criteria

Select the model with the highest value of AIC (or BIC)

Page 40: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

40

Bayesian Hypothesis Testing

Data: drawn from

Hypothesis

Prior:

Posterior: ,

Bayes Factor:

If this Bayes factor (, data has sufficient evidence to support the hypothesis .

Page 41: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

41

Frequentist Vs. Bayesian

Note that both and classical hypothesis tests are frequentists since the statements are made in terms of probability.

The Bayes Factor is used in Bayesian tests which is based on the posterior probability

Page 42: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Multiple Hypotheses:

Consider 1000 independent tests each at Type-error of α = 0.05.

Then 5% of the null hypotheses would be falsely rejected. In other

words, if 50 of the hypotheses were rejected, there is no guarantee

that they were not all falsely rejected.

FWER: m = # of hypotheses

π = P(One or more falsely rejected hypotheses)

= 1 –

(Bonferroni Correction)

m)1(

mm //1)1(1

Page 43: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

If m is large, α would be very small. Thus the power of detecting any true positive would be very small.

Sequential Bonferroni Corrections:

Let be the p-values of independent tests with

corresponding null hypotheses .

Holm’s Method (Holm, 1979; Scand. Statist.)

• If , accept all nulls.

• If , reject ; if , accept the rest of nulls.

• Continue until first j such that . In that case reject all and accept the rest of nulls.

][...]2[]1[ mppp

mp /]1[

mp /]1[

)(....,,)2(,)1( mHHH

)1(H )1/(]2[ mp

)1/(][ jmjp ,1,)( jiiH

Page 44: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Simes Method (Biometrika, 1986):

• If , reject all nulls.

• If not, but if , reject all

• Continue until first . In that case reject all

][mp

2/]1[ mp 1,...,2,1,)( miiH

1][

imip

ijjH ,...,2,1,)(

Note: Both Holm’s and Simes methods are designed to refine the FWER.

Page 45: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

False Discovery Rate (FDR): Benjamini and Hochberg (1995), JRSS

When the number of hypotheses m is very large (say in

thousands), and if each individual hypothesis is not important,

then FWER criterion is not very useful since it yields few

discoveries. For example, in a microarray data analysis, the

objective is to detect potential genes for future exploration. Here,

each individual gene is not important. In such cases, tests with a

controlled FWER would yield few discoveries.

Page 46: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

FDR = Expected proportion of false rejections.

Accept Null Reject Null Total

True Null U V

True Alternatives

T S

m- R R m

0m

0mm

FDR =

=

0 if 0 where,],[ RRV

RVE

)0(]0|[ RPRRVE

Note that FWER = P(R>0)

Page 47: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Benjamini and Hochberg proved that the following procedure produces

:

Let k be the largest integer i such that , then reject all

qFDR

qmi

ip ][.,...,2,1,)( kjjH

The result was proved under the assumption of independent test statistics.

It was later extended to a positively correlated test statistics by Benjamini

and Yekutieli, 2001; Ann. Stat.

Page 48: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Bayesian Interpretation (Storey, 2003, Ann. Stat.)

are independently distributed. ,...,2,1,

. if 0reject that statistics test be Let

,...,2,1 ,0: vs.0:0

]0|[

miiT

ciTiHiT

miiiaHi

iH

RRVEpFDR

)|0(

then,0)0( with i.i.d. are ....,,2,1

ciTiHPpFDR

iHiPpm

Note: pFDR is a posterior version of the Type-I error

Page 49: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Directional Hypothesis Problem (Three decision problem):

Suppose is rejected, but it is also important to find the direction

of

0:0 iiH

.0or 0.,., iieii

So the problem is to find subsets

}0:{ and }0:{

such that },...,2,1{ of and

iiSiiS

mSS

Page 50: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Example: Gene selection

When the genes are altered under adverse condition, such as cancer, the affected genes show under or over expression in a microarray.

0:0:0:0

),(~

Level Expression

iiHori

iHvsiiH

iPiX

iX

The objective is to find the genes with under expressions and genes with over expressions.

Page 51: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

Directional Error (Type III error):

Type III error is defined as P( Selection of false direction if the null is rejected). The traditional method does not control the directional error. For example,

Sarkar and Zhou (2008, JSPI)Finner ( 1999, AS)Shaffer (2002, Psychological Methods)Lehmann (1952, AMS; 1957, AMS)

Main points of these work is that if the objective is to find the true direction of the alternative after rejecting the null, then a Type III error must be controlled instead of Type I error.

0. if occurserror an , and ,|| 2/2/ tttt

Page 52: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

)(0,in containedsupport ith density w)(

,0)(-in containedsupport ith density w)(

)0()(),0()(0),0()()(-

where

)()(00)()(

from generated are ,...,2,1 Suppose

0:,0:,)0,(0:0

g

g

IgIIg

ppp

m

iiHi

iHsayiiH

Bayesian Decision Theoretic Framework

Page 53: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

important. more is ilon what ta based assigned becan and

test.tail- twoa yield ulddensity wosymmetric a of truncatedas g and -g with

test.tail-one a yield would0)p(or 0

left tail. n thelikely tha more is right tail that thereflects

).1,0,1-(pby introduced isprior in the skewness The

.on density a of densities trucatedbe could g and

p-p

p-p

-p

pp

pp

g

Page 54: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

rule. randomized a be ),0,(i

selectingfor 1) 0, (0,

0 selectingfor 0) 1, (0,

selectingfor 0) (1,0,

values taking),0,( where

1),(),(

Function Loss

iiiLet

iH

iH

iH

idididid

m

i idiiLL

Page 55: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

same theare ),0( for which rules all of class theconsider

}*:))(),0(0),({()S(

space thecomparingby compared becan rulesdecision ,prior fixed aFor

),0()0(0

0 )(1),()(

0 )(),()( where

)()0(00)()(

bygiven is )m,...,1( ruledecision afor risk average The

R

Drrr

iRi

r

i idiiiRi

r

i idiiiRi

r

rprprpr

Page 56: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

BayesRule

)(r

)(r

pppp and on

depends slope

Page 57: Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is,

direction. negative in the genes detectedfalsely ofnumber expected than theless be woulddirection positive in the genes delected

falsely ofnumber expected themean that would this1",-"0 For the

direction. negative in therisk averagehan smaller t be willdirection positive in the

rule Bayes theofrisk average e then th),( an likely th more

is t known tha isit apriori if that implies theoremThis:

ppiH

iHRemark