An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty

An introduction to the An introduction to the Bootstrap methodBootstrap method

Hugh ShanahanHugh Shanahan

University College London University College London

November 2001November 2001

I know that it will happen, Because I believe in the certainty of chance

The Divine Comedy

OutlineOutline

• Origin of StatisticsOrigin of Statistics• Central Limit TheoremCentral Limit Theorem

• Difficulties in “Standard Statistics”Difficulties in “Standard Statistics”• Bootstrap - the basic ideaBootstrap - the basic idea• A simple exampleA simple example• Case Study I : Phylogenetic TreesCase Study I : Phylogenetic Trees• Case Study II : Bayesian NetworksCase Study II : Bayesian Networks• ConclusionsConclusions

Statistics 101Statistics 101

• We want the ‘average’ and ‘error’ for some We want the ‘average’ and ‘error’ for some variablevariable• Time between first and second division of frog Time between first and second division of frog

embryoembryo• Half-life of a radioactive sampleHalf-life of a radioactive sample• How many days does Wimbledon get delayed How many days does Wimbledon get delayed

by (grrr……..)by (grrr……..)

StrategyStrategy• Assuming only statistical variationAssuming only statistical variation• Carry out measurement “many” timesCarry out measurement “many” times

• Error decreases as number of measurements increaseError decreases as number of measurements increase

In fact, there’s a huge amount of statistical In fact, there’s a huge amount of statistical machinery going on with this…….machinery going on with this…….

Assume the Central Limit TheoremAssume the Central Limit Theorem

““If random samples of n observations yIf random samples of n observations y11, y, y22, …y, …ynn are are

drawn from a population of finite mean drawn from a population of finite mean and variance and variance 22, then when n is sufficiently large, the sampling , then when n is sufficiently large, the sampling distribution of the sample mean can be approximated distribution of the sample mean can be approximated by a normal density with mean by a normal density with mean yy = = and standardand standard

deviation deviation yy = = nn1/21/2””

THE MOST IMPORTANT THEOREM OF STATISTICSTHE MOST IMPORTANT THEOREM OF STATISTICS

Consequences of CLTConsequences of CLT

• Averages taken from Averages taken from any any distribution distribution (your experimental data) will have a normal (your experimental data) will have a normal distributiondistribution• The error for such an observable will The error for such an observable will decrease slowly as the number of decrease slowly as the number of observations increaseobservations increase

But nobody tells you how big the sample has to be..But nobody tells you how big the sample has to be..

Normal distributionNormal distribution Averages of N.D.Averages of N.D.

distributiondistribution Averages of Averages of distribution distribution

Uniform distributionUniform distribution Averages of U.D.Averages of U.D.

Research is more than Statistics Research is more than Statistics 101 !!101 !!

• Very often, we are looking at quite complicated Very often, we are looking at quite complicated objects, objects, not just single variables. Even if we not just single variables. Even if we assume CLT, then it is not clear how to propagate assume CLT, then it is not clear how to propagate the uncertainty through to the final objects we are the uncertainty through to the final objects we are looking at. looking at.

• It is not clear when we have a large enough It is not clear when we have a large enough sample, we should do a histogram, but this may sample, we should do a histogram, but this may not be possible. not be possible.

What the statistician sees….What the statistician sees….(or rather what they talk about)(or rather what they talk about)

• The The probability distributionprobability distribution rather than the data rather than the data• But we just have the data ! But we just have the data !

• The bootstrap method attempts to determineThe bootstrap method attempts to determine the probability distribution from the data the probability distribution from the data itself, without recourse to CLT.itself, without recourse to CLT.

• The bootstrap method is not a way of reducing The bootstrap method is not a way of reducing the error ! It only tries to estimate it.the error ! It only tries to estimate it.

Basic idea of BootstrapBasic idea of Bootstrap

• Originally, from some list of data, one Originally, from some list of data, one computes an computes an object.object.

• Create an artificial list by randomly drawing Create an artificial list by randomly drawing elements from that list. elements from that list. Some elements will Some elements will be picked more than once. be picked more than once.

• Compute a new object.Compute a new object.• Repeat 100-1000 times and look at the Repeat 100-1000 times and look at the

distribution of these objects.distribution of these objects.

A simple exampleA simple example

• Data available comparing grades before and Data available comparing grades before and after leaving graduate school amongst 15 after leaving graduate school amongst 15 U.S. Universities.U.S. Universities.

• Some linear correlation between grades Some linear correlation between grades (high incoming usually means high (high incoming usually means high outgoing). outgoing). =0.776=0.776

• But how reliable is this result ?But how reliable is this result ?

Addendum : The Jack-knifeAddendum : The Jack-knife

• Jack-knife is a special kind of bootstrap.Jack-knife is a special kind of bootstrap.• Each bootstrap subsample has all but one of Each bootstrap subsample has all but one of

the original elements of the list.the original elements of the list.• For example, if original list has 10 For example, if original list has 10

elements, then there are 10 jack-knife elements, then there are 10 jack-knife subsamples.subsamples.

How many bootstraps ?How many bootstraps ?

• No clear answer to this. Lots of theorems on No clear answer to this. Lots of theorems on asymptotic convergence, but no real asymptotic convergence, but no real estimates !estimates !

• Rule of thumb : try it 100 times, then 1000 Rule of thumb : try it 100 times, then 1000 times, and see if your answers have times, and see if your answers have changed by much.changed by much.

• Anyway have NAnyway have NNN possible subsamples possible subsamples

Is it reliable ?Is it reliable ?

• A very very good question !A very very good question !• Jury still out on how far it can be applied, Jury still out on how far it can be applied,

but for now nobody is going to shoot you but for now nobody is going to shoot you down for using it.down for using it.

• Good agreement for Normal (Gaussian) Good agreement for Normal (Gaussian) distributions, skewed distributions tend to distributions, skewed distributions tend to more problematic, particularly for the tails, more problematic, particularly for the tails, (boot strap underestimates the errors). (boot strap underestimates the errors).

Case Study I : Phylogenetic Case Study I : Phylogenetic TreesTrees

Get a multiple sequence Get a multiple sequence alignmentalignment

C1 C2 C3 S1 A A GS2 A A AS3 G G AS4 A G A

Construct a Tree using your favourite method(Parsimony, ML, etc..)

How confident are we of this tree ?How confident are we of this tree ?

• For example, how confident are we that two For example, how confident are we that two sequences are in the same clade ?sequences are in the same clade ?

• I.E. what is the probability distribution of I.E. what is the probability distribution of our confidence of the branches ?our confidence of the branches ?

• Certainly not a problem that Stat. 101 can Certainly not a problem that Stat. 101 can handle !handle !

• Bootstrap can provide a way of determining Bootstrap can provide a way of determining this (first thought of by Felsenstein, 1985)this (first thought of by Felsenstein, 1985)

Having created an ensemble of Phylogenetic trees,one can elucidate the statistical frequency of variousfeatures of the tree.E.G. Do two sequences lie in the same clade ?

Can this be used for statistical significance ? This is very much an open question !!!!(Be cautious, and assume not…...)

Case Study II : Gene expression Case Study II : Gene expression data and Bayesian (Probabilistic) data and Bayesian (Probabilistic)

networksnetworks• A method for elucidating which genes is A method for elucidating which genes is

regulating the production of what genes.regulating the production of what genes.• Problem is that it is difficult to determine Problem is that it is difficult to determine

how reliable the edges of the network ishow reliable the edges of the network is• The bootstrap method is the favoured The bootstrap method is the favoured

approach…..approach…..

Ideally, what you want is the followingIdeally, what you want is the following

Formally, we get a joint probability distributionFormally, we get a joint probability distributionwhich takes the form :which takes the form :

P(G1,G2,….) = … x P(G3 | G1, G2 ) x …P(G1,G2,….) = … x P(G3 | G1, G2 ) x … … … x P(G7 | G3 ) x …x P(G7 | G3 ) x …

etc….etc….

More importantly, we can tell which genes More importantly, we can tell which genes directly affect which genes (e.g. G1 and G2 directly affect which genes (e.g. G1 and G2 acting on G3) and which ones are indirect acting on G3) and which ones are indirect (e.g. G6 acting on G3)(e.g. G6 acting on G3)

But there is a problem….But there is a problem….

• Finding the right network is an NP-hard Finding the right network is an NP-hard problem.problem.

• Have to apply various heuristic techniques….Have to apply various heuristic techniques….• Also, given the paucity of data it is not clear Also, given the paucity of data it is not clear

that any given connection between two genes that any given connection between two genes is not a spurious correlation that will vanish is not a spurious correlation that will vanish with more statistics. with more statistics.

Summary of the Bootstrap Summary of the Bootstrap methodmethod

• Original object O (a tree, a best fit...) is computed from a “list of data” (numbers, sequences, microarray data,….).• Construct a new list, with the same number of elements, from the original list by randomly picking elements from the list. Any one element from the list can be picked any number of times.• Compute new object, call it O1

• Repeat the process many times (typically 100-1000).• The elements {O1 , O2 , ……} are assumed to be taken from a statistical distribution, so one can compute averages, variances, etc.

ConclusionsConclusions• Don’t feel bad if this went over your head !Don’t feel bad if this went over your head !

• I’m happy to explain this again……..I’m happy to explain this again……..

• Textbook : Textbook : Randomization, Bootstrap and Monte Randomization, Bootstrap and Monte Carlo Methods in BiologyCarlo Methods in Biology, B.F.J. Manly, Chapman & Hall, B.F.J. Manly, Chapman & Hall

• Many extra subtleties, (parametric, non-Many extra subtleties, (parametric, non-parametric, random numbers) have not been parametric, random numbers) have not been discussed.discussed.

• Do NOT scrimp on the explanation of this Do NOT scrimp on the explanation of this method when you are writing it up !!!method when you are writing it up !!!

Documents

An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty