Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Introduction to Statistics and Data Analysis
Chris Holmes
Professor of Biostatistics,Department of Statistics,
& Wellcome Trust Centre for Human Genetics
DTC 2012
Chris Holmes Intro Stats
What is Statistics?
Chris Holmes Intro Stats
Statistics:
◦ Statistics is the science and art of data analysis - from observationalstudies - or from planned experiments
◦ Statistics is concerned with the collection, analysis andinterpretation of data
◦ It is the science of the scientific method
Chris Holmes Intro Stats
Branches of statistics
◦ Statistics covers a wide range of areas, from how best to collect data(optimal design of experiments) to the construction of predictivestochastic (empirical) models
◦ Some areas of note include
- graphical displays of data- stochastic modelling of systems- predictive algorithms
Chris Holmes Intro Stats
Uncertainty
◦ At the heart of Statistics is the rigorous treatment of uncertainty orrandom variation characterised via probability
◦ Statistics works in units of uncertainty (you can think of probabilityas the currency)
◦ Probability:
- probability provides a formal system to quantify uncertainty- probability calculus provides a formal system to update uncertainty
(beliefs) in light of information (data)- allows for coherent accumulation of evidence supporting or refuting a
scientific hypothesis of interest
◦ Statistics is about being precise about our level of imprecision
Chris Holmes Intro Stats
Probability
Probability (of continuous, real, valued observations) has deep rootedmathematical foundations
But for us we shall only need to deal with some simple aspects:
Chris Holmes Intro Stats
Consider an arbitrary event denoted A, e.g., A ≡ “Britain will join theEuro currency in 2012”
Then:
Pr(A) ∈ [0, 1]
- Probability ranges between 0 and 1
Pr(A) = 0
- Event A can never occur: “I’d take a bet of 1p for A occurring inreturn for my life” (assuming you don’t wish to die)
Pr(A) = 1
- Event A will surely occur: “I’ll take a bet of 1p in return for all myworldly wealth if A does not happen”
And so on, e.g. Pr(A) = 0.5, equal chance of A occurring or not (note:we need to define “chance” without referring to probability!), “I’d behappy for my worst enemy to decide on A or A′ for me in a bet of equalodds”
Chris Holmes Intro Stats
Note that the use of Probability to refer to degrees of belief in arbitraryevents is not without controversy
Some (many) feel that Probability should be restricted to events that canbe measured via the long running frequency of outcome under perfectlyrepeatable trials (such as in hypothetical games of chance)
- The probability that a fair coin tossed 5 times gives {H,H,H,H,H}- The probability of a Royal Flush in Poker
I find this overly restrictive and am happy to interpret Probability interms of personal degrees of belief (measures of uncertainty) in events(Savage, 1954)
- What’s the probability that I had more that 10 quid in my wallet atsome point yesterday?
Chris Holmes Intro Stats
Updating uncertainty
Coherent updating of uncertainty in light of information followsconditional probability calculus
Pr(A|B), to be read “my updated beliefs in event A occurring givenknowledge of the status of event B”,
Pr(A|B) =Pr(A,B)
Pr(B)(1)
where Pr(A,B) is the joint probability of both (A,B) occurring, andPr(B) is a normalising constant (that does not change with the outcomeof A) and ensures Pr(A|B) ∈ [0, 1], in fact (Theorem of TotalProbability)
Pr(B) = Pr(A,B) + Pr(A′,B)
Note also from(1) we have the useful identity Pr(A,B) = Pr(A|B)Pr(B)
Chris Holmes Intro Stats
Updating uncertainty II – Bayes Rule
Given the definition of conditional probability of A|B, then clearly also
Pr(B|A) = Pr(A,B)Pr(A) and equating terms and rearranging leads us to,
Bayes Rule:
Pr(A|B) =Pr(B|A)Pr(A)
Pr(B)
Bayes Rule allows us to express beliefs in A|B in terms of B|A andbackground beliefs Pr(A) (before we knew the status of B) which turnsout to be a extremely useful!
Which lead to Bayesian updating being referred to as “inverseprobability”
Chris Holmes Intro Stats
Statistical Data Analysis
Broadly speaking, the analysis of data proceeds in two stages
◦ Exploratory analysis of data via graphing and summary statistics
◦ Formal statistical modelling of dependence structures of interest,e.g. for prediction, or for evaluating the empirical evidence for aparticular scientific hypothesis
Today we will deal with topics related to exploratory analysis of data
Chris Holmes Intro Stats
Due to time constraints we will not cover the optimal design ofexperiments that precedes the above tasks for experimental studies,although this is an important discipline: “How to set up an experimentand collect samples so as to maximise the information content, reducebias, and reduce confounding?”
Chris Holmes Intro Stats
Graphing data
◦ The starting point of ALL good statistical data analysis begins withgraphical plots and summary statistics of the data
◦ ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!
◦ Why?
Chris Holmes Intro Stats
Graphical Excellence
“Graphics reveal data, communicate complex ideas and dependencieswith clarity, precision and efficiency”
- Edward Tufte: The Visual Display of Quantitative Information
Chris Holmes Intro Stats
Graphical Excellence
Excellent graphics:
◦ show the data
◦ induce the viewer to think about the substance
◦ avoid bias
◦ make large complex data sets coherent
◦ encourage data exploration and debate
Chris Holmes Intro Stats
Moreover:
◦ Graphical plots and summary stats provide a feel for the variation inthe data
◦ They can also highlight unusual results, measurement errors, outliers
- Such features can severely distort your results if left unchecked!- Many formal tests assume that the data follows a certain pattern (a
probability distribution such as Normal), if these assumptions areinvalid the results will be completely misleading
- Confidence in these assumptions can be gained through plotting thedata
Chris Holmes Intro Stats
In the intro stats course (last term) we covered histograms, boxplots,scatterplots, qq-plots, pie-charts (don’t use them!), barcharts. It isexpected that you familiarise yourself with these
However, modern experiments in the life sciences, genetics, genomics,molecular biology can produce huge data sets, involving1000s-to-1000000s of measurements on 100s-to-1000s of samples
For example, mRNA microarrays can measure gene-expression levels forall known genes in the genome within a tissue sample, and whole genomesequencing can read off all of the 3x109 DNA bases in a human genome
These data sets can hold signals associated to heritable disease risk, orprovide a snapshot of the functional status of a population of cells, whichhave important applications in
◦ Biomarkers
◦ Understanding function
Chris Holmes Intro Stats
Exploring “Big Data”
Modern bio-data sets are large and highly structured
Underlying biological mechanisms lead to strong dependences betweenvariables
Conventional exploratory tools such as scatterplots are not usable due tothe dimensionality
To explore such high-dimensional data it is useful to consider multivariatetechniques. We shall consider two,
◦ Cluster Analysis
◦ Principal Components Analysis
Chris Holmes Intro Stats
Cluster Analysis
In Big Data it is interesting to search for and highlight sub-groups ofsamples embedded within high-d data that show self-similarity, such thatobjects within a group are more similar to one another than those inother groups, or to know that no such groups exist
Chris Holmes Intro Stats
E.g., in mRNA gene-expression analysis of tumour samples it isinteresting to see if there are undetected sub-groups, that perhaps relateto heterogeneity in clinical outcome; or in population geneticshighlighting individuals that are (cryptically) more closely or distantlyrelated than expected
This process, or detecting and assigning objects to sub-groups, can begenerally referred to as Cluster Analysis
Chris Holmes Intro Stats
Cluster Analysis
Broadly speaking, there are two characterisations of cluster analysismethods
- Model based or Model free, for defining the similarity betweenobjects
- Hierarchical or Partition, for assigning objects to clusters based ontheir similarity
so you can have {{hierarchical model based}, {hierarchical model free},{Model based partition}, {Model free partition} } clustering
Chris Holmes Intro Stats
Model based
- assumes a probability distribution for each group
- e.g. model objects within a group as arising from a Normal(Gaussian) distribution
Model free (or algorithmic)
- define an arbitrary distance metric between objects
- e.g., Euclidean dij =√∑
s(xis − xjs)2
Chris Holmes Intro Stats
Partition
- divide up the space (probabilistically for model based)
- assign objects to clusters by the partition they fall in
(a) Model free Partition (b) Model based Partition
Chris Holmes Intro Stats
Partitioning
One issue with Partitioning is that you need to define the number clusters
If you wish to explore the data over differing resolutions then you’d wantto examine the clusterings obtained from k = {1, 2, . . . , n} clusters (withn data points)
You could simply run n parallel Partition models for the differing values.But then there is no dependence between the clusterings such that, say,k = 5 and k = 6 clusterings might be very dissimilar
Hierarchical clustering allows one to explore the data across multipleresolutions via recursive partitioning (division) or recursive merging(agglomeration) of data objects
Chris Holmes Intro Stats
Hierarchical Clustering
HC works in either divisive or agglomerative fashion
Divisive (top down)
- Start with all data points in a single cluster
- Partition the cluster into two clusters
- For each cluster, partition into two clusters; Repeat
Agglomerative (bottom up)
- Start with each point in its own cluster
- Merge two points (clusters) into a single cluster
- Repeat
To complete either we’ll need to define a score (model free) or aprobability (model based) that measures the similarity between twoclusters
Chris Holmes Intro Stats
Hiearchical Clustering: dendrogram
Such a recursive approach then produces a dendrogram (tree) thatrepresents the clustering
where the length of the branches quantifies the “distance” betweenclusters
The dendrogram provides a useful semi-quantitative description of thesimilarity and major groupings of objects in a data table
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Model free HC
In order to decide on clusters which to join / split we need to define adistance between objects
Common choice are
- Euclidean,
dij = ||xi − xj ||2 =
√√√√ p∑v=1
(xiv − xjv )2
where dij is the distance between the i ’th, j ’th objects, xiv denotesthe v ’th of p measurements on xi
- Absolute (Manhattan)
dij = ||xi − xj ||1 =
p∑v=1
|xiv − xjv |
Chris Holmes Intro Stats
Linkage
Given a metric we can calculate the pairwise distance matrix, D, thatrecords the distance between every pair of objects, (D)ij = dij ,i = 1, . . . , n, j = i + 1, . . . , n
We now need to score any potential split / merge of a cluster(s) todecide on the best next step
The linkage method defines the overall distance between two sets(clusters) of observations
Chris Holmes Intro Stats
Consider two clusters A and B
Common types of linkage include
- Single linkagemin
i∈A,j∈Bdij
- Complete linkagemax
i∈A,j∈Bdij
- Average linkage1
|A||B|∑
i∈A,j∈B
dij
Chris Holmes Intro Stats
Single, Complete, and Average Linkage
Chris Holmes Intro Stats
Biclustering
Suppose the data is recorded in a matrix X with n rows of objects and pcolumns of measurements
It may well be of interest to cluster both objects (rows) and to clustermeasurements (columns)
Then plot out the joint dendrogram on top of the distance matrix
Known as biclustering
Chris Holmes Intro Stats
Biclustering: of mRNA from case-control samples reveals geneexpression profiles
Chris Holmes Intro Stats
Principal Components Analysis
In biological systems, measurements of molecular phenotypes, such asmRNA, miRNA, DNA, are high dimensional and strongly dependent dueto fundamental mechanisms such as, gene function, biological pathwaysor recombination
In exploratory data analysis we would like to identify patterns embeddedin complex data tables (from experiments)
PCA is one of the most important and widely used methods inexploratory statistics, used in a huge variety of applications
- reveal patterns in hidden in high-d data tables
- provides a low dimensional views of high-d data
Chris Holmes Intro Stats
Chris Holmes Intro Stats
PCA
Suppose X is a (m × n) data table, e.g. m rows measuringgene-expression, on n columns of samples
X =
| | | |x1 x2 · · · xn| | | |
where
xi =
xi1xi2...
xim
We suspect that signal (interesting patterns) may be contained indimensions much lower than m. For example, within a few correlatedgenes (on a common pathway)
Chris Holmes Intro Stats
Change of Basis
We can seek a linear projection of the m dimensional space into a newbasis space via,
PX = Y
where P is a (p × p) square matrix,
PX =
− p1 −...
......
− pm −
| | | |
x1 x2 · · · xn| | | |
and Y is an (m × n) matrix,
Y =
p1x1 . . . p1xn...
. . ....
pmx1 . . . pmxn
with elements (Y )ij = pixj
Chris Holmes Intro Stats
The row vectors pi ’s, i = 1, . . . ,m, provide linear combinations of the moriginal measurements
p1 = [p11, p12, . . . , p1m]
You should convince yourself that information is not lost by making sucha projection (transformation) if P is square and invertible, P−1P = I ,then,
X = P−1PX = P−1Y
So we can invert the transformation to get from Y back to X
Chris Holmes Intro Stats
Optimal choice of basis
We wish to reveal patterns embedded in X
Hence we should construct P to compress (i.e. project) the structuredparts of X (the signal) into the first few bases (dimensions of Y )
Then we can visualise and explore X in a much lower dimensional space
That is choose the first few p1, p2, . . . , pk with k << m to preserve thesignal, so that pk+1, . . . , pm contain unstructured variation (noise)
Chris Holmes Intro Stats
Optimal choice of basis
We still need to define precisely what we mean by “optimal”
We shall first constrain P to form an orthonormal basis (to make theproblem well posed)
PPT = I
Chris Holmes Intro Stats
Think of a scatterplot then each pi is now akin to a rotation of theoriginal axes
pi · pTi =[− pi −
] |pi|
= 1
and
pi · pTj =[− pi −
] |pj|
= 0
for j 6= iChris Holmes Intro Stats
Signal-to-noise
If we believe the unstructured noise is identically distributed andindependent across the m measurements
Then a second order statistic to define “optimality” is to maximise thesignal-to-noise ratio (SNR),
SNR =σ2signal
σ2noise
where σ2 denotes the variance – note the assumption here is thatvariance is a good measure of “signal”
So SNR >> 1 suggests a lot of signal (pattern) in the data
Chris Holmes Intro Stats
Constructive derivation of basis
Suppose we wish to construct our first basis (projection) p1
If the noise is common across measurements (equivalent to a ball of noisein the original axes) then to maximise the SNR in the first projection wecan simply maximise the variance of the first axis (row) of Y
Let CY denote the variance-covariance matrix of Y , assume Y is centred(mean zero in each direction)
CY =1
n − 1YY T
so (CY )11 records the variance along the first row of Y
Chris Holmes Intro Stats
So find p1 so as to maximise the spread of points (as defined by thevariance) along the axis of Y1·
Then, having set p1, find p2 that maximises the variance, subject top1 · pT2 = 0
and then p3 subject to { p1 · pT3 = 0 and p2 · pT3 = 0 }, and so on for p4etc.....
Chris Holmes Intro Stats
Calculating the basis – for the more mathematical among you
Recall we are trying to find P so as to maximise the diagonal elements ofCY = 1
n−1YYT and,
CY =1
n − 1YY T
=1
n − 1(PX )(PX )T
=1
n − 1PXXTPT
=1
n − 1P(XXT )PT
and we recognise XXT as the (unnormalised) variance-covariance of X
Chris Holmes Intro Stats
Eigen-decomposition
XXT is square symetric and hence has an eigen-decomposition
XXT = UDUT
where U is a matrix of eigenvectors of (XXT ) and D is a diagonal matrixstoring the m decreasing eigenvalues of (XXT )
Now, select P to be the eigenvector of XXT , so P = UT ,
Chris Holmes Intro Stats
in which case,
CY =1
n − 1P(XXT )PT
=1
n − 1P(PTDP)PT
=1
n − 1(PPT )D(PPT )
=1
n − 1(I )D(I )
=1
n − 1D
*** Setting P = UT makes P an othonormal basis, PPT = I , andmaximises the variance in each direction *** !!!
Chris Holmes Intro Stats
Example of PCA
In population genetics and genetic epidemiology we often genotypeindividuals from a large cross-section of the population
Genotyping measures positions in the genome where we know commonvariation exists between individuals
- there are roughly around 3,000,000 DNA bases in the humangenome where we can expect greater than 1% of the population toshow variation
E.g. suppose at a locus we know that some people might be {A,A} andsome {A,T} and others {T ,T}. We could encode this as {0, 1, 2}
Now construct a large data table X with elements (X )ij ∈ {0, 1, 2} forsay m = 3, 000, 000 loci (rows) and n = 1000s of individuals (columns)
Perform PCA on X and project individuals into the first few PCs andexplore
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
Chris Holmes Intro Stats
PCA on genotype matrix mirrors geography, Nature, 456, 98-101
Chris Holmes Intro Stats
What have we learnt?
Statistics is the scientific discipline concerned with the analysis andinterpretation of data (in an increasingly data rich world)
Statistics concerns itself with uncertainty, quantified using probability,and updated using probability calculus
The analysis of data should proceed via exploratory analysis followed byformal modelling (if required)
Exploratory analysis involves graphical interrogation of the data andsummary statistics
In high-d data, cluster analysis and PCA can provide useful tools forexploration of structure
Chris Holmes Intro Stats
Key References
◦ Cleveland, W. S. (1993) Visualising data. Hobart Press
◦ Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer. 2nd Ed.
◦ Savage, L, J. (1954). The Foundations of Statistics. Dover. – worthreading Chapter 1-8 for those fascinated with foundations ofsubjective probability (Bayesian stats)
◦ Tufte, E. (2001) The Visual Display of Quantitative Information.2nd Edn. Graphics Press. - you will never look at a graph in thesame way again
◦ Wainer, H. (1984). How to display data badly. AmericanStatistician. Vol. 38, No. 2
Chris Holmes Intro Stats