61
Introduction to Statistics and Data Analysis Chris Holmes Professor of Biostatistics, Department of Statistics, & Wellcome Trust Centre for Human Genetics DTC 2012 Chris Holmes Intro Stats

Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Introduction to Statistics and Data Analysis

Chris Holmes

Professor of Biostatistics,Department of Statistics,

& Wellcome Trust Centre for Human Genetics

DTC 2012

Chris Holmes Intro Stats

Page 2: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

What is Statistics?

Chris Holmes Intro Stats

Page 3: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Statistics:

◦ Statistics is the science and art of data analysis - from observationalstudies - or from planned experiments

◦ Statistics is concerned with the collection, analysis andinterpretation of data

◦ It is the science of the scientific method

Chris Holmes Intro Stats

Page 4: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Branches of statistics

◦ Statistics covers a wide range of areas, from how best to collect data(optimal design of experiments) to the construction of predictivestochastic (empirical) models

◦ Some areas of note include

- graphical displays of data- stochastic modelling of systems- predictive algorithms

Chris Holmes Intro Stats

Page 5: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Uncertainty

◦ At the heart of Statistics is the rigorous treatment of uncertainty orrandom variation characterised via probability

◦ Statistics works in units of uncertainty (you can think of probabilityas the currency)

◦ Probability:

- probability provides a formal system to quantify uncertainty- probability calculus provides a formal system to update uncertainty

(beliefs) in light of information (data)- allows for coherent accumulation of evidence supporting or refuting a

scientific hypothesis of interest

◦ Statistics is about being precise about our level of imprecision

Chris Holmes Intro Stats

Page 6: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Probability

Probability (of continuous, real, valued observations) has deep rootedmathematical foundations

But for us we shall only need to deal with some simple aspects:

Chris Holmes Intro Stats

Page 7: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Consider an arbitrary event denoted A, e.g., A ≡ “Britain will join theEuro currency in 2012”

Then:

Pr(A) ∈ [0, 1]

- Probability ranges between 0 and 1

Pr(A) = 0

- Event A can never occur: “I’d take a bet of 1p for A occurring inreturn for my life” (assuming you don’t wish to die)

Pr(A) = 1

- Event A will surely occur: “I’ll take a bet of 1p in return for all myworldly wealth if A does not happen”

And so on, e.g. Pr(A) = 0.5, equal chance of A occurring or not (note:we need to define “chance” without referring to probability!), “I’d behappy for my worst enemy to decide on A or A′ for me in a bet of equalodds”

Chris Holmes Intro Stats

Page 8: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Note that the use of Probability to refer to degrees of belief in arbitraryevents is not without controversy

Some (many) feel that Probability should be restricted to events that canbe measured via the long running frequency of outcome under perfectlyrepeatable trials (such as in hypothetical games of chance)

- The probability that a fair coin tossed 5 times gives {H,H,H,H,H}- The probability of a Royal Flush in Poker

I find this overly restrictive and am happy to interpret Probability interms of personal degrees of belief (measures of uncertainty) in events(Savage, 1954)

- What’s the probability that I had more that 10 quid in my wallet atsome point yesterday?

Chris Holmes Intro Stats

Page 9: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Updating uncertainty

Coherent updating of uncertainty in light of information followsconditional probability calculus

Pr(A|B), to be read “my updated beliefs in event A occurring givenknowledge of the status of event B”,

Pr(A|B) =Pr(A,B)

Pr(B)(1)

where Pr(A,B) is the joint probability of both (A,B) occurring, andPr(B) is a normalising constant (that does not change with the outcomeof A) and ensures Pr(A|B) ∈ [0, 1], in fact (Theorem of TotalProbability)

Pr(B) = Pr(A,B) + Pr(A′,B)

Note also from(1) we have the useful identity Pr(A,B) = Pr(A|B)Pr(B)

Chris Holmes Intro Stats

Page 10: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Updating uncertainty II – Bayes Rule

Given the definition of conditional probability of A|B, then clearly also

Pr(B|A) = Pr(A,B)Pr(A) and equating terms and rearranging leads us to,

Bayes Rule:

Pr(A|B) =Pr(B|A)Pr(A)

Pr(B)

Bayes Rule allows us to express beliefs in A|B in terms of B|A andbackground beliefs Pr(A) (before we knew the status of B) which turnsout to be a extremely useful!

Which lead to Bayesian updating being referred to as “inverseprobability”

Chris Holmes Intro Stats

Page 11: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Statistical Data Analysis

Broadly speaking, the analysis of data proceeds in two stages

◦ Exploratory analysis of data via graphing and summary statistics

◦ Formal statistical modelling of dependence structures of interest,e.g. for prediction, or for evaluating the empirical evidence for aparticular scientific hypothesis

Today we will deal with topics related to exploratory analysis of data

Chris Holmes Intro Stats

Page 12: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Due to time constraints we will not cover the optimal design ofexperiments that precedes the above tasks for experimental studies,although this is an important discipline: “How to set up an experimentand collect samples so as to maximise the information content, reducebias, and reduce confounding?”

Chris Holmes Intro Stats

Page 13: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Graphing data

◦ The starting point of ALL good statistical data analysis begins withgraphical plots and summary statistics of the data

◦ ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

◦ Why?

Chris Holmes Intro Stats

Page 14: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Graphical Excellence

“Graphics reveal data, communicate complex ideas and dependencieswith clarity, precision and efficiency”

- Edward Tufte: The Visual Display of Quantitative Information

Chris Holmes Intro Stats

Page 15: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Graphical Excellence

Excellent graphics:

◦ show the data

◦ induce the viewer to think about the substance

◦ avoid bias

◦ make large complex data sets coherent

◦ encourage data exploration and debate

Chris Holmes Intro Stats

Page 16: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Moreover:

◦ Graphical plots and summary stats provide a feel for the variation inthe data

◦ They can also highlight unusual results, measurement errors, outliers

- Such features can severely distort your results if left unchecked!- Many formal tests assume that the data follows a certain pattern (a

probability distribution such as Normal), if these assumptions areinvalid the results will be completely misleading

- Confidence in these assumptions can be gained through plotting thedata

Chris Holmes Intro Stats

Page 17: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

In the intro stats course (last term) we covered histograms, boxplots,scatterplots, qq-plots, pie-charts (don’t use them!), barcharts. It isexpected that you familiarise yourself with these

However, modern experiments in the life sciences, genetics, genomics,molecular biology can produce huge data sets, involving1000s-to-1000000s of measurements on 100s-to-1000s of samples

For example, mRNA microarrays can measure gene-expression levels forall known genes in the genome within a tissue sample, and whole genomesequencing can read off all of the 3x109 DNA bases in a human genome

These data sets can hold signals associated to heritable disease risk, orprovide a snapshot of the functional status of a population of cells, whichhave important applications in

◦ Biomarkers

◦ Understanding function

Chris Holmes Intro Stats

Page 18: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Exploring “Big Data”

Modern bio-data sets are large and highly structured

Underlying biological mechanisms lead to strong dependences betweenvariables

Conventional exploratory tools such as scatterplots are not usable due tothe dimensionality

To explore such high-dimensional data it is useful to consider multivariatetechniques. We shall consider two,

◦ Cluster Analysis

◦ Principal Components Analysis

Chris Holmes Intro Stats

Page 19: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Cluster Analysis

In Big Data it is interesting to search for and highlight sub-groups ofsamples embedded within high-d data that show self-similarity, such thatobjects within a group are more similar to one another than those inother groups, or to know that no such groups exist

Chris Holmes Intro Stats

Page 20: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

E.g., in mRNA gene-expression analysis of tumour samples it isinteresting to see if there are undetected sub-groups, that perhaps relateto heterogeneity in clinical outcome; or in population geneticshighlighting individuals that are (cryptically) more closely or distantlyrelated than expected

This process, or detecting and assigning objects to sub-groups, can begenerally referred to as Cluster Analysis

Chris Holmes Intro Stats

Page 21: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Cluster Analysis

Broadly speaking, there are two characterisations of cluster analysismethods

- Model based or Model free, for defining the similarity betweenobjects

- Hierarchical or Partition, for assigning objects to clusters based ontheir similarity

so you can have {{hierarchical model based}, {hierarchical model free},{Model based partition}, {Model free partition} } clustering

Chris Holmes Intro Stats

Page 22: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Model based

- assumes a probability distribution for each group

- e.g. model objects within a group as arising from a Normal(Gaussian) distribution

Model free (or algorithmic)

- define an arbitrary distance metric between objects

- e.g., Euclidean dij =√∑

s(xis − xjs)2

Chris Holmes Intro Stats

Page 23: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Partition

- divide up the space (probabilistically for model based)

- assign objects to clusters by the partition they fall in

(a) Model free Partition (b) Model based Partition

Chris Holmes Intro Stats

Page 24: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Partitioning

One issue with Partitioning is that you need to define the number clusters

If you wish to explore the data over differing resolutions then you’d wantto examine the clusterings obtained from k = {1, 2, . . . , n} clusters (withn data points)

You could simply run n parallel Partition models for the differing values.But then there is no dependence between the clusterings such that, say,k = 5 and k = 6 clusterings might be very dissimilar

Hierarchical clustering allows one to explore the data across multipleresolutions via recursive partitioning (division) or recursive merging(agglomeration) of data objects

Chris Holmes Intro Stats

Page 25: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Hierarchical Clustering

HC works in either divisive or agglomerative fashion

Divisive (top down)

- Start with all data points in a single cluster

- Partition the cluster into two clusters

- For each cluster, partition into two clusters; Repeat

Agglomerative (bottom up)

- Start with each point in its own cluster

- Merge two points (clusters) into a single cluster

- Repeat

To complete either we’ll need to define a score (model free) or aprobability (model based) that measures the similarity between twoclusters

Chris Holmes Intro Stats

Page 26: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Hiearchical Clustering: dendrogram

Such a recursive approach then produces a dendrogram (tree) thatrepresents the clustering

where the length of the branches quantifies the “distance” betweenclusters

The dendrogram provides a useful semi-quantitative description of thesimilarity and major groupings of objects in a data table

Chris Holmes Intro Stats

Page 27: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 28: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Model free HC

In order to decide on clusters which to join / split we need to define adistance between objects

Common choice are

- Euclidean,

dij = ||xi − xj ||2 =

√√√√ p∑v=1

(xiv − xjv )2

where dij is the distance between the i ’th, j ’th objects, xiv denotesthe v ’th of p measurements on xi

- Absolute (Manhattan)

dij = ||xi − xj ||1 =

p∑v=1

|xiv − xjv |

Chris Holmes Intro Stats

Page 29: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Linkage

Given a metric we can calculate the pairwise distance matrix, D, thatrecords the distance between every pair of objects, (D)ij = dij ,i = 1, . . . , n, j = i + 1, . . . , n

We now need to score any potential split / merge of a cluster(s) todecide on the best next step

The linkage method defines the overall distance between two sets(clusters) of observations

Chris Holmes Intro Stats

Page 30: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Consider two clusters A and B

Common types of linkage include

- Single linkagemin

i∈A,j∈Bdij

- Complete linkagemax

i∈A,j∈Bdij

- Average linkage1

|A||B|∑

i∈A,j∈B

dij

Chris Holmes Intro Stats

Page 31: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Single, Complete, and Average Linkage

Chris Holmes Intro Stats

Page 32: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Biclustering

Suppose the data is recorded in a matrix X with n rows of objects and pcolumns of measurements

It may well be of interest to cluster both objects (rows) and to clustermeasurements (columns)

Then plot out the joint dendrogram on top of the distance matrix

Known as biclustering

Chris Holmes Intro Stats

Page 33: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Biclustering: of mRNA from case-control samples reveals geneexpression profiles

Chris Holmes Intro Stats

Page 34: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Principal Components Analysis

In biological systems, measurements of molecular phenotypes, such asmRNA, miRNA, DNA, are high dimensional and strongly dependent dueto fundamental mechanisms such as, gene function, biological pathwaysor recombination

In exploratory data analysis we would like to identify patterns embeddedin complex data tables (from experiments)

PCA is one of the most important and widely used methods inexploratory statistics, used in a huge variety of applications

- reveal patterns in hidden in high-d data tables

- provides a low dimensional views of high-d data

Chris Holmes Intro Stats

Page 35: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 36: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

PCA

Suppose X is a (m × n) data table, e.g. m rows measuringgene-expression, on n columns of samples

X =

| | | |x1 x2 · · · xn| | | |

where

xi =

xi1xi2...

xim

We suspect that signal (interesting patterns) may be contained indimensions much lower than m. For example, within a few correlatedgenes (on a common pathway)

Chris Holmes Intro Stats

Page 37: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Change of Basis

We can seek a linear projection of the m dimensional space into a newbasis space via,

PX = Y

where P is a (p × p) square matrix,

PX =

− p1 −...

......

− pm −

| | | |

x1 x2 · · · xn| | | |

and Y is an (m × n) matrix,

Y =

p1x1 . . . p1xn...

. . ....

pmx1 . . . pmxn

with elements (Y )ij = pixj

Chris Holmes Intro Stats

Page 38: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

The row vectors pi ’s, i = 1, . . . ,m, provide linear combinations of the moriginal measurements

p1 = [p11, p12, . . . , p1m]

You should convince yourself that information is not lost by making sucha projection (transformation) if P is square and invertible, P−1P = I ,then,

X = P−1PX = P−1Y

So we can invert the transformation to get from Y back to X

Chris Holmes Intro Stats

Page 39: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Optimal choice of basis

We wish to reveal patterns embedded in X

Hence we should construct P to compress (i.e. project) the structuredparts of X (the signal) into the first few bases (dimensions of Y )

Then we can visualise and explore X in a much lower dimensional space

That is choose the first few p1, p2, . . . , pk with k << m to preserve thesignal, so that pk+1, . . . , pm contain unstructured variation (noise)

Chris Holmes Intro Stats

Page 40: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Optimal choice of basis

We still need to define precisely what we mean by “optimal”

We shall first constrain P to form an orthonormal basis (to make theproblem well posed)

PPT = I

Chris Holmes Intro Stats

Page 41: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Think of a scatterplot then each pi is now akin to a rotation of theoriginal axes

pi · pTi =[− pi −

] |pi|

= 1

and

pi · pTj =[− pi −

] |pj|

= 0

for j 6= iChris Holmes Intro Stats

Page 42: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Signal-to-noise

If we believe the unstructured noise is identically distributed andindependent across the m measurements

Then a second order statistic to define “optimality” is to maximise thesignal-to-noise ratio (SNR),

SNR =σ2signal

σ2noise

where σ2 denotes the variance – note the assumption here is thatvariance is a good measure of “signal”

So SNR >> 1 suggests a lot of signal (pattern) in the data

Chris Holmes Intro Stats

Page 43: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Constructive derivation of basis

Suppose we wish to construct our first basis (projection) p1

If the noise is common across measurements (equivalent to a ball of noisein the original axes) then to maximise the SNR in the first projection wecan simply maximise the variance of the first axis (row) of Y

Let CY denote the variance-covariance matrix of Y , assume Y is centred(mean zero in each direction)

CY =1

n − 1YY T

so (CY )11 records the variance along the first row of Y

Chris Holmes Intro Stats

Page 44: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

So find p1 so as to maximise the spread of points (as defined by thevariance) along the axis of Y1·

Then, having set p1, find p2 that maximises the variance, subject top1 · pT2 = 0

and then p3 subject to { p1 · pT3 = 0 and p2 · pT3 = 0 }, and so on for p4etc.....

Chris Holmes Intro Stats

Page 45: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Calculating the basis – for the more mathematical among you

Recall we are trying to find P so as to maximise the diagonal elements ofCY = 1

n−1YYT and,

CY =1

n − 1YY T

=1

n − 1(PX )(PX )T

=1

n − 1PXXTPT

=1

n − 1P(XXT )PT

and we recognise XXT as the (unnormalised) variance-covariance of X

Chris Holmes Intro Stats

Page 46: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Eigen-decomposition

XXT is square symetric and hence has an eigen-decomposition

XXT = UDUT

where U is a matrix of eigenvectors of (XXT ) and D is a diagonal matrixstoring the m decreasing eigenvalues of (XXT )

Now, select P to be the eigenvector of XXT , so P = UT ,

Chris Holmes Intro Stats

Page 47: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

in which case,

CY =1

n − 1P(XXT )PT

=1

n − 1P(PTDP)PT

=1

n − 1(PPT )D(PPT )

=1

n − 1(I )D(I )

=1

n − 1D

*** Setting P = UT makes P an othonormal basis, PPT = I , andmaximises the variance in each direction *** !!!

Chris Holmes Intro Stats

Page 48: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Example of PCA

In population genetics and genetic epidemiology we often genotypeindividuals from a large cross-section of the population

Genotyping measures positions in the genome where we know commonvariation exists between individuals

- there are roughly around 3,000,000 DNA bases in the humangenome where we can expect greater than 1% of the population toshow variation

E.g. suppose at a locus we know that some people might be {A,A} andsome {A,T} and others {T ,T}. We could encode this as {0, 1, 2}

Now construct a large data table X with elements (X )ij ∈ {0, 1, 2} forsay m = 3, 000, 000 loci (rows) and n = 1000s of individuals (columns)

Perform PCA on X and project individuals into the first few PCs andexplore

Chris Holmes Intro Stats

Page 49: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 50: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 51: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 52: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 53: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 54: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 55: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 56: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 57: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 58: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Chris Holmes Intro Stats

Page 59: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

PCA on genotype matrix mirrors geography, Nature, 456, 98-101

Chris Holmes Intro Stats

Page 60: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

What have we learnt?

Statistics is the scientific discipline concerned with the analysis andinterpretation of data (in an increasingly data rich world)

Statistics concerns itself with uncertainty, quantified using probability,and updated using probability calculus

The analysis of data should proceed via exploratory analysis followed byformal modelling (if required)

Exploratory analysis involves graphical interrogation of the data andsummary statistics

In high-d data, cluster analysis and PCA can provide useful tools forexploration of structure

Chris Holmes Intro Stats

Page 61: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts

Key References

◦ Cleveland, W. S. (1993) Visualising data. Hobart Press

◦ Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer. 2nd Ed.

◦ Savage, L, J. (1954). The Foundations of Statistics. Dover. – worthreading Chapter 1-8 for those fascinated with foundations ofsubjective probability (Bayesian stats)

◦ Tufte, E. (2001) The Visual Display of Quantitative Information.2nd Edn. Graphics Press. - you will never look at a graph in thesame way again

◦ Wainer, H. (1984). How to display data badly. AmericanStatistician. Vol. 38, No. 2

Chris Holmes Intro Stats