Biostatistics and Experimental Design - Bioinformatics …genome.tugraz.at/biostatistics/biostat.pdf · Biostatistics and Experimental Design ... a free software environment for statistical

Biostatistics and Experimental Design

Gerhard Thallinger

Institute of Computational BiotechnologyGraz University of Technology

http://genome.tugraz.at

based on lecture notes from Hubert Hackl

2017/2018

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Probability and theoretical distributions

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Experimental design

Study design and clinical trials

Discussion of medical literature


Introduction





Hypothesis testing

Comparing groups



Experimental design



Aims

At the end of this course, you should be able to ...

I ... understand statistical results (understand statistics in(bio)medical publications)

I ... analyze and visualise data by applying appropriate statisticalmethods

I ... design experiments for research and clinical studiesI ... judge statistical results from a critical point of viewI ... use R, a free software environment for statistical computing

and graphics


Introduction





Hypothesis testing

Comparing groups



Experimental design



Simpson’s paradoxall

Drug Recovery Sum Recovery rateyes no

new 20 20 40 50%old 16 24 40 40%

female


new 18 12 30 60%old 7 3 10 70%

male


new 2 8 10 20%old 9 21 30 30%

Example adapted from: Pearl J. Causality: Models, Reasoning, and Inference,Cambridge University Press, 1st edition, 2000;174ff

Simpson EH. Journal of the Royal Statistical Society, Ser. B, 1951;13:238-241

Simpson’s paradox

Confounding variablesExamples: kidney stone treatment, sex bias, education, ...

Breadth and length of skulls (Pearson 1896)

Pearson K. Phil. Trans. R. Soc. Lond. A, 1896;187:253-318

Car/goat problem (The Monty Hall Paradox)

One of three doors hides a car (all three equally likely) and the othertwo hide goats. You choose Door A. The host, who knows where thecar is, then opens one of the other two doors to reveal a goat, andasks whether you wish to change your choice. Say he opens Door C;should you stick with the original choice, Door A, or switch to Door B?

”Let’s make a deal”

Car/goat problem

Naıve approachRegardless of the initial situation, there are now only two doors fromwhich I could choose.

p(car is behind A) = p(car is not behind A) = 12

⇒ There is no advantage in switching the door.

Bayes theorem

p(A|open C) =p(open C|A)× p(A)

p(open C)=

12 ×

13

12

=13

p(B|open C) =p(open C|B)× p(B)

p(open C)=

1× 13

12

=23

⇒ The probability of winning the car is bigger if one changes the door.

Diagnosis study

1 in 1000 persons are suffering from a disease. There is a test, whichgives wrong results with a probability of 5% (false-positive rate is 5%).

What is the probability that a person with positive test result has thisdisease?

The naıve approach would yield 95%.

Considering the prevalence of the disease the probability of havingthe disease when the test is positive is less than 2%.

Biostatistics

BiostatisticsApplication of statistics in biology and medicine and related research.Guidelines to conduct and interpret medical studies.Helps to objectify evaluation of medical data

Descriptive statisticsAim is to describe data by characteristic values and visualization withgraphical procedures in a short and concise wayData presented without measure of significance

Inferential statisticsAre used to draw inferences about a population from a sample.Hypothesis testingQuantify uncertainty of decisionParameter estimation

Key concepts

PopulationCollection of all objects, events or individuals (people) about whomyou would like to ask a research question.

SampleTo study a population, the researcher typically selects a small group,called a sample, from the population.The sample size is the number of individuals in the sample (not thenumber of measurements you make on each person!). The sampleshould be representative and random.

Random sampleSample chosen from a population in a fashion that ensures everyobject, event, item or individual has an equal chance of being drawn.The selection of any one entity can in no way influence or affect theselection of any other(independent).

IndividualsObjects, events, persons, individuals (observation unit)

What statistical calculations can do

Statistical estimationAn example is to calculate the mean of a sample. This is only anestimate of the population mean and called a point estimate. Youwant also know how good this estimate is and want to give a range ofvalues (confidence interval)

Statistical hypothesis testingStatistical hypothesis testing helps to decide whether an observeddifference is likely to be caused by chance and provide a measurecalled p-value.

Statistical modelingStatistical modeling tests how well experimental data fit amathematical model constructed from e.g. physical principles. Anexample for this is linear regression.

ExamplesSample size and populationAristotle maintained that women have fewer teeth than men; althoughhe was twice married, it never occurred to him to verify this statementby examining his wives mouths.

Russell B, The Impact of Science on Society, Simon and Schuster, New York, 1953;p 7

Test whether a drug is effective in treating patients with HIV

I The population you really care about is more diverse than thepopulation from which your data were sampled

I Collection of data from a ”convenience sample” rather than arandom sample

I The measured variable (CD4 lymphocytes) is a proxy for anothervariable you really care about (survival time)

I Measurements may be made or recorded incorrectly (quality ofantibody!)

I Combination of different kinds of measurements to reach anoverall conclusion.

Applications in Medicine

I EpidemiologyI BiometryI In vitro and animal experimentsI Clinical trials (Phase I to IV)I Approval for drugs and medical devicesI Evaluation of new measurement and diagnostic techniquesI Meta analysisI Evidence based medicine

Research projects

PLANNING

DESIGN

EXECUTION data collection

DATA PROCESSING

DATA ANALYSIS

PRESENTATION

INTERPRETATION

PUBLICATION

Classification of statistical methods

Univariate methodsEach variable is considered individually

Bivariate methodsRelation between 2 variables is studied

Multivariate methodsRelation between >2 variables is studied


Introduction





Hypothesis testing

Comparing groups



Experimental design



Measurement

Observation unitThe unit upon which measurements are madeBlood samples, animals, test persons, patients ...

VariableObservable or measurable properties of the observation unit whichcan take different values.Should address the question and follow objectivity, reliability, andvalidity.Diagnosis, tumor stage, cholesterol levels ...

ValueA realized measurement; feature characteristicType of surgery, 3 mol/ml, female ...

Types of data

Categorical data (qualitative)Nominal data (sex male, female; blood group 0, A, B, AB)Ordinal data (cancer stage I, II, III, IV)

Numerical data (quantitative)Discrete data (number of children 0, 1, 2, 3, 4, 5+)Continuous data (blood pressure; height in cm)

Other types of dataRanks, percentages, rates and ratios, scores, visual analogue scale,censored data

Note: It is important to know the data type since representation andanalysis are dependent on this type.

Types of scales

Nominal scaleEqual or not equal (a = b, a 6= b)

Ordinal scaleRank is possible (a < b,a = b,a > b)

Interval scaleNot only rank but also difference of values (c = a− b)0 is taken arbitrarily (e.g. 2007 AD, temperature scale, diopter)

Ratio scaleNot only differences but also ratios (c = a/b)0 is represented naturally in empirical data (e.g. age of a person,absolute zero)

Frequencies

Absolute frequencyNumber of observation k bearing the same value or fall within a givenclass from the number n of total observations

fabs = k

Relative frequencyEstimate of the probability of a single event for discrete data:

frel =kn

0 ≤ frel ≤ 1

Relative frequency in percent:

frel% = frel × 100%

Presentation of categorical (discrete) data

Frequency tableBlood group distribution of 2060 individuals from Croatia:

Blood group frequency relative frequency relative frequency %

0 702 0.341 34.1%A 862 0.418 41.8%B 365 0.177 17.7%

AB 131 0.064 6.4%

Total 2060 1.000 100.0%

Mourant AE, et al. The Distribution of the Human Blood Groups and OtherPolymorphisms, Oxford University Press, 1976; pp. 909

Together with relative frequencies, the sample size should be given

1 man and 6 women are 14.286% and 85.714%⇒ if sample size is small, use absolute and avoid relative frequencies⇒ Percentages with many decimal places pretend large sample size

Presentation of discrete data

In bar charts, bars should always start from 0.

Prefer bar charts to pie charts since the eye is good at judging linearmeasures and bad at judging relative areas.

3-dimensional pie charts show misleading proportions due to thechange of perspective.

Presentation of data course

Consider relation between x- and y-scale.

Diagrams should start from 0.

Presentation of continuous data

A simple graphical way of depicting a complete set of observations isby means of the histogram in which the number (or frequency) ofobservation is plotted for different values or groups of values.

ExampleSerum cholesterol levels (mmol/l) of a sample of 86 stroke patients

3.7 3.8 3.8 4.4 4.5 4.5 4.5 4.7 4.7 4.8 4.8 4.9 4.94.9 5.0 5.1 5.1 5.2 5.3 5.3 5.4 5.4 5.5 5.5 5.5 5.65.6 5.6 5.6 5.6 5.6 5.6 5.7 5.7 5.7 5.8 5.8 5.9 6.06.1 6.1 6.1 6.1 6.2 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.56.5 6.6 6.7 6.7 6.8 6.8 7.0 7.0 7.0 7.0 7.1 7.1 7.27.3 7.4 7.4 7.5 7.5 7.6 7.6 7.6 7.7 7.8 7.8 7.8 8.28.3 8.6 8.7 8.9 9.3 9.5 10.2 10.4

Markus HS, et al. Stroke, 1995;26(8):1329-1333

Histogram

Partition into classes

Following aspects should be considered:

I Partition comprises all valuesI Values have to be assigned to the classes unequivocallyI The class width should be the same for all classesI Mid-point of a class represents all values within the classI The smaller the number of classes the greater the class width

and the greater the loss of information.I The higher the number of classes the more of the uninteresting

random effects are apparent.

Empirical formulae:

k ≈√

n, k ≈ 5× log10(n)

where k is the number of classes and n the number of values.

Histogram

Partition in classes (Example)

Range : min = 3.7 , max = 10.4Span width : max −min = 10.4− 3.7 = 6.7k ≈√

86 = 9.27

Class width = 1.0 and k = 8 ⇒

Interval Tally Frequency Relative frequency

3.00-3.99 /// 3 3.5%4.00-4.99 ///// ///// / 11 12.8%5.00-5.99 ///// ///// ///// ///// //// 24 27.9%6.00-6.99 ///// ///// ///// ///// 20 23.3%7.00-7.99 ///// ///// ///// //// 19 22.1%8.00-8.99 ///// 5 5.8%9.00-9.99 // 2 2.3%

10.00-10.99 // 2 2.3%

Total 86 100.0%

Histogram

Histograms of cholesterol levels from stroke patients

Histogram

Histograms with different number of classes

Histograms have to be area accurate, when frel or fabs is plotted, the class width has tobe constant.In cases of different class widths, the frequency density (frel/widthi ) should be plotted.

Frequency density histogram

Age group Relative frequency (%) Frequency density (%)(Relative frequency per year)

0-4 25.3 5.065-14 18.9 1.89

15-44 30.3 1.0145-64 13.6 0.6865+ 11.7 0.33

Frequency polygon

Frequency polygons are useful for comparisons.

Cumulative frequency histogram and empiricalcumulative distribution function

Measures of central tendency

Arithmetic mean

x = (x1 + x2 + . . .+ xn) =1n

n∑i=1

xi

where n is the number of observations (degree of freedom) andx1, x2, . . . , xn is the sample (observations)

MedianFor ranked data x1 ≤ x2 ≤ . . . ≤ xn the median x is for

odd n x = x(n+1)/2

even n x = 12 (xn/2 + xn/2+1)

ModeThe mode xmod is the most frequent observation.

It is the only measure for nominal data. For continuous data it isrepresented by the center of the class with the most frequentobservations within the histogram and can be used for bimodal data

Measures of variability

Rank, rank listThe sample x1, x2, . . . , xn sorted by the size of the values isx(1), x(2), . . . , x(n) and called rank list, where the indices (1), ...(n) arethe ranks R(xi ) of the values.

RangeSpan width (range): r = xmax − xmin = x(n) − x(1)

PercentilesThe p% percentile (Qp) means that p% of the values are smaller thanor equal to the p% percentile.

Qp =

x(k) : n × p is not an integer (k = int(n × p) + 1)12

(x(k) + x(k+1)) : n × p is an integer (k = n × p)


Quartiles1stquartile = Q1 = Q252ndquartile = Q2 = Q50 = median3rdquartile = Q3 = Q75

Interquartile rangeIQR = Q3−Q1 = Q75 −Q25

Outlier detectionxi ≥ Q75 + 1.5× IQR or xi ≤ Q25 − 1.5× IQR . . . mild outlierxi ≥ Q75 + 3.0× IQR or xi ≤ Q25 − 3.0× IQR . . . extreme outlier.

This approach could be misleading for small number of observations.There are also other methods for outlier detection and fordetermination of quartiles. E.g.:

Qp = (1− j)×x(k+1) + j×x(k+2) : k = int((n−1)×p); j = (n−1)×p−k

Box-and-whiskers plot


Variance (2nd moment)

s2 =1

n − 1

n∑i=1

(xi − x)2

Standard deviation

s =√

s2 =

√1

n − 1

n∑i=1

(xi − x)2

where n is the number of observations and n − 1 corresponds to thedegrees of freedom

Coefficient of variation

CV = s/|x | or CV = s/|x | × 100%

provides a standardized measure for the variability (CV < 10%represents low and CV > 25% high variability).

Measures of variabilityStandard error of mean

SE(x) = s/√

n

describes not the data, but the accuracy of the estimation.

SE is sometimes misleadingly used

Measures of shape

Skewness (3rd moment)

g1 =m3√m3

2

=

1n−1

n∑i=1

(xi − x)3√(1

n−1

n∑i=1

(xi − x)2

)3

g1 = 0 means the distribution is symmetrical, g1 > 0 right skewed,and g1 < 0 left skewed and mi is the i-th central moment.

Kurtosis (4th moment)

g2 =m4

m22− 3 =

1n−1

n∑i=1

(xi − x)4

(1

n−1

n∑i=1

(xi − x)2

)2 − 3

For normal distribution g2 = 0. If g2 > 0 (g2 < 0) within the center ofthe distribution lies more(less) values than for the normal distribution.

q-q plot

Comparison of sample quantiles with quantiles of a normaldistribution.

Normal distributed observations should following a line.

Transformations

MotivationMost (parametric) statistical methods for analyzing continuous dataassumes normal distribution.To test for a normal distribution the Shapiro-Wilk test and the q-q plotcan be used.Another important assumption is that different groups of observationshave the same standard deviations (or CV).Reduction of the influence of outlying values.

TransformationsLog (is the most common transformation)Square rootReciprocalBox-Cox (find the best transformation)Rank

Log transformations

⇒ asymmetric confidence interval: CI = blog x±t×

slog x√(n)

Shapiro CM, et al., Am J Med Sci, 1987;293(6):365-370

Hodgkin's disease

T4 (cells/mm3)

Fre

quen

cy

0 500 1000 1500 2000 2500

02

46

8

0 500 1000 1500 2000 2500

02

46

8

Hodgkin's disease(log transformed)

T4 (log cells/mm3)

Fre

quen

cy

4 5 6 7 8 9

02

46

8

4 5 6 7 8 9

02

46

8

Non−Hodgkin's disease

T4 (cells/mm3)

Fre

quen

cy

0 500 1000 1500 2000 2500

02

46

8

0 500 1000 1500 2000 2500

02

46

8

Non−Hodgkin's disease(log transformed)

T4 (log cells/mm3)

Fre

quen

cy

4 5 6 7 8 9

02

46

8

4 5 6 7 8 9

02

46

8

Box-Cox-transformations

Define a function to find the best transformation:

x ′ =

xλ − 1λ

forλ 6= 0;

log(x) forλ = 0;

For the logarithmic transformation λ = 0, square root λ = 12 , cubic

λ = 13 , and reciprocal λ = −1.

Optimal λ can be calculated from the likelihood function L(λ).

Standardization

StandardizationFor the analysis of multivariate data a standardization is oftenwanted. That is a normalization where the mean gets 0 and thestandard deviation gets 1.

x′

i =xi − x

s

x′

i is also called z-score. The data are centered and the area underthe normal distribution gets 1. This is helpful for comparisons.

Ranging

x′

i =xi − xmin

xmax − xmin

E.g. for construction of diagrams and figures

Bivariate descriptive methodsContingency tablenominal versus nominal(ordinal) scaled variable

Light Regular Dark Total

Male 20 40 50 110Female 50 20 20 90

Total 70 60 70 200

Barplots

Bivariate descriptive methods

Boxplots

nominal versus metric scaled variable


Introduction





Hypothesis testing

Comparing groups



Experimental design



Diagnostic tests

Sensitivity (SN) =TP

TP + FN= TPR (Precision)

Specificity (SP) =TN

FP + TN= 1− FPR

Positive predictive value (PPV) =TP

FP + TP(Recall)

Negative predictive value (NPV) =TN

TN + FN

Prevalance (observed in this study) =TP + FN

n; Accuracy =

TP + TNn

Diagnostic tests

Consider the predictive ability of the test for the general populationor groups with different prevalence of disease (T .. test result,D .. disease state):

P(D+|T +) =P(T +|D+)× P(D+)

P(T +)=

P(T +|D+)× P(D+)

P(T +|D+)× P(D+) + P(T +|D−)× P(D−)

P(D+) = Prevalence (PREV )P(D+|T +) = PPVP(T +|D+) = SNP(T +|D−) = 1− SP

PPV =SN × PREV

SN × PREV + (1− SP)× (1− PREV )

NPV =SP × (1− PREV )

(1− SN)× PREV + SP × (1− PREV )

Diagnostic tests

Likelihood ratio:

LR+ =P(T +|D+)

P(T +|D−)=

SN1− SP

LR− =P(T−|D+)

P(T−|D−)=

1− SNSP

post test odds = pre test odds × LR

PPV1− PPV

=PREV

1− PREV×

SN1− SP

. . .

PPV =SN × PREV

SN × PREV + (1− SP)× (1− PREV )

Receiver operating characteristics (ROC) curve

Example

The new england

journal of medicine

established in 1812 may 27, 2004 vol. 350 no. 22

Prevalence of Prostate Cancer among Men with a Prostate-Specific Antigen Level ≤4.0 ng per Milliliter

Ian M. Thompson, M.D., Donna K. Pauler, Ph.D., Phyllis J. Goodman, M.S., Catherine M. Tangen, Dr.P.H., M. Scott Lucia, M.D., Howard L. Parnes, M.D., Lori M. Minasian, M.D., Leslie G. Ford, M.D.,

Scott M. Lippman, M.D., E. David Crawford, M.D., John J. Crowley, Ph.D., and Charles A. Coltman, Jr., M.D.

Table 2. Relationship of the Prostate-Specific Antigen (PSA) Level to the Prevalence of Prostate Cancer and High-Grade Disease.*

PSA LevelNo. of Men(N=2950)

Men withProstate Cancer

(N=449)

Men with High-Grade Prostate Cancer

(N=67) Sensitivity Specificity

no. of men (%) no./total no. (%)

≤0.5 ng/ml 486 32 (6.6) 4/32 (12.5) 1.0 0.0

0.6–1.0 ng/ml 791 80 (10.1) 8/80 (10.0) 0.93 0.02

1.1–2.0 ng/ml 998 170 (17.0) 20/170 (11.8) 0.75 0.33

2.1–3.0 ng/ml 482 115 (23.9) 22/115 (19.1) 0.37 0.73

3.1–4.0 ng/ml 193 52 (26.9) 13/52 (25.0) 0.12 0.92

Example

Discussion

Screening testsTesting healthy population for early signs of rare serious disease

High sensitivity and PPV

Don’t want FN and accept moderate number of FP

Diagnostic testsE.g. testing high risk individuals

High specifity and NPV

False positive diagnosis would have major consequences for the patient (HIV+)

Predictive values are strongly dependent on prevalence

The choice of the cut-off is not a statistical decision

Test must be repeatable and should have minimal inter-observer variation

Method comparison for categorical dataUsed to quantify the agreement of categorical assessments ofdifferent observers:

Normal Benign Suspect Cancer Total

Normal 21 12 0 0 33Benign 4 17 1 0 22Suspect 3 9 15 2 29Cancer 0 0 0 1 1

Total 28 38 16 3 85

Observed agreement of frequencies

po =1n

k∑i=1

fii = (21 + 17 + 15 + 1)/85 = 0.635 (64%)

Expected agreement of frequencies (by chance)

pe =1n2

k∑i=1

rici = (33 ·28 + 22 ·38 + 29 ·16 + 1 ·3)/852 = 0.308 (31%)

Observer A

Obs. B

Data from Boyd et al., J Natl Cancer Inst, 1982;68(3):357-363

Method comparison for categorical data

Measure of agreement Cohen’s κ

κ =po − pe

1− pe=

0.635− 0.3081− 0.308

= 0.47

Guidelines to interpret κ

Value of κ Strength of agreement

<0.20 Poor0.21-0.40 Fair0.41-0.60 Moderate0.61-0.80 Good0.81-1.00 Very good

Altman DG, Practical statistics for medical research, 1991; Chapman, London, pp 404adapted from Landis and Koch, Biometrics, 1977;33(1):159-174

Method comparison for categorical data

Cohen’s κ does not take into account the degree of disagreement⇒Weighted κ adds weights to the frequencies in each cell accordingto their distance:

wij = 1−|i − j |k − 1

(linear) wij = 1−|i − j |2

(k − 1)2 (quadratic; Fleiss-Cohen)

po(w) =1n

k∑i=1

k∑j=1

wij fij pe(w) =1n2

k∑i=1

k∑j=1

wij ricj

κlw =po(w) − pe(w)

1− pe(w)=

0.866− 0.6911− 0.691

= 0.57 (linear weights)

κqw =po(w) − pe(w)

1− pe(w)=

0.947− 0.8411− 0.841

= 0.67 (quadratic weights)

Note: Weighted kappa must not be applied to unordered categorialdata.

Example: Kappa statistics for gene grouping

Huang DW, et al., Genome Biology, 2007;8(9):R183

Example: Diagnosis of renal artery stenosis

Vasbinder GB, et al. Ann Intern Med, 2004;141(9):674-682

Objective: To determine the validity of computed tomographicangiography (CTA) and magnetic resonance angiography (MRA)compared with digital subtraction angiography (DSA) for detectionof renal artery stenosis.

Results: Twenty percent of patients who underwent all 3 testshad clinically relevant renal artery stenosis. Moderate interob-server agreement was found, with values ranging from 0.59 to0.64 for CTA and 0.40 to 0.51 for MRA. The combined sensitivityand specificity were 64% (95% CI, 55% to 73%) and 92% (CI,90% to 95%) for CTA and 62% (CI, 54% to 71%) and 84% (CI,81% to 87%) for MRA.

Limitations: Eighteen percent of the patients were includednonconsecutively. Digital subtraction angiography may be an im-perfect reference test.

Conclusion: Computed tomographic angiography and MRA arenot reproducible or sensitive enough to rule out renal artery ste-nosis in hypertensive patients. Therefore, DSA remains the diag-nostic method of choice.

Table 3. Overall Diagnostic Accuracy and Areas under the Receiver-Operating Characteristic Curves for All Observers*

Observer Sensitivity, % Specificity, % Positive PredictiveValue, %

Negative PredictiveValue, %

AUC

CTAA 69 91 67 92 0.84B 61 89 59 90 0.76†C 61 97 83 91 0.84

Combined 64 (55–73) 92 (90–95) 68 (59–77) 91 (88–94) 0.85 (0.79–0.91)

MRAD 67 77 42 90 0.75E 63 84 50 90 0.76F 57 90 59 89 0.81

Combined 62 (54–71) 84 (81–87) 49 (40–58) 90 (87–93) 0.83 (0.77–0.89)

* Values in parentheses are 95% CIs. AUC area under the receiver-operating characteristic curve; CTA computed tomographic angiography; MRA magneticresonance angiography.† The AUC for CTA observer B is statistically significantly lower than the AUCs for CTA observers A (P 0.03) and C (P 0.05).

Method comparison studiesAim is to see if 2 (or more) methods (devices) agree enough that theycan be interchanged (e.g. quicker or cheaper methods).

Best approach is to analyze the differences between the measurementsof the 2 methods on each subject.

Bland JM, Altman DG, Lancet, 1986;1(8476):307-310

Method comparison studies

It is expected that about 95% of the observations were included in therange of mean ± 2SD.

This range of values defines the 95% limits of agreement.

In case of variable agreement (wider scatter as the averageincreases)⇒ log-transform

Inappropriate use of correlation coefficient r and significance testing:

1. r measures the strength of a relation between 2 variables, not the agreementbetween them (perfect correlation if the points lie along any straight line).

2. Change in scale of measurement does not affect the correlation

3. Correlation depends on the range of the true quantity in the sample.

4. The test of significance may prevalently show that the two methods are related

5. Data which seem to be in poor agreement can produce quite high correlations.

Method comparison studies

Repeatability of a method

The repeatability of a method can be assessed by comparing repeated measurementsusing the method repeatedly on a series of subjects.The Bland-Altman plot can also be used to assess the repeatability.Since for the repeated measurements the same method is used, the mean differenceshould be zero. Hence, the Coefficient of Repeatability (CR) can be defined as:

CR = 1.96×

√√√√√ n∑i=1

(d2i − d1i )2

n − 1

If more than 2 measurements per subject⇒ ANOVA

Measuring agreement using repeated measurements:

Take difference of means from each methodThe SD has to be corrected (law of Error Propagation):

SDc =

√√√√SD2 +

(SD1

2

)2

+

(SD2

2

)2

Error grid analysis (EGA)Comparison of blood glucose meters with the gold standard(Beckman analyzer)

Brunner GA, et al., Diabetes Care, 1998;21(4):585-590Clarke WL, et al., Diabetes Care, 1987;10(5):622-628


Introduction





Hypothesis testing

Comparing groups



Experimental design



Combinatorics

PermutationsFor n different elements there are n! permutations.

For example n = 3 :ABC, ACB, BAC, BCA, CAB, CBA ⇒ 3! = 6 permutations.

For n objects in k groups, not distinguishable within a group, there aren!

n1!× n2!× . . .× nk !permutations.

For example 2 red balls, 3 green balls, and 7 blue balls⇒12!

2!× 3!× 7!= 7920 permutations.

Combinations

CombinationsIf from n elements not all (as for permutations) but k elements weredrawn are called combinations.

Binomial coefficient(nk

)=

n!

k !(n − k)!=

n × (n − 1)× (n − 2)× . . .× (n − k + 1)

1× 2× 3× . . .× k

Without repetitionsOrder does not matter:

(nk

)Order matters: k !×

(nk

)With repetitionsOrder does not matter:

(n+k−1k

)Order matters: nk

Random experiments

All outcomes of the experiment are known in advance

But, it is a priori unknown which will be the outcome of each repetitionof the experiment:

I Systematic and random errorsI Complex processes, result of many combined processes

The experiment can be repeated under identical conditions

Examples are tossing a coin, throwing a dice, or life-time of a bulb.

Sample space and event

Sample spaceCollection of possible elementary outcomes from a random exp.

Throwing a dice: Ω = 1,2,3,4,5,6Life-time of a bulb: Ω = [0,∞)Diagnosis: Ω = diseased ,healthyBody height: Ω = R+

EventA set of outcomes of the experiment.

A = 6,A = tail,A = diseased,A = height > 180cmA = Ω ... certain eventA = ∅ ... impossible event

Sigma-field SA σ-field (σ-algebra) S is a non empty collection of subsets of Ω thatsatisfy ∅ ∈ S

A ∈ S⇒ Ac ∈ SAi is a countable sequence of sets⇒

⋃i

Ai ∈ S

Probability measure

The pair (Ω,S) is considered as sample space associated with astatistical experiment.

A set function P defined on S is called a probability measure (orprobability) if it satisfies the following conditions:

1. P(A) ≥ 0 for all A ∈ S.2. P(Ω) = 1.3. Ai ∈ S be a disjoint sequence of sets (Aj ∩ Ak = 0 for j 6= k)

⇒ P(∞∑i=1

Ai ) =∞∑i=1

P(Ai )

P(A) is called the probability of event.

The triple (Ω,S,P) is called a probability space.

Probability

For an experiment with k possible equal probable outcomes :

P(A1) = P(A2) = . . . = P(Ak ) =1k

,k∑

i=1P(Ai ) = 1.

Are events mutually exclusive then the probability is the sum of theprobability of each event:

P(A1 ∪ A2 ∪ . . . ∪ Ak ) = P(A1) + P(A2) + . . .+ P(Ak ) =k∑

i=1P(Ai ).

Are events independent then the probability of occurrence of allevents is the product of the probability of each event:

P(A1 ∩ A2 ∩ . . . ∩ Ak ) = P(A1)× P(A2)× . . .× P(Ak ) =k∏

i=1P(Ai ).

Conditional probability

For 2 arbitrary events A and BP(A ∪ B) = P(A) + P(B)− P(A ∩ B)

P(Ac) = 1− P(A)

What is the probability of event A given B?P(A|B) = P(A ∩ B)/P(B)

What is the probability of event B given A?P(B|A) = P(A ∩ B)/P(A)

Bayes Theorem

P(A|B) =P(B|A)× P(A)

P(B)

Example 1Do women get less promoted than men?

From 200 promotions only 4 women get promoted (2%).For one position 40 women and 3270 men have applied.

P(P|F ) =P(F |P)× P(P)

P(F )=

0.02× 2003270+40

403270+40

= 0.1 = 10%

P(P|M) =P(M|P)× P(P)

P(M)=

0.98× 2003270+40

32703270+40

= 0.0599 = 6%

Bayes Theorem

P(B) = P(A ∩ B) + P(Ac ∩ B) = P(B|A)× P(A) + P(B|Ac)× P(Ac)

P(A|B) =P(B|A)× P(A)

P(B|A)× P(A) + P(B|Ac)× P(Ac)=

P(B|A)× P(A)n∑

i=1P(B|Ai )× P(Ai )

Example 2A Briton was arrested 1990 for 16 years based on a random DNAmatch with a probability of 1 in 3× 106 according to experts.

Suppose there are 10000 people in the DNA database than theprobability that the suspect is innocent given a DNA match (that iswhat we want to know) can be calculated using the Bayes theorem:

P(I|M) =P(M|I)× P(I)

P(M)=

13×106 × 9999

100001

3×106 × 999910000 + 1

10000

= 0.0033

P(M|I) = 13000000 ⇔ P(I|M) = 3

1000

Likelihood function

posterior(probability)︷︸︸︷P(B|A) =

likelihood︷︸︸︷P(A|B)×

prior(probability)︷︸︸︷P(B)

P(A)︸︷︷︸evidence, normalizing factor

∝

likelihood of B given fixed A︷︸︸︷L(B|A) ×

prior︷︸︸︷P(B)

Consider a model which gives the probability density function (PDF)of an observable random variable vector X as a function of aparameter θ (in general a parameter vector). Then for specific valuesx1, ..., xn of X (a given realization), the function

L(θ|x1, ..., xn) = f (x1, ..., xn|θ)

is a likelihood function of θ. The likelihood function is functionally thesame in form as PDF. However, the emphasis is changed from the xto the θ. The PDF is a function of the x ’s while holding theparameters θ’s constant, L is a function of the parameters θ’s, whileholding the x ’s constant.

Likelihood ratio

Bayes theoremThe Bayes theorem can also be written in terms of a likelihood ratioand odds:

O(A|B) = O(A)× Λ(A|B)

where Λ(A|B) is the likelihood ratio,

O(A|B) =P(A|B)

P(Ac |B)are the odds of A given B, and

O(A) =P(A)

P(Ac)are the odds of A.

Likelihood ratio

Λ(A|B) =L(A|B)

L(Ac |B)=

P(B|A)

P(B|Ac)

Maximum likelihood estimation

Choosing an estimator for θ (θ(X )) that maximizes L(θ|x1, ..., xn) andsatisfies therefore

L(θ|x1, ..., xn) = supθ∈Θ

L(x1, ..., xn|θ)

is called maximum likelihood estimator (MLE).

Since products of probabilities are very small it is convenient to workwith logarithm of the likelihood function. log is a monotone functiontherefore

log L(θ|x1, ..., xn) = supθ∈Θ

log L(x1, ..., xn|θ).

If θ exists it must satisfy the likelihood equations

∂ log L(θ|x1, ..., xn)

∂θj= 0, j = 1,2, ..., k , θ = (θ1, ..., θk ).

Maximum likelihood

If X1,X2, ...,Xn are independent and identically distributed (i.i.d.) withprobability density function (PDF) or probability mass function (PMF)the likelihood function can be calculated:

L(θ|x1, ..., xn) =n∏

i=1f (xi |θ)

and the log likelihood function:

log L(θ|x1, ..., xn) = log(n∏

i=1f (xi |θ)) =

n∑i=1

log f (xi |θ)

For example linear regression:

log L(y = ax + b|x) =n∑

i=1log f (ax + b)

Random variable

The probability measure P is a set function and hence difficult to workwith.

Let (Ω,S) be a sample space. A random variable is defined as finite,single-valued function that maps Ω into R if the inverse images underX of all Borel sets in R are events, that is if

X : Ω→ R X−1(B) = ω : X (ω) ∈ B ∈ S for all B ⊂ R

In short, a random variable (r.v.) is a function that assigns a realnumber to the outcome of a random experiment.

The resulting value (X = x) is called realization of the randomvariable X.

Discrete random variableA discrete random variable can take a countable number ofpredetermined values.

ExamplesTo toss a coin, to throw a dice, or the number of cars crossing a lineduring a certain time interval

Probability mass function (PMF)For discrete random variables the mass function determines theprobability of each element of the sample space.

f (xi ) = P[X = xi ]

Continuous random variableContinuous random variables can take any real value

Probability density function (PDF)A probability density function is a function f(x) that describes theprobability density in terms of the input variable x, which satisfy

1. P[a ≤ X ≤ b] =b∫a

f (x)dx ,

2. f (x) ≥ 0,∀x ∈ R,

3.∞∫−∞

f (x)dx = 1.

The histogram is an estimator for the probability density function.

Cumulative Distribution Function (CDF)

F (x) = P(X ≤ x) =

x∫a

f (x)dx continuous r .v .x∑

xi =aP(X = xi ) discrete r .v .

where a is the smallest value that the r.v. can take.

Cumulative Distribution Function (CDF)

Properties of the CDF

1. limx→−∞

= 0; limx→+∞

= 1

2. x < y ⇒ F (x) ≤ F (y)

3. F (x) is continuous from the right, F (x + h)→ F (x) as h→ 0

Probability and CDF

P(X > x) = 1− F (x)

P(x < X ≤ y) = F (y)− F (x)

Measures for the distribution function and r.v.

Expectation

E(X ) = µ =∞∫−∞

xf (x)dx

Variance

var(X ) = σ2 = E [(X−E(X ))2] = E(X 2)−(E(X ))2 =∞∫−∞

(x−µ)2f (x)dx

Standard deviation

sd(X ) = σ =√

E [(X − E(X ))2]

Covariance

cov(X ,Y ) = E [(X − E(X ))(Y − E(Y ))]

Correlation

ρ = cov(X ,Y )/σxσy

Normal distribution

Factors of variation which act in an additive way result in a symmetricdistribution which is called a normal distribution.The PDF of the normal distribution with parameters µ and σ (N(µ, σ);also called Gauss distribution) is :

f (x ;µ, σ) = 1√2πσ

e−(x−µ)2

2σ2 E(X ) = µ sd(X ) = σ

Normal distribution

Effects of different σ and µ on the PDF of the normal distribution:

Standard normal distribution

A variable that has a normal distribution with mean µ = 0 andvariance σ2 = 1 is called the standard normal variate and iscommonly designated by the letter Z.

Z =X − µσ

∼ N(0; 1)

Standard normal distribution

The cumulative distribution function can be calculated as follows:

F (x) =

x∫−∞

f (u)du =1

√2πσ

x∫−∞

e−(u−µ)2

2σ2 du

Substituting µ = 0 and σ2 = 1 yields:

Φ(z) =1√2π

z∫−∞

e−u22 du

=12

[1 + erf (z√2

)]

Standard normal distribution and probability

Since the area under the standard normal distribution is 1, theprobability is according to the area under the normal distributionwithin the range of z

P(Z ≤ z) = Φ(z)

P(−0.56 ≤ z ≤ 2.00) = Φ(2.00)− (1− Φ(0.56)) = 0.6895

P(−2.00 ≤ z ≤ 2.00) = 2× Φ(2.00)− 1 = 0.9545

Lognormal distribution

Factors of variation which act in a multiplicative way lead to anasymmetric distribution which is called a lognormal distribution.

Shapiro CM, et al., Am. J. Med Sci., 1987;293(6):365-370

Hodgkin's disease

T4 (cells/mm3)

Fre

quen

cy

0 500 1000 1500 2000 2500

02

46

8

0 500 1000 1500 2000 2500

02

46

8

Hodgkin's disease(log transformed)

T4 (log cells/mm3)

Fre

quen

cy

4 5 6 7 8 9

02

46

8

4 5 6 7 8 9

02

46

8

Lognormal distribution

The lognormal distribution with parameters µ and σ is denotedlnN(µ, σ) and has the following PDF:

f (x ;µ, σ) = 1x√

2πσe−

(lnx−µ)2

2σ2E(X ) = e(µ+ σ2

2 )

sd(X ) =√

(eσ2 − 1)(e2µ+σ2 )

Binomial distribution

The binomial distribution is the simplest probability distribution fordiscrete data.

It represents the probability distribution of the number of successes kin a sequence of n independent yes/no experiments, each of whichyields success with probability p. It is denoted B(n,p).

f (k ; n,p) =

(nk

)pk (1− p)n−k E(X ) = np sd(X ) =

√np(1− p)

For n = 1 it is identical to the Bernoulli distribution.


ExampleThe probability being in blood group B is 0.08 so the probability ofbeing group 0, A, or AB is 0.92.

For two unrelated people, the probability of both being in blood groupB is 0.08× 0.08 = 0.006

Number in B Probability

B B 2 0.08× 0.08 = 0.0064¬B B 1 0.92× 0.08 = 0.0736

B ¬B 1 0.08× 0.92 = 0.0736¬B ¬B 0 0.92× 0.92 = 0.8464


0 5 10 15

0.0

0.4

0.8

n=2, p=0.08

k

prob

abili

ty

0 5 10 15

0.0

0.3

0.6

n=6, p=0.08

kpr

obab

ility

0 5 10 15

0.0

0.2

0.4

n=10, p=0.08

k

prob

abili

ty

0 5 10 15

0.00

0.20

n=20, p=0.08

k

prob

abili

ty

0 5 10 15

0.00

0.10

0.20

n=50, p=0.08

k

prob

abili

ty

0 5 10 150.00

0.10

n=100, p=0.08

kpr

obab

ility

Binomial versus hypergeometric distribution

Binomial distributionProbability distribution of the number of successes k in a sequence ofn independent yes/no experiments (with replacements), each ofwhich yields success with probability p:

f (k ; n,p) =

(nk

)pk (1− p)n−k

Hypergeometric distributionProbability distribution that describes the number of successes k in asequence of n draws from a finite population N without replacements.

f (k ; N,m,n) =

(mk

)(N−mn−k

)(Nn

) E(X ) =nmN

sd(X ) =

√nmN

N −mN

N − nN − 1

The finite population N consists in a drawing experiment e.g. of mwhite marbles and N −m black marbles.

Over-representation analysis

Poisson distribution

Another discrete probability distribution is the Poisson distribution. Itcan be described by number of events k occurring over time (orspace) at a fixed rate λ on average, but where each event occursindependently and at random (Pois(λ)).For example the daily number of new registrations of cancer may be2.2 on average, but on any day there may be no cases or there maybe several.

f (k ;λ) =λk e−λ

k !E(X ) = λ sd(X ) =

√λ

Examples are:

I The number of phone calls at a call center per minute.I The number of mutations in a given stretch of DNA after a certain

amount of radiation.I The number of light bulbs that burn out in a certain time interval.I The number of cars that pass through a certain point on a road

(distant from traffic lights) during a given period of time.

Poisson distribution

Examples for different values of λ

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

0.30

k

prob

abili

ty

λ=2.2

λ=10.0

λ=24.0

Negative bionomial distribution

If count data is too dispersed to fit a Poisson distribution it can bemodeled by the two parameter negative binomial distribution (Pascaldistribution or Polya distribution).

The negative binomial distribution is the distribution of the number oftrials n needed to get fixed number of successes k , where each of thetrials yields success with probability p. It is denoted NB(k ,p).

The probability mass function is therefore given by:

f (n; k ,p) =

(n − 1k − 1

)pk (1−p)n−k E(X ) =

kp

sd(X ) =

√k(1− p)

p

For k = 1 it is identical to the geometric distribution.


Examples for a fixed number of successes


Examples for a fixed probability

Other distributions

Test distributionsχ2, t , F , ...

Mathematical deduced distributionsExponential, Gamma, Beta, Cauchy, logistic, uniform, Weibull,...

Extended Binomial distributionsBernoulli, geometric, multinomial,...

Multinomial, Beta, and Dirichlet distribution

Two possibilities Three or more possibilities

Binomial distribution Multinomial distribution

f (k ; n, p) =(n

k

)pk (1− p)n−k f (θ1, .., θk ; n; p1, .., pk ) =

n!

k∏i=1

θi !

k∏i=1

pθii

k∑i=1

pi = 1; 0 ≤ θi ≤ n;k∑

i=1θi = n

Beta distribution Dirichlet distribution

f (θ;α, β) =Γ(α+ β)

Γ(α)Γ(β)θα−1 (1− θ)β−1 f (θ1, .., θk ;α1, .., αk ) =

Γ

(k∑

i=1αi

)k∏

i=1Γ(αi )

k∏i=1

θαi−1i

x ≥ 0 : Γ (x + 1) = x!; α > 0, β > 0 θi ≥ 0;k∑

i=1θi = 1

The Beta distribution is the conjugate of the Binomial distribution (samefunctional form, however, variable and parameter are exchanged).


Introduction





Hypothesis testing

Comparing groups



Experimental design




AimsI Estimation of parameters of the relevant population by the

statistics of the sample distribution.I Measures of uncertainty and quality of these estimations and

specification of a confidence interval.

To be valid the sample must be representative of the population. Forquantification of the strength of the evidence or its uncertainty thecharacteristics of the sampling distributions are useful (e.g.properties of the distribution of the means of random samples).

Sampling distributions

Variability of sample means of many random samples of a given sizefrom the population

I is less among the means of large samples than small samplesI is less than the variability of the individual observations in the

populationI increases with greater variability (standard deviation) among the

individual values

The distribution of sample means will be nearly normal whatever thedistribution of the variable in the population as long as the samplesare large enough.

Distribution of means from random sampling

25 30 35 40 45

0.00

0.02

0.04

0.06

Normal distribution

x

f(x)

n=10

x

Fre

quen

cy

25 30 35 40 45

020

040

060

080

010

00

n=25

y

Fre

quen

cy

25 30 35 40 45

050

010

0015

00

n=100

z

Fre

quen

cy

25 30 35 40 45

020

060

010

0014

00

Central limit theorem

If X1,X2, ...,Xn be independent and identically distributed (i.i.d.) withmean µi and variance σ2

i

Then

Xnorm =

n∑i=1

xi −n∑

i=1µi√

n∑i=1

σ2i

has a limiting cumulative distribution function which approaches anormal distribution (∼ N(0; 1)) for large n.

⇒ importance of the normal distribution

Standard error of sample mean

The standard deviation of a large number of sample means will be:

s =σ√

n

where σ is the standard deviation of the variable in the population andn is the size of each sample.

We can estimate the standard error of the (population) mean (SEM)from a single sample using the observed standard deviation in thatsample:

SEM =s√

n

The standard error of the mean is often abbreviated to standard error(SE). The standard error is a measure for the quality of the estimationof the population mean. SE can be used to construct a confidenceinterval.

Standard error

Standard error of the differences between two sample means

SE(x1 − x2) =√

[SE(x1)]2 + [SE(x2)]2 =

√s2

1

n1+

s22

n2

Standard error of a sample proportionFrom the binomial distribution we know the standard deviation

s =√

np(1− p) ⇒ SE =

√p(1− p)

n

This will be true only for large samples (np > 5 and n(1− p) > 5).

Standard error of the difference between two proportions

SE(p1 − p2) =

√p1(1− p1)

n1+

p2(1− p2)

n2

Confidence interval

An (1− α) - confidence interval [θl , θu] is a random interval whichincludes the unknown, true value θ with a probability of 1− α.

P[θl ≤ θ ≤ θu] ≥ 1− α

Per convention α = 0.05, but it can be chosen arbitrarily.

x − t1−α/2s√

n≤ µ ≤ x + t1−α/2

s√

nfor normal distributed population

Student’s t-distribution

If X1, . . . ,Xn are independent and N(0,1) distributed then:

t =x

s/√

nis t-distributed with degrees of freedom n − 1.

With small degrees of freedom (or small n) the t-distribution differsfrom the normal distribution considerably.If the degree of freedom is high the t-distribution approximates thestandard normal distribution.

Student’s t-distribution

Student, Biometrika, 1908;6(1):1-25

William Sealy Gosset(1876-1937)

Confidence interval

95% confidence intervals for mean serum albumin concentration from216 patients with primary biliary cirrhosis constructed from 100random samples of size 100

Christensen E, et al., Gastroenterology, 1985;89(5):1084-1091

Confidence interval for relative frequencies

If X1, ...Xn are independent and binary variables (0,1) with parameter

p = P[xi = 1]⇒ k =n∑

i=1xi is binomial distributed with p =

kn

.

The (1− α) confidence interval is [pl , pu] with

pl =k

k + (n − k + 1)F ∗1−α/2pu =

(k + 1)F1−α/2

n − k + (k + 1)F1−α/2

and F ∗1−α/2, F1−α/2 are quantiles of F-distributions.

In case of large n z =k − np√np(1− p)

is approximately N(0,1)

distributed and the confidence interval is:

p − z1−α/2

√p(1− p)

n≤ p ≤ p + z1−α/2

√p(1− p)

n

Parameter estimation

We want to have good estimators for different parameters of thepopulation distribution (θ = µ, θ = σ2).

An estimator θ of a parameter θ should

I for large n approach θ andI for large n follow a normal distribution (central limit theorem).

These properties are most of the time satisfied and we want aquantitative criteria. The estimation error (θ − θ) should be minimal:

1. Unbiasedness 3. Consistency

E [θ − θ] = 0 ... Bias limn→∞ P(∣∣∣θn − θ

∣∣∣ < ε) = 1 (ε > 0)

2. Minimal variance 4. Robustness

Var [θ]→ minimal Not unduly affected by outliers


Parameter estimation for normal distribution

L(θ|x) =n∏

i=1f (xi |θ) =

1

(σ√

2π)ne−

12σ2

∑(xi−µ)2

log L = −n logσ −n2

log 2π −1

2σ2

∑(xi − µ)2

d(log L)

dµ= 0 =

1σ2

∑(xi − µ) =

1σ2(∑

(xi )− nµ) ⇔ µ =

∑xi

n

d(log L)

dσ= 0 =

− nσ

+

∑(xi − µ)2

σ3 ⇔ σ2 =

∑(xi − µ)2

n

Ordinary least squares (OLS) is a special case of the maximumlikelihood method

Thumbnail example

X = x1, ..., xn, where xt ∈ 0,1

Binomial distribution: P(X |Θ) =(n

k

)Θk (1−Θ)n−k ... Likelihood


P(X |Θ) =(n

k

)Θk (1−Θ)n−k

log P(X |Θ) = k log Θ + (n − k) log(1−Θ) + C

ddΘ

log P(X |Θ) =kΘ−

n − k1−Θ

= 0⇒ Θ =kn

Since the data X are usually subject to random fluctuations andintrinsic uncertainty, repeating the whole process of data collectionand parameter estimation under identical conditions will mostly leadto slightly different results.

⇒ if we are able to repeat the data-generating process several times,we will get a distribution of parameter estimates Θ, from which wecan infer the intrinsic uncertainty of the estimation process

Distribution of parameter estimate Θ

The probability of k observations of heads in a sample of size n isgiven by

P(k) =(n

k

)Θk (1−Θ)n−k

k = nΘ⇒ P(k) = C( n

nΘ

)ΘnΘ(1−Θ)n(1−Θ)

In more complicated situations analytical solutions are usually notavailable⇒ computational procedure called bootstrapping.

Frequentist versus Bayesian paradigm

Bayesian approach

P(Θ|X )︸︷︷︸posterior probability

∝ P(X |Θ)︸︷︷︸likelihood

P(Θ)︸︷︷︸prior probability

We want to compute the posterior probability from the likelihood andthe prior probability.

It is mathematically convenient to choose a functional form that isinvariant with respect to the transformation (see above), that is, forwhich the prior and the posterior probability are in the same functionfamily (conjugate).

The conjugate of the binomial distribution is the beta distribution:

P(Θ|X ) ∝ Θk+α+1(1−Θ)N−k+β−1

P(Θ|X ) = B(Θ|k + α,N − k + β)

Comparison of frequentist and Bayesian approach

Maximum a posteriori (MAP) estimate:ΘMAP = argmaxΘ P(Θ|X )

Maximum likelihood (ML) estimate:ΘML = argmaxΘ P(X |Θ)

N →∞⇒ ΘMAP = ΘML

Suppose you are allowed to toss a thumbnail a few times only

You can use prior knowledge, e.g. Torque acting on the fallingthumbnail from theoretical physics.

If you allowed to toss the thumbnail arbitrarily often, the data will”speak for itself”, and including any prior knowledge no longer makesany difference to the prediction.

Comparison of frequentist and Bayesian approach

Main difference between the frequentist and the Bayesian approachis the different interpretation of Θ:

The Frequentist statistician interprets Θ as a parameter and aims toestimate it with a point estimate, typically adopting the maximumlikelihood approach

The Bayesian statistician interprets Θ as random variable and tries toinfer its whole posterior distribution, P(Θ|X ).

For derivation of P(Θ|X ) in complex inference problems a powerfulcomputational approach called Markov Chain Monte Carlo (MCMC)approach can be used (Bayesian pendant to the frequentist’sbootstrap approach)

Parameter free estimation / Resampling

Parameter-free means there are no assumptions about the form ofthe population distribution, but instead using the data (sample) and itsdistribution.

We are not interested in the parameters per se but we want to test ahypothesis or want to know the quality of a prediction based on thedata.

In both cases using resampling methods allows to quantify theperformance of the estimation.

Bootstrap

The idea of the bootstrap is to randomly sample n times withreplacement from the original data points (based on the samedistribution of the original data).

If this procedure is often repeated (eg. 1000 times) the distribution ofthe medians should approximate a normal distribution and the meanand variance of the medians can be calculated.

The 95% confidence interval can be derived from the sortedbootstrap samples (at the 25th and 975th value).

Permutation test

The permutation test (randomization test) is similar to the bootstrap,only that the re-sampling procedure is done without replacements.

As example the question is addressed if active genes in a specificcondition tend to be adjacent within the genome. For this purpose theposition within the genome were 10000 times permuted and thenumber of adjacent active genes were counted.

As measure of the test the z-score or the p-value (that is the fractionof the rearrangements that have counts as far apart or more thanactually observed) can be provided.

Jackknife

The jackknife approach is used to measure the performance of anestimator (θ∗) by systematically recomputing the statistic estimate(θ∗−i ) leaving out one observation at a time from the sample.

Finally the jackknife corrected estimator (θjack ) can be calculated fromthe θ∗−i as follows:

θjack = nθ∗ −n − 1

n

n∑i=1

θ∗−i

For example estimating the mean:

x =

n∑i=1

xi

n and x−j =

n∑i=1

xi−xj

n−1 ⇒ xj = nx − (n − 1)x−j

and analogous for general estimators :

θ∗j = nθ∗ − (n − 1)θ∗−j with θjack =

n∑j=1θ∗j

n

Quenouille M. Journal of the Royal Statistical Society, Ser. B, 1949;11:68-84Tukey JW. Annals of Mathematical Statistics, 1958;29:614

Parameter free estimation / Resampling - Summary

Bootstrap resamplingGenerate samples with the same size n as x with replacement toestablish confidence intervals.

Permutation subsamplingGenerate samples with a (in general) smaller size n than x withoutreplacement to test hypotheses of ’no effect’.

Jackknife ’Leave one out’ samplingGenerate samples of size n=x-1 to measure the performance of anestimator.


Introduction





Hypothesis testing

Comparing groups



Experimental design



Hypothesis test

In medicine often comparison between treatments or procedures, orbetween groups or subjects is conducted. Or more general aresearch question is addressed and tested with an experiment.

The numerical value corresponding to the comparison of interest iscalled effect.

A null hypothesis H0 can be stated if this effect of interest is zero, aswell as and alternative hypothesis H1 that the effect is not zero.

The null hypothesis is in general the negation of the researchhypothesis that generated the data.

The probability that we could have observed data (or data that weremore extreme) if the null hypothesis is true is called p-value. Thesmaller the p-value the more evidence we have against the nullhypothesis.

Test statistic

For most problems calculating a test statistic - a value which we cancompare with the known distribution we expect when the nullhypothesis is true - can be used to evaluate the probability:

test statistic =observed value − hypothesized value

standard error of mean

In many cases the hypothesized value is zero, so that the test statisticbecomes the ratio of the observed quantity of interest to it’s standarderror.

Error types

If H0 is rejected with high probability than based on this evidence youcan accept the research hypothesis.In general there are two possible decisions:

I reject H0 and accept H1 orI do not reject H0 and consider H1 as not approved

Note: H0 can never be accepted however large the p-value may be.

As apparent in the following table there are two possibilities to decidecorrectly and two possibilities to make errors:

Decision H0 is really true H0 is really false

Do notreject H0

correctType II error

The probability of this is β

Reject H0Type I error

The probability of this is α correct

Significance

α the (maximal) probability of the Type I error is the level ofsignificance. By reducing the risk of an error of the first kind weincrease the risk of an error of the second kind.

I The conventional compromise is to choose α = 0.05 as level ofsignificance.

I If p ≤ α H0 is rejected (the research hypothesis accepted) andthe test is stated statistically significant.

I Sometimes, if α = 0.001 is chosen and p ≤ α the test is stated’highly’ significant.

These are reasonable guide-lines, however, not an absolutedemarcation. There is not a great difference between p=0.06 andp=0.04 and they indicate similar strength of evidence. Therefore thep-values should provided and not only that the test is significant.

Two-sided tests versus one-sided testsExample for an one sample testH0 : µ = µ0H1 : µ = µ1 6= µ0 ... two-sided alternative hypothesisH1 : µ = µ1 > µ0 or µ = µ1 < µ0 ... one-sided alternative hypotheses

One sided tests are rarely appropriate and in most cases two-sidedtests are used. Even when there is strong prior expectations, forexample that a new treatment can’t be worse than than the old oneyou can not be sure (otherwise you would not need an experiment).

Critical regions of a two−sided test

tα/2 µ0 t1−α/2

reject H0 fail to reject H0 reject H0

Critical region of an one−sided test (lower tail)

tα µ0

reject H0 fail to reject H0

Critical region of an one−sided test (upper tail)

µ0 t1−α

fail to reject H0 reject H0

Power of a test

The statistical power of a test is defined as 1− β. This is theprobability that a new therapy or theory is proven better, if it is reallybetter.

The power depends on the sample size n and effect size δ, whichrefers to the magnitude of the effect under the alternate hypothesis.

The effect size if means of normal distributed data are compared is:

δ =µ1 − µ0

σ0.

I Optimal tests are defined, that at a given α the power is maximal.I The power decreases, if α decreases.I The power increases, if the variability decreases.I The power is better for one-sided tests.

Power analysis

Since there is a relation between α, the power (1− β), the samplesize n, and the effect size δ the optimal sample size can be calculatedfrom the other parameters. This procedure is called power analysis.

1. Estimate effect size (e.g. from literature)2. Define α and β3. Calculate optimal sample size n

Calculation of sample size (continuous data)

Determination of difference in the mean to a given µ and knownvariance σ0 and independent normal distributed data x1, ..., xn.

z =√

nx − µ0

σ0

H0 : µ = µ0 is z normal distributed and H0 is rejected if |z| > z1−α/2

If µ = µ1 > µ0 ⇒ z =√

nx − µ1

σ0+√

nµ1 − µ0

σ0

z1−α/2 = zβ +√

nδ ⇒ n =(z1−α/2 + z1−β)2

δ2

For example: α = 0.05, β = 0.20, δ =38− 35

6= 0.5 ⇒

n =(z0.975 + z0.80)2

0.52 ≈(1.96 + 0.84)2

0.25≈ 31

Calculation of sample size (proportions)

The effect size if two proportions p0 and p1 are compared is:δ = p0 − p1

The sample size can be calculated as follows:

n =(z1−α/2 + z1−β)2 ∗ [p0(1− p0) + p1(1− p1)]

δ2

For example: α = 0.05, β = 0.20,p0 = 0.80,p1 = 0.75 ⇒

δ = 0.80− 0.75 = 0.05

n =(z0.975 + z0.80)2 ∗ (0.80 ∗ 0.20 + 0.75 ∗ 0.25)

0.052

≈(1.96 + 0.84)2 ∗ 0.35

0.0025≈ 1094

Estimation versus hypothesis testing

I There is a close relation between confidence intervals andhypothesis testing:p < 0.05 (i.e. significant)⇔ the 95% interval does not includethe value specified in H0.The reason for this relation is that both methods are based onsimilar aspects of the theoretical distribution of the test statistic.

I The confidence interval shows the uncertainty, or lack ofprecision, in the estimate of interest, and thus conveys moreuseful information than the p-value.

I The use of a new treatment is dependent not only on thesignificance but also on the amount of the effect. A singlenumber (p-value) cannot convey the necessary information.

Testing for equality or noninferiority

In traditional comparative studies, the burden of proof rests on thealternative (research) hypothesis of difference between the groups. Ifthe evidence is not strong enough in favour of a difference, equalitycannot be ruled out, but the null hypothesis cannot be accepted.

”Absence of evidence is not evidence of absence”Altman DG and Bland JM, British Medical Journal, 1995;311:485.

It is not possible to establish an alternative hypothesis of exactequality. Therefore a region around the mean has to be defined,where the two means are considered equal:

|µ1 − µ0| < δ

δ represents the equivalence margin. This allows us to define thefollowing null and alternative hypotheses:

H0 : |µ1 − µ0| ≥ δH1 : |µ1 − µ0| < δ

Testing for equality or noninferiority

Two one-sided t-tests (TOST)Perform two one-sided tests based on the following split nullhypotheses:

H01 : µ1 − µ0 ≥ δH02 : µ1 − µ0 ≤ −δ

The p-value for the overall test is p = max(p1,p2). Whether toperform correction for multiple testing is heavily discussed. If youwant to be on the save side, divide α by 2 (Bonferroni correction).

Schuirmann DJ, J Pharmacokin Biopharm, 1987;15:657-680.Wellek S, Testing Statistical Hypotheses of Equivalence. CRC Press, 2003.

Confidence intervalConstruct the (1− 2 ∗ α) confidence interval of the difference of themeans. If the CI for the difference is completely contained in theinterval b−δ, δc then we declare equivalence.

Tryon WW, Psychological Methods, 2001;6(4):371-386

Non-parametric tests

Parametric methodsI Makes assumptions about the sampling distributionsI Based on theoretical distributions which are described by

parameters (mean, standard deviation)I Confidence intervals and hypothesis tests

Non-parametric (distribution-free) methods

I Often used to analyze data which are not normal distributed (i.e.skewed data)

I Mostly based on the ranks or on comparing sum of ranks.I Tend to be more suited to hypothesis testing than estimationI In some cases estimation calculation of confidence intervals is

possible (e.g. median).

Multiple testing

ProblemIf more hypothesis tests are done in parallel than the probabilityincreases to draw wrong conclusion.

Example - MicroarraysThousands of genes are tested if they are significantly differentialexpressed.

I In case of 1000 tests, 50 false positives are expected at a Type Ierror level of 0.05.

I The probability for k independent tests, that at least one p < α is1− (1− α)k and converges for large k towards 1.

I Multiple testing corrections adjust p-values (or the significancelevel α) derived from multiple statistical tests to correct for theseoccurrence of false positives.

Type I error

Decision H0 is really true H0 is really false

Do not reject H0 U T (Type II error) G − RReject H0 V (Type I error) S R

G0 G1 G

Per family and per comparison error rate

PFER = E(V ), PCER = E(V )/G

Family wise error rate (FWER)

FWER = P(V > 0)

False discovery rate (FDR)

FDR =

E(V/R) R > 0

0 R = 0

Methods for multiple testing corrections

Method Error control

Bonferroni FWER most stringentBonferroni step down (Holm) FWER ..Westfall and Young permutation FWER ..Benjamini and Hochberg FDR FDR less stringent

Familiy-wise error allow very few occurrences of false positives.

False discovery error rate allows a percentage of called genes to befalse positive.

Multiple testing corrections

Sort p-values from smallest to largest and apply correctioncorresponding to the selected method.

p Bonf. Holm BH(FDR)

p(1) p(1) ∗ n p(1) ∗ n p(1) ∗ np(2) p(2) ∗ n p(2) ∗ (n − 1) p(2) ∗ n/2: : : :p(i) p(i) ∗ n p(i) ∗ (n − i + 1) p(i) ∗ n/i: : : :p(n−1) p(n−1) ∗ n p(n−1) ∗ 2 p(n−1) ∗ n/(n − 1)

p(n) p(n) ∗ n p(n) p(n)

padj = min(1,p)

Westfall and Young permutation

1. Compute the t statistic for each row in the original dataset.2. Order them: |t(1)| ≥ |t(2)| ≥ |t(3)| ≥ ... ≥ |t(k)|3. Permute columns of data matrix4. Compute t statistics for all rows of the permuted dataset:

t (b)1 , ..., t (b)

k

5. Compute u(b)k = |t (b)

(k) | and

u(b)j = max(u(b)

j+1, |t(b)(j) |), 1 ≤ j ≤ k − 1

6. Repeat 1-5 N times and calculate the adjusted p-values:

p(j) =

N∑b=1

I(u(b)j ≥ |t(j)|)

N,

where I(•) is the indicator function set to 1 if the condition inparentheses is true and 0 if false.

.


Introduction





Hypothesis testing

Comparing groups



Experimental design



Comparing groups: general steps

I Determine data type and putative distributionI (Test data for the presumed distribution)I Select the test according to data type and distributionI Formulate null hypothesis and select significance level αI Calculate test statistics value, determine degrees of freedom (df)I Determine p-value based on the test statistics value by lookup in

table with dfI Adjust the p-value if multiple tests were performedI Reject the null hypothesis if the p-value ≤ α

Choosing an appropriate method

There are several aspects of the data to be considered whenchoosing an appropriate method of analysis:

I The number of groups of observationsI Independent or dependent groups of observationsI The type of the dataI The distribution of the dataI The objective of the analysis

Comparing groups of continuous data

One group of observationsComparing the mean of a single group of observations with a specificvalue k .

Confidence interval for the mean

Is k within the (1− α)CI : [x − t1−α/2s√

n, x + t1−α/2

s√

n]

One sample t-test

t =x − ks/√

n

Confidence interval for the medianFrom the ranked data the CI are the values of the nearest ranks to

[np − 1.96√

np(1− p),np + 1.96√

np(1− p)] with p =12

One group of observations

Binomial sign test

z =r − np√np(1− p)

where r is the number of observations > k and p =12

Binomial sign test with continuity correction

z =|r − np| − 1

2√np(1− p)

Wilcoxon signed rank sum test1. calculate differences: xi − k2. rank them in order to the magnitude |xi − k |3. calculate the sum of all positive (negative) ranks corresponding tothe observation above(below) k

⇒ get p-value for the sum from tabulated test statistic.

Two groups of paired observations

Confidence interval for the differences between means

(1− α)CI : [d − t1−α/2SE(d), d + t1−α/2SE(d)]

Paired t-testOne sample t test can also be used for the comparison of meansusing mean difference (d):

t =d − kSE(d)

(e.g. k = 0)

Non-parametric methodsOne sample sign test and Wilcoxon signed rank sum test can also beapplied to the differences between the paired data (Wilcoxonmatched pairs signed rank sum test).

Two groups of independent observationsConfidence interval for the differences between means

Pooled variance: s2 =(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2

Standard error: SE(x1 − x2) = s√

1n1

+ 1n2

(1− α)− CI : x1 − x2 ± t1−α/2SE(x1 − x2)

Two sample t-test

t =x1 − x2

SE(x1 − x2)

Welch testThe Welch test is a modification of the t-test for the case of unequalvariances of the two groups.

t =x1 − x2√

s21

n1+

s22

n2

with degreesof freedom

df =(s2

1/n1 + s22/n2)2

(s21/n1)2

n1 − 1+

(s22/n2)2

n2 − 1

Two groups of independent observations

Mann-Whitney U-test

Rank all observations (as if they were from a single sample)

U1 = n1n2 + n1(n1+1)2 −

n1∑i=1

ri

U2 = n1n2 + n2(n2+1)2 −

n2∑i=1

ri

U = min (U1,U2)

U < U(α;n1;n2) ⇒ test is significant (U() from the tabulated statistics).

Mann-Whitney U-test Example

Two groups: A=7,4,9,17 and B=11,6,21,14

Is there any evidence that A and B are drawn from populations withdifferent levels of the variable. H0: There is no tendency for membersof one population to exceed members of the other.

Ranked observations: A B A A B B A B4 6 7 9 11 14 17 21

For each A(B), count how many Bs (As) are preceding:U = 0 + 1 + 1 + 3 = 5 and U ′ = 1 + 3 + 3 + 4 = 11

U + U ′ = n1 ∗ n2 ⇔ 5 + 11 = 4 ∗ 4 U = n1n2 +n1(n1 + 1)

2−

n1∑i=1

ri

There are 70 different ways of arrangements (8!/4!4!) and each hasequal probability of 1/70 under the null hypothesis. E.g. U = 2 :AAABBABB, AABAABBB⇒ p = 2/70 = 0.029.

Comparing two variances using the F-test

We can test the null hypothesis that two population variances areequal using the F-distribution.

If the data are normal distributed the ratio of two independentestimates of the same variance will follow an F-distribution.

F (ν1, ν2) =χ2

1/ν1

χ22/ν2

with ν1, ν2 are the degrees of freedom.

Calculate (s1/s2)2 with s1 > s2 and look up with degrees of freedom(ν1 = n1 − 1; ν2 = n2 − 1) in the tabled F-statistic.

Chi-square distribution

The chi-square distribution results when independent variables withstandard normal distributions are squared and summed:

X = (Z1 + c1)2 + (Z2 + c2)2 + ...+ (Zν + cν)2 has a χ2 distribution

with ν degrees of freedom and non-centrality parameter δ2 =ν∑

i=1c2

i .

More independent groups of observations

One way ANalysis Of VAriance (ANOVA)The main objective is to define the sources of variation that have anyinfluence on the data. The following model is suggested for the data,where just one factor is supposed to be effecting at the population:

xij = µ+ αi + εij i = 1, ..., k j = 1, ...,ni

k∑i=1

ni = N

The idea is to test if the data xij can be explained as the responsefrom different treatments (groups i = 1, ..., k ) of a given factor.

αi is the treatment effect and can be characterized by the samplemean for every subgroup:

αi = xi − x =1ni

ni∑j=1

xij +1N

k∑i=1

ni∑j=1

xij

and

εij = xij − xi

ANOVA

H0 : µ1 = µ2 = ... = µkH1 : At least one of the µi is not equal to the others

The test of H0 is based on estimating σ2. A general estimator of thevariance is based on the variance within groups:

MSE =s2

1 + s22 + ...+ s2

k

k=

1N − k

k∑i=1

ni∑j=1

(xij − xi )2

The second estimator of the variance is based on the variancebetween groups:

MSA = Ns2x =

1k − 1

k∑i=1

ni (xi − x)2

If H0 is true both variances would be very similar and if MSA >> MSEthan H0 is rejected. This can be formulated by the F statistic:

F =MSAMSE

; H0 is rejected if F exceeds the critical value fα;k−1,N−k

ANOVA

All the information can be summarized with an ANOVA table:

Variation df Sum of squares MSS F-value

Treatments k-1 SSA =k∑

i=1ni (xi − x)2 SSA/(k-1) MSA/MSE

Error N-k SSE =k∑

i=1

Ni∑j=1

(xij − xi )2 SSE/(N-k)

Total N-1 SST =k∑

i=1

Ni∑j=1

(xij − x)2

ANOVA

H0 : µ1 = µ2 = ... = µk

Rejecting the Null hypothesis signifies that their is a statisticallysignificant difference (at the level α) between any of the group means.

It is, however, not known which of the means differ. Therefore a”post-hoc” test is necessary to determine which specific means showa difference. The following tests are commonly used:

I Fisher’s Least Significant Difference (LSD): Similar to pair-wiset-tests between all groups, uses pooled SD of all groups. Doesnot correct for multiple testing.

I Tukey’s Honestly Significant Difference (HSD): Similar topair-wise t-tests between all groups, does correct for multipletesting.

I Scheffe’s method: Corrects α for all pair-wise and also for allcomparisons involving more than two means at a time.

Kruskal-Wallis test

As ANOVA is a more general form of the t-test, the Kruskal-Wallis testis a more general form of the non-parametric Mann-Whitney U-test.

H =12

N(N + 1)

k∑i=1

Ni (Ri − R)2

where R is the average of all ranks (R = (N + 1)/2), Ri is the ranksum of Ni observations in the i th group and Ri is the average rank ineach group (Ri = Ri/Ni )

The H statistic can be also equivalently formulated:

H =12

N(N + 1)

k∑i=1

R2i

Ni− 3(N + 1)

H is χ2 distributed with k-1 degrees of freedom. For more than one tieH has to be corrected by

C = 1−k∑

i=1(t3

i − ti )/(N3 − N) and H ′ =HC

Comparing groups of categorical data

I Categorical data are very common in medical research, whenindividuals are categorized into one or more mutually exclusivegroups. The number falling into a particular group is calledfrequency.

I The data are often shown in form frequency tables.I It can be also summarized as the proportion of the total number

of individuals in one of the categories.

One proportionConfidence interval

p =rn

and SE(p) =

√p(1− p)

n

Based on normal distribution when np > 5 and n(1− p) > 5⇒ r > 5and (n − r) > 5

95% CI: [p − 1.96

√p(1− p)

n,p + 1.96

√p(1− p)

n]

Hypothesis testTest the null hypothesis that the population proportion is somepre-specified value k :

z =p − kSE(p)

with SE(p) =

√k(1− k)

n

and with continuity correction:

z =|p − k | −

12n

SE(p)

Proportions in two independent groups

Confidence interval

SE(p1 − p2) =

√p1(1− p1)

n1+

p2(1− p2)

n2

95% CI : [p1 − p2 − 1.96SE(p1 − p2),p1 − p2 + 1.96SE(p1 − p2)]

Hypothesis test

p =r1 + r2

n1 + n2

SE(p1 − p2) =

√p(1− p)

n1+

p(1− p)

n2=

√p(1− p)(

1n1

+1n2

)

z =p1 − p2

SE(p1 − p2)and zc =

|p1 − p2| −12

(1n1

+1n2

)

SE(p1 − p2)

Two paired proportionsExample - Sleep difficultiesTwo groups of individuals were investigated with regard to sleepdifficulties. The individuals were matched with respect to age (within5 years), level of education, marital status, occupation, tobaccosmoking frequency and duration, and alcohol use.

Marijuana group Control group Number of pairs

yes yes a = 4yes no b = 3no yes c = 9no no d = 16

total n = 32

p1 − p2 =a + b

n−

a + cn

=b − c

n

Karacan I, et al. Ann NY Acad Sci, 1977;282(1):348-374

Two paired proportions

Confidence interval

SE(p1 − p2) =1n

√b + c −

(b − c)2

n

95% CI : [p1 − p2 − 1.96SE(p1 − p2),p1 − p2 + 1.96SE(p1 − p2)]

Hypothesis testReplace both b and c by (b + c)/2

SE(p1 − p2) =1n

sqrtb + c

2+

b + c2

=1n√

b + c

z =p1 − p2

SE(p1 − p2)=

b − c√

b + c

Analysis of frequency tables

Chi squared test for an r × c tableThe null hypothesis is that the two classifications (columns and rows)are unrelated in the relevant population.

Compare the observed frequencies which what we would expect ifthe null hypothesis is true.

X 2 =r∑

i=1

c∑j=1

(Oij − Eij )2

Eij

with observed frequencies Oij and expected frequencies Eij

The expected frequency in each cell is the product of the relevant rowand column totals divided by the sum of all the observed frequenciesin the table (i.e. sample size).

X 2 is χ2 distributed with (r − 1)(c − 1) degree of freedom.

2x2 frequency tables

C1 C2 total

R1 a b a+bR2 c d c+dtotal a+c b+d N

There are two common tests for 2× 2 frequency tables:I Chi squared test (if all Eij > 5)I Fisher’s exact test

Chi squared test

For the first cell:

(O11 − E11)2

E11=

(a− (a+b)(a+c)N )2

(a+b)(a+c)N

and for the sum of all 4 cells:

X 2 =N(ad − bc)2

(a + b)(a + c)(b + d)(c + d)

Continuity correction (also known as Yates’ correction):

X 2Y =

N(|ad − bc| − N2 )2

(a + b)(a + c)(b + d)(c + d)

The Chi squared test is equivalent to the comparison of proportions.

Fisher’s exact test

The method consists of evaluating the probability associated with allpossible 2x2 tables which have the same row and column totals,making the assumption that the null hypothesis is true.

p(a,b, c,d) =(a + b)!(a + c)!(b + d)!(c + d)!

N!a!b!c!d !

In order to calculate the significance of the observed data, i.e. thetotal probability of observing data as extreme or more extreme if thenull hypothesis is true, there are 2 possibilities:

1) Add the probabilities in the ’tail’ of the distribution in which theobserved data fall and double the value to get a two tailed test.

2) Add up probabilities of all tables where p < p(a,b, c,d).

McNemar’s test for paired samples

Cases+ - total

+ a b a+bControl

- c d c+dtotal a+c b+d N

X 2 =(|b − c| − 1)2

b + c

Ordered 2 x k contingency table

score/categories x1 = 1 ... xk = k total

frequency r1 ... rk R =k∑

i=1ri

total n1 ... nk N =k∑

i=1ni

From the regression approach we get:

X 2trend =

[k∑

i=1rixi − Rx

]2

p(1− p)

[k∑

i=1nix2

i − Nx2

], df = 1, p =RN

, x =k∑

i=1

nixi

N

An alternative approach is based on Kendall’s rank correlation (τ )

X 2 =

(τ

SE(τ)

)2

Coefficients of association

The following coefficients are defined to describe the association ofnominal data (categories) from contingency tables (k = min(r , c)).

Contingency coefficient (Pearson):(adjusted for number of rows and columns) CC =

√k

k − 1

√χ2

n + χ2

Cramer’s V: V =

√χ2

n × (k − 1)

Phi (Cramer’s V in non-square tables): φ =

√χ2

n

Coefficient of association (Yule): Q =ad − bcad + bc

Eta is a coefficient of nonlinearassociation. It is designed for caseswhere one of the measures is nominaland the other numeric.

η =

√1

n−1

∑ki=1 ni (yi − y)2

s2y

Comparing risks

Relative risk and odds ratio

Outcome+ - total risk odds

+ a b a+b a/(a+b) a/bExposure

- c d c+d c/(c+d) c/d

total a+c b+d N RR =a/(a + b)

c/(c + d)OR =

adbc

There is another way of analyzing 2× 2 tables, which involves thecomparison of two groups with respect to the risk of some event.

The methods were developed from epidemiology, especially for theanalysis of case control studies.

The parameters of interest are the relative risk (RR) and the oddsratio (OR).

Relative risk

In a prospective study groups of subjects with different characteristicsare followed up to see whether an outcome of interest occurs.

The risks in the two groups (exposed and non-exposed) are a/(a + b)and c/(c + d).

The relative risk RR =a/(a + b)

c/(c + d)

Under the null hypothesis the expected value of RR is 1.

SE(log RR) =

√1a−

1a + b

+1c−

1c + d

(1−α)CI : [log RR−z1−α/2SE(log RR), log RR + z1−α/2SE(log RR)]

Odds ratio

In retrospective case-control studies the selection of subjects is basedon the outcome. In this case the relative risk is not a valid estimate.

We can use the odds (a/b) of the outcome in the first group (cases)and compare to the odds (c/d) in the second group (controls) and get

the odds ratio OR =adbc

For case-control studies the outcome of interest is usually rare so theodds ratio offers a method of getting an approximate relative risk.

SE(log OR) =

√1a

+1b

+1c

+1d

(1−α)CI : [log OR−z1−α/2SE(log OR), log RR +z1−α/2SE(log OR)]

Goodness-of-fit

I qq-plotI Chi-square goodness-of-fit testI Kolmogorov-Smirnov test (KS-test)I Shapiro-Wilk testI Anderson-Darling test


Introduction





Hypothesis testing

Comparing groups



Experimental design




Aim is to find associations between two- or more variables (bivariateor multi-variate data).

Possible questions are:

1. Is there a relation between variables?2. How strong is this relation?3. Which shape has this relation?4. Can a variable of interest predicted by observation of other

variables?

Correlation

The correlation is a method which analyzes the strength of the linearagreement between x and y . Where x and y are pairwiseobservations of the same observation unit (bivariate data). Asmeasure the (Pearson) correlation coefficient r is used.

Variance

s2x =

1n − 1

n∑i=1

(xi − x)2 s2y =

1n − 1

n∑i=1

(yi − y)2

Covariance

Cov(x , y) = sxy =1

n − 1

n∑i=1

(xi − x)(yi − y)

Correlation

r =sxy

sxsy=

n∑i=1

(xi − x)(yi − y)√n∑

i=1(xi − x)2

n∑i=1

(yi − y)2

Correlation

Test for linear relation

H0: true correlation ρ = 0

For common normal distributed (x , y) the test-statistic

T = r

√n − 21− r2

is t-distributed with n − 2 degree of freedom.

With following transformation the correlation is approximatelystandard normal distributed:

z ′ = 0.5(ln(1 + r)− ln(1− r)) and SE =1

√N − 3

(1− α)CI : [(e2zl − 1)/(e2zl − 1), (e2zu − 1)/(e2zu − 1)] with

zl = z ′ − z1−α/21

√N − 3

and zu = z ′ + z1−α/21

√N − 3

Spearman’s rank correlation

Spearman’s rank correlation coefficient rs is obtained by ranking thevalues of the two variables separately and calculate the Pearson’scorrelation on the ranks of the data. For ties the average rank is used.

In case when there are no ties the Spearman’s rank correlation canbe calculated simpler:

rs = 1−6

n∑i=1

d2i

N3 − N

where di are the differences in the ranks.

Kendall’s τ

Kendell’s rank correlation coefficient τ is the proportion of concordantpairs (ordered the same way) minus the proportion of discordant pairs(ordered in opposite way).

τ =nc − nd

12 n(n − 1)

=S

12 n(n − 1)

When there are no ties nc + nd = n(n − 1)/2.

To allow for perfect correlation when ties were between subjects forboth variables there is a different version:

τb =S√

(n(n − 1)/2−∑

t(t − 1)/2)(n(n − 1)/2−∑

u(u − 1)/2)

Considerations for calculation of correlation

1. If a lot of variables will be tested there are many correlations. Asfor multiple testing the significant correlations are overestimated.

2. Spurious correlations for trends over time (divorce rate vs. priceof gasoline)

3. Correlation by heterogeneity (frequency of voice vs. body height:correlation based on gender)

4. Trivial correlation5. Confounding variables (number of storks vs. birth-rate;

Simpson’s paradox)6. Non-linear relations7. Extreme data points

Regression

We want to describe the relation between a set of data on twocontinuous variables and predict the value of one variable for anindividual when we only know the other variable.

Also the effect of one variable on the other variable is of interest.Therefore the relation is directed and the variables are categorized:

X .. independent, predictor value (plotted on the horizontal x-axis)

Y .. dependent, response or outcome variable (plotted on the verticaly-axis)

Whereas correlation provides strength and sign of a relation,regression gives a quantitative model of the relation of dependentvariables.

Linear regression

Define a statistical model of regression:yi = f (xi ) + εi i = 1, ..,nwhere f is the regression function and εi is random noise (error) withE [εi ] = 0 and variance σ2.

For linear regression the regression function is the linear function:f (x) = β0 + β1xwhere β0 is the intercept and β1 the slope of the linear function.

Estimation of parameter

Minimum least square method:

∂n∑

i=1(yi − β0 − β1xi )

2

∂β0= 2

n∑i=1

(β1xi + β0 − yi ) = 0

∂n∑

i=1(yi − β0 − β1xi )

2

∂β1= 2

n∑i=1

xi (β1xi + β0 − yi ) = 0

⇒

β1 =

n∑i=1

(xi − x)(yi − y)

n∑i=1

(xi − x)2= r

sy

sx

β0 = y − β1x

εi = yi − yi = yi − β0 − β1xi = yi − y − β1(xi − x)

Residual variance

s2res =

n∑i=1

(yi − yi )2

n − 2=

n∑i=1

(yi − y − β1(xi − x))2

n − 2= (1− r2)s2

y

The variance can be divided in residual (unexplained) s2res and by the

regression explained variance (s2reg)):

s2y︸︷︷︸

total

= s2reg︸︷︷︸

explained

+ s2res︸︷︷︸

unexplained

= r2s2y + (1− r2)s2

y

⇒ r2 is a measure for the quality of the regression

Confidence interval

Slope

SE(β1) =sres√

n∑i=1

(xi − x)2

(1− α)CI : β1 ± t1−α/2SE(β1)

Estimated y for a given x

SE(y) = sres

√1n

+(x − x)2∑ni=1(xi − x)2

(1− α)CI : y ± t1−α/2SE(y)

Hypothesis testH0 : There is no relation⇔ β1 = 0

The ratioβ1

SE(β1)is compared with the t-distribution with df = n − 2.

Prediction interval

spred = sres

√1 +

1n

+(x − x)2∑ni=1(xi − x)2

(1− α) prediction interval: y ± t1−α/2spred

Here the estimated standard deviation of the individual values y − yat the value x is used and not the standard error.

Note: the prediction interval is much wider than the confidenceinterval.

The confidence interval and the prediction interval can be added tothe scatter plot around the regression line.

Causality

Correlation and regression are based on similar mathematicalbackground but are distinct methods with a different purpose.

Correlation and regression only give information about association,however, a causal relation cannot be directly inferred. This appliesregardless of the strength of the observed association.

One of the strongest ways to prove causal inference is to conduct anexperiment (i.e., systematically manipulate a variable to study itseffect on another).

Causal inference

ProblemI Confounding variables (see Simpson’s paradox)

MethodsI Pearl’s do-operatorI Control by selection (stratification)

no variation in the confounding variable

I Statistical control

Partial correlationMultiple regression model

I Directionality and time

Pearl’s do operator

The idea is to perform an atomic intervention, leaving all othermechanisms unperturbed. This is denoted by do(Xi = xi ) or shortdo(xi ).

Pearl J, Causality - Models, Reasoning, and Interference, Cambridge University Press,2000

Partial correlation

Partial correlation represents the relationship between two variableswhile controlling for a third variable.

rYZ .X =rZY − rZX × rXY√1− r2

ZX

√1− r2

XY


Introduction





Hypothesis testing

Comparing groups



Experimental design



Scatter plots

Multiple regression

In observational studies we are interested in the way one variable isinfluenced by several variables

X1, ...,Xk ... predictor variables, explanatory variables

Y ... dependent, response or outcome variable is expressed as acombination of the explanatory variables

It is not necessary for the explanatory variables to be continuous.

Statistical Model:

y = β0 + β1x1 + β2x2 + ...+ βk xk + εi

where β0...βk are the regression coefficients.

Multiple regression

Multiple regression

Y = Xβ + ε

with β = (β0, β1, ..., βk )T and

Y =

y1...

yn

, X =

1 x11 · · · x1k...

.... . .

...1 xn1 · · · xnk

, ε =

ε1...εn

Minimum square estimator:

n∑i=1

ε2i = (Y − Xβ)T (Y − Xβ)→ min!⇒

∂

∂β(Y − Xβ)T (Y − Xβ) = Y − Xβ = 0

β = (X T X )−1X T Y

Global F test

H0 : β1 = β2 = ... = βk = 0

Source df Sum of squares MSS F-value

Regression k SSreg =n∑

i=1(yi − y)2 SSreg

kMSreg

MSres

Residues n − k − 1 SSres =n∑

i=1(yi − yi )

2 SSres

n − k − 1

Total n − 1 SSy =n∑

i=1(yi − y)2

SSy = SSreg + SSres

Goodness-of-fit

R2 = 1−SSres

SSy=

SSreg

SSy=

n∑i=1

(yi − y)2

n∑i=1

(yi − y)2

R2 ∗ 100% tells how many percent of the variability around theabsolute mean can be explained by the regression.

The expected value of R2 will increase independent of the influenceof each variable as more variables are added to the model⇒

Adjusted R2 = 1−MSres

MSy= 1−

n − 1n − k − 1

(1− R2)

When linear regression is performed than R2 = r2. For multipleregression models R is called multiple correlation coefficient, howeverit must not interpreted the same way.The F-test is the only way to assess whether a model explains asignificant proportion of variability.

Variable selection

A problem arises when the number of variables p is high compared tothe number of observations.

⇒ Selection of variables:I Only select those variables which are significant or most

significant in pairwise comparison.I In case of many strong correlated variables include only one of

them in modelI Include variables with already known influenceI Exclude correlated variables where the influence is not plausible.

Forward selection

I Start with null model or take only those variables which have tobe in the model.

I Add stepwise those variables which lead to the most reduction ofSSres.

I Stop procedure when SSres can not be reduced (or whenchanges are very small) by adding a new variable.

Backward selection (elimination)

I Start with a model containing all variables p = k .

I Remove variables one by one which show the least increase ofSSres.

I Stop procedure when SSres is substantially increased byremoving one of the remaining variables.

All subsets regression

Selecting the best model is to examine every possible model:

I There are 2k − 1 subsets with i1, .., ip ⊆ 1,2, ..., k.

I Calculate for each subset a multiple regression with variablesXi1 , ...,Xip .

I Choose model with smallest p and acceptable SSres.

I Assess goodness-of-fit with Cp statistics.

Goodness-of-fit measures

Adjusted R2:

R2adj = 1−

MSres

MSy= 1−

n − 1n − p − 1

(1− R2)

F-test: Comparison of a model with k − 1 variables with a modelincluding an additional variable:

F =SSres(k − 1)− SSres(k)

SSres(k)/(n − k − 1)

Mallow’s Cp:

Cp =SSres(p)

MSres(k)− n − 2(p + 1)

Akaike information criterion (AIC, smaller values are better):

AIC = n log (SSres(p)

n) + 2(p + 1) + n

Model assumptions

I Linearity : The expected value of Y is linear dependent on theexplanatory variables

I Homoscedasticity : Homogeneity of variance of the residualsindependent of the explanatory variables

I Assumption of normal distribution of the residues

Methods to test assumptions

Two-way analysis of variance (ANOVA)

In one-way ANOVA the means across only one factor (treatmentgroups) are compared, whereas in two-way ANOVA the meansacross two factors are compared.

There are 2 common application cases:

1. Two-way cross classifications (e.g. randomized complete blockdesign RCBD)

2. Repeated measurements

Two-way cross classifications

↓ A B → Level 1 Level 2 . . . Level j . . . Level m Totalx111 x121 x1j1 x1m1

Level 1...

... . . .... . . .

... T1..x11n x12n x1jn x1mn

......

......

......

......

xi11 xi21 xij1 xim1

Level i...

... . . ....

...... Ti..

xi1n xi2n xijn ximn...

......

......

......

...xk11 xk21 xkj1 xkm1

Level k...

... . . ....

...... Tk..

xk1n xk2n xkjn xkmn

Total T.1. T.2. . . . T.j. . . . T.M. T...

Two way cross classification model

Statistical model:

xijl = µ+ αi + βj + γij + εijl

where groups of A are i = 1, .., k , the groups of B j = 1, ..,m, and thenumber of repeated measurements l = 1, ..,n.

The model describes if the data xijl can be explained by the overallmean the effects of treatments of the factor A, the treatment of thefactor B, and the interdependency between A and B.

This is called interdependency model and if the last term is omitted itis basically an additive model.

Partitioning the variation

SST = SSA + SSB + SSAB + SSE

SST =k∑

i=1

m∑j=1

n∑l=1

x2ijk −

T 2...

N

SSA =k∑

i=1

T 2i..

mn−

T 2...

N

SSB =m∑

j=1

T 2.j.

kn−

T 2...

N

SSAB =k∑

i=1

m∑j=1

T 2ij.

n−

T 2...

N− SSA− SSB

SSE = SST − SSA− SSB − SSAB

ANOVA table

Variation df SSQ MSS

Factor A k − 1 SSA MSA =SSAk − 1

Factor B m − 1 SSA MSB =SSB

m − 1

Interaction (k − 1)(m − 1) SSAB MSAB =SSAB

(k − 1)(m − 1)

Error N − km SSE MSE =SSE

N − km

Total N − 1 SST

Fixed and random effects

Fixed effectsA variable (effect, factor) is considered fixed, when all possible valuescan be observed in the study (e.g. the gender of a patient, type ofcar). Categorical variables are (in general) fixed effects.

Random effectsA variable (effect, factor) is considered random, when only a subsetof a population can be observed in a study (e.g. only the threeuniversities in Graz out of all universities in Austria, a patient withmultiple measurements).

F-values from ANOVA for different effects

Effects Fixed Random Mixed Mixed

A is Fixed Random Fixed RandomB is Fixed Random Random Fixed

Factor A F =MSAMSE

F =MSA

MSABF =

MSAMSAB

F =MSAMSE

Factor B F =MSBMSE

F =MSB

MSABF =

MSBMSE

F =MSB

MSAB

A× B F =MSABMSE

F =MSABMSE

F =MSABMSE

F =MSABMSE

Repeated measurementsThis analysis is considered to be an extension to the paired t test,since the measurements are done on the same subject and thereforecompass paired data.

An example for this type of analysis is studying short-term effects of adrug on the heart rate:

Time (min)Subject 0 30 60 120

1 96 92 86 922 110 106 108 1143 89 86 85 834 95 78 78 835 128 124 118 1186 100 98 100 947 72 68 67 718 79 75 74 749 100 106 104 102

Statistical model for repeated measurements

Statistical model:

xij = µ+ αi + βj (tj ) + εij

where tj are the time points or in general measuring points, βj (tj )individual effect of the subject j at the time point tj .

The question to address is, if the time course is constant (αi = 0) or ischanging over the subjects (αi 6= 0).

The analysis methods are differing in the assumptions of theindividual variations:

1. Multi-variate one-way analysis of variance (MANOVA)

2. Uni-variate model of analysis of variance with repeatedmeasurements

MANOVA

Multi-variate analysis of variance (MANOVA) is used when there are2 or more dependent variables (DV).

MANOVA uses a linear combination of the response variables, whichmaximizes the ratio of between-group and within-group variances ofz:

zik = c0 + c1xi1 + ...+ ck xik

If H denotes the hypothesis sums of squares and cross-productmatrix and E denotes the error sums of squares and cross-productmatrix then the matrix A can be expressed as A = HE−1.

The eigenvalues λi of A correspond to the factors ci in the linearcombination.

MANOVA

Based on the λis the following test statistics can be calculated:

Pillai’s trace = trace[H(H + E)−1] =k∑

i=1

λi

1 + λi

Hotellings-Lawley’s trace = trace(A) =k∑

i=1λi

Wilk’s Λ =|E |

|H + E |=

k∏i=1

11 + λi

Roy’s largest root = max(λi ).

These statistics are translated into F statistics in order to test the nullhypothesis.

Uni-variate model of analysis of variance withrepeated measurements

The within subjects design requires homogeneity of treatmentdifference variances. One can create a new set of variables,composed of all possible pairwise differences, and the variances ofthese differences must all be equal in the population. This is calledthe sphericity assumption.

The compound symmetry assumption - a special case of thesphericity assumption - is met if all the covariances (the off-diagonalelements of the covariance matrix) are equal and all the variancesare equal in the populations being sampled.

Since for more than 2 time points these assumptions are often not thecase there is a correction accounting for this namely theGreenhouse-Geisser and the Huynh-Feldt corrections.

Logistic regression

In many studies the outcome variable of interest is the presence orabsence of some condition, or in general a binary variable.

For such data multiple linear regression can’t be used and a similarapproach called multiple linear logistic or logistic regression is used.

Here the explanatory variables were used to predict a transformationof the dependent variable and model a probability therefore the linearmodel is not working.

The transformation is called logit:

logit(p) = log (p

1− p) with

p1− p

is the odds

and p is the proportion of individuals with the characteristic. Theregression model can be formulated as:

log (p

1− p) = β0 + β1x1 + β2x2 + ...+ βk xk

Logistic regression

p(x) =eβ0+β1x

1 + eβ0+β1x

p(x) is the logistic distribution function (from which the name isderived) and models the probability that y = 1.

If you want compare predictions for subjects with or without aparticular characteristic (explanatory variable) you have:

log (p1

1− p1)− log (

p2

1− p2) = log

p1(1− p2)

p2(1− p1)= log (OR)

With the logit transformation there is now a linear relation betweenthe explanatory variables and the outcome.

Estimation and tests in logistic regression

Estimation of regression coefficients βi and standard error SE(βi ) isdone by the maximum likelihood method.

For the test if the influence of xi on P(y = 1|xi ) is significant the nullhypothesis is H0 : βi = 0 and the alternative two sided hypothesis isβi 6= 0.

The test statistic is called Wald statistic:

W =βi

SE(βi )

which can be approximated by a normal distribution.

Interpretation of coefficients

Linear model

g(x) = log (p

1− p) = β0 + β1x

Binary variable x

For x = 0 and x = 1⇒

g(0) = β0 and g(1) = β0 + β1

β1 = g(1)− g(0) = log (OR) and OR = eβ1

Continuous variable x

If x changes by k units:

∆g = kβ1 = log (OR)

ekβ1 = (eβ1 )k = ORk ⇒

OR is multiplicative.

Computation

One issue to consider is that for y = 0 or y = 1 the logit(p) is −∞ or∞.

The method of analysis uses an iterative procedure whereby theanswer is obtained by several repeated cycles of calculation using themaximum likelihood approach.

The k + 1 not-linear equation can lead sometimes to numericalproblems. It is recommended that data from at least 20 events and 20not-events for each explanatory variable are available.

Due to the computational complexity logistic regression can onlyfound in large statistical packages.

Quality of the prediction

Information from different significant influence factors (explanatoryvariables) can be combined by the prognostic index (PI).

PI = β1x1 + β2x2 + ...+ βk xk

Like for diagnostic tests PI can divided with different cut-points andthe quality of the prediction can be studied by a receiver operatingcharacteristics.

Here for all cut-points c the prognostic index is studies how good theoutcome is predicted by the binary variable PI > c.

The AUC is measure for the quality, which can be compared for eachunivariate predictors (explanatory variables).

Discriminant analysis

We wish to be able to find some combinations of variables thatclassifies a large proportion of subjects into the correct group, so thatwe can have a good chance of allocating (diagnosing) new subjectscorrectly.

The basic idea of discriminant analysis to find the combination ofvariables that maximizes the separation between the groups, as withlogistic regression.

With more than two groups the groups can be further separated byconstructing a second combination of the same variables which arecalled canonical variates or discriminant functions.


Group x1 x2 . . . xk

A 96 92 86 92A 79 75 74 74A 89 86 85 83A 95 78 78 83B 128 124 118 118B 100 98 100 94B 110 106 108 114B 93 87 91 89

The discriminant function can be defined as:

y = β0 + β1x1 + ...βk xk

The parameters βi are estimated in that way that the ratio of thebetween-groups variance to the within-groups variance is maximal.


Discriminant (function) analysis DA is mathematically identical to asingle factor MANOVA: DA is multivariate analysis of variance(MANOVA) reversed. In MANOVA, the independent variables are thegroups and the dependent variables are the predictors. In DA, theindependent variables are the predictors and the dependent variablesare the groups.

Factor analysis and ordination techniques

Explorative methods to find an elementary explanation model formutual relations.

Overview of common ordination techniques

indirect direct

linear Principal Component Analysis (PCA) Redundancy analysis (RDA)

unimodal (Detrended) Correspondence Analysis ((D)CA) Canonical CA (CCA)

Other common methods in this context include Multi Dimensional Scaling(MDS) and Principal Coordinate Analysis (PCoA).

Principal component analysis (PCA)

Variables are summarized by a linear combination to the principalcomponents.

The origin of the coordinate system is centered to the center of thedata (mean centering).

The coordinate system is rotated to a maximum of the variance in thefirst axis⇒ First principal component (PC) is in direction of themaximum variance from origin and subsequent PCs are orthogonal tothe first PC and describe maximum residual variance.

This method can be approached by a singular value decomposition:of the (m × n) data matrix X

Principal component analysis (PCA)

X = UWV T with UUT = V T V = VV T = I

For mean centered data the Covariance matrix C can be calculatedby XX T .

U are eigenvectors of XX T and the eigenvalues are in the diagonal ofW defined by the characteristic equation |C − λI| = 0.

Transformation of the input vectors into the principal componentspace can be described by Y = XU where the projection of sample ialong the axis is defined by the j-th principal component:

yij =m∑

t=1xitutj

PCA for gene expression data

Correspondence analysis (CA)

CA is an extension of the analysis of contingency tables. In this case,the status of descriptors (objects in rows) are compared with these ofother descriptors (variables in columns).

Aim of the CA is to reduce the contingency tables to a fewsummarizing variables, showing a lack in independence betweenrows and columns.

The approach is a combination of using the χ2 statistic and singularvalue decomposition similar to that for principal component analysis.

Starting with an r × c contingency table where Ti are the row totals inrow i , Tj are the column totals in column j .

The total number is N and the number of observations in row i andcolumn j is nij .


χ2 =(O − E)2

E⇒

A matrix S with elements sij can be constructed where

s2ij =

(nij

N−

TiTj

N2 )2

TiTj

N2

The matrix S can singular value decomposed with:

S = UWV T

W is a diagonal matrix, and its diagonal elements are referred to asthe singular values of S. We think of them as sorted from the largestto the smallest and denote them by λk .


The coordinates for sample i in the new space are then given byaik = λk uik/

√Ti/N for k = 1, ..., J and the variables are viewed in the

same space with variable j given coordinates bjk = λk vjk/√

Tj/N fork = 1, ..., J.

These coordinates are called principal coordinates.

Overlay of PCA and CA of expression data of 773 genes in 73samples across 5 cell cycle phases in yeast. Three most informativecomponents and coordinates respectively are used.

Data from Spellman P, et al. Mol Biol Cell. 1998;9:3273-3297


Introduction





Hypothesis testing

Comparing groups



Experimental design



Experimental design

Basic principles

I ReplicationI Independence and pseudo-replicationI ControlsI RandomizationI Interspersion (Blocking, Stratification)I Design typesI Power analysis

Replication

I Reduce the effect of uncontrolled variation (i.e., increaseprecision)

I Quantify uncertaintyI Increase power of the significance test (Power analysis)

Types of replication

Technical replicates: replicates that share the same sample; i.e. themeasurements are repeated

Biological replicates: replicate measurements from independentbiological samples

Pseudo replicates

I ”Incorrect” replication when replicating samples, not treatmentsI Replicates are not independentI Type I error (α) approaches 1 with increasing number of samples

per unit.

Hurlbert SH, Ecological Monographs, 1984;54(2):187-211

Controls

I Any treatment against which one or more other treatments arecompared

I It may be an ”untreated” treatment, a ”procedural” treatment, orsimply a different treatment

I Controls must undergo identically experimental procedure to thetreated units (e.g. injection of a saline solution)

I To allow separation of the effects of different aspects of theexperimental procedure

Randomization

I Random sampling from clearly defined populationsI Experimental subjects (”units”) should be assigned to treatment

groups at random (does not mean haphazardly)I One needs to explicitly randomize using a computer, dice, ...I Avoid biasI Ensures that statistical inferences are reliable

InterspersionI Interspersion is necessary to avoid unbalanced effects of

unforeseen events (e.g. weather or other defects) betweentreatment and control.

I Even by randomization simple segregation can occur(with 3-fold replication the chances are 10%).

Hurlbert SH, Ecological Monographs, 1984;54(2):187-211

Common design types

I Factorial designsI Completely randomized designI Complete randomized block designI Latin square designI Cross over designsI Nested designI Split-plot designI Repeated measurements

Factorial design

One factorial experimentAim is to study the effect of one single factor (with several levels).

For example the only interesting factor is drug treatment, all otherfactors (age, weight, sex ...) are ignored (but should kept constant).

Multi-factorial experimentThe design incorporates two or more factors that are crossed witheach other. The term crossed indicates that all combinations of thefactors are included and that every level (group) of each factor occursin combination with every level of the other factors.

Multi-factorial design allows the study of interaction between factors.

Analysis of a two factorial design with two-way ANOVA.

Randomized complete block design (RCBD)

I Treatments are assigned at random within blocks of adjacentsubjects, each treatment once per block.

I The number of blocks is the number of replications.I Any treatment can be adjacent to any other treatment, but not to

the same treatment within the block.I Used to control variation in an experiment by accounting for

spatial effects.

Sample layout with 4 treatments (A-D) and 3 blocks (I-III):

Block I A B C DBlock II D A B CBlock III B D C A

Latin square design (LSD)

I Treatments are assigned at random within rows and columns,with each treatment once per row and once per column.

I There are equal numbers of rows, columns, and treatments.I Useful where the experimenter desires to control variation in two

different directions

Sample layout with 4 treatments (A-D) assigned to 4 rows (I-IV) and 4columns (1-4):

Column1 2 3 4

Row I A B C DRow II C D A BRow III D C B ARow IV B A D C

Crossover design

An experimental design that combines attributes of latin squares andrepeated measures designs is the crossover design, often used inexperiments that apply multiple treatments to individual organisms.

In its simplest form, the crossover design can be considered as a latinsquare where subjects are one blocking factor (e.g. rows) and timeperiods are a second blocking factor (e.g. columns) and treatmentsare applied to each combination of subject and period using one ofthe latin square randomizations:

Period 1 Period 2Subject 1 A BSubject 2 B A

Problematic in this type of design are carryover effects.

Nested design

Multi-factorial experimental designs where a factor(B) is crossed withon factor (C) but nested within another (A)

A second factor (or set of factors) is then applied to whole blocks, withreplicate blocks for each level of this factor.

Split-plot design

Split-plot designs were originally used in agricultural experiments andrepresents a randomized complete block design, with one or morefactors applied to the experimental units within each block.

A second factor (or set of factors) is then applied to whole blocks, withreplicate blocks for each level of this factor.

Note: The units of replication are different for different factors.

Crawley MJ, Statistical Computing Keough MJ & Quinn GP EcologicalWiley, 2002:352ff Applications, 1998;8(1):141-161

Repeated measure designs

Factor A are units of replication termed ”subjects”Factor B (subjects) nested within AFactor C: repeated recordings on each subject

Completely randomized design (2 factor design (2x8) with 10replicates)⇒ 160 subjects needed

Power analysis

There is a relation between the 4 parameters for a significance test:Sample size n, significance level α (commonly 0.05), power 1− β(commonly 80%), effect size δ = ∆/σ (standardized difference ofmeans)

1. Clearly define null hypothesis and alternative hypothesis2. Identify the statistical model to be applied to the data, the desired

power and significance level3. Identify the assumption of the statistical procedure

Obtain some pilot estimate of variation4. Specify the effect size (e.g. other studies of the same biological

system)5. Calculate sample size

Power analysis

Cohen suggested 1988 some values for small, medium and largestandardized differences (δ = 0.2,0.5,0.8).

A more useful approach may be to plot detectable effect size versussample size or the power versus the effect size.

If there are constraints on the size of the experiment or samplingprogram with an estimate of σ, chosen values for α and β and thenumber of observations possible to determine the minimumdetectable effect size (MDES).

Experimental design for cDNA microarrays

Churchill GA, Nature Genetics, 2002;32(Suppl):490-495


Introduction





Hypothesis testing

Comparing groups



Experimental design



Types of study design

1. Retrospective studies (of past events) including case-controlstudies2. Prospective studies (of past events)3. Cohort studies or epidemiological design (of ongoing or futureevents)4. Clinical trials

Basic structure for different designs

Types of studies

Therapy studyEffectivity of a drug, new surgery or alternative methodsDesign: RCT

Diagnosis studyValidity and reliability of new diagnostic testsDesign: Cross-sectional

Screening studyInvestigation of test resultsDesign: Cross-sectional

Prognosis studyProgress of an early diagnosed diseaseDesign: Cohort

Causal studyAssociation between dangerous substances and a diseaseDesign: Cohort, Case control

Hierarchy of medical studies

Clinical trials

Clinical studies form a class of all scientific approaches to evaluatingmedical disease prevention, diagnostic techniques, and treatments.Among this class trials, often called clinical trials, form a subset ofthose clinical studies that evaluate investigational drugs.

I Phase I trials focus on safety of a new investigational medicine.These are the first human trials after successful animal trials.

I Phase II trials are small trials to evaluate efficacy and focus moreon a safety profile.

I Phase III trials are well-controlled trials, the most rigorousdemonstration of a drug’s efficacy prior to federal regulatoryapproval.

I Phase IV trials are often conducted after a medicine is marketedto provide additional details about the medicine’s efficacy and amore complete safety profile.

Clinical trials

The goal in a phase I trial is to identify a maximum tolerated dose(MTD), a dose that has reasonable efficacy (i.e. is toxic enough to killcancer cells) but with tolerable toxicity (i.e. not toxic enough to kill thepatient).

Phase I trials are applied to patients from standard treatment failurewho are at high risk of death in the short term.

In phase II trials the optimal dose (MTD) is applied to a small group ofpatients meeting predefined inclusion criteria (there are alsoexclusion criteria) and the response rate, the proportion orpercentage of patients who respond, is studied.

A second type of phase II trials consist of small comparative trialswhere we want to establish the efficacy of a new drug against acontrol or standard regimen.

Clinical trials

Phase III/IV are larger studies and the standard is a randomizeddouble-blind controlled trial (”golden standard”).

Controlled: The drug is tested against a control group receiving aplacebo or the standard treatment. The size, shape, procedureshould be very similar to control psychological and emotional effects.

Randomized: If a patient gets the drug or the placebo is assignedrandomly.

Stratified randomization: If there are expected cofounding variables(e.g. age) patients are stratified and treatment randomly assignedwithin stratum.

Minimization: A non-random treatment allocation for smaller trials.The allocation is based on the balance of several parameter, so thatthe n + 1 treatment is assigned based on the sum of the numberswithin the stratified variables (e.g. age≤50 or age>50).

Clinical trials

Double-blind: Blind to the patient and blind to the investigator(Triple-blind means that also regulatory officers/statisticians are”blinded”).

Selection of subjects: Based on inclusion/exclusion criteria

Alternative designs

I Crossover designI Within group (paired) comparisonsI Sequential designI Factorial designI Adaptive designI Zelen’s design

Sample size

Sample size for Phase II trials and surveys:

n =(z1−α)2p(1− p)

d2 (response rate)

Sample size for other Phase II trials:

n =(z1−α)2s2

d2 (continuous endpoint)

n =(z1−α + z1−β)2

( 12 ln 1+r

1−r )2+ 3 (correlation endpoint)

Sample size

Phase II designs for selection:

N =4(z1−α)2s2

d2 (continuous endpoint)

N =4(z1−α)2p(1− p)

(p2 − p1)2 (binary endpoint)

Phase III trials:

N =4(z1−α + z1−β)2σ2

d2 (comparison of 2 means)

N =4(z1−α + z1−β)2p(1− p)

(p2 − p1)2 (comparison of 2 proportions)

Number-needed-to-treat

Experimental event rate: EER =a

a + b

Control event rate: CER =c

c + d

Relative risk: RR =EERCER

Relative risk reduction: RRR =EER − CER

CERAbsolute risk reduction: ARR = EER − CER

Number-needed-to-treat: NNT =1

ARR=

1EER − CER

Study protocol

I International Conference on Harmonisation of TechnicalRequirements for Registration of Pharmaceuticals for HumanUse (ICH) guidelines for Good Clinical Practise (GCP).

I Formal document outlining the proposed procedures (basicallycontain any information from patient selection criteria toresponsibilities)

I For protocol violations (e.g. patients didn’t take their treatments)the only safe way is to keep those in the analysis as intended(intention-to-treat).

Study safety

SponsorInforming the local site investigators of the true historical safety recordof the drug, Monitoring the results of the study (Data MonitoringCommitee (DMC) also known as Data Safety Monitoring Board),Collecting adverse event reports, Write site-specific informed consent

Local site investigatorConducting the study according to the study protocol, Give trulyinformed consent (risks, potential benefits)

Institutional review board (IRB) or Ethics CommitteeScrutinize the study for both medical safety and protection to thepatients

Regulatory agencies (FDA, EAEM)Review all study data before allowing the drug to proceed to the nextphase, Audits for the local site investigator


Introduction





Hypothesis testing

Comparing groups



Experimental design



Medical journals and sites

How to choose a statistical test

Motulsky H, Intuitive Biostatistics, 2nd Ed., Oxford Univ. Press, 2010:pp 387-389

Bayesians vs. FrequentistFrequentistThe population value is seen as fixed (but unknown) and calculateconfidence interval and hypothesis tests. The entire informationcomes from the data.

BayesiansThe population mean follows a distribution (prior probability). Datacan be used to modify the prior probability distribution and gives theposterior probability distribution. Here a 95% credible interval (orBayesian confidence interval) can be constructed, which is narrowerthan the confidence interval derived from the data alone. Difficultiescan arise in deciding on the prior distribution (prior) and somebayesian methods may lead to intractable computational problems.

Dos and Don’ts

I Don’t carry out a significance test, get a large p-value, and theninterpret this as meaning that there is no difference.

I A confidence interval for the mean difference would be muchbetter than significance tests. A non-significant difference in 10subjects cannot be interpreted.

I Quote your p-values correctly to one significant figure (e.g.p = 0.007; do not use p < 0.013,p < 0.01,p > 0.05,p = NS)

I ”Significant” should not be used if you mean ”important”.I Don’t do direct comparison of p-values. It is not correct to

compare two groups by testing changes in each one separately.Significance does not depend only on magnitude, but onvariability and sample size. A two sample t-test should be usedto compare the log ratios in the two groups.

Dos and Don’ts (cont.)

I Always state if you are using SD, SE or CI. Avoid ±I Do confidence intervals (or SE’s) on group means, rather than on

comparisons.I Don’t use three-dimensional effects.I Don’t analyze the data as if they are all from the same population

and ignoring the fact that these 21 groups of subjects are from 9different trials.

I Don’t do Chi-square test analysis of ordered categorical data.

Manuscript Writing Guidelines

1. Read the journal’s instructions to authors. If they do not coverstatistics, use those of one of the major general medical journals.

2. Never, ever, conclude that there is no difference or relationshipbecause it is not significant.

3. Give confidence intervals where you can.4. Give exact p-values where possible, not p < 0.05 or p = NS,

though only one significant figure is necessary.5. Be clear what your main hypothesis and outcome variable are.

Avoid multiple testing.(Note: This is not feasible nowadays. This should be changed to”Ajdust p-values for multiple testing.”)

Bland M, How to Upset the Statistical Referee. Talk presented to the LondonHypertension Society, 2004

Manuscript Writing Guidelines (cont.)

6. Get the design right, be clear about blinding and randomization,do a sample size calculation if you can.

7. Be clear whether you are quoting standard deviations orstandard errors, avoid ± notation.

8. Avoid bar charts with error bars.9. Check the assumptions of your statistical methods.

10. Give clear descriptions of your statistical methods.

Bland M, How to Upset the Statistical Referee. Talk presented to the LondonHypertension Society, 2004

Documents

Biostatistics and Experimental Design - Bioinformatics …genome.tugraz.at/biostatistics/biostat.pdf · Biostatistics and Experimental Design ... a free software environment for statistical