1 Statistics Achim Tresch Gene Center LMU Munich

Statistics

Achim TreschGene CenterLMU Munich

Pope Benedikt XVI Andrej Kolmogoroff

Two ways of dealing with uncertainty

Topics

I. Descriptive Statistics

II. Test theory

III. Common tests

IV. Bivariate Analysis

V. Regression

I. Description

•Tables•Figures and graphical presentation

•Interpretation

„If you don‘t know, you have to believe“

Pan Tau „I strongly believe the Irak owns

weapons of mass destruction“ George W. Bush

What is „data“?

Cases (Samples, Observations)

Endpoints (Variables)

Realizations (instances,values)

A collection of observationsof a similar structure

Different Scales of a Variable

Categorial VariablesHave only a finite number of instances:Male/female; Mon/Tue/…/Sun

Continuous VariablesCan take values in an interval of the real numbersE.g. blood pressure [mmHg], costs [€]

Nominal data: Categorial variables without a given orderE.g. eye color [brown, blue, green, grey]Special Case: Binary (=dichotomic) variables (yes/no, 0/1…)Ordinal data: Instances are ordered in a natural wayE.g. tumor grade [I, II, III, IV], rank in a contest (1,2,3,…)

885% shinier hair!

I. Description

Problem:It is often difficult to map a variable to an appropriate scale:E.g. happiness, pain, satisfaction, social status, anger-> Check whether your choice of scale is meaningful!

Value A B AB 0 (absolute) frequency 83 20 10 75 188

relative frequency 44% 11% 5% 40% 100%

Always list absolute frequencies!• Do not list relative frequencies in percent if the

sample size is small (n < 20)• Do not use decimal digits in percent numbers for

„Side effects were observed in 14,2857% of all cases“Nonsense, we conclude that n=7!

Description of a categorial variable: Tables

Example: Blood antigens (ABO), n = 188 samples

I. Description

A B AB 0

Description of a categorial variable: Barplot

I. Description

11Merkmalsausprägung

-3 -2 -1 0 1 2 3

Description of continuous data: Histogram

I. Description

Merkmalsausprägung

-3 -2 -1 0 1 2 3

Merkmalsausprägung

-3 -2 -1 0 1 2 3

Merkmalsausprägung

-4 -2 0 2 4

The size of the bins (= width of the bars) is a matter of choice and has to be

determined sensibly!

50 bins 4 Balken12 bins

I. Description

Merkmalsausprägung

-3 -2 -1 0 1 2 3

Merkmalsausprägung

-3 -2 -1 0 1 2 3

Merkmalsausprägung

Caution: Data will be smoothed automatically. This is very suggestive and blurs discontinuities in a distribution.

I. Description

Description of continuous data: Density plot

The most important one: The Gaussian (normal) distribution

Expectation value

Standard-deviation

I. Description

C.F Gauss (1777-1855):Roughly speaking, continuous variables that are the (additive) result of a lot of other random variables follow a Gaussian distribution.-> It is often sensible to assume a gaussian distribution for continuous variables.

Measures of Location, Scale and Scatter

Mean: sum of all observations / number of samples

Ex.: observations: 2, 3, 7, 9, 14sum: 2+3+7+9+14 = 35

# observations: 5Mean: 35/5 = 7

Median: A number M such that 50% of all observations are less than or equal to M, and 50% are greater than or equal to M. (Q: What if #observations is even?)

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

(0, le

50% 50%

I. Description

rel. H

0 1 2 3 4

00 Mode: A value for which the

density of the variable reaches a local maximum. If there is only one such value, the distribution is called unimodal, otherwise multimodal. Special case: bimodal)

Median

I. Description

Description of Location, Scale and Scatter

Distribution Shapes

SymmetricMean Median

Skewed to the rightMedian << Mean

Skewed to the leftMean << Median

I. Description

The median should be preferred to the mean if• the ditribution is very asymmetric• there are extreme outliers

The skewness g of the distribution ranges between–1 und +1, i.e. the distribution is approx. symmetric.

skewness g > 0

skewness g < 0

0 1 2 3 4 5

rep(0, length(d))

The mean is more „precise“ than the median if the distribution is approximately normal

Rule of thumb:

Right skew:

Left skew:

I. Description

How would you describe this distribution?

I. Description

„…it showed a giant boa swallowing an elephant. I painted the inside of the boa to make it visible to the adults. They always need explanations.“

Antoine de Saint-Exupéry, Le petit prince

Unexpected distributionshave unexpected causes!

I. Description

More Location measures

Quantile: A q-quantile Q (0≤q≤1) splits the data into a fraction of q points below or equal to Q and a fraction of 1-q points above or equal to Q.

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

(0, le

50% 50%Median = 50%-quantile

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

(0, le

25% 25%1.quartile =

25%-quantile

25% 25%3.quartile =

75%-quantile

1-quantile =

maximum

0-quantile =

minimum

I. Description

The five-point Summary and the Boxplot

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

I. Description

Span:Maximum - Minimum

Interquartile range (IQR):

3. quartile - 1. quartile

Mesures of Variation

I. Description

How far do the observations scatter around their „center“(=measure of location)?

Measures of Variation

large variationsmall variation

Location measure

e.g.: location = Median variation = 3.Quartil – 1.Quartil

= Interquartilabstand (IQR)

I. Description

|~| xx j

e.g.: location = median variation = mean deviation (MD) from

e.g.: location = median variation = (median absolute deviation,MAD)

from njxx j ,...,1 , |~| Median

I. Description

Mean ± s contains ~68% of the data

Mean ± 2s ´´ ~95% ´´

Mean ± 3s ´´ ~99.7% ´´ x-s x x+s

Numbers for Gaussian variables:

z.B.: location = mean variation = mean squared deviation

= variance (v)Or: variation = square root of the variance

= standard deviation (s, std.dev)

I. Description

Histogram/Density Plot vs. Boxplot

Boxplot contains less information, but it is easier to interpret.

I. Description

Multiple Boxplots I. Description

Sample: 2769 schoolchildren

Always report the sample size!

a) numericalMedian, Q1, Q3, Min., Max. (5-summary) for symmetric distr. alternatively: mean, standard deviation

b) graphical

Boxplots, histograms and/or density plots

c) verbale.g. „Blood pressure was reduced by 12 mmHg (Interquartile range: 8 to 18 mmHg = 10mmHg), whereas the reduction in the placebo group was only3 mmHg (IQR: –2 to 4 mmHg = 6mmHg).“

SummaryI. Description

Cross Table

Person Medication Response

A Verum yes

B Placebo no

Two categorial variables: Cross Tables

I. Description

Cross Tablevalues of variable 2

values of variable 1(potential causes)

(potential effects)

I. Description

Two categorial variables: Cross tables

A Verum yes

B Placebo no

Cross TableResponse

yes no

Medi-cation

Placebo

values of variable 2

(potential effects)

Each case is one count in the table

I. Description

A Verum yes

B Placebo no

Cross TableResponse

yes no

Medi-cation

Verum 1 0

Placebo 0 1

(potential effects)

I. Description

Each case is one count in the table

A Verum yes

B Placebo no

Cross TableResponse

yes no

Medi-cation

Verum 1 0

Placebo 0 1

(potential effects)

The most common question is:Are there differences between █ and

I. Description

Absolute number, row-, column percent

ResponseTotal

yes no

Medi-cation

Verum20

50%, 67%20

50%, 40%40

Placebo10

25%, 33%30

75%, 60%40

Total 30, 37% 50, 63% 80, 100%

Cross Table: n = 80 cases

I. Description

What‘s bad about this table?

I. Description

Cross tables:Independent vs. paired data

independent data

paired data

A Verum yes

B Placebo no

Person Medic.: VerumMedic.: Placebo

A yes yes

B yes no

Paired data: One object (or two closely related objects) serves for the measurement of two variables of the same kind.Exercise: The influence of diet on body height is assessed in 1) a study with 100 randomly picked subjects. 2) a study with 50 identical twins that grew up separately. Write down the cross tables. Which study is probably more informative?

I. Description

Cross TableMedic.: Placebo

yes no

Medic.: Verum

yes 1 1

no 0 0

I. Description

Cross tables:Paired data

paired data

Person Medic.: VerumMedic.: Placebo

A yes yes

B yes no

KreuztabelleMedic.: Placebo

yes no

Medic.: Verum

yes 1 1

no 0 0

A typical question is:

concordant observations

discordant observations

Are the observations concordant or discordant?Is there a particularly large number in █ or █ ?

I. Description

Cross tables:Paired data

Measure in the sample

Measure in the population?Variance? Confidence intervals?

Estimation, Regression:

I. Description

Difference in the sample

Difference in the

population?Probability of a false call?

Significance

Testing:

Induction from the sample to the population

What allows us to conclude from the sample to the population? The sample has to be representative(figures about drug abuse of students cannot be generalized to the whole population of Germany)

How is representativity achieved?Large sample numbersRandom recruitment of samples from the populationE.g.: Dial a random phone number. Choose a random name from the register of birth (Advantages/Disadv.?)

Randomization: Random allocation of the samples to the different experimental groups

I. Description

Confidence intervals

95%-Confidence interval: An estimated interval which contains the „true value“ of a quantity with a probability of 95%.

____________________________________( )20.5 29,5

Interval estimate

Point estimate (e.g. % votes for the SPD in the EU elections)

( 1 – α ) – Conficence interval: An estimated interval which contains the „true value“ of a quantity with a probability of (1 – α). 1 – α = confidence level , α = error probability

Use confidence intervals with caution!

I. Description

A non-sheep detector

Training: Measure the length of all sheep that cross your way

II. Testing

Training: Measure the length of all sheep that cross your way. Determine the distribution of the quantity of interest.

70 80 90 100 110 120 130 140

Groesse [cm]

II Testing

Test phase: For any unknown animal, test the hypothesis that it is a sheep. Measure ist length and compare it to the learned length distribution of the sheep. If its length is „out of bounds“, the animal will be called a non-sheep (rejection of the hypothesis). Otherwise, we cannot say much (non-rejection).

70 80 90 100 110 120 130 140

Groesse [cm]

Not a sheep

II Testing

4670 80 90 100 110 120 130 140

Groesse [cm]

Advantage of the method: One does not need to know much about sheep.

Disadvantage: It produces errors…

True Negatives

Negatives calls Positive calls

Decision boundary

True PositivesFalse

PositivesFalse Negatives

II TestingA non-sheep detector

Statistical Hypothesis Testing

State a null hypothesis H0 („nothing happens, there is no difference…“)Choose an appropriate test statistic (the data-derived quantity that finally leads to the decision) This implicitly determines the null distribution (the distribution of the test statistic under the null hypothesis).

-10 -5 0 5 10 15

Blutdrucksenkung [mmHg]

II Testing

Stats an alternative hypothesis (e.g. „the test statistic is higher than expected under the null hypothesis“)Determine a decision boundary. This is equivalent to the chioce of a significance level α, i.e. the fraction of false positive calls you are willing to accept.

-10 -5 0 5 10 15

II Testing

Acceptance region

Rejection region

-10 -5 0 5 10 15

Calculate the actual value of the test statistic in the sample, and make your decision according to the prespecified(!) decision boundary.

Keep H0 (no rejection)

Reject H0 (assume the alternative hypothesis)

II Testing

d Good statistic

Good test statistics, bad test statistics

Accept null hypothesis

Reject null hypothesis

Null hypothesis is true

right decisionTyp I error

(False Positive)

Alternative is true

Typ II error(False Negative)

right decision

Distribution of the test statistic under the null hypothesis

Distribution of the test statistic under the alternative hypothesis

II Testing

d Bad statistic

II Testing

Distribution of the test statistic under the null hypothesis

Distribution of the test statistic under the alternative hypothesis

Accept null hypothesis

Reject null hypothesis

Null hypothesis is true

right decisionTyp I error

(False Positive)

Alternative is true

Typ II error(False Negative)

right decision

Good test statistics, bad test statistics

The Offenbach Oracle

Throw the 20-sided die

Score = 20: reject the null hypothesisScore ≠ 20: keep the null hypothesis

This is (independent of the null hypothesis) a valid statistical test at a 5% type I error level!

Toni, 29, Offenbach, mechanician and moral philosopher

II Testing

The Offenbach Oracle

5 10 15 20

The distribution of the test statistic under null- and alternative hypothesis is identicalThis test cannot discriminate between the two alternatives!

Distribution under H0

Distribution under H1

95% of the Positives (as well as the Negatives) will be missed.

II Testing

The p-value

-10 -5 0 5 10 15

p = 0.08

Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated:

Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis].

II Testing

55-10 -5 0 5 10 15

p = 0.42

II Testing

The p-value

Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated:

Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis].

Test decisions according to the p-value

Decision boundary d significance level α Observed test statistic t p-value

-10 -5 0 5 10 15

α = 0.05

p ≥ α

Keep H0 (no rejection)

p < α

Reject H0 (assume the alternative hypothesis)

p = 0.02

p = 0.83

t more extreme than d p is smaller than α

II Testing

-10 -5 0 5 10 15

One- and two-sided hypotheses

Acceptance region Rejection region

One-sided alternative

H0: The value of a quantity of interest in group A is not higher than in group B.

H1: The value of a quantity of interest in group A is higher than in group B.

II Testing

-10 -5 0 5 10 15

Acceptance region Rejection region

H0: The quantity of interest has the same value in group A and group B

H1: The quantity of interest is different in group A and group B

Rejection region

Generally, two-sided alternatives are more conservative: Deviations in both directions are detected.

II Testing

One- and two-sided hypotheses

Two-sided alternative

Example “Testing”: Colon Carcinoma

How about this fact?

Variable: VaccineScale: binary

Endpoint: 4-year

survivalScale: binary32*94 ≈

30(62-32)*77 ≈ 23

II Testing

Interesting questions:

Das the vaccine yield any effect?

Is this effect „significant“ ?

4-year survival

Ja Nein

Vaccineyes (n=32) 30 (94%) 2 (6%)

no (n=30) 23 (77%) 7 (23%)

II TestingExample “Testing”: Colon Carcinoma

Null hypothesis H0: Vaccination has not (either positive or negative) impact on the patients. The survival rates in the vaccine and non-vaccine group in the whole population are the same.

Alternative hypothesis H1: For the whole population, the survival rates in the vaccine and non vaccine group are different.

Choose the significance level α (usually: α = 1%; 0.1%; 5%)

Interpretation of the significane level α :If there is no difference between the groups, one obtains a false positive result with a probability of α.

Choice of test statistic: „Fisher‘s Exact Test“

Sir Ronald Aylmer Fisher, 1890-1962 Theoretical Biology, Evolution Theory,

Statistics

Value of the test statistic t after the experiment has been carried out. This value can be converted into a p-value:

p = 0.0766 7.7%

Since we have chosen a significane level α = 5%, and p > α, we cannot reject the null hypothesis, thus we keep it.

Formulation of the result: At a 5% significance level (and using Fisher‘s Exact Test), no significant effect of vaccination on survival could be detected.

Consequence: We are not (yet) sufficiently convinced of the utility of this therapy. But this does not mean that there is no difference at all!

“No test based upon the theory of probability can by itself provide any valuable evidence of the truth or

falsehood of a hypothesis.“ Neyman J, Pearson E (1933) Phil Trans R Soc A

Egon Pearson (1895-1980)

Jerzy Neyman (1894-1981)

Non-significance ≠ equivalenceStatistics can never prove a hypothesis,

it can only provide evidence

II Testing

End of

Part I

1 Statistics Achim Tresch Gene Center LMU Munich

Documents

ChemComm - LMU

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de

Zitate - LMU

Lmu PméRev2

AGUDATH ACHIM SYNAGOGUE

LMU Commissioning

Achim Tresch Computational Biology

Shevet Achim Newsletter

Introduction to JSR 354 (Currency and Money) by Anatole Tresch

Achim Peters Mythos Übergewicht - bücher.de...Achim Peters Mythos Übergewicht 0021_10149_Peters_Mythos Uebergewicht_Neu.indd 121_10149_Peters_Mythos Uebergewicht_Neu.indd 1 115.01.13

Introduction to JSR 354 (Currency and Money) by Anatole Tresch, Werner Keil

Introduction - LMU

AGUDATH ACHIM CONGREGATION

Overview - Achim Zeileis

Appendix - LMU

Prof. Dr. Achim Zielesny

Package ‘LSD’ - R · PDF filePackage ‘LSD’ January 26, 2018 Version 4.0-0 Date 2018-01-25 Title Lots of Superior Depictions Author Bjoern Schwalb [aut, cre], Achim Tresch [aut],

Biliproteins - LMU

Structure - LMU

Achim as Cada Riu