View
225
Download
0
Category
Tags:
Preview:
Citation preview
1
Statistics
Achim TreschGene CenterLMU Munich
2
3
Pope Benedikt XVI Andrej Kolmogoroff
Two ways of dealing with uncertainty
4
Topics
I. Descriptive Statistics
II. Test theory
III. Common tests
IV. Bivariate Analysis
V. Regression
5
I. Description
•Tables•Figures and graphical presentation
•Interpretation
„If you don‘t know, you have to believe“
Pan Tau „I strongly believe the Irak owns
weapons of mass destruction“ George W. Bush
6
What is „data“?
Cases (Samples, Observations)
Endpoints (Variables)
Realizations (instances,values)
…Th
e s
am
ple
/ th
e s
am
ple
pop
ula
tion
⊆ p
op
ula
tion
A collection of observationsof a similar structure
7
Different Scales of a Variable
Categorial VariablesHave only a finite number of instances:Male/female; Mon/Tue/…/Sun
Continuous VariablesCan take values in an interval of the real numbersE.g. blood pressure [mmHg], costs [€]
Nominal data: Categorial variables without a given orderE.g. eye color [brown, blue, green, grey]Special Case: Binary (=dichotomic) variables (yes/no, 0/1…)Ordinal data: Instances are ordered in a natural wayE.g. tumor grade [I, II, III, IV], rank in a contest (1,2,3,…)
885% shinier hair!
I. Description
Problem:It is often difficult to map a variable to an appropriate scale:E.g. happiness, pain, satisfaction, social status, anger-> Check whether your choice of scale is meaningful!
9
Value A B AB 0 (absolute) frequency 83 20 10 75 188
relative frequency 44% 11% 5% 40% 100%
Always list absolute frequencies!• Do not list relative frequencies in percent if the
sample size is small (n < 20)• Do not use decimal digits in percent numbers for
n<300
„Side effects were observed in 14,2857% of all cases“Nonsense, we conclude that n=7!
Description of a categorial variable: Tables
Example: Blood antigens (ABO), n = 188 samples
I. Description
10
0
5
10
15
20
25
30
35
40
45
A B AB 0
%
Description of a categorial variable: Barplot
I. Description
11Merkmalsausprägung
Za
hl d
er
Fä
lle
-3 -2 -1 0 1 2 3
02
04
06
0
Description of continuous data: Histogram
I. Description
12
Merkmalsausprägung
Za
hl d
er
Fä
lle
-3 -2 -1 0 1 2 3
02
04
06
0
Merkmalsausprägung
Za
hl d
er
Fä
lle
-3 -2 -1 0 1 2 3
05
10
15
20
Merkmalsausprägung
Za
hl d
er
Fä
lle
-4 -2 0 2 4
05
01
00
15
02
00
The size of the bins (= width of the bars) is a matter of choice and has to be
determined sensibly!
50 bins 4 Balken12 bins
I. Description
Merkmalsausprägung
rela
tive
Hä
ufig
keit
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
Merkmalsausprägung
rela
tive
Hä
ufig
keit
13
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
Merkmalsausprägung
rela
tive
Hä
ufig
keit
Caution: Data will be smoothed automatically. This is very suggestive and blurs discontinuities in a distribution.
I. Description
Description of continuous data: Density plot
14
The most important one: The Gaussian (normal) distribution
Expectation value
Standard-deviation
I. Description
C.F Gauss (1777-1855):Roughly speaking, continuous variables that are the (additive) result of a lot of other random variables follow a Gaussian distribution.-> It is often sensible to assume a gaussian distribution for continuous variables.
15
Measures of Location, Scale and Scatter
Mean: sum of all observations / number of samples
Ex.: observations: 2, 3, 7, 9, 14sum: 2+3+7+9+14 = 35
# observations: 5Mean: 35/5 = 7
Median: A number M such that 50% of all observations are less than or equal to M, and 50% are greater than or equal to M. (Q: What if #observations is even?)
|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||
-2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
d
rep
(0, le
ng
th(d
))
50% 50%
I. Description
rel. H
äu
fig
ke
it
0 1 2 3 4
02
00
04
00
06
00
08
00
01
00
00 Mode: A value for which the
density of the variable reaches a local maximum. If there is only one such value, the distribution is called unimodal, otherwise multimodal. Special case: bimodal)
16
Mean
Median
I. Description
Description of Location, Scale and Scatter
17
Distribution Shapes
SymmetricMean Median
Skewed to the rightMedian << Mean
Skewed to the leftMean << Median
I. Description
18
The median should be preferred to the mean if• the ditribution is very asymmetric• there are extreme outliers
The skewness g of the distribution ranges between–1 und +1, i.e. the distribution is approx. symmetric.
skewness g > 0
skewness g < 0
0 1 2 3 4 5
-2
-1
01
2
d
rep(0, length(d))
The mean is more „precise“ than the median if the distribution is approximately normal
Rule of thumb:
Right skew:
Left skew:
I. Description
19
How would you describe this distribution?
I. Description
20
„…it showed a giant boa swallowing an elephant. I painted the inside of the boa to make it visible to the adults. They always need explanations.“
Antoine de Saint-Exupéry, Le petit prince
Unexpected distributionshave unexpected causes!
I. Description
21
More Location measures
Quantile: A q-quantile Q (0≤q≤1) splits the data into a fraction of q points below or equal to Q and a fraction of 1-q points above or equal to Q.
|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||
-2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
d
rep
(0, le
ng
th(d
))
50% 50%Median = 50%-quantile
|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||
-2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
d
rep
(0, le
ng
th(d
))
25% 25%1.quartile =
25%-quantile
25% 25%3.quartile =
75%-quantile
1-quantile =
maximum
0-quantile =
minimum
I. Description
22
The five-point Summary and the Boxplot
|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||
-2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
d
rep(
0, le
ngth
(d))
I. Description
23
Span:Maximum - Minimum
Interquartile range (IQR):
3. quartile - 1. quartile
Mesures of Variation
I. Description
24
How far do the observations scatter around their „center“(=measure of location)?
Measures of Variation
large variationsmall variation
Location measure
e.g.: location = Median variation = 3.Quartil – 1.Quartil
= Interquartilabstand (IQR)
I. Description
25
Measures of Variation
x~ jx
|~| xx j
e.g.: location = median variation = mean deviation (MD) from
=
x~
n
jj xx
n 1
|~|1
x~
e.g.: location = median variation = (median absolute deviation,MAD)
from njxx j ,...,1 , |~| Median
x~
I. Description
x~
26
Mean ± s contains ~68% of the data
Mean ± 2s ´´ ~95% ´´
Mean ± 3s ´´ ~99.7% ´´ x-s x x+s
Measures of Variation
Numbers for Gaussian variables:
z.B.: location = mean variation = mean squared deviation
from
=
= variance (v)Or: variation = square root of the variance
= standard deviation (s, std.dev)
x
n
jj xx
n 1
2)(1
x
I. Description
27
Histogram/Density Plot vs. Boxplot
Boxplot contains less information, but it is easier to interpret.
I. Description
1
3
2
4
28
Multiple Boxplots I. Description
Sample: 2769 schoolchildren
29
Always report the sample size!
a) numericalMedian, Q1, Q3, Min., Max. (5-summary) for symmetric distr. alternatively: mean, standard deviation
b) graphical
Boxplots, histograms and/or density plots
c) verbale.g. „Blood pressure was reduced by 12 mmHg (Interquartile range: 8 to 18 mmHg = 10mmHg), whereas the reduction in the placebo group was only3 mmHg (IQR: –2 to 4 mmHg = 6mmHg).“
SummaryI. Description
30
Cross Table
Person Medication Response
A Verum yes
B Placebo no
Two categorial variables: Cross Tables
Data
I. Description
31
Cross Tablevalues of variable 2
values of variable 1(potential causes)
(potential effects)
I. Description
Two categorial variables: Cross tables
Person Medication Response
A Verum yes
B Placebo no
Data
32
Cross TableResponse
yes no
Medi-cation
Verum
Placebo
values of variable 2
values of variable 1(potential causes)
(potential effects)
Each case is one count in the table
I. Description
Two categorial variables: Cross tables
Person Medication Response
A Verum yes
B Placebo no
Data
33
Cross TableResponse
yes no
Medi-cation
Verum 1 0
Placebo 0 1
values of variable 2
values of variable 1(potential causes)
(potential effects)
I. Description
Two categorial variables: Cross tables
Each case is one count in the table
Person Medication Response
A Verum yes
B Placebo no
Data
34
Cross TableResponse
yes no
Medi-cation
Verum 1 0
Placebo 0 1
values of variable 2
values of variable 1(potential causes)
(potential effects)
The most common question is:Are there differences between █ and
█ ?
I. Description
Two categorial variables: Cross tables
35
Absolute number, row-, column percent
ResponseTotal
yes no
Medi-cation
Verum20
50%, 67%20
50%, 40%40
50%
Placebo10
25%, 33%30
75%, 60%40
50%
Total 30, 37% 50, 63% 80, 100%
Cross Table: n = 80 cases
I. Description
Two categorial variables: Cross tables
36
What‘s bad about this table?
I. Description
Two categorial variables: Cross tables
37
Cross tables:Independent vs. paired data
independent data
paired data
Person Medication Response
A Verum yes
B Placebo no
Person Medic.: VerumMedic.: Placebo
A yes yes
B yes no
Paired data: One object (or two closely related objects) serves for the measurement of two variables of the same kind.Exercise: The influence of diet on body height is assessed in 1) a study with 100 randomly picked subjects. 2) a study with 50 identical twins that grew up separately. Write down the cross tables. Which study is probably more informative?
I. Description
38
Cross TableMedic.: Placebo
yes no
Medic.: Verum
yes 1 1
no 0 0
values of variable 2
values of variable 1
I. Description
Cross tables:Paired data
paired data
Person Medic.: VerumMedic.: Placebo
A yes yes
B yes no
39
KreuztabelleMedic.: Placebo
yes no
Medic.: Verum
yes 1 1
no 0 0
values of variable 2
values of variable 1
A typical question is:
concordant observations
discordant observations
Are the observations concordant or discordant?Is there a particularly large number in █ or █ ?
I. Description
Cross tables:Paired data
40
Measure in the sample
Measure in the population?Variance? Confidence intervals?
Estimation, Regression:
I. Description
Difference in the sample
Difference in the
population?Probability of a false call?
Significance
Testing:
Induction from the sample to the population
41
What allows us to conclude from the sample to the population? The sample has to be representative(figures about drug abuse of students cannot be generalized to the whole population of Germany)
How is representativity achieved?Large sample numbersRandom recruitment of samples from the populationE.g.: Dial a random phone number. Choose a random name from the register of birth (Advantages/Disadv.?)
Randomization: Random allocation of the samples to the different experimental groups
I. Description
42
Confidence intervals
95%-Confidence interval: An estimated interval which contains the „true value“ of a quantity with a probability of 95%.
24,3
____________________________________( )20.5 29,5
X
Interval estimate
Point estimate (e.g. % votes for the SPD in the EU elections)
( 1 – α ) – Conficence interval: An estimated interval which contains the „true value“ of a quantity with a probability of (1 – α). 1 – α = confidence level , α = error probability
Use confidence intervals with caution!
I. Description
43
A non-sheep detector
Training: Measure the length of all sheep that cross your way
II. Testing
44
Training: Measure the length of all sheep that cross your way. Determine the distribution of the quantity of interest.
A non-sheep detector
70 80 90 100 110 120 130 140
Groesse [cm]
II Testing
45
Test phase: For any unknown animal, test the hypothesis that it is a sheep. Measure ist length and compare it to the learned length distribution of the sheep. If its length is „out of bounds“, the animal will be called a non-sheep (rejection of the hypothesis). Otherwise, we cannot say much (non-rejection).
A non-sheep detector
70 80 90 100 110 120 130 140
Groesse [cm]
Not a sheep
II Testing
4670 80 90 100 110 120 130 140
Groesse [cm]
Advantage of the method: One does not need to know much about sheep.
Disadvantage: It produces errors…
True Negatives
Negatives calls Positive calls
Decision boundary
True PositivesFalse
PositivesFalse Negatives
II TestingA non-sheep detector
47
Statistical Hypothesis Testing
State a null hypothesis H0 („nothing happens, there is no difference…“)Choose an appropriate test statistic (the data-derived quantity that finally leads to the decision) This implicitly determines the null distribution (the distribution of the test statistic under the null hypothesis).
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
II Testing
48
Statistical Hypothesis Testing
Stats an alternative hypothesis (e.g. „the test statistic is higher than expected under the null hypothesis“)Determine a decision boundary. This is equivalent to the chioce of a significance level α, i.e. the fraction of false positive calls you are willing to accept.
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
α
d
II Testing
Acceptance region
Rejection region
49
Statistical Hypothesis Testing
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
α
d
Calculate the actual value of the test statistic in the sample, and make your decision according to the prespecified(!) decision boundary.
Keep H0 (no rejection)
Reject H0 (assume the alternative hypothesis)
II Testing
50
0
d Good statistic
Good test statistics, bad test statistics
Accept null hypothesis
Reject null hypothesis
Null hypothesis is true
right decisionTyp I error
(False Positive)
Alternative is true
Typ II error(False Negative)
right decision
Distribution of the test statistic under the null hypothesis
Distribution of the test statistic under the alternative hypothesis
II Testing
0
d Bad statistic
II Testing
Distribution of the test statistic under the null hypothesis
Distribution of the test statistic under the alternative hypothesis
Accept null hypothesis
Reject null hypothesis
Null hypothesis is true
right decisionTyp I error
(False Positive)
Alternative is true
Typ II error(False Negative)
right decision
Good test statistics, bad test statistics
52
The Offenbach Oracle
Throw the 20-sided die
Score = 20: reject the null hypothesisScore ≠ 20: keep the null hypothesis
This is (independent of the null hypothesis) a valid statistical test at a 5% type I error level!
Toni, 29, Offenbach, mechanician and moral philosopher
II Testing
53
The Offenbach Oracle
But:
5 10 15 20
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Index
c(0
, 0)
5 10 15 20
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Index
c(0
, 0)
The distribution of the test statistic under null- and alternative hypothesis is identicalThis test cannot discriminate between the two alternatives!
Distribution under H0
Distribution under H1
95% of the Positives (as well as the Negatives) will be missed.
II Testing
54
The p-value
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
p = 0.08
Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated:
Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis].
t=4.2
II Testing
55-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
p = 0.42
t=0.7
II Testing
The p-value
Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated:
Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis].
56
Test decisions according to the p-value
Decision boundary d significance level α Observed test statistic t p-value
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
α = 0.05
p ≥ α
Keep H0 (no rejection)
p < α
Reject H0 (assume the alternative hypothesis)
t
p = 0.02
dt
p = 0.83
t more extreme than d p is smaller than α
II Testing
57
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
One- and two-sided hypotheses
][
Acceptance region Rejection region
One-sided alternative
H0: The value of a quantity of interest in group A is not higher than in group B.
H1: The value of a quantity of interest in group A is higher than in group B.
II Testing
58
-10 -5 0 5 10 15
Blutdrucksenkung [mmHg]
][
Acceptance region Rejection region
H0: The quantity of interest has the same value in group A and group B
H1: The quantity of interest is different in group A and group B
][
Rejection region
Generally, two-sided alternatives are more conservative: Deviations in both directions are detected.
II Testing
One- and two-sided hypotheses
Two-sided alternative
59
Example “Testing”: Colon Carcinoma
How about this fact?
Variable: VaccineScale: binary
Endpoint: 4-year
survivalScale: binary32*94 ≈
30(62-32)*77 ≈ 23
II Testing
60
Interesting questions:
Das the vaccine yield any effect?
Is this effect „significant“ ?
4-year survival
Ja Nein
Vaccineyes (n=32) 30 (94%) 2 (6%)
no (n=30) 23 (77%) 7 (23%)
II TestingExample “Testing”: Colon Carcinoma
61
Null hypothesis H0: Vaccination has not (either positive or negative) impact on the patients. The survival rates in the vaccine and non-vaccine group in the whole population are the same.
Alternative hypothesis H1: For the whole population, the survival rates in the vaccine and non vaccine group are different.
Choose the significance level α (usually: α = 1%; 0.1%; 5%)
Interpretation of the significane level α :If there is no difference between the groups, one obtains a false positive result with a probability of α.
II TestingExample “Testing”: Colon Carcinoma
62
Choice of test statistic: „Fisher‘s Exact Test“
Sir Ronald Aylmer Fisher, 1890-1962 Theoretical Biology, Evolution Theory,
Statistics
II TestingExample “Testing”: Colon Carcinoma
63
Value of the test statistic t after the experiment has been carried out. This value can be converted into a p-value:
p = 0.0766 7.7%
Since we have chosen a significane level α = 5%, and p > α, we cannot reject the null hypothesis, thus we keep it.
Formulation of the result: At a 5% significance level (and using Fisher‘s Exact Test), no significant effect of vaccination on survival could be detected.
Consequence: We are not (yet) sufficiently convinced of the utility of this therapy. But this does not mean that there is no difference at all!
II TestingExample “Testing”: Colon Carcinoma
64
“No test based upon the theory of probability can by itself provide any valuable evidence of the truth or
falsehood of a hypothesis.“ Neyman J, Pearson E (1933) Phil Trans R Soc A
Egon Pearson (1895-1980)
Jerzy Neyman (1894-1981)
Non-significance ≠ equivalenceStatistics can never prove a hypothesis,
it can only provide evidence
II Testing
65
End of
Part I
Recommended