Upload
gema
View
24
Download
0
Tags:
Embed Size (px)
DESCRIPTION
To P or not to P???: Does a statistical test holds what it promises?. There is increasing concern that in modern research , false findings may be the majority or even the vast majority of published research claims . ” ,” Ioannidis (2005, PLoS Medicine ). - PowerPoint PPT Presentation
Citation preview
To P or not to P???:Does a statistical test holds what it promises?
Probably more than 70% of all medical and biological scientific studies are irreproducible!
There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims.” ,” Ioannidis (2005, PLoS Medicine)
We compare the effects two drugs to control blood
pressure in two groups of patients
We test the effect of a drug to control blood pressure
against a null control group
We test for a significant correlation
𝑃 (𝑡 )=𝑃 (𝑡=𝑒𝑓𝑓𝑒𝑐𝑡 𝑠𝑖𝑧𝑒
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑒𝑟𝑟𝑜𝑟)<0.001
We compare two observationsWe use the t-test to calculate a
probability of difference.
We compare an observation against a null expectation
We use the t-test to assess the validity of a hypothesis.
We compare an observed statistic against an unobserved
null assumption.
)
Formally, we test H1: r2 = 0.57 against the alternative H0 : r2 = 0
This is not the same as to test the hypothesis that X and Y are correlated against the hypothesis that X and Y are not correlated.
X and Y might not be correlated but have a r2 ≠ 0.This appears if X and Y are jointly constraint by marginal settings.
An intuitive significance test
We test an observation against a specific null assumption.We compare two specific values of r2.
Number of storks and reproductive rates in Europe (Matthews 2000)
Urbanisation
Storks Birth rate
Storks
Pseudocorrelations between X and Y arise whenX = f(U)Y = g(U)f = h(g)
The sample spaces of both variables are constraint by one or more hidden variables that are itself correlated.
Birth rate
r2 = 0.25; P < 0.05
Excel gets plots at log-scales
wrong.
Country Area No. stork pairs Stork density Inhabitants Annual no.
birthsAnnual birth
rate
Albania 28750 100 0.00348 3200000 83000 0.026Belgium 30520 1 0.00003 9900000 87000 0.009Bulgaria 111000 5000 0.04505 9000000 117000 0.013Denmark 43100 9 0.00021 5100000 59000 0.012Germany 357000 3300 0.00924 78000000 901000 0.012France 544000 140 0.00026 56000000 774000 0.014Greece 132000 2500 0.01894 10000000 106000 0.011Netherlands 41900 4 0.00010 15000000 188000 0.013Italy 301280 5 0.00002 57000000 551000 0.010Austria 83860 300 0.00358 7600000 87000 0.011Poland 312680 30000 0.09594 38000000 610000 0.016Portugal 92390 1500 0.01624 10000000 120000 0.012Spain 504750 8000 0.01585 39000000 439000 0.011Switzerland 41290 150 0.00363 6700000 82000 0.012Turkey 779450 25000 0.03207 56000000 1576000 0.028Hungary 93000 5000 0.05376 11000000 124000 0.011
𝜙 (0 ;1 )= 1
√2𝜋𝑒− 12𝑍 𝑖❑
2
𝜒2 (𝑘 )←∑1
𝑘
𝑍 𝑖❑2
The sum of squared normal distributed variates is approximately c2 distributed with
k degrees of freedom.
𝑁 (𝑘−1 )←∑1
𝑘
𝑍𝑖
The sum of differently distributed variates is approximately normally distributed with k-1 degrees of freedom (central limit theorem).
𝑍 𝑖❑2=((𝑥 𝑖−𝜇)
𝜎 )2
𝜒2 (𝑘 )←∑1
𝑘
¿¿¿
The c2 testA Poisson distribution has s2 = m
Some basic knowledge
The likelihood of a function equals the probability to obtain the observed data with respect to certain parameter values.
The maximum likelihood refers to those parameter values of a function (model) that maximize the function given that data.
𝐿 (Θ|𝑋 )=𝑃 (𝑋∨Θ)
=+
𝜒2 (𝑘1−𝑘0 )=−2 ln( 𝑍0𝑍1 )
The sum of two c2 distributions has k = k1+k2 degrees of freedom.
The log-quotient of two normally distributed random variates is asymptotically c2 distributed (theorem of Wilk).
Likelihood ratios (odds)
Sir Ronald Fisher
Λ=−2𝑙𝑛(𝑚𝑎𝑥𝑖𝑚𝑢𝑚 h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑𝑜𝑓 h𝑡 𝑒𝑛𝑢𝑙𝑙𝑎𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛𝑚𝑎𝑥𝑖𝑚𝑢𝑚 h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑𝑜𝑓 h𝑡 𝑒𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 )=−2 𝑙𝑛(𝑃 (𝑋=𝑥 )∨𝑞0
𝑃 (𝑋=𝑥 )∨𝑞1 ))
Fisher argued that hypotheses can be tested using likelihoods.
𝜒2 (𝑘 )←∑1
𝑘
𝑍 𝑖❑2 𝜒2 (𝑘 )←∑
1
𝑘 𝑍12
𝑍22 →
𝐹𝑘
The quotient of two normally distributed random variates is c2 distributed.
-2ln( l is normally distributed. -2ln(l) is c2 distributed.
Classical frequentist hypothesis testing
We throw 100 times a coin and get 59 times the head. Is the coin fair?
𝑝 (𝑥=59|12 )=(10059 )( 12 )100
=0.016
𝑝 (𝑥=59| 59100 )=(10059 )( 59100 )59
(1− 59100 )100−59
=0.081
P = 1/2 P = 59/100
Likelihood with P = ½ and P = 59/100 estimates
Fisher would contrast two probabilities of a binomial process given the outcome of 59 heads.
Fisher:• The significance P of a test is the probability of the hypothesis given the data!• The significance P of a test refers to a hypothesis to be falsified.• It is the probability to obtain an effect in comparison of a random assumption.• As a hypothesis P is part of the discussion of a publication.
Λ=−2 ln( 0.0160.081 )=−2 ln (0.20 )=3.26
𝑃 ( Λ )=𝑃 ( 𝜒2=3.26 ;𝑑𝑓 𝜒 2=1 )=0.93
The odds in favour of H1 is 0.016/0.081 = 0.2. H1 is five time more probable than H0. The probability that the observed binomial probability q1 = 59/100 differs from q1 = ½ given
the observed data is Pobs = 0.93.The probability in favour of the null assumption is therefore P0 = 1-0.93 = 0.07.
In Fisher’s view a test should falsify a hypothesis with respect to a null assumption given the data.
This is in accordance with the Popper - Lakatos approach to scientific methodology.
According to Fisher the test failed to reject the hypothesis of P = 59/100.
Egon Pearson Jerzy Neyman
The Pearson – Neyman framework
𝑃 (𝑥 ≥59 )=∑𝑖=59
100
(100𝑖 )( 12 )100
=0.04 𝑃 (1− Λ )=1− 𝑃 (𝜒 2=3.26 ;𝑑𝑓 𝜒 2=1)=0.07The likelihood result
Pearson-Neyman asked what is the probability of the data given the model!
The significance value a of a test is the probability (the evidence) against the null assumption.
1-P
Q 1-Q
P
H1 true H0 true
Reject H0
Reject H1
Type I error
Type II error
P is the probability to reject H0 given that H0 is true (the type I error rate).
It is not allow to equal P and Q, the probability to reject H1 given that H1 is true
(the type II error rate).
0 1 2 3 4 50
0.01
0.02
0.03
0.04
0 1 2 3 4 50
0.20.40.60.8
1P P
Test valueb b
H1H0
P(H1)
Distribution of b under H0 Cumulative distribution of b under H0
Classical frequentist hypothesis testing
Pearson-Neyman:• The significance P of a test is the probability that our null hypothesis is true in comparison
a to precisely defined alternative hypothesis.• This approach does not raise concerns if we have two and only two contrary hypotheses
(tertium non datur).• As a result P is part of the results section of a publication.
In the view of Pearson and Neyman a test should falsify a null hypothesis with respect to the observation.
For 50 years Pearson and Neyman won because their approach is simpler in most applications.
P is not the probability that H0 is true!!1-P is not the probability that H1 is true!!
Rejecting H0 does not mean that H1 is true.Rejecting H1 does not mean that H0 is true.
A test aims at falsifying a hypothesis.
• The test does not rely on prior information.• It does not consider additional hypothesis.• The result is invariant of the way of testing.
Fisher: Pearson-Neyman:
A test aims at falsifying a null assumption.
We test against assumed data that have not been measured.
We test for differences in the model parameters.
P values are part of the hypothesis development.
P values are central to hypothesis testing.
We test the observed data.We test against something that has not been
measured.
Modus tollens
𝐴→¬𝐵 If Ryszard is from Poland he is probably not a member of Sejm.Probably Ryszard is a member of Sejm.Thus he is probably not a citizen of Poland.
If P(H1) > 0.95 H0 is probably false.H0 is probably true.P(H1) < 0.95.This does not mean that H1 is probably false.It only means that we don’t know.
A word on logic
If multiple null assumptions are possible the results of classical hypothesis testing are difficult to interpret.
If multiple hypotheses are contrary to a single null hypothesis the results of classical hypothesis testing are difficult to interpret.
Pearson-Neyman and Fisher testing works always properly if there are two and only two truly contrasting alternatives.
The pattern of co-occurrences of the two species appeared to be random (P(H0) > 0.3).(we cannot test for randomness)
We reject our hypothesis about antibiotic resistences in the Bacillus thuringiensis strains P(H1) > 0.1. (we can only reject null hypotheses)
The two enzymes did not differ in substrate binding efficacy (P > 0.05). (we do not know)
Time of acclimation and type of injection significantly affected changes in Tb within 30 min after injection (three-way ANOVA: F5;461 = 2:29; P<0.05). (with n = 466, time explains 0.5% variation)
The present study has clearly confirmed the hypothesis that non-native gobies are much more aggressive fish than are bullheads of comparable size... This result is similar to those obtained for invasive round goby in its interactions with the native North American cottid. (F1,14 = 37.83); (if others have found the same, we rather should test for lack of difference. The present null assumption is only a straw man).
Examples
conditional priori(A)posterior
priori(B)
Thomas Bayes (1702-1761)
Abraham de Moivre (1667-1754)
𝑝 (𝐴⋀ 𝐵 )=𝑝 (𝐴|𝐵 )𝑝 (𝐵)𝑝 (𝐵⋀ 𝐴 )=𝑝 (𝐵|𝐴 )𝑝 ( 𝐴)
𝑝 (𝐴⋀ 𝐵 )=𝑝 (𝐵⋀ 𝐴)¿
𝑝 (𝐴|𝐵 )𝑝 (𝐵)=𝑝 (𝐵|𝐴 )𝑝 (𝐴)
𝑝 (𝐴|𝐵 )=𝑝 (𝐵|𝐴)𝑝 (𝐴)
𝑝 (𝐵)=𝑝 (𝐵|𝐴 )𝑝(𝐵)
𝑝 (𝐴)
Theorem of Bayes
The Bayesian philosophy
The law of conditional probability
Theorem of Bayes
0 0.1 0.90.5 0.99
A frequentist test provides a precise estimate of probability
0 0.1 0.90. 5 0.99
P PDP
P
A Bayesian interpretation of probability
Under a Bayesian interpretation a statistical test provides an estimate of how much a test shifted an initial assumption about the level of probability in favour of our
hypothesis towards statistical significance.Significance is the degree of belief based on prior knowledge.
Under a frequentist interpretation a statistical test provides an estimate of the probability in favour of our null hypothesis.
In the frequentist interpretation probability is an objective reality.
𝑝 (𝑝𝑜𝑠𝑡|𝑝𝑟𝑖𝑜𝑟 )=𝑝 (𝑝𝑜𝑠𝑡)
𝑝 (𝑝𝑜𝑠𝑡|𝑝𝑟𝑖𝑜𝑟 )=𝑝 (𝑝𝑟𝑖𝑜𝑟|𝑝𝑜𝑠𝑡 )𝑝 (𝑝𝑟𝑖𝑜𝑟 )
𝑝 (𝑝𝑜𝑠𝑡)
Post is independent of prior
Post is mediated by prior
𝑝 (𝑝𝑜𝑠𝑡|𝑝𝑟𝑖𝑜𝑟 )≤𝑝 (𝑝𝑜𝑠𝑡)
P
The earth is round: P < 0.05 (Goodman 1995)
0 0.9 0.10. 5 0.01
P P
We perform a test in our bathroom to look whether the water in the filled bathtub is curved according to a globe or to a three-dimensional heart.Our test gives P = 0.98 in favour of earth like curvature (P(H0) < 0.05).Does this change our view about the geometry of earth? Does this mean that a heart model has 2% support?
P PThe Bayesian probability level in favour of H0
The frequentist probability level in favour of H0 that the earth is a heart
The higher our initial guess about the probability of our hypothesis is, the less does any new test contribute to further evidence.Frequentist tests are not as strong as we think.
Often null hypotheses serve as straw man only to „support” or hypothesis (fictional testing)
0.001 0.0001 0.00001 0.00000001
Confirmatory studies
Bayesian prior and conditional probabilities are often not known and have to be guessed.
Frequentist inference did a good job, we have scientific progress.
𝑝 (𝑝𝑜𝑠𝑡|𝑝𝑟𝑖𝑜𝑟 )=𝑝 (𝑝𝑟𝑖𝑜𝑟|𝑝𝑜𝑠𝑡 )𝑝 (𝑝𝑟𝑖𝑜𝑟 )
𝑝 (𝑝𝑜𝑠𝑡)
Our test provides a significance level independent of prior information only if we are quite sure about the hypothesis to be tested.
A study reports that potato chips increase the risk of cancer. P < 0.01.
P(H1) = 0.99However, previous work did not find a relationship. Thus we believe that p(H1) < 0.5.
Our test returns a probability of P = (0.0 < P < 0.5) * 0.99 < 0.5
The posterior test is not as significant as we believe.
Tests in confirmatory studies must consider prior information.
𝐵𝐹=𝑝( 𝐴)𝑝 (𝐵)
=𝑝 (𝐴|𝐵 )𝑝 (𝐵|𝐴)
Bayes factor, odds
𝐵𝐹=𝑝 (𝑡|𝐻 1 )𝑝 (𝑡|𝐻 0 )
We have 59 heads and 41 numbers. Does this mean that head has a higher probability?
𝑝 (𝑥≥59|12 )=∑𝑖=59
100
(100𝑖 )( 12 )100
=0.044
The frequentist approach
Under Bayesian logic the observed result is only 5 times less probable
than any other result.
𝐾=0.0440.0099
=4.44
The odds for a deviation is 4.44.1/4.44 = 0.23
Bayesian inference
The Bayes approach asks what is the probability of our model with respect to any other possible
model.
Bayes factor in favour of H0
Z-scoreParametric frequentist probability
0.5 1.177 0.2390320.1 2.146 0.0318760.05 2.448 0.0143750.01 3.035 0.0024070.001 3.717 0.0002020.0001 4.292 0.0000180.00001 4.799 0.000002
The Bayesian factor give the odds in favour of H0
A factor of 1/10 means that H0 is ten times less probable than H1.
For tests approximately based on the normal distribution (Z, t, F, c2) Goodman defined the minimal Bayes factor BF as:
𝐵𝐹=𝑒−𝑍2
2
Z p(Z)
For a hypothesis to be 100 times more probable than the alternative model we need a parametric significance level of P < 0.0024!
How to recalculate frequentist probabilities in a Bayesian framework
𝐵𝐹=𝑝 (𝑡|𝐻 1 )𝑝 (𝑡|𝐻 0 )
=𝑒−𝜒 2
2
Bayesian statisticians call for using P < 0.001 has the upper limit of significance!!
Λ=𝜒2=−2 ln (𝑝 (𝑡|𝐻 1 )𝑝 (𝑡|𝐻 0 )
)=−2 ln (𝐵𝐹 )
For large n, c2 is approximately normally distributed
𝑍=√−2 ln (𝐵𝐹 )
Hirotugo Akaike y = 90.901xR² = 0.5747
y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x - 356.84
R² = 0.9607
-200
300
800
1300
1800
2300
2800
3300
0 5 10 15 20
Y
X
Any test for goodness of fit will eventually become significant if we only enlarge the
number of free parameters.
Wiliam Ockham
Pluralitas non est ponenda sine necessitate
Occam’s razor
Bias Explained varianceSignificance
Few ManyVariables
OptimumMaximum information
content
The sample size corrected Akaike criterion of model choice
k: total number of model parameters +1n: sample size L: maximum likelihood estimate of the model
𝐴𝐼𝐶 𝑐=2𝑘−2 ln (Λ )+ 2𝑘(𝑘+1)𝑛−𝑘−1
All models are wrong but some are useful.
𝐴𝐼𝐶𝑐=2𝑘+𝜒 2+2𝑘(𝑘+1)𝑛−𝑘−1 𝐴𝐼𝐶𝑐=2𝑘+𝑙𝑛( 1−𝑟
2
𝑛 )+ 2𝑘(𝑘+1)𝑛−𝑘−1
Maximum likelihood estimated
21 AICAICAIC
We choose the model with the lowest AIC („the most useful model”).This is often not the model with the lowest P-value.
𝐴𝐼𝐶=2𝑘−2 ln ( Λ )+2𝑘¿¿
The lower is AIC, the more parsimonious is the model
AIC model selection serves to find the best descriptor of observed structure.It is a hypothesis generating method.
It does not test for significance.
Model selection using significance levels is a hypothesis testing method.
When to apply AIC:General linear modelling (regression models, ANOVA, MANCOVA)Regression trees Path analysisTime series analysisNull model analysis
by c2 by r2
81.121619)16(12
199607.01
ln122
rCAIC
95.01219)12(4
195747.01
ln42
rCAIC
Model selection using significance levels is a hypothesis testing method.
Significance levels and AIC must not be used together.
AIC should be used together with r2.
y = 90.901xR² = 0.5747
y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x - 356.84
R² = 0.9607
-200
300
800
1300
1800
2300
2800
3300
0 5 10 15 20Y
X
P=0.95
P=0.9999
Any statistical test will eventually become significant if we only enlarge the sample size.
r2=0.01
Using an F-test at r2 = 0.01 (regression analysis) we need 385 data to get at significant result at P < 0.05.
At very large sample sizes (N >> 100) classical statistical tests break down.
F-test 𝐹= 𝑟2
1−𝑟2(𝑛−2)→𝑝 (𝐹 ,1 ,𝑛−2)
The relationship between P, r2, and sample size
Large data sets
N = 1000, 10 pairs of zeroes and ones.
Ran1 Ran2 r F p1 0.008328 0.107104 -0.051 2.68 0.902 0.820474 0.309694 r2
3 0.648093 0.798087 0.0034 0.935418 0.1647625 0.406203 0.178282 … … … 99 0 0
100 1 1
100 pairs of Excel random numbers N = 100, one pair of zeroes and ones
16% significant correlations
7.5% significant correlations
3000 replicates
N = 10000, 100 pairs of zeroes and ones.
99.9% significant correlations
A B C D1 1 1 1 02 0 1 0 03 0 0 1 04 1 1 0 05 1 0 0 16 0 1 1 17 0 0 0 18 0 0 1 09 1 1 0 1
10 0 0 0 0
A B C D E F G H1 1 1 1 0 1 0 1 02 0 1 0 0 1 1 1 03 0 0 1 0 1 1 1 04 1 1 0 0 1 0 1 15 1 0 0 1 1 1 1 16 0 1 1 1 0 1 1 17 0 0 0 1 1 0 1 08 0 0 1 0 0 1 1 09 1 1 0 1 0 1 1 0
10 0 0 0 0 0 0 1 1
A B C D E F G H I J K L M N O P1 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 02 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 13 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 04 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 15 1 0 0 1 1 1 1 1 0 0 0 1 0 0 0 16 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 17 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 18 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 19 1 1 0 1 0 1 1 0 1 1 0 1 0 1 0 0
10 0 0 0 0 0 0 1 1 0 0 1 1 0 1 1 1
Number of species co-occurrences in comparison to a null expectaction (data are simple random numbers)
The variance of the null space decreases due to statistical averaging.Nobs
Null distribution
𝑡=𝑒𝑓𝑓𝑒𝑐𝑡 𝑠𝑖𝑧𝑒
𝑆𝐸
𝑆𝐸→0→ 𝑡→∞
• Any test that evolves randomisation of a compound metric will eventually become significant due to the decrease of the standard error.
• This reduction is due to statistical averaging.
The null model relies on a randomisation of 1s and 0s in the matrix
At very large sample sizes (N >> 100) classical statistical tests break down.
Instead of using a predefined significance level use a predefined effect size or r2 level.
The T-test of Wilcoxon revealed a statistically significant difference in pH of surface water between the lagg site (Sphagno-Juncetum) and the two other sites.
Every statistical analysis must at least present sample sizes, effect sizes, and confidence limits.Multiple independent testing needs independent data.
Pattern seeking or P-fishing
PersonBlood presure Gender Age class Smoker
1 80 m 30 y
2 133 f 40 y
3 64 m 60 n
4 139 f 40 y
5 63 m 80 n
6 105 f 70 y
7 114 f 60 y
Variables SS df MS F PGender 1 1 1.37 0.00 0.97Age class 15183 8 1897.85 2.32 0.02Smoker 4062 1 4061.61 4.97 0.03Gender*Age class 6507 7 929.57 1.14 0.34Gender*Smoker 1168 1 1167.74 1.43 0.23Age class*Smoker 8203 7 1171.81 1.43 0.19Gender*Age class*Smoker 4083 5 816.58 1.00 0.42
Error 790913 968 817.06
Simple linear random numbers Of 12 trials four gave significant results
Using the same test several times with the same data needs a Bonferroni correction.
Single test
)(1)( sigpnsigp
)()(
))(1(1
))(1(1)(
))(1()(
signpsigp
signp
sigpsigp
sigpnsigp
testExp
test
ntestExp
ntestExp
n independent tests
nn TestTestExp
05.005.0
The Bonferroni correcton is very conservative.
False discovery rates (false detection error rates): The proportion of erroneously declared significances.
A sequential Bonferroni correction
Test SignificancesRanked
significanceSignificance cut-off level
Corrected cut-off level Significance
7 0.03 0.001 0.01 0.001429 Sig6 0.14 0.007 0.01 0.001667 Nsig5 0.45 0.012 0.01 0.002 Nsig4 0.001 0.03 0.01 0.0025 Nsig3 0.012 0.06 0.01 0.003333 Nsig2 0.007 0.14 0.01 0.005 Nsig1 0.06 0.45 0.01 0.01 Nsig
𝛼𝑛𝑒𝑤 (𝑖 )= 𝛼𝑛−𝑖−1
;𝑖=1 , .. ,𝑘
What is multiple testing?• Single analysis?• Single data set?• Single paper?• Single journal• Lifetime work?
There are no established rules!
False discovery rates (false detection error rates): The proportion of erroneously declared significances.
K is the number of contrasts.
Site
Study
year
Campylobacter
Gender
Number of
chicken
Chicken
No.
Ring
number
Body weight
[g]
Weight
/age
Age [days]
Bill length
[mm]
Ht Sr%
Hb [g/dl]
RBC [T/l
]
WBC [G/l]
MCV
MCH
MCHC
Moczni
k [mg/dl]
Urine
acid
Cholesterol
[mg/dl]
Tryglicerids [mg/dl]
Total
protein
[g/dl]
HDL
[mg/dl]
LDL [mg/dl]
aspAT
[U/l]
AlAT [U/l]
Na K
Ca (mg/l)
Mg
(mg/l)
Fe
Zn (mg/l)
Cu
Mn
Co
Cd (mg/l)
Pb
1 2006 0
femal
e2 1
P 215
12950
89.393
933 96 30 7.75
5051.55
13.1
194
50
25.85
16.8
19.5
218.9
203.4 3.4 63 115
.3196.4
51.3
122
8258.9
686
41
1.847
16
30 3 0.3
92 4
2 2006 1
femal
e2 2
P 215
23010
83.611
136 10
1 33 7.9 1.56
12.9
212
50.6
23.94
15.1
13.1
184.6
187.9 3.2 40.
3106.7
229.9
51.2
122
6404.6
223
41
0.771 2 3
0 3 0.354 3
… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … ……… … …139
2012 0 mal
e 3 152P
2154
2650
71.621
637 10
4 30 7.3 1.36
16.9
221
53.7
24.33
14.6
16.2
153.2
186.3 3.3 43.
472.5
146.7
46.7
67 7562.
9355
40
1.143 2 3
0 3 0.369 3
A data set on health status and reproductive success of Polish storks
N: 866 stork chicken K: 104 physiological and environmental variables
Site
Study
year
Campylobac
terGend
er
Number of
chicken
Chicken No.
Ring numb
er
Body weight [g]
Weight/age
Age [day
s]
Bill length
[mm]
Ht Sr%
Hb [g/dl]
RBC [T/l]
WBC
[G/l]
MCV
MCH
MCHC
Mocznik
[mg/dl]
Urine
acid
Cholesterol [mg/d
l]
Tryglicerids [mg/dl
]
Total prote
in [g/dl]
HDL [mg/
dl]
LDL [mg/
dl]
aspAT
[U/l]
AlAT
[U/l]
Na K Ca (mg/l)
Mg (mg/
l)Fe Zn
(mg/l)Cu
Mn
Co
Cd (mg/
l)Pb
1 2006 0 female 2 1 P 2151 2950 89.3939 33 96 30 7.75505 1.55 13.1 194 50 25.85 16.8 19.5 218.9 203.4 3.4 63 115.3 196.4 51.3 122 8 258.9 686 41 1.847 16 30 3 0.392 4
2 2006 1 female 2 2 P 2152 3010 83.6111 36 101 33 7.9 1.56 12.9 212 50.6 23.94 15.1 13.1 184.6 187.9 3.2 40.3 106.7 229.9 51.2 122 6 404.6 223 41 0.771 2 30 3 0.354 3
… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … …
139 2012 0 male 3 152 P 2154 2650 71.6216 37 104 30 7.3 1.36 16.9 221 53.7 24.33 14.6 16.2 153.2 186.3 3.3 43.4 72.5 146.7 46.7 67 7 562.9 355 40 1.143 2 30 3 0.369 3
• Common practise is to screen the data for significant relationships and publish these significances.
• The respective paper does not mention how many variables have been tested.
• Hypotheses are constructed post factum to match the „findings”. • „Results” are discussed as if they corroborate the hypotheses.
P-fishing
If the data set is large• Divide the records at random into two or more parts.• Use one part for hypothesis generation, use the other parts for testing.• Use always multiple testing corrected corrected significance levels.• Take care of non-independence of data. Try reduced degrees of freedom.
• Hypotheses must come from theory (deduction), not from the data.• Inductive hypothesis testing is critical.• If the hypotheses are intended as being a simple description, don’t use P-values.
P < 0.000001
Possibly data are non-independent due to sampling
sequence
No clear hypothesis
Final guidelines
Don’t mix data description, classification and hypotheses testing.Provide always sample sizes and effect sizes . If possible provide confidence limits.
Data description and model selection:• Rely on AIC, effect sizes, and r2 only.• Do not use P-values.• Check for logic and reason.
Hypothesis testing:• Be careful with hypothesis induction.
Hypotheses should stem from theory not from the data.
• Do not develop and test hypotheses using the same data.
• Do not use significance testing without a priori defined and theory derived hypotheses.
• Check for logic and reason.• Check whether results can be reproduced. • Do not develop hypotheses post factum
(telling just so stories)Testing for simple differences and relationships:• Be careful in the interpretation of P-values. P does not provide the probability that a
certain observation is true. • P does not provide the probability that the alternative observation is true.• Check for logic and reason.• Don’t use simple tests in very large data sets. Use effect sizes only.• Use predefined effect sizes and explained variances.• If possible use a Bayesian approach.