Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Lecture 2 :Searching for correlations
• Lecture 1 : basic descriptive statistics
• Lecture 2 : searching for correlations
• Lecture 3 : hypothesis testing and model-fitting
• Lecture 4 : Bayesian inference
The dark energy puzzleThese lectures
• Correlation coefficient and its error
• How to quantify the significance of a correlation
• Bootstrap error estimates
• Non-parametric correlation tests
• Common pitfalls when searching for correlations
• Comparing two distributions
The dark energy puzzleLecture 2 : searching for correlations
• Two variables are correlated if they share a statistical dependence / relationship
• For example, measurements of temperature at noon and 1pm every day are correlated, because they both lie consistently above the mean daily temperature
• Correlations between variables are important because they indicate some underlying physical relationship between those variables
The dark energy puzzleWhat is a correlation?
The dark energy puzzleWhat is a correlation?
Uncorrelated variables !
The dark energy puzzleWhat is a correlation?
Trick for generating correlated variables :x, z unit gaussian variablesy = x.r + z (1-r2)1/2
(x,y) have corr. coefficient r
• Example in astronomy : black-hole / bulge relation
The dark energy puzzle
Blac
k ho
le m
ass
(sol
ar m
asse
s)
Bulge velocity dispersion (km/s)
What is a correlation?
• Describes the strength of the correlation between (x, y)
• Means :
• Standard deviations :
• Definition of correlation coefficient :
The dark energy puzzleCorrelation coefficient
• No correlation [P(x,y) separable into f(x) g(y)] :
• Complete correlation :
• Complete anti-correlation :
• Possible range is
The dark energy puzzleCorrelation coefficient
The dark energy puzzleLies, damn lies and statistics
Why was this poor statistics? Correlation is not the same as causation. Other dietary or lifestyle habits
could be a third variable! [ ... more examples later ... ]
Example 3
12/08/12 11:05 PMBBC NEWS | Health | Eating pizza 'cuts cancer risk'
Page 1 of 2http://news.bbc.co.uk/2/hi/health/3086013.stm
Home News Sport Radio TV Weather Languages
Low graphics | Accessibility help
Before people startdialling the pizza takeaway,they should consider thatpizza can be high in saturatedfat, salt and calories
Nicola O'Connor, Cancer ResearchUK
Pizzas are covered with a potentiallyprotective tomato sauce
[an error occurred while processing this directive]
One-Minute World News
News services Your news when youwant it
News Front Page
Africa
Americas
Asia-Pacific
Europe
Middle East
South Asia
UK
Business
Health
Medical notes
Science &Environment
Technology
Entertainment
Also in the news
-----------------
Video and Audio
-----------------
Programmes
Have Your Say
In Pictures
Country Profiles
Special Reports
RELATED BBC SITES
SPORT
WEATHER
ON THIS DAY
EDITORS' BLOG
Last Updated: Tuesday, 22 July, 2003, 10:28 GMT 11:28 UK
E-mail this to a friend Printable version
Eating pizza 'cuts cancer risk'
Italian researchers say
eating pizza could protect
against cancer.
Researchers claim eating pizza
regularly reduced the risk of
developing oesophageal cancer
by 59%.
The risk of developing colon
cancer also fell by 26% and
mouth cancer by 34%, they
claimed.
The secret could be lycopene, an antioxidant chemical in
tomatoes, which is thought to offer some protection against
cancer, and which gives the fruit its traditional red colour.
But some experts cast doubt
on the idea that pizza
consumption was the
explanation for why some
people did not develop cancer.
They said other foods or
dietary habits could play a part.
Lifestyle
The researchers looked at 3,300 people who had developed
cancer of the mouth, oesophagus, throat or colon and 5,000
people who had not developed cancer.
They were asked about their eating habits, and how often
they ate pizza.
Those who ate pizza at least once a week had less chance of
developing cancer, they found.
Dr Silvano Gallus, of the Mario Negri Institute for
Pharmaceutical Research in Milan, who led the research: "We
knew that tomato sauce could offer protection against certain
tumours, but we did not expect pizza as a complete meal
also to offer such protective powers."
Nicola O'Connor, of Cancer Research UK, told BBC News
Online: "This study is interesting and the results should
probably be looked at in the context of what we already know
about the Mediterranean diet and it's association with a lower
risk of certain types of cancer.
"But before people start dialling the pizza takeaway, they
SEE ALSO:
Scientists design 'anti-cancer'tomato 20 Jun 02 | Health
GM tomatoes 'offer health boost' 30 Aug 01 | Science/Nature
Oral cancer attacked by tomatoes 21 Dec 00 | Health
RELATED INTERNET LINKS:
International Journal of Cancer
Cancer Research UK
The BBC is not responsible for thecontent of external internet sites
TOP HEALTH STORIES
Stem cell method put to the test
Hospitals 'eyeing private market'
Low vitamin D 'Parkinson's link'
| News feeds
• We can estimate the Pearson product-moment correlation coefficient as :
• The possible range of values is
The dark energy puzzleEstimating the correlation coefficient
c.f.
•If the correlation is statistically significant :
• 0 < |r| < 0.3 is a “weak correlation”
• 0.3 < |r| < 0.7 is a “moderate correlation”
• 0.7 < |r| < 1.0 is a “strong correlation”
The dark energy puzzleEstimating the correlation coefficient
• Assumption : (x,y) are drawn from a bivariate Gaussian distribution about an underlying linear relation :
• If this model is true, then the uncertainty in the measured value of r is
The dark energy puzzleEstimating the correlation coefficient
• Example 1 : who discovered the distance-redshift relation, Lemaitre (1927) or Hubble (1929)?
The dark energy puzzleEstimating the correlation coefficient
!
!
!"#!$%&&'(!)*'+,-(.!!/(012345(!167!!!!
8(6-95-:+,;!
!"#$%&'(&)*+,-.&/,0++*&+1&2+3456"6$+7"*&8&944*$:%&;"60:3"6$,<.&=7$#:><$6?&+1&60:&@$6A"6:><>"7%.&
B+0"77:<C5>D.&/+560&91>$,"(&
"#$%&'(%(&E7:&+1&60:&D>:"6:<6&%$<,+#:>$:<&+1&3+%:>7&6$3:<&$<&60"6&+1&60:&:F4"7%$7D&=7$#:><:.&&"*3+<6&$7#">$"C*?&"66>$C56:%&6+&
G5CC*:&HIJKJL(&@0"6&$<&7+6&A$%:*?&-7+A7&$<&60"6&60:&+>$D$7"*&6>:"6$<:&C?&':3"M6>:&&&HIJKNL&&,+76"$7:%&&"&>$,0&15<$+7&+1&&C+60&
60:+>?&"7%&+1&+C<:>#"6$+7(&&O0:&P>:7,0&4"4:>&A"<&&3:6$,5*+5<*?&,:7<+>:%&&A0:7&&4>$76:%&&$7&Q7D*$<0&R&&"**&%$<,5<<$+7<&+1&>"%$"*&
#:*+,$6$:<&"7%&%$<6"7,:<&&!"#$%&'(%)(*+%,-*.&%(/0-*-1"2%%$(&(*/-#"&-3#%3,%%456%7%%A:>:&+3$66:%(&P"<,$7"6$7D&$7<$D06<&">:&D*:"7:%&
1>+3&"&*:66:>&>:,:76*?&1+57%&$7&60:&':3"M6>:&">,0$#:<(&&97&"44:"*&$<&3"%:&1+>&"&':3"M6>:&O:*:<,+4:.&6+&0+7+5>&&60:&%$<,+#:>:>&+1&
60:&:F4"7%$7D&57$#:><:(&
!"#$%&'"()*+,-.(!($(&/"0'"&12$3((4$4"'5(&
O0:&6$6*:&+1&60:&+>$D$7"*&IJKN&4"4:>&$7%$,"6:<&6+&60:&>:"%:>&60"6&60:&,+76:76&A$**&C:&"&15<$+7&+1&C+60&
60:+>?&"7%&+1&+C<:>#"6$+7S&&
48#%57$#:><&0+3+DT7:&%:&3"<<:&,+7<6"76:&:6&%:&>"?+7&,>+$<<"76.&>:7%"76&,+346:&%:&*"&#$6:<<:&&>"%$"*:&
%:<&&7UC5*:5<:<&:F6>"R9"2"1&-:;(.6%<'-1'%-.%&*"#.2"&($%-#%&'(%=#92-.'%&3#9;(>%&';.?&&
4@%'3/39(#(3;.%;#-)(*.(%3,%13#.&"#&%/"..%"#$%-#1*(".-#9%*"$-;.%"113;#&-#9%1+>&60:&>"%$"*&#:*+,$6?&+1&
:F6>"R9"2"1&-1%#(A;2"(6&
':3"M6>:&<4:76&60:&?:"><&IJKVWX&"6&60:&G">#">%&2+**:D:&EC<:>#"6+>?(&G:&0"%&"7&:F,:**:76&1+57%"6$+7&$7&
+C<:>#"6$+7"*&"<6>+7+3?.&A>$6$7D&"C+56&6:>3<&<5,0&"<&60:&:11:,6$#:&6:34:>"65>:<&+1&<6"><.&6>$D+7+3:6>$,&
4">"**"F:<.&3+#$7DR,*5<6:>&4">"**"F:<.&"C<+*56:&C+*+3:6>$,&3"D7$65%:<.&%A">1&C>"7,0&<6"><.&D$"76&C>"7,0&
<6"><.&"7%&60:&*$-:(&&&
O+&<4:"-&+1&':3"M6>:&HIJKNL&"<&"&3+<6&>:3">-"C*:&"7%&"C<+*56:*?&C>$**$"76&60:+>:6$,"*&4"4:>&+7*?.&$<&"&
D>"#:&$7Y5<6$,:&6+&60:&&#:>?&&6$6*:(&&Z+6&+7*?&%+:<&':3"M6>:&60:+>:6$,"**?&%:>$#:&"&*$7:">&>:*"6$+7<0$4&
C:6A::7&60:&>"%$"*&#:*+,$6$:<&+1&D"*"F$:<&"7%&60:$>&%$<6"7,:<&$7&60:&"C+#:&4"4:>.&C56&0:&$<&:"D:>&6+&
%:6:>3$7:&60:&>"6:&"6&A0$,0&60:&=7$#:><:&:F4"7%<(&&&':3"M6>:&HIJKNL&,">:15**?&5<:<&60:&>"%$"*&#:*+,$6$:<&
+1&&VK&:F6>"D"*",6$,&7:C5*":&&6"C5*"6:%&$7&/6>[3C:>D&&HIJKXL.&"7%&0:&,+7#:>6<&&"44">:76&3"D7$65%:<&3&
$76+&%$<6"7,:&\*+D&>&]&^(K3&_&V(^V`&1+**+A$7D&G5CC*:&&HIJKaL(&O0:&",65"*&#"*5:&A0$,0&':3"M6>:&+C6"$7<&$7&
IJKN&1+>&60:&>"6:&+1&:F4"7<$+7&+1&60:&=7$#:><:&&$<&aKX&-3W<W;4,&b&XNX&-3W<W;4,&A$60&%$11:>:76&
A:$D06$7D&1",6+><&HP$D5>:&IL(&
arXiv:1106.3928 etc.
• Example 1 : who discovered the distance-redshift relation, Lemaitre (1927) or Hubble (1929)?
The dark energy puzzleEstimating the correlation coefficient
The dark energy puzzle
• Part (a) : For each dataset, find the Pearson product-moment correlation coefficient and its error
• Lemaitre : r = 0.38 +/- 0.15
• Hubble : r = 0.79 +/- 0.13
Estimating the correlation coefficient
• What is the probability of obtaining the measured value of r if the true correlation is zero? (also depends on N)
• In order to determine whether the correlation is significant, calculate
• This obeys the Student’s t probability distribution with number of degrees of freedom
• Consult tables (2-tailed test) with these two numbers
The dark energy puzzleIs a correlation significant?
The dark energy puzzleIs a correlation significant?
Distribution of twhen rho=0 Number of data
points is crucial !
• Standard tables list the critical values that t must exceed, as a function of nu, for the hypothesis that the two variables are unrelated to be rejected at a particular level of statistical significance (e.g. 95%, 99%)
The dark energy puzzleIs a correlation significant?
“Two-tailed test” :correlation or anti-correlation
The dark energy puzzle
• Part (b) : Determine the statistical significance of the correlation
• Lemaitre : t=2.60 , nu=40, prob=1.3e-2 (2.5 sigma)
• Hubble : t=6.03 , nu=22 , prob=4.5e-6 (4.6 sigma)
Is a correlation significant?
The dark energy puzzleLinear regression line
• The regression line is the linear fit that minimizes the sum of the squares of the y-residuals
• With intercept [y = a + b x] :
• Without intercept [y = b x] :
The dark energy puzzle
• Part (c) : Determine linear least-squares regression lines of the form V = HD and v = H’D + C
• Lemaitre : H=414.9 , H’=221.7 , C=316.5 Hubble : H=423.9 , H’=453.9 , C=-40.4
Linear regression line
• Aside : Hubble and Lemaitre both found values of H0 ~ 420 km/s/Mpc with independent techniques ! How could they both be wrong? [Example of statistical bias]
• Today would probably indicate confirmation bias, but Hubble didn’t even cite Lemaitre’s result!
• Lemaitre : assumed galaxy apparent magnitude was standard candle - scuppered by Malmquist bias
• Hubble : used “brightest stars” as standard candles, but could not distinguish brightest star from HII region (systematic error bias due to aperture effect)
The dark energy puzzleLinear regression line
The dark energy puzzleBootstrap errors and probabilities
• Bootstrap statistics allow us to determine parameter errors and probability distributions using just the data
• If we have N data points, repeatedly draw at random samples of N points (with replacement)
• Re-compute the parameter of interest for each bootstrap sample
• The distribution of the re-computed parameters estimates the uncertainty in the measurement from the original sample
The dark energy puzzleBootstrap errors and probabilities
• Applied to our example : create 1000 bootstrap samples and measure the correlation coefficients
68% conf +/- 0.11
68% conf +/- 0.07
The dark energy puzzleBootstrap errors and probabilities
• Applied to our example : create 1000 bootstrap samples and do a linear regression fit
68% conf +/- 50
68% conf +/- 42
• If we do not want to assume (x,y) are drawn from a bivariate Gaussian we can use a non-parametric correlation test
• Let (Xi,Yi) be the rank of (xi,yi) in the overall order such that
• Find Spearman rank correlation coefficient
• Compare to standard tables with
The dark energy puzzleNon-parametric correlation coefficient
D [Mpc] Rank0.03 1.00.03 2.00.21 3.00.26 4.00.28 5.50.28 5.50.45 7.00.50 8.50.50 8.50.63 10.00.80 11.00.90 13.50.90 13.50.90 13.50.90 13.51.00 16.01.10 17.51.10 17.51.40 19.01.70 20.02.00 22.52.00 22.52.00 22.52.00 22.5
1
The dark energy puzzleNon-parametric correlation coefficient
• Standard tables list the critical values that rs must exceed, as a function of nu, for the hypothesis that the two variables are unrelated to be rejected at a particular level of statistical significance (e.g. 95%, 99%)
The dark energy puzzle
• Part (e) : Determine the Spearman rank cross-correlation coefficient and its statistical significance
• Lemaitre : rs=0.42 , nu=40 , prob=6.0e-3 (2.8 sigma)
• Hubble : rs=0.80 , nu=22 , prob=3.4e-6 (4.7 sigma)
• [Results very similar to Pearson product-moment correlation coefficient, but fewer assumptions]
Non-parametric correlation coefficient
• Selection effects leading to spurious correlations, for example Malmquist bias
The dark energy puzzleIssues with correlations
Distance modulus (redshift)
log 1
0(ra
dio
lum
inos
ity)
Selection effectsin this region
Selection effectsin this region
• Is the correlation driven by a small number of outliers, so is not robust?
The dark energy puzzleIssues with correlations
Mean, variance, correlation coefficientand regression line are all identical
“Rule ofthumb”
• Correlation does not necessarily imply causation
• Sleeping in your shoes causes you to wake up with a headache! [third variable - you were probably having a few drinks the night before...]
• Ice cream sales cause drowning! [third variable - hotter weather means more people are at the beach...]
• Having grey hair causes cancer! [third variable - age...]
• Obesity causes global warming! [third variable - everyone is getting richer...]
The dark energy puzzleIssues with correlations
The dark energy puzzleLies, damn lies and statistics
“We don’t accept the idea that there are harmful agents in
tobacco”[Phillip Morris,1964]
Why was this poor statistics? Cigarette companies were attempting to invoke a “third variable”. But
correlation does sometimes imply causation, if it can be demonstrated by independent lines of evidence
Example 4
• Are two samples drawn from the same distribution?
• Example 2 : samples of flux densities measured at random positions [A] and galaxy positions [B]
The dark energy puzzleComparing two distributions
• Part (a) : Are their means consistent?
• Calculate t statistic and no. of degrees of freedom :
• Compare to Student’s t distribution
• t = 0.31, nu = 661.5, prob of consistency = 0.76
• Small print : assumes (x,y) are normally-distributed populations
The dark energy puzzleComparing two distributions
• Part (b) : Are the full distributions consistent?
• The Kolmogorov-Smirnov test considers the maximum value of the absolute difference between the cumulative probability distributions
The dark energy puzzleComparing two distributions
Max diff = D = 0.067
Prob(Q) = 0.427