Vooruitblik 10 en 11
Ma 1 oktober 07
Chapter 10Correlation and Regression
1. Correlation
2. Regression
3. Variation and Prediction Intervals
4. Rangorde correlatie
1. Correlation
• Verband tussen twee gemeten variabelen in een dataset op interval of ratio nivo
• In dit boek: alléén lineaire verbanden
• Let op de voorwaarden!
• Maat: Pearson PM correlatie r of rho
• Geen correlatie: r = 0, maximale correlatie r = -1 of +1
• Kritische waarden: tabel A-6
Scatterplots of Paired Data
Figure 10-2
Scatterplots of Paired Data
Figure 10-2
Formula 10-1
nxy – (x)(y)
n(x2) – (x)2 n(y2) – (y)2r =
The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample.
Calculators can compute r
Formula
Figure 10-3
Hypothesis Test for a Linear Correlation
2. Regression
• Vervolg op correlatie
• Berekening van regressielijn in de scatterplot: de lijn die het beste past in de puntenwolk
• Doel: voorspellen van waarden
Regression
The typical equation of a straight line y = mx + b is expressed in the form y = b0 + b1x, where b0 is the y-intercept and b1 is the slope.
^
The regression equation expresses a relationship between x (called the independent variable, predictor variable or explanatory variable), and y (called the dependent variable or response variable).
^
Formulas for b0 and b1
Formula 10-2n(xy) – (x) (y)
b1 = (slope)n(x2) – (x)2
b0 = y – b1 x (y-intercept)Formula 10-3
calculators or computers can compute these values
Given the sample data in Table 10-1, find the regression equation.
Example: Old Faithful - cont
Procedure for Predicting
Figure 10-7
3. Variation and Prediction Intervals
• Vervolg op regressielijn
• (hfst 7) Confidence interval = interval schatting van populatie parameters: proportie, gemiddelde, variantie
• Hier: interval schatting van de schatting van de waarde van een variabele
Key Concept
In this section we proceed to consider a method for constructing a prediction interval, which is an interval estimate of a predicted value of y.
y - E < y < y + E^ ^
Prediction Interval for an Individual y
where
E = t2 se n(x2) – (x)2
n(x0 – x)2
1 + +1n
x0 represents the given value of x
t2 has n – 2 degrees of freedom
Standard Error of Estimate
The standard error of estimate, denoted by se
is a measure of the differences (or distances) between the observed sample y-values and the predicted values y that are obtained using the regression equation.
Definition
^
4. Rangorde correlatie
• Non-parametrische methode = verdelingsvrije toets = geen aannames mbt. Verdeling in de opulatie
• Associatietest op twee variabelen• Spearman’s: rs (sample) of voor populatie: rhos
• Procedure in fig 10.10 (p.537)
voorbeeld
1. Goodness-of-fit: multinominaal
2. Kruistabellen (contingency tables)
3. Variantie analyse (ANOVA)
Chapter 11Multinomial Experiments and Contingency Tables
OverviewWe focus on analysis of categorical (qualitative
or attribute) data that can be separated into different categories (often called cells).
Use the 2 (chi-square) test statistic (Table A- 4).
The goodness-of-fit test uses a one-way frequency table (single row or column).
The contingency table uses a two-way frequency table (two or more rows and columns).
1. Goodness-of-fit: multinominaal
• Komt een feitelijke kansverdeling op een nominale variabele overeen met een verwachte verdeling?
• H0: p1 = x, p2 = y, p3 = z, p4 = etc..
• H1: Tenminste één van de gevonden proporties is afwijkend van de verwachte kans.
Goodness-of-Fit Test in Multinomial Experiments
Critical Values1. Found in Table A- 4 using k – 1 degrees of
freedom, where k = number of categories.
2. Goodness-of-fit hypothesis tests are always right-tailed.
2 = (O – E)2
E
Test Statistics
Example: Last Digit Analysis
Test the claim that the digits in Table 11-2 do not occur with the same frequency.
Relationships Among the 2 Test Statistic, P-Value, and Goodness-of-Fit
Figure 11-3
2. Kruistabellen (contingency tables)
• In this section we consider contingency tables (or two-way frequency tables), which include frequency counts for categorical data arranged in a table with a least two rows and at least two columns.
• We present a method for testing the claim that the row and column variables are independent of each other.
• We will use the same method for a test of homogeneity, whereby we test the claim that different populations have the same proportion of some characteristics.
491
213
704
377
112
489
31
8
39
899
333
1232
Black White Yellow/OrangeRow Totals
Controls (not injured)
Cases (injured or killed)
Column Totals
For the upper left hand cell:
= 513.714E =(899)(704)
1232
Case-Control Study of Motorcycle Drivers
(row total) (column total) E =
(grand total)
899
1232704
899
1232
491513.714
213
704
377
112
489
31
8
39
899
333
1232
Black White Yellow/OrangeRow Totals
Cases (injured or killed)Expected
Column Totals
Controls (not injured)Expected
190.286
356.827
132.173
28.459
10.541
2 2 22 ( ) (491 513.714) (8 10.541)
...513.714 10.541
O E
E
2 8.775
Case-Control Study of Motorcycle Drivers
H0: Row and column variables are independent.
H1: Row and column variables are dependent.
The test statistic is 2 = 8.775
= 0.05
The number of degrees of freedom are
(r–1)(c–1) = (2–1)(3–1) = 2.
The critical value (from Table A-4) is 2.05,2 = 5.991.
Case-Control Study of Motorcycle Drivers
We reject the null hypothesis. It appears there is an association between helmet color and motorcycle safety.
Case-Control Study of Motorcycle Drivers
Figure 11-4
3. Variantie analyse (ANOVA)
• ANalysis Of VAriance
• H0 = meerdere populatie gemiddeldes zijn gelijk
• F-verdeling (tabel A7)
• Toets op P-waarde
TOT SLOT: Bayesiaanse statistiek
• Teksten en 2 opdrachten (worden uitgedeeld)
• 2. Formele benadering• 1. Intuïtieve benadering
Voorbeeldprobleem
• Gegeven: In Orange County VS is 51 % man, 9.5% van de mannen rookt sigaren, tegenover 1.7% van de vrouwen
• Gevraagd: Hoe groot is de kans dat een willekeurige sigarenroker een man is?
1. Intuïtieve benadering
2. Formele benadering
Einde vooruitblik
• Volgende week (week 6): – Vragenuur– Geen nieuwe stof– Voorbereiding proeftentamen
• Week 7: maandag 15 oktober– Vrijdaggroep: bespreking oefeningen in plaats
van vrijdag 12 oktober (ivm. afwezigheid Joris)