Class 1: Sept. 9 About instructor: Dylan Small, Assistant Professor, Department of Statistics. How I...

Preview:

Citation preview

Class 1: Sept. 9• About instructor: Dylan Small, Assistant

Professor, Department of Statistics.

• How I got interested in statistics?

                       

My Current research• Statistical methods for comparing treatments/policies

when a perfectly controlled randomized experiment cannot be done using the method of “instrumental variables.” Applications to:– Treatment of depression among the elderly in primary care

practices– Food policy in developing countries

• Statistical methods for panel studies, studies that survey same people repeatedly over time.– Prediction of child morbidity/mortality in Pakistan using

previous height and weight measurements.

Course Objectives

• To learn how to use two important statistical tools to analyze data: Regression and Analysis of Variance

• To get hands on experience analyzing data and computing with data (using JMP)

• To gain experience in interpreting the results of a statistical analysis and communicating the results to others

Course requirements• Responsible for both material covered in the lecture

and reading associated with the lecture. • Weekly homework, typically handed out on Thursday,

due following Thursday at beginning of class. Late homework will be given at most half credit.

• Project: Analysis of data set of interest to you using regression. Work in groups of 2-3 people. Final report, class presentation. More details in October.

• Midterm: Tuesday, October 21, 3:00 pm-4:20pm• Final: Tuesday, December 21, 8:30am-10:30am

Grading

• Grades will be based on– 20% Homework– 30% Project– 20% Midterm– 30% Final

Web site/Textbooks• Web site:

http://www-stat.wharton.upenn.edu/~dsmall/stat112-f04 Can be reached by going to http://www-stat.wharton.upenn.edu

, clicking on courses and clicking on Stat 112.• Textbooks:

– Moore and McCabe, Introduction to the Practice of Statistics, 4 th edition (Required). We will be covering Chapter 2, part of Chapter 3 and Chapters 10-13.

– JMP version 5 with handbook. Highly recommended. If you do not own it, you need to sign up for a Wharton account and use it in the Wharton labs.

– JMP manual for Introduction to the Practice of Statistics. Recommended.

Instructor Accessibility

• E-mail address: dsmall@wharton.upenn.edu• My Office hours (office: 464 Huntsman Hall):

– Tuesdays and Thursdays after class, 4:30-5:30.– By appointment. I will be happy to meet with you if

you send me an e-mail to arrange a time.• I encourage you to come see me at least once during

the semester to chat about your background, interests, concerns about the class and future plans.

• TA: Lie Wang, office hours TBA• Stat Lab: Monday-Thursday, 9-3; Friday, 11-5

Class 1• Reading: Introduction to Chapter 2, Chapter 2.1• Topic: Relationships between variables measured on

same unit.• Unit could be an individual, a state, a company, a

year, etc. • Data set: Penn Alcohol data set. Penn Alcohol

dataset (pennalcohol.JMP under datasets on website). Survey given to 123 Penn undergraduates.

• Alcohol use: Number of days per month on which person drinks.

Distributions Alcohol Use

0 5 10 15 20 25

Bivariate Fit of Alcohol Use By Sensation Seeking

0

5

10

15

20

25

Alc

oh

ol U

se

-0.10 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 1.1

Sensation Seeking

Bivariate Fit of Alcohol Use By Life Change

0

5

10

15

20

25

Alc

oh

ol U

se

500 10001500200025003000

Life Change

Association

• Two variables measured on the same unit are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable.

• Two variables are positively associated when above average values of one tend to accompany above average values of the other and below-average values also tend to occur together.

• Two variables are negatively associated when above-average values of one accompany below-average values of the other, and vice versa.

Strength of association

• Strength of the association: Measure of how strong is the positive or negative association. Statistical associations are overall tendencies, not ironclad rules.

• If there is a strong association between two variables, then knowing one helps a lot in predicting the other. But when there is a weak association, information about one variable does not help much in guessing the other.

Association does not have to be linear or unidirectional

• Relationship between gas mileage per gallon and speed at which a car is driven:

Bivariate Fit of Mileage By Speed

5

10

15

20

25

30

35

40

Mile

ag

e

0 102030405060708090100110

Speed

                                                          

                                                                                            

Response and Explanatory Variable

• Response variable (Y) measures outcome of study.

• Explanatory variable (X) explains or causes change in the response variable.

• Y=gas mileage per gallon, X=speed at which car is driven.

• Response and explanatory variables in alcohol study?

Scatterplots

• A scatterplot shows the relationship between two quantitative variables measured on the same units. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each unit in the data appears as the point in the plot fixed by the values of both variables for that unit.

• Always plot the explanatory variable, if there is one, on the horizontal axis (the x axis of the scatterplot).

Scatterplots in JMP

• Click Analyze, Fit Y by X. Left click the response variable (so that it is highlighted) and then left click the Y, response button (so that it appears in the Y, response box). Similarly left click the explanatory variable and then left click the X, factor button. Click OK.

Examining a scatterplot

• Look for the overall pattern of the data and for striking deviations from that pattern.

• The overall pattern of a scatterplot can be described by the form, direction and strength of the relationship.

• An important kind of deviation is an outlier in terms of the direction of the scatterplot, a point that falls outside the overall pattern of the relationship.

                                                                   

Brain size and body size in 96 mammals (mammalstudy.JMP)

Bivariate Fit of BRAIN By BODY

0

1000

2000

3000

4000

BR

AIN

Human beingDolphin

African elephant

Hippopotamus

0 50010001500200025003000

BODY

Labeling points in JMP

• To label a point in a scatterplot in JMP, put cursor in column that you want to use to name the point (species in the mammal study), then click Cols and then click Label. Then put cursor on the row you want to label, then click Rows and then click Label.

Association is not causation

• An association between what we call the response variable and what we call the explanatory variable does not prove that changes in the explanatory variable cause changes in the response variable.

• The relationship between two variables can be strongly influenced by other variables that are lurking in the background (lurking variables)

Bivariate Fi t of Sal ary of Presbyt erian Mi ni st ers in MA By Price of Rum

0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

5 0 0 0 0S

ala

ry o

f Pre

sb

yte

rian

Min

iste

rs in

MA

1 9 9 8

1 9 8 2

1 9 5 41 9 2 61 8 8 6

0 2 .55 7 .51 01 2 .5P ric e o f R u m

Are the Presybterian ministers benefiting from the rum trade or supporting it?

Key Points from Lecture

• Association: Definition.

• Scatterplots: – How to examine them.– How to make them in JMP

• Association is not causation.

• Next class: 2.2 (correlation), begin 2.3 (least squares regression)

Recommended