PSYC 6130C UNIVARIATE ANALYSIS Prof. James Elder

Preview:

Citation preview

PSYC 6130C UNIVARIATE ANALYSIS

Prof. James Elder

Introduction

PSYC 6130, PROF. J. ELDER 3

What is (are) statistics?

• A branch of mathematics concerned with understanding and summarizing collections of numbers

• A collection of numerical facts

• Estimates of population parameters, derived from samples

PSYC 6130, PROF. J. ELDER 4

What is this course about?

• Applied statistics

• Emphasizes methods, not proofs

• Descriptive statistics

• Inferential statistics

PSYC 6130, PROF. J. ELDER 5

Fall Term

Date Title Readings Notes

10-Sep-08 Introduction Probability Descriptive statistics

1.1-1.3 5.1-5.5, 5.7 2.1,2.2,2.5,2.7-2.9,2.12,2.13

17-Sep-08 The normal distribution 3.1-3.4 Lab 1

24-Sep-08 Introduction to hypothesis testing t-tests

4 7

1-Oct-08 Rosh Hashanah – No Classes

8-Oct-08 t-tests 7 Lab 2

15-Oct-08 Statistical power and effect size 8 Assignment 1 due

22-Oct-08 Correlation and regression 9

29-Oct-08 One-way independent ANOVA 11 Lab 3

5-Nov-08 Multiple comparisons 12.1-12.12

12-Nov-08 Multiple comparisons 12.1-12.12 Lab 4

19-Nov-08 Two-way ANOVA 13.1-13.11,13.14 Assignment 2 due

26-Nov-08 Review

3-Dec-08 Exam

PSYC 6130, PROF. J. ELDER 6

Winter Term

Date Title Readings Deadlines

7-Jan-09 Repeated measures ANOVA 14

14-Jan-09 Two-way mixed design ANOVA 14 Lab 5 Deadline for choosing project topic

21-Jan-09 Reading Week

28-Jan-09 Multiple regression 15 Lab 6

4-Feb-09 The general linear model 16 Assignment 3 due, drop date is Feb 1

11-Feb-09 The binomial distribution 5.6, 5.8-5.10 Lab 7

18-Feb-09 Reading Week – No Classes

25-Feb-09 Chi-square tests 6

4-Mar-09 Resampling and nonparametric techniques 18 Lab 8

11-Mar-09 Student Presentations

18-Mar-09 Student Presentations Assignment 4 due

25-Mar-09 Review

1-Apr-09 Exam

Some Background

(Howell Ch. 1)

PSYC 6130, PROF. J. ELDER 8

Variables and Constants

• Constants are properties that never change (e.g., the speed of light in a vacuum ~3x108m/s).

• Most physiological and psychological parameters of interest vary considerably

– Between individuals (e.g., intelligence quotient)

– Within individuals (e.g., heart rate)

• Any variable whose variation is somewhat unpredictable is called a random variable (rv).

PSYC 6130, PROF. J. ELDER 9

Scales of measurement

• Nominal scale: values are categories, having no meaningful correspondence to numbers.

PSYC 6130, PROF. J. ELDER 10

Scales of measurement

• Ordinal scale: ordering is meaningful, but exact numerical values (if they exist) are not.

PSYC 6130, PROF. J. ELDER 11

Scales of measurement

• Interval scale: values are numerically meaningful, and interval between two values is meaningful.

– Example: Celsius temperature scale. It takes the same amount of energy to raise the temperature of a gram of water from 20 °C to 21 °C as it does to raise it from 30 °C to 31 °C.

• Ratio scale: ratio of two values is also meaningful.

– Example: Kelvin temperature scale. A gram of H20 at 300 K has twice the energy of a gram of H20 at 150 K.

– Ratio scales require a 0-point corresponding to a complete lack of the substance being measured.

• Example: a gram of H20 at 0 K has no heat (particles are motionless).

PSYC 6130, PROF. J. ELDER 12

Continuous vs Discrete Variables

• A continuous variable may assume any real value within some range

PSYC 6130, PROF. J. ELDER 13

Continuous vs Discrete Variables

• A discrete variable may assume only a countable number of values: intermediate values are not meaningful.

PSYC 6130, PROF. J. ELDER 14

Independent vs Dependent Variables

• Experiments involve independent and dependent variables.

– The independent variable is controlled by the experimenter.

– The dependent variable is measured.

– We seek to detect and model effects of the independent variable on the dependent variable.

• Example: In a visual search task, subjects are asked to find the odd-man-out in a display of discrete items (e.g., a horizontal bar amongst vertical bars).

– The number of items in the display is an independent variable.

– Reaction time is the main dependent variable.

– Typically, we observe a roughly linear relationship between the number of items and the reaction time.

PSYC 6130, PROF. J. ELDER 15

Experimental vs Correlational Research

• Experimental study:

– Researcher controls the independent variable.

– Seek to detect effects on the dependent variable.

– Direction of causation may be inferred (but may be indirect).

• Correlational study:

– There are no independent or dependent variables.

– No variables are under control of the researcher.

– Seek to find statistical relationships (dependencies) between variables.

– Direction of causation may not normally be inferred.

PSYC 6130, PROF. J. ELDER 16

Correlational Studies: Examples

PSYC 6130, PROF. J. ELDER 17

Populations vs Samples

• In human science, we typically want to characterize and make inferences not about a particular person (e.g., Uncle Bob) but about all people, or all people with a certain property (e.g., all people suffering from a bipolar disorder).

• These groups of interest are called populations.

• Typically, these populations are too large and inaccessible to study.

• Instead, we study a subset of the group, called a sample.

• In order to make reliable inferences about the population, samples are ideally randomly selected.

• The population properties of interest are called parameters.

• The corresponding measurements made on our samples are called statistics. Statistics are approximations (estimates) of parameters.

PSYC 6130, PROF. J. ELDER 18

Different Types of Populations and Samples

• Outside of human science, populations do not necessarily refer to humans

– e.g. populations may be of bees, algae, quarks, stock prices, pork belly futures, ozone levels, etc…

• In clinical and social psychology you will often be conducting large-n studies on human populations.

• In cognitive psychology, you will often be doing small-n within-subject studies involving repeated trials on the same subject.

– Here, you may think of the ‘population’ as being the infinite set of responses you would obtain were you able to continue the experiment indefinitely.

– The sample is the set of responses you were able to collect in a finite number of trials (e.g., 5000) on the same subject.

PSYC 6130, PROF. J. ELDER 19

Summation Notation

Let Number of siblings for respondent iX i

i Xi Yi

1 1 2

2 2 1

3 2 1

… … …

N 4 0

Number of children for respondent iY i

1

1Then

N

ii

X XN

1

1 N

ii

Y YN

where Number of respondents in sampleN

PSYC 6130, PROF. J. ELDER 20

Some Summation Rules

N

i ii=1

1. Often abbreviate X as X

2. ( )i i i iX Y X Y

1 1 2 2 1 2 1 2since (X ) (X ) (X ) (Y ) Associative property of additionY Y X Y

Similarly, ( )i i i iX Y X Y

3. , where is a constant,C NC Csince adding C to itself N times yields N C's.

4. i iCX C X

1 2 1 2since ( ) Multiplication is distributive over additionCX CX C X X

But note that

5. i i i iXY X Y

1 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2since X X (X )(Y ) X +X X XY Y X Y Y Y Y Y

PSYC 6130, PROF. J. ELDER 21

Summary

• What is (are) statistics

• Variables and constants

• Scales of measurement

• Continuous and discrete variables

• Independent and dependent variables

• Experimental and correlational research

• Populations and samples

• Summation Notation

Descriptive Statistics(Howell, Ch 2)

PSYC 6130, PROF. J. ELDER 23

Frequency Tables1991 U.S. General Social Survey: Number of Brothers and Sisters

Frequency Percent Valid Percent Cumulative Percent

Valid 0 74 4.88 4.92 4.921 236 15.56 15.68 20.602 276 18.19 18.34 38.943 236 15.56 15.68 54.624 209 13.78 13.89 68.505 118 7.78 7.84 76.356 80 5.27 5.32 81.667 81 5.34 5.38 87.048 58 3.82 3.85 90.909 47 3.10 3.12 94.02

10 34 2.24 2.26 96.2811 22 1.45 1.46 97.7412 11 0.73 0.73 98.4713 9 0.59 0.60 99.0714 5 0.33 0.33 99.4015 3 0.20 0.20 99.6016 1 0.07 0.07 99.6717 2 0.13 0.13 99.8018 1 0.07 0.07 99.8721 1 0.07 0.07 99.9326 1 0.07 0.07 100.00

Total 1505 99.21 100.00Missing DK 4 0.26

NA 8 0.53Total 12 0.79

Total 1517 100.00

PSYC 6130, PROF. J. ELDER 24

Bar Graphs and Histograms

PSYC 6130, PROF. J. ELDER 25

Grouped Frequency Distributions

• What are the apparent limits?

• What are the real limits?X f

<5 5815 - 9 66110 - 14 74015 - 19 70120 - 24 68925 - 29 67430 - 34 73135 - 39 90340 - 44 93045 - 49 83850 - 54 74655 - 59 60860 - 64 43465 - 69 38370 - 74 34575 - 79 28880 - 84 17485+ 97

Statistics Canada 2001 CensusAge of Respondent

PSYC 6130, PROF. J. ELDER 26

Percentiles and Percentile Ranks

• Percentile: The score at or below which a given % of scores lie.

• Percentile Rank: The percentage of scores at or below a given score

PSYC 6130, PROF. J. ELDER 27

Linear Interpolation to Compute Percentile Ranks

What if you have a 23-year-old respondent and

would like to know her percentile rank?

Let age (percentile)xpercentile ranky

Then the linear (affine) interpolation model is: y ax b

There are 2 unknowns ( and ). If we have two

data points near these unknowns, we can solve:

a b

1 1

2 2

y ax b

y ax b

2 1

2 1

y ya

x x

Thus y ax b

1 1ax y ax

1 1( )y a x x

2 11 1

2 1

( )y y

y x xx x

Frequency Percent Cumulative Percent

Valid <5 581 5.5 5.55 - 9 661 6.3 11.810 - 14 740 7.0 18.815 - 19 701 6.7 25.520 - 24 689 6.5 32.025 - 29 674 6.4 38.430 - 34 731 6.9 45.435 - 39 903 8.6 54.040 - 44 930 8.8 62.845 - 49 838 8.0 70.850 - 54 746 7.1 77.955 - 59 608 5.8 83.660 - 64 434 4.1 87.865 - 69 383 3.6 91.470 - 74 345 3.3 94.775 - 79 288 2.7 97.480 - 84 174 1.7 99.185+ 97 0.9 100.0Total 10523 100.0

Statistics Canada 2001 Census Age of Respondent

PSYC 6130, PROF. J. ELDER 28

Frequency Percent Cumulative Percent

Valid <5 581 5.5 5.55 - 9 661 6.3 11.810 - 14 740 7.0 18.815 - 19 701 6.7 25.520 - 24 689 6.5 32.025 - 29 674 6.4 38.430 - 34 731 6.9 45.435 - 39 903 8.6 54.040 - 44 930 8.8 62.845 - 49 838 8.0 70.850 - 54 746 7.1 77.955 - 59 608 5.8 83.660 - 64 434 4.1 87.865 - 69 383 3.6 91.470 - 74 345 3.3 94.775 - 79 288 2.7 97.480 - 84 174 1.7 99.185+ 97 0.9 100.0Total 10523 100.0

Linear Interpolation to Compute Percentiles

What if you want to know what the median age is? Statistics Canada 2001 Census Age of Respondent

2 1

1 12 1

To compute percentiles,

simply swap the x's and y's in the formula:

x ( )x x

x y yy y

PSYC 6130, PROF. J. ELDER 29

Measures of Central Tendency

• The mode – applies to ratio, interval, ordinal or nominal scales.

• The median – applies to ratio, interval and ordinal scales

• The mean – applies to ratio and interval scales

Mean Median ModeAGE 37.1 37 41

PSYC 6130, PROF. J. ELDER 30

The Mode

• Defined as the most frequent value (the peak)

• Applies to ratio, interval, ordinal and nominal scales

• Sensitive to sampling error (noise)

• Distributions may be referred to as unimodal, bimodal or multimodal, depending upon the number of peaks

Mode = 41

PSYC 6130, PROF. J. ELDER 31

The Median

• Defined as the 50th percentile

• Applies to ratio, interval and ordinal scales

• Can be used for open-ended distributions

Median 37

PSYC 6130, PROF. J. ELDER 32

The Mean

• Applies only to ratio or interval scales

• Sensitive to outliers

1

1Population mean

N

ii

XN

1

1Sample mean

N

ii

X XN

37.1X

PSYC 6130, PROF. J. ELDER 33

Properties of the Mean

Then the mean also increases (decreases) by :C

X X C

Suppose a constant is added (or subtracted) to every score in your sample:

i i

C

X X C1.

Then the mean is also multiplied (divided) by :C

X CX

Suppose every score in your sample is multiplied (divided) by a constant :

i i

C

X CX2.

( ) 0iX X3.

PSYC 6130, PROF. J. ELDER 34

Properties of the Mean (Cntd…)

2 2

Least-squares property: the mean minimizes the sum of squared deviations:

( ) ( ) i iX X X X X

2

2 2 2

2

Proof:

( ) has a minimum where ( ) 0 and ( ) 0i i i

d dX X X X X X

dX dX

2 1( ) 2 ( ) 0i i i

dX X X X X X X

dX N

2

2

2( ) 2 0i

dX X N

dX

PSYC 6130, PROF. J. ELDER 35

Measures of Variability (Dispersion)

• Range – applies to ratio, interval, ordinal scales

• Semi-interquartile range – applies to ratio, interval, ordinal scales

• Variance (standard deviation) – applies to ratio, interval scales

PSYC 6130, PROF. J. ELDER 36

Range

• Interval between lowest and highest values

• Generally unreliable – changing one value (highest or lowest) can cause large change in range.

Range = 79 drinks

PSYC 6130, PROF. J. ELDER 37

Semi-Interquartile Range• The interquartile range is the interval between the first and third

quartile, i.e. between the 25th and 75th percentile.

• The semi-interquartile range is half the interquartile range.

• Can be used with open-ended distributions

• Unaffected by extreme scores

N Valid 19769Missing 6004

Median 4Percentiles 25 2

50 475 7

SIQ = 2.5 drinks

PSYC 6130, PROF. J. ELDER 38

Population Variance and Standard Deviation

dev iis at kno ionwn as the of sample iX i

2Thus ( ) is known as t sum of squared deviah te ions.iSS X

2

2 2

The population is simply the mean squared deviation:

1(

varianc

)

e

iXN

2

The population standard deviation is simply the square-root of the variance:

1( )iXN

The standard deviation is particularly sensitive to outliers, due to the squaring operation.

PSYC 6130, PROF. J. ELDER 39

Sample Variance and Standard Deviation

de is known as the viation of sample iX X i

2Thus ( ) is known as t sum of squared deviationshe .iSS X X

2

2

1The mean squared sample deviation ( )

is a biased estimator of the population variance

- it tends to underestimate .

iX XN

2

2 2

A minor modification makes the sample variance unbiased:

1( )

1 i

s

s X XN

2

The corrected sample standard deviation is given by

1( )

1 is not an unbiased estimator of , but is close enough for most purposes.

is X XN

s

PSYC 6130, PROF. J. ELDER 40

Degrees of Freedom

The is the number of independent measurements

available for estimating a p

degrees of freedom

opulation parame

ter.

df

2The calculation of involves . Knowing and 1 of the sample values

allows us to infer the value of the remaining sample value. Thus only

1 of the sample values are independent, and 1.

s X X N

N df N

PSYC 6130, PROF. J. ELDER 41

Computational Formulas for Variance

2The formula for the sum of squares: devi (ation l )a iSS X X

2 2computationalMore efficient to use the formula: iSS X NX

Why are these equivalent?

2 2 2( ) ( 2 )i i iX X X X X X

2 22i iX X X X

2 2 22iX NX NX

2 2iX NX

2 2 2

Thus

1s

1 iX NXN

PSYC 6130, PROF. J. ELDER 42

Properties of the Standard Deviation

Suppose a constant is added (or subtracted) to every score in your sample:

i i

C

X X C

Then the standard devia does not chation nge.

1.

PSYC 6130, PROF. J. ELDER 43

Properties of the Standard Deviation (cntd…)

Suppose every score in your sample is multiplied (divided) by a constant :

i i

C

X CX

2.

Then the standard deviation is also multiplied (divided) by :C

s Cs

2

Proof:

1( )

1old is X XN

21

( )1new is CX CX

N

21

( )1 iC X X

N

oldCs

PSYC 6130, PROF. J. ELDER 44

Standard Deviation Example

5.7 drinks

5.8 drinks

X

s

cf. SIQ = 2.5 drinks

range = 79 drinks

PSYC 6130, PROF. J. ELDER 45

Skew

• The mean and median are identical for symmetric distributions.

• Skew tends to push the mean away from the median, toward the tail (but not always)

Median=3

Mean=6.7

PSYC 6130, PROF. J. ELDER 46

Skewness

• Properties of skewness

– Positive for positive skew (tail to the right)

– Negative for negative skew (tail to the left)

– Dimensionless

– Invariant to shifting or scaling data (adding or multiplying constants)

3

3

( )Sample skewness =

2 ( 1)iX XN

N N s

PSYC 6130, PROF. J. ELDER 47

Dealing with Outliers

• Trimming:

– Throw out the top and bottom k% of values (k=5%, for example).

– May be justified if there is evidence for confounding process interfering with the dependent variable being studied

• Example: participant blinks during presentation of a visual stimulus

• Example: participant misunderstands a question on a questionnaire.

• Transforming

– Scores are transformed by some function (e.g., log, square root)

– Often done to reduce or eliminate skewness

PSYC 6130, PROF. J. ELDER 48

Log-Transforming Data

skewness=0.67 skewness=0.08

End of Lecture 1

Sept 10, 2008

PSYC 6130, PROF. J. ELDER 50

Kurtosis

kurtosis>0: leptokurtic (Laplacian)

kurtosis=0: mesokurtic (Gaussian)kurtosis<0: platykurtic

4 2

4

( )N(N+1) ( 1)Sample kurtosis = 3

(N-2)(N-3) ( 1) ( 2)( 3)iX X N

N s N N

PSYC 6130, PROF. J. ELDER 51

Summary

• Measures of central tendency

• Measures of dispersion

• Skew

• Kurtosis

Recommended