27
Statistics Review I Class 13

Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Embed Size (px)

Citation preview

Page 1: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Statistics Review I

Class 13

Page 2: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

CLASS OVERVIEW

Levels of Measurement

Measures of Centrality and Dispersion

* Centrality (mean, median, mode)

* Dispersion (range, variance, std. deviation, std. error)

* Z scores and Z distribution

Confidence Intervals

Exploring Data Sets

* Reasons

* Methods (histograms, features of distributions)

Dealing with Outliers

Page 3: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

LEVELS OF MEASUREMENT

1. Categorical

2. Ordinal

3. Continuous

a. Interval

b. Ratio

c. Discrete

Page 4: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Categorical Variables

1. Refer to categories: human, cat, eggplant

2. All or none: Can’t be 1 third human, 2 thirds eggplant

3. Numbers serve as labels, not values: 1 = human, 2 = eggplant

“1” is not less than “2”; human is not less than eggplant

4. Common kinds of categorical variables: gender, race, major

5. Binary: only two values: Yes/No, Day/Night, present/absent

6. Non-Binary: Multiple values. Animal, vegetable, mineral Democrat, Republican, Independ.

7. Nominal: Values are known signifiers:

“Did Joey go potty? Yes? Was it Number 1 or Number 2?”

In some sports, numbers on jerseys represent player position; e.g. 1 = tackler, 2 = runner, etc.

Page 5: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Ordinal Variables

Numeric values refer to the ordering of things

Rankings: 1 = First place, 2 = second place

Chronology: 1= occurred first, 2 = occurred second, etc.

Numeric valued DO NOT indicate how much “1” differs from “2”

Bike race: 1st place (27.24); 2nd place (27.28); 3rd place (33.10)

Grant scores: 1. 99.89

2. 92.63

3. 89.76

4. 89.75

5. 88.84

6. 79.48

winners

losers

Page 6: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

CONTINUOUS VARIABLESInterval: Most stat tests rely on interval data

Equal intervals represent equal differences

Discrete: Virtually same as "interval" but there is a finite range of values, as in Likert scales.

“How happy are you with your cell phone service?”1 2 3 4 5

Not at all Barely somewhat Very Greatly

Ratio: Ratios of values on scale are meaningful

Must have meaningful “0” pointLikert scale above NOT ratio, b/c 2:4 ≠ 1:2Temperature, RT, number of yawns in class ARE ratio

Page 7: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

GUESS THAT VARIABLE

Example Variable

1 = female, 2 = male

32.75 miles per gallon

1 = slightly tired 2 = moder. tired 3 = very tired

352 Smith Hall

Top 4 Reasons to Learn Stats:

1. Necessary for career2. Source of serenity 3. Great ice-breaker4. Fun for whole family

Categorical, binary

Ratio

Interval

Categorical, non-binary

Ordinal

Page 8: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Distress and Disclosure: A Sample Experiment That

Never Occurred!!!Hyp: Increased anxiety leads

to disclosure.

Ss see scary movie or neutral movie.

Ss asked to rate how scary they found the movie

Ss write about thoughts and feelings movie created.

Page 9: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Measures of Centrality

MODE Most frequent value, occurrence

MEDIAN Middle-most value; 50% above/below

MEAN Arithmetic average

Number of words written: 2, 2, 3, 5, 8

MODE = ?

MEDIAN = ?

MEAN = ?

2

3

4 [2 + 2 + 3 + 5 + 8 / 5 = 4]

Page 10: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Relations Btwn Mean, Median, Mode

Number of words written?

N = 5: 1, 2, 2, 3, 8

N = 10: 1, 2, 3, 3, 3, 4, 5, 5, 6, 8

N = 20: 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 8

N = 5

0

1

2

3

4

5

$1 $2 $3 $4 $5 $6 $7 $8

N = 10

0

1

2

3

4

5

$1 $2 $3 $4 $5 $6 $7 $8

N = 20

0

1

2

3

4

5

$1 $2 $3 $4 $5 $6 $7 $8

Mode Median Mean

2.0 3.0 3.8

3.0 3.5 4.0

4.0 4.0 4.35

If true distribution is normal, then as sample increases mean, median, and mode converge.

How does change in N affect rel. btwn Mean, Median, and Mode?

Page 11: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

MEASURES OF DISPERSON

Range: Difference between highest score and lowest score.

N = 20: 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5,5, 6, 6, 6, 7, 8

4.0 4.0 4.35

Mode Median Mean

8 – 1 = 7 = range

Deviation (from mean), AKA “Error”: Difference between individual score and mean

8 – 4.35 = 3.65 = 8’s deviation

Sum of Squared Errors (SS): Why? To get a meaningful index of average dispersion.

1 - 4.35 + 2 – 4.35 ... + 7 – 4.35 + 8 – 4.35 = 0. Useless!

(1 - 4.35)2 + (2 – 4.35)2 ...+ (7 – 4.35)2 + (8 – 4.35)2 = 87.00. Useful!

Page 12: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Variance = s2 = Average deviation in sample = SS N - 1

87 = 4.58 = s2 20-1

Standard Deviation = s = s2 = sq. root of variance =

Variance and Standard Deviation

4.58 = 2.14

We need to get an estimate of average dispersion from mean, just like the mean gives an estimate of average score.

Two problems with variance:

1) units, based on sq’d deviations, are not relatable to actual scores.

2) Variance tends to be a large, unwieldy, number.

1 sd above and below mean = 68% of distribution

2 sd above and below mean = 95% of distribution

Page 13: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Z Scores and Z Distribution

DV 1: “How anxious were you during movie?”

DV 2: Number of words written about movie.

Mean SD

4.23 2.71

28.71 11.65

Issue: How do we compare anxiety with word production?

Z-score conversion: Effect is to convert different metrics into a common metric

Z = X – X

s

Sub. 24: anxious = 3; words = 22

Z_anxious = 3 – 4.23 = -.45 Z_words = 22 – 28.71 = -.58 2.71 11.65

SPSS: Descriptives, “Save standardized values as variables”

Z distribution is normal, mean = 0, SD = 1

discrete data

ratio data

Page 14: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Standard Error of the Mean

Sample mean ( X ) estimates true population mean (µ)

Many sample means from same population will vary.

Standard Error of the Mean (SE) = the average amount that sample means vary around true mean.

If n of sample mean ≥ 30, SE can be estimated based on s (std. deviation), and sample n.

Formula for SE:

SE Movie anxiety study: DV = reported anxiety; n = 43, s = 2.71

SE = (2.71 / √43) = 0.41

SE X = s/√n

Note: SE is much smaller than SD. Why?

Page 15: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

CONFIDENCE INTERVALS

Issue: How do we know if the sample mean is a good estimate of the true mean? In other words, how do we estimate a mean’s accuracy?

Confidence Intervals (CI) estimate accuracy of sample means.

CI shows boundary values (highest & lowest) w/n which true mean is likely to occur.

Conventional boundary captures true mean 95% of time.

Calculation: Lower boundary = + (1.96 * SE)

Upper boundary = − (1.96 * SE)

= 4.23, SE = 0.41Movie anxiety study:

X Lower CI = 4.23 - (1.96 * 0.41) = 3.43

Upper CI = 4.23 + (1.96*.041) = 5.03

X X

Page 16: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

GRAPHIC REPRESENTATION OF CI

0

1

2

3

4

5

6

Neutral Movie Scary Movie

An

xiet

y R

atin

g

Alone

With Friend

Error bars overlap; means are likely from same distribution.

Differences are not meaninful.

Error bars DON’T overlap; means are likely from different distributions

Differences are meaningful

Page 17: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

GRAPHICALLY EXPLORING DATA USING CENTRALITY AND DISPERSION

Why explore data?

1. Get a general sense or feel for your data.

2. Determine if distribution is normal, skewed, kurtotic, or multi-modal (more on this soon).

3. Identify outliers

4. Identify possible data entry errors

Page 18: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

+ =

12, 19, 17, 14, 17, 13, 17, 15

+ 147 =

DATA BUGS ARE A HAZZARD: KNOW WHAT'S IN YOUR DATA!

Page 19: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Normally Distributed Data Set

SPSS output: Note similarity between mean, median, mode

Page 20: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Skewed Distribution

Positive Skew Negative SkewPossible

"floor effect"Possible

"ceiling effect"

Page 21: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Kurtosis

Positive kurtosis, “leptokurotic”

Negative kurtosis, “platykurotic”

"Normativity bias?" DV doesn't discriminate IV wasn't impactful

Distinctiveness bias? IV and/or DV too ambiguous Population too diverse

Problems? Problems?

Neuroticism Measure Drinks Per Week

Page 22: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Bimodality

Note: What clues in “statistics” output that the distribution may be bimodal?

Bimodality suggests 2 (or more) populations

Multimodal: More than two modes.

Page 23: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Outliers

Page 24: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

BOX AND WHISKER GRAPH

Median (50 %)

Top 25%

Upper Quartile

Lower Quartile

Bottom 25%

Page 25: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

BOX AND WHISKER GRAPH, AND DATA CHECKING

Detecting Skew

Detecting Outliers

subject number

Page 26: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

DEALING WITH OUTLIERS1. Check raw data: Entry problem? Coding problem?

2. Remove the outlier:

a. Must be at least 2.5 DV from the mean (some say 3 DV)

b. Must declare deletions in pubs.

c. Try to identify reason for outlier (e.g., other anomalous responses).3. Transform data: Convert data to a metric that reduces deviation. (More on this

in next slide).

4. Change the score to a more conservative one (Field, 2009):

a. Next highest plus 1

b. 2 SD or 3 SD above (or below) the mean.

c. ISN’T THIS CHEATING? No (says Field) b/c retaining score biases outcome. Again, report this step in pubs.

5. Run more subjects!

Page 27: Statistics Review I Class 13. CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion

Data Transformations1. Log Transformation (log(X)): Converting scores into Log X reduces

positive skew, draws in scores on the far right side of distribution.

NOTE: This only works on sets where lowest value is greater than 0. Easy fix: add a constant to all values.

2. Square Root Transformation (√X): Sq. roots reduce large numbers more than small ones, so will pull in extreme outliers.

3. Reciprocal Transformation (1/X): Divide 1 by each score reduces large values. BUT, remember that this effectively reverses valence, so that scores above the mean flip over to below the mean, and vice versa.

Fix: First, preliminary transform by changing each score to highest score minus the target score. Do it all at same time by 1/(Xhighest – X).

4. Correcting negative skew: All steps work on neg. skew, but first must reverse scores. Subtract each score from highest score. Then, re-reverse back to original scale after transform completed.