A Hypothetical Research Example

Preview:

Citation preview

A Hypothetical

Research Example

Is Hanx Writer better?

How would you design a research project to

answer this question?

Based on what criteria would you make a claim?

Better, no difference, worse

Some Fundamental Issues

How can you make a claim, usually a general

statement of the whole world, based on

partial observation of the world?

Scientific research: to generate new knowledge

Knowledge need to be as general as possible

What we can see is only part of the world

Key issues

Sample as a representation of population

Criteria for claiming positive, negative results

Quantitative Data Analysis

Foundation

Statistics,

Frequency Distributions

and Central Tendency

Statistics

Statistics

Mathematical procedures

Dealing with observed information

Organization, summarization, and

interpretation

Populations and Samples

What are they?

Parameters and Statistics

What do they describe?

Descriptive and Inferential Statistical Methods

Different purposes of these two methods

The relationship between a population

and a sample.

Parameters

Statistics

Descriptive Methods

Inferential Methods

Sample Error

Implications

Sample statistics are

Representative

Not identical to the corresponding population parameters

Two different samples will have different statistics.

Differences can occur just by chance

Sampling error is inevitable!

Statistics in the Context of

Research

Teaching methods and Test Scores

Two samples

Two statistics with difference

What causes the difference?

Sample errors

Teaching methods

Using statistics to infer the characteristics of

the population

What Is the Research Question

Here?

Scientific Methods and

Research Design

Correlational Method

Produce different sets of variables to see whether they are

related

Income Education (year)

#1 125,000 19

#2 100,000 20

#3 40,000 16

#4 35,000 16

#5 41,000 18

#6 29,000 12

#7 35,000 14

#8 24,000 12

#9 50,000 16

#10 60,000 17

Correlation: .79

GPA TV Time (hours/week)

#1 2.2 25

#2 3.5 21

#3 2 20

#4 2.9 15

#5 3.1 14

#6 3.2 13

#7 2.4 10

#8 3.4 9

#9 3.8 7

#10 3.7 4

Correlation: -.63

Scientific Methods and

Research Design

Experimental Method

Produce different sets of variables to see whether they

have a cause-and-effect relationship

Independent and Dependent Variables

Independent

Manipulated by researchers

Different treatments

Dependent: observed

Variables and Measurement

Constructs

Internal attributes or characteristics that cannot be

directly observed

Intelligence, self-esteem, teaching/learning results

Discrete and Continuous Variables

Discrete

No values between two neighboring values

Continuous

Always can add a new value between two values

Real limits

Boundaries of a real score

Scales of Measurement

The Nominal Scale Different names for categories

No quantitative distinction

The Ordinal Scale Ordered categories

The Interval Scale Ordered categories

Intervals between categories are comparable

The zero point is just for convenience

The Ratio Scale The interval scale with an absolute zero point

Temperature: What scales are they? Fahrenheit, Celsius, Kelvin

Different statistics methods for different types of variables

Exercise

A survey collects following data

Age

Gender

Self-perception of weight level (overweight, normal,

underweight)

Satisfaction with services (in a scale of 1-5)

Weight loss

Identify their scales of measurement

Discrete or continuous?

Statistical Notation

Summation Notation

Observed values or scores

SX: the sum of all X

How about others?

SX2, S(X+1), S(X+1)2

Frequency Distributions

Purpose for Frequency

Distributions

To organize research results so that

researchers can see what happened

A frequency distribution does not simply

summarize the scores, but rather shows the

entire set of scores.

Frequency Distribution Tables

Two elements

Scores

Frequencies

Obtaining SX from a

Frequency Distribution Table

Score and frequency

Proportions and Percentages

Grouped Frequency

Distribution Tables

Why do we need groups?

Age, income, …

It could be quite tricky

in selecting group intervals

Survey design

How many hours do you watch TV per week?

a) 0-10 b) 10-20 c) 20-30 d)30-40 e) above 40

a) 0-5 b) 5-10 c) 10-15 d)15-20 e) above 20

Frequency Distribution Graphs

A pictorial presentation of a frequency

distribution

The X-axis (the abscissa): measurement scales

(categories)

Independent variables

The Y-axis (the ordinate): frequency

Dependent variables

Graphs for Interval or Ratio

Data

Histograms

The width of the bar extends to the real limits.

Graphs for Interval or Ratio

Data (Cont.)

Polygons

A continuous line

Graphs for Nominal or Ordinal

Data

Bar Graphs

Similar to histograms but with space between bars

Using Excel to Create Graphs

You need to have the Data Analysis package

installed

Histogram in Excel

Just generate the frequency distribution table

Use the Chart Wizard to create Histogram graph

You need to manually remove the space between bars

Graphs for Population

Distributions

Relative Frequencies

Cannot get the absolute

frequency distribution

Smooth Curves

For numerical scores

measured by an interval

or ratio scale

Symmetrical vs. skewed

distribution

Central Tendency

Why do we need to measure

central tendency?

To identify the “average” or “typical” data

Products and services

Different measures in different distributions

and different types of data

The Mean

Formulas

Population mean

Sample mean

The Weighted Mean

Different groups

How about more groups?

Different weights

GPA

SX

m = ------------

NSX

M = ------------

n

SX1 + SX2

M = ----------------

n1 + n2

S(SX)

M = -------------

Sn

S XC

M = -------------

SC

Computing the Mean from a

Frequency Distribution Table

Distribution with frequency or percentage

Score Frequency

80 2

90 4

95 3

100 1

Score Percentage

80 20%

90 40%

95 30%

100 10%

Characteristics of the Mean

What may affect the mean?

Remember the formula to calculate the mean

SX

M = ------------

n

The Median

What is a median?

Finding a median

An Odd Number of Scores

1, 1, 3, 4, 5, 6, 6, 6, 7

An Even Number of Scores

1, 1, 3, 4, 5, 6, 6, 6

The Median, the Mean, and the

Middle

Which to use?

All depends on what you mean the middle.

Mean: a weighted middle

Median: an absolute middle

Mean and Median

1,2,3,4,5

1,2,3,4,55

The Mode

The score with the greatest frequency

A distribution could have multiple modes.

Selecting a Measure of Central

Tendency

When to Use the Median

Extreme Scores or skewed distributions

House prices

Undetermined Scores or Incomplete Data

Ordinal Scales

When to Use the Mode

Nominal Scales

You cannot rank scores.

Discrete Variables

The mean and median may be meaningless.

Central Tendency and the

Shape of the Distribution

Symmetrical Distributions

The mean and the median are the same

Central Tendency and the

Shape of the Distribution

Skewed Distributions

Variability and

Probability

Selecting An Olympian

An easy case

Selecting An Olympian

Picking Up An Stock

Stock A

Average annual return: 5% in the past 8 years

5%, 4%, 3%, 3%, 3%, 6%, 7%, 9%

Stock B

Average annual return: also 5% in the past 8 years

15%, 15%, -10%, 20%, -5%, 5%, -10%, 10%

Which one to choose?

Assuming you are not a risk-taker.

Data Collected from the Real

World Is Always Noisy.

How to Decide the Quality of

Data?

How to Tell the Difference

between Different Data Sets?

Variability

The spread of data

Purpose for measuring variability

Understanding the distribution of data

Evaluating performances

People or products

Simple Measures of Variability

Range and Interquartile Range

Range Difference between the maximum and the minimum

maximum – minimum +1

Or maximum – minimum

Interquartile range Difference between the first and third quartiles

Algorithm

1. Order the scores

2. Split into two equal sets

3. Find the middle values for two sets

4. Get the difference between them

What is the interquartile range of the following data? 1, 11, 15, 19, 20, 24, 28, 34, 37, 47 , 70

Deviation

The problem of using ranges to measure variability Totally ignore the data in between

1, 11, 15, 19, 20, 24, 28, 34, 37, 47

1, 24, 24, 24, 24, 24, 24, 24, 24, 47

The problem of using interquartile Totally ignore the data outside the ranges.

1, 11, 15, 19, 20, 24, 28, 34, 37, 47

-100, 11, 15, 19, 20, 24, 28, 34, 37, 147

Deviation Distance from the mean (the average of all data)

Get deviation for every data point

Average absolute deviation

1, 3, 5, 7, 9 (4 + 2 + 0 + 2 + 4)/5 = 2.4

3, 3, 5, 7, 7 (2 + 2 + 0 + 2 + 2)/5 = 1.6

Variance

The Mean of the squared deviation

s2 = S(X-m)2/N m: the mean of population (all existing data points)

N: the number of data points in a population

Deviation: S|X-m|/N

Why do we prefer variance over average absolute deviation?

Standard deviation s The square root of variance

Standard Deviation and

Variance For Population

Calculation steps

Deviations

Squared deviations

Sum of squares: SS

Variance: s 2 (divided by N)

Standard deviation: s

Computation Formula for the Sum

of Squared Deviations SS

The definitional formula is difficult to use and

may lead to rounding errors.

SS = S (X-m)2

The computational formula is often used.

SS = SX2 – (SX)2/N

They are equivalent.

Standard Deviation and

Variance For Samples

Sample: A small portion of population

The calculation of standard deviation and variance for samples is very similar to that for a population

The only difference

n-1 is used in calculating variance

The computational formula for the sum of squares still uses n.

Why not using n?

Degrees of Freedom

Not all samples are random!

Statistics

Assume observed data are random, but follow certain

rules

Need to make adjustment for nonrandom data

For a sample with n scores, only n-1 scores

are truly independent.

The number of truly independent deviations

All samples are biased.

Underestimate or overestimate parameters.

Example of Underestimated

Variance

N = 6 (0,0,3,3,9,9), n = 2

Adjustment (df) Is Necessary

Biased sample vs. unbiased sample

Biased statistics vs. unbiased statistics

Standard Deviation in

Descriptive and Inferential

Statistics

Descriptive

What is going on and how spread data is

Mean and standard deviation

Inferential

What may come? In particular, how likely will extreme scores be observed?

Probability as a function of the mean and standard deviation

But, be careful! Lessons learned from financial markets

Probability of the market crash in October 1987

Probability of the fall of Long Term Capital Management in October 1997

When Genius Failed: The Rise and Fall of Long-Term Capital Management

Population vs. Sample

Notations

N vs. n (size)

m vs. M (mean)

s vs. s (variance)

z-Score:

Location of real

scores in a

standardized

distribution

Why Do We Need z-Scores?

To help compare scores from different distributions

To help compute probability

Examples

Two students’ grades from two classes 80 vs. 90

80 (M=70, s=10) vs. 90 (M=85, s=5)

How good is a test score? GRE Verbal: 160 (percentile 85)

Standardized scores

Probability

A z-Score

Tells the location of a score in a standardized

distribution

Sign: above or below the average

Number: distance to the mean

Formula

X-m X-M

z = ----------, z = ----------

s s

z

How far is a score from the mean, measured by the

standard deviation?

Examples

Given, m = 500, s = 100 in all SAT scores

The z-score of an SAT of 620

The score if z = -0.3

In a standard test (s = 100), if X = 720, z = 1.2

Calculate the mean of the test

X-m

z = ----------

s

Using a Distribution Graph

For a distribution with a standard deviation of

s = 4, a score of X=52 corresponds to a z-

score of - 2.0. What is the mean for this

distribution?

Using z-Scores to Standardize

a Distribution

Questions

Can we compare students’ GRE scores obtained in different years? Say last year vs. two years ago?

Different groups of students in test

Different test questions

Are they really comparable?

Yes!

GRE scores are standardized

Score distribution is standardized each time.

Standardizing a Distribution

Relabeling each score

How to Standardize a Score?

Convert a score under a distribution, with any

mean and standard deviation, to a z-score

under a standardized distribution, with a

mean of 0 and a standard deviation of 1

Convert the z-score to a score under a

standardized distribution, with a

predetermined mean and standard deviation.

Probability

Probability

Probability definition

Chance, odds, proportion

What is the probability to get a King from a deck

of cards?

Random Sampling

Each individual of the population has the

same chance of being selected

Constant probability for each and every

selection in case of repetitive samplings

Previously selected samples must be returned to

the population!

The General Formula

Example: coin toss

number of outcomes classified as A

Probability of A = ----------------------------------------------------

total number of possible outcomes

Using Frequency

Distributions to Calculate

Probability

Probability = Proportion

More Complicated

Distributions?

0

100

200

300

400

500

600

0 1 2 3 4 5

What If Distributions Are

Smooth Curves?

Same technique: to find

the proportion

But how?

No blocks to count.

Calculus

1 2 3 4 5

Probability and the Normal

Distribution

The normal distribution

A particular shape

Can describe many phenomena if sample is big

enough

The bell curve

Symmetrical

Single mode in the middle

Simulation (link)

Proportion of a z-Score in

the Normal Distribution

Has Been Pre-Calculated

The Unit Normal Table

Up to z = 4.00

Example

p (z>1.0)

p (z<1.5)

p (z<-0.5)

More

p (1<z<1.5)

p (-0.5<z<1.5)

Different tables for different distributions

Normal distribution is most often seen.

Finding the z-Score from a

Probability

What if you cannot find the exact number?

Use the closest z-score

Interpolate

Probabilities For Scores from A

Normal Distribution

Transfer scores to z-scores and then look

up the unit normal table

p(55<x<65)=?

Find a Score for a Particular

Probability

Look up the unit normal table for a z-score,

and then find a score in a distribution which

corresponds to the z-score

What is the minimum score necessary to be in the top 15%?

You Can Estimate Where Your

IQ Stands

mIQ = 100

sIQ = 15

Importance of Probability to

Research

Compare samples with the population mean

Does a particular sample belong to the population?

How sure is it?

Probabilities and

Samples:

Distribution of

Sample Means

Sample Error

Samples Always Have Errors!

How Can We Infer Population Parameters based on One or

A Few Samples?

Recall Our Examples on GRE

and SAT Scores

We can make predication

Using statistics

Mean, standard deviation

For any variable, if we know the mean and

standard deviation, we will have a way to deal

with it

Sample statistics can be treated in the same way.

Sample mean: the mean of all possible sample means

Sample variance: the variance of all possible sample

means

Distribution of Sample Means

A frequency distribution of sample means

Including all the possible samples with a

particular sample size n

The distribution of statistics

A sampling distribution

With a specific sample size

Example

The population: 2,4,6,8

Sample size: n=2

Random sample

What Do We Get?

The sample means pile up around the

population mean.

The distribution of sample means is like a

normal distribution.

It is more likely to get a sample mean close to

the population mean.

What is the probability to get an extreme sample

mean?

Central Limit Theorem

For a population (m,s), the distribution of

sample means for sample size n will have a

mean of m and a standard deviation of s/

It is about any population

The shape of distribution does not matter.

The mean and standard deviation do not matter.

n

Shape of Distribution

Would be a normal distribution if

The population is a normal distribution, or

The sample size is large enough, say larger than

30

The Mean of Sample Means

The expected value of M

It is near the population mean

The Standard Deviation of

Sample Means

The standard error of M

Standard distance between an M and m

Notation: sM

The larger the sample size is, the smaller the

standard error of M is.

The law of large number

The larger the sample size is, the more probable the

sample means will be close to the population mean.

The more unlikely a sample mean is very far away

from the population mean.

Distributions of Sample Means

for A Normal Distribution

n1

n2 n3

A Non-Normal Distribution

Distributions of Sample Means

n1 n2

Implication

The probability of sample means can be

estimated by using z-scores and the unit

normal table

Variables:

A sample mean

The population mean

The standard error

The population standard deviation and sample size

Example: m=500, s=100

Given n = 25, what is the probability to get a

sample mean larger than 540?

Standard Error

Like standard deviation

Measure the standard distance between a sample mean and the population mean

Provide information about sample error

Very often, we don’t know the population mean

All we have are sample means and standard errors.

How much do we know about the population mean based on the sample means?

Example

Comparing a new teaching method with the

traditional method based on testing scores

New Tradition

Important Concepts

Variability

Variance

Standard deviation

Population and sample

z-Score

Scores, mean and the standard deviation

Probability and frequency distribution

z-Score and probability

Use the unit normal table

Sampling distribution and standard errors

Probability of sample means

Go Back to the Hanx Writer

Example

What the research project is about

Assuming two populations

Hanx Writer user population

Normal keyboard user population

Obtaining one sample from each population

Using the means from two samples to estimate the populations

The central question:

How likely is the sample of Hanx Writer population actually one

sample of normal keyboard population (which means no difference)

If the probability is low, the sample is more likely from another population.

Otherwise, cannot rule out the possibility that two samples are from the

same population, the normal keyboard users.

Homework

On CANVAS

With red rectangles

Recommended