Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Hypothetical
Research Example
Is Hanx Writer better?
How would you design a research project to
answer this question?
Based on what criteria would you make a claim?
Better, no difference, worse
Some Fundamental Issues
How can you make a claim, usually a general
statement of the whole world, based on
partial observation of the world?
Scientific research: to generate new knowledge
Knowledge need to be as general as possible
What we can see is only part of the world
Key issues
Sample as a representation of population
Criteria for claiming positive, negative results
Quantitative Data Analysis
Foundation
Statistics,
Frequency Distributions
and Central Tendency
Statistics
Statistics
Mathematical procedures
Dealing with observed information
Organization, summarization, and
interpretation
Populations and Samples
What are they?
Parameters and Statistics
What do they describe?
Descriptive and Inferential Statistical Methods
Different purposes of these two methods
The relationship between a population
and a sample.
Parameters
Statistics
Descriptive Methods
Inferential Methods
Sample Error
Implications
Sample statistics are
Representative
Not identical to the corresponding population parameters
Two different samples will have different statistics.
Differences can occur just by chance
Sampling error is inevitable!
Statistics in the Context of
Research
Teaching methods and Test Scores
Two samples
Two statistics with difference
What causes the difference?
Sample errors
Teaching methods
Using statistics to infer the characteristics of
the population
What Is the Research Question
Here?
Scientific Methods and
Research Design
Correlational Method
Produce different sets of variables to see whether they are
related
Income Education (year)
#1 125,000 19
#2 100,000 20
#3 40,000 16
#4 35,000 16
#5 41,000 18
#6 29,000 12
#7 35,000 14
#8 24,000 12
#9 50,000 16
#10 60,000 17
Correlation: .79
GPA TV Time (hours/week)
#1 2.2 25
#2 3.5 21
#3 2 20
#4 2.9 15
#5 3.1 14
#6 3.2 13
#7 2.4 10
#8 3.4 9
#9 3.8 7
#10 3.7 4
Correlation: -.63
Scientific Methods and
Research Design
Experimental Method
Produce different sets of variables to see whether they
have a cause-and-effect relationship
Independent and Dependent Variables
Independent
Manipulated by researchers
Different treatments
Dependent: observed
Variables and Measurement
Constructs
Internal attributes or characteristics that cannot be
directly observed
Intelligence, self-esteem, teaching/learning results
Discrete and Continuous Variables
Discrete
No values between two neighboring values
Continuous
Always can add a new value between two values
Real limits
Boundaries of a real score
Scales of Measurement
The Nominal Scale Different names for categories
No quantitative distinction
The Ordinal Scale Ordered categories
The Interval Scale Ordered categories
Intervals between categories are comparable
The zero point is just for convenience
The Ratio Scale The interval scale with an absolute zero point
Temperature: What scales are they? Fahrenheit, Celsius, Kelvin
Different statistics methods for different types of variables
Exercise
A survey collects following data
Age
Gender
Self-perception of weight level (overweight, normal,
underweight)
Satisfaction with services (in a scale of 1-5)
Weight loss
Identify their scales of measurement
Discrete or continuous?
Statistical Notation
Summation Notation
Observed values or scores
SX: the sum of all X
How about others?
SX2, S(X+1), S(X+1)2
Frequency Distributions
Purpose for Frequency
Distributions
To organize research results so that
researchers can see what happened
A frequency distribution does not simply
summarize the scores, but rather shows the
entire set of scores.
Frequency Distribution Tables
Two elements
Scores
Frequencies
Obtaining SX from a
Frequency Distribution Table
Score and frequency
Proportions and Percentages
Grouped Frequency
Distribution Tables
Why do we need groups?
Age, income, …
It could be quite tricky
in selecting group intervals
Survey design
How many hours do you watch TV per week?
a) 0-10 b) 10-20 c) 20-30 d)30-40 e) above 40
a) 0-5 b) 5-10 c) 10-15 d)15-20 e) above 20
Frequency Distribution Graphs
A pictorial presentation of a frequency
distribution
The X-axis (the abscissa): measurement scales
(categories)
Independent variables
The Y-axis (the ordinate): frequency
Dependent variables
Graphs for Interval or Ratio
Data
Histograms
The width of the bar extends to the real limits.
Graphs for Interval or Ratio
Data (Cont.)
Polygons
A continuous line
Graphs for Nominal or Ordinal
Data
Bar Graphs
Similar to histograms but with space between bars
Using Excel to Create Graphs
You need to have the Data Analysis package
installed
Histogram in Excel
Just generate the frequency distribution table
Use the Chart Wizard to create Histogram graph
You need to manually remove the space between bars
Graphs for Population
Distributions
Relative Frequencies
Cannot get the absolute
frequency distribution
Smooth Curves
For numerical scores
measured by an interval
or ratio scale
Symmetrical vs. skewed
distribution
Central Tendency
Why do we need to measure
central tendency?
To identify the “average” or “typical” data
Products and services
Different measures in different distributions
and different types of data
The Mean
Formulas
Population mean
Sample mean
The Weighted Mean
Different groups
How about more groups?
Different weights
GPA
SX
m = ------------
NSX
M = ------------
n
SX1 + SX2
M = ----------------
n1 + n2
S(SX)
M = -------------
Sn
S XC
M = -------------
SC
Computing the Mean from a
Frequency Distribution Table
Distribution with frequency or percentage
Score Frequency
80 2
90 4
95 3
100 1
Score Percentage
80 20%
90 40%
95 30%
100 10%
Characteristics of the Mean
What may affect the mean?
Remember the formula to calculate the mean
SX
M = ------------
n
The Median
What is a median?
Finding a median
An Odd Number of Scores
1, 1, 3, 4, 5, 6, 6, 6, 7
An Even Number of Scores
1, 1, 3, 4, 5, 6, 6, 6
The Median, the Mean, and the
Middle
Which to use?
All depends on what you mean the middle.
Mean: a weighted middle
Median: an absolute middle
Mean and Median
1,2,3,4,5
1,2,3,4,55
The Mode
The score with the greatest frequency
A distribution could have multiple modes.
Selecting a Measure of Central
Tendency
When to Use the Median
Extreme Scores or skewed distributions
House prices
Undetermined Scores or Incomplete Data
Ordinal Scales
When to Use the Mode
Nominal Scales
You cannot rank scores.
Discrete Variables
The mean and median may be meaningless.
Central Tendency and the
Shape of the Distribution
Symmetrical Distributions
The mean and the median are the same
Central Tendency and the
Shape of the Distribution
Skewed Distributions
Variability and
Probability
Selecting An Olympian
An easy case
Selecting An Olympian
Picking Up An Stock
Stock A
Average annual return: 5% in the past 8 years
5%, 4%, 3%, 3%, 3%, 6%, 7%, 9%
Stock B
Average annual return: also 5% in the past 8 years
15%, 15%, -10%, 20%, -5%, 5%, -10%, 10%
Which one to choose?
Assuming you are not a risk-taker.
Data Collected from the Real
World Is Always Noisy.
How to Decide the Quality of
Data?
How to Tell the Difference
between Different Data Sets?
Variability
The spread of data
Purpose for measuring variability
Understanding the distribution of data
Evaluating performances
People or products
Simple Measures of Variability
Range and Interquartile Range
Range Difference between the maximum and the minimum
maximum – minimum +1
Or maximum – minimum
Interquartile range Difference between the first and third quartiles
Algorithm
1. Order the scores
2. Split into two equal sets
3. Find the middle values for two sets
4. Get the difference between them
What is the interquartile range of the following data? 1, 11, 15, 19, 20, 24, 28, 34, 37, 47 , 70
Deviation
The problem of using ranges to measure variability Totally ignore the data in between
1, 11, 15, 19, 20, 24, 28, 34, 37, 47
1, 24, 24, 24, 24, 24, 24, 24, 24, 47
The problem of using interquartile Totally ignore the data outside the ranges.
1, 11, 15, 19, 20, 24, 28, 34, 37, 47
-100, 11, 15, 19, 20, 24, 28, 34, 37, 147
Deviation Distance from the mean (the average of all data)
Get deviation for every data point
Average absolute deviation
1, 3, 5, 7, 9 (4 + 2 + 0 + 2 + 4)/5 = 2.4
3, 3, 5, 7, 7 (2 + 2 + 0 + 2 + 2)/5 = 1.6
Variance
The Mean of the squared deviation
s2 = S(X-m)2/N m: the mean of population (all existing data points)
N: the number of data points in a population
Deviation: S|X-m|/N
Why do we prefer variance over average absolute deviation?
Standard deviation s The square root of variance
Standard Deviation and
Variance For Population
Calculation steps
Deviations
Squared deviations
Sum of squares: SS
Variance: s 2 (divided by N)
Standard deviation: s
Computation Formula for the Sum
of Squared Deviations SS
The definitional formula is difficult to use and
may lead to rounding errors.
SS = S (X-m)2
The computational formula is often used.
SS = SX2 – (SX)2/N
They are equivalent.
Standard Deviation and
Variance For Samples
Sample: A small portion of population
The calculation of standard deviation and variance for samples is very similar to that for a population
The only difference
n-1 is used in calculating variance
The computational formula for the sum of squares still uses n.
Why not using n?
Degrees of Freedom
Not all samples are random!
Statistics
Assume observed data are random, but follow certain
rules
Need to make adjustment for nonrandom data
For a sample with n scores, only n-1 scores
are truly independent.
The number of truly independent deviations
All samples are biased.
Underestimate or overestimate parameters.
Example of Underestimated
Variance
N = 6 (0,0,3,3,9,9), n = 2
Adjustment (df) Is Necessary
Biased sample vs. unbiased sample
Biased statistics vs. unbiased statistics
Standard Deviation in
Descriptive and Inferential
Statistics
Descriptive
What is going on and how spread data is
Mean and standard deviation
Inferential
What may come? In particular, how likely will extreme scores be observed?
Probability as a function of the mean and standard deviation
But, be careful! Lessons learned from financial markets
Probability of the market crash in October 1987
Probability of the fall of Long Term Capital Management in October 1997
When Genius Failed: The Rise and Fall of Long-Term Capital Management
Population vs. Sample
Notations
N vs. n (size)
m vs. M (mean)
s vs. s (variance)
z-Score:
Location of real
scores in a
standardized
distribution
Why Do We Need z-Scores?
To help compare scores from different distributions
To help compute probability
Examples
Two students’ grades from two classes 80 vs. 90
80 (M=70, s=10) vs. 90 (M=85, s=5)
How good is a test score? GRE Verbal: 160 (percentile 85)
Standardized scores
Probability
A z-Score
Tells the location of a score in a standardized
distribution
Sign: above or below the average
Number: distance to the mean
Formula
X-m X-M
z = ----------, z = ----------
s s
z
How far is a score from the mean, measured by the
standard deviation?
Examples
Given, m = 500, s = 100 in all SAT scores
The z-score of an SAT of 620
The score if z = -0.3
In a standard test (s = 100), if X = 720, z = 1.2
Calculate the mean of the test
X-m
z = ----------
s
Using a Distribution Graph
For a distribution with a standard deviation of
s = 4, a score of X=52 corresponds to a z-
score of - 2.0. What is the mean for this
distribution?
Using z-Scores to Standardize
a Distribution
Questions
Can we compare students’ GRE scores obtained in different years? Say last year vs. two years ago?
Different groups of students in test
Different test questions
Are they really comparable?
Yes!
GRE scores are standardized
Score distribution is standardized each time.
Standardizing a Distribution
Relabeling each score
How to Standardize a Score?
Convert a score under a distribution, with any
mean and standard deviation, to a z-score
under a standardized distribution, with a
mean of 0 and a standard deviation of 1
Convert the z-score to a score under a
standardized distribution, with a
predetermined mean and standard deviation.
Probability
Probability
Probability definition
Chance, odds, proportion
What is the probability to get a King from a deck
of cards?
Random Sampling
Each individual of the population has the
same chance of being selected
Constant probability for each and every
selection in case of repetitive samplings
Previously selected samples must be returned to
the population!
The General Formula
Example: coin toss
number of outcomes classified as A
Probability of A = ----------------------------------------------------
total number of possible outcomes
Using Frequency
Distributions to Calculate
Probability
Probability = Proportion
More Complicated
Distributions?
0
100
200
300
400
500
600
0 1 2 3 4 5
What If Distributions Are
Smooth Curves?
Same technique: to find
the proportion
But how?
No blocks to count.
Calculus
1 2 3 4 5
Probability and the Normal
Distribution
The normal distribution
A particular shape
Can describe many phenomena if sample is big
enough
The bell curve
Symmetrical
Single mode in the middle
Simulation (link)
Proportion of a z-Score in
the Normal Distribution
Has Been Pre-Calculated
The Unit Normal Table
Up to z = 4.00
Example
p (z>1.0)
p (z<1.5)
p (z<-0.5)
More
p (1<z<1.5)
p (-0.5<z<1.5)
Different tables for different distributions
Normal distribution is most often seen.
Finding the z-Score from a
Probability
What if you cannot find the exact number?
Use the closest z-score
Interpolate
Probabilities For Scores from A
Normal Distribution
Transfer scores to z-scores and then look
up the unit normal table
p(55<x<65)=?
Find a Score for a Particular
Probability
Look up the unit normal table for a z-score,
and then find a score in a distribution which
corresponds to the z-score
What is the minimum score necessary to be in the top 15%?
You Can Estimate Where Your
IQ Stands
mIQ = 100
sIQ = 15
Importance of Probability to
Research
Compare samples with the population mean
Does a particular sample belong to the population?
How sure is it?
Probabilities and
Samples:
Distribution of
Sample Means
Sample Error
Samples Always Have Errors!
How Can We Infer Population Parameters based on One or
A Few Samples?
Recall Our Examples on GRE
and SAT Scores
We can make predication
Using statistics
Mean, standard deviation
For any variable, if we know the mean and
standard deviation, we will have a way to deal
with it
Sample statistics can be treated in the same way.
Sample mean: the mean of all possible sample means
Sample variance: the variance of all possible sample
means
Distribution of Sample Means
A frequency distribution of sample means
Including all the possible samples with a
particular sample size n
The distribution of statistics
A sampling distribution
With a specific sample size
Example
The population: 2,4,6,8
Sample size: n=2
Random sample
What Do We Get?
The sample means pile up around the
population mean.
The distribution of sample means is like a
normal distribution.
It is more likely to get a sample mean close to
the population mean.
What is the probability to get an extreme sample
mean?
Central Limit Theorem
For a population (m,s), the distribution of
sample means for sample size n will have a
mean of m and a standard deviation of s/
It is about any population
The shape of distribution does not matter.
The mean and standard deviation do not matter.
n
Shape of Distribution
Would be a normal distribution if
The population is a normal distribution, or
The sample size is large enough, say larger than
30
The Mean of Sample Means
The expected value of M
It is near the population mean
The Standard Deviation of
Sample Means
The standard error of M
Standard distance between an M and m
Notation: sM
The larger the sample size is, the smaller the
standard error of M is.
The law of large number
The larger the sample size is, the more probable the
sample means will be close to the population mean.
The more unlikely a sample mean is very far away
from the population mean.
Distributions of Sample Means
for A Normal Distribution
n1
n2 n3
A Non-Normal Distribution
Distributions of Sample Means
n1 n2
Implication
The probability of sample means can be
estimated by using z-scores and the unit
normal table
Variables:
A sample mean
The population mean
The standard error
The population standard deviation and sample size
Example: m=500, s=100
Given n = 25, what is the probability to get a
sample mean larger than 540?
Standard Error
Like standard deviation
Measure the standard distance between a sample mean and the population mean
Provide information about sample error
Very often, we don’t know the population mean
All we have are sample means and standard errors.
How much do we know about the population mean based on the sample means?
Example
Comparing a new teaching method with the
traditional method based on testing scores
New Tradition
Important Concepts
Variability
Variance
Standard deviation
Population and sample
z-Score
Scores, mean and the standard deviation
Probability and frequency distribution
z-Score and probability
Use the unit normal table
Sampling distribution and standard errors
Probability of sample means
Go Back to the Hanx Writer
Example
What the research project is about
Assuming two populations
Hanx Writer user population
Normal keyboard user population
Obtaining one sample from each population
Using the means from two samples to estimate the populations
The central question:
How likely is the sample of Hanx Writer population actually one
sample of normal keyboard population (which means no difference)
If the probability is low, the sample is more likely from another population.
Otherwise, cannot rule out the possibility that two samples are from the
same population, the normal keyboard users.
Homework
On CANVAS
With red rectangles