View
218
Download
0
Category
Preview:
Citation preview
8/12/2019 Correlation and Regression Feb2014
1/50
PLT 6133
QUANTITATIVE DATA
ANALYSIS
8/12/2019 Correlation and Regression Feb2014
2/50
CORRELATION AND
REGRESSIONS
8/12/2019 Correlation and Regression Feb2014
3/50
Statistics maybe regarded as a method of dealing
with data. This definition stresses the view that
statistics is a tool concerned with collection,
organization and analysis of numerical facts andobservations..the major concerned with
descriptive statistics is to present information in a
convenient, usable, and understandable form
- Richard Runyon & Audry Haber
8/12/2019 Correlation and Regression Feb2014
4/50
Summary of Major Types of Descriptive Statistics
TYPE OF TECHNIQUE STATISTICAL TECHNIQUE PURPOSE
Univariate Frequency distribution, Describe one variable
measures of central tendency,
std deviation,
Bivariate Correlation, percentage Describe a relationship
table, chi-square or the association
between two variables
Multivariate Elaboration paradigm, Describe relationships
linear and multiple regression among several variables,
or see how severalindependent variables have
an effect on a dependent
variable.
8/12/2019 Correlation and Regression Feb2014
5/50
Three Broad Types of Research Questions:
1 Descriptive Research Questions
2
3
Associational Research Questions
Difference Research Questions
8/12/2019 Correlation and Regression Feb2014
6/50
DESCRIPTIVE RESE RCH QUESTIONS
Descriptive Research Questions are not answered
with inferential statistics.
They merely describe or summarize data, without
trying to generalize to a larger population ofindividual.
Mean, Percentage, SD, Mod, Median, etc.
8/12/2019 Correlation and Regression Feb2014
7/50
INFERENTIAL STATISTICSrely on principles from probability
sampling, whereby a researcher uses a random process to
select cases from the entire population.
Inferential statistics are a precise way to talk about how
confident a researcher can be when inferring from the
results in a sample to the population.
8/12/2019 Correlation and Regression Feb2014
8/50
SSOCI TION L RESE RCH QUESTIONS
Associational Research Questions are those inwhich 2 or more variables are associated or
related.
This approach usually involves an attempt to see
how 2 or more variables covary (as one grows
larger, the other grows larger or smaller) or one
or more variables enables one to predict another
variable.
Pearson Correlation, Spearman Correlation, Eta
Correlation, etc.
8/12/2019 Correlation and Regression Feb2014
9/50
DIFFERENCE RESE RCH QUESTIONS
Difference Research Questions: For thesequestions, we compare scores (on the dependent
variable) of 2 or more different groups, each of
which is composed of individuals with one of the
values or levels on the independent variable.
This type of question attempts to demonstrate that
groups are not the same on the dependent
variable.
T-test, ANOVA, ANCOVA, MANOVA, MANCOVA,
etc.
8/12/2019 Correlation and Regression Feb2014
10/50
CORRELATION
The correlation is one of the most common and
most useful statistics.
Definition - A correlation is a single number that describes
the degree of relationship (dependence) between two
variables. It characterizes the existence of a relationshipbetween variables.
Relationship between 2 variables can vary from strong to
weak.
More accurately, correlation is the co-variation of
standardized variables.
8/12/2019 Correlation and Regression Feb2014
11/50
However, a correlation does not imply causation.meaning
Because there is a strong positive or strong
negative correlation between 2 variables, thisdoes not mean that one variable is caused by the
other variable. Many statisticians claim that a
strong correlation neverimplies a cause-effect
relationship between two variables.
8/12/2019 Correlation and Regression Feb2014
12/50
GENERALLY
Two variables may correlate to each other in 3possible ways:
Positive relationship:
Both variables vary in the same directionas one goes up, the other goes up. Eg.
Salary and years of education are positively correlated because people who get the
highest salaries tend to be the ones who have gone to school the longest.
Negative relationship:Two variables vary in the opposite directionas one up, the other goes down. Eg. The
number of problems faced and the amount of immunoglobulin A in a persons system
are negatively correlated because as the number of problems goes up, the amount of
immunoglobulin A tends to go down.
Zero relationship:
Two variables has no relationship with each otherone changes without affecting the
other. Eg. Average speed of car driven and average speed of mouse. Also, the
relationship between personality fluctuations and movement of distant stars has a
zero correlation.
8/12/2019 Correlation and Regression Feb2014
13/50
Degree of Correlation: How Strongly are
variables correlated?
The degree of correlation between two variables can
be established using two methods:
Scatter plota graph with plotted values for twovariables being compared.
Correlation Coefficient methods.
8/12/2019 Correlation and Regression Feb2014
14/50
SCATTER PLOTS
8/12/2019 Correlation and Regression Feb2014
15/50
Scatter Plots - Example
Example of negative correlation
- Hours of exercise per week and months of
machine owned
Example of uncorrelated data
- Height and months of machined owned
Example of positive correlation
- Cardiovascular fitness score and months
machine owned
8/12/2019 Correlation and Regression Feb2014
16/50
Scatter Plots - Example
Example of (a) weak and (b) strong correlation
8/12/2019 Correlation and Regression Feb2014
17/50
Scatter Plots - Example
Researchers laid out 10 circular plots,
each 4 meters in diameter, in an area
where beavers were cutting down
cottonwood trees. The number of
stumps and the number of clusters ofbeetle larvae were recorded in each
plot with the following results:
Stumps Beetle Larvae
2 10
2 30
1 12
3 244 40
1 11
5 56
3 40
1 8
2 14
8/12/2019 Correlation and Regression Feb2014
18/50
Scatter Plots - Example
The scatter plot for the previous data:
From the scatter plot, there appears to be a fairly strong positive association
between the number of cottonwood stumps and the number of clusters of
beetle larvae.
8/12/2019 Correlation and Regression Feb2014
19/50
CORRELATION COEFFICIENT
8/12/2019 Correlation and Regression Feb2014
20/50
Correlation coefficient is used to measure the degree of
correlation between variables - It is a quantitative indicator.
There are several type of correlation coefficient depending of
the type of relationship.
The most common is Pearsons correlation coefficient
(denoted by r) which is sensitive only to a linear relationship
between two variables.
Other types of common correlation coefficients includeSpearmens rank correlation coefficient (denoted by ) and
Kendalls rank correlation coefficient (denoted by ).
Correlation coefficient
8/12/2019 Correlation and Regression Feb2014
21/50
A correlation coefficient is a calculated number that indicates thedegree of correlation between two variables:
Perfect positive correlation usually is calculated as a value of
1 (or 100%).
Perfect negative correlation usually is calculated as a value of
-1.
A values of zero shows no correlation at all.
Correlation Coefficient
8/12/2019 Correlation and Regression Feb2014
22/50
Correlation Coefficient
TABLE 1.0 Interpreting a Correlation Coefficient
Size of the Correlation coefficient General Interpretation
0.8 to 1.0 Very strong relationship
0.6 to 0.8 Strong relationship
0.4 to 0.6 Moderate relationship
0.2 to 0.4 Weak relationship
0.0 to 0.2 Weak or no relationship
8/12/2019 Correlation and Regression Feb2014
23/50
Correlation Coefficient
A much more precise way to interpret the correlation coefficient:
Computing the coefficient of determination. The coefficient of
determination is the percentage of variance in one variable that is
accounted for by the variance in the other variable.
Coefficient of determination = Square of correlation coefficient
Example: If the correlation between GPA and the number of hours of
study is 0.7, then the coefficient of determination is _______.
This means _______% of the variance in GPA can be explained by the
variance in studying time. The stronger the correlation, the more the
variance can be explained.
However, this means that _______ % cannot be explained. The amount
of unexplained variance is called the coefficient of alienation (or
coefficient of non-determination).
8/12/2019 Correlation and Regression Feb2014
24/50
Pearsons Correlation Coefficient
If we have a series of nmeasurements ofXand Ywritten asxiand yiwhere i= 1, 2, ..., n,
then the sample correlation coefficientcan be used to estimate the population Pearsoncorrelation rbetweenXand Y. The sample correlation coefficient is written as:
where x and y are the sample means ofXand Y, and sxand syare the sample standard
deviations ofXand Y.
This can also be written as:
8/12/2019 Correlation and Regression Feb2014
25/50
Age Score15 95
26 71
10 83
9 91
15 102
20 87
18 93
11 100
8 104
20 94
Is there a linear relationship between the age atwhich a child first begins to speak and his or her
mental ability later on? To answer this question a
study was conducted in which the age (in months)
at which a child first spoke and the child's score on
an aptitude test as a teenager were recorded:
Draw a scatter plot and determine whether there
appears to be a linear relationship between these
two variables. If so, describe the relationship,
calculate r, and determine what percentage of the
variability in the aptitude score can be explained
by the variability in the age at which a child beginsspeaking.
Correlation Coefficient Example
8/12/2019 Correlation and Regression Feb2014
26/50
Correlation Coefficient Example
The scatter plot for the data:
There appears to be a moderate negative association between the
age at which a baby first begins to speak and mental ability later in
life.
8/12/2019 Correlation and Regression Feb2014
27/50
Correlation Coefficient ExampleCalculation of the correlation coefficient:
r=(1013676-152920) (102616-1522) (1085510-9202))
= -0.5973301213-0.60
The variability in the age at which a child first speaks explains only about
36% (r2= 0.36) of the variability in aptitude test scores later in life.
8/12/2019 Correlation and Regression Feb2014
28/50
ExerciseCompute the correlation between the mens
Height (in cm) and Weights (in kg) for the
following data:
Man Height (X) Weight (Y)
A 182 86
B 167 61
C 175 70
D 182 75
E 180 70
8/12/2019 Correlation and Regression Feb2014
29/50
When is a correlationstrong enough?
0.9 very high correlation; very
dependable relationship
8/12/2019 Correlation and Regression Feb2014
30/50
Words of Caution
Ex amine your data distribution (i.e using scatter
plot) before you do anything with the correlation
and make sure you know the dos and donts with
the correlation coefficient!
Correlation coefficient is just an index ofrelationship which tells nothing about the cause
and effect of the relationship!
Limit yourself to linear relationship if you dont
have adequate statistical background!
8/12/2019 Correlation and Regression Feb2014
31/50
REGRESSION
8/12/2019 Correlation and Regression Feb2014
32/50
In statistics, regression analysisis a statistical technique for
estimating the relationships among variables. It includesmany techniques for modeling and analyzing several
variables, when the focus is on the relationship between a
dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand
how the typical value of the dependent variable changes
when any one of the independent variables is varied, while
the other independent variables are held fixed.
All regression analysis test whether a significant
quantitative relationship exists.
Regression Analysis
8/12/2019 Correlation and Regression Feb2014
33/50
Some Commonly Used Jargons..
Linear Regression
Line of Best Fit
Regression Equation
8/12/2019 Correlation and Regression Feb2014
34/50
The General idea About Regression
Suppose we are asked to investigate the relationship between two
variables namely Variable P (being the independent) and variableQ (being the dependent):
What would be the predicted value of Q if P = 15? If P = 25?
How do you predict these?
Pair Variable P Variable Q
Pair 1 10 7
Pair 2 20 12
Pair 3 30 17
Pair 4 40 22
8/12/2019 Correlation and Regression Feb2014
35/50
010 20 30 40
20
15
10
5Pair 1
Pair 2
Pair 3
Pair 4
P variable
Q variable
8/12/2019 Correlation and Regression Feb2014
36/50
Notice that if we connect these points, we would get a
straight line. This line fits ALL the observed points.
This straight line is called the line of best fit or
regression line.
The line of best fit defines a basis for predicting values
of Q, given values of P (and vice versa).
The concept of the line of best fit can be extended to
form a basis for linear regression as well as non-linear
regression.
8/12/2019 Correlation and Regression Feb2014
37/50
Linear Regression
8/12/2019 Correlation and Regression Feb2014
38/50
Non-Linear Regression
8/12/2019 Correlation and Regression Feb2014
39/50
Regression models involve the following variables: The unknown parameters, denoted as , which may
represent a scalar or a vector.
The independent variables,X.
The dependent variable, Y.
Regression models can predict a value of the Yvariable given
values of theXvariables. Prediction withinthe range of
values in the dataset used for model-fitting is known
informally asinterpolation. Prediction outsidethis range of
the data is known as extrapolation.
Regression Models
8/12/2019 Correlation and Regression Feb2014
40/50
Linear Regression
In linear regression, data is modeled using linear predictor
functions, and unknown model parameters are estimated from
the data.
Such models are called linear models. Most commonly, linear
regression refers to a model in which the conditional
mean of Ygiven the value ofXis an affine function ofX. Lesscommonly, linear regression could refer to a model in which
the median, or some other quantile of the conditional
distribution of YgivenXis expressed as a linear function ofX.
Like all forms of regression analysis, linear regression focuses on
the conditional probability distribution of YgivenX, rather than
on the joint probability distribution of YandX, which is the
domain of multivariate analysis.
8/12/2019 Correlation and Regression Feb2014
41/50
In non-linear regression, data are modeled by a function
which is a non-linear combination of the model
parameters and depends on one or more independent
variables.
As linear regression is much easier, some non-linear
regression can be transformed or segmented to a linear
regression.
Non-Linear Regression
8/12/2019 Correlation and Regression Feb2014
42/50
The method of least squares gives a way to find the best
estimate of a particular measurement or data, assuming that the
errors (i.e. the differences from the true value) are random and
unbiased.
"Least squares" means that the overall solution minimizes thesum of the squares of the errors made in the results of every
single equation.
The best fit in the least-squares sense minimizes the sum of
squared residuals, a residual being the difference between an
observed value and the fitted value provided by a model.
Method of least squares
Method of least squares the
8/12/2019 Correlation and Regression Feb2014
43/50
Method of least squaresthe
line of best fitThe method of least squares calculates the line of best fit by minimising the sum
of the squares of the vertical distances of the points to the line. Lets illustratewith a simple example.
Method of least squares the
8/12/2019 Correlation and Regression Feb2014
44/50
Method of least squaresthe
line of best fit
Continued from previous slide.
8/12/2019 Correlation and Regression Feb2014
45/50
Example - Method of least
squares
Fit a least square line to the following data.
X 1 2 3 4 5
Y 2 5 3 8 7
8/12/2019 Correlation and Regression Feb2014
46/50
Example - Method of least squares
Solution:
X Y XY X2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
The equation of least square line
Normal Equation for a ---- (1)
Normal Equation for b ---- (2)
Eliminate a from equation (1) and (2), multiply equation (2) by3and subtract form
equation (2), we get the values of a and b.
Here a = 1.1 and b = 1.3, the equation of least square line becomes .
Exercise
8/12/2019 Correlation and Regression Feb2014
47/50
ExerciseA researcher investigates the relationship between individuals
score on a Reading Aptitude Test and the average amount of hours
he/she spends for reading (simply called Hours): The data
gathered from 10 students are as follows:
:Student Score on Reading Aptitude Test (X) Hours (Y)
S1 20 5
S2 5 1
S3 5 2
S4 40 7
S5 30 8
S6 35 9
S7 5 3
S8 5 2
S9 15 5
S10 40 8
8/12/2019 Correlation and Regression Feb2014
48/50
DO NOT WORRY ABOUT APPLYING THE
EQUATIONS!
You will use SPSS (Statistical Package for
Social science) to obtain all the analysis
8/12/2019 Correlation and Regression Feb2014
49/50
The first step in any applied research isto get a good THEORETICAL grasp of the
topic to be studied.
The best data analyst dont start with
the data, they start with theory.
8/12/2019 Correlation and Regression Feb2014
50/50
THANK YOU
PREPARED BYASSOC PROF DR NORMAH MULOP
Recommended