Conjoint Analysis :
Conjoint Analysis is a marketing research technique designed to help determine preferences of
customers. It is used to analyse how customers value different attributes of a product ( or service)
and thus gives an insight into the trade-offs they are to make among the various attributes. To put
simply, it tells how much each feature of a product is worth to the consumers.
This study includes surveying people with a certain set of attribute combinations which the survey-
takers rank or provide preferences. Analysis will be done to model the customer preferences for
different combination of attributes. The attributes are termed factors and the different values are
levels.
In the example that we have taken to use Conjoint Analysis through the tool SPSS, we have analysed
data on carpet, taking attributes like Price, Brand, Money-return, Package design and Seal as the
attributes based on which the consumers give prefernces. Using two data sets, we calculate the part
worths and decide on the weightage of each of the attributes that the users have provided.
Variable name Variable label Value label
package package design A*, B*, C*
brand brand name K2R, Glory, Bissell
price price $1.19, $1.39, $1.59
seal Good Housekeeping seal no, yes
money money-back guarantee no, yes
Code to fetch import the data and analyse : GET FILE='C:\Users\Abhi\Desktop\carpet_plan.sav'.
DATASET NAME DataSet1 WINDOW=FRONT. GET FILE='C:\Users\Abhi\Desktop\carpet_prefs.sav'.
DATASET NAME DataSet2 WINDOW=FRONT. CONJOINT PLAN='C:\Users\Abhi\Desktop\carpet_plan.sav' /DATA='C:\Users\Abhi\Desktop\carpet_prefs.sav' /SEQUENCE=PREF1 PREF2 PREF3 PREF4 PREF5 PREF6 PREF7 PREF8 PREF9 PREF10 PR
EF11 PREF12 PREF13 PREF14 PREF15 PREF16 PREF17 PREF18 P REF19 PREF20 PREF21 PREF22 /SUBJECT=ID
/FACTORS=PACKAGE BRAND (DISCRETE) PRICE (LINEAR LESS) SEAL (LINEAR MORE) MONEY (LINEAR MORE)
/PRINT=SUMMARYONLY.
Model Description
N of Levels
Relation to Ranks
or Scores
package 3 Discrete
brand 3 Discrete
price 3 Linear (less)
seal 2 Linear (more)
money 2 Linear (more)
Calculation of the part-worth of each attribute
Utilities
Utility Estimate Std. Error
package A* -2.233 .192
B* 1.867 .192
C* .367 .192
brand K2R .367 .192
Glory -.350 .192
Bissell -.017 .192
price $1.19 -6.595 .988
$1.39 -7.703 1.154
$1.59 -8.811 1.320
seal no 2.000 .287
yes 4.000 .575
money no 1.250 .287
yes 2.500 .575
(Constant) 12.870 1.282
This table shows the utility (part-worth) scores and their standard errors for each factor level. Higher
utility values indicate greater preference. We can see that the value of the part worths are such that,
for each attribute if part-worths are added for different levels, it sums up to zero. Thus with respect
to brand Glory and Bisell, K2R is preferred more. As expected, there is an inverse relationship
between price and utility, with higher prices corresponding to lower utility . The presence of a seal of
approval or money-back guarantee corresponds to a higher utility.Also, total utility of a combination
can be calculated as :
If the cleaner had package design C*, brand Bissell, price $1.59, a seal of approval, and a money -back
guarantee, the total utility would be:
0.367 + (−0.017) + (−8.811) + 4.000 + 2.500 + 12.870 = 10.909.
Importance:
Importance Values
package 35.635
brand 14.911
price 29.410
seal 11.172
money 8.872
We can see that attributes package has most importance followed by price. Money return is of least
concern for the consumer. The values are computed by taking the utility range for each factor
separately and dividing by the sum of the utility ranges for al l factors. The values thus represent
percentages and have the property that they sum to 100.
Coefficients
B Coefficient
Estimate
price -5.542
seal 2.000
money 1.250
The utility for a particular factor level is determined by multiplying the level by the coefficient. For
example, the predicted utility for a price of $1.19 was listed as −6.595 in the utilities table. This is
simply the value of the price level, 1.19, multiplied by the price coefficient, −5.542.
This table provides measures of the correlation between the observed and estimated preferences.
Preference Scores of
Simulationsa
Card
Number ID Score
1 1 10.258
2 2 14.292
The real power of conjoint analysis is the ability to predict preference for product profiles that
weren't rated by the subjects. These are referred to as simulation cases.
Preference Probabilities of Simulationsb
Card
Number ID Maximum Utilitya
Bradley-Terry-
Luce Logit
1 1 30.0% 43.1% 30.9%
2 2 70.0% 56.9% 69.1%
The maximum utility model determines the probability as the number
of respondents predicted to choose the profile divided by the total
number of respondents. For each respondent, the predicted choice is
simply the profile with the largest total utility.
Number of Reversals
Factor price 3
money 2
seal 2
brand 0
package 0
Subject 1 Subject 1 1
2 Subject 2 2
3 Subject 3 0
4 Subject 4 0
5 Subject 5 0
6 Subject 6 1
7 Subject 7 0
8 Subject 8 0
9 Subject 9 1
10 Subject 10 2
This table displays the number of reversals for each factor and for each subject. For example, three
subjects showed a reversal for price. That is, they preferred product profiles with higher prices.
Reversal Summary
N of
Revers
als N of Subjects
1 3
2 2
Q. Perform Discriminant Analysis on the given dataset. The dataset chosen contains statistics on set of people who have been given bank loans & have defaulted or not defaulted with their various characteristics.
Discriminant
Notes
Output Created 04-Apr-2013 18:39:05
Comments
Input Data E:\VGSOM\STUDY\SECOND
SEM\BRM\SPSS16\Samples\banklo
an.sav
Active Dataset DataSet1
File Label Bank Loan Default
Filter <none>
Weight <none>
Split File <none>
N of Rows in Working
Data File 850
Missing Value Handling Definition of Missing User-defined missing values are
treated as missing in the analysis
phase.
Cases Used In the analysis phase, cases with no
user- or system-missing values for
any predictor variable are used.
Cases with user-, system-missing, or
out-of-range values for the
grouping variable are always
excluded.
Syntax DISCRIMINANT
/GROUPS=default(0 1)
/VARIABLES=employ address age
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF
BOXM COEFF CORR TABLE
/PLOT=COMBINED
/PLOT=CASES
/CLASSIFY=NONMISSING POOLED
MEANSUB.
Resources Processor Time 00:00:00.047
Elapsed Time 00:00:00.121
p{color:black;font-family:sans-serif;font-size:10pt;font-
weight:normal}
Your trial period for SPSS for Windows will expire in 14 da
ys.p{color:0;font-family:Monospaced;font-size:13pt;font-
style:normal;font-weight:normal;text-decoration:none}
GET
FILE='E:\VGSOM\STUDY\SECOND SEM\BRM\SPSS16\Samples\banklo
an.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
DISCRIMINANT
/GROUPS=default(0 1)
/VARIABLES=employ address age
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE
/PLOT=COMBINED
/PLOT=CASES
/CLASSIFY=NONMISSING POOLED MEANSUB.
[DataSet1] E:\VGSOM\STUDY\SECOND SEM\BRM\SPSS16\Samples\bankloan.sav
Warnings
All-Groups Stacked Histogram is no longer displayed.
Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 700 82.4
Excluded Missing or out-of-range
group codes 150 17.6
At least one missing
discriminating variable 0 .0
Both missing or out-of-
range group codes and at
least one missing
discriminating variable
0 .0
Total 150 17.6
Total 850 100.0
Group Statistics
Previously defaulted Mean Std. Deviation
Valid N (listwise)
Unweighted Weighted
No Years with current
employer 9.51 6.664 517 517.000
Years at current address 8.95 7.001 517 517.000
Age in years 35.51 7.708 517 517.000
Yes Years with current
employer 5.22 5.543 183 183.000
Years at current address 6.39 5.925 183 183.000
Age in years 33.01 8.518 183 183.000
Total Years with current
employer 8.39 6.658 700 700.000
Years at current address 8.28 6.825 700 700.000
Age in years 34.86 7.997 700 700.000
Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.
Years with current
employer .920 60.759 1 698 .000
Years at current address .973 19.402 1 698 .000
Age in years .981 13.482 1 698 .000
Pooled Within-Groups Matrices
Years with
current
employer
Years at
current address Age in years
Correlation Years with current
employer 1.000 .292 .524
Years at current address .292 1.000 .588
Age in years .524 .588 1.000
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
Previously defaulted Rank
Log
Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
The ranks and natural logarithms of determinants
printed are those of the group covariance
matrices.
Test Results
Box's M 28.171
F Approx. 4.665
df1 6
df2 7.335E5
Sig. .000
This matrix shows correlation between the predictors. The largest
correlations occur between Credit card debt in thousands and the other variables.
Log Determinants
Previously defaulted Rank
Log
Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
Tests null hypothesis of
equal population covariance
matrices.
Summary of Canonical Discriminant Functions
Eigenvalues
Functio
n Eigenvalue % of Variance Cumulative %
Canonical
Correlation
1 .100a 100.0 100.0 .301
a. First 1 canonical discriminant functions were used in the analysis.
Wilks' Lambda
Test of
Functio
n(s) Wilks' Lambda Chi-square df Sig.
1 .909 66.251 3 .000
Standardized Canonical Discriminant
Function Coefficients
Function
1
Years with current
employer .980
Years at current address .436
Age in years -.330
Structure Matrix
Function
1
Years with current
employer .934
Years at current address .528
Age in years .440
Pooled within-groups correlations
between discriminating variables and
standardized canonical discriminant
functions
Variables ordered by absolute size of
correlation within function.
Functions at Group
Centroids
Previo
usly
default
ed
Function
1
No .188
Yes -.530
Unstandardized
canonical
discriminant
functions evaluated
at group means
Classification Statistics
Classification Processing Summary
Processed 850
Excluded Missing or out-of-range
group codes 0
At least one missing
discriminating variable 0
Used in Output 850
Prior Probabilities for Groups
Previo
usly
default
ed Prior
Cases Used in Analysis
Unweighted Weighted
No .500 517 517.000
Yes .500 183 183.000
Total 1.000 700 700.000
Classification Function Coefficients
Previously defaulted
No Yes
Years with current
employer -.192 -.302
Years at current address -.302 -.348
Age in years .797 .827
(Constant) -12.588 -12.444
Fisher's linear discriminant functions
Classification Resultsa
Previously
defaulted
Predicted Group Membership
Total No Yes
Original Count No 300 217 517
Yes 44 139 183
Ungrouped cases 81 69 150
% No 58.0 42.0 100.0
Yes 24.0 76.0 100.0
Ungrouped cases 54.0 46.0 100.0
a. 62.7% of original grouped cases correctly classified.
The Discriminant Analysis shows that the persons in the category
who have previously defaulted are predicted likely to default this time as well & those who haven’t defaulted earlier are predicted less likely to default this time. The conclusion is inferred from the total no. of defaulters being more than non defaulters (139>44) similarly (300>217).
Q. Perform Factor Analysis on the given dataset.
The dataset chosen contains fictional statistics anxiety questionnaire. It contains response given by students regarding their ease of use, liking and usage of SPSS in statistics.
By using the Scree Plot I have chosen 5 factors.
Since a student may give related answers depending upon the choices hence I considered the
variables to be inter-related and hence used Oblimin rotation. Say a student gave high points for variable “I have little experience of computers” is likely to give high points for “All computers hate me” as the variables are correlated somewhat.
Using the options of SPSS the following Pattern Matrix was generated.
Pattern Matrixa
Component
1 2 3 4 5
I have little experience of
computers
.903
SPSS always crashes when I
try to use it
.732
All computers hate me .684
I worry that I will cause
irreparable damage because
of my incompetenece with
computers
.662
Computers have minds of
their own and deliberately go
wrong whenever I use them
.581
People try to tell you that
SPSS makes statistics easier
to understand but it doesn't
.446
Computers are out to get me .333
My friends are better at SPSS
than I am
.661
My friends are better at
statistics than me
.655
If I'm good at statistics my
friends will think I'm a nerd
.622
My friends will think I'm stupid
for not being able to cope
with SPSS
.504 .330
Everybody looks at me when
I use SPSS
.358 .358
I can't sleep for thoughts of
eigen vectors
-.728
I wake up under my duvet
thinking that I am trapped
under a normal distribtion
.324 -.543
Computers are useful only for
playing games
.359 .393 -.366
Standard deviations excite
me
.301 .356 .315
I have never been good at
mathematics
-.855
I did badly at mathematics at
school
-.736
I slip into a coma whenever I
see an equation
-.722
Statiscs makes me cry -.772
I don't understand statistics -.730
I weep openly at the mention
of central tendency
-.664
I dream that Pearson is
attacking me with correlation
coefficients
-.564
Extraction Method: Principal Component Analysis.
Rotation Method: Oblimin with Kaiser Normalization.
a. Rotation converged in 15 iterations.
The total variance explained by each factor is given below
Total Variance Explained
Compo
nent
Rotation Sums of
Squared
Loadingsa
Total
1 5.522
2 2.452
3 2.383
4 3.535
5 4.913
Extraction Method:
Principal Component
Analysis.
It is calculated by the sum of squared loadings of the factor and dividing the sum of squared loadings by
the number of variables and multiplying by 100.
Hence the factoring would be as follows depending on the loading values.
Factor Variable Nos. 1 1,2,3,4,5,6,7,14
2 8,9,10 3 13
4 17,18,19
5 20,21,22,23
Since variables 11, 12, 15 and 16 have very close loadings in different factors it is not good as this
variable is assessing both constructs.15 has exact same value in both Factor 2 and Factor 3.These are
said to have split loading.
They are hence mentioned in a separately.
Factor Variable No 2 11,16,15
3 12,15
As Split loading is present this is not a simple structure.
Factor 1: Anxiety about the usage of computers accounts for 55.22% of the total variance and loads 8 of
the variables.
Factor 2: View of students regarding their understanding of statistics and SPSS with regard to their peers
accounts for 24.52% of the total variance and loads 3 variables. It also split loads variable 11, 16 and 15.
Factor 3: Anxiety about Eigen vectors corresponds to only 23.83% of the total variance and loads only 1
variable directly while it split loads variable 12 and 15.
Factor 4: Students interest in mathematics accounts for 35.35% of the total variance and loads 3
variable.
Factor 5: Dislike for statistics accounts for 49.13% of the total variance and loads 4 variables.
CLUSTER ANALYSIS
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same
group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis
used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Proximities
Notes
Output Created 02-Apr-2013 22:00:05
Comments
Input Data C:\Users \dev
maletia\Downloads\ClusterAnonFaculty.sav
Active Dataset DataSet3
Fi l ter <none>
Weight <none>
Spl it File <none>
N of Rows in Working Data File 44
Missing Value Handling Definition of Missing User-defined missing values are treated as missing.
Cases Used Statistics are based on cases with no missing values
for any variable used.
Syntax PROXIMITIES Sa lary FTE Rank Articles Experience
/MATRIX
OUT('C:\Users\DEVMAL~1\AppData\Local\Temp\spss
6496\spssclus.tmp')
/VIEW=CASE
/MEASURE=SEUCLID
/PRINT NONE
/ID=Name
/STANDARDIZE=VARIABLE Z.
Resources Processor Time 00:00:00.078
Elapsed Time 00:00:00.082
Workspace Bytes 11152
Fi les Saved Matrix Fi le C:\Users \DEVMAL~1\AppData\Local\Temp\spss6496
\spssclus.tmp
The variables are which I have used in the dataset are as follows:
• Name -- Although faculty salaries are public information under North Carolina state law
• Salary – annual salary in dollars, from the university report available in One Stop.
• FTE – Full time equivalent work load for the faculty member.
• Rank – where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor
• Articles – number of published scholarly articles, excluding things like comments in newsletters,
abstracts in proceedings, and the like.
• Experience – Number of years working as a full time faculty member in a Department of Psychology.
• ArticlesAPD – number of published articles as listed in the university’s Academic Publications
• Sex –biological sex from physical appearance.
In the first step SPSS computes for each pair of cases the squared Euclidian distance between the cases. This is
quite simply, the sum across variables (from i = 1 to v) of the squared difference between the score on variable
i for the one case (Xi) and the score on variable i for the other case (Yi). The two cases which are separated by
the smallest Euclidian distance are identified and then classified together into the first cluster. At this point
there is one cluster with two cases in it.
Next SPSS re-computes the squared Euclidian distances between each entity (case or c luster) and each other
entity. When one or both of the compared entities is a cluster, SPSS computes the averaged squared Euclidian
distance between members of the one entity and members of the other entity. The two entities with the
smallest squared Euclidian distance are classified together. SPSS then re-computes the squared Euclidian
distances between each entity and each other entity and the two with the smallest squared Euclidian distance
are classified together. This continues until all of the cases have been clustered into one big cluster.
The output obtained can be seen below:
Case Processing Summarya
Cases
Val id Miss ing Tota l
N Percent N Percent N Percent
44 100.0% 0 .0% 44 100.0%
a . Squared Euclidean Distance used
On the first step SPSS clustered case 32 with 33. The squared Euclidian distance between these two cases is 0.000. At stages 2-4 SPSS creates three more clusters, each containing two cases. At stage 5 SPSS adds case 39 to the cluster that already contains cases 37 and 38. By the 43rd stage all cases have been clustered into one entity. The results can be seen below:
Average Linkage (Between Groups)
Agglomeration Schedule
Stage
Cluster Combined
Coefficients
Stage Cluster Fi rs t Appears
Next Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 32 33 .000 0 0 9
2 41 42 .000 0 0 6
3 43 44 .000 0 0 6
4 37 38 .000 0 0 5
5 37 39 .001 4 0 7
6 41 43 .002 2 3 27
7 36 37 .003 0 5 27
8 20 22 .007 0 0 11
9 30 32 .012 0 1 13
10 21 26 .012 0 0 14
11 20 25 .031 8 0 12
12 16 20 .055 0 11 14
13 29 30 .065 0 9 26
14 16 21 .085 12 10 20
15 11 18 .093 0 0 22
16 8 9 .143 0 0 25
17 17 24 .144 0 0 20
18 13 23 .167 0 0 22
19 14 15 .232 0 0 32
20 16 17 .239 14 17 23
21 7 12 .279 0 0 28
22 11 13 .441 15 18 29
23 16 27 .451 20 0 26
24 3 10 .572 0 0 28
25 6 8 .702 0 16 36
26 16 29 .768 23 13 35
27 36 41 .858 7 6 33
28 3 7 .904 24 21 31
29 11 28 .993 22 0 30
30 5 11 1.414 0 29 34
31 3 4 1.725 28 0 36
32 14 31 1.928 19 0 34
33 36 40 2.168 27 0 40
34 5 14 2.621 30 32 35
35 5 16 2.886 34 26 37
36 3 6 3.089 31 25 38
37 5 19 4.350 35 0 39
38 1 3 4.763 0 36 41
39 5 34 5.593 37 0 42
40 35 36 8.389 0 33 43
41 1 2 8.961 38 0 42
42 1 5 11.055 41 39 43
43 1 35 17.237 42 40 0
Cluster Membership
Case 5 Clusters 4 Clusters 3 Clusters 2 Clusters
1:Rosalyn 1 1 1 1
2:Lawrence 2 2 1 1
3:Suni la 1 1 1 1
4:Randolph 1 1 1 1
5:Mickey 3 3 2 1
6:Louis 1 1 1 1
7:Tony 1 1 1 1
8:Raul 1 1 1 1
9:Cata l ina 1 1 1 1
10:Johnson 1 1 1 1
11:Beulah 3 3 2 1
12:Martina 1 1 1 1
13:Marie 3 3 2 1
14:Ernest 3 3 2 1
15:Chris topher 3 3 2 1
16:Ernie 3 3 2 1
17:Chris ta 3 3 2 1
18:Linette 3 3 2 1
19:Bo 3 3 2 1
20:Carla 3 3 2 1
21:Alberto 3 3 2 1
22:Chris tina 3 3 2 1
23:Jonah 3 3 2 1
24:Tucker 3 3 2 1
25:Shanta 3 3 2 1
26:Mel issa 3 3 2 1
27:Jenna 3 3 2 1
28:Johnny 3 3 2 1
29:Cleatus 3 3 2 1
30:Jonas 3 3 2 1
31:Tad 3 3 2 1
32:Amaryl l is 3 3 2 1
33:Nathan 3 3 2 1
34:Deanna 3 3 2 1
35:Wi l ly 4 4 3 2
36:Deana 5 4 3 2
37:Dea 5 4 3 2
38:Claude 5 4 3 2
39:Amanda 5 4 3 2
40:Boris 5 4 3 2
41:Garrett 5 4 3 2
42:Stew 5 4 3 2
43:Bree 5 4 3 2
44:Karma 5 4 3 2
Vertical Icicle:
In this document, it is not possible to display the full vertical icicle, but, yet, the results for the same are
described below.
For the two cluster solution you can see that one cluster consists of ten cases (Boris through Willy, followed by
a column with no X’s). These were our adjunct (part-time) faculty (excepting one) and the second cluster
consists of everybody else.
For the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deanna
through Mickey were our junior faculty and Lawrence through Rosalyn our senior faculty
For the four cluster solution you can see that one case (Lawrence) forms a cluster of his own.
Dendrogram
It displays essentially the same information that is found in the agglomeration schedule but in graphic form.
* * * * * * * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * * * * * *
Dendrogram using Average Linkage (Between Groups)
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Amaryllis 32 ─┐
Nathan 33 ─┤
Jonas 30 ─┼─┐
Cleatus 29 ─┘ │
Alberto 21 ─┐ │
Melissa 26 ─┤ │
Carla 20 ─┤ ├─────┐
Christina 22 ─┤ │ │
Shanta 25 ─┤ │ │
Ernie 16 ─┤ │ │
Christa 17 ─┼─┘ │
Tucker 24 ─┤ │
Jenna 27 ─┘ ├───┐
Beulah 11 ─┐ │ │
Linette 18 ─┼─┐ │ │
Marie 13 ─┤ ├─┐ │ │
Jonah 23 ─┘ │ ├─┐ │ │
Johnny 28 ───┘ │ │ │ ├───┐
Mickey 5 ─────┘ ├─┘ │ │
Ernest 14 ─┬───┐ │ │ │
Christopher 15 ─┘ ├─┘ │ ├───────────────┐
Tad 31 ─────┘ │ │ │
Bo 19 ─────────────┘ │ │
Deanna 34 ─────────────────┘ │
Raul 8 ─┬─┐ │
Catalina 9 ─┘ ├─────┐ ├───────────────┐
Louis 6 ───┘ │ │ │
Tony 7 ─┬─┐ ├───┐ │ │
Martina 12 ─┘ ├─┐ │ │ │ │
Sunila 3 ─┬─┘ ├───┘ ├───────────┐ │ │
Johnson 10 ─┘ │ │ │ │ │
Randolph 4 ─────┘ │ ├───────┘ │
Rosalyn 1 ─────────────┘ │ │
Lawrence 2 ─────────────────────────┘ │
Garrett 41 ─┐ │
Stew 42 ─┼─┐ │
Bree 43 ─┤ │ │
Karma 44 ─┘ ├───┐ │
Dea 37 ─┐ │ │ │
Claude 38 ─┤ │ ├─────────────────┐ │
Amanda 39 ─┼─┘ │ │ │
Deana 36 ─┘ │ ├───────────────────────┘
Boris 40 ───────┘ │
Willy 35 ─────────────────────────┘
Multiple Regression Analysis
In this Analysis we are using a data file that was created by randomly sampling 400 elementary
schools from the California Department of Education's API 2000 dataset. This data file contains a
measure of school academic performance as well as other attributes of the elementary schools, such
as, class size, enrolment, poverty, etc.,
Now, performing a regression analysis using api00 as the outcome variable and the
variables acs_k3, meals and full as predictors. These measure the academic performance of the
school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of
students receiving free meals (meals) - which is an indicator of poverty, and the percentage of
teachers who have full teaching credentials (full). We expect that better academic performance would
be associated with lower class size, fewer students receiving free meals, and a higher percentage of
teachers having full teaching credentials. The output is as follows:
Regression
Notes
Output Created 02-Apr-2013 21:48:19
Comments
Input Data C:\Users\Divij\Desktop\SPSS Data\elemapi.sav
Active Dataset DataSet5
Filter <none>
Weight <none>
Split File <none>
N of Row s in Working Data File 400
Missing Value Handling Definition of Missing User-defined missing values are treated as missing.
Cases Used Statistics are based on cases with no missing values for any variable used.
Syntax regression
/dependent api00
/method=enter acs_k3 meals full
.
Resources Processor Time 00:00:00.063
Elapsed Time 00:00:00.026
Memory Required 2284 bytes
Additional Memory Required for
Residual Plots 0 bytes
Variables Entered/Removedb
Model
Variables
Entered
Variables
Removed Method
1 pct full
credential, avg
class size k-3,
pct free mealsa
. Enter
a. All requested variables entered.
b. Dependent Variable: api 2000
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .821a .674 .671 64.153
a. Predictors: (Constant), pct full credential, avg class size k-3, pct
free meals
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 2634884.261 3 878294.754 213.407 .000a
Residual 1271713.209 309 4115.577
Total 3906597.470 312
a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
b. Dependent Variable: api 2000
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 906.739 28.265 32.080 .000
avg class size k-3 -2.682 1.394 -.064 -1.924 .055
pct free meals -3.702 .154 -.808 -24.038 .000
pct full credential .109 .091 .041 1.197 .232
a. Dependent Variable: api 2000
Let's test the three predictors on whether they are statistically significant and, if so, the direction of the
relationship. The average class size (acs_k3, b=-2.682) is not significant (p=0.055), but only just so,
and the coefficient is negative which would indicate that larger class sizes is related to lower
academic performance, which is what we would expect. Next, the effect of meals (b=-3.702, p=.000)
is significant and its coefficient is negative indicating that the greater the proportion students receiving
free meals, the lower the academic performance. We cannot say that free meals are causing lower
academic performance. The meals variable is highly related to income level and functions more as a
proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance.
Finally, the percentage of teachers with full credentials (full, b=0.109, p=.2321) seems to be unrelated
to academic performance. This would seem to indicate that the percentage of teachers with full
credentials is not an important factor in predicting academic performance which is unexpected.
From these results, we would conclude that lower class sizes are related to higher performance, that
fewer students receiving free meals is associated with higher performance, and that the percentage of
teachers with full credentials was not related to academic performance in the schools. Before we
write this up as our finding, we should do checks to make sure we can firmly stand behind these
results.
Examining Data
Step 1)
To start examining the data we have a look at the first 10 data points for the variables included in our
regression analysis. We need to lay focus on the number of missing data points in the given data.
api00 acs_k3 meals full
693 16 67 76.00 570 15 92 79.00 546 17 97 68.00
571 20 90 87.00 478 18 89 87.00 858 20 . 100.00
918 19 . 100.00 831 20 . 96.00 860 20 . 100.00
737 21 29 96.00
Number of cases read: 10 Number of cases listed: 10
We see that among the first 10 observations, we have four missing values for meals. Keeping this in
mind, we can use the descriptives command with /var=all to get descriptive statistics for all of the
variables, and pay special attention to the number of valid cases for meals.
Step 2)
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
school number 400 58 6072 2866.81 1543.811
district number 400 41 796 457.73 184.823
api 2000 400 369 940 647.62 142.249
api 1999 400 333 917 610.21 147.136
growth 1999 to 2000 400 -69 134 37.41 25.247
pct free meals 315 6 100 71.99 24.386
english language learners 400 0 91 31.45 24.839
year round school 400 0 1 .23 .421
pct 1st year in school 399 2 47 18.25 7.485
avg class size k-3 398 -21 25 18.55 5.005
avg class size 4-6 397 20 50 29.69 3.841
parent not hsg 400 0 100 21.25 20.676
parent hsg 400 0 100 26.02 16.333
parent some college 400 0 67 19.71 11.337
parent college grad 400 0 100 19.70 16.471
parent grad school 400 0 67 8.64 12.131
avg parent ed 381 1.00 4.62 2.6685 .76379
pct full credential 400 .42 100.00 66.0568 40.29793
pct emer credential 400 0 59 12.66 11.746
number of students 400 130 1570 483.47 226.448
Percentage free meals in
3 categories 400 1 3 2.02 .819
Valid N (listwise) 295
Examining the output for the variables we used in our regression analysis above,
namely api00, acs_k3, meals, full. For api00, we see that the values range from 369 to 940 and
there are 400 valid values. For acs_k3, the average class size ranges from -21 to 25 and there are 2
missing values. An average class size of -21 sounds wrong. The variable meals ranges from 6%
getting free meals to 100% getting free meals, so these values seem reasonable, but there are only
315 valid values for this variable. The percent of teachers being full credentialed ranges from .42 to
100, and all of the values are valid.
This has uncovered a number of peculiarities worthy of further examination. We now obtain a
corrected data set from the same source. This data set has got all the data corrected & is free from
the shortcomings diagnosed above. We run another multiple regression on the new data set.
New Multiple regression analysis
For this multiple regression example, we will regress the dependent variable, api00, on all of the
predictor variables in the data set.
Regression
Notes
Output Created 02-Apr-2013 22:54:47
Comments
Input Data C:\Users\Divij\Desktop\SPSS
Data\elemapi2.sav
Active Dataset DataSet8
Filter <none>
Weight <none>
Split File <none>
N of Row s in Working Data File 400
Missing Value Handling Definition of Missing User-defined missing values are treated as
missing.
Cases Used Statistics are based on cases with no missing
values for any variable used.
Syntax regression
/dependent api00
/method=enter ell meals yr_rnd mobility
acs_k3 acs_46 full emer enroll .
Resources Processor Time 00:00:00.031
Elapsed Time 00:00:00.022
Memory Required 4724 bytes
Additional Memory Required for
Residual Plots 0 bytes
Variables Entered/Removedb
Model
Variables
Entered
Variables
Removed Method
1 number of
students, avg
class size 4-6,
pct 1st year in
school, avg
class size k-3,
pct emer
credential,
english language
learners, year
round school,
pct free meals,
pct full
credentiala
. Enter
a. All requested variables entered.
b. Dependent Variable: api 2000
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .919a .845 .841 56.768
a. Predictors: (Constant), number of students, avg class size 4-6, pct
1st year in school, avg class size k-3, pct emer credential, english
language learners, year round school, pct free meals, pct full
credential
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 6740702.006 9 748966.890 232.409 .000a
Residual 1240707.781 385 3222.618
Total 7981409.787 394
a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg
class size k-3, pct emer credential, english language learners, year round school, pct free
meals, pct full credential
b. Dependent Variable: api 2000
Coefficientsa
Model Unstandardized Coefficients
Standardized
Coefficients t Sig.
B Std. Error Beta
1 (Constant) 758.942 62.286 12.185 .000
english language learners -.860 .211 -.150 -4.083 .000
pct free meals -2.948 .170 -.661 -17.307 .000
year round school -19.889 9.258 -.059 -2.148 .032
pct 1st year in school -1.301 .436 -.069 -2.983 .003
avg class size k-3 1.319 2.253 .013 .585 .559
avg class size 4-6 2.032 .798 .055 2.546 .011
pct full credential .610 .476 .064 1.281 .201
pct emer credential -.707 .605 -.058 -1.167 .244
number of students -.012 .017 -.019 -.724 .469
a. Dependent Variable: api 2000
1) Examining the output from this regression analysis. As with the simple regression, we look to
the p-value of the F-test to see if the overall model is significant. With a p-value of zero to
three decimal places, the model is statistically significant. The R-squared is 0.845, meaning
that approximately 85% of the variability of api00 is accounted for by the variables in the
model. In this case, the adjusted R-squared indicates that about 84% of the variability
ofapi00 is accounted for by the model, even after taking into account the number of predictor
variables in the model. The coefficients for each of the variables indicates the amount of
change one could expect in api00 given a one-unit change in the value of that variable, given
that all other variables in the model are held constant. For example, consider the
variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit
increase in ell, assuming that all other variables in the model are held constant.
2) R-Square is the proportion of variance in the dependent variable (api00) which can be
predicted from the independent variables (ell, meals, yr_rnd,
mobility, acs_k3, acs_46, full, emer and enroll). This value indicates that 84% of the
variance in api00 can be predicted from the
variables ell, meals,yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll.
3) The beta coefficients are used by some researchers to compare the relative strength of the
various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one
another. In other words, the beta coefficients are the coefficients that you would obtain if the
outcome and predictor variables were all transformed to standard scores, also cal led z-
scores, before running the regression. In this example, meals has the largest Beta coefficient,
-0.661, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase
in meals leads to a 0.661 standard deviation decrease in predicted api00, with the other
variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a
0.013 standard deviation increase api00 with the other variables in the model held constant.
4) The adjusted R-square attempts to yield a more honest value to estimate the R-squared for
the population. The value of R-square was .8446, while the value of Adjusted R-square was
.8409. The adjusted R-square attempts to yield a more honest value to estimate the R-
squared for the population.
5) The F Value is the Mean Square Regression (748966.89) divided by the Mean Square
Residual (3222.61761), yielding F=232.41. The p value associated with this F value is very
small (0.0000). These values are used to answer the question "Do the independent variables
reliably predict the dependent variable?". The p value is compared to your alpha level
(typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably
predict the dependent variable".\
6) These are the degrees of freedom associated with the sources of variance. The Total
variance has N-1 degrees of freedom (DF). In this case, there were N=395 observations, so
the DF for total is 394.