Download pdf - Spss analysis conjoint_cluster_regression_pca_discriminant

Conjoint Analysis :

Conjoint Analysis is a marketing research technique designed to help determine preferences of

customers. It is used to analyse how customers value different attributes of a product ( or service)

and thus gives an insight into the trade-offs they are to make among the various attributes. To put

simply, it tells how much each feature of a product is worth to the consumers.

This study includes surveying people with a certain set of attribute combinations which the survey-

takers rank or provide preferences. Analysis will be done to model the customer preferences for

different combination of attributes. The attributes are termed factors and the different values are

levels.

In the example that we have taken to use Conjoint Analysis through the tool SPSS, we have analysed

data on carpet, taking attributes like Price, Brand, Money-return, Package design and Seal as the

attributes based on which the consumers give prefernces. Using two data sets, we calculate the part

worths and decide on the weightage of each of the attributes that the users have provided.

Variable name Variable label Value label

package package design A*, B*, C*

brand brand name K2R, Glory, Bissell

price price $1.19, $1.39, $1.59

seal Good Housekeeping seal no, yes

money money-back guarantee no, yes

Code to fetch import the data and analyse : GET FILE='C:\Users\Abhi\Desktop\carpet_plan.sav'.

DATASET NAME DataSet1 WINDOW=FRONT. GET FILE='C:\Users\Abhi\Desktop\carpet_prefs.sav'.

DATASET NAME DataSet2 WINDOW=FRONT. CONJOINT PLAN='C:\Users\Abhi\Desktop\carpet_plan.sav' /DATA='C:\Users\Abhi\Desktop\carpet_prefs.sav' /SEQUENCE=PREF1 PREF2 PREF3 PREF4 PREF5 PREF6 PREF7 PREF8 PREF9 PREF10 PR

EF11 PREF12 PREF13 PREF14 PREF15 PREF16 PREF17 PREF18 P REF19 PREF20 PREF21 PREF22 /SUBJECT=ID

/FACTORS=PACKAGE BRAND (DISCRETE) PRICE (LINEAR LESS) SEAL (LINEAR MORE) MONEY (LINEAR MORE)

/PRINT=SUMMARYONLY.

Model Description

N of Levels

Relation to Ranks

or Scores

package 3 Discrete

brand 3 Discrete

price 3 Linear (less)

seal 2 Linear (more)

money 2 Linear (more)

Calculation of the part-worth of each attribute

Utilities

Utility Estimate Std. Error

package A* -2.233 .192

B* 1.867 .192

C* .367 .192

brand K2R .367 .192

Glory -.350 .192

Bissell -.017 .192

price $1.19 -6.595 .988

$1.39 -7.703 1.154

$1.59 -8.811 1.320

seal no 2.000 .287

yes 4.000 .575

money no 1.250 .287

yes 2.500 .575

(Constant) 12.870 1.282

This table shows the utility (part-worth) scores and their standard errors for each factor level. Higher

utility values indicate greater preference. We can see that the value of the part worths are such that,

for each attribute if part-worths are added for different levels, it sums up to zero. Thus with respect

to brand Glory and Bisell, K2R is preferred more. As expected, there is an inverse relationship

between price and utility, with higher prices corresponding to lower utility . The presence of a seal of

approval or money-back guarantee corresponds to a higher utility.Also, total utility of a combination

can be calculated as :

If the cleaner had package design C*, brand Bissell, price $1.59, a seal of approval, and a money -back

guarantee, the total utility would be:

0.367 + (−0.017) + (−8.811) + 4.000 + 2.500 + 12.870 = 10.909.

Importance:

Importance Values

package 35.635

brand 14.911

price 29.410

seal 11.172

money 8.872

We can see that attributes package has most importance followed by price. Money return is of least

concern for the consumer. The values are computed by taking the utility range for each factor

separately and dividing by the sum of the utility ranges for al l factors. The values thus represent

percentages and have the property that they sum to 100.

Coefficients

B Coefficient

Estimate

price -5.542

seal 2.000

money 1.250

The utility for a particular factor level is determined by multiplying the level by the coefficient. For

example, the predicted utility for a price of $1.19 was listed as −6.595 in the utilities table. This is

simply the value of the price level, 1.19, multiplied by the price coefficient, −5.542.

This table provides measures of the correlation between the observed and estimated preferences.

Preference Scores of

Simulationsa

Card

Number ID Score

1 1 10.258

2 2 14.292

The real power of conjoint analysis is the ability to predict preference for product profiles that

weren't rated by the subjects. These are referred to as simulation cases.

Preference Probabilities of Simulationsb

Card

Number ID Maximum Utilitya

Bradley-Terry-

Luce Logit

1 1 30.0% 43.1% 30.9%

2 2 70.0% 56.9% 69.1%

The maximum utility model determines the probability as the number

of respondents predicted to choose the profile divided by the total

number of respondents. For each respondent, the predicted choice is

simply the profile with the largest total utility.

Number of Reversals

Factor price 3

money 2

seal 2

brand 0

package 0

Subject 1 Subject 1 1

2 Subject 2 2

3 Subject 3 0

4 Subject 4 0

5 Subject 5 0

6 Subject 6 1

7 Subject 7 0

8 Subject 8 0

9 Subject 9 1

10 Subject 10 2

This table displays the number of reversals for each factor and for each subject. For example, three

subjects showed a reversal for price. That is, they preferred product profiles with higher prices.

Reversal Summary

N of

Revers

als N of Subjects

1 3

2 2

Q. Perform Discriminant Analysis on the given dataset. The dataset chosen contains statistics on set of people who have been given bank loans & have defaulted or not defaulted with their various characteristics.

Discriminant

Notes

Output Created 04-Apr-2013 18:39:05

Comments

Input Data E:\VGSOM\STUDY\SECOND

SEM\BRM\SPSS16\Samples\banklo

an.sav

Active Dataset DataSet1

File Label Bank Loan Default

Filter <none>

Weight <none>

Split File <none>

N of Rows in Working

Data File 850

Missing Value Handling Definition of Missing User-defined missing values are

treated as missing in the analysis

phase.

Cases Used In the analysis phase, cases with no

user- or system-missing values for

any predictor variable are used.

Cases with user-, system-missing, or

out-of-range values for the

grouping variable are always

excluded.

Syntax DISCRIMINANT

/GROUPS=default(0 1)

/VARIABLES=employ address age

/ANALYSIS ALL

/PRIORS EQUAL

/STATISTICS=MEAN STDDEV UNIVF

BOXM COEFF CORR TABLE

/PLOT=COMBINED

/PLOT=CASES

/CLASSIFY=NONMISSING POOLED

MEANSUB.

Resources Processor Time 00:00:00.047

Elapsed Time 00:00:00.121

p{color:black;font-family:sans-serif;font-size:10pt;font-

weight:normal}

Your trial period for SPSS for Windows will expire in 14 da

ys.p{color:0;font-family:Monospaced;font-size:13pt;font-

style:normal;font-weight:normal;text-decoration:none}

GET

FILE='E:\VGSOM\STUDY\SECOND SEM\BRM\SPSS16\Samples\banklo

an.sav'.

DATASET NAME DataSet1 WINDOW=FRONT.

DISCRIMINANT

/GROUPS=default(0 1)

/VARIABLES=employ address age

/ANALYSIS ALL

/PRIORS EQUAL

/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE

/PLOT=COMBINED

/PLOT=CASES

/CLASSIFY=NONMISSING POOLED MEANSUB.

[DataSet1] E:\VGSOM\STUDY\SECOND SEM\BRM\SPSS16\Samples\bankloan.sav

Warnings

All-Groups Stacked Histogram is no longer displayed.

Analysis Case Processing Summary

Unweighted Cases N Percent

Valid 700 82.4

Excluded Missing or out-of-range

group codes 150 17.6

At least one missing

discriminating variable 0 .0

Both missing or out-of-

range group codes and at

least one missing

discriminating variable

0 .0

Total 150 17.6

Total 850 100.0

Group Statistics

Previously defaulted Mean Std. Deviation

Valid N (listwise)

Unweighted Weighted

No Years with current

employer 9.51 6.664 517 517.000

Years at current address 8.95 7.001 517 517.000

Age in years 35.51 7.708 517 517.000

Yes Years with current

employer 5.22 5.543 183 183.000


Age in years 33.01 8.518 183 183.000

Total Years with current

employer 8.39 6.658 700 700.000


Age in years 34.86 7.997 700 700.000

Tests of Equality of Group Means

Wilks' Lambda F df1 df2 Sig.

Years with current

employer .920 60.759 1 698 .000

Years at current address .973 19.402 1 698 .000

Age in years .981 13.482 1 698 .000

Pooled Within-Groups Matrices

Years with

current

employer

Years at

current address Age in years

Correlation Years with current

employer 1.000 .292 .524

Years at current address .292 1.000 .588

Age in years .524 .588 1.000

Analysis 1

Box's Test of Equality of Covariance Matrices

Log Determinants

Previously defaulted Rank

Log

Determinant

No 3 11.012

Yes 3 10.501

Pooled within-groups 3 10.919

The ranks and natural logarithms of determinants

printed are those of the group covariance

matrices.

Test Results

Box's M 28.171

F Approx. 4.665

df1 6

df2 7.335E5

Sig. .000

This matrix shows correlation between the predictors. The largest

correlations occur between Credit card debt in thousands and the other variables.

Log Determinants

Previously defaulted Rank

Log

Determinant

No 3 11.012

Yes 3 10.501

Pooled within-groups 3 10.919

Tests null hypothesis of

equal population covariance

matrices.

Summary of Canonical Discriminant Functions

Eigenvalues

Functio

n Eigenvalue % of Variance Cumulative %

Canonical

Correlation

1 .100a 100.0 100.0 .301

a. First 1 canonical discriminant functions were used in the analysis.

Wilks' Lambda

Test of

Functio

n(s) Wilks' Lambda Chi-square df Sig.

1 .909 66.251 3 .000

Standardized Canonical Discriminant

Function Coefficients

Function

1

Years with current

employer .980

Years at current address .436

Age in years -.330

Structure Matrix

Function

1

Years with current

employer .934

Years at current address .528

Age in years .440

Pooled within-groups correlations

between discriminating variables and

standardized canonical discriminant

functions

Variables ordered by absolute size of

correlation within function.

Functions at Group

Centroids

Previo

usly

default

ed

Function

1

No .188

Yes -.530

Unstandardized

canonical

discriminant

functions evaluated

at group means

Classification Statistics

Classification Processing Summary

Processed 850

Excluded Missing or out-of-range

group codes 0

At least one missing

discriminating variable 0

Used in Output 850

Prior Probabilities for Groups

Previo

usly

default

ed Prior

Cases Used in Analysis

Unweighted Weighted

No .500 517 517.000

Yes .500 183 183.000

Total 1.000 700 700.000

Classification Function Coefficients

Previously defaulted

No Yes

Years with current

employer -.192 -.302

Years at current address -.302 -.348

Age in years .797 .827

(Constant) -12.588 -12.444

Fisher's linear discriminant functions

Classification Resultsa

Previously

defaulted

Predicted Group Membership

Total No Yes

Original Count No 300 217 517

Yes 44 139 183

Ungrouped cases 81 69 150

% No 58.0 42.0 100.0

Yes 24.0 76.0 100.0

Ungrouped cases 54.0 46.0 100.0

a. 62.7% of original grouped cases correctly classified.

The Discriminant Analysis shows that the persons in the category

who have previously defaulted are predicted likely to default this time as well & those who haven’t defaulted earlier are predicted less likely to default this time. The conclusion is inferred from the total no. of defaulters being more than non defaulters (139>44) similarly (300>217).

Q. Perform Factor Analysis on the given dataset.

The dataset chosen contains fictional statistics anxiety questionnaire. It contains response given by students regarding their ease of use, liking and usage of SPSS in statistics.

By using the Scree Plot I have chosen 5 factors.

Since a student may give related answers depending upon the choices hence I considered the

variables to be inter-related and hence used Oblimin rotation. Say a student gave high points for variable “I have little experience of computers” is likely to give high points for “All computers hate me” as the variables are correlated somewhat.

Using the options of SPSS the following Pattern Matrix was generated.

Pattern Matrixa

Component

1 2 3 4 5

I have little experience of

computers

.903

SPSS always crashes when I

try to use it

.732

All computers hate me .684

I worry that I will cause

irreparable damage because

of my incompetenece with

computers

.662

Computers have minds of

their own and deliberately go

wrong whenever I use them

.581

People try to tell you that

SPSS makes statistics easier

to understand but it doesn't

.446

Computers are out to get me .333

My friends are better at SPSS

than I am

.661

My friends are better at

statistics than me

.655

If I'm good at statistics my

friends will think I'm a nerd

.622

My friends will think I'm stupid

for not being able to cope

with SPSS

.504 .330

Everybody looks at me when

I use SPSS

.358 .358

I can't sleep for thoughts of

eigen vectors

-.728

I wake up under my duvet

thinking that I am trapped

under a normal distribtion

.324 -.543

Computers are useful only for

playing games

.359 .393 -.366

Standard deviations excite

me

.301 .356 .315

I have never been good at

mathematics

-.855

I did badly at mathematics at

school

-.736

I slip into a coma whenever I

see an equation

-.722

Statiscs makes me cry -.772

I don't understand statistics -.730

I weep openly at the mention

of central tendency

-.664

I dream that Pearson is

attacking me with correlation

coefficients

-.564

Extraction Method: Principal Component Analysis.

Rotation Method: Oblimin with Kaiser Normalization.

a. Rotation converged in 15 iterations.

The total variance explained by each factor is given below

Total Variance Explained

Compo

nent

Rotation Sums of

Squared

Loadingsa

Total

1 5.522

2 2.452

3 2.383

4 3.535

5 4.913

Extraction Method:

Principal Component

Analysis.

It is calculated by the sum of squared loadings of the factor and dividing the sum of squared loadings by

the number of variables and multiplying by 100.

Hence the factoring would be as follows depending on the loading values.

Factor Variable Nos. 1 1,2,3,4,5,6,7,14

2 8,9,10 3 13

4 17,18,19

5 20,21,22,23

Since variables 11, 12, 15 and 16 have very close loadings in different factors it is not good as this

variable is assessing both constructs.15 has exact same value in both Factor 2 and Factor 3.These are

said to have split loading.

They are hence mentioned in a separately.

Factor Variable No 2 11,16,15

3 12,15

As Split loading is present this is not a simple structure.

Factor 1: Anxiety about the usage of computers accounts for 55.22% of the total variance and loads 8 of

the variables.

Factor 2: View of students regarding their understanding of statistics and SPSS with regard to their peers

accounts for 24.52% of the total variance and loads 3 variables. It also split loads variable 11, 16 and 15.

Factor 3: Anxiety about Eigen vectors corresponds to only 23.83% of the total variance and loads only 1

variable directly while it split loads variable 12 and 15.

Factor 4: Students interest in mathematics accounts for 35.35% of the total variance and loads 3

variable.

Factor 5: Dislike for statistics accounts for 49.13% of the total variance and loads 4 variables.

CLUSTER ANALYSIS

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same

group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis

used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Proximities

Notes


Comments

Input Data C:\Users \dev

maletia\Downloads\ClusterAnonFaculty.sav


Fi l ter <none>

Weight <none>

Spl it File <none>

N of Rows in Working Data File 44

Missing Value Handling Definition of Missing User-defined missing values are treated as missing.

Cases Used Statistics are based on cases with no missing values

for any variable used.

Syntax PROXIMITIES Sa lary FTE Rank Articles Experience

/MATRIX

OUT('C:\Users\DEVMAL~1\AppData\Local\Temp\spss

6496\spssclus.tmp')

/VIEW=CASE

/MEASURE=SEUCLID

/PRINT NONE

/ID=Name

/STANDARDIZE=VARIABLE Z.



Workspace Bytes 11152

Fi les Saved Matrix Fi le C:\Users \DEVMAL~1\AppData\Local\Temp\spss6496

\spssclus.tmp

The variables are which I have used in the dataset are as follows:

• Name -- Although faculty salaries are public information under North Carolina state law

• Salary – annual salary in dollars, from the university report available in One Stop.

• FTE – Full time equivalent work load for the faculty member.

• Rank – where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor

• Articles – number of published scholarly articles, excluding things like comments in newsletters,

abstracts in proceedings, and the like.

• Experience – Number of years working as a full time faculty member in a Department of Psychology.

• ArticlesAPD – number of published articles as listed in the university’s Academic Publications

• Sex –biological sex from physical appearance.

In the first step SPSS computes for each pair of cases the squared Euclidian distance between the cases. This is

quite simply, the sum across variables (from i = 1 to v) of the squared difference between the score on variable

i for the one case (Xi) and the score on variable i for the other case (Yi). The two cases which are separated by

the smallest Euclidian distance are identified and then classified together into the first cluster. At this point

there is one cluster with two cases in it.

Next SPSS re-computes the squared Euclidian distances between each entity (case or c luster) and each other

entity. When one or both of the compared entities is a cluster, SPSS computes the averaged squared Euclidian

distance between members of the one entity and members of the other entity. The two entities with the

smallest squared Euclidian distance are classified together. SPSS then re-computes the squared Euclidian

distances between each entity and each other entity and the two with the smallest squared Euclidian distance

are classified together. This continues until all of the cases have been clustered into one big cluster.

The output obtained can be seen below:

Case Processing Summarya

Cases

Val id Miss ing Tota l

N Percent N Percent N Percent

44 100.0% 0 .0% 44 100.0%

a . Squared Euclidean Distance used

On the first step SPSS clustered case 32 with 33. The squared Euclidian distance between these two cases is 0.000. At stages 2-4 SPSS creates three more clusters, each containing two cases. At stage 5 SPSS adds case 39 to the cluster that already contains cases 37 and 38. By the 43rd stage all cases have been clustered into one entity. The results can be seen below:

Average Linkage (Between Groups)

Agglomeration Schedule

Stage

Cluster Combined

Coefficients

Stage Cluster Fi rs t Appears

Next Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2

1 32 33 .000 0 0 9

2 41 42 .000 0 0 6

3 43 44 .000 0 0 6

4 37 38 .000 0 0 5

5 37 39 .001 4 0 7

6 41 43 .002 2 3 27

7 36 37 .003 0 5 27

8 20 22 .007 0 0 11

9 30 32 .012 0 1 13

10 21 26 .012 0 0 14

11 20 25 .031 8 0 12

12 16 20 .055 0 11 14

13 29 30 .065 0 9 26

14 16 21 .085 12 10 20

15 11 18 .093 0 0 22

16 8 9 .143 0 0 25

17 17 24 .144 0 0 20

18 13 23 .167 0 0 22

19 14 15 .232 0 0 32

20 16 17 .239 14 17 23

21 7 12 .279 0 0 28

22 11 13 .441 15 18 29

23 16 27 .451 20 0 26

24 3 10 .572 0 0 28

25 6 8 .702 0 16 36

26 16 29 .768 23 13 35

27 36 41 .858 7 6 33

28 3 7 .904 24 21 31

29 11 28 .993 22 0 30

30 5 11 1.414 0 29 34

31 3 4 1.725 28 0 36

32 14 31 1.928 19 0 34

33 36 40 2.168 27 0 40

34 5 14 2.621 30 32 35

35 5 16 2.886 34 26 37

36 3 6 3.089 31 25 38

37 5 19 4.350 35 0 39

38 1 3 4.763 0 36 41

39 5 34 5.593 37 0 42

40 35 36 8.389 0 33 43

41 1 2 8.961 38 0 42

42 1 5 11.055 41 39 43

43 1 35 17.237 42 40 0

Cluster Membership

Case 5 Clusters 4 Clusters 3 Clusters 2 Clusters

1:Rosalyn 1 1 1 1

2:Lawrence 2 2 1 1

3:Suni la 1 1 1 1

4:Randolph 1 1 1 1

5:Mickey 3 3 2 1

6:Louis 1 1 1 1

7:Tony 1 1 1 1

8:Raul 1 1 1 1

9:Cata l ina 1 1 1 1

10:Johnson 1 1 1 1

11:Beulah 3 3 2 1

12:Martina 1 1 1 1

13:Marie 3 3 2 1

14:Ernest 3 3 2 1

15:Chris topher 3 3 2 1

16:Ernie 3 3 2 1

17:Chris ta 3 3 2 1

18:Linette 3 3 2 1

19:Bo 3 3 2 1

20:Carla 3 3 2 1

21:Alberto 3 3 2 1

22:Chris tina 3 3 2 1

23:Jonah 3 3 2 1

24:Tucker 3 3 2 1

25:Shanta 3 3 2 1

26:Mel issa 3 3 2 1

27:Jenna 3 3 2 1

28:Johnny 3 3 2 1

29:Cleatus 3 3 2 1

30:Jonas 3 3 2 1

31:Tad 3 3 2 1

32:Amaryl l is 3 3 2 1

33:Nathan 3 3 2 1

34:Deanna 3 3 2 1

35:Wi l ly 4 4 3 2

36:Deana 5 4 3 2

37:Dea 5 4 3 2

38:Claude 5 4 3 2

39:Amanda 5 4 3 2

40:Boris 5 4 3 2

41:Garrett 5 4 3 2

42:Stew 5 4 3 2

43:Bree 5 4 3 2

44:Karma 5 4 3 2

Vertical Icicle:

In this document, it is not possible to display the full vertical icicle, but, yet, the results for the same are

described below.

For the two cluster solution you can see that one cluster consists of ten cases (Boris through Willy, followed by

a column with no X’s). These were our adjunct (part-time) faculty (excepting one) and the second cluster

consists of everybody else.

For the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deanna

through Mickey were our junior faculty and Lawrence through Rosalyn our senior faculty

For the four cluster solution you can see that one case (Lawrence) forms a cluster of his own.

Dendrogram

It displays essentially the same information that is found in the agglomeration schedule but in graphic form.

* * * * * * * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * * * * * *

Dendrogram using Average Linkage (Between Groups)

Rescaled Distance Cluster Combine

C A S E 0 5 10 15 20 25

Label Num +---------+---------+---------+---------+---------+

Amaryllis 32 ─┐

Nathan 33 ─┤

Jonas 30 ─┼─┐

Cleatus 29 ─┘ │

Alberto 21 ─┐ │

Melissa 26 ─┤ │

Carla 20 ─┤ ├─────┐

Christina 22 ─┤ │ │

Shanta 25 ─┤ │ │

Ernie 16 ─┤ │ │

Christa 17 ─┼─┘ │

Tucker 24 ─┤ │

Jenna 27 ─┘ ├───┐

Beulah 11 ─┐ │ │

Linette 18 ─┼─┐ │ │

Marie 13 ─┤ ├─┐ │ │

Jonah 23 ─┘ │ ├─┐ │ │

Johnny 28 ───┘ │ │ │ ├───┐

Mickey 5 ─────┘ ├─┘ │ │

Ernest 14 ─┬───┐ │ │ │

Christopher 15 ─┘ ├─┘ │ ├───────────────┐

Tad 31 ─────┘ │ │ │

Bo 19 ─────────────┘ │ │

Deanna 34 ─────────────────┘ │

Raul 8 ─┬─┐ │

Catalina 9 ─┘ ├─────┐ ├───────────────┐

Louis 6 ───┘ │ │ │

Tony 7 ─┬─┐ ├───┐ │ │

Martina 12 ─┘ ├─┐ │ │ │ │

Sunila 3 ─┬─┘ ├───┘ ├───────────┐ │ │

Johnson 10 ─┘ │ │ │ │ │

Randolph 4 ─────┘ │ ├───────┘ │

Rosalyn 1 ─────────────┘ │ │

Lawrence 2 ─────────────────────────┘ │

Garrett 41 ─┐ │

Stew 42 ─┼─┐ │

Bree 43 ─┤ │ │

Karma 44 ─┘ ├───┐ │

Dea 37 ─┐ │ │ │

Claude 38 ─┤ │ ├─────────────────┐ │

Amanda 39 ─┼─┘ │ │ │

Deana 36 ─┘ │ ├───────────────────────┘

Boris 40 ───────┘ │

Willy 35 ─────────────────────────┘

Multiple Regression Analysis

In this Analysis we are using a data file that was created by randomly sampling 400 elementary

schools from the California Department of Education's API 2000 dataset. This data file contains a

measure of school academic performance as well as other attributes of the elementary schools, such

as, class size, enrolment, poverty, etc.,

Now, performing a regression analysis using api00 as the outcome variable and the

variables acs_k3, meals and full as predictors. These measure the academic performance of the

school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of

students receiving free meals (meals) - which is an indicator of poverty, and the percentage of

teachers who have full teaching credentials (full). We expect that better academic performance would

be associated with lower class size, fewer students receiving free meals, and a higher percentage of

teachers having full teaching credentials. The output is as follows:

Regression

Notes


Comments

Input Data C:\Users\Divij\Desktop\SPSS Data\elemapi.sav


Filter <none>

Weight <none>

Split File <none>

N of Row s in Working Data File 400

Missing Value Handling Definition of Missing User-defined missing values are treated as missing.

Cases Used Statistics are based on cases with no missing values for any variable used.

Syntax regression

/dependent api00

/method=enter acs_k3 meals full

.



Memory Required 2284 bytes

Additional Memory Required for

Residual Plots 0 bytes

Variables Entered/Removedb

Model

Variables

Entered

Variables

Removed Method

1 pct full

credential, avg

class size k-3,

pct free mealsa

. Enter

a. All requested variables entered.

b. Dependent Variable: api 2000

Model Summary

Model R R Square

Adjusted R

Square

Std. Error of the

Estimate

1 .821a .674 .671 64.153

a. Predictors: (Constant), pct full credential, avg class size k-3, pct

free meals

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 2634884.261 3 878294.754 213.407 .000a

Residual 1271713.209 309 4115.577

Total 3906597.470 312

a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals


Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) 906.739 28.265 32.080 .000

avg class size k-3 -2.682 1.394 -.064 -1.924 .055

pct free meals -3.702 .154 -.808 -24.038 .000

pct full credential .109 .091 .041 1.197 .232

a. Dependent Variable: api 2000

Let's test the three predictors on whether they are statistically significant and, if so, the direction of the

relationship. The average class size (acs_k3, b=-2.682) is not significant (p=0.055), but only just so,

and the coefficient is negative which would indicate that larger class sizes is related to lower

academic performance, which is what we would expect. Next, the effect of meals (b=-3.702, p=.000)

is significant and its coefficient is negative indicating that the greater the proportion students receiving

free meals, the lower the academic performance. We cannot say that free meals are causing lower

academic performance. The meals variable is highly related to income level and functions more as a

proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance.

Finally, the percentage of teachers with full credentials (full, b=0.109, p=.2321) seems to be unrelated

to academic performance. This would seem to indicate that the percentage of teachers with full

credentials is not an important factor in predicting academic performance which is unexpected.

From these results, we would conclude that lower class sizes are related to higher performance, that

fewer students receiving free meals is associated with higher performance, and that the percentage of

teachers with full credentials was not related to academic performance in the schools. Before we

write this up as our finding, we should do checks to make sure we can firmly stand behind these

results.

Examining Data

Step 1)

To start examining the data we have a look at the first 10 data points for the variables included in our

regression analysis. We need to lay focus on the number of missing data points in the given data.

api00 acs_k3 meals full

693 16 67 76.00 570 15 92 79.00 546 17 97 68.00

571 20 90 87.00 478 18 89 87.00 858 20 . 100.00

918 19 . 100.00 831 20 . 96.00 860 20 . 100.00

737 21 29 96.00

Number of cases read: 10 Number of cases listed: 10

We see that among the first 10 observations, we have four missing values for meals. Keeping this in

mind, we can use the descriptives command with /var=all to get descriptive statistics for all of the

variables, and pay special attention to the number of valid cases for meals.

Step 2)

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

school number 400 58 6072 2866.81 1543.811

district number 400 41 796 457.73 184.823

api 2000 400 369 940 647.62 142.249

api 1999 400 333 917 610.21 147.136

growth 1999 to 2000 400 -69 134 37.41 25.247

pct free meals 315 6 100 71.99 24.386

english language learners 400 0 91 31.45 24.839

year round school 400 0 1 .23 .421

pct 1st year in school 399 2 47 18.25 7.485

avg class size k-3 398 -21 25 18.55 5.005

avg class size 4-6 397 20 50 29.69 3.841

parent not hsg 400 0 100 21.25 20.676

parent hsg 400 0 100 26.02 16.333

parent some college 400 0 67 19.71 11.337

parent college grad 400 0 100 19.70 16.471

parent grad school 400 0 67 8.64 12.131

avg parent ed 381 1.00 4.62 2.6685 .76379

pct full credential 400 .42 100.00 66.0568 40.29793

pct emer credential 400 0 59 12.66 11.746

number of students 400 130 1570 483.47 226.448

Percentage free meals in

3 categories 400 1 3 2.02 .819

Valid N (listwise) 295

Examining the output for the variables we used in our regression analysis above,

namely api00, acs_k3, meals, full. For api00, we see that the values range from 369 to 940 and

there are 400 valid values. For acs_k3, the average class size ranges from -21 to 25 and there are 2

missing values. An average class size of -21 sounds wrong. The variable meals ranges from 6%

getting free meals to 100% getting free meals, so these values seem reasonable, but there are only

315 valid values for this variable. The percent of teachers being full credentialed ranges from .42 to

100, and all of the values are valid.

This has uncovered a number of peculiarities worthy of further examination. We now obtain a

corrected data set from the same source. This data set has got all the data corrected & is free from

the shortcomings diagnosed above. We run another multiple regression on the new data set.

New Multiple regression analysis

For this multiple regression example, we will regress the dependent variable, api00, on all of the

predictor variables in the data set.

Regression

Notes


Comments

Input Data C:\Users\Divij\Desktop\SPSS

Data\elemapi2.sav


Filter <none>

Weight <none>

Split File <none>

N of Row s in Working Data File 400

Missing Value Handling Definition of Missing User-defined missing values are treated as

missing.

Cases Used Statistics are based on cases with no missing

values for any variable used.

Syntax regression

/dependent api00

/method=enter ell meals yr_rnd mobility

acs_k3 acs_46 full emer enroll .



Memory Required 4724 bytes

Additional Memory Required for

Residual Plots 0 bytes

Variables Entered/Removedb

Model

Variables

Entered

Variables

Removed Method

1 number of

students, avg

class size 4-6,

pct 1st year in

school, avg

class size k-3,

pct emer

credential,

english language

learners, year

round school,

pct free meals,

pct full

credentiala

. Enter

a. All requested variables entered.


Model Summary

Model R R Square

Adjusted R

Square

Std. Error of the

Estimate

1 .919a .845 .841 56.768

a. Predictors: (Constant), number of students, avg class size 4-6, pct

1st year in school, avg class size k-3, pct emer credential, english

language learners, year round school, pct free meals, pct full

credential

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 6740702.006 9 748966.890 232.409 .000a

Residual 1240707.781 385 3222.618

Total 7981409.787 394

a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg

class size k-3, pct emer credential, english language learners, year round school, pct free

meals, pct full credential


Coefficientsa

Model Unstandardized Coefficients

Standardized

Coefficients t Sig.

B Std. Error Beta

1 (Constant) 758.942 62.286 12.185 .000

english language learners -.860 .211 -.150 -4.083 .000

pct free meals -2.948 .170 -.661 -17.307 .000

year round school -19.889 9.258 -.059 -2.148 .032

pct 1st year in school -1.301 .436 -.069 -2.983 .003

avg class size k-3 1.319 2.253 .013 .585 .559

avg class size 4-6 2.032 .798 .055 2.546 .011

pct full credential .610 .476 .064 1.281 .201

pct emer credential -.707 .605 -.058 -1.167 .244

number of students -.012 .017 -.019 -.724 .469

a. Dependent Variable: api 2000

1) Examining the output from this regression analysis. As with the simple regression, we look to

the p-value of the F-test to see if the overall model is significant. With a p-value of zero to

three decimal places, the model is statistically significant. The R-squared is 0.845, meaning

that approximately 85% of the variability of api00 is accounted for by the variables in the

model. In this case, the adjusted R-squared indicates that about 84% of the variability

ofapi00 is accounted for by the model, even after taking into account the number of predictor

variables in the model. The coefficients for each of the variables indicates the amount of

change one could expect in api00 given a one-unit change in the value of that variable, given

that all other variables in the model are held constant. For example, consider the

variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit

increase in ell, assuming that all other variables in the model are held constant.

2) R-Square is the proportion of variance in the dependent variable (api00) which can be

predicted from the independent variables (ell, meals, yr_rnd,

mobility, acs_k3, acs_46, full, emer and enroll). This value indicates that 84% of the

variance in api00 can be predicted from the

variables ell, meals,yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll.

3) The beta coefficients are used by some researchers to compare the relative strength of the

various predictors within the model. Because the beta coefficients are all measured in

standard deviations, instead of the units of the variables, they can be compared to one

another. In other words, the beta coefficients are the coefficients that you would obtain if the

outcome and predictor variables were all transformed to standard scores, also cal led z-

scores, before running the regression. In this example, meals has the largest Beta coefficient,

-0.661, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase

in meals leads to a 0.661 standard deviation decrease in predicted api00, with the other

variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a

0.013 standard deviation increase api00 with the other variables in the model held constant.

4) The adjusted R-square attempts to yield a more honest value to estimate the R-squared for

the population. The value of R-square was .8446, while the value of Adjusted R-square was

.8409. The adjusted R-square attempts to yield a more honest value to estimate the R-

squared for the population.

5) The F Value is the Mean Square Regression (748966.89) divided by the Mean Square

Residual (3222.61761), yielding F=232.41. The p value associated with this F value is very

small (0.0000). These values are used to answer the question "Do the independent variables

reliably predict the dependent variable?". The p value is compared to your alpha level

(typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably

predict the dependent variable".\

6) These are the degrees of freedom associated with the sources of variance. The Total

variance has N-1 degrees of freedom (DF). In this case, there were N=395 observations, so

the DF for total is 394.