24
BUSINESS ANALYTICS AND INTELLIGENCE Assignment 1 AJAY KUMAR (1411211) ANU KANKANE (1411214) ANKIT GOSWAMI (1411286) MALA HARISH (1411242) SAMYARAJ DAS

Business Analytics

Embed Size (px)

DESCRIPTION

An assignment of Logistic and Multinomial Regression

Citation preview

Page 1: Business Analytics

Business analytics and INTELLIGENCE

Assignment 1

AJAY KUMAR (1411211)

ANU KANKANE (1411214)

ANKIT GOSWAMI (1411286)

MALA HARISH (1411242)

SAMYARAJ DAS (1411398)

Page 2: Business Analytics

ContentsDESCRIPTIVE STATISTICS.....................................................................................................................................2

Proportion of Failure in Courses................................................................................................................2

Probability of Dropout – Sports Activity...................................................................................................3

Proportion of Dropout – Average Difficulty Level.....................................................................................3

Proportion of Dropout – High School GPA................................................................................................3

Proportion of Dropout – Gender...............................................................................................................4

Proportion of Dropout – Year of Program.................................................................................................4

BINARY LOGISTIC MODEL...................................................................................................................................5

Bifurcation of Data Set..................................................................................................................................5

Variables........................................................................................................................................................5

Definition of drop out................................................................................................................................6

Independent Variables:.................................................................................................................................6

Interaction Variables.................................................................................................................................6

New Variables created –............................................................................................................................6

Selecting final Independent Variables:..........................................................................................................7

OUTPUT:.........................................................................................................................................................7

The final output.........................................................................................................................................7

Result of testing on Validation Set............................................................................................................7

Observations:.................................................................................................................................................8

Recommendations:........................................................................................................................................8

MULTINOMIAL LOGISTIC MODEL.......................................................................................................................9

Variables:.......................................................................................................................................................9

Independent Variables:...........................................................................................................................10

Selecting final Independent Variables:........................................................................................................10

OUTPUT:.......................................................................................................................................................10

The final Output-......................................................................................................................................10

Observations and Recommendations:........................................................................................................11

Exhibit 1:...........................................................................................................................................................12

Exhibit 2:...........................................................................................................................................................13

Exhibit 3:...........................................................................................................................................................13

Exhibit 4:...........................................................................................................................................................14

Exhibit 5:...........................................................................................................................................................15

Exhibit 6:...........................................................................................................................................................15

Exhibit 7:...........................................................................................................................................................16

Exhibit 8:...........................................................................................................................................................17

1 | P a g e

Page 3: Business Analytics

Exhibit 9: Excel Sheet and SPSS Output............................................................................................................17

DESCRIPTIVE STATISTICS

Descriptive statistics are calculated to understand the relationship of dropout status with other

variables like courses opted, activity in sports, average difficulty level of courses, High School GPA,

gender and year of program.

The data set was divided into two parts- Training Data and Validation Data. The descriptive and

model building was done using the Training Dataset which had details of 100 students.

Proportion of Failure in Courses

C1 C2 C3 C4 C5 C6 C7 C8 C9C10

C11C12

C13C14

C15C16

C17C18

C19C20

C21C22

C23C24

0.00

0.10

0.20

0.30

0.40

0.50

0.600.50

0.03

0.16

0.33

Proportion of Students failed

From the above graph, it can be observed that most courses have difficulty levels, i.e. proportion of a

student failing, in the subject less than 0.16. However, courses C3 and C24 have higher proportion of

students who failed. The proportion of failure can be used as a proxy for the Difficulty Index of a

subject. (Exhibit 2)

Dropout codes 0, 1 and 2 used for the computation of the following descriptive statistics is defined as

below.

2 | P a g e

Dropout Code (Y) Dropout Code Description

0 If the candidate did not drop out

1

If the candidate dropped out, and he had failed in more than equal 1

course

2 If the candidate dropped out despite passing all the courses he took

Page 4: Business Analytics

Probability of Dropout – Sports Activity

Dropout Criteria

(Y)

Active in Sports Inactive in Sports Grand Total

0 28 16 44

1 20 13 33

2 10 13 23

Grand Total 58 42 100

From the above table, a larger proportion (28/44) of students who did not drop out was inactive in

sports. Also, a higher proportion of students dropping out because of failure in more than 1 course

were observed to be inactive in sports.

Proportion of Dropout – Average Difficulty Level

Y Average difficulty level of subjects taken

0 0.08028

1 0.09686

2 0.0914

The average difficulty level of the courses opted by a student and who has not dropped out from the

college is less than the average difficulty level of the courses opted by a student who has dropped

out from the college.

Proportion of Dropout – High School GPA

Y No of students with HSGPA>3

0 35

1 29

2 19

3 | P a g e

Page 5: Business Analytics

Maximum number of students with High School GPA > 3 fall in the category of not dropping out

from the college. And this number is least for the students who dropped out despite passing all the

subjects.

Proportion of Dropout – Gender

The number of female students not dropping out from the college is higher than the number of male

students not dropping out from the college. They however have comparable numbers for dropping

out from the college.

Proportion of Dropout – Year of Program

Drop out No of students who have dropped out

Year 1 29

Year 2 43

Year 3 8

The maximum number of drop outs happen in year 2.

4 | P a g e

Y No of Male

students

Y No of female

students

0 18 0 26

1 18 1 15

2 10 2 13

Page 6: Business Analytics

BINARY LOGISTIC MODEL

Bifurcation of Data SetThe data was divided into two sets. Set 1, Training Data, consists of randomly selected 100

candidates and Set 2, Validation Data, consisted of 12 samples. Set 1 was used to estimate and build

the model, and Set 2 was used to validate the model built.

The Independent Variable had the following observed classification-

Total Number of Students 100

Students Dropped 56

Students Graduated 44

Variables

Dependent Variable: Whether a particular student dropped out during the term or graduated

from Lovely

Business School is taken as the dependent variable.

5 | P a g e

Final Result Y =

Dropped Out 1

Graduated/Did not Drop Out 0

Page 7: Business Analytics

Definition of drop out: A particular student, who has not taken any courses in two

consecutive terms, is termed as a drop out.

For instance, from the historical data provided, student 3544856 has not taken any subjects in the last

two terms. Hence he is considered as a drop out.

Independent Variables:The provided data has the following continuous variables, and one binary variable (gender).

i) HSGPA

ii)

HSPctil

e

iii) HSSize

iv) SAT

Apart from this, the historical data also has details on courses taken on per term basis.

Interaction Variables - The probability of passing a difficult course should

empirically be dependent on the past academic performance of the candidate.

a) So a person with a higher HSGPA should have a higher probability of passing a

course, say C1, than a person who had a lower HSGPA. This leads to intercept

difference between when C1=0 and when C1=1

b) The logit, of a model with one continuous variable and one course, should empirically

be a function of the continuous variable leading us to make a guess of presence of

slope effect being present in the interaction terms.

To deal such situations we created 96 interaction variables between dummy and continuous

variables.

An exhaustive list of all the independent variables is given in Exhibit-1.

New Variables created –

a) The first new variable that was created is “Difficult Index of Subjects”. The difficulty

index of each subject is calculated as the ratio of number of students who have taken the

subject and failed to pass to the total number of students who have taken the subject.

6 | P a g e

Page 8: Business Analytics

Hence, if number of students failing in a subject is higher, then the difficulty index of that

subject is higher.

DI x=no of students who failed in subject xtotal no of students who opted subject x

Detailed Difficulty Index of each subject given in Exhibit 2.

b) The second new variable that was created is “rank”. Instead of taking ‘HSPct’, and

‘HSSize’ as different variables. We explored the option to see, if

Rank=(1−HSPct )∗HSSize gave us better results. (It didn’t, so we ended up dropping

Rank, and the 24 interaction variables between Rank and the Courses.)

Selecting final Independent Variables:Step 1: For all the 72 interaction variables the ROC Area was founded. (Exhibit 3)

Step 2: Since all interaction variables involving courses Ci (₳ i≠j, and i takes the values from 1 to

24) are correlated we have taken the interaction variables having the highest ROC value.

Step 3: In case ROC values for three interaction variables involving Ci is the same, then the

preference of selection has been SAT>HSPcT>HSGPA since the ROC values individually for these

variables are in decreasing order.

Exhibit 3 gives the detailed list of ROC areas and the final variables selected.

OUTPUT: The final outputLog ( Y=1/Y=0) = 5.603 + 55.205*DI -.006*SATC5 -.101*HSPcTC19

The model gave an efficiency of 92% (Exhibit 4)

Result of testing on Validation Set

     

Predicte

d

    0 1

Observe

d 0 6 0

  1 0 6

7 | P a g e

Page 9: Business Analytics

Observations:

The probability of dropping out is dependent on the following:

1. ADI: Average difficulty levels of all the subjects taken by a student. This seems to be a

logical conclusion also since a student might not have been able to perform better in an exam

because of difficult subjects and hence would have failed in that course. If such subjects are

more in number, the chances are high that the student’s performance will decrease leading to

failure in the examination. Not just failure, there is also a probability that the student is not

able to handle the pressure and hence drops out of the course.

2. If a student has taken course C19, his/her probability of dropping out decreases as compared

to the base case i.e. P(Y=0). And it is inversely proportional to GPA in Higher Secondary

School. This implies that if a person has taken course C19 and had a high Percentile in

School, his probability of dropping out decreases. One reason can be that the course C19 is an

extention of a high school course or has a similar course structure to a subject studied in high

school. With such similarity, the probability of performing well in the subject increases with

a student’s percentile in high school (i.e. HSPcT) and hence drop out probability decreases.

3. C5 is an easy subject as determined by the difficulty index. According to the model, the

probability of dropping out is inversely proportional to SAT Score. One reason behind this

result can be the same as in 2. C5 might be an aptitude based course, hence their probability

of performing well increases with their SAT score which in turn implies that their probability

of dropping out decreases.

4. The probability of dropping/ continuing is not dependent on the gender of the students.

5. Also, SAT score, participation in sports, do not affect the probability of dropping out.

Recommendations:1. Since the probability of dropping out decreases when a student takes courses C19 and C5, the

college should promote these courses among the students to decrease the dropout rates. They

can be made compulsory courses for the students.

2. Since probability of dropouts increases with the average difficulty level of the subjects taken

by a student, the college should ensure that each student takes a balanced choice of subjects.

The average difficulty of subjects taken in a particular term should be such that his

probability of dropping out doesn’t increase.

8 | P a g e

Page 10: Business Analytics

MULTINOMIAL LOGISTIC MODELVariables:

Dependent Variable: The dependent variable is coded either as 0,1 or 2.

The historical data had the following classification-

Y Number of candidates

0 44

1 33

2 23

9 | P a g e

Y  

0

If the candidate did not

drop out

1

If the candidate dropped

out, because he had

failed in more than 1

course

2

If the candidate dropped

out despite passing all

the courses he took

Page 11: Business Analytics

Independent Variables:Same as used in Binary Logistic Regression.

Selecting final Independent Variables:Step 1: For all the 72 interaction variables the ROC Area was founded. (Exhibit 3)

Step 2: Since all interaction variables involving courses Ci (₳ i≠j, and i takes the values from 1 to

24) are correlated we have taken the interaction variables having the highest ROC value.

Step 3: In case ROC values for three interaction variables involving Ci is the same, then the

preference of selection has been SAT>HSPcT>HSGPA since the ROC values individually for these

variables are in decreasing order.

Exhibit 3 gives the detailed list of ROC areas and the final variables selected.

Step 4: Variables which were not significant (at .1 were removed one by one, till all the variables in

Likelihood Ratio Test remained significant.

OUTPUT:The final Output-

P (Y=2 )=exp ( z 2 )¿¿

P (Y=1 )= exp ( z 1 )¿¿

P (Y=0 )= 11+exp ( z1 )+exp ( z 2 )

Z1 = -1248.233 -.971*SATC5 - .258*SATC18 - 5.545*HSPcTC1 + 5.594*HSPcTC11 +

18.644*HSPcTC14 -2.842*HSPcTC15 + 9294*ADI – 171.182*Gender -890.011*HSGPA +

2.616*SAT

Z2 = -1204.327 -.960*SATC5 -.239*SATC18 + 5.330*HSPcTC11 -2.879*HSPcTC15 +

8975.397*ADI – 173.323*Gender -889.389*HSGPA + 2.592*SAT

The achieved efficiency of the model is 90% (Exhibit 8)

Results of testing on validation set:

10 | P a g e

Page 12: Business Analytics

    Predicted    

    0 1 2

Observed 0 6   0

  1 0 6 0

  2 0 0 0

Observations and Recommendations:The marginal probability of Y, with respect to any variable xi and coefficient βi is

¿β i exp (z 2 )∗( 1+exp (Z 1 )+exp ( z2 ) )−(β i∗exp (Z 1 )+β i∗exp ( z2 ) )∗exp ( z1 )

(1+exp (Z1 )+exp ( z 2 ) )2

From the above marginal probability it can be observed that if βi is positive, the probability of

dropping out increases.

1. From Z2, A higher SAT score implies higher probability of dropping out. One reason that can

be attributed to this behaviour is that Lovely Business School might not fall under the

ambitious list of colleges for the students. If he/she has a higher SAT score, the probability is

that he might get admission in another college.

2. Also, it can be seen that the probability of dropping out increases with increase in Average

Difficulty Index of the courses taken by the student. This seems to be a logical conclusion

since the student might not have been able to handle the pressure of difficult subjects leading

him to drop out or fail in the exam which is again increasing his probability of dropping out.

The strength of coefficient for ADI, is stronger for Y=1 than Y=2, pointing towards ADI has

a higher impact for students dropping out who have failed in more than one subject.

3. C5, C18, C15 are the courses with least Difficulty Index, hence it is not surprising that

students who takes any of these courses have lower probability of dropping out; and there is a

significant interaction between past performance and these courses.

4. Male have lesser probability than female in dropping out, since the coefficient of Gender is

negative in both case. This may point towards the nation having a cultural issue when it

comes to male and female education.

11 | P a g e

Page 13: Business Analytics

Exhibit 1:Dropped/Result 1= dropped, 0=continued

Gender 1 = Male

0 = Female

Student ID Identification Number

Course Year Year in which the course was taken

Semester Semester within the year

Result PASS – Student passed the course

OTHER – Failed and discontinued

Gender 1 = Male

0 = Female

HSGPA GPA in Higher Secondary School

HSPct Percentile in Graduating Class in Higher Secondary

HSSize Number of students in HS graduating Class

SAT Overall SAT Score

Sports 1 = Active in Sports

0 = Not a sports person

Ci 1=enrolled in course with course code Ci, 0=not enrolled in

course Ci, i=1 to 24

CPi Difficulty index of the subject Ci ; 0<= CPi <= 1

SAT*Ci Interaction Variable between SAT and Ci

HSPcT*Ci Interaction Variable between HSPcT and Ci

HSGPA*Ci Interaction Variable between HSGPA and Ci

12 | P a g e

Page 14: Business Analytics

Exhibit 2:

 

Total

cases

Cases

failed

Probability of

failing

C1 107 9 0.08411215

C2 7   0

C3 2 1 0.5

C4 74 2 0.027027027

C5 44 1 0.022727273

C6 71 7 0.098591549

C7 80 3 0.0375

C8 1   0

C9 106 12 0.113207547

C10 66 3 0.045454545

C11 70 11 0.157142857

C12 107 9 0.08411215

C13 78 3 0.038461538

C14 176 12 0.068181818

C15 80 5 0.0625

C16 97 12 0.12371134

C17 24 1 0.041666667

C18 73 3 0.04109589

C19 100 16 0.16

C20 10   0

C21 93 3 0.032258065

C22 96 6 0.0625

C23 103 14 0.13592233

C24 3 1 0.333333333

Exhibit 3:ROC Area

 

SAT(0.558

)

HSPcT(0.54

)

HSGPA

(0.553) Variable Selected

C1 0.506 0.511 0.496 HSPcT*C1

13 | P a g e

Page 15: Business Analytics

C2 0.443 0.443 0.443 SAT*C2

C3 0.474 0.474 0.474 SAT*C3

C4 0.245 0.248 0.247 HSPcT*C4

C5 0.094 0.094 0.091 SAT*C5

C6 0.209 0.228 0.218 HSPcT*C6

C7 0.322 0.349 0.337 HSPcT*C7

C8 0.508 0.508 0.508 SAT*C8

C9 0.508 0.504 0.487 SAT*C9

C10 0.228 0.245 0.239 HSPcT*C10

C11 0.234 0.268 0.253 HSPcT*C11

C12 0.537 0.531 0.516 SAT*C12

C13 0.293 0.319 0.309 HSPcT*C13

C14 0.506 0.515 0.5 HSPcT*C14

C15 0.283 0.308 0.3 HSPcT*C15

C16 0.426 0.433 0.421 HSPcT*C16

C17 0.292 0.294 0.296 HSGPA*C17

C18 0.235 0.235 0.234 SAT*C18

C19 0.451 0.452 0.437 HSPcT*C19

C20 0.395 0.395 0.395 SAT*C20

C21 0.285 0.3 0.295 HSPcT*C21

C22 0.48 0.466 0.457 SAT*C22

C23 0.451 0.452 0.437 HSPcT*C23

C24 0.503 0.503 0.503 SAT*C24

Exhibit 4:Classification Tablea

Observed

Predicted

Dropped/Result Percentage

Correct.0 1.0

Step 1 Dropped/

Result

.0 39 5 88.6

1.0 3 53 94.6

Overall Percentage 92.0

14 | P a g e

Page 16: Business Analytics

a. The cut value is .600

Exhibit 5:Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 1a Averagedifficu

lty 55.205 26.299 4.406 1 .036

94447689383

92180000000

00.000

SATC5 -.006 .002 11.580 1 .001 .994

HSPcTC19 -.101 .035 8.602 1 .003 .904

Constant 5.603 3.790 2.185 1 .139 271.150

a. Variable(s) entered on step 1: Averagedifficulty, SATC5, HSPcTC19.

Exhibit 6:Likelihood Ratio Tests

Effect

Model Fitting

Criteria Likelihood Ratio Tests

-2 Log

Likelihood of

Reduced

Model Chi-Square df Sig.

Intercept 58.827 21.880 2 .000

SATC3 54.696 17.749 2 .000

SATC5 94.745 57.797 2 .000

SATC18 76.023 39.076 2 .000

SATC22 54.765a 17.818 2 .000

HSPcTC1 71.977 35.029 2 .000

HSPcTC11 71.316 34.369 2 .000

HSPcTC14 60.626 23.679 2 .000

HSPcTC15 51.011 14.064 2 .001

Averagedifficu

lty60.115 23.168 2 .000

Gender 51.829a 14.882 2 .001

HSGPA 48.209 11.262 2 .004

15 | P a g e

Page 17: Business Analytics

SAT 60.633a 23.685 2 .000

Exhibit 7:

  BStd.

ErrorWald

d

fSig.

Y=2

Intercept

-

1248.2

3

407.13

49.4 1

0.00

2

SATC3 -0.473 2.325 0.041 10.83

9

SATC5 -0.971 0.335 8.386 10.00

4

SATC18 -0.258 0.06 18.675 1 0

SATC22 -0.246 0.284 0.753 10.38

6

HSPcTC1 -5.545 3.354 2.733 10.09

8

HSPcTC11 5.594 0.904 38.286 1 0

HSPcTC14 18.644 4.656 16.032 1 0

HSPcTC15 -2.842 1.093 6.765 10.00

9

Averagedifficu

lty

9294.0

03

2667.8

0912.137 1 0

Gender

-

171.18

2

73.416 5.437 1 0.02

16 | P a g e

Page 18: Business Analytics

HSGPA

-

890.01

1

94.813 88.116 1 0

SAT 2.616 0.0169568.

061 0

Y=1

Intercept

-

1204.3

3

406.61

28.773 1

0.00

3

SATC3 -0.474 2.315 0.042 10.83

8

SATC5 -0.96 0.36 7.116 10.00

8

SATC18 -0.239 0.059 16.279 1 0

SATC22 -0.237 0.284 0.699 10.40

3

HSPcTC1 23.759 15.423 2.373 10.12

3

HSPcTC11 5.33 0.897 35.311 1 0

HSPcTC14 -10.587 15.011 0.497 10.48

1

HSPcTC15 -2.879 1.093 6.946 10.00

8

Averagedifficu

lty

8975.3

97

2662.4

8411.364 1

0.00

1

Gender

-

173.32

3

73.415 5.574 10.01

8

HSGPA

-

889.38

9

94.822 87.975 1 0

SAT 2.592 0 . 1 .

17 | P a g e

Page 19: Business Analytics

Exhibit 8:Classification

Observed

Predicted

0 1 2

Percent

Correct

0 44 0 0 100.0%

1 0 27 6 81.8%

2 0 4 19 82.6%

Overall

Percentage44.0% 31.0% 25.0% 90.0%

Exhibit 9: Excel Sheet and SPSS Output

18 | P a g e