6
N University: Office of Strategic Research and Analytics OBJECTIVE To understand how variables related to students accepted into N University’s programs affect their likelihood of actually enrolling in these programs. DATA About: We have data from 1268 subjects. Each subject is a student who has been admitted to one of 4 program types at N University: Bachelors Degree, Fast-Track 18-Month BS Degree Completion Blended Class, Fast-Track 18-Month BS Degree Completion Online Class, and Undergraduate Certificate. The variables available to us range from demographic information, to their major, to time elapsed between the application submitted date and the decision date. Variables of Interest: In order to ensure that our analysis is applicable to future students, we replaced factor variables that had a large amount of very specific levels, with factor variables that had a reduced number of levels that were more general. For example, the variable Country presented an issue because it did not encompass all the possible countries that a student could be from. However, by creating a new variable that was coded “Domestic” for students from the United States and “International” otherwise, not only did we retain the information from the Country variable, but we ensured that future students would have a level they can be placed in for analytical purposes. Factor variables with levels that did not have clear ways of being grouped together, such as High School, were omitted. Variables that were the same for each subject, such as Level, where all the students were listed as undergraduate, were also omitted. Missing Values: While with continuous variables, missing data could be filled in using methods such as using the average of the column, it is not as simple with categorical variables. Thus applicants that had missing values for any of the categorical variables were not included in the analysis. After omitting these subjects, 931 subjects remained. MODELING PROCEDURE The response variable was Enrolled which was coded 1 if the variable “Registered for Class” was a “Yes” and 0 if it was a “No.” This type of variable is binomially distributed and thus we ran a generalized linear binomial model. We split the 931 subjects into a test set of 121 randomly selected subjects and the remaining 810 subjects became the training set on which we ran our analysis. We started with the model that contained only an intercept and no predictive variables. From there we implemented a forward stepwise model selection procedure, using AIC as the criteria for deciding which variables to keep in the model. The final model chosen by the procedure is presented in the following section.

Non-Traditional Student Enrollment Analytics

Embed Size (px)

Citation preview

Page 1: Non-Traditional Student Enrollment Analytics

N University: Office of Strategic Research and Analytics

OBJECTIVE To understand how variables related to students accepted into N University’s programs

affect their likelihood of actually enrolling in these programs.

DATA About: We have data from 1268 subjects. Each subject is a student who has been admitted to one of 4

program types at N University: Bachelors Degree, Fast-Track 18-Month BS Degree Completion Blended

Class, Fast-Track 18-Month BS Degree Completion Online Class, and Undergraduate Certificate. The

variables available to us range from demographic information, to their major, to time elapsed between

the application submitted date and the decision date.

Variables of Interest: In order to ensure that our analysis is applicable to future students, we replaced

factor variables that had a large amount of very specific levels, with factor variables that had a reduced

number of levels that were more general. For example, the variable Country presented an issue because

it did not encompass all the possible countries that a student could be from. However, by creating a new

variable that was coded “Domestic” for students from the United States and “International” otherwise,

not only did we retain the information from the Country variable, but we ensured that future students

would have a level they can be placed in for analytical purposes. Factor variables with levels that did not

have clear ways of being grouped together, such as High School, were omitted. Variables that were the

same for each subject, such as Level, where all the students were listed as undergraduate, were also

omitted.

Missing Values: While with continuous variables, missing data could be filled in using methods such as

using the average of the column, it is not as simple with categorical variables. Thus applicants that had

missing values for any of the categorical variables were not included in the analysis. After omitting these

subjects, 931 subjects remained.

MODELING PROCEDURE The response variable was Enrolled which was coded 1 if the variable “Registered for Class” was a “Yes”

and 0 if it was a “No.” This type of variable is binomially distributed and thus we ran a generalized linear

binomial model.

We split the 931 subjects into a test set of 121 randomly selected subjects and the remaining 810

subjects became the training set on which we ran our analysis.

We started with the model that contained only an intercept and no predictive variables. From there we

implemented a forward stepwise model selection procedure, using AIC as the criteria for deciding which

variables to keep in the model.

The final model chosen by the procedure is presented in the following section.

Page 2: Non-Traditional Student Enrollment Analytics

THE MODEL R- Output:

In terms of log(odds):

In terms of odds:

We will interpret each variable’s coefficient in order to explain the effect of that variable on the odds of

a student enrolling in their intended program, as that is easier for a general audience to understand.

However, we have included the model in terms of log (odds) above as well. The statistical significance of

each coefficient at specific significance levels is included so that the reader may designate their own

significance level and draw their own conclusions. The interpretation of each coefficient is holding all

other variables constant.

Log(odds of enrolling) = 1.301818 -1.733925*[Dom_InternInternational] +1.478557*[Prog_typeFast-Track 18-Month BS Degree Completion Blended Classes] +0.654839*[Prog_typeFast-Track 18-Month BS Degree Completion Online Classes] -0.249308*[Prog_typeUndergraduate Certificate] -0.671543*[Campus_PreferenceNo Preference] -0.536188*[Campus_PreferenceOnline] +0.878252*[Veteran_YNY] -0.017988*[Age] +0.425373*[Hispanic Latino_YNY] -0.251832*[GenderMale]

odds of enrolling = 𝑒1.301818

*𝑒−1.733925∗Dom_Intern𝑰𝒏𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒐𝒏𝒂𝒍

*𝑒1.478557∗Prog_type𝑭𝒂𝒔𝒕−𝑻𝒓𝒂𝒄𝒌 𝟏𝟖−𝑴𝒐𝒏𝒕𝒉 𝑩𝑺 𝑫𝒆𝒈𝒓𝒆𝒆 𝑪𝒐𝒎𝒑𝒍𝒆𝒕𝒊𝒐𝒏 𝑩𝒍𝒆𝒏𝒅𝒆𝒅 𝑪𝒍𝒂𝒔𝒔𝒆𝒔

*𝑒0.654839∗Prog_type𝑭𝒂𝒔𝒕−𝑻𝒓𝒂𝒄𝒌 𝟏𝟖−𝑴𝒐𝒏𝒕𝒉 𝑩𝑺 𝑫𝒆𝒈𝒓𝒆𝒆 𝑪𝒐𝒎𝒑𝒍𝒆𝒕𝒊𝒐𝒏 𝑶𝒏𝒍𝒊𝒏𝒆 𝑪𝒍𝒂𝒔𝒔𝒆𝒔

*𝑒−0.249308∗Prog_type𝑼𝒏𝒅𝒆𝒓𝒈𝒓𝒂𝒅𝒖𝒂𝒕𝒆𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒆 *𝑒−0.671543∗Campus_Preference𝑵𝒐𝑷𝒓𝒆𝒇𝒆𝒓𝒆𝒏𝒄𝒆 *𝑒−0.536188∗Campus_Preference𝑶𝒏−𝑳𝒊𝒏𝒆 *𝑒0.878252∗Veteran_YN𝒀 *𝑒−0.017988∗Age *𝑒0.425373∗Hispanic_Latino_YN𝒀 *𝑒−0.251832∗Gender𝑴𝒂𝒍𝒆

Page 3: Non-Traditional Student Enrollment Analytics

The coefficient for the Intercept implies that the odds of a female student who is domestic, admitted to

the bachelors degree program, prefers to complete their program on the Boston Main Campus, is not a

veteran, is not Hispanic/Latino, and is of zero years of age is 𝑒1.301818=3.675974. This coefficient has no

qualitative meaning on its own because it does not make sense for a person who is of zero years of age

to apply to the university.

Dom_Intern is a factor variable with two levels, indicating whether a student is domestic or

international.

Dome_InternInternational= {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒊𝒔 𝒊𝒏𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒐𝒏𝒂𝒍𝟎, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒊𝒔 𝒅𝒐𝒎𝒆𝒔𝒕𝒊𝒄

The coefficient for this variable implies that, holding all other variables constant, the odds of an

international student enrolling in their intended program decreases by a factor of 𝑒−1.733925

=0.1765899 compared to a domestic student. At any common significance level, there is a

statistically significant difference between the odds of an international and domestic student

enrolling in their intended program.

Prog_Type is a factor variable with four levels, indicating what type of program the student was

admitted into: Bachelors Degree, Fast-Track 18-Month BS Degree Completion Blended Classes, Fast-

Track 18-Month BS Degree Completion Online Classes, and Undergraduate Certificate.

Prog_TypeFast-Track 18-Month BS Degree Completion Blended Classes

= {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕′𝒔 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒕𝒚𝒑𝒆 𝒊𝒔 𝑭𝒂𝒔𝒕 𝑻𝒓𝒂𝒄𝒌 𝟏𝟖 𝑴𝒐𝒏𝒕𝒉 𝑩𝑺 𝑫𝒆𝒈𝒓𝒆𝒆 𝑪𝒐𝒎𝒑𝒍𝒆𝒕𝒊𝒐𝒏 𝑩𝒍𝒆𝒏𝒅𝒆𝒅𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student in the Fast-Track 18-Month BS Degree Completion Blended program enrolling increases

by a factor of 𝑒1.478557=4.386611 compared to a student in the Bachelors Degree Program. At a

significance level of .001, this difference in odds of enrolling between students in the Bachelors

Degree program and students in the Fast-Track 18-Month BS Degree Completion Blended

program is statistically significant.

Prog_TypeFast-Track 18-Month BS Degree Completion Online Classes

= {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕′𝒔 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒕𝒚𝒑𝒆 𝒊𝒔 𝑭𝒂𝒔𝒕 𝑻𝒓𝒂𝒄𝒌 𝟏𝟖 𝑴𝒐𝒏𝒕𝒉 𝑩𝑺 𝑫𝒆𝒈𝒓𝒆𝒆 𝑪𝒐𝒎𝒑𝒍𝒆𝒕𝒊𝒐𝒏 𝑶𝒏𝒍𝒊𝒏𝒆𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student in the Fast-Track 18-Month BS Degree Completion Online program increases by a factor

of 𝑒0.654839=1.924833 compared to students in the Bachelors Degree program. At a significance

level of .001, this difference in odds of enrolling between students in the Bachelors Degree

program and students in the Fast-Track 18-Month BS Degree Completion Online program is

statistically significant.

Prog_TypeUndergraduate Certificate

= {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕′𝒔 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒕𝒚𝒑𝒆 𝒊𝒔 𝑼𝒏𝒅𝒆𝒓𝒈𝒓𝒂𝒅𝒖𝒂𝒕𝒆 𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒆𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student admitted to the Undergraduate Certificate program enrolling decreases by a factor of

𝑒−0.249308=0.7793399 compared to Bachelors Degree students. This difference in odds of

Page 4: Non-Traditional Student Enrollment Analytics

enrolling between students in the Undergraduate Certificate program and students in the

Bachelors Degree program is not statistically significant.

Campus_Preference is a factor variable with three levels, indicating whether a student prefers to

complete their program on the Boston Main Campus, online, or have no preference.

Campus_PreferenceNoPreference = {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒉𝒂𝒔 𝒏𝒐 𝒑𝒓𝒆𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒊𝒏 𝒄𝒂𝒎𝒑𝒖𝒔𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student that has no campus preference enrolling in their program decreases by a factor of

𝑒−0.671543 =0.5109196 compared to students whose campus preference is the Boston Main

Campus. The difference in odds of enrolling between students who prefer Boston Main Campus

and students who have no preference is not statistically significant.

Campus_PreferenceOnLine = {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕′𝒔 𝒄𝒂𝒎𝒑𝒖𝒔 𝒑𝒓𝒆𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒊𝒔 𝒐𝒏𝒍𝒊𝒏𝒆𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student that prefers to complete their program online enrolling in classes decrease by a factor

of 𝑒−0.536188=0.5849739 compared to students who’s campus preference is Boston Main

Campus. At a significance level of .001, this difference in odds of enrolling between students

who prefer to complete their program Online versus at the Boston Main Campus is statistically

significant.

Veteran_YN is a two level factor variable indicating the veteran status of a student.

Veteran_YNY = {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒊𝒔 𝒂 𝒗𝒆𝒕𝒆𝒓𝒂𝒏𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

student who is a veteran enrolling in their program increases by a factor of 𝑒0.878252=2.406689

compared to a student who is not a veteran. At a significance level of .01, the difference in odds

of enrolling between students who are and are not veterans is statistically significant.

Age is a continuous variable indicating the age of the student in years. The range of the students in our

training data set was 17-64. The coefficient for this variable implies that, holding all other variables

constant, for every 1 year in the age of the student, the odds of that student enrolling in their intended

program decrease by a factor of 𝑒−0.017988=0.9821728. At a significance level of .01, this difference in

odds of enrolling for each increase in years of age is statistically significant.

Hispanic_Latino_YN is a two level factor variable indicting whether or not the student is Hispanic/Latino.

Hispanic_Latino_YNY = {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒊𝒔 𝑯𝒊𝒔𝒑𝒂𝒏𝒊𝒄/𝑳𝒂𝒕𝒊𝒐𝒏𝒐𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

Hispanic/Latino student enrolling in their intended program increases by a factor of

𝑒0.425373=1.530161 compared to non-Hispanic/Latino students. However, this difference in odds

of enrolling between Hispanic/Latino and non-Hispanic/Latino students is only statistically

significant at the .05 level.

Gender is a two level factor variable indicating the gender of the student, either male or female.

Page 5: Non-Traditional Student Enrollment Analytics

GenderMale = {𝟏, 𝒔𝒕𝒖𝒅𝒆𝒏𝒕 𝒊𝒔 𝒂 𝒎𝒂𝒍𝒆𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

The coefficient for this variable implies that, holding all other variables constant, the odds of a

male student enrolling in his intended program decreases by a factor of 𝑒−0.251832=0.7773753

compared to female students. However the difference in odds of enrolling between females and

males is not statistically significant.

EVALUATION OF THE MODEL In order to evaluate the accuracy of the model, we conducted a binary classification on the test set of

the data. The decision rule (as a function of the predicted probability of enrolling based on our model)

used to classify the predicted enrollment status of each student was the following:

Predicted Enrollment Status = {0 − 𝐷𝑖𝑑 𝑁𝑜𝑡 𝐸𝑛𝑟𝑜𝑙𝑙, 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑒𝑛𝑟𝑜𝑙𝑙𝑖𝑛𝑔) < .5

1 − 𝐸𝑛𝑟𝑜𝑙𝑙𝑒𝑑, 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑒𝑛𝑟𝑜𝑙𝑙𝑖𝑛𝑔) ≥ .5

Comparing the actual enrollment status of each student in the test set to the predicted enrollment

status based on our model, the accuracy rate was 67%.

DISCUSSION In this section, we will qualitatively explain the results of our model and offer suggestions based on our

conclusions. Our chosen significance level is .001, and thus we only discuss variables that had a

significant effect on likelihood of enrollment at that level.

International students are less likely to enroll in their intended program than domestic students. This

could be due to a multitude of reasons. Two possible reasons are that 1) it is more costly for

international students to go to school in the United States, or 2) international students have a larger

scope of schools to choose from when including those in their own country. It would take a lot more

convincing to get an international student to make the effort to move to the United State and

participate in N University’s program than it would a student who just has to travel from another US

state. This difference between the likelihood of enrollment between international and domestic

students could be minimized if the University improves their international student services and/or

improves the marketing of the existing programs, making clear how attending the university would

contribute to the future success of international students.

Students admitted to the Fast-Track programs are much more likely to enroll than students admitted to

the Bachelors Degree program. This could be due to the fact that there are more existing traditional

Bachelors Degree programs for students to choose from, creating more competition for the university.

This also implies that the flexibility and efficiency that the university offers through the Fast-Track

programs is attractive to students and these programs tend to pull in a higher proportion of the

admitted applicants. Thus it would be advisable to invest/expand more into the marketing of these

programs.

The likelihood of a student who prefers to complete their program online enrolling in their intended

program is higher than the likelihood of a student who prefers the Boston Main Campus. Again, this may

be due to the fact that there are many schools that offer (or even require) on-campus learning.

Therefore students who do not have a campus preference, or prefer the Boston Main Campus have

more options to choose from and this poses more competition for the University. However, a higher

Page 6: Non-Traditional Student Enrollment Analytics

proportion of students who prefer to complete their courses online will actually enroll in their admitted

program because the University offers that as an option. Thus, investing more money into reeling in

students who prefer to attend online classes will likely offer a higher return.

At our chosen significance level, the difference in likelihood of enrolling is negligible for the other factors

in our model.