102
Trend of Saudi Arabia Students Taking Higher Education Abroad A THESIS SUBMITTED TO THE GRADUATE EDUCATIONAL COUNCIL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS For the degree MASTER OF SCIENCE By Majed Saeed Alghamdi Advisor Dr. Rahmatullah Imon Ball State University Muncie, Indiana May 2016

Trend of Saudi Arabia Students Taking Higher Education Abroad

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Trend of Saudi Arabia Students Taking

Higher Education Abroad

A THESIS

SUBMITTED TO THE GRADUATE EDUCATIONAL COUNCIL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

For the degree

MASTER OF SCIENCE

By

Majed Saeed Alghamdi

Advisor Dr. Rahmatullah Imon

Ball State University

Muncie, Indiana

May 2016

i

Trend of Saudi Arabia Students Taking Higher Education Abroad

A THESIS

SUBMITTED TO THE GRADUATE EDUCATIONAL COUNCIL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE

MASTER OF SCIENCE

By

Majed Saeed Alghamdi

Committee Approval:

…………………………………………………………………………………………….

Committee Chairman Date

……………………………………………………………………………………………

Committee Member Date

…………………………………………………………………………………………….

Committee Member Date

Department Head Approval:

……………………………………………………………………………………………

Head of Department Date

Graduate office Check:

……………………………………………………………………………………………

Dean of Graduate School

Date

Ball State University

Muncie, Indiana

May, 2016

ii

ACKNOWLEDGEMENTS

I would like to express my special appreciation and thanks to my advisor Professor Dr.

Rahmatullah Imon, you have been a tremendous mentor for me, for his patience, motivation,

enthusiasm, and immense knowledge. His guidance helped me in all the time during my analysis

and writing the report. I could not have imagined having a better advisor and mentor for my thesis

other than him I would also like to thank my committee members, professor Dr. Munni Begum

and Dr. Yayuan Xiao for their encouragement, insightful comments and patience. I am thankful to

all my classmates for their kind supports. Last but not the least, I would like to thank my family:

my parents, my brothers and sisters, for supporting me throughout my life.

Majed Alghamdi

May 7, 2016

iii

ABSTRACT

In this study our prime objective was to investigate the trend of Saudi Arabia students who are

studying abroad for higher education. We find student enrolment is growing almost exponentially

over the years. The most popular programs are Engineering and Medical Science and the least

popular programs are Agriculture and Fine Arts. We also find an evidence of gender discrimination

against women among the Saudi Arabia students studying abroad. In quest of which factors

influence the number of students studying abroad we consider regression analysis and find that

budget in higher education and oil price are the most important variables to explain students’

enrolment. Both regression and cross validation study reveal that the robust reweighted least

squares (RLS) fit the data better than other models and yield better forecasts.

iv

Table of Contents

CHAPTER 1 .................................................................................................................................. 1

INTRODUCTION ..................................................................................................................... 1

1.1 Objective of the Study ....................................................................................................... 3

1.2 Sources of Data .................................................................................................................. 3

1.3 Methodology ...................................................................................................................... 4

CHAPTER 2 .................................................................................................................................. 5

Trend of Saudi Arabia Students Studying abroad ................................................................. 5

2.1 Trend Analysis ................................................................................................................... 5

2.2 Trend Analysis of Nine Major Programs ........................................................................ 10

2.3 Trend Analysis of Some Other Relevant Variables ......................................................... 28

2.4 Summary Results of Trend Analysis ............................................................................... 34

CHAPTER 3 ................................................................................................................................ 35

Comparison between Genders and Different Programs ..................................................... 35

3.1 Comparison between Genders ......................................................................................... 35

3.2 Tests for the Equality of Means between Male and Female Students ............................. 41

3.3 Comparison of the Individual Treatment Means ............................................................. 46

3.4 Result Summary .............................................................................................................. 48

v

CHAPTER 4 ................................................................................................................................ 50

Modeling and Fitting of Data Using Regression Diagnostics and Robust Regression ...... 50

4.1 Classical Regression Analysis ......................................................................................... 50

4.2 Regression Diagnostics .................................................................................................... 54

4.3 Robust Regression ........................................................................................................... 62

4.4 Regression Results ........................................................................................................... 65

4.5 Results Comparisons ....................................................................................................... 75

CHAPTER 5 ................................................................................................................................ 76

Cross Validation of Forecasts................................................................................................. 76

5.1 Evaluation of Forecasts by Cross Validation .................................................................. 76

5.2 Cross Validation Results ................................................................................................. 78

CHAPTER 6 ................................................................................................................................ 80

Conclusions and Areas of Further Research ........................................................................ 80

6.1 Conclusions ..................................................................................................................... 80

6.2 Areas of Further Research ............................................................................................... 81

References .................................................................................................................................... 82

APPENDIX A .............................................................................................................................. 84

APPENDIX B .............................................................................................................................. 88

vi

List of Tables

Chapter 2

Table 2.1: Trend Summary of the Total Number of Students ...................................................... 12

Table 2.2: Trend Summary of the Total Number of Social Science Students .............................. 15

Table 2.3: Trend Summary of the Total Number of Natural Science Students ............................ 17

Table 2.4: Trend Summary of the Total Number of Medical Science Students ........................... 18

Table 2.5: Trend Summary of the Total Number of Law Students .............................................. 20

Table 2.6: Trend Summary of the Total Number of Humanities Students ................................... 21

Table 2.7: Trend Summary of the Total Number of Fine Arts ..................................................... 23

Table 2.8: Trend Summary of the Total Number of Engineering Students .................................. 24

Table 2.9: Trend Summary of the Total Number of Education Students ..................................... 26

Table 2.10 Trend Summary of the Total Number of Agriculture Students .................................. 27

Table 2.11: Trend Summary of Oil Revenue ................................................................................ 30

Table 2.12: Trend Summary of Budget in Higher Education ....................................................... 32

Table 2.13: Trend Summary of Oil Price ...................................................................................... 33

Table 2.14: Trend Summary ......................................................................................................... 34

Chapter 3

Table 3.1: Summary Test Results for the Equality of Means between Male and Female Students

....................................................................................................................................................... 42

Table 3.2: Average Number of Students in Different Programs .................................................. 43

Table 3.3 ANOVA Table for the Equality of Mean Test of Nine Programs ................................ 48

vii

Chapter 4

Table 4.1: Regression Results Summary ...................................................................................... 75

Chapter 5

Table 5.1: Original and Forecasted Values for 2011-2014 ........................................................... 78

Table 5.2: Cross Validation Result Summary ............................................................................... 79

viii

List of Figures

Chapter 2

Figure 2.1: Time Series Plot of the Total Number of Students .................................................... 10

Figure 2.2: Trend Analysis of the Total Number of Students....................................................... 11

Figure 2.3: Time Series Plot of Total Number of Students in Different Programs ...................... 12

Figure 2.4: Time Series Plot of Total Number of Students (in ln) in Different Programs ........... 13

Figure 2.5: Trend Analysis Plot of the Total Number of Social Science Students ....................... 15

Figure 2.6: Trend Analysis Plot of the Total Number of Students for Natural Science ............... 16

Figure 2.7: Trend Analysis Plot of the Total Number of Students for Medical Science .............. 18

Figure 2.8: Trend Analysis Plot of the Total Number of Students for Law ................................. 19

Figure 2.9: Trend Analysis Plot of the Total Number of Students for Humanities ...................... 21

Figure 2.10: Trend Analysis Plot of the Total Number of Students for Fine Arts ....................... 22

Figure 2.11: Trend Analysis Plot of the Total Number of Students for Engineering ................... 24

Figure 2.12: Trend Analysis Plot of the Total Number of Students for Education ...................... 25

Figure 2.13: Trend Analysis Plot of the Total Number of Students for Agriculture .................... 27

Figure 2.14: Time Series Plot of the Budget in Higher Education ............................................... 28

Figure 2.15: Time Series Plot of Oil Price .................................................................................... 28

Figure 2.16: Time Series Plot of Oil Revenue .............................................................................. 29

Figure 2.17: Trend Analysis of Oil Revenue ................................................................................ 30

Figure 2.18: Trend Analysis of Budget in Higher Education ....................................................... 31

Figure 2.19: Trend Analysis of Oil Price ...................................................................................... 33

ix

Chapter 3

Figure 3.1: Time Series Plot of Male and Female Students in Social Science ............................. 35

Figure 3.2: Time Series Plot of Male and Female Students in Natural Science ........................... 36

Figure 3.3: Time Series Plot of Male and Female Students in Medical Science .......................... 37

Figure 3.4: Time Series Plot of Male and Female Students in Law ............................................. 37

Figure 3.5: Time Series Plot of Male and Female Students in Humanities .................................. 38

Figure 3.6: Time Series Plot of Male and Female Students in Engineering ................................. 39

Figure 3.7: Time Series Plot of Male and Female Students in Education .................................... 39

Figure 3.8: Time Series Plot of Male and Female Students in Fine Arts ..................................... 40

Figure 3.9: Time Series Plot of Male and Female Students in Agriculture .................................. 40

Figure 3.10: Box Plot of Number of Students in Different Programs .......................................... 43

Chapter 4

Figure 4.1: Scatter Plot of the Total Number of Students vs Budget in Higher Education .......... 66

Figure 4.2: Scatter Plot of the Total Number of Students vs Oil Price ......................................... 67

Figure 4.3: RLS and OLS Fit of the Total Number of Students vs Oil Price ............................... 67

Figure 4.4: Scatter Plot of the Total Number of Students vs Oil Revenue ................................... 68

Figure 4.5: Normal Probability Plot of the Residuals for Model A .............................................. 72

Figure 4.6: Normal Probability Plot of the Residuals for Model B .............................................. 73

Figure 4.7: Normal Probability Plot of the Residuals for Model C .............................................. 74

Chapter 5

Figure 5.1: Scatterplot of RLS, OLS, Exponential Forecasts vs Original Values ........................ 78

1

CHAPTER 1

INTRODUCTION

As early as the reign of King Abdulaziz, The founding king of Saudi Arabia, students were being

sponsored to study abroad. Early programs were limited to Arab countries such as Egypt and

Lebanon to study Arabic and Islamic studies. The number of Saudi Arabian students studying

abroad has increased dramatically during the past decade. This explosive growth can be

attributed to an educational agreement brokered between former U.S. president George Bush and

Saudi King Abdullah bin Abdulaziz Al Saud in 2005. The agreement opened the doors for Saudi

students to pursue their higher educational degrees in the U.S. with their government paying all

of their educational expenses. As a result over 100,000 Saudi students were enrolled in American

colleges and universities in 2013-14, making Saudi Arabia the fourth largest sponsor of

international students to the U.S.

Saudi enrollments overseas have been growing exponentially since the 2005 introduction of

the King Abdullah bin Abdulaziz Scholarship Program (KASP). In 2012, the KASP was extended

with the aim of helping a further 50,000 Saudis graduate from the world’s top 500 universities by

2020. According to data from the Institute for International Education, in the 2012/13 academic

year there were a total of 44,586 tertiary-level Saudi students in the United States, an almost 100

percent increase from 2010/11 and a 12-fold increase from 2005.

The most recent data from the Student and Exchange Visitor Program’s SEVIS database show that

there were a total of 70,366 active nonimmigrant Saudi students (including dependents) in the

2

United States in July 2014 on F, J or M visas. This compares to 61,944 at the same time in

2013. Saudi government data pegs the 2013/14 number of Saudi students and dependents in the

United States at a significantly larger 106,858. Of those 89,423 were reported to be on government

scholarships. The same data show that there were 20,252 students in the United Kingdom, 18,926

in Canada, and 13,002 in Australia, with just under 200,000 total Saudi students at institutions

abroad (75% male) across the world.

By level of study, 120,000 students are at the undergraduate level, 47,500 at the master’s level and

10,400 at the doctoral level. The KASP will continue to prioritize fields designated as important

to progressing the Saudi “knowledge economy,” such as medicine, engineering and science.

Approximately 70 percent of scholarship students currently study in subjects related to Business

Administration, Engineering, Information Technology and Medicine. The top fields of study for

Saudi students in the United States last year were: Intensive English (27.2%), Engineering

(21.1%), Business/Management (17.1%), Math and Computer Science (7.4%), and Health

Professions (5.6%).

The Saudi government is projected to invest over 10% of its annual budget to higher education for

the foreseeable future. Currently it invests nearly $2.4 billion in the KASP initiative annually,

which includes academic funding as well as living expenses for over 100,000 students enrolled in

graduate and undergraduate programs in the U.S. If the Saudi government continues to support

KASP at the current level, it will soon surpass South Korea in terms of sending more students

abroad to study

3

1.1 Objective of the Study

In this study our prime objective was to investigate the trend of Saudi Arabia students who are

studying abroad for higher education. We would like to investigate both the overall trend and also

trends of individual programs. We would like to see whether there is any special preference for

any particular program. Another point of our interest is to investigate whether there is any gender

discrimination among the students? We would also like to find out the most important factors that

influence the number of students studying abroad most. We would employ regression analysis for

this and for the validity of the model we would employ recent diagnostics. If the conventionally

used least squares method fails we would either use robust regression or choose some other models.

To confirm which method does fit the data best we would apply cross validation.

1.2 Sources of Data

The most important data I need for my study is the number of Saudi Arabia students studying

abroad for higher education. This data set is taken from the official website The Ministry of Higher

Education of Saudi Arabia as given below.

https://www.mohe.gov.sa/ar/Ministry/Deputy-Ministry-for-Planning-and-Information-

affairs/HESC/Ehsaat/Pages/default.aspx

We have data for both male and female students in nine programs from 1981-2014. The nine

programs are Social Science, Natural Science, Medical Science, Law, Humanities, Fine Arts,

Engineering, Education, and Agriculture.

We believe that Budget in Higher Education is a key factor to understand the number of Saudi

Arabia students studying abroad. The Budget in Higher Education data set from 1981 to 2014 is

4

taken from the official website of the Ministry of Finance of Saudi Arabia. Here is the link of the

data:

https://www.mof.gov.sa/english/DownloadsCenter/Pages/Budget.aspx

We know Saudi Arabia heavily relies on Oil. We feel Oil Revenue and Oil Price could be very

important variables for our study. We collect these data from 1981-3014 from the official website

of Saudi Arabian Moneytary Agency (SAMA). Here is the link of the data:

http://www.sama.gov.sa/en-US/EconomicReports/Pages/YearlyStatistics.aspx

All these data are presented in Appendix A of my thesis.

1.3 Methodology

In this study we have employed a number of modern and sophisticate statistical techniques. We

have used linear, quadratic and exponential trend models to investigate both the overall trend and

also trends of individual programs. We have used experimental design technique to see whether

there is any special preference for any particular program and to investigate whether there is any

gender discrimination among the students. We would also like to find out the most important

factors that influence the number of students studying abroad most. We employ Fisher’s LSD and

Tukey’s test in this regard. We employ recent diagnostics like Jarque-Bera and Rescaled Moments

for normality and the robust reweighted least squares (RLS) technique for regression analysis.

Finally we employ a cross validation study based on the mean squared percentage error (MSPE)

to confirm which method does fit the data best.

5

CHAPTER 2

Trend of Saudi Arabia Students Studying abroad

In this chapter we introduce different time series models that we are going to use in our study with

their estimation procedures and properties. An excellent review of different aspects of time series

models are available in Pyndick and Rubenfield (1998), Bowerman et al. (2005), Montgomery et

al. (2008) and estimation. A time series is a chronological sequence of observations on a particular

variable. A time series model accounts for patterns of the past movement of a variable and uses

that information to predict its future movements, i.e., it is a sophisticated method of extrapolating

data. There are two different approaches of modeling a time series data: deterministic and

stochastic.

2.1 Trend Analysis

We begin with simple models that can be used to forecast a time series on the basis of its past

behavior. Most of the series we encounter are not continuous in time, instead, they consist of

discrete observations made at regular intervals of time. We denote the values of a time series by {

ty }, t = 1, 2, …, T. Our objective is to model the series ty and use that model to forecast ty beyond

the last observation Ty . We denote the forecast l periods ahead by lTy ˆ .

We sometimes can describe a time series ty by using a trend model defined as

ttty TR (2.1)

where tTR is the trend in time period t.

6

2.1.1 Linear Trend Model:

tt 10TR (2.2)

We can predict ty by

tyt 10

ˆˆˆ (2.3)

Then the forecast l period ahead is given by

lTy lT 10ˆˆˆ

(2.4)

For this particular model the distance value is DV =

T

t

tt

tlT

T

1

2

21

. Hence the 100(1– )%

prediction interval for an individual value of the dependent variable DV1ˆ2/,2 sty TlT .

2.1.2 Polynomial Trend Model of Order p

p

pt ttt ...TR 2

210 (2.5)

If the number of observation is not too large, we can predict ty by

p

pt ttt ˆ...ˆˆˆy 2

210 (2.6)

Then the forecast l period ahead is given by

p

plT lTlTlT ˆ...ˆˆˆy2

210 (2.7)

The 100(1– )% prediction interval for an individual value of the dependent variable

DV1ˆ2/,1 sty pTlT (2.8)

7

Quadratic Trend Model:

It is a special case of polynomial trend model when order p = 2. Hence from the above results we

have

2

210TR ttt (2.9)

If the number of observation is not too large, we can predict ty by

2

210ˆˆˆy ttt

(2.10)

Then the forecast l period ahead is given by

2

210ˆˆˆy lTlTlT

(2.11)

The 100(1– )% prediction interval for an individual value of the dependent variable

DV1ˆ2/,3 sty TlT (2.12)

2.1.3 Comparisons of Different Methods

Minitab computes three measures of accuracy of the fitted model: MAPE, MAD, and MSD for

each of the simple forecasting and smoothing methods. For all three measures, the smaller the

value, the better the fit of the model. Use these statistics to compare the fits of the different

methods.

MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values. It

expresses accuracy as a percentage.

MAPE =

100|/ˆ|

T

yyy ttt

(2.13)

8

where ty equals the actual value, ty equals the fitted value, and T equals the number of

observations.

MAD (Mean), which stands for Mean Absolute Deviation, measures the accuracy of fitted time

series values. It expresses accuracy in the same units as the data, which helps conceptualize the

amount of error.

MAD (Mean) = T

yy tt |ˆ|

(2.14)

where ty equals the actual value, ty equals the fitted value, and T equals the number of

observations.

MSD stands for Mean Squared Deviation. MSD is always computed using the same denominator,

T, regardless of the model, so you can compare MSD values across models. MSD is a more

sensitive measure of an unusually large forecast error than MAD.

MSD =

T

yy tt 2

ˆ

(2.15)

where ty equals the actual value, ty equals the fitted value, and T equals the number of

observations.

2.1.4 Exponential smoothing

Exponential smoothing provides a forecasting method that is most effective when the components

of the time series may be changing over time. It is often more reasonable to have more recent

values of ty play a greater role than do earlier values. In such a case recent values should be

weighted more heavily in the moving average.

9

Suppose that the time series ty has a level (or mean) that may slowly change over time but has no

trend or seasonal pattern. This series can be described as

tty 0 (2.16)

Then the estimate T for the level of the series in time period T is given by the smoothing equation

11 TTT y (2.17)

where is a smoothing constant between 0 and 1, and 1T is the estimate of the level in the time

period T – 1.

A point forecast for one period ahead us given by

TTy 1ˆ

(2.18)

which implies

Ty = ...11 2

2

1 TTT yyy =

0

1

Ty

(2.19)

It is easy to show that the l period forecast lTy ˆ can be given by

lTy ˆ =

0

1

Ty

(2.20)

There are several methods to choose the appropriate value of . The most popular method is to

choose which minimizes the mean sum of (squared) distances (MSD) of the actual and

forecasted values. Other measures of accuracy are the mean absolute percentage error (MAPE)

and the mean absolute deviation (MAD).

10

2.2 Trend Analysis of Nine Major Programs

In this section we would like to investigate trend of total number of students studying abroad in

nine major programs. For each program we consider three different trend models: linear, quadratic,

and exponential. We also compute MAPE, MAD and MSD to evaluate which method better fits

the data.

2.2.1 All Programs

At first we consider the total number of students studying abroad in all programs. Figure 2.1 gives

the time series plot of the total number of students from 1980 to 2014. From this figure it is clear

that the number of students studying abroad has an increasing trend. It seems to us that this increase

is not linear, it is exponential.

2010200520001995199019851980

100000

80000

60000

40000

20000

0

Year

Tota

l

Time Series Plot of Total No. of Students

Figure 2.1: Time Series Plot of the Total Number of Students

Now we would like to fit this data by three trend models: linear, quadratic and exponential and

the graphs are presented in Figure 2.2.

11

3330272421181512963

100000

75000

50000

25000

0

Index

Tota

l

MAPE 208

MAD 17431

MSD 439097288

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for TotalLinear Trend Model

Yt = -17044 + 2139*t

3330272421181512963

100000

80000

60000

40000

20000

0

Index

Tota

l

MAPE 119

MAD 8259

MSD 110484665

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for TotalQuadratic Trend Model

Yt = 27234 - 5241*t + 210.8*t**2

3330272421181512963

100000

80000

60000

40000

20000

0

Index

Tota

l

MAPE 77

MAD 12364

MSD 505301336

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for TotalGrowth Curve Model

Yt = 2054.90 * (1.0895**t)

Figure 2.2: Trend Analysis of the Total Number of Students

From Figure 2.2 it is clear that the number of Saudi Arabia students studying abroad has an

increasing trend. It seems to us that an exponential model may fit the data better. But graphical

summaries are very subjective in nature. So for more convincing conclusions we need to look at

12

numerical quantities. The following table gives a summary result to compare three different trend

models.

Table 2.1: Trend Summary of the Total Number of Students

Model MAPE MAD MSD

Linear 208 17431 439097288

Quadratic 119 8259 110484665

Exponential 77 12364 505301336

Results presented in Table 2.1 clearly show that both the quadratic trend model and the exponential

trend model fit the data better than the linear model but in terms of MAPE the exponential trend

model is better than the other two models.

Now we will investigate trend models for nine separate programs.

2010200520001995199019851980

35000

30000

25000

20000

15000

10000

5000

0

Year

Da

ta

Agriculture

Education

Engineering

Fine Arts

Humanities

Law

Medical Science

Natural Science

Social Science

Variable

Time Series Plot of Students in Different Programs

Figure 2.3: Time Series Plot of Total Number of Students in Different Programs

13

Figure 2.3 shows that the number of Saudi Arabia students studying abroad in each different

programs has an overall increasing trend. But there are huge differences in the number of students

so when they are plotted together some programs are not distinguishable at all. As a remedy to this

problem we plot the same graph in natural log scale and the graph is presented in Figure 2.4.

2010200520001995199019851980

11

10

9

8

7

6

5

4

3

Year

Da

ta

Agriculture

Education

Engineering

Fine Arts

Humanities

Law

Medical Science

Natural Science

Social Science

Variable

Time Series Plot of Students in Different Programs (in ln)

Figure 2.4: Time Series Plot of Total Number of Students (in ln) in Different Programs

Figure 2.3 shows that the number of Saudi Arabia students studying abroad in each different

programs has an overall increasing trend. But there are huge differences in the number of students

so when they are plotted together some programs are not distinguishable at all. As a remedy to this

problem we plot the same graph in natural log scale and the graph is presented in Figure 2.4. It is

clear from this figure that the number of students differs significantly from one program to another.

The highest enrolled programs are Engineering, Natural Science, Medical Science and Social

Science. But the number of students in Social Science dropped in the last few years. The programs

which have relatively less number of students are Agriculture and Fine Arts.

14

3330272421181512963

35000

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 112

MAD 2595

MSD 30241525

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for The Total of Social SceiencesQuadratic Trend Model

Yt = 3155 - 503*t + 22.6*t**2

3330272421181512963

35000

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 234

MAD 3537

MSD 34029670

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for The Total of Social SceiencesLinear Trend Model

Yt = -1599 + 289*t

Now we will investigate trend models for nine separate programs.

2.2.2 Social Sciences

Among the nine programs at first we consider the total number of students studying abroad in

Social Science program. Figure 2.5 gives linear, quadratic and exponential trend fits for the Social

Science program.

From the figure it is clear that the number of students studying abroad in Social Science program

shows an increasing trend. It seems to us that an exponential model may fit the data. The following

table gives a summary result to compare three different trend models.

15

3330272421181512963

35000

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 93

MAD 2552

MSD 39771799

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for The Total of Social SceiencesGrowth Curve Model

Yt = 658.094 * (1.0530**t)

Figure 2.5: Trend Analysis Plot of the Total Number of Social Science Students

Table 2.2: Trend Summary of the Total Number of Social Science Students

Model MAPE MAD MSD

Linear 234 3537 34029670

Quadratic 112 2595 30241525

Exponential 93 2552 39771799

Results presented in Table 2.2 clearly show that the exponential trend model fits the data better

than the other two models.

2.2.3 Natural Sciences

Our next example is the total number of students studying abroad in Natural Science program.

Figure 2.6 gives linear, quadratic and exponential trend fits for the Natural Science program. From

the figure it is clear that the number of students studying abroad in Natural Science program has

an increasing trend and an exponential model may better fit the data.

16

3330272421181512963

30000

25000

20000

15000

10000

5000

0

-5000

Index

Tota

l

MAPE 278

MAD 4086

MSD 27110563

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Natural SciencesLinear Trend Model

Yt = -4613 + 508*t

3330272421181512963

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 193

MAD 2392

MSD 8401020

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Natural SciencesQuadratic Trend Model

Yt = 5952 - 1252*t + 50.31*t**2

3330272421181512963

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 72

MAD 2666

MSD 30860217

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Natural SciencesGrowth Curve Model

Yt = 279.595 * (1.1053**t)

Figure 2.6: Trend Analysis Plot of the Total Number of Students for Natural Science

17

3330272421181512963

30000

25000

20000

15000

10000

5000

0

-5000

Index

Tota

l

MAPE 249

MAD 4015

MSD 25461692

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Medical ScienceLinear Trend Model

Yt = -4742 + 528*t

3330272421181512963

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 165

MAD 2250

MSD 7351186

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Medical ScienceQuadratic Trend Model

Yt = 5652 - 1205*t + 49.50*t**2

Table 2.3: Trend Summary of the Total Number of Natural Science Students

Model MAPE MAD MSD

Linear 278 4086 27110563

Quadratic 193 2392 8401020

Exponential 72 2666 30860217

Results presented in Table 2.3 clearly show that the exponential trend model fits the data better

than the other two models.

2.2.4 Medical Science

Our next example is the total number of students studying abroad in Medical Science program.

Figure 2.7 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear

that the number of students studying abroad in natural science program has an increasing trend and

an exponential model may better fit the data.

18

3330272421181512963

30000

25000

20000

15000

10000

5000

0

Index

Tota

l

MAPE 61

MAD 2408

MSD 25015184

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for the Total of Medical ScienceGrowth Curve Model

Yt = 259.904 * (1.1148**t)

Figure 2.7: Trend Analysis Plot of the Total Number of Students for Medical Science

Table 2.4: Trend Summary of the Total Number of Medical Science Students

Model MAPE MAD MSD

Linear 249 4015 25461692

Quadratic 165 2250 7351186

Exponential 61 2408 25015184

Results presented in Table 2.4 clearly show that the exponential trend model fits the data better

than the other two models.

2.2.5 Law

Here we consider the total number of students studying abroad in law program. Figure 2.8 gives

linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number

of students studying abroad in Law program has an increasing trend and an exponential model may

better fit the data.

19

Figure 2.8: Trend Analysis Plot of the Total Number of Students for Law

20

Table 2.5: Trend Summary of the Total Number of Law Students

Model MAPE MAD MSD

Linear 563 657 644213

Quadratic 357 338 174624

Exponential 96 419 755189

Results presented in Table 2.5 clearly show that the exponential trend model fits the data better

than the other two models.

2.2.6 Humanities

Now we consider the total number of students studying abroad in Humanities program. Figure 2.9

gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the

number of students studying abroad in Humanities program has an increasing trend. We also

observe from this plot that both quadratic and exponential models adequately fit the data.

21

Figure 2.9: Trend Analysis Plot of the Total Number of Students for Humanities

Table 2.6: Trend Summary of the Total Number of Humanities Students

Model MAPE MAD MSD

Linear 167 1179 2573862

Quadratic 58 752 1348197

Exponential 87 880 2475024

Results presented in Table 2.6 clearly show that the quadratic trend model fits the data better than

the other two models.

22

2.2.7 Fine Arts

Now we consider the total number of students studying abroad in Fine Arts program. Figure 2.10

gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the

number of students studying abroad in Fine Arts program has an increasing trend and an

exponential model may better fit the data

Figure 2.10: Trend Analysis Plot of the Total Number of Students for Fine Arts

23

Table 2.7: Trend Summary of the Total Number of Fine Arts

Model MAPE MAD MSD

Linear 224.2 194.6 71151.6

Quadratic 180.2 132.1 29439.3

Exponential 69.5 126.9 84233.2

Results presented in Table 2.7 clearly show that the exponential trend model fits the data better

than the other two models.

.

2.2.8 Engineering

Now we consider the total number of students studying abroad in Engineering program. Figure

2.11 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that

the number of students studying abroad in Engineering program has an increasing trend. We also

observe from this plot that an exponential model may better fit the data.

.

24

Figure 2.11: Trend Analysis Plot of the Total Number of Students for Engineering

Table 2.8: Trend Summary of the Total Number of Engineering Students

Model MAPE MAD MSD

Linear 397 4738 36869030

Quadratic 258 2724 11068847

Exponential 119 3466 50802116

Results presented in Table 2.8 clearly show that the exponential trend model fits the data better

than the other two models.

25

2.2.9 Education

Now we consider the total number of students studying abroad in Education program. Figure 2.12

gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the

number of students studying abroad in Education program has an increasing trend. We also observe

from this plot that both quadratic and exponential models adequately fit the data.

Figure 2.12: Trend Analysis Plot of the Total Number of Students for Education

26

Table 2.9: Trend Summary of the Total Number of Education Students

Model MAPE MAD MSD

Linear 134 577 464455

Quadratic 48 301 214264

Exponential 82 506 523959

Results presented in Table 2.9 clearly show that the quadratic trend model fits the data better than

the other two models.

2.2.10 Agriculture

Finally we consider the total number of students studying abroad in Agriculture. Figure 2.13 gives

linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number

of students studying abroad in Agriculture program has an increasing trend. We also observe from

this plot that both quadratic and exponential models adequately fit the data.

27

Figure 2.13: Trend Analysis Plot of the Total Number of Students for Agriculture

Table 2.10 Trend Summary of the Total Number of Agriculture Students

Model MAPE MAD MSD

Linear 36.53 26.68 1190.99

Quadratic 28.773 20.265 610.926

Exponential 33.25 25.90 1214.57

Results presented in Table 2.10 clearly show that the quadratic trend model fits the data better than

the other two models.

28

2.3 Trend Analysis of Some Other Relevant Variables

Here we consider some other variables which we believe may have a significant impact on the

number of students studying abroad. These variables are budget in higher education, oil price and

oil revenue. Oil is the key factor of Saudi Arabia economy, so oil price and oil revenue should

affect almost all major policies of the government.

At first we would like to see the trend of these variables. Time series plots of these three variables

are presented in Figures 2.14 to 2.16.

2011200620011996199119861981

2.0000E+11

1.5000E+11

1.0000E+11

5.0000E+10

0

Year

Budg

ei in

HE

Time Series Plot of Budgei in HE

Figure 2.14: Time Series Plot of the Budget in Higher Education

We observe from this figure that the budget in higher education has a steady progress over the

years and it clearly shows an increasing trend. Oil price dropped once but gained later and thus

shows an upward trend overall. Oil revenue also shows an increasing pattern.

2011200620011996199119861981

100

90

80

70

60

50

40

30

20

10

Year

Oil P

rice

Time Series Plot of Oil Price

Figure 2.15: Time Series Plot of Oil Price

29

2011200620011996199119861981

1200000

1000000

800000

600000

400000

200000

0

Year

Oil R

even

ue

Time Series Plot of Oil Revenue

Figure 2.16: Time Series Plot of Oil Revenue

Now we fit these three variables by three different trend models.

2.3.1 Oil Revenue

At first we consider oil revenue over the years. Figure 2.17 gives linear, quadratic and exponential

trend fits of this data. From the figure it is clear that oil revenue has an increasing trend. We also

observe from this plot that both quadratic and exponential models adequately fit the data.

3330272421181512963

1200000

1000000

800000

600000

400000

200000

0

Index

Oil

Rev

enue

MAPE 8.35439E+01

MAD 1.64688E+05

MSD 4.23297E+10

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil RevenueLinear Trend Model

Yt = -127953 + 26267*t

3330272421181512963

1200000

1000000

800000

600000

400000

200000

0

Index

Oil R

even

ue

MAPE 3.55741E+01

MAD 7.94309E+04

MSD 1.23704E+10

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil RevenueQuadratic Trend Model

Yt = 294817 - 44194*t + 2013*t**2

30

3330272421181512963

1200000

1000000

800000

600000

400000

200000

0

Index

Oil

Rev

enue

MAPE 4.79737E+01

MAD 1.26068E+05

MSD 3.40103E+10

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil RevenueGrowth Curve Model

Yt = 55445.0 * (1.0792**t)

Figure 2.17: Trend Analysis of Oil Revenue

Table 2.11: Trend Summary of Oil Revenue

Model MAPE MAD MSD

Linear 8.35439E+01 1.64688E+05 4.23297E+10

Quadratic 3.55741E+01 7.94309E+04

1.23704E+10

Exponential 4.79737E+01 1.26068E+05 3.40103E+10

Results presented in Table 2.11 clearly show that the quadratic trend model fits the data better than

the other two models.

2.3.2 Budget in Higher Education

Next we consider the budget in higher education. Figure 2.18 gives linear, quadratic and

exponential trend fits of this data. From the figure it is clear that the budget in higher education

shows an increasing trend. We also observe from this plot that both quadratic and exponential

models adequately fit the data.

31

3330272421181512963

2.0000E+11

1.5000E+11

1.0000E+11

5.0000E+10

0

Index

Bu

dg

ei i

n H

E

MAPE 5.86496E+04

MAD 1.89828E+10

MSD 5.58537E+20

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Budgei in HELinear Trend Model

Yt = -38718871627 + 5497524487*t

3330272421181512963

2.0000E+11

1.5000E+11

1.0000E+11

5.0000E+10

0

Index

Bu

dg

ei i

n H

E

MAPE 1.64690E+04

MAD 8.29748E+09

MSD 1.18811E+20

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Budgei in HEQuadratic Trend Model

Yt = 12499933066 - 3038942962*t + 243899070*t**2

3330272421181512963

1.0000E+12

8.0000E+11

6.0000E+11

4.0000E+11

2.0000E+11

0

Index

Bu

dg

ei i

n H

E

MAPE 5.35190E+02

MAD 8.71668E+10

MSD 3.87341E+22

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Budgei in HEGrowth Curve Model

Yt = 102994932 * (1.3105**t)

Figure 2.18: Trend Analysis of Budget in Higher Education

32

Table 2.12: Trend Summary of Budget in Higher Education

Model MAPE MAD MSD

Linear 5.86496E+04 1.89828E+10 5.58537E+20

Quadratic 1.64690E+04 8.29748E+09 1.18811E+20

Exponential 5.35190E+02 8.71668E+10 3.87341E+22

Results presented in Table 2.12 clearly show that the exponential trend model fits the data better

than the other two models.

2.3.3 Oil Price

Next we consider oil price. Figure 2.19 gives linear, quadratic and exponential trend fits of this

data. From the figure it is clear that oil price shows an increasing trend. We also observe from this

plot that both quadratic and exponential models adequately fit the data.

3330272421181512963

100

90

80

70

60

50

40

30

20

10

Index

Oil P

rice

MAPE 59.160

MAD 20.980

MSD 554.086

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil PriceLinear Trend Model

Yt = 30.67 + 0.877*t

33

3330272421181512963

100

90

80

70

60

50

40

30

20

10

Index

Oil P

rice

MAPE 18.8959

MAD 7.6090

MSD 95.7177

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil PriceQuadratic Trend Model

Yt = 82.96 - 7.838*t + 0.2490*t**2

3330272421181512963

100

90

80

70

60

50

40

30

20

10

Index

Oil

Pric

e

MAPE 48.344

MAD 19.889

MSD 565.457

Accuracy Measures

Actual

Fits

Variable

Trend Analysis Plot for Oil PriceGrowth Curve Model

Yt = 28.291 * (1.01911**t)

Figure 2.19: Trend Analysis of Oil Price

Table 2.13: Trend Summary of Oil Price

Model MAPE MAD MSD

Linear 59.160 20.980 554.086

Quadratic 18.8959 7.6090 95.7177

Exponential 48.344 19.889 565.457

Results presented in Table 2.13 clearly show that the quadratic trend model fits the data better than

the other two models.

34

2.4 Summary Results of Trend Analysis

In this section we summarize the above trend results. Altogether we have considered 13 variables.

Table 2.14 gives a quick view regarding which model is appropriate for which variable.

Table 2.14: Trend Summary

Variable Model Direction

Total Number of Students Exponential Increasing

Students in Social Science Exponential Increasing

Students in Natural Science Exponential Increasing

Students in Medical Science Exponential Increasing

Students in Law Exponential Increasing

Students in Humanities Quadratic Increasing

Students in Fine Arts Exponential Increasing

Students in Engineering Quadratic Increasing

Students in Education Exponential Increasing

Students in Agriculture Quadratic Increasing

Oil Revenue Quadratic Increasing

Budget in Higher Education Exponential Increasing

Oil Price Quadratic Increasing

The above results show that out of 13 variables not a single one fit a linear trend model. For most

of the variables both quadratic and exponential models perform similar but on 8 cases exponential

model fit the data better and on 5 remaining cases quadratic model performs better and all of them

show increasing trend.

35

CHAPTER 3

Comparison between Genders and Different Programs

We have separate information regarding male and female Saudi Arabia students who are studying

abroad. In this chapter we would like to see whether there is any gender discrimination. We would

also like to see that whether there is a significant difference among the number of students studying

different programs.

3.1 Comparison between Genders

At first we would like to investigate whether there is any gender discrimination. At first we will

look at the number of male and female students in different programs.

3.1.1 Social Science

Figure 3.1 gives a time series plot of the number of male and female students in Social Science

program.

Figure 3.1: Time Series Plot of Male and Female Students in Social Science

36

It is clear from this figure that the number of male students is consistently higher but the gap

becomes very high in the recent years.

3.1.2 Natural Science

Figure 3.2 gives time series plot of the number of male and female students in Natural Science

program.

Figure 3.2: Time Series Plot of Male and Female Students in Natural Science

It is clear from this figure that the number of male students is consistently higher but the gap

becomes very high in the recent years.

3.1.3 Medical Science

Figure 3.3 gives a time series plot of the number of male and female students in Medical Science

program.

37

Figure 3.3: Time Series Plot of Male and Female Students in Medical Science

It is clear from this figure that the number of male students is consistently higher but the gap

becomes very high in the recent years.

3.1.4 Law

Figure 3.4 gives a time series plot of the number of male and female students in Law program.

Figure 3.4: Time Series Plot of Male and Female Students in Law

It is clear from this figure that the number of male students is consistently higher but the gap

becomes very high in the recent years.

38

3.1.5 Humanities

Figure 3.5 gives a time series plot of the number of male and female students in Humanities

program.

Figure 3.5: Time Series Plot of Male and Female Students in Humanities

It is clear from this figure that the number of female students was higher initially. Then the gap

between male and female gets narrowed. However, in recent years the number of male students

gets increased and currently it is more than the female students.

3.1.6 Engineering

Figure 3.6 gives a time series plot of the number of male and female students in Engineering

program.

39

Figure 3.6: Time Series Plot of Male and Female Students in Engineering

It is clear from this figure that the number of male students is consistently higher but the gap

becomes a rocket high in the recent years.

3.1.7 Education

Figure 3.7 gives a time series plot of the number of male and female students in Education

program.

Figure 3.7: Time Series Plot of Male and Female Students in Education

It is clear from this figure that the number of male students was higher before but the gap gets

narrowed and currently the number of female students has overtaken the number of male students.

40

3.1.8 Fine Arts

Figure 3.8 gives a time series plot of the number of male and female students in Fine Arts program.

Figure 3.8: Time Series Plot of Male and Female Students in Fine Arts

Probably this is the only program where the number of female students is consistently higher

than male students and the gap becomes higher in the recent years.

3.1.9 Agriculture

Figure 3.9 gives a time series plot of the number of male and female students in Agriculture

program.

Figure 3.9: Time Series Plot of Male and Female Students in Agriculture

41

Figure 3.9 shows that that the number of male students was much higher before. The gap narrowed

down gradually but the number of male students is consistently higher than the female students.

3.2 Tests for the Equality of Means between Male and Female

Students

In the previous section we have seen that in almost every program the number of male students is

higher than that of the female students. As we know graphs are very subjective here we test the

difference between mean of male and female students. Let us denote the number of male students

by X and the number of female students by Y. We are interested in testing the hypothesis .

against

:

Under 0H , the test statistic becomes

Assuming further normality and large sample sizes, the critical region for the test becomes

We test the equality of mean of male and female students for all nine programs and the results are

presented below. We present the average number of male and female students, z-value and its

corresponding p-value, whether the difference is significant or not, and if so, to which gender it is

biased. It is worth mentioning that * stands for significant at the 10% level, ** stands for significant

at the 5% level and *** stands for significant at the 1% level.

YXH :0

)/()/(22

mn

YXZ

YX

1H

mSnSzyx YX //||22

2/

YX

42

Table 3.1: Summary Test Results for the Equality of Means between Male and Female Students

Program Male

(Ave)

Female (Ave) z-value p-value Difference Biased to

Social Science 2737 722 2.20 0.032 **Significant Male

Natural Science 3146 1137 2.09 0.040 **Significant Male

Medical

Science

3102 1388 1.86 0.068 *Significant Male

Law 546 109 2.65 0.010 **Significant Male

Humanities 890 957 -0.25 0.807 Insignificant

Fine Arts 57 127 -1.57 0.121 Insignificant

Engineering 4374 150 3.17 0.002 ***Significant Male

Education 421 438 -0.14 0.887 Insignificant

Agriculture 79.6 5.74 11.40 0.000 ***Significant Male

It is clear from this table that the number of male students is significantly higher than the number

of female students in 6 out of 9 programs. Female students are more in only three programs but

the differences are not statistically significant. So we can say that male students have advantageous

position than female students.

3.2.1 Comparison among All Programs

Now we would like to see whether there is any difference among the number of students studying

different programs.

43

Table 3.2: Average Number of Students in Different Programs

Program Average Number of Students

Social Science 3459

Natural Science 4284

Medical Science 4490

Law 655

Humanities 1847

Fine Arts 184.6

Engineering 4524

Education 859

Agriculture 85.32

Socia

l Scie

nce

Natur

al Scie

nce

Medica

l Scie

nce

Law

Hum

anitie

s

Fine Arts

Engin

eerin

g

Educ

ation

Agricu

lture

35000

30000

25000

20000

15000

10000

5000

0

Dat

a

ure, Education, Engineering, Fine Arts, Humanities, Law, Medical Science, Natural Scie

Figure 3.10: Box Plot of Number of Students in Different Programs

44

The above table and the figure clearly shows differences in the average number of students, but

we also need to know whether this difference is statistically significant or not.

3.2.2 Tests for the Equality of Means among All Programs

Frequently, experiments want to compare more than two components. We will be comparing the

means of m normal distributions under the assumption that the variances are all the same. Let us

now consider m normal distributions with unknown means and an unknown but

common variance 2 . We wish to test the null hypothesis .

11X 12X jX1

11nX .1X

21X 22X jX 2

22nX .2X

1iX 2iX ijX

iinX .iX

1mX 2mX mjX

mmnX .mX

..X

The i-th group mean is , i = 1, 2, …, m

and the grand mean is

where .

m ,...,, 21

mH ...: 210

i

n

j

ij

in

X

X

i

1

.

n

Xn

n

X

X

m

i

ii

m

i

n

j

ij

i

1

.1 1

..

mnnnn ...21

45

To determine a critical region for a test of 0H , we partition the total sum of squares as

SS (TO) = =

Let = SS (Programs), the sum of squares among the different programs.

= SS (Error), the sum of squares within programs (often called the error

sum of squares).

It is easy to show that

, and

Hence, ~ and

Thus

The information used for the tests of the equality of several means is often summarized in an

analysis of variance (ANOVA) table.

Source Sum of Squares (SS) Degrees of Freedom Mean Squares (MS) F Ratio

Programs SS(P) m – 1 MS(P) = SS(P)/(m – 1) MS(P)/MS(E)

Error SS(E) n – m MS(E) = SS(E)/(n – m)

Total SS(T) n – 1

We would reject 0H if the observed value of F is too large. Thus the critical region is in the form

.

m

i

n

j

iiij

m

i

n

j

ij

ii

XXXXXX1 1

2

....

1 1

2

..

m

i

ii

m

i

n

j

iij XXnXXi

1

2

...

1 1

2

.

m

i

ii XXn1

2

...

m

i

n

j

iij

i

XX1 1

2

.

m

i

n

j

ij

i

nXX1 1

222

.. 1~/

1~/

2

2

.

i

i

n

X 1~ 2

2

1

2

.

i

n

j

iij

n

XXi

2

1

2

... /

m

i

ii XXn 12 m

mn

XXm

i

n

j

iij

i

2

2

1 1

2

.

~

mnmF

mn

m

,1~

/ErrorSS

1/ProgramSS

mnmFF ,1;

46

3.3 Comparison of the Individual Treatment Means

There are several methods by which we can compare treatment means.

3.3.1 The Least Significance Difference (Fisher’s LSD) Method

Suppose that following an analysis of variance F test where the null hypothesis is rejected, we

wish to test

jiH :0 for all i j.

This could be done by using the t statistic

t = ji

ji

nn

yy

/1/1EMS

..

The pair of means i and j would be declared significantly different if

jipNji nntyy /1/1EMS|| ),2/1(..

The quantity

LSD = jipN nnt /1/1EMS),2/1(

is called the least significant difference.

A design is called balanced when 1n = 2n = … = pn = n, and

LSD = nt pN 2EMS/),2/1(

47

3.3.2 Duncan’s Multiple Range Test

A widely used procedure for comparing all pairs of means is the multiple range test proposed by

Duncan. We first arrange the p treatment means in ascending order and compute the standard error

of each average as

hy nEMSs /.1

where

p

iih npn

1

/1/ .

If 1n = 2n = … = pn = n, we have hn = n, and hence nEMSsy /

.1

The significant ranges are calculated as

pNkrRk , .1ys , k = 2, 3, …, p

where the values of pNkr , is obtained from a table given by Duncan. Then the observed

differences between means are tested, beginning with the largest versus smallest and compared

with the least significant range pR . Next, the difference between the largest and the second

smallest is computed and compared with the least significant range 1pR . Finally, the difference

between the second largest and the smallest is computed and compared with the least significant

range 1pR . This process is continued until the differences of all possible p(p–1)/2 pairs of means

have been considered. If an observed difference is greater than the corresponding least significant

range, then we conclude that the pair of means in question is significantly different.

3.3.3 The Newman-Keuls Test

This test is similar to Duncan’s multiple range test, except that the critical difference between

means are calculated differently. Here we compute a set of critical values

48

K pNkqk , .1ys , k = 2, 3, …, p

where pNkq , is the upper percentage point of the Studentized range for groups of means

of size k and N – p error degrees of freedom.

The Studentized range is defined as

q = n

yy

/EMS

minmax

3.3.4 Tukey’s Test

Tukey proposed a multiple comparison procedure based on the Studentized range statistic. His

procedure requires the use of pNpq , to determine the critical value of all pairwise

comparisons, regardless of how many means are in the group. Thus, Tukey’s test declares two

means significantly different if the absolute value of their sample differences exceeds

T = pNpq , .1ys

3.4 Result Summary

At first we would like to test the equality of mean number of students in nine programs. The

summary results are presented in Table 3.3.

Table 3.3 ANOVA Table for the Equality of Mean Test of Nine Programs

Source SS DF MS F Ratio p-value

Programs 998821022 8 124852628 5.06 0.000

Error 7322357160 297 24654401

Total 8321178183 305

49

Table 3.3 clearly shows that the programs effect is highly significant. So we must reject the

hypothesis of equal mean for the nine programs.

Now in search of which programs differ significantly from the other programs we report Tukey’s

test and Fisher’s LSD as they are very effective and readily available in MINITAB. Here we

present only the summary result the details result is presented in the Appendix.

Grouping Information Using Tukey Method

N Mean Grouping

Engineering 34 4524 A

Medical Science 34 4490 A

Natural Science 34 4284 A B

Social Science 34 3459 A B C

Humanities 34 1847 A B C

Education 34 859 A B C

Law 34 655 B C

Fine Arts 34 185 C

Agriculture 34 85 C

Tukey’s test shows that most of the Saudi Arabia students go abroad to study Engineering and

Medical Science and the least number of students study Agriculture and Fine Arts.

Grouping Information Using Fisher Method

N Mean Grouping

Engineering 34 4524 A

Medical Science 34 4490 A

Natural Science 34 4284 A

Social Science 34 3459 A B

Humanities 34 1847 B C

Education 34 859 C

Law 34 655 C

Fine Arts 34 185 C

Agriculture 34 85 C

However, Fisher’s LSD shows most of the Saudi Arabia students go abroad to study Engineering,

Medical Science and Natural Science and the least popular programs are Agriculture, Fine Arts,

Law and Education.

50

CHAPTER 4

Modeling and Fitting of Data Using Regression

Diagnostics and Robust Regression

In this chapter at first we discuss classical regression method with diagnostics and then discuss

some robust methods that are commonly used in regression. We will employ all these things to

investigate which variables have significant impact on the number of Saudi Arabia students

studying abroad.

4.1 Classical Regression Analysis

Regression is probably the most popular and commonly used statistical method in all branches of

knowledge. It is a conceptually simple method for investigating functional relationships among

variables. The user of regression analysis attempts to discern the relationship between a dependent

(response) variable and one or more independent (explanatory/predictor/regressor) variables.

Regression can be used to predict the value of a response variable from knowledge of the values

of one or more explanatory variables.

We write the multiple regression model as

ikikiii XXXY ...22110 , i = 1, 2, …, n (4.1)

where Y is the dependent variable, the X’s are the independent variables, and is the error term.

Here we have a dependent variable and k explanatory variables excluding the intercept term. This

model is also called a k + 1 variable regression model.

51

The assumptions of the multiple regression model are quite similar to those of the two-variable

linear regression model:

The relationship between Y and X is linear. But no exact linear relationship exists between

two or more X’s.

The X’s are nonstochastic variables whose values are fixed.

The error has zero expected values: E( ) = 0

The error term has constant variance for all observations, i.e.,

E(2

i ) = 2 , i = 1, 2, …, n.

The random variables i are statistically independent. Thus,

E(ji ) = 0, for all i j.

The error term is normally distributed.

4.1.1 Estimation Technique

We can express the multiple regression model in matrix notation as:

Y = X + (4.2)

Where

Y =

ny

y

y

...

2

1

X =

knn

k

k

xx

xx

xx

...1

............

...1

...1

1

212

111

=

k

...

1

0

=

n

...

2

1

We obtain the OLS estimate of k unknown parameters 0 , 1 , …, k in such a way that the sum

of squares (SS)

n

ii

1

2 = XYXY

is minimized.

52

The value of that minimizes is given by the solution to

= 0

We get

= 2 YX – 2 XX = 0 = YXXX

1 (4.3)

We also have

V ( ) = 12 XX (4.4)

For this model, the residuals are

kikiiiii XXXYYY ˆ...ˆˆˆˆˆ22110 , i = 1, 2, …, n (4.5)

An unbiased and consistent estimate of 2 is )1/(ˆ1

22

knsn

ii . The estimated standard error

of j is jj

Vss 2ˆ

, where jV is the j-th diagonal element of 1

XX . When the errors are

normally distributed, then 1

ˆ

kn

j

jjt

s

4.1.2 Checking for Goodness of Fit

We can use the 2R statistic as a measure of goodness of fit for the multiple regression model. We

know that

2R = TSS

RSS = 1 –

TSS

ESS = 1 –

n

ii

n

ii

YY1

2

1

2

(4.6)

2R is the proportion of the total variation in Y explained by the regression of Y on X. It is easy to

show that 2R ranges in value between 0 and 1. But it is only a descriptive statistics. Roughly

53

speaking, we associate a high value of 2R (close to 1) with a good fit of the model by the regression

line and associate a low value of 2R (close to 0) with a poor fit. How large must 2R be for the

regression equation to be useful? That depends upon the area of application. If we could develop

a regression equation to predict the stock market, we would be ecstatic if 2R = 0.50. On the other

hand, if we were predicting death in road accident, we would want the prediction equation to have

strong predictive ability, since the consequences of poor prediction could be quite serious.

But the difficulty with 2R as a measure of goodness of fit is that it does not account for the number

of degrees of freedom. A natural solution is to use variances, not variations and that help to define

a corrected (adjusted)2R , defined as

2R = 1 – [Estimated V( ) / Estimated V(Y)]

Now

Estimated V( ) = )1/(ˆ1

22

knsn

ii

and

Estimated V(Y) =

n

ii YY

1

2/ (n – 1)

Thus the corrected 2R becomes

2R = 1 – 1

1

2

1

2

kn

n

YYn

ii

n

ii

= 1

111 2

kn

nR (4.7)

4.1.3 Tests of Regression Coefficients

We often like to establish that the explanatory variable X has a significant effect on Y, that the

coefficient of X (which is ) is significant. In this situation the null hypothesis is constructed in

54

way that makes its rejection possible. We begin with a null hypothesis, which usually states that a

certain effect is not present, i.e., = 0. We estimate and its standard error from the data and

compute the statistic

t =

ˆ

ˆ

s ~ 1knt (4.8)

4.2 Regression Diagnostics

Diagnostics are designed to find problems with the assumptions of any statistical procedure. In

diagnostic approach we estimate the parameters (in regression fit the model) by the classical

method (the OLS) and then see whether there is any violation of assumptions and/or irregularity

in the results regarding the six standard assumptions mentioned at the beginning of this section.

But among them the assumption of normality is the most important assumption.

4.2.1 Test for Normality

The normality assumption means the errors are distributed as normal. The simplest graphical

display for checking normality in regression analysis is the normal probability plot. This method

is based in the fact that if the ordered residuals are plotted against their cumulative probabilities

on normal probability paper, the resulting points should lie approximately on a straight line. An

excellent review of different analytical tests for normality is available in Imon (2003). A test based

on the correlation of true observations and the expectation of normalized order statistics is known

as the Shapiro – Wilk test. A test based on empirical distribution function is known as Anderson

– Darling test. It is often very useful to test whether a given data set approximates a normal

distribution. This can be evaluated informally by checking to see whether the mean and the median

55

are nearly equal, whether the skewness is approximately zero, and whether the kurtosis is close to

3. A more formal test for normality is given by the Jarque – Bera statistic:

JB = [n / 6] [22 )3( KS / 4] (4.9)

Imon (2003) suggests a slight adjustment to the JB statistic to make it more suitable for the

regression problems. His proposed statistic based on rescaled moments (RM) of ordinary least

squares residuals is defined as

RM = [n3c / 6] [

22 )3( KcS / 4] (4.10)

where c = n/(n – k), k is the number of independent variables in a regression model. Both the JB

and the RM statistic follow a chi square distribution with 2 degrees of freedom. If the values of

these statistics are greater than the critical value of the chi square, we reject the null hypothesis of

normality.

4.2.2 Outliers

In Statistics we often observe that the values of descriptive measures are often much influenced

by few extreme observations which are commonly known as outliers. According to Barnett and

Lewis (1993), ‘Observations which stand apart from the bulk of the data are called outliers.’

Different aspects of outliers with its consequences are discussed by Hadi, Imon and Werner (2009).

Hampel et al. (1986) claim that a routine data set typically contains about 1-10% outliers, and even

the highest quality data set cannot be guaranteed free of outliers. to Barnett and Lewis (1993)

commented ‘Any outliers, however, are always extreme values in the sample.’ But this statement

is not always true, especially in regression analysis.

56

In a regression problem, observations are judged as outliers on the basis of how unsuccessful the

fitted regression equation is in accommodating them and that is why observations corresponding

to excessively large residuals are treated as outliers.

Types of Outliers

X – Outlier: This is a point that is outlying in regard to the x–coordinate. In the literature an X–

outlier is more popularly known as a high leverage point.

Y – Outlier: This is a point that is outlying only because its y–coordinate is extreme.

X – and Y – Outlier: A point that is outlying in both x and y coordinates is known as x – and y –

outlier.

Residual Outlier: This is a point that has a large standardized (deletion) residual. Most of the

commonly used outlier detection methods are based on this approach where an observation is

judged as outlier on the basis of how unsuccessful the fitted regression equation is in

accommodating it.

Detection of Outliers

We often use the following three types of residuals for the identification of outliers.

Standardized residuals , i = 1, 2, …, n (4.11)

Studentized residuals , i = 1, 2, …, n (4.12)

Deletion Studentized (Externally Studentized or R-Student) residuals

, i = 1, 2, …, n (4.13)

ˆ

ˆT

iii

xyd

ii

T

iii

w

xyr

ˆ

iii

T

iii

w

xyt

ˆ

57

where is the OLS estimates of the mean squared error (MSE) based on a data set with the i-

th observation deleted.

As a thumb rule we call an observation outlier when its corresponding residual value exceeds 3 in

absolute value. A good review of recent outlier detection techniques in linear regression is

available in Imon (2008), and Hadi, Imon and Werner (2009).

4.2.3 Multicollinearity

One basic assumption of the multiple regression model is that there is no exact linear relationship

between any of the independent variables in the model. If such an exact linear relationship does

exist, we say that the independent variables are perfectly collinear or that perfect collinearity exists.

Multicollinearity arises when two or more variables (or combinations of variables) are highly

correlated with each other.

Effects of Multicollinearity

Wrong interpretation of the regression coefficients

Large variances and covariances for the OLS estimators of the regression parameters

Unduly large (in absolute value) estimates of the regression parameters

Indications of Multicollinearity

High Correlation Values

Calculate regression coefficients between all explanatory variables and test the maximum (in

absolute value) correlation coefficient by the statistic t = 2

1

2

ij

ij

r

nr

~ 2nt

2

ˆi

58

There is an evidence of multicollineatiy at the 5% level of significance if

|t| > 975.0,2nt

Large Variance Inflation Factor

We know that the variance of j is jV2 , where jV is the j-th diagonal element of 1

XX .

Consequently V( j ) is large, if jV is large. Hence

jV will be called the variance inflation

factor (VIF) of the explanatory variable jX . One or more large VIF’s indicate

multicollienarity.

Thumb rule: VIF < 5 No multicollinearity

5 VIF 10 Moderate multicollinearity

VIF > 10 Severe multicollinearity

Large Condition Number

A condition number is associated with the characteristic roots (eigen values) of the matrix XX .

The condition number of XX is defined as

min

max

A large condition number indicates the existence of multicollinearity.

Thumb rule: < 10 No multicollinearity

10 30 Moderate multicollinearity

> 30 Severe multicollinearity

Low Tolerance Value

Tolerance values are defined as inverse of VIF values. In other words, we can define

Tolerance value = 1/VIF

59

Since tolerance values are inverse of VIF’s, low tolerance values indicate multicollinearity

problem.

Thumb rule: VIF > 0.2 No multicollinearity

0.1 VIF 0.2 Moderate multicollinearity

VIF < 0.1 Severe multicollinearity

4.2.4 Variable Selection

In some applications theoretical considerations or prior experience can be helpful in selecting the

regressors to be used in the model. Building a regression model that includes only a subset of

available regressors involves two conflicting objectives.

1. We would like the model to include as many regressors as possible so that the information

content in these factors can influence the fitted value of the response.

2. We want the model to include as few regressors as possible because the variance of the fitted

response increases as the number of regressors increases. Also the more regressors there are in a

model, the greater the cost of data collection and model maintenance.

Finding an appropriate subset of regressors for the model is called the variable selection problem.

Graphical Methods

A number of graphical displays are used for variable selection. Here is a list of few of them

Added Variable Plot

Partial Residual (PR) plot (Ezekiel, 1924)

Component and Component-plus-residual (CCPR) plot (Wood, 1973)

60

Augmented Partial Residual (APR) plot (Mallows, 1986)

Conditional Expectation and Residual (CERES) plot (Cook, 1993)

Robust Added Variable Plot (Imon, 2003)

Model Selection Criteria

Minimum Residual Mean Square (RMS)

where SSE = is the residual sum of squares, n is the number of observations, k is the

number of explanatory variables.

Maximum R-Square

where SST is the total sum of squares.

Maximum Adjusted R-Square

Akaike Information Criterion

For a model with p = k + 1 predictors including the intercept, the Akaike information criterion

suggests to choose p for which the statistic

AIC (p) =

will be minimized. This statistic imposes a penalty for including insignificant variables.

1ˆ 2

kn

SSE

n

iii yy

1

2)ˆ(

,12

SST

SSER

n

ii yy

1

2)(

)1/(

)1/(12

nSST

knSSERa

n

p

n

n

ii

1ln

1

2

61

Mallows Cp

For a model with p predictors,

where is a good estimate of s2 (usually obtained from the full model). The above expression

can be reexpressed as

where 2

ˆp is the MSE from the sub model. It is straight forward to show that for the full model

pC = p. But here we search for a sub model where pC ≈ p for a value of p which is less than the

value of p for the full model.

Other Model Selection Criteria

Schwarz Criterion (SC)

Bayesian Information Criterion (BIC)

Final Prediction Error (FPE) or Prediction Criterion (PC)

Hannan-Quinn Criterion (HQC)

Variable Selection Methods

Forward Selection

Start with the empty model, then add the most significant variable (the one with the largest t-value

or smallest p-value). Repeat until all candidate variables to enter the model have insignificant

regression coefficients.

,)2(ˆ

)(2

npσ

YWIYC

T

p

2

,)2(

ˆ

ˆ

2

2

npσ

pnC

p

p

62

Backward Elimination

Start with the full model, then delete the least significant variable (the one with the smallest t-value

or largest p-value). Repeat until all regression coefficients in the model are significant.

Stepwise Method

This is a combination of forward selection and backward elimination methods.

4.3 Robust Regression

Robustness is now playing a key role in time series. According to Kadane (1984) ‘Robustness is a

fundamental issue for all statistical analyses; in fact it might be argued that robustness is the

subject of statistics.' The term robustness signifies insensitivity to small deviations from the

assumption. That means a robust procedure is nearly as efficient as the classical procedure when

classical assumptions hold strictly but is considerably more efficient over all when there is a small

departure from them. The main application of robust techniques in a time series problem is to try

to devise estimators that are not strongly affected by outliers or departures from the assumed

model. In time series, robust techniques grew up in parallel to diagnostics [see Hampel et al.

(1986)] and initially they were used to estimate parameters and to construct confidence intervals

in such a way that outliers or departures from the assumptions do not affect them. A large body of

literature is now available [Rousseuw and Leroy (1987), Maronna, Martin, and Yohai (2006), Hadi, Imon

and Werner (2009)] for robust techniques that are readily applicable in linear regression or in time series.

63

4.3.1. L – estimator

A first step toward a more robust time series estimator was the consideration of least absolute values

estimator (often referred to as L – estimator). In the OLS method, outliers may have a very large influence

since the estimated parameters are estimated by minimizing the sum of squared residuals

n

t

tu1

2

L estimates are then considered to be less sensitive since they are determined by minimizing the sum of

absolute residuals

n

t

tu1

||

The L estimator was first introduced by Edgeworth in 1887 who argued that the OLS method is over

influenced by outliers, but because of computational difficulties it was not popular and not much used

until quite recently. Sometimes we consider the L – estimator as a special case of pL -norm estimator in

the literature where the estimators are obtained by minimizing

n

t

p

tu1

||

The 1L -norm estimator is the OLS, while the 2L - norm estimator is the L – estimator. But unfortunately

a single erroneous observation (high leverage point) can still totally offset the L-estimator.

4.3.2. Least Median of Squares

Rousseeuw (1984) proposed Least Median of Squares (LMS) method which is a fitting technique less

sensitive to outliers than the OLS. In OLS, we estimate parameters by

Minimizing the sum of squared residuals

n

t

tu1

2

64

Which is obviously the same if we

Minimize the mean of squared residuals

n

t

tun 1

21.

Sample means are sensitive to outliers, but medians are not. Hence to make it less sensitive we can replace

the mean by a median to obtain median sum of squared residuals

MSR ( ) = Median {2

ˆtu } (4.14)

Then the LMS estimate of is the value that minimizes MSR ( ). Rousseeuw and Leroy (1987) have

shown that LMS estimates are very robust with respect to outliers and have the highest possible 50%

breakdown point.

4.3.3. Least Trimmed Squares

The least trimmed (sum of) squares (LTS) estimator is proposed by Rousseeuw (1984). In this method

we try to estimate in such a way that

LTS ( ) = minimize

h

t

tu1

2ˆ (4.15)

Here tu is the t-th ordered residual. For a trimming percentage of , Rousseeuw and Leroy (1987)

suggested choosing the number of observations h based on which the model is fitted as h = [n (1 – )]

+ 1. The advantage of using LTS over LMS is that, in the LMS we always fit the regression line based

on roughly 50% of the data, but in the LTS we can control the level of trimming. When we suspect that

the data contains nearly 10% outliers, the LTS with 10% trimming will certainly produce better result

than the LMS. We can increase the level of trimming if we suspect there are more outliers in the data.

65

4.3.4 Reweighted Least Squares

Another way to obtain a set of results based on a robust fit is the method of Reweighted Least

Squares (RLS) proposed by Rousseeuw and Leroy (1987). In this method, the parameters are

estimated by the LMS method and the outliers are identified. After that the final model is fitted by

the least squares without the potential outliers. Since this fitting does not involve any outliers this

method is claimed to be more appropriate for the majority of the observations. However, the

residuals of the deleted points are reestimated from the robust fit to produce a full set of residuals.

4.4 Regression Results

Here we employ regression method to understand which variables have significant impact on the

number of Saudi Arabia Students studying abroad. Budget in higher education can be an immediate

choice. Saud Arabia economy heavily relies on oil. So the two other variables one can consider

are oil price and oil revenue. We begin with a simple linear regression model with the number of

Saudi Arabia students studying abroad on the three explanatory variables one at a time.

Figure 4.1 gives a scatter plot of the total number of students versus budget in higher education.

We observe an upward and strong linear relationship between these two variables. The attached

MINITAB output shows that the value of 2R is 0.83 and the p-value corresponding to the variable

budget in higher education is highly significant (0.000).

66

2.0000E+111.5000E+111.0000E+115.0000E+100

100000

80000

60000

40000

20000

0

Budgei in HE

Tota

l No

. o

f S

tud

en

ts

Scatterplot of Total No. of Students vs Budgei in HE

Figure 4.1: Scatter Plot of the Total Number of Students vs Budget in Higher Education

Regression Analysis: Total No. of Students versus Budget in HE The regression equation is

Total No. of Students = - 5982 + 0.000000 Budget in HE

Predictor Coef SE Coef T P VIF

Constant -5982 3025 -1.98 0.057

Budget in HE 0.00000046 0.00000004 12.48 0.000 1.000

S = 12621.3 R-Sq = 83.0% R-Sq(adj) = 82.4%

Figure 4.2 gives a scatter plot of the total number of students versus budget in higher education.

We observe an upward and linear relationship between these two variables. The attached

MINITAB output shows that the value of 2R is 0.529 which is not great. This graph also shows

that probably there are few outliers in this data. So we think it will be a good idea to employ a

robust regression here. We fit the reweighted least squares (RLS) method to this data and the fitted

plot is presented in Figure 4.3.

67

100908070605040302010

100000

80000

60000

40000

20000

0

Oil Price

To

tal N

o.

of

Stu

de

nts

Scatterplot of Total No. of Students vs Oil Price

Figure 4.2: Scatter Plot of the Total Number of Students vs Oil Price

Regression Analysis: Total No. of Students versus Oil Price The regression equation is

Total No. of Students = - 19210 + 860 Oil Price

Predictor Coef SE Coef T P VIF

Constant -19210 7525 -2.55 0.016

Oil Price 860.3 143.6 5.99 0.000 1.000

S = 20985.0 R-Sq = 52.9% R-Sq(adj) = 51.4%

100908070605040302010

100000

80000

60000

40000

20000

0

Oil Price

No. o

f St

ude

nts

OLS

RLS

Total No. of Students

Variable

OLS and RLS Fit of Total No. of Students vs Oil Price

Figure 4.3: RLS and OLS Fit of the Total Number of Students vs Oil Price

68

Regression Analysis: Total No. of Students_1 versus Oil Price_1 The regression equation is

Total No. of Students_1 = - 29017 + 1363 Oil Price_1

Predictor Coef SE Coef T P

Constant -29017 2561 -11.33 0.000

Oil Price_1 1362.94 56.54 24.10 0.000

S = 6886.77 R-Sq = 96.2% R-Sq(adj) = 96.0%

We observe from Figure 4.3 that the robust RLS fit the data much better than the traditionally used

OLS fit. Now we observe an upward and very linear relationship between these two variables. The

attached MINITAB output shows that the value of 2R gets increased from 0.529 to 0.962 which

is a huge improvement. So we can say robust regression performs much better than the classical

regression method here.

120000010000008000006000004000002000000

100000

80000

60000

40000

20000

0

Oil Revenue

To

tal N

o.

of

Stu

de

nts

Scatterplot of Total No. of Students vs Oil Revenue

Figure 4.4: Scatter Plot of the Total Number of Students vs Oil Revenue

69

Figure 4.4 gives a scatter plot of the total number of students versus oil revenue. We observe an

upward and linear relationship between these two variables. The attached MINITAB output shows

that the value of 2R is 0.786 which is good.

Regression Analysis: Total No. of Students versus Oil Revenue The regression equation is

Total No. of Students = - 6054 + 0.0797 Oil Revenue

Predictor Coef SE Coef T P

Constant -6054 3443 -1.76 0.088

Oil Revenue 0.079707 0.007362 10.83 0.000

S = 14154.7 R-Sq = 78.6% R-Sq(adj) = 77.9

Since each of the three explanatory variables shows a linear relationship with the total number of

students studying abroad, now we fit a multiple linear regression model.

Response variable: The total number of students studying abroad

Explanatory variables: Budget in higher education, Oil price, and Oil revenue.

Regression Analysis: Total No. of versus Budget in HE, Oil Revenue, Oil Price The regression equation is

Total No. of Students = - 18688 + 0.000000 Budget in HE - 0.0127 Oil Revenue

+ 417 Oil Price

Predictor Coef SE Coef T P VIF

Constant -18688 4476 -4.18 0.000

Budget in HE 0.00000042 0.00000008 5.23 0.000 6.812

Oil Revenue -0.01267 0.01897 -0.67 0.509 12.003

Oil Price 417.3 134.2 3.11 0.004 3.471

S = 10526.9 R-Sq = 88.9% R-Sq(adj) = 87.8%

The attached MINITAB output for multiple regression is quite confusing. Here the value of 2R is

0.889 which is good, but we observe that the effect of oil revenue is negative which completely

70

conflicts with our findings in Figure 4.4. It may be a clear case of wrong sign problem which is

caused by multicollinearity. We checked the VIF values and found the largest one as 12.003 which

shows that this model is severely affected by multicollinearity.

The above results suggest us that we cannot keep all the three explanatory variables in the model.

In quest of which of the explanatory variables should remain in the model we apply the forward

selection, the backward elimination and stepwise regression methods and the MINITAB results

are reported.

Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Forward selection. Alpha-to-Enter: 0.05

Response is Total No. of Students on 3 predictors, with N = 34

Step 1 2

Constant -5982 -17088

Budget in HE 0.00000 0.00000

T-Value 12.48 9.92

P-Value 0.000 0.000

Oil Price 350

T-Value 3.98

P-Value 0.000

S 12621 10432

R-Sq 82.95 88.72

R-Sq(adj) 82.42 87.99

Mallows Cp 16.0 2.4

Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Backward elimination. Alpha-to-Remove: 0.05

Response is Total No. of Students on 3 predictors, with N = 34

Step 1 2

Constant -18688 -17088

Oil Revenue -0.013

T-Value -0.67

P-Value 0.509

71

Budget in HE 0.00000 0.00000

T-Value 5.23 9.92

P-Value 0.000 0.000

Oil Price 417 350

T-Value 3.11 3.98

P-Value 0.004 0.000

S 10527 10432

R-Sq 88.88 88.72

R-Sq(adj) 87.77 87.99

Mallows Cp 4.0 2.4

Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is Total No. of Students on 3 predictors, with N = 34

Step 1 2

Constant -5982 -17088

Budget in HE 0.00000 0.00000

T-Value 12.48 9.92

P-Value 0.000 0.000

Oil Price 350

T-Value 3.98

P-Value 0.000

S 12621 10432

R-Sq 82.95 88.72

R-Sq(adj) 82.42 87.99

Mallows Cp 16.0 2.4

All these three methods come up with exactly the same conclusion, i.e. the explanatory variables

that we should keep in our study are budget in higher education and oil price. Let us denote this as

Model A

Regression Analysis: Model A: Total No. of Stu versus Budget in HE, Oil Price The regression equation is

Total No. of Students = - 17088 + 0.000000 Budget in HE + 350 Oil Price

Predictor Coef SE Coef T P VIF

Constant -17088 3747 -4.56 0.000

Budget in HE 0.00000037 0.00000004 9.92 0.000 1.519

Oil Price 350.09 87.97 3.98 0.000 1.519

S = 10432.5 R-Sq = 88.7% R-Sq(adj) = 88.0%

72

The attached MINITAB output for Model A looks better now. Here the value of 2R is 0.887 which

is good, but more importantly we see that the effects of both of the explanatory variables are

positive and they are statistically significant.

3000020000100000-10000-20000-30000

99

95

90

80

70

60

50

40

30

20

10

5

1

Residuals

Pe

rce

nt

Mean 0

StDev 10111

N 34

AD 0.906

P-Value 0.019

Probability Plot of ResidualsNormal - 95% CI

Figure 4.5: Normal Probability Plot of the Residuals for Model A

But when we look at the normality plot of residuals as shown in Figure 4.5 we do not feel very

good about Model A. For this particular case the value of the Jarque-Bera test is 6.72 (p-value

0.0347) and the RM test is 8.37 (p-value 0.0152). So both of the tests reject the assumption of

normality of errors and thus the model looks questionable. As an alternative choice we fit the

same model by the robust reweighted least squares (RLS) method and we call it Model B.

Regression Analysis: Model B: Total No. of Stu versus Budget in HE_1, Oil Price_1 The regression equation is

Total No. of Students_1 = - 24848 + 0.000000 Budget in HE_1 + 992 Oil Price_1

Predictor Coef SE Coef T P

Constant -24848 2647 -9.39 0.000

Budget in HE_1 0.00000016 0.00000005 2.91 0.008

Oil Price_1 991.7 136.8 7.25 0.000

S = 5984.88 R-Sq = 97.2% R-Sq(adj) = 97.0%

73

The attached MINITAB output shows that Model B produces even better fit in terms of 2R as its

value goes up to 0.972 from 0.887 when the OLS fit was done. Here the effects of both of the

explanatory variables are positive and they are statistically significant.

20000100000-10000-20000

99

95

90

80

70

60

50

40

30

20

10

5

1

RLS

Pe

rce

nt

Mean -7.56700E-12

StDev 5730

N 25

AD 0.532

P-Value 0.157

Probability Plot of RLSNormal - 95% CI

Figure 4.6: Normal Probability Plot of the Residuals for Model B

For model B, the normality plot of residuals as shown in Figure 4.6 look much better than what

we saw for Model A. For a confirmation we compute the Jarque-Bera and the RM values for Model

B. We see that the value of the Jarque-Bera test is 1.56 (p-value 0.4584) and the RM test is 1.69

(p-value 0.4296). So both of the tests now accept the assumption of normality of errors and thus

the model can be considered as a valid one.

In the previous chapter we have seen that most of the variables we consider here in our regression

model show exponential growth. So it may be a good idea to fit the model using a log

transformation on the response as suggested by Montgomery et al. (2013). This third model will

be denoted as Model C.

74

Regression Analysis: Model C: The regression equation is

Total No. of Students_2 = 7.44 + 0.000000 Budget in HE_2 + 0.0217 Oil Price_2

Predictor Coef SE Coef T P

Constant 7.4370 0.1189 62.53 0.000

Budget in HE_2 0.00000000 0.00000000 10.12 0.000

Oil Price_2 0.021717 0.002792 7.78 0.000

S = 0.331114 R-Sq = 92.6% R-Sq(adj) = 92.1%

1.00.50.0-0.5-1.0

99

95

90

80

70

60

50

40

30

20

10

5

1

Residuals_1

Pe

rce

nt

Mean -4.44089E-15

StDev 0.3209

N 34

AD 0.459

P-Value 0.247

Probability Plot of Residuals_1Normal - 95% CI

Figure 4.7: Normal Probability Plot of the Residuals for Model C

The attached MINITAB output shows that Model C falls in between Model A and Model B in

terms of possessing better 2R . For this model the value of 2R is 0.926. But it was 0.972 for Model

B and 0.887 for Model A. Here the effects of both of the explanatory variables are positive and

they are statistically significant.

The normality plot of residuals for model C looks good as shown in Figure 4.7. Now we compute

the Jarque-Bera and the RM values for Model C. We see that the value of the Jarque-Bera test is

1.86 (p-value 0.3946) and the RM test is 1.97 (p-value 0.3734). So both of the tests now accept

the assumption of normality of errors and thus the model can be considered as a valid one.

75

4.5 Results Comparisons

In this section we summarize our above findings. To explain the number of students studying

abroad we began with three explanatory variables but this model failed the multicollinearity check.

After that we employed the variable selection procedure to select the best set of regressors. After

this selection was made we fit the data with three different models and the result summaries are

presented in Table 4.1.

Table 4.1: Regression Results Summary

Model 2R JB RM Normality

A: OLS 0.887 0.0347 0.0152 Rejected

B: RLS 0.972 0.4584 0.4296 Accepted

C: Exponential 0.926 0.3946 0.3734 Accepted

The above results suggest that the traditional least squares method performs worst among the three

models considered here. It not only possesses the lowest 2R , it fails the normality test as well.

Both the robust fit and the exponential model pass the normality test but we will put the robust

RLS ahead of the exponential model both in terms of possessing higher 2R and p-value in test of

normality.

76

CHAPTER 5

Cross Validation of Forecasts

In this chapter our main objective is to evaluate forecasts made by different regression methods

and models. We would employ the cross validation method for this purpose.

5.1 Evaluation of Forecasts by Cross Validation

Cross-validation is a technique for assessing how the results of a statistical analysis will generalize

to an independent data set. It is mainly used in settings where the goal is prediction, and one wants

to estimate how accurately a predictive model will perform in practice. One round of cross-

validation involves partitioning a sample of data into complementary subsets, performing the

analysis on one subset (called the training set), and validating the analysis on the other subset

(called the validation set or testing set). An excellent review of different type of cross validation

techniques is available in Izenman (2008). Picard and Cook (1984) developed all basic

fundamentals of applying cross validation technique in regression and time series.

According to Montgomery et al. (2013), three types of procedures are useful for validating a

regression or time series model.

(i) Analysis of the model coefficients and predicted values including comparisons with prior

experience, physical theory, and other analytical models or simulation results,

(ii) Collection of new data with which to investigate the model’s predictive performance,

77

(iii) Data splitting, that is, setting aside some of the original data and using these observations to

investigate the model’s predictive performance. Since we have a large number of data set, we

prefer the data splitting technique for cross-validation of the fitted model.

In order to find out the best prediction model we usually leave out say, l observations aside as

holdback period. The size of l is usually 10% to 20% of the original data. Suppose that we

tentatively select two models namely, A and B. We fit both the models using (T – l) set of

observations. Then we compute

l

t

AiA el

MSPE1

21 (5.1)

for model A and

l

t

BiB el

MSPE1

21 (5.2)

for model B. Several methods have been devised to determine whether one MSPE is statistically

different from the other. One such popular method of testing is the F-test approach, where F-

statistic is constructed as a ratio between the two MSPEs keeping the larger MSPE in the numerator

of the F-statistic. If the MSPE for model A is larger, this statistic takes the form:

B

A

MSPE

MSPEF (5.3)

This statistic follows an F distribution with (l , l) degrees of freedom under the null hypothesis of

equal forecasting performance. If the F-test is significant we will choose model B for this data

otherwise, we would conclude that there is a little bit difference in choosing between these two

models.

78

5.2 Cross Validation Results

In this section we employ the linear regression with the OLS and RLS methods and an exponential

model for cross validation. Since we have 34 years data, we will use the first 90% of our data (30

years) for fitting the model and information for the last 10% of observations (4 years) will be

forecasted by these three different methods.

Table 5.1: Original and Forecasted Values for 2011-2014

Year Original RLS OLS Exponential

2011 95991 89716.3 69734.2 70962

2012 86030 90866.4 78140.5 97382

2013 102302 95339.7 89855.8 136358

2014 90925 87741.9 89071.0 121570

102500100000975009500092500900008750085000

140000

130000

120000

110000

100000

90000

80000

70000

Original

Fore

ca

st

Original

RLS

OLS

Exponential

Variable

Scatterplot of Original vs RLS, OLS, Exponential Forecasts

Figure 5.1: Scatterplot of RLS, OLS, Exponential Forecasts vs Original Values

79

Table 5.1 provides total number of students studying abroad. Three different forecasted values are

for the years 2011-2014 are presented together with the original values.

Figure 5.1 gives a graphical display to show which forecasted values get closer to their

corresponding original ones. The original values are plotted in black dots while the RLS forecasts

plotted in red dots are quite close to the black ones. This graph clearly shows that the RLS forecast

are much better than the OLS forecasts. Although exponential model performed better than the

OLS fit. In terms of forecasts it seems to perform even worse the OLS.

Table 5.2: Cross Validation Result Summary

Model MSPE F p-value

OLS 227502579

RLS 30342093 7.49791 0.0383

Exponential 713559588 0.525061 0.7260

As we know that the graphical summaries are subjective, we do an analytical test to evaluate the

forecasts as designed in (5.1) to (5.3) and the results are presented in Table 5.2. We observe from

this table that the MSPE value for the RLS is much less than that of OLS and exponential model.

We also observe that the p-value of the F test is highly significant in comparison to the OLS.

However, the exponential forecasts produce very insignificant p-value in this regard. Thus we can

conclude that the RLS produces the best set of forecasts followed by the OLS forecasts.

Exponential forecasts are the worst in this study.

80

CHAPTER 6

Conclusions and Areas of Further Research

In this chapter we will summarize the findings of our research to draw some conclusions and

outline ideas for our future research.

6.1 Conclusions

In this study our prime objective was to investigate the trend of Saudi Arabia students who are

studying abroad for higher education. We investigate both the overall trend and also trends of nine

individual programs. We observe that not a single variable fit linear trend model. All of them fit

either quadratic or exponential models. Then we investigate trends of some other variables such

as budget in higher education, oil price, and oil revenue which should influence the number of

students studying abroad. We observe similar trend for these variables as well.

We also observe that most of the Saudi Arabia students go abroad to study Engineering and

Medical Science and the least number of students study Agriculture and Fine Arts. We also found

that the number of male students is significantly higher than the number of female students in 6

out of 9 programs. Female students are more in only three programs but the differences are not

statistically significant. So we get an evidence of gender discrimination among the Saudi Arabia

students studying abroad.

In quest of which factors influence the number of students studying abroad we consider regression

analysis and the two variables that we found affect most are budget in higher education and oil

81

price. We also observe that commonly used least squares method have several limitations in this

case so we finally used the robust reweighted least squares to fit the data. To verify how good the

fit is, we did cross validation to generate forecasts for the last four years of data and we found that

the RLS fit produces much better forecasts than other methods.

Our findings cause a little bit concern about the future of the programs in which the Saudi Students

go abroad for higher studies. Since we see that oil price has a significant positive impact on the

number of students we suspect the recent fall in oil price might affect the programs adversely.

6.2 Areas of Further Research

Although our data sets are time series, we are not able to consider a variety of time series methods

due to time constraints. We only consider the deterministic models in fitting the data. In future we

would like to extend our research by considering stochastic ARIMA models. Volatility could be

an essential part of this data. We would like to consider ARCH/GARCH or ARFIMA/GARFIMA

models on these data in future.

82

References

1. Bowerman, B. L., O’Connell, R. T., and Koehler, A. B. (2005). Forecasting, Time

Series, and Regression: An Applied Approach, 4th Ed., Duxbury Publishing, Thomson

Books/Cole, New Jersey.

2. Hadi, A.S., Imon, A.H.M.R. and Werner, M. (2009). Detection of outliers, Wiley

Interdisciplinary Reviews: Computational Statistics, 1, pp. 57 – 70.

3. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W. (1986). Robust

Statistics: The Approach Based on Influence Function, Wiley, New York.

4. Imon, A. H. M. R. (2003). Residuals from Deletion in Added Variable Plots, Journal of

Applied Statistics, 30, 841– 855.

5. Imon, A. H. M. R. (2003). Regression Residuals, Moments, and Their Use in Tests for

Normality, Communications in Statistics—Theory and Methods, 32, pp. 1021 – 1034.

6. Imon, A. H. M. R. (2008). Diagnostic Robust Approach of Outlier Detection in Regression,

Journal of Statistical Research, 42, 105 – 120.

7. Izenman, A.J. (2008), Modern Multivariate Statistical Techniques: Regression,

Classification, and Manifold Learning, Springer, New York.

8. Kadane, J.B. (1984). Robustness of Bayesian Analysis, Elsevier North-Holland,

Amsterdam.

9. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006), Robust Statistics: Theory and

Methods, Wiley, New York.

83

10 Montgomery, D., Jennings, C., and Kulachi, M. (2008), Introduction to Time Series

Analysis and Forecasting, Wiley, New York.

11. Montgomery, D., Peck, E., and Vining, G. (2013), An Introduction to Regression

Analysis, 5th Ed., Wiley, New York.

12. Pindyck, R. S. and Rubenfeld, D. L. (1998), Econometric Models and Economic

Forecasts, 4th Ed. Irwin/McGraw-Hill Boston.

13 Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the American

Statistical Association, 79, pp. 871 – 880.

14. Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley,

New York.

15. Rousseeuw, P.J. and Leroy, A.M. (1987). A Fast Algorithm for S-Regression Estimates,

Journal of Computational and Graphical Statistics, 15, pp. 414–427.

16. Saudi Arabian Moneytary Agency (SAMA).

http://www.sama.gov.sa/en-US/EconomicReports/Pages/YearlyStatistics.aspx

17. Saudi Arabia Cultural Mission to the U.S.

http://www.sacm.org/ArabicSACM/pdf/Posters_Sacm_schlorship.pdf

19. The Ministry of Education

https://www.mohe.gov.sa/ar/Ministry/Deputy-Ministry-for-Planning-and-

Information-affairs/HESC/Ehsaat/Pages/default.aspx

20. The Ministry of Education

https://www.mof.gov.sa/english/DownloadsCenter/Pages/Budget.aspx

84

APPENDIX A

Table: A1. Number of Saudi Students Studying Abroad for Higher Education

Year Social Science Natural Science Medical Science

Male Female Total Male Female Total Male Female Total

1981 2015 84 2099 1124 48 1172 1312 235 1547

1982 2061 213 2274 1117 78 1195 758 130 888

1983 1735 156 1891 974 72 1046 666 110 776

1984 1356 141 1497 673 47 720 508 81 589

1985 1540 164 1704 611 53 664 621 86 707

1986 1199 161 1360 647 65 712 637 82 719

1987 1062 138 1200 645 61 706 654 86 740

1988 939 92 1031 597 71 668 578 64 642

1989 685 112 797 555 125 680 542 59 601

1990 570 79 649 462 100 562 448 58 506

1991 598 82 680 423 88 511 361 46 407

1992 628 81 709 430 79 509 431 60 491

1993 605 89 694 424 76 500 508 59 567

1994 647 88 735 428 73 501 552 60 612

1995 475 51 526 425 89 514 559 50 609

1996 531 58 589 481 133 614 550 62 612

1997 598 151 749 536 372 908 673 110 783

1998 107 75 182 535 424 959 860 206 1066

1999 676 254 930 595 388 983 966 248 1214

2000 1759 534 2293 974 537 1511 1361 313 1674

2001 1917 568 2485 1072 570 1642 1626 398 2024

2002 687 244 931 730 436 1166 1171 307 1478

2003 764 296 1060 788 392 1180 1214 362 1576

2004 754 333 1087 776 407 1183 1376 398 1774

2005 591 241 832 597 282 879 1709 467 2176

2006 2267 510 2777 2823 607 3430 3895 986 4881

2007 4663 968 5631 3136 720 3856 4983 1380 6363

2008 5424 1273 6697 5130 1262 6392 3652 1674 5326

2009 9462 2045 11507 7118 1715 8833 6173 2340 8513

2010 16318 4132 20450 8584 2567 11151 7524 3736 11260

2011 26043 7702 33745 11945 4481 16426 11589 6287 17876

2012 1547 1093 2640 16331 6306 22637 14717 7913 22630

2013 1542 1269 2811 19047 8230 27277 17208 9881 27089

2014 1287 1068 2355 16245 7711 23956 15097 8847 23944

85

Year Law Humanities Fine Arts

Male Female Total Male Female Total Male Female Total

1981 123 2 125 408 117 525 98 2 100

1982 313 19 332 327 2363 2690 45 33 78

1983 42 4 46 236 2203 2439 47 32 79

1984 32 9 41 190 274 464 27 11 38

1985 39 6 45 287 252 539 29 23 52

1986 43 6 49 321 261 582 24 26 50

1987 41 2 43 228 260 488 17 35 52

1988 39 1 40 191 168 359 13 18 31

1989 44 2 46 116 110 226 12 26 38

1990 39 1 40 97 49 146 10 11 21

1991 36 3 39 107 34 141 9 16 25

1992 35 1 36 107 44 151 10 22 32

1993 55 1 56 129 57 186 10 21 31

1994 54 1 55 108 61 169 12 21 33

1995 29 0 29 111 90 201 5 26 31

1996 29 0 29 335 501 836 3 22 25

1997 31 8 39 441 735 1176 13 35 48

1998 39 8 47 533 549 1082 9 37 46

1999 78 8 86 481 816 1297 6 31 37

2000 183 17 200 711 1048 1759 14 38 52

2001 292 25 317 754 1119 1873 18 53 71

2002 24 56 80 653 1018 1671 24 56 80

2003 105 5 110 568 1010 1578 14 58 72

2004 127 10 137 567 1030 1597 20 50 70

2005 240 25 265 268 744 1012 21 62 83

2006 506 37 543 677 977 1654 28 64 92

2007 625 58 683 949 1495 2444 27 119 146

2008 756 82 838 522 408 930 17 52 69

2009 1744 208 1952 4336 2820 7156 68 178 246

2010 1729 260 1989 1920 1786 3706 77 266 343

2011 2289 475 2764 1998 1455 3453 143 406 549

2012 2989 629 3618 5370 3800 9170 269 621 890

2013 3096 902 3998 3161 2646 5807 331 868 1199

2014 2715 827 3542 3050 2231 5281 474 994 1468

86

Year Engineering Education Agriculture

Male Female Total Male Female Total Male Female Total

1981 1490 20 1510 382 25 407 219 1 220

1982 1137 14 1151 516 212 728 176 4 180

1983 1026 68 1094 514 265 779 138 3 141

1984 849 12 861 339 202 541 107 2 109

1985 737 9 746 473 309 782 99 3 102

1986 537 17 554 296 344 640 81 4 85

1987 499 6 505 174 351 525 95 3 98

1988 449 10 459 157 192 349 82 0 82

1989 451 9 460 148 106 254 82 1 83

1990 428 10 438 120 68 188 52 1 53

1991 467 18 485 123 87 210 49 1 50

1992 362 3 365 120 93 213 50 0 50

1993 407 2 409 104 93 197 55 1 56

1994 411 6 417 109 88 197 62 1 63

1995 419 37 456 62 60 122 61 2 63

1996 544 15 559 74 55 129 54 2 56

1997 1123 34 1157 118 88 206 66 2 68

1998 1435 100 1535 107 75 182 58 6 64

1999 542 46 588 228 353 581 82 3 85

2000 498 43 541 459 631 1090 82 4 86

2001 516 44 560 458 560 1018 83 8 91

2002 681 88 769 193 311 504 79 4 83

2003 2711 162 2873 176 276 452 74 10 84

2004 5481 292 5773 177 167 344 54 13 67

2005 5080 130 5210 171 224 395 34 5 39

2006 6665 317 6982 300 323 623 81 15 96

2007 10647 360 11007 2144 1019 3163 80 31 111

2008 18104 692 18796 319 216 535 29 0 29

2009 21461 672 22133 1254 710 1964 44 0 44

2010 30164 968 31132 610 716 1326 74 2 76

2011 26255 860 27115 955 1341 2296 74 12 86

2012 1490 20 1510 863 1342 2205 88 19 107

2013 1137 14 1151 1016 1867 2883 88 18 106

2014 1026 68 1094 1059 2117 3176 74 14 88

87

Table: A2. Saudi Arabia Oil Revenue, Oil Price and Budget in Higher Education

Year Oil Revenue Budget in HE Oil Price

1981 328594 2.76845E+06 77.80

1982 186006 9.35426E+06 74.58

1983 145123 1.03608E+07 68.43

1984 121348 9.30524E+06 69.36

1985 88425 1.10786E+07 67.16

1986 42464 7.13496E+09 26.21

1987 67405 6.00293E+09 28.38

1988 48400 6.15068E+09 20.45

1989 75900 5.73860E+09 25.20

1990 96800 5.75337E+09 28.40

1991 149497 6.09730E+09 23.50

1992 128790 3.18550E+10 22.64

1993 105976 3.41000E+10 20.52

1994 95505 3.51000E+10 19.31

1995 105728 2.69120E+10 19.24

1996 135982 2.76267E+10 23.07

1997 159985 4.17000E+10 23.04

1998 79998 4.31000E+10 15.08

1999 104447 4.41000E+10 21.60

2000 214424 4.92840E+10 35.64

2001 183915 5.43000E+10 31.14

2002 166100 4.70370E+10 31.27

2003 231000 6.75000E+10 30.92

2004 330000 6.36500E+10 35.14

2005 504540 7.01000E+10 50.21

2006 604470 8.73000E+10 59.94

2007 562186 9.67000E+10 62.59

2008 983369 1.05000E+11 80.38

2009 434420 1.22100E+11 53.89

2010 670265 1.37600E+11 68.60

2011 1034360 1.50000E+11 88.79

2012 1144818 1.68600E+11 93.06

2013 1035046 2.04000E+11 88.95

2014 913346 2.10000E+11 80.34

88

APPENDIX B

One-way ANOVA: Agriculture, Education, Engineering, Fine Arts, Humanities, ... Source DF SS MS F P

Factor 8 998821022 124852628 5.06 0.000

Error 297 7322357160 24654401

Total 305 8321178183

S = 4965 R-Sq = 12.00% R-Sq(adj) = 9.63%

Individual 95% CIs For Mean Based on

Pooled StDev

Level N Mean StDev --------+---------+---------+---------+-

Agriculture 34 85 38 (-------*--------)

Education 34 859 895 (-------*--------)

Engineering 34 4524 8019 (--------*-------)

Fine Arts 34 185 340 (-------*-------)

Humanities 34 1847 2142 (-------*--------)

Law 34 655 1159 (-------*--------)

Medical Science 34 4490 7337 (-------*--------)

Natural Science 34 4284 7319 (-------*--------)

Social Science 34 3459 6584 (-------*--------)

--------+---------+---------+---------+-

0 2000 4000 6000

Pooled StDev = 4965

One-way ANOVA: Agriculture, Education, Engineering, Fine Arts, Humanities, ... Source DF SS MS F P

Factor 8 998821022 124852628 5.06 0.000

Error 297 7322357160 24654401

Total 305 8321178183

S = 4965 R-Sq = 12.00% R-Sq(adj) = 9.63%

Individual 95% CIs For Mean Based on

Pooled StDev

Level N Mean StDev --------+---------+---------+---------+-

Agriculture 34 85 38 (-------*--------)

Education 34 859 895 (-------*--------)

Engineering 34 4524 8019 (--------*-------)

Fine Arts 34 185 340 (-------*-------)

Humanities 34 1847 2142 (-------*--------)

Law 34 655 1159 (-------*--------)

Medical Science 34 4490 7337 (-------*--------)

Natural Science 34 4284 7319 (-------*--------)

Social Science 34 3459 6584 (-------*--------)

--------+---------+---------+---------+-

0 2000 4000 6000

Pooled StDev = 4965

Grouping Information Using Tukey Method

89

N Mean Grouping

Engineering 34 4524 A

Medical Science 34 4490 A

Natural Science 34 4284 A B

Social Science 34 3459 A B C

Humanities 34 1847 A B C

Education 34 859 A B C

Law 34 655 B C

Fine Arts 34 185 C

Agriculture 34 85 C

Means that do not share a letter are significantly different.

Tukey 95% Simultaneous Confidence Intervals

All Pairwise Comparisons

Individual confidence level = 99.79%

Agriculture subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Education -2965 774 4512 (-------*------)

Engineering 700 4439 8177 (-------*------)

Fine Arts -3639 99 3838 (------*-------)

Humanities -1977 1761 5500 (-------*------)

Law -3169 569 4308 (------*-------)

Medical Science 666 4405 8143 (-------*------)

Natural Science 460 4198 7937 (------*-------)

Social Science -365 3373 7112 (-------*------)

------+---------+---------+---------+---

-5000 0 5000 10000

Education subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Engineering -73 3665 7403 (------*-------)

Fine Arts -4413 -674 3064 (-------*------)

Humanities -2751 988 4726 (-------*------)

Law -3943 -204 3534 (-------*------)

Medical Science -107 3631 7369 (------*-------)

Natural Science -314 3425 7163 (-------*------)

Social Science -1138 2600 6338 (------*-------)

------+---------+---------+---------+---

-5000 0 5000 10000

Engineering subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Fine Arts -8078 -4339 -601 (------*-------)

Humanities -6415 -2677 1061 (-------*------)

Law -7607 -3869 -131 (------*-------)

Medical Science -3772 -34 3704 (-------*------)

Natural Science -3979 -240 3498 (-------*------)

Social Science -4803 -1065 2673 (-------*------)

------+---------+---------+---------+---

-5000 0 5000 10000

90

Fine Arts subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Humanities -2076 1662 5400 (------*-------)

Law -3268 470 4208 (-------*------)

Medical Science 567 4305 8044 (-------*------)

Natural Science 361 4099 7837 (------*-------)

Social Science -464 3274 7012 (-------*------)

------+---------+---------+---------+---

-5000 0 5000 10000

Humanities subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Law -4930 -1192 2546 (-------*------)

Medical Science -1095 2643 6382 (------*-------)

Natural Science -1301 2437 6175 (-------*------)

Social Science -2126 1612 5350 (------*-------)

------+---------+---------+---------+---

-5000 0 5000 10000

Law subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Medical Science 97 3835 7574 (-------*------)

Natural Science -109 3629 7367 (------*-------)

Social Science -934 2804 6542 (-------*------)

------+---------+---------+---------+---

-5000 0 5000 10000

Medical Science subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Natural Science -3945 -206 3532 (-------*------)

Social Science -4770 -1031 2707 (-------*------)

------+---------+---------+---------+---

-5000 0 5000 10000

Natural Science subtracted from:

Lower Center Upper ------+---------+---------+---------+---

Social Science -4563 -825 2913 (------*-------)

------+---------+---------+---------+---

-5000 0 5000 10000

Grouping Information Using Fisher Method

N Mean Grouping

Engineering 34 4524 A

Medical Science 34 4490 A

Natural Science 34 4284 A

Social Science 34 3459 A B

Humanities 34 1847 B C

Education 34 859 C

Law 34 655 C

Fine Arts 34 185 C

Agriculture 34 85 C

91

Means that do not share a letter are significantly different.

Fisher 95% Individual Confidence Intervals

All Pairwise Comparisons

Simultaneous confidence level = 43.41%

Agriculture subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Education -1596 774 3144 (------*------)

Engineering 2069 4439 6809 (------*-----)

Fine Arts -2271 99 2469 (-----*------)

Humanities -609 1761 4131 (------*------)

Law -1801 569 2939 (------*-----)

Medical Science 2035 4405 6775 (------*-----)

Natural Science 1828 4198 6568 (------*------)

Social Science 1003 3373 5743 (------*-----)

---------+---------+---------+---------+

-3500 0 3500 7000

Education subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Engineering 1295 3665 6035 (-----*------)

Fine Arts -3044 -674 1696 (------*------)

Humanities -1382 988 3358 (------*------)

Law -2574 -204 2166 (-----*------)

Medical Science 1261 3631 6001 (-----*------)

Natural Science 1055 3425 5795 (------*------)

Social Science 230 2600 4970 (-----*------)

---------+---------+---------+---------+

-3500 0 3500 7000

Engineering subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Fine Arts -6709 -4339 -1969 (------*-----)

Humanities -5047 -2677 -307 (-----*------)

Law -6239 -3869 -1499 (------*------)

Medical Science -2404 -34 2336 (------*------)

Natural Science -2610 -240 2130 (-----*------)

Social Science -3435 -1065 1305 (------*------)

---------+---------+---------+---------+

-3500 0 3500 7000

Fine Arts subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Humanities -708 1662 4032 (------*------)

Law -1900 470 2840 (-----*------)

Medical Science 1935 4305 6675 (-----*------)

Natural Science 1729 4099 6469 (------*-----)

Social Science 904 3274 5644 (-----*------)

---------+---------+---------+---------+

-3500 0 3500 7000

92

Humanities subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Law -3562 -1192 1178 (------*-----)

Medical Science 273 2643 5013 (------*-----)

Natural Science 67 2437 4807 (------*------)

Social Science -758 1612 3982 (------*-----)

---------+---------+---------+---------+

-3500 0 3500 7000

Law subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Medical Science 1465 3835 6205 (------*------)

Natural Science 1259 3629 5999 (-----*------)

Social Science 434 2804 5174 (------*------)

---------+---------+---------+---------+

-3500 0 3500 7000

Medical Science subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Natural Science -2576 -206 2164 (-----*------)

Social Science -3401 -1031 1339 (------*------)

---------+---------+---------+---------+

-3500 0 3500 7000

Natural Science subtracted from:

Lower Center Upper ---------+---------+---------+---------+

Social Science -3195 -825 1545 (------*-----)

---------+---------+---------+---------+

-3500 0 3500 7000