Upload
mdc-polygon
View
246
Download
1
Embed Size (px)
DESCRIPTION
Polygon is a tribute to the scholarship and dedication of the faculty at Miami Dade College in interdisciplinary areas.
Citation preview
I
Editorial Note:
Polygon is MDC Hialeah's Academic Journal. It is a multi-disciplinary online publication whose purpose is to display the academic work produced by faculty and staff. We, the editorial committee of Polygon, are pleased to publish the 2016 Spring Issue Polygon which is the tenth consecutive issue of Polygon. It includes five regular research papers. We are pleased to present work from a diverse array of fields written by faculty from across the college. The editorial committee of Polygon is thankful to the Miami Dade College President, Dr. Eduardo J. Padrón, Miami Dade College District Board of Trustees, the Hialeah Campus Academic Dean, Professor Joaquin G. Martinez, Chairperson of Hialeah Campus Liberal Arts and Sciences, Dr. Caridad Castro, Chairperson of Hialeah Campus English, Communications and World Languages, Dr. Victor McGlone, Director of Hialeah Campus Administrative Services, Ms. Andrea M. Forero, all staff and faculty of Hialeah Campus and Miami Dade College, in general, for their continued support and cooperation for the publication of Polygon. Sincerely, The Editorial Committee of Polygon: Dr. M. Shakil (Mathematics), Dr. Jaime Bestard (Mathematics), and Professor Victor Calderin (English)
Patrons: Professor Joaquin Martinez, Dean of Academic and Student Affairs Dr. Caridad Castro, Chair of Liberal Arts and Sciences Dr. Jon Mcglone, Chair of World Language
Miami Dade College District Board of Trustees:
Helen Aguirre Ferré, Chair
Armando J. Bucelo Jr. Benjamin León III
Marili Cancio Jose K. Fuentes
Armando J. Olivera Bernie Navarro
Eduardo J. Padrón, College President
Mission of Miami Dade College The mission of the College is to provide accessible, affordable, high‐quality education that keeps the
learner’s needs at the center of the decision-making process.
II
CONTENTS
ARTICLES
Dynamic Stability Analysis of Tumor-Host Interactions
AUTHOR(S)
Dr. Keysner Boet
A comparison between TRON and Levenberg-Marquardt methods and their relationship to Tikhonov’s Regularization Method in Nonlinear Parameter Estimation
Dr. Justina L. Castellanos and Dr. Angel P´erez
SURVEY OF STUDENTS’ FAMILIARITY WITH DEVELOPMENTAL MATHEMATICS- A STATISTICAL ANALYSIS
Dr. M. Shakil
Item Analysis Statistics and Their Uses: An Overview
Dr. M. Shakil
Testing the Goodness of Fit of Continuous Probability Distributions to Some Flood Data
Comments about Polygon: (http://www.mdc.edu/hialeah/Polygon2013/docs2013b/Comments_About_Polygon.pdf)
Dr. M. Shakil
III
Previous Editions
Polygon, 2008 (http://issuu.com/polygon5/docs/polygon2008) Polygon, 2009 (http://issuu.com/polygon5/docs/polygon2009) Polygon, 2010 (http://issuu.com/polygon5/docs/polygon_2010) Polygon, 2011 (http://issuu.com/polygon5/docs/polygon_2011) Polygon, 2012 (http://issuu.com/polygon5/docs/polygon_2012) Polygon, 2013 (http://issuu.com/polygon5/docs/polygon2013) Polygon, 2014 (http://issuu.com/polygon5/docs/polygon_2014)
Disclaimer: The views and perspectives presented in the articles published in Polygon do not represent those of Miami Dade College.
Back to Front Cover
A comparison between TRON and
Levenberg-Marquardt methods and their
relationship to Tikhonov’s Regularization Method
in Nonlinear Parameter Estimation
Justina L. Castellanos∗ Angel Perez†
Abstract
Parameter estimation problems are usually solved by minimizing a nonlinear or linear least squares function (NLS or LLS). For the nonlinear case, the Levenberg-Marquardt method (L-M) has for long been the best method. The connection of this method to Tikhonov’s Regularization method is described. The possibility of using a Trust Region Newton’s Method (TRON) that deals with bound constraints for NLS problems, that performs like the L-M method, is shown. Key words: Inverse Problem, Newton method, Trust Region strategy, Tikhonov method
1 Introduction
The parameter estimation problem for nonlinear models is of great interest notonly for mathematicians but for many specialists in other applied areas suchas engineering, biology, and so forth. It is usually posed as the solution of thefollowing Nonlinear Least Squares (NLS) problem
Minimize F (x) =m∑
i=1
(φ(x; ti)− yobs
i
)2=
12‖f(x)‖22
s.t l≤x≤u l,x,u∈Rn
(1)
f(x) = (f1(x), .., fm(x))tfi(x) = φ(x; ti)− yobs
i , i = 1, ..., m
where φ(x; t) represents the desired model function, with t an independent vari-able where the data
{yobs
i
}are measured, which may be subject to experimental
error. The independent variables {xj}, j = 1, ..., n, can be interpreted as pa-rameters of the problem that are to be manipulated in order to adjust the model
∗Miami Dade College, e-mail: [email protected]†Woolton Inc., e-mail: [email protected]
1
to the data. If the model is to have any validity, we can expect that ‖f(x∗)‖2(with x∗ being the solution of (1)) will be ”small” and that m, the numberof data points, will be much greater than n. The vector function f is calledthe residual vector and vectors u and l are the upper and lower bounds on theunknown vector of parameters x, respectively.
Although problem (1) can be minimized by any general nonlinear opti-mization method, in most circumstances, the properties of function F makeit worthwhile to use methods designed specifically for the least squares prob-lem. In particular, the gradient and Hessian matrix of F have a special struc-ture. If J = [∇f t
1(x) ∇f t2(x)....,∇f t
m(x)]t is the m×n Jacobian matrix of f(x),then g(x) = ∇F (x) = J(x)tf(x), and the n×n Hessian matrix is ∇2F (x) =
J(x)tJ(x) +m∑
i=1
fi(x)(∇2fi(x)). If∥∥f(x∗)∥∥
2is sufficiently small, then the Hes-
sian matrix can be approximated by J(x)tJ(x). One of these methods is the well known Levenberg-Marquardt Method (L-M), which in the other hand, can be viewed as the iterative solution of the approximation of the nonlinear problem (1) by a linear-least squares problem, as we will see ahead.
The Linear Least-Squares problem (LLS) is the solution of
Minimizex∈Rn
12‖Ax− b‖22 A : m×n matrix, b : m− data vector (2)
where the difficulty to find a r easonable approximate s olution c omes f rom the usually ill-posedness of the problem, which is reflected in the i ll-conditioning of the matrix A.This is a cause of the quasi-linear dependency of its columns and may produce solutions quite far from the correct one, because of small errors in the data. Tikhonov’s regularization method solves this problem penalizing the LLS function in order to force the solution vector to be not so big:
Minimizex∈Rn
12‖Ax−b‖22 + λ ‖x‖22 (3)
The difficulty is to select the scalar λ which is problem dependent (see [7]).In NLS problems is used the approximation ∇2F (xk) ≈ J(xk)
tJ(xk) at each iteration and a Linear Least-Squares problem like (2) is solved, but with the Jacobian matrix J(xk) and residual vector f(xk) at the current iteration being A and b respectively. The minimization process seeks the descent direction s that solves
Minimize12‖J(xk)s + f(xk)‖22 (4)s∈Rn
The ill-conditioning of the Jacobian matrix can cause that the iteration of the NLS method may generate solutions to the associated LLS subproblem that are quite far from the exact one. Then, the resulting solution to the NLS problem might also be far from the expected. Thus a Tikhonov’s regularization would be useful in order to avoid this situation.
2
In this paper, the relationship between the iteration of a Levenberg-Marquardt and the Tikhonov’s regularization methods is presented to explain the good performance of the former method. Afterwards, the use of a Trust Region Newton’s method (TRON) [5] to solve the NLS problem taking the Hessian matrix as the approximation J(xk)
tJ(xk) at each iteration is shown. This method can deal with bound constraints and its relationship to the Levenberg-Marquardt iteration guarantees its good behavior for the NLS problems. Our main goal is to point out these relationships among the methods and, in a paper coming soon, to use these ideas in practical applications.
The paper is organized as follows: the connection of the linear iteration of the Levenberg-Marquardt method to Tikhonov’s regularization method is made in section 2. Section 3 describes the TRON method pointing out its similarity to the L-M method when applied to NLS using only first order information in the approximation of the Hessian matrix and thus, the corresponding connection of the former method to the Tikhonov’s regularization method. Section 4 presents the way Levenberg-Marquardt and TRON methods solve the linear subproblem (4) and compute the regularization parameter. Section 5 is devoted to the conclusions.
In what follows we will use the following notation: fk for f(xk), gk for g(xk) and Jk for J(xk). The norm used here will be the Euclidean norm, so we omit the subscript.
2 Relationship between the Levenberg-Marquardtand Tikhonov’s Regularization methods
The Levenberg-Marquardt method (L-M) [2] has been used to solve the NLSproblem with great success, since it takes advantage of the specific form of thefunction to be minimized; that is to say it exploits the particular form of thegradient and the possibility of approximating the true Hessian matrix at eachiteration, under the assumption of small residuals, by Jt
kJk, as was noticed inthe introduction to this work.
The search direction is defined as the solution of the equations
(Jt
kJk + λkI)sk = −Jt
kfk
where λk is a non-negative scalar. A unit step is always taken along sk, givingxk+1 = xk + sk. It can be shown that, for some scalar ∆k related to λk, thevector sk is the solution of the constrained subproblem
Minimize12‖Jks + fk‖2 (5)
subject to ‖s‖≤∆k
which is equivalent to the solution of the unconstrained minimization of the Lagrangian function for problem (5):
3
Minimize12‖Jks + fk‖2 + λk
(‖s‖2 −∆2
k
)
s∈Rn
.
This is also equivalent to Tikhonov’s Regularization (3) for LLS problems exceptthat, in that case, the bound ∆k on the size of the descent direction is not knownand therefore, λk must be fixed. In the L-M iteration, ∆k is given in some wayby the major iteration, and then λk can be chosen as explained in section 4.
This relationship between the L-M method and Tikhonov’s regularizationis the reason for the good behavior of the L-M on noisy problems, since, in acertain way, it prevents the size of the iteration vector from growing too much.
3 TRON: Trust Region Newton’s method
TRON [5] is a routine that uses a trust region version of Newton’s method forgeneral nonlinear minimization of bound constrained problems, i.e.,
Minimize F (x)s.t. l≤x≤u
where F is a nonlinear smooth scalar funtion in Rn, l and u are the lower andupper bounds, respectively. The basic iteration of the method is
xk+1 = xk + sk
sk = arg min(
mk(s) =12st∇2Fks + stgk
)
s.t. ‖s‖≤∆k
When applied to the problem of parameter estimation (1) with the trueHessian matrix approximated using first order information (∇2Fk ≈ Jt
kJk), thismethod becomes a Levenberg-Marquardt method, since at each iteration thenonlinear least squares problem is approximated by the solution of the associatedlinear least-squares problem. The basic iteration now becomes
xk+1 = xk + sk
sk = arg min(
mk(s) =12‖Jks + fk‖22
)
s.t. ‖s‖≤∆k
. (6)
To solve the constrained LLS subproblem (6), the vector sk is obtained usinga Linear Preconditioned Conjugate Gradient method (LPCG). This is the es-sential part of the TRON method in which we are interested in this work. Formore information about the treatment of the bound constraints see [5].
As the TRON method solves the same subproblem (6) as the L-M method,when applied to a nonlinear least-squares problem using the approximate Hes-sian matrix JtJ, the equivalence to Tikhonov’s regularization method is alsovalid for this method.
4
4 Relationship between the solution of theLinear Least Squares subproblem by L-M andTRON methods
The L-M algorithm is of the trust region type and a ”good ” value of λk (or∆k) must be chosen in order to ensure descent. If λk is zero, sk is the Gauss-Newton direction, as λk → ∞, ∆k → 0 and ‖sk‖ → 0 and sk becomes parallel to the steepest-descent direction. The difficulty in this approach is an appropriate strategy for choosing ∆k, which must rely on heuristic considerations. Most standard strategies (see Dennis and Schnabel [1], More [2]), have originally been developed to ”globalize” the convergence of the Gauss-Newton iteration for well-posed minimization problems, so the parameter λk is chosen once the corresponding ∆k has been fixed by some criteria on the agreement between the nonlinear model and the linear one. If the Gauss-Newton direction (sk =−(Jt
kJk)−1Jtkfk) satisfies the constraint on the
norm and the matrix JtkJk is non-singular, then λk is set to zero, but if one of these
conditions fails, it is used an algorithm to compute the scalar λk that is the root of the equation ‖sk(λ)‖ − ∆k ( where sk(λ) = −(Jt
kJk + λI)−1Jtkfk).
The TRON method solves (5) using a LPCG algorithm that begins at s0k = 0,
stopping the iteration whenever:
∥∥sik
∥∥ > ∆k, (7)
∥∥JtkJksi
k + gk
∥∥ ≤ rtol ‖gk‖ , and (8)
ptiJ
tkJkpi = 0 (9)
where pi is the conjugate gradient direction, the subscript i is the counter of
the LPCG iterations, and rtol is a tolerance less than one.These criteria solve the problem of the size of the descent direction sk and
the possible singularity of the approximate Hessian matrix. If the n iterationsof the LPCG are made or (8) is satisfied with a sufficiently low value of rtol or‖gk‖ near zero, then an approximate solution to the Gauss-Newton equationsis obtained. If rtol is not sufficiently small or ‖gk‖ is large, then the iterationis stopped very early and a direction close to the steepest descent is accepted.If (7) or (9) are satisfied, then the LPCG finds a scalar τ > 0 such that ‖sk‖ =∥∥si
k + τpi
∥∥ = ∆k. As the LPCG iteration starts at s0k = 0, the iteration vector
moves from the right-hand side of the equations (−gk) to their complete solution(Gauss-Newton direction) then, the vector sk is the same as in the L-M iterationand the criterion (7) controls the size of the vector. The regularizing effect (see[4]) of the LPCG iterations assures that the approximate solution to the linearsubproblem will be a regularized solution, where the regularization parameterλk is implicitly given by the number of the iteration where the LPCG stops.
5
Thus, the TRON and L-M methods solve the same subproblem (5) by doing a Tikhonov’s regularization to the linearized problem. The difference between them is in the method for computing the scalar λk, which in both cases depends on the trust region radius. In the former method λk is given implicitly by the iteration number of the LPCG method and in the L-M method, an algorithm explicitly computes the root of the univariate equation ‖sk(λ)‖ − ∆k.
5 Conclusions
The Levenberg-Marquardt Method is a standard method that gives very good results when applied to Nonlinear Least-Squares Problems. Its good behaviour relies on the fact that at each iteration a Tikhonov’s regularization is made for the associated linear least-squares subproblem, so that the iterations do not grow too much outside a ”permissible” region, where the regularization parameter is chosen by an optimization criterion on the objective function.
The use of the TRON method using the approximate Hessian JtJ is an interesting alternative method, since it deals with bounds on the variables and the iterations have the nice properties of the L-M method, as was shown in this paper. Also the use of a LPCG in the inner iteration to compute the descent direction guarantees the regularization of the associated LLS subproblem in the presence of errors, because of the regularizing effect of this latter method. The use of the TRON method in nonlinear parameter estimation problems will be presented in a paper coming soon.
References
[1] Dennis J.E., Schnabel R.B., Numerical Methods for Unconstrained Opti-mization and Nonlinear Equations, SIAM, Philadelphia, 1996
[2] More J.J., The Levenberg-Marquardt algorithm: implementation and the-ory, in Numerical Analysis (G.A. Watson, ed.), Lecture Notes in Mathemat-ics 630, Springer-Verlag, pp.105-116, 1977
[3] Hanke M., Regularizing properties of a Truncated Newton-CG algorithm forNonlinear Inverse Problems, Numer. Funct. Anal. Optim., 18, 971-993, 1997
[4] Hansen P.C., Rank-Deficient and Discrete Ill-posed Problems, SIAMPhipadelphia, 1998
[5] More J.J., Chih-Jen L., Newton’s Method for large-scale bound constrainedoptimization problems., SIAM Journal on Opt. Vol.9, No. 4, pp. 1100–1127,1999.
[6] Nocedal J., Stephen J.W., Numerical Optimization, Springer, 1999
6
[7] Tikhonov, A.N., Solution of incorrectly formulated problems and the regu-larization method, Soviet Math. Dokl., 4, pp.1035-1038, 1963; English trans-lation of Dokl. Akad. Nauk. SSSR, 151, pp. 501-504, 1963
7
1
SURVEY OF STUDENTS’ FAMILIARITY WITH DEVELOPMENTAL MATHEMATICS – A STATISTICAL ANALYSIS
M. Shakil, Ph.D.
Professor of Mathematics
Department of Liberal Arts and Sciences
Miami Dade College, Hialeah Campus
FL 33012, USA E-mail: [email protected]
ABSTRACT
In recent years, there has been a great interest in developmental mathematics and college students’ familiarity with it at all level. In this paper, the students’ familiarity with developmental mathematics has been studied from a statistical point of view. By administering a survey developmental mathematics in some math classes, the data have been analysed statistically which shows some interesting results. It is hoped that the findings of the paper will be quite useful for researchers in various disciplines.
KEYWORDS ANOVA, Developmental Mathematics, Hypothesis Testing, Shannon’s Diversity Index.
1. INTRODUCTION
The importance of college students’ familiarity with developmental mathematics in the present day instruction of mathematics at various levels cannot be overlooked. It appears from the literature that not much work has been done on the problem of students’ familiarity with developmental mathematics. Motivated by the importance of college students’ familiarity with developmental mathematics, in this paper, the students’ familiarity with developmental mathematics has been statistically investigated and analyzed. The interested readers are referred to Shakil et al. (2010) and references therein, where the authors have investigated similar studies to analyze the students’ familiarity with grammar and mechanics of English language from an exploratory point of view. Please also Shanno (1951) and Siromoney (1964), among others. The organization of this paper is as follows. Section 2 discusses the methodology. The results are given in section 3. The discussion and conclusion are provided in Section 4.
2. METHODOLOGY
A survey consisting of 20 multiple choice questions on developmental mathematics (see Appendix I) were constructed to test students’ familiarity with developmental mathematics. It was administered to six different math courses in the spring semester of 2016. The courses were MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), which be referred as MAC 2233-A, MAC2233-B, STA2023-A, STA2023-B and STA2023-C. The survey was administered online in Blackboard by the instructor in each of these courses. A total of 126 students (out of 151 enrolled students) participated in the survey the details of which are provided in the following Table 1 below.
Table 1: Surveyed Courses
Discipline Courses Respondents
STA STA2023 (Three Sections) 67
MAC MAC1147, MAC2233 (Two Sections) 59
Total 6 126
2
3. RESULTS 3.1 MASTERY REPORT The total number of questions in the survey was 20. Each question was assigned 1 point. The possible points in the survey were 20. The score unit was assumed to be percent. There was no passing or failure score in the survey. However, it was expected that the students at the above level of courses will achieve 100 % (that is, 20 out of 20 points) or at least 75 % (that is, 15 out of 20 points) in the developmental mathematics survey. Thus, the minimum % of score of passing the developmental mathematics survey was assumed to be 75 % for a satisfactory knowledge of developmental mathematics. There were two students in STA2023 course who scored 5 (25 %) and 8 (40 %) out of 20 points respectively, and so were discarded from the analysis. The mastery report of the 124 survey participants (excluding the above two students in STA2023 courses) is provided in the Table 2 and Figure 1 below.
Table 2: Mastery Report Proportion of Students
Total Number of Students Surveyed Reported: 124 Total Number of Survey Questions: 20
Points Assigned Per Question: 1 Minimum % of Passing Score: 75 %
% of Students Scoring 20 Points (100 %)
% of Students Scoring 15-19 Points (75-95 %)
40.30% 59.70%
Figure 1: Mastery Report 3.2 PERFOMANCE ANALYSIS For the performance analysis of students in the developmental mathematics survey, the participants were divided into two different categories as follows: Category (A):
i. MAC Group: MAC1147, MAC2233 (Two Sections); ii. STA Group: STA2023 (Three Sections).
3
Category (B):
(i) MAC1147; (ii) MAC2233 (Two Sections); (iii) STA2023 (Three Sections).
Category (C):
(i) MAC1147; (ii) MAC2233 (Section-1); (iii) MAC2233 (Section-2); (iv) STA2023 (Section-1) ; (v) STA2023 (Section-2) ; (vi) STA2023 (Section-3).
The descriptive statistics of the performance of the Categories (A) and (B) in the survey are provided respectively in Tables 3 and 4 below. For the descriptive statistics of the performance of the Category (C), please see the Table 7 in Sub-Section 3.3 below.
Table 3: Descriptive Statistics of Category (A)
Group Respondent Mean Median St. Dev.
Coeff. Of
Var.
Min. Score
Max. Score
1Q 2Q 3Q
MAC Group
59 18.73 19 1.26 6.71% 15
20
18 19 20
STA Group
65 18.85 19 1.30 6.91% 15
20
18 19 20
Table 4: Descriptive Statistics of Category (B)
Group Respondent Mean Median St. Dev.
Coeff. Of
Var.
Min. Score
Max. Score
1Q 2Q 3Q
MAC1147 23 18.70 19 1.30 6.92% 16
20
18 19 20
MAC2233 (Two Sections)
36 18.75 19 1.25 6.67% 15
20
18 19 20
STA2023 (Three Sections)
65 18.85 19 1.30 6.91% 15
20
18 19 20
3.3 HYPOTHESIS TESTING: INFERENCES ABOUT MEAN SCORES This section discusses the hypothesis testing and draws inferences about the mean scores of different independent samples. The results of these tests of hypotheses are provided below. (I) INFERENCES ABOUT MEAN SCORES OF CATEGORY (A): MAC AND STA PARTICIPANTS Here we discuss the hypothesis testing and draw the inferences about the mean scores of two independent samples: MAC and STA groups, defined as follows: Category (A):
i. MAC Group: MAC1147, MAC2233 (Two Sections); ii. STA Group: STA2023 (Three Sections).
4
For the descriptive statistic of MAC and STA Groups, please see the Table 3 above. Following the procedure on Pages 474 - 475 in Triola (2010) of not equal variances: no pool, the hypothesis testing was conducted for these two independent groups by using the statistical software package STATDISK. The results of the hypothesis test to draw the inferences about the mean scores of MAC and STA Groups are provided in Table 5 and Figure 3 below
Table 5: Hypothesis Testing about Mean Scores of MAC and STA Groups
Assumption: Not Equal Variances: No Pool Alpha = 0.05 Let µ1 = Mean Score of MAC Group and µ2 = Mean Score of STA Group. Claim: µ1 = µ2 (Null Hypothesis) Not eq. vars: No Pool Alternative Hypothesis: µ1 not equal µ2 Test Statistic, t: -0.5217 Critical t: ±1.979685 P-Value: 0.6028 Degrees of freedom: 121.4640 95% Confidence interval: -0.5753641 < µ1-µ2 < 0.3353641 Fail to Reject the Null Hypothesis There is not enough evidence to warrant the rejection of the claim that µ1 = µ2.
Figure 2: Hypothesis Testing about Mean Scores of MAC and STA Groups (II) ANALYSIS OF VARIANCE (ANOVA): INFERENCES ABOUT MEAN SCORES OF CATEGORY (B): MAC1147, MAC2233 (TWO SECTIONS), AND STA2023 (THREE SECTIONS) PARTICIPANTS For the descriptive statistic of Category B: MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections) Groups, please see the Table 4 above. Following the procedure on Pages 628 - 631 in Triola (2010), we discuss here the ANOVA for testing the hypothesis of the equality of the mean scores of three independent groups based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). The results of ANOVA are provided in Table 6 and Figure 3 below.
5
Table 6: ANOVA: Hypothesis Testing About Equality of Mean Scores
ANOVA OF MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). Alpha = 0.05 Claim: Equality of the mean scores of three independent groups based on the courses (Null Hypothesis) Alternative Hypothesis: Mean Scores are not equal Source: DF: SS: MS: Test Stat, F: Critical F: P-Value: Treatment: 2 0.467283 0.233642 0.141296 3.071137 0.868375 Error: 121 200.081104 1.653563 Total: 123 200.548387 Fail to Reject the Null Hypothesis There is enough evidence to support the claim of the equality of the mean scores of three independent groups based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections).
Figure 3: ANOVA: Hypothesis Testing About Equality of Mean Scores (III) ANALYSIS OF VARIANCE (ANOVA): INFERENCES ABOUT MEAN SCORES OF CATEGORY (C): MAC1147, MAC2233 (SECTION-1), MAC2233 (SECTION-2), STA2023 (SECTION-1), STA2023 (SECTION-2), AND STA2023 (SECTION-3) PARTICIPANTS For the descriptive statistic of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups, please see the Table 7 below.
6
Table 7: Descriptive Statistics of Category (C)
Group Respondent Mean Median St. Dev.
Coeff. Of
Var.
Min. Score
Max. Score
1Q 2Q 3Q
MAC1147 23 18.70 19 1.30 6.92% 16
20
18 19 20
MAC2233 (Section-1)
18 18.44 18.5 1.20 6.50% 15
20
18 18.5 19
MAC2233 (Section-2)
18 19.06 20 1.26 6.61% 17
20
18 20 20
STA2023 (Section-1)
23 18.83 20 1.61 8.57% 15
20
17 20 20
STA2023 (Section-2)
21 18.76 19 1.14 6.05% 17
20
18 19 20
STA2023 (Section-3)
21 18.95 19 1.12 5.89% 17
20
18 19 20
Following the procedure on Pages 628 - 631 in Triola (2010), we discuss here the ANOVA for testing the hypothesis of the equality of the mean scores of six independent groups based on the courses, that is, Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups. The results of ANOVA are provided in Table 8 and Figure 4 below.
Table 8: ANOVA: Hypothesis Testing About Equality of Mean Scores
ANOVA OF Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups. Alpha = 0.05 Claim: Equality of the mean scores of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups (Null Hypothesis) Alternative Hypothesis: Mean Scores are not equal Source: DF: SS: MS: Test Stat, F: Critical F: P-Value: Treatment: 4 1.704643 0.426161 0.25042 2.461696 0.9088 Error: 101 171.880262 1.701785 Total: 105 173.584906 Fail to Reject the Null Hypothesis There is enough evidence to support the claim of the equality of the mean scores of Category C: MAC1147, MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) Groups.
7
Figure 4: ANOVA: Hypothesis Testing About Equality of Mean Scores 3.4 DIVERSITY ANALYSIS This sub-section discusses the diversity analysis for testing the hypothesis of evenness ratio of respondents (two independent samples: MAC and STA groups) based on gender. All these analyses were carried out by using the statistical software packages STATDISK and EXCEL. (I) Respondent Performance Based on Gender (Two Independent Samples: MAC and STA Groups Based on Gender): The performance of respondents (two independent samples: MAC and STA groups) based on gender is provided in Table 9 and Figure 5 below.
Table 9: Respondent Performance Based on Gender (Two Independent Samples: MAC and STA Groups Based on Gender)
Group- Gender % of Students (out of 124)
Scoring 20 Points (100 %)
% of Students (out of 124)
Scoring 15-19 Points (75-95 %)
MAC Group- Male 5.65% 16.13%
STA Group- Male 7.26% 8.87%
MAC Group- Female 12.10% 13.71%
STA Group- Female 15.32% 20.97%
8
Figure 5: Respondent Performance Based on Gender (II) Diversity Analysis: This Sub-section discusses the diversity analysis for testing the hypothesis of evenness ratio of respondent performance based on gender belonging to two independent samples: MAC and STA groups. For Diversity Analysis, we first compute the proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the MAC and STA groups, thus making eight different categories, which are provided in Table 10 below.
Table 10: Diversity Analysis Based on Gender
Group Gender Proportion (p) of Students Scoring 20 Out of 20 Points
Proportion (p) of Students Scoring 15-19 Out of 20 Points
MAC Group
Male 0.0565 0.1613
STA Group
Male 0.0726 0.0887
MAC Group
Female 0.1210 0.1371
STA Group
Female 0.1532 0.2097
Hypothesis: Does the respondent performance (that is, the proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories as given in Table 10 above) suggest more diversity in the groups’ familiarity with developmental mathematics? The above hypothesis can be analyzed by applying Shannon’s Measure of Diversity Index (or, entropy) (Shannon, 1948), which is a measure of the diversity of a population, as given below.
Shannon’s Diversity Index: For a discrete random variable associated with n (countable) possible outcomes iE ’s, where
ii pEP , and npppP ,,, 21 , Shannon’s diversity index (or, entropy), PHn , or, simply, H , is defined by the
following formula:
in
i
i ppH ln1
(1)
It can be easily verified that Shannon’s diversity index (or, entropy), H , satisfies the following conditions:
9
(i) H is maximum when
n
ppp n
121 .
(ii) H is minimum when 1ip , 0jp , ij , ni ,,2,1 , that is, H is minimum, when one of the probabilities is
unity and all others are zero. (iii) From (i) and (ii), it follows that, for the discrete case,
nH ln0 .
Further, the largest value of Shannon’s diversity index, maxH , is given by the following formula:
SH lnmax , (2)
where S denotes the number of categories in the population.
Evenness Ratio: The evenness ratio, HE , is given by the following formula:
maxH
HEH , (3)
where 10 HE . Note that if 1HE , there is complete evenness.
Now, using the values of the proportion (p) from the Table 10 in Equations (1), (2) and (3), the values of Shannon’s
diversity index, H , the largest value of Shannon’s diversity index, maxH , and the evenness ratio, HE , are computed as
follows:
2.004878H ,
2.079442max H ,
and
0.96414237HE .
Since 10.96414237 HE , there appears to be complete evenness in the respondent performance (that is, the
proportion (p) of male and female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories as given in Table 10 above).
4. CONCLUSIONS This paper discussed the students’ familiarity with developmental mathematics from a statistical point of view. A survey consisting of 20 multiple choice questions on developmental mathematics was constructed to test students’ familiarity with developmental mathematics in six different courses, that is, MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), during the spring semester of 2016, and was administered online in Blackboard by the instructor. A total of 126 students (out of 151 enrolled students) participated in the survey. There were two students in STA2023 course who scored 5 (25 %) and 8 (40 %) out of 20 points respectively, and so were discarded from the analysis. The mastery report of the 124 survey participants (excluding the above two students in STA2023 courses) is provided in the Table 2 and Figure 1 in Sub-section 3.1. The minimum % to pass was 60. Out of 124 survey participants, considered in this research project, 40.30 %
10
students scored 20 out of 20 points, whereas 59.70% students scored 15-19 out of 20 points. Based on the hypothesis testing, the following inferences were drawn about the survey participants.
There was sufficient evidence to support the claim that the mean scores of MAC group and STA group participants were same.
There was sufficient evidence to support the claim of the equality of the mean scores of three independent group
participants based on the courses, that is, MAC1147, MAC2233 (Two Sections), and STA2023 (Three Sections). There is was sufficient to support the claim of the equality of the mean scores of Category C: MAC1147,
MAC2233 (Section-1), MAC2233 (Section-2), STA2023 (Section-1), STA2023 (Section-2), and STA2023 (Section-3) group participants.
There appeared to be complete evenness in the respondent performance (that is, the proportion (p) of male and
female student population scoring 20 and 15-19 points out of 20 points respectively belonging to the eight different categories.
It is hoped that the findings of the paper will be quite useful for researchers in various disciplines.
ACKNOWLEDGMENT
The author would like to express his sincere gratitude and acknowledge their indebtedness to his students of the courses, that is, MAC 1147, MAC 2233 (two sections) and STA 2023 (three sections), in the spring semester of 2016, for their cooperation in participating in the survey. Further, the author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. I would also like to acknowledge my sincere indebtedness to the works of various authors and resources on the subject which I have consulted during the preparation of this research project. The author is thankful to his wife for her patience and perseverance for the period during which this paper was prepared. The author would like to dedicate this paper to his late parents, brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for giving an opportunity to serve this college, without which it was impossible to conduct his research.
REFERENCES
[1] Shakil, M., Calderin, V., and Pierre-Philippe, L. (2010). Survey of Students’ Familiarity with Grammar and Mechanics of English Language – An Exploratory Analysis, Polygon, Vol. 4, pp. 43 – 55. [2] Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, pp. 379 - 423; 623 - 656.
[3] Shannon, C.E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30, pp. 50 - 64.
[4] Siromoney, G. (1964). An Information-theoretical Test for Familiarity with a Foreign Language. Journal of Psychological Researches, viii, pp. 267 – 272.
[5] Triola, M. F. (2010). Elementary Statistics. Addison-Wesley, N. Y.
11
APPENDIX I
Spring 16
"Survey of Student's Familiarity with Developmental Mathematics"
Name: GENDER:
Current GPA: Major:
Course Name/Reference: Term/Year:
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Simplify.
1) Write "forty-one thousand five hundred forty-three" in standard form. 1)
A) 410,543 B) 415,043 C) 41,543 D) 401,543
Solve. Write the answer in simplest form.
2 2) Mary is saving 19 of her monthly income of $5358 for retirement. How much money is she setting 2)
aside each month for retirement? A) $50,901 B) $282 C) $141 D) $564
Solve.
3) Subtract 9 from 54. 3)
A) 46 B) 55 C) 44 D) 45
Round the decimal to the indicated place value.
4) 10.849, nearest tenth 4)
A) 10.9 B) 10.8 C) 10.85 D) 10.7
Perform the indicated operations. Round the result to the nearest thousandth, if necessary.
5) 85.42 + 79.65 + 15.475 5) A) 180.645 B) 181.545 C) 180.555 D) 180.545
Simplify the expression.
6) -12 - (-7) 6) A) -5 B) 19 C) 5 D) -19
7) 5(-11) 7) A) -55 B) -60 C) -550 D) -155
Perform the indicated operations. Round the result to the nearest thousandth, if necessary.
8) A country reports total exports of $4,771 million last year. Write this number using 8) standard notation.
A) $4,771,000 B) $4,771,000,000 C) $4,771 D) $4,771,000,000,000
Write the decimal as a percent.
9) 0.41 9) A) 410% B) 0.041% C) 4.1% D) 41%
12
48°
66°
Solve.
10) In a survey of 100 people, 47 preferred ketchup on their hot dogs. What percent preferred ketchup? 10)
A) 47% B) 0.47% C) 100
47% D) 4.7%
The following plane figure is called a triangle. The sum of the three angles of a triangle is always 180°. Find the measure of the missing angle in the figure.
11) 11)
A) 58° B) 66° C) 76° D) 48°
Fill in the blank with one of the words or phrases listed
below.
equivalent, or >, least common denominator, or
mixed number, <, least common multiple, like
12) The symbol means is greater than. 12)
A) equivalent B) < C) > D) like
Find the GCF for the list.
13) 36, 15 13) A) 6 B) 1 C) 15 D) 3
Simplify the radical. Indicate if the radical is not a real number. Assume that x represents a positive real number.
14) 625 14)
A) -25
C) 25
B) 312
D) Not a real number
15) In the following figure, the sum of the angles x and 30° is 90°, that is, the angles x
and 30° are complementary of each other. Find the measure of x .
15)
30°
A) 55° B) 115° C) 150° D) 60° Insert <, >, or = to make the statement true.
16) - 6 - 3 16) A) = B) < C) >
13
Evaluate the expression for the given replacement values.
17) x2 + y2 for x = 5 and y = - 2 17)
A) 29 B) 100 C) 14 D) 20
Simplify the expression.
18) 7x + 2 - 3x + 1 18) A) 4x + 1 B) 7x C) 4x + 3 D) 10x + 3
Solve the equation. Don't forget to first simplify each side of the equation, if possible.
19) 6x - 5x + 4 = 4 19) A) 4 B) -4 C) 0 D) 8
The bar graph shows the number of students who flunk Dr. Jones' class each year.
20) During which year(s) did Dr. Jones' have more than 10 students flunk his
class? 20)
A) 1998, 1999 B) 2002 C) 1998, 1999, 2000 D) 1998, 2002
1
Item Analysis Statistics and Their Uses: An Overview
M. Shakil, Ph.D.
Professor of Mathematics
Department of Liberal Arts and Sciences
Miami Dade College, Hialeah Campus
FL 33012, USA
E-mail: [email protected]
Abstract
In this paper, we have presented an overview of some item analysis statistics which are available in the ParSCORETM analysis report. The uses of item analysis statistics to some multiple-choice math examinations have been investigated. It is hoped that the present study would be quite useful in recognizing the most critical pieces of the test items data, and evaluating whether or not that test item needs revision. The methods discussed in this project can be used to describe the relevance of test item analysis to classroom tests.
Keywords: Item Analysis Statistics, Multiple-Choice Examinations, ParSCORETM Analysis.
1. Introduction
An item analysis involves many statistics that can provide useful information for determining the validity and improving the quality and accuracy of multiple-choice or true/false items. These statistics are used to measure the ability levels of examinees from their responses to each item. The ParSCORETM item analysis report, when a Multiple-Choice Exam is machine scored, consists of three types of reports, that is, a summary of test statistics, a test frequency table, and item statistics. The test statistics summary and frequency table describe the distribution of test scores. The item analysis statistics evaluate class-wide performance on each test item. The ParSCORETM report on item analysis statistics gives an overall view of the test results and evaluates each test item, which are also useful in comparing the item analysis for different test forms. The organization of this paper is as follows. In Section 2, descriptions of some useful, common item analysis statistics, that is, item difficulty, item discrimination, distractor analysis, and reliability, are presented. For the sake of completeness, in Section 2, definitions of some test statistics as reported in the ParSCORETM analysis report are also provided. Section 3 contains the uses of item analysis statistics to some multiple-choice math examinations. The concluding remarks are presented in Section 4.
2. Item Analysis Statistics In what follows, we shall present some commonly used item analysis statistics available on ParSCORETM report when a Multiple-Choice Exam is machine scored. For details on these, the
2
interested readers are referred to Wood (1960), Lord & Novick (1968), Henrysson (1971), Nunally (1978), Thompson and Levitov (1985), Crocker and Algina (1986), Ebel and Frisbie (1986), Suen (1990), Thorndike et al. (1991), DeVellis (1991), Millman and Greene (1993), Haladyna (1999), Tanner (2001), Haladyna et al. (2002), and Mertler (2003), among others. (I) Item Difficulty: Item difficulty is a measure of the difficulty of an item. For items (that is, multiple-choice questions) with one correct alternative worth a single point, the item difficulty (also known as the item difficulty index, or the difficulty level index, or the difficulty factor, or the item facility index, or the item easiness index, or the p -value) is defined as the proportion of
respondents (examinees) selecting the answer to the item correctly, and is given by
n
cp
where p the difficulty factor, c the number of respondents selecting the correct answer to
an item, and n total number of respondents. Item difficulty is relevant for determining whether
students have learned the concept being tested. It also plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not. Note that
(i) 10 p .
(ii) A higher value of p indicate low difficulty level index, that is, the item is easy. A
lower value of p indicate high difficulty level index, that is, the item is difficult. In
general, an ideal test should have an overall item difficulty of around 0.5; however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8). In a criterion-referenced test (CRT), with emphasis on mastery-testing of the topics covered, the optimal value of p for many items is
expected to be 0.90 or above. On the other hand, in a norm-referenced test (NRT), with emphasis on discriminating between different levels of achievement,
it is given by 50.0p . For details on these, see, for example, Chase(1999),
among others.
(iii) To maximize item discrimination, ideal (or moderate or desirable) item difficulty
level, denoted as Mp , is defined as a point midway between the probability of
success, denoted as Sp , of answering the multiple - choice item correctly (that
is, 1.00 divided by the number of choices) and a perfect score (that is, 1.00) for the item, and is given by
2
1 S
SM
ppp
.
(iv) Thus, using the above formula in (iv), ideal (or moderate or desirable) item
difficulty levels for multiple-choice items can be easily calculated, which are provided in the following table, (for details, see, for example, Lord, 1952; among others).
3
Number of Alternatives Probability of Success
( Sp )
Ideal Item Difficulty Level
( Mp )
2 0.50 0.75
3 0.33 0.67
4 0.25 0.63
5 0.20 0.60
(Ia) Mean Item Difficulty (or Mean Item Easiness): Mean item difficulty is the average of difficulty easiness of all test items. It is an overall measure of the test difficulty and ideally
ranges between 60 % and 80 % (that is, 80.060.0 p ) for classroom achievement tests.
Lower numbers indicate a difficult test while higher numbers indicate an easy test.
(II) Item Discrimination: The item discrimination (or the item discrimination index) is a basic measure of the validity of an item. It is defined as the discriminating power or the degree of an item's ability to discriminate (or differentiate) between high achievers (that is, those who scored high on the total test) and low achievers (that is, those who scored low), which are determined on the same criterion, that is, (1) internal criterion, for example, test itself; and (2) external criterion, for example, intelligence test or other achievement test. Further, the computation of the item discrimination index assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right or wrong dichotomy of a student’s performance on an item. For details on the item discrimination index, see, for example, Kelly (1939), Wood (1960), Henrysson (1971), Nunally (1972), Ebel (1979), Popham (1981), Ebel & Frisbie (1986), Weirsma & Jurs (1990), Glass & Hopkins (1995), Brown (1996), Chase (1999), Haladyna (1999), Nitko (2001), Tanner (2001), Oosterhof (2001), Haladyna et al. (2002), and Mertler (2003), among others. There are several ways to compute the item discrimination, but, as shown on the ParSCORETM item analysis report and also as reported in the literature, the following formulas are most commonly used indicators of item’s discrimination effectiveness.
(a) Item Discrimination Index (or Item Discriminating Power, or D -Statistics), D : Let the students’ test scores be rank-ordered from lowest to highest. Let
groupupperinstudentsofNumberTotal
correctlyitemtheansweringgroupupperinstudentsofNopU
%30%25
%30%25.
,
and
grouplowerinstudentsofNumberTotal
correctlyitemtheansweringgrouplowerinstudentsofNopL
%30%25
%30%25.
The ParSCORETM item analysis report considers the upper %27 and the lower %27 as the
analysis groups. The item discrimination index, D , is given by
4
LU ppD .
Note that
(i) 11 D .
(ii) Items with positive values of D are known as positively discriminating items, and
those with negative values of D are known as negatively discriminating items.
(iii) If 0D , that is, LU pp , there is no discrimination between the upper and
lower groups.
(iv) If 00.1D , that is, 000.1 LU pandp , there is a perfect discrimination
between the two groups.
(v) If 00.1D , that is, 00.10 LU pandp , it means that all members of the
lower group answered the item correctly and all members of the upper group answered the item incorrectly. This indicates the invalidity of the item, that is, the item has been miskeyed and needs to be rewritten or eliminated.
(vi) A guideline for the value of an item discrimination index is provided in the following table, see, for example, Chase(1999), and Mertler(2003), among others.
Item Discrimination Index, D Quality of an Item
50.0D Very Good Item; Definitely Retain
49.040.0 D Good Item; Very Usable
39.030.0 D Fair Quality; Usable Item
29.020.0 D Potentially Poor Item; Consider Revising
20.0D Potentially Very Poor; Possibly Revise Substantially, or Discard
(b) Mean Item Discrimination Index, D : This is the average discrimination index for all test items combined. A large positive value (above 0.30) indicates good discrimination between the upper and lower scoring students. Tests that do not discriminate well are generally not very reliable and should be reviewed. (c) Point-Biserial Correlation (or Item-Total Correlation or Item Discrimination)
Coefficient, pbisr : The point-biserial correlation coefficient is another item discrimination index
of assessing the usefulness (or validity) of an item as a measure of individual differences in knowledge, skill, ability, attitude, or personality characteristic. It is defined as the correlation between the student performance on an item (correct or incorrect) and overall test score, and is given by either of the following two equations (which are mathematically equivalent).
(i) Suen (1990); DeVellis (1991); Haladyna (1999)
5
q
p
s
XXr
TC
pbis
,
where pbisr the point-biserial correlation coefficient;
CX the mean total score for
examinees who have answered the item correctly;
TX the mean total score for all
examines; p the difficulty value of the item; pq 1 ; and s the standard deviation of
total exam scores.
(ii) Brown (1996)
qps
mmr
qp
pbis
,
where pbisr the point-biserial correlation coefficient; pm the mean total score for
examinees who have answered the item correctly; qm the mean total score for examinees
who have answered the item incorrectly; p the difficulty value of the item; pq 1 ; and
s the standard deviation of total exam scores.
Note that
(i) The interpretation of the point-biserial correlation coefficient, pbisr , is same as
that of the D -statistic.
(ii) It assumes that the distribution of test scores is normal and that there is a normal
distribution underlying the right or wrong dichotomy of a student performance on
an item.
(iii) It is mathematically equivalent to the Pearson (product moment) correlation
coefficient, which can be shown by assigning two distinct numerical values to the
dichotomous variable (test item), that is, incorrect = 0 and correct = 1.
(iv) 11 pbisr .
(v) 0pbisr means little correlation between the score on the item and the score on
the test.
(vi) A high positive value of pbisr indicates that the examinees who answered the item
correctly also received higher scores on the test than those examinees who
answered the item incorrectly. (vii) A negative value indicates that the examinees who answered the item correctly
received low scores on the test and those examinees who answered the item
incorrectly did better on the test. It is advisable that an item with 0pbisr or with
large negative value of pbisr should be eliminated or revised. Also, an item with
low positive value of pbisr should be revised for improvement.
6
(viii) Generally, the value of pbisr for an item may be put into two categories as
provided in the following table.
Point-Biserial Correlation Coefficient, pbisr Quality
30.0pbisr Acceptable Range
1pbisr Ideal Value
(ix) The statistical significance of the point-biserial correlation coefficient, pbisr , may
be determined by applying the Student’s t test; see, for example, Triola (2007),
among others.
Remark: It should be noted that the use of point-biserial correlation coefficient, pbisr , is more
advantageous than that of item discrimination index statistics, D , because every student taking
the test is taken into consideration in the computation of pbisr , whereas only 54 % of test-takers
passing each item in both groups (that is, the upper 27 % + the lower 27 %) are used to
compute D .
(d) Mean Item-Total Correlation Coefficient, pbisr : It is defined as the average correlation of
all the test items with the total score. It is a measure of overall test discrimination. A large positive value indicates good discrimination between students.
(III) Internal Consistency Reliability Coefficient (Kuder-Richardson 20, 20KR , Reliability
Estimate): The statistic that measures the test reliability of inter-item consistency, that is, how well the test items are correlated with one another, is called the internal consistency reliability coefficient of the test. For a test, having multiple-choice items that are scored correct or incorrect, and that is administered only once, the Kuder-Richardson formula 20 (also known as KR-20) is used to measure the internal consistency reliability of the test scores; see, for example, Nunally (1972), and Haladyna (1999), among others. The KR-20 is also reported in the ParSCORETM item analysis. It is given by the following formula:
12
1
2
20
ns
qpsn
KR
n
i
ii
where 20KR = the reliability index for the total test; n = the number of items in the test; 2s = the
variance of test scores; ip = the difficulty value of the item; and ii pq 1 .
Note that
(i) 0.10.0 20 KR .
(ii) 020 KR indicates a weaker relationship between test items, that is, the overall
test score is less reliable. A large value of 20KR indicates high reliability.
(iii) Generally, the value of 20KR for an item may be put into the following categories
as provided in the table below.
7
20KR Quality
60.020 KR Acceptable Range
75.020 KR Desirable
85.080.0 20 KR Better t
120 KR Ideal Value
(iv) Remarks: The reliability of a test can be improved as follows:
a) By increasing the number of items in the test for which the following
Spearman-Brown prophecy formula is used (Mertler, 2003).
rn
rnrest
11
where estr = the estimated new reliability coefficient; r = the
original 20KR reliability coefficient; n = the number of times the
test is lengthened.
b) Or, using the items that have high discrimination values in the test.
c) Or, performing an item-total statistic analysis as described above.
(IV) Standard Error of Measurement ( mSE ): It is another important component of test item
analysis to measure the internal consistency reliability of a test see, for example, Nunally (1972), and Mertler (2003), among others. It is given by the following formula:
201 KRsSEm , 0.10.0 20 KR ,
where mSE = the standard error of measurement; s = the standard deviation of test scores;
and 20KR = the reliability coefficient for the total test.
Note that
(i) 0mSE , when 120 KR .
(ii) 1mSE , when 020 KR .
(iii) A small value of mSE (e.g., 3 ) indicates high reliability; whereas a large value
of mSE indicates low reliability.
8
(iv) Remark: Higher reliability coefficient (i.e., 120 KR ) and smaller standard
deviation for a test indicate smaller standard error of measurement. This is
considered to be more desirable situation for classroom tests. (v) Test Item Distractor Analysis: It is an important and useful component of test item analysis. A test item distractor is defined as the incorrect response options in a multiple-choice test item. According to the research, there is a relationship between the quality of the distractors in a test item and the student performance on the test item, which also affect the student performance on his/her total test score. The performance of these incorrect item response options can be determined through the test item distractor analysis frequency table which contains the frequency, or number of students, that selected each incorrect option. The test item distractor analysis is also provided in the ParSCORETM item analysis report. For details on test item distractor analysis, see, for example, Thompson & Levitov (1985), DeVellis (1991), Milman & Greene (1993), Haladyna (1999), and Mertler (2003), among others. A general guideline for the item distractor analysis is provided in the following table:
Item Response Options
Item Difficulty p
Item Discrimination Index
D or pbisr
Correct Response
85.035.0 p (Better) 30.0D or 30.0pbisr
(Better)
Distractors 02.0p (Better) 0D or 0pbisr (Better)
(v) Mean: The mean is a measure of central tendency and gives the average test score of a sample of respondents (examinees), and is given by
n
x
x
n
i
i
1
,
where scoretestindividualxi , scoretestindividualxi , srespondentofnon . .
(vi) Median: If all scores are ranked from lowest to highest, the median is the middle score. Half of the scores will be lower than the median. The median is also known as the 50th percentile or the 2nd quartile. (vii) Range of Scores: It is defined as the difference of the highest and lowest test scores. The range is a basic measure of variability.
(viii) Standard Deviation: For a sample of n examinees, the standard deviation, denoted by s ,
of test scores is given by the following equation
9
1
1
2
n
xx
s
n
i
i
,
where scoretestindividualxi and scoretestaveragex
. The standard deviation is a
measure of variability or the spread of the score distribution. It measures how far the scores deviate from the mean. If the scores are grouped closely together, the test will have a small standard deviation. A test with a large value of the standard deviation is considered better in discriminating the student performance levels.
(ix) Variance: For a sample of n examinees, the variance, denoted by 2s , of test scores is
defined as the square of the standard deviation, and is given by the following equation
1
1
2
2
n
xx
s
n
i
i
.
(x) Skewness: For a sample of n examinees, the skewness, denoted by 3 , of the distribution
of the test scores is given by the following equation
n
i
i
s
xx
nn
n
1
3
321
,
where scoretestindividualxi , scoretestaveragex
and
scorestestofdeviationdardss tan . It measures the lack of symmetry of the distribution.
The skewness is 0 for symmetric distribution and is negative or positive depending on whether the distribution is negatively skewed (has a longer left tail) or positively skewed (has a longer right tail).
(xi) Kurtosis: For a sample of n examinees, the kurtosis, denoted by 4 , of the distribution of
the test scores is given by the following equation
32
13
321
12
1
4
4
nn
n
s
xx
nnn
nn n
i
i ,
where scoretestindividualxi , scoretestaveragex
, and
scorestestofdeviationdardss tan . It measures the tail-heaviness (the amount of probability
in the tails). For the normal distribution, 34 . Thus, depending on whether 334 or , a
distribution is heavier tailed or lighter tailed than the normal distribution.
10
3. Use of Item Analysis Statistics
This section provides some of the uses of item analysis statistics to some multiple-choice math examinations (which we call as MAT0000-Version A and MAT0000-Version B). It consists of three parts, which are described below. 3.1. Test Item Analysis of MAT0000-Version A and MAT0000-Version B Exams An item analysis of the data obtained from my MAT0000-Version A and MAT0000-Version B Exam Items is presented here based upon the classical test theory (CRT). Various test item statistics and relevant statistical graphs (for both test forms, Versions A and B) using the ParSCORETM item analysis report and the Minitab software are computed and summarized in the Tables 1 – 5 below. Each version consisted of 30 items. There were two different groups of 7 students for each version. It appears from these statistical analyses that a large value of
190.020 KR for Version B indicates its high reliability in comparison to Version A, which is
also substantiated by large positive values of 3.0450.0 DIMean and
4223.0.. BisrPtMean , small value of standard error of measurement (that is, 82.1SEM ),
and an ideal value of mean (that is, 1857.19 , the passing score) for Version B. These
analyses are also evident by the bar charts and scatter plots drawn for various test item
statistics using Minitab, that is, item difficulty ( p ), item discrimination index ( D ) and point-
biserial correlation coefficient ( pbisr ), which are presented below in Figures 1 and 2.
Table 1
A Comparison of MAT0000-Version A and MAT0000-Version B Exam Test Items
Exam. Version 20
Re
KR
liability
Mean SD SEM 3.0p 7.03.0 p 7.0p
2.0D
A 0.53 17.14 2.80 1.92 8 10 12 14
B 0.90 19.57 5.75 1.82 1 15 14 20
Exam. Version DIMean .. BisrPtMean
A 0.233 0.2060
B 0.450 0.4223
11
Table 2 (MAT0000-Version A - Data Display)
Disc. Ind. Difficulty Pt-Bis Row PU PL (D) Difficulty (p) (p) % (r)
1 1.0 0.0 1.0 0.4286 42.86 0.78 2 1.0 1.0 0.0 0.8571 85.71 0.02 3 1.0 0.5 0.5 0.8571 85.71 0.46 4 1.0 0.0 1.0 0.5714 57.14 0.66 5 1.0 0.0 1.0 0.5714 57.14 0.77 6 1.0 0.0 1.0 0.7143 71.43 0.82 7 0.5 0.0 0.5 0.5714 57.14 0.56 8 1.0 1.0 0.0 1.0000 100.00 0.00 9 0.0 0.5 -0.5 0.1429 14.29 -0.46 10 0.5 0.5 0.0 0.4286 42.86 0.27 11 0.5 0.5 0.0 0.4286 42.86 -0.15 12 1.0 1.0 0.0 1.0000 100.00 0.00 13 1.0 1.0 0.0 1.0000 100.00 0.00 14 0.0 0.0 0.0 0.0000 0.00 0.00 15 1.0 0.5 0.5 0.5714 57.14 0.25 16 1.0 0.5 0.5 0.7143 71.43 0.37 17 1.0 0.5 0.5 0.8571 85.71 0.60 18 1.0 1.0 0.0 1.0000 100.00 0.00 19 1.0 1.0 0.0 1.0000 100.00 0.00 20 1.0 0.5 0.5 0.8571 85.71 0.46 21 1.0 0.5 0.5 0.8571 85.71 0.46 22 0.5 0.5 0.0 0.5714 57.14 -0.16 23 0.0 0.5 -0.5 0.1429 14.29 -0.46 24 0.5 1.0 -0.5 0.5714 57.14 -0.27 25 0.0 0.0 0.0 0.2857 28.57 0.08 26 0.0 0.0 0.0 0.1429 14.29 -0.02 27 1.0 0.5 0.5 0.4286 42.86 0.37 28 0.5 0.0 0.5 0.1429 14.29 0.71 29 0.5 0.0 0.5 0.2857 28.57 0.53 30 0.0 0.5 -0.5 0.1429 14.29 -0.46
Table 3
Descriptive Statistics: MAT0000-Version A
Variable Mean SE Mean StDev Variance Minimum Q1 Disc. Ind. (D) 0.2333 0.0821 0.4498 0.2023 -0.5000 0.000000000 Difficulty (p) 0.5714 0.0573 0.3139 0.0985 0.000000000 0.2857 Difficulty (p) % 57.14 5.73 31.39 985.11 0.000000000 28.57 Pt-Bis (r) 0.2063 0.0703 0.3850 0.1482 -0.4600 -0.00500
Variable Median Q3 Maximum Disc. Ind. (D) 0.000000000 0.5000 1.0000 Difficulty (p) 0.5714 0.8571 1.0000 Difficulty (p) % 57.14 85.71 100.00 Pt-Bis (r) 0.1650 0.5375 0.8200
12
Difficulty (p) %
Co
un
t
100.0085.7171.4357.1442.8628.5714.290.00
6
5
4
3
2
1
0
Chart of Difficulty (p) %
Disc. Ind. (D)
Co
un
t
1.00.50.0-0.5
12
10
8
6
4
2
0
Chart of Disc. Ind. (D)
Pt-Bis (r)
Co
un
t
0.82
0.78
0.77
0.71
0.66
0.60
0.56
0.53
0.46
0.37
0.27
0.25
0.08
0.02
0.00
-0.02
-0.15
-0.16
-0.27
-0.46
6
5
4
3
2
1
0
Chart of Pt-Bis (r)
Difficulty (p) %
Dis
c.
Ind
. (D
)
100806040200
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
Scatterplot of Disc. Ind. (D) vs Difficulty (p) %
Difficulty (p) %
Dis
c.
Ind
. (D
)
28.570.0014.29100.0071.4357.1485.7142.86
3
2
1
0
-1
Chart of Disc. Ind. (D) vs Difficulty (p) %
Difficulty (p) %
Pt-
Bis
(r)
100806040200
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
Scatterplot of Pt-Bis (r) vs Difficulty (p) %
Difficulty (p) %
Pt-
Bis
(r)
28.570.0014.29100.0071.4357.1485.7142.86
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Chart of Pt-Bis (r) vs Difficulty (p) %
Figure 1
(Bar Charts and Scatter Plots for p , D , and pbisr , Version A)
13
Table 4 (MAT0000-Version B - Data Display)
Disc. Ind. Difficulty Pt-Bis Row PU PL (D) Difficulty (p) (p) % (r) 1 1.0 1.0 0.0 1.0000 100.00 0.00 2 1.0 1.0 0.0 0.7143 71.43 0.06 3 1.0 1.0 0.0 1.0000 100.00 0.00 4 1.0 1.0 0.0 0.8571 85.71 0.11 5 1.0 0.5 0.5 0.8571 85.71 0.54 6 1.0 0.5 0.5 0.7143 71.43 0.67 7 1.0 0.0 1.0 0.4286 42.86 0.92 8 1.0 0.5 0.5 0.4286 42.86 0.37 9 0.5 0.5 0.0 0.4286 42.86 0.42 10 1.0 0.0 1.0 0.4286 42.86 0.92 11 1.0 0.5 0.5 0.5714 57.14 0.69 12 1.0 1.0 0.0 1.0000 100.00 0.00 13 1.0 0.5 0.5 0.8571 85.71 0.32 14 0.5 0.0 0.5 0.4286 42.86 0.37 15 1.0 0.5 0.5 0.5714 57.14 0.54 16 0.5 0.0 0.5 0.5714 57.14 0.34 17 1.0 0.0 1.0 0.5714 57.14 0.69 18 1.0 1.0 0.0 1.0000 100.00 0.00 19 1.0 1.0 0.0 1.0000 100.00 0.00 20 1.0 0.5 0.5 0.8571 85.71 0.54 21 0.5 1.0 -0.5 0.8571 85.71 -0.39 22 1.0 0.5 0.5 0.7143 71.43 0.67 23 0.5 0.0 0.5 0.1429 14.29 0.67 24 1.0 0.0 1.0 0.4286 42.86 0.92 25 1.0 0.0 1.0 0.5714 57.14 0.44 26 1.0 0.0 1.0 0.4286 42.86 0.67 27 1.0 0.5 0.5 0.7143 71.43 0.06 28 0.5 0.0 0.5 0.1429 14.29 0.67 29 1.0 0.5 0.5 0.8571 85.71 0.54 30 1.0 0.0 1.0 0.4286 42.86 0.92
Table 5
Descriptive Statistics: MAT0000-Version B
Variable Mean SE Mean StDev Variance Minimum Q1 Disc. Ind. (D) 0.4500 0.0733 0.4015 0.1612 -0.5000 0.000000000 Difficulty (p) 0.6524 0.0458 0.2508 0.0629 0.1429 0.4286 Difficulty (p) % 65.24 4.58 25.08 628.81 14.29 42.86 Pt-Bis (r) 0.4223 0.0628 0.3440 0.1183 -0.3900 0.0600
Variable Median Q3 Maximum Disc. Ind. (D) 0.5000 0.6250 1.0000 Difficulty (p) 0.6429 0.8571 1.0000 Difficulty (p) % 64.29 85.71 100.00 Pt-Bis (r) 0.4900 0.6700 0.9200
14
Difficulty (p) %
Co
un
t
100.0085.7171.4357.1442.8614.29
9
8
7
6
5
4
3
2
1
0
Chart of Difficulty (p) %
Disc. Ind. (D)
Co
un
t
1.00.50.0-0.5
14
12
10
8
6
4
2
0
Chart of Disc. Ind. (D)
Pt-Bis (r)
Co
un
t
0.920.690.670.540.440.420.370.340.320.110.060.00-0.39
5
4
3
2
1
0
Chart of Pt-Bis (r)
Difficulty (p) %
Dis
c.
Ind
. (D
)
100908070605040302010
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
Scatterplot of Disc. Ind. (D) vs Difficulty (p) %
Difficulty (p) %
Dis
c.
Ind
. (D
)
14.2957.1442.8685.7171.43100.00
6
5
4
3
2
1
0
Chart of Disc. Ind. (D) vs Difficulty (p) %
Difficulty (p) %
Pt-
Bis
(r)
100908070605040302010
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
Scatterplot of Pt-Bis (r) vs Difficulty (p) %
Difficulty (p) %
Pt-
Bis
(r)
14.2957.1442.8685.7171.43100.00
6
5
4
3
2
1
0
Chart of Pt-Bis (r) vs Difficulty (p) %
Figure 2
(Bar Charts and Scatter Plots for p , D , and pbisr , Version B)
3.2. A Comparison of MAT0000-Version A and MAT0000-Version B Exams Performance A Two-Sample T-Test: To identify if there is a significant difference between the MAT0000-Version A and MAT0000-Version B Exams Performance of the students, a two-sample T-test was conducted using the Minitab and Statdisk software. For this, first the assumption of normality was checked using the Anderson-Darling Test for both groups, and the normality tests were met. The results are provided in the Tables 6 -7. Moreover, at the significance level of
05.0 , the two-sample T-test conducted fails to reject the claim that BA , that is, the
sample does not provide enough evidence to reject the claim.
15
Table 6
Descriptive Statistics: MAT0000-Version A and MAT0000-Version B Exams
Total Variable Count N Mean SE Mean StDev Variance Minimum Q1 Median MAT0000A 7 7 17.14 1.14 3.02 9.14 13.00 14.00 17.00 MAT0000B 7 7 19.57 2.35 6.21 38.62 12.00 15.00 18.00
Variable Q3 Maximum Skewness Kurtosis MAT0000A 19.00 22.00 0.16 -0.03 MAT0000B 25.00 29.00 0.40 -1.31
Table 7
Two-Sample T-Test and CI: MAT0000-Version A and MAT0000-Version B
(Assume Unequal Variances) Two-sample T for MAT0000-Version A vs MAT0000-Version B N Mean StDev SE Mean MAT0000A 7 17.14 3.02 1.1 MAT0000B 7 19.57 6.21 2.3 Difference = mu (MAT0000A) - mu (MAT0000B) Estimate for difference: -2.42857 95% CI for difference: (-8.45211, 3.59497) T-Test of difference = 0 (vs not =): T-Value = -0.93 P-Value = 0.380 DF = 8
Two-Sample T-Test and CI: MAT0000-Version A and MAT0000-Version B (Assume Equal Variances)
Two-sample T for MAT0000-Version A vs MAT0000-Version B N Mean StDev SE Mean MAT0000A 7 17.14 3.02 1.1 MAT0000B 7 19.57 6.21 2.3 Difference = mu (MAT0000A) - mu (MAT0000B) Estimate for difference: -2.42857 95% CI for difference: (-8.11987, 3.26273) T-Test of difference = 0 (vs not =): T-Value = -0.93 P-Value = 0.371 DF = 12 Both use Pooled StDev = 4.8868
16
3.3. A Comparison of MAT0000 Classroom Test Aver (Pre) Vs Final Exam (Post) Performance A Paired Samples T-Test: To identify if there is a significant gain in the MAT0000 posttest (state exit exam) compared to the pretest (classroom test Average) performance of the students, a paired samples T-test was conducted using the Minitab and Statdisk software. For this, first to check whether normality assumption for a paired samples t-test is met, the hypothesis tests for the gain scores were conducted using Minitab, which suggested that the normality tests were met, the distributions being close to normal. The results are provided in the Tables 8 -10 and Figure 5 below below. It is evident that the normality tests are easily met.
Moreover, at the significance level of 05.0 , the paired samples T-test conducted fails to
reject the claim that BA , that is, the sample does not provide enough evidence to reject
the claim. STATDISK OUTPUT: MAT0000
Paired T-Test and CI: MAT0000-Post, MAT0000-Pre, (Gain Score = Post – Pre)
Figure 5 (Paired Samples T-Test: MAT0000 Pre Vs Post (Exams)
Table 8
17
MINITAB OUTPUT
Data Display: MAT0000 MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre)
Row 20071-Pre 20071-Post Gain 1 69.4 56.7 -12.7 2 63.2 50.0 -13.2 3 54.8 60.0 5.2 4 78.0 83.3 5.3 5 75.6 76.7 1.1 6 66.8 63.3 -3.5 7 51.8 46.7 -5.1 8 44.6 40.0 -4.6 9 72.6 56.7 -15.9 10 68.4 60.0 -8.4 11 67.2 50.0 -17.2 12 76.6 96.7 20.1 13 82.6 73.3 -9.3 14 49.0 43.3 -5.7
Table 9
MAT0000
Descriptive Statistics: MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre)
Total Variable Count N Mean SE Mean StDev Variance Minimum Q1 Median 20071-Post 14 14 61.19 4.33 16.21 262.62 40.00 49.18 58.35 20071-Pre 14 14 65.76 3.12 11.66 136.01 44.60 54.05 67.80 Gain 14 14 -4.56 2.67 10.01 100.14 -17.20 -12.83 -5.40 Variable Q3 Maximum Range IQR Skewness Kurtosis 20071-Post 74.15 96.70 56.70 24.98 0.84 0.22 20071-Pre 75.85 82.60 38.00 21.80 -0.51 -0.80 Gain 2.13 20.10 37.30 14.95 1.10 1.56
Table 10
MAT0000
Paired T-Test and CI: MAT0000-Post, MAT0000-Pre (Gain Score = Post – Pre) Paired T for 20071-Post - 20071-Pre N Mean StDev SE Mean 20071-Post 14 61.1929 16.2056 4.3311 20071-Pre 14 65.7571 11.6622 3.1169 Difference 14 -4.56429 10.00704 2.67450 95% CI for mean difference: (-10.34218, 1.21361) T-Test of mean difference = 0 (vs not = 0): T-Value = -1.71 P-Value = 0.112
18
4. Concluding Remarks and Recommendation for Future Research This paper discusses some item analysis statistics which are available in the ParSCORETM analysis report. The uses of item analysis statistics to some multiple-choice math examinations have been investigated. It is hoped that the present study would be helpful in recognizing the most critical pieces of the state exit test items data, and evaluating whether or not that test item needs revision. The methods discussed in this project can be used to describe the relevance of test item analysis to classroom tests. These procedures can also be used or modified to measure, describe and improve tests or surveys such as college mathematics placement exams (that is, CPT), mathematics study skills, attitude survey, test anxiety, information literacy, other general education learning outcomes, etc. Further, research based on Bloom’s cognitive taxonomy of test items, the applicability of Beta-Binomial models and Bayesian analysis of test items and item response theory (IRT) using the 1-parameter logistic model (also known as Rasch model), 2- & 3- parameter logistic models, plots of the item characteristic curves (ICCs) of different test items, and other characteristics of measurement instruments of IRT may also be investigated.
Acknowledgments
The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. I would also like to acknowledge my sincere indebtedness to the works of various authors and resources on the subject which I have consulted during the preparation of this research project. The author is thankful to his wife for her patience and perseverance for the period during which this paper was prepared. The author would like to dedicate this paper to his late parents, brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for giving an opportunity to serve this college, without which it was impossible to conduct his research.
References
Brown, J. D. (1996). Testing in language programs. Prentice Hall, Upper Saddle River, NJ. Chase, C. I. (1999). Contemporary assessment for educators. Longman, New York. Crocker, L. and Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart and Winston, New York. DeVellis, R. F. (1991). Scale development: Theory and applications. Sage Publications, Newbury Park. Ebel, R.L. (1979). Essentials of educational measurement (3rd ed). Prentice Hall, Englewood Cliffs, NJ. Ebel, R. L. and Frisbie, D. A. (1986). Essentials of educational measurement. Prentice- Hall, Inc, Englewood Cliffs, NJ. Glass, G. V. and Hopkins, K. D. (1995). Statistical Methods in Education and
19
Psychology, 3rd edition, Allyn & Bacon, Boston. Haladyna. T. M. (1999). Developing and validating multiple-choice exam items, 2nd ed. Lawrence Erlbaum Associates, Mahwah, NJ. Haladyna, T. M., Downing, S.M. and Rodriguez, M.C. (2002). A review of multiple- choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R.L. Thorndike (Ed.), Educational Measurement (p. 141). American Council on Education, Washington DC. Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. J. Ed. Psych., 30, 17-24. Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley, Reading, MA. Mertler, C. A. (2003). Classroom Assessment – A Practical Guide for Educators. Pyrczak Publishing, Los Angeles, CA. Millman, J. and Greene, J. (1993). The specification and development of tests of achievement and ability. In R.L. Linn (Ed.), Educational measurement (pp. 335-366). Oryx Press, Phoenix, AZ. Nitko, A. J. (2001). Educational assessment of students (3rd edition). Prentice Hall, Upper Saddle River, NJ Nunnally, J. C. (1972). Educational measurement and evaluation (2nd ed). McGraw-Hill, New York. Nunnally, J. C. (1978). Psychometrics Theory, Second Edition. : McGraw Hill, New York. Oosterhof, A. (2001). Classroom applications for educational measurement. Merrill Prentice Hall, Upper Saddle River, NJ. Popham, W. J. (1981). Modern educational measurement. Prentice-Hall, Englewood Cliff, NJ. Suen, H. K. (1990). Principles of exam theories. Lawrence Erlbaum Associates, Hillsdale, NJ. Tanner, D. E. (2001). Assessing academic achievement. Allyn & Bacon, Boston. Thompson, B. and Levitov, J. E. (1985). Using microcomputers to score and evaluate test items. Collegiate Microcomputer, 3, 163-168.
Thorndike, R. M., Cunningham, G. K., Thorndike, R. L. and Hagen, E.P. (1991). Measurement and evaluation in psychology and education (5th ed). MacMillan, New York.
20
Triola, M. F. (2006). Elementary Statistics. Pearson Addison-Wesley, New York. Wiersma, W. and Jurs, S. G. (1990). Educational measurement and testing (2nd ed). Allyn and Bacon, Boston, MA. Wood, D. A. (1960). Test construction: Development and interpretation of achievement tests. Charles E. Merrill Books, Inc, Columbus, OH.
1
Testing the Goodness of Fit of Continuous Probability Distributions
to Some Flood Data
M. Shakil, Ph.D.
Professor of Mathematics
Department of Liberal Arts and Sciences
Miami Dade College, Hialeah Campus
FL 33012, USA
E-mail: [email protected]
Abstract
In this paper, we have tested the goodness of fit for Cauchy, generalized extreme value, Laplace,
log-Pearson 3, logistic, and normal probability distributions to ordered differences in flood
heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best et al.
(2008). It was found that the generalized extreme value distribution was the best fit amongst the
six continuous probability distributions for the ordered differences in flood heights data based on
both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests. On the other hand, it was
found that log-Pearson 3 distribution fitted reasonably well to the ordered differences in flood
heights data based on Chi-Squared tests goodness of fit test. Since fitting of a probability
distribution to flood data may be helpful in predicting the probability or forecasting the
frequency of occurrence of the flood during monsoon and hurricanes, and planning beforehand,
it is hoped that this study will be quite useful in many problems of business and economic
planning, hydrological processes and designs, and other applied research.
2010 Mathematics Subject Classifications: 62C12, 62F03, 62N02, 62N03, 62-07.
Keywords: Flood data, Goodness of fit test, Hurricane, Monsoon, Probability distribution.
1. Introduction
According to the Wikipedia, “a flood is an overflow of water that submerges land which is
usually dry. The European Union (EU) Floods Directive defines a flood as a covering by water
of land not normally covered by water. Flooding may occur as an overflow of water from water
bodies, such as a river, lake, or ocean, in which the water overtops or breaks levees, resulting in
some of that water escaping its usual boundaries, or it may occur due to an accumulation of
rainwater on saturated ground in an areal flood. Floods can also occur in rivers when the flow
rate exceeds the capacity of the river channel, particularly at bends or meanders in the waterway.
Floods often cause damage to homes and businesses if they are in the natural flood plains of
rivers. Some floods develop slowly, while others such as flash floods, can develop in just a few
minutes and without visible signs of rain,” (https://en.wikipedia.org/wiki/Flood). The rainfall or
2
other types of precipitation produced by hurricanes cause also widespread flooding in the
affected areas due to which people have to face a lot of damage and destruction of their property,
including loss of life, resulting into great socio-economic problems. The statistical analysis of
flood data is therefore very crucial, and plays an important role in many studies of hydrological
processes and designs. Many researchers have investigated the statistical analysis of flood data,
see, for example, Pericchi and Rodríguez-Iturbe (1985), Opere et al. (2006), Yiou et al. (2006),
Van Bladeren et al. (2007), Ghorbani et al. (2011), Win and Win (2014), and Ahn et al. (2014),
and references therein. Since fitting of a probability distribution to flood data may be helpful in
predicting the probability or forecasting the frequency of occurrence of the flood during
monsoon and hurricanes, and planning beforehand, it is hoped that this study will be useful in
many problems of business and economic planning, hydrological processes and designs, and
other applied research. Therefore, the statistical analysis of flood data is very necessary and
important. Motivated by the importance of the study of flood data in many problems of
hydrological processes and designs, we have investigated in this paper the goodness of fit test
(GOF) for Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and normal
probability distributions to ordered differences in flood heights for two stations on the Fox River
in Wisconsin for 33 years, as reported in in Best et al. (2008) to determine their applicability and
best fit to these data based on the goodness of fit (GOF) tests, namely, Kolmogorov-Smirnov,
Anderson-Darling, and Chi-Squared tests for the goodness-of-fit. The other researchers have also
investigated their statistical analysis of these data for the ordered differences in flood heights for
two stations on the Fox River in Wisconsin for 33 years, see, for example, Bain and Engelhardt
(1973), Puig and Stephens (2000), Meintanis (2004), Krishnamoorthy (2006), and Gulati, S.
(2011). For the applications of the log Pearson type-3 distribution in hydrology, see, for example,
Phien and Ajirajah (1984). Also, for a discussion of the GOF tests, the interested readers are
referred to Massey (1951), Stephens (1974), Conover (1999), Blischke and Murthy (2000), Hogg
and Tanis (2006), and Ahsanullah et al. (2014), among others.
The organization of this paper is as follows, Section 2 contains the Methodology, along with the
description of the ordered differences in flood heights for two stations on the Fox River in
Wisconsin for 33 years, as reported in in Best et al. (2008). Also, in Section 2, we have provided
the continuous probability distributions considered in this paper, namely, Cauchy,
generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions.
In Section 3, we have presented the results and discussions of our findings respectively. Some
concluding remarks are given in Section 4.
2. Methodology
In this section, we will test the goodness of fit test (GOF) for Cauchy, generalized extreme value,
Laplace, log-Pearson 3, logistic, and normal probability distributions to ordered differences in
flood heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best
et al. (2008) to determine their applicability and best fit to these data based on the goodness of fit
(GOF) tests, namely, Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared tests for
goodness-of-fit. For the sake of completeness, the ordered differences in flood heights for two
stations on the Fox River in Wisconsin for 33 years are provided in Table 1 below. In Table 2,
we have provided the probability density functions and parameters of Cauchy,
3
generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions
considered in this paper.
Table 1
(Source: Best et al., 2008)
Ordered differences in flood heights for two stations on the Fox River
in Wisconsin for 33 years
1.96, 1.96, 3.60, 3.80, 4.79, 5.66, 5.76, 5.78, 6.27, 6.30, 6.76, 7.65, 7.84, 7.99, 8.51,
9.18, 10.13, 10.24, 10.25, 10.43, 11.45, 11.48, 11.75, 11.81, 12.34, 12.78, 13.06, 13.29,
13.98, 14.18, 14.40, 16.22 and 17.06.
Table 2
(Continuous Probability Distributions Used in Flood Data Analysis)
Sl.
No.
Name of the
Distributions
xf Parameters
1
Cauchy
2
1
11
xxf
0 : scale parameter
(real): location parameter,
and x
2 Generalized
Extreme Value
0expexp1
011exp1
11
1
kxx
kxkxk
xf
kk
0k : shape parameter
0 : scale parameter
(real): location parameter,
where
0
001
kforx
kforxk
3
Laplace
xxf exp2
0 : inverse scale
parameter
(real): location parameter,
and x
4
Log-Pearson
III (LP3)
x
x
xxf
lnexp
ln11
0 , 0 , ,
where 00 whenex
and 0 whenxe
4
5 Logistic
2exp1
exp
x
xxf
0 : scale parameter
(real): location parameter,
and x
6 Normal
2
2
1exp
2
1
xxf
0 : scale parameter
(real): location parameter,
and x
Fitting of the above-said distributions to flood data are carried as follows. As a first step, using
Easyfit software, we have computed the descriptive statistics of the flood data as given in Table
3. Also, using the statdisk software, we have tested the normality of the flood data by Ryan-
Joiner Test (Similar to Shapiro-Wilk Test), along with drawing a histogram of the data, which
are given in Figure 1 and Table 4.
We have tested the fitting of Cauchy, generalized extreme value, Laplace, log-Pearson 3,
logistic, and normal probability distributions to ordered differences in flood heights for two
stations on the Fox River in Wisconsin for 33 years (Table 1). For this, we have used the Easyfit
software for estimating the parameters of these distributions, and the goodness of fit (GOF) tests,
namely, Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared tests for goodness-of-fit,
which are provided in the Tables 5 and 6 below. For the parameters estimated in Table 5,
Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability
distributions respectively have been superimposed on the histogram of the ordered differences in
flood heights, which is provided in Figure 2 below. For these distributions, we have provided the
cumulative distribution function, survival function, hazard function, cumulative hazard function,
P-P plot, Q-Q plot and probability difference in Figures 3 – 9 respectively as given below.
Table 3
(Descriptive Statistics)
Statistic Value
Sample Size 33
Range 15.1
Mean 9.3533
Variance 16.169
Std. Deviation 4.0211
Coef. of Variation 0.42991
Std. Error 0.69999
Skewness -0.07331
Excess Kurtosis -0.79828
Percentile Value
Min 1.96
5% 1.96
10% 3.68
25% (Q1) 6.025
50% (Median) 10.13
75% (Q3) 12.56
90% 14.312
95% 16.472
Max 17.06
5
Figure 1: Normality Assessment of Flood Data
Table 4
(Ryan-Joiner Test of Normality Assessment)
Ryan-Joiner Test
Test statistic, Rp: 0.9925
Critical value for 0.05 significance level: 0.9666
Critical value for 0.01 significance level: 0.9528
Fail to reject normality with a 0.05 significance level.
Fail to reject normality with a 0.01 significance level.
Possible Outliers
Number of data values below Q1 by more than 1.5 IQR: 0
Number of data values above Q3 by more than 1.5 IQR: 0
Table 5
Fitting Results
# Distribution Parameters
1 Cauchy =2.8118 =9.6936
2 Gen. Extreme Value k=-0.32444 =4.2124 =7.9779
3 Laplace =0.3517 =9.3533
4 Log-Pearson 3 =2.9931 =-0.31728 =3.065
5 Logistic =2.217 =9.3533
6 Normal =4.0211 =9.3533
6
Table 6
Goodness of Fit – Summary
# Distribution
Kolmogorov
Smirnov
Anderson
Darling Chi-Squared
Statistic Rank Statistic Rank Statistic Rank
1 Cauchy 0.11607 5 0.85112 5 1.3161 3
2 Gen. Extreme Value 0.07953 1 0.18391 1 1.3503 4
3 Laplace 0.15476 6 1.0484 6 5.5995 6
4 Log-Pearson 3 0.09304 3 0.22833 2 1.0865 1
5 Logistic 0.1142 4 0.4503 4 2.0981 5
6 Normal 0.0929 2 0.2467 3 1.1639 2
Probability Density Function
Histogram Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
f(x)
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
Figure 2: Fitting of Probability Density Functions to the Flood Data
7
Cumulative Distribution Function
Sample Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
F(x
)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Figure 3: Fitting of Cumulative Distribution Functions to the Flood Data
Survival Function
Sample Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
S(x
)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Figure 4: Survival Functions of Distributions for the Flood Data
8
Hazard Function
Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
h(x
)
0.8
0.72
0.64
0.56
0.48
0.4
0.32
0.24
0.16
0.08
0
Figure 5: Hazard Functions of Distributions for the Flood Data
Cumulative Hazard Function
Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
H(x
)
3.6
3.2
2.8
2.4
2
1.6
1.2
0.8
0.4
0
9
Figure 6: Cumulative Hazard Functions of Distributions for the Flood Data
P-P Plot
Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
P (Empirical)
10.90.80.70.60.50.40.30.20.1
P (
Model)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Figure 7: P-P Plot of Distributions for the Flood Data
Q-Q Plot
Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
Quantile
(M
odel)
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
Figure 8: Q-Q Plot of Distributions for the Flood Data
10
Probability Difference
Cauchy Laplace Logistic Normal Gen. Extreme Value Log-Pearson 3
x
171615141312111098765432
Pro
babili
ty D
iffe
rence
0.32
0.24
0.16
0.08
0
-0.08
-0.16
-0.24
-0.32
Figure 9: Probability Differences of Distributions for the Flood Data
3. Results and Discussions
The descriptive statistics of the ordered differences in flood heights for two stations on the Fox
River in Wisconsin for 33 years, as reported in in Best et al. (2008), (please see Table 1), are
provided in Table 3 above. Also, we have tested the normality of the flood data by Ryan-Joiner
Test (Similar to Shapiro-Wilk Test), along with drawing a histogram of the data, which are given
in Figure 1 and Table 4. The following are the observations based on Ryan-Joiner Test of
Normality Assessment of the flood data, which is also confirmed from the skewness of the flood
data as computed in Table 3:
(a) Fail to reject normality with a 0.05 significance level,
(b) Fail to reject normality with a 0.01 significance level.
Further, we have tested the fitting of Cauchy, generalized extreme value, Laplace, log-Pearson 3,
logistic, and normal probability distributions to ordered differences in flood heights for two
stations on the Fox River in Wisconsin for 33 years. The estimates of parameters of Cauchy,
generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions
for the flood data are given in Table 5. For the parameters estimated in Table 5, the probability
density functions of Cauchy, generalized extreme value, Laplace, log-Pearson 3, logistic, and
normal probability distributions respectively have been superimposed on the histogram of the
precipitation data, which is provided in Figure 2. The goodness of fit (GOF) of Cauchy,
generalized extreme value, Laplace, log-Pearson 3, logistic, and normal probability distributions
11
to the flood data by Kolmogorov-Smirnov, Anderson-Darling, and the Chi-Squared GOF tests is
summarized in the Table 6 above. Further, for these distributions, we have provided cumulative
distribution function, survival function, hazard function, cumulative hazard function, P-P plot, Q-
Q plot and probability difference in Figures 3 – 9 respectively as given above. From the
Kolmogorov-Smirnov and Anderson-Darling GOF tests as provided in Table 6 and Figure 2
above, we observed that the generalized extreme value distribution is the best fit amongst the six
continuous probability distributions to the ordered differences in flood heights for two stations
on the Fox River in Wisconsin for 33 years. On the other hand, Log-Pearson 3 distribution was
found to be the best fit for these data by Chi-Squared tests goodness of fit test (Table 6). The
graphs of cumulative distribution function, survival function, hazard function, cumulative hazard
function, P-P plot, Q-Q plot and probability difference as provided in Figures 3 – 9 respectively
also confirm these results.
4. Concluding Remarks
In many problems of hydrological processes and designs, fitting of a probability distribution to
the flood data may be helpful in predicting the probability or forecasting the frequency of
occurrence of the flood, and planning beforehand. Motivated by the importance of the study of
flood data in many problems of hydrological processes and designs, and planning beforehand, in
this paper, we have tested the goodness of fit of Cauchy, generalized extreme value, Laplace,
log-Pearson 3, logistic, and normal probability distributions to the ordered differences in flood
heights for two stations on the Fox River in Wisconsin for 33 years, as reported in in Best et al.
(2008). It was found that the generalized extreme value distribution was the best fit for these
flood data by both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests, whereas
Log-Pearson 3 distribution was found to be the best fit for these flood data by Chi-Squared tests
goodness of fit test. It is hoped that this study will be quite helpful in many problems of
hydrological research.
Acknowledgment
The author would like to thank the Editorial Committee of Polygon for accepting this paper for
publication in Polygon. Also, the author would like to thank Professor M. Ahsanullah, Rider
University, New Jersey, USA, and Professor B. M. Golam Kibria, FIU, Miami, USA, for their
valuable and helpful suggestions, which improved the quality and presentation of the paper.
Also, the author is thankful to his wife for her patience and perseverance for the period during
which this paper was prepared. The author would like to dedicate this paper to his late parents,
brothers and sisters. Last but not least, the author is thankful to the Miami Dade College for
giving an opportunity to serve this college, without which it was impossible to conduct his
research.
12
References
Ahsanullah, M., Kibria, B. M. G., and Shakil, M. (2014). Normal and Student´s t Distributions
and Their Applications. Atlantis Press, Paris, France.
Ahn, J., Cho, W., Kim, T., Shin, H., & Heo, J. H. (2014). Flood frequency analysis for the annual
peak flows simulated by an event-based rainfall-runoff model in an urban drainage basin. Water,
6(12), 3841 - 3863.
Bain, L., and Engelhardt, M. (1973). Interval estimation for the two parameter double
exponential distribution. Technometrics, 15, 875 – 887.
Best, D., Rayner, J., and Thas, O. (2008). Comparison of some tests of fit for the Laplace
distribution. Computational Statistics and Data Analysis, 52, 5338 – 5343.
Blischke, W. R., and Murthy, D. N. P. (2000). Reliability, Modeling, Prediction, and
Optimization. John Wiley & Sons, New York.
Conover, W. J. (1999). Practical Nonparametric Statistics, John Wiley & Sons, New York.
Ghorbani, M. A., Ruskeepaa, H., Singh, V. P., and Sivakumar, B. (2011). Flood frequency
analysis using Mathematica. Turkish Journal of Engineering and Environmental Sciences, 34(3),
171 - 188.
Gulati, S. (2011). Goodness of fit test for the Rayleigh and the Laplace distributions.
International Journal of Applied Mathematics and Statistics™, 24(SI-11A), 74 - 85.
Hogg, R. V., and Tanis, E. A. (2006). Probability and Statistical Inference. Pearson/Prentice
Hall, NJ.
Krishnamoorthy, K. (2006). Handbook of Statistical Distributions with Applications. Chapman
and Hall, CRC, Boca Raton.
Massey, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American
Statistical Association, 6, 68 - 78.
Meintanis, S. G. (2004). A class of omnibus tests for the Laplace distribution based on the
empirical characteristic function. Communications in Statistics, Theory and Methods, 33(4), 925
– 948.
Opere, A. O., Mkhandi, S., and Willems, P. (2006). At site flood frequency analysis for the Nile
Equatorial basins. Physics and Chemistry of the Earth, Parts A/B/C, 31(15), 919 - 927.
Pericchi, L. R., and Rodríguez-Iturbe, I. (1985). On the statistical analysis of floods. In A
celebration of statistics (pp. 511-541). Springer, New York.
13
Phien, H. N., and Ajirajah, T. J. (1984). Applications of the log Pearson type-3 distribution in
hydrology. Journal of hydrology, 73, 3, 359 - 372.
Puig, P., and Stephens, M. A. (2000). Tests of fit for the Laplace distribution, with applications.
Technometrics, 42(4), 417 – 424.
Stephens, M. A. (1974). EDF statistics for goodness-of-fit, and some comparisons. Journal of the
American Statistical Association, 69, 730 – 737.
Van Bladeren, D., Zawada, P. K., and MAHLANGu, D. (2007). Statistical Based Regional
Flood Frequency Estimation Study for South Africa Using Systematic, Historical and
Palaeoflood Data: Pilot Study, Catchment Management Area 15. Water Research Commission.
Win, N. L., and Win, K. M. (2014). Comparative Study of Flood Frequency Analysis on
Selected Rivers in Myanmar. In InCIEC 2013 (pp. 287-299). Springer Singapore.
Wikipedia. https://en.wikipedia.org/wiki/Flood
Yiou, P., Ribereau, P., Naveau, P., Nogaj, M., and Brázdil, R. (2006). Statistical analysis of
floods in Bohemia (Czech Republic) since 1825. Hydrological Sciences Journal, 51(5), 930 -
945.