Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Global Journal of Pure and Applied Mathematics.
ISSN 0973-1768 Volume 13, Number 6 (2017), pp. 2563-2578
© Research India Publications
http://www.ripublication.com
------------------------------------------------------------------------------------------------ *Corresponding Author
Email Addresses : [email protected] (Makkulau),[email protected] (Edi Cahyono),
[email protected] (Mukhsar), [email protected] (Asrul Sani), [email protected] (La Ode Saidi),
[email protected] (Andi Tenri Ampa)
Outlier Detection in Multivariate Linear Models
Using Lagrange Multipliers
Makkulau1, Edi Cahyono1, Mukhsar1*, Asrul Sani1,
La Ode Saidi1, Andi Tenri Ampa1
Department of Mathematics and Statistictics, Faculty of Mathematics and Natural
Sciences, Universitas Halu Oleo, Kampus Hijau Bumi Tridharma Andounohu,
Kendari, 93232, Sulawesi Tenggara, Indonesia.
Abstract
In statistics, data which deviates so much from other data is called outlier.
Outlier can influence the resulting model, hence it can have an impact on
decision making. This paper reports a development of the outlier detection
method using the Likelihood Displacement Statistic method (LD) and
Likelihood Ratio Statistic for a Mean Shift method (LR), called Likelihood
Displacement Statistic-Lagrange (LDL) method and Likelihood Ratio
Statistic for a Mean Shift method (LR) to the Likelihood Ratio Statistic for
a Mean Shift-Lagrange (LRL) method. LDL and LRL method is a method
that uses Lagrange multipliers with confidence interval constraint of
parameter vector, which is done by deleting m observations and shifting the
means on the model using a nonlinear program with Karush-Kuhn Tucker
(KKT) condition. This method LDL compares LDL values and Chi-square
table values that follow Chi-square distribution, and method LRL compares
LRL values and Chi-square table values that follow Chi-square distribution.
The advantage of the LDL and LRL method which uses Lagrange
multiplier is the optimal values obtained are limited in the determined
confidence interval. Since optimal values are limited to this confidence
interval.
Keywords: Likelihood Displacement Statistic-Lagrange, Likelihood Ratio
Statistic for a Mean Shift-Lagrange, Karush-Kuhn Tucker, Outlier,
Lagrange Multipliers.
2564 Makkulau et al
1. INTRODUCTION
In statistics, an outlier is an observation point that is distant from other observations
(data). There are two different outlier, i.e. outlier in data and outlier in models. The
existing outlier detection method in models is using common Likelihood method. The
limitation of this method is the optimal value produced might be not the real optimal
values. When research object in the form of data contain outliers, i.e. data which
deviates so much from other data, these outliers can influence the resulting model.
Hence they can have an impact on decision making. Based on the number of variables
considered, outliers can be divided into outliers in univariate or multivariate
observations and outliers in univariate or multivariate linear models. Outliers in linear
models can exist in the predictor (independent) variable, response (dependent)
variable, or both at once. It is easier to handle outliers in predictor variables than
outliers in both at once. Development of outlier detection method in univariate and
multivariate observations have been done by [1,2, 3,4] which identified an
observation that deviate much from other observations. Outlier detection in univariate
liniear models have been developed by [5,6,7,8] identified ouliers in univariate linear
models with Least Trimmed Squares method and Single Linkage Clustering to obtain
potential oulier observations. Identified ouliers in data spatial and ouliers with flexible
Kernel density estimates [20,21].
Formal test for detecting outliers in multivariate linear models have been developed
by [9] with likelihood ratio test. By [10] developed univariate Cook's distance to
detect outliers in multivariate linear models, while [11] proposed the detection of the
observation that impact on the multivariate linear model using modified Cook's
distance. In developing outlier detection method in multivariate linear models, [10]
use Displacement Likelihood Statistic (LD) method, Likelihood Ratio Statistic for a
Mean Shift (LR) method, and multivariate Leverage method. In the LD and the LR
methods used by [10], parameter estimation of the likelihood function in multivariate
linear models used Maximum Likelihood Estimate (MLE). The optimal values in the
LD and the LR methods are obtained by taking into account all the values.
Development of the LD method which uses Lagrange multiplier method called
Displacement Likelihood Statistic-Lagrange (LDL) has been done by [12], and
development of the LR method which uses Lagrange multiplier method called
Likelihood Ratio Statistic for a Mean Shift-Lagrange (LRL) has been done by [14].
The application of the LDL method on process data of sugar and molasses production
in the Djombang Baru Sugar Factory, Jombang, East Java Province can be found in
[13]. The aims of this study are to determine how to detect and test outliers in
multivariate linear models using the Likelihood Displacement Statistic-Lagrange
(LDL) method and Likelihood Ratio Statistic methods for a Mean Shift-Lagrange
(LRL) method.
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2565
2. MATERIAL AND METHODS
Multivariate linear model is linear model with more than one response variable [15].
Suppose 1 2, , , pX X X are predictor variables and
1 2, , , qY Y Y are response
variables, then model for hY is:
0 1 1 2 2 , 1, 2, ,h h h h ph p hY X X X h q ,
where 0,hE 2Var ,h h and * *
*Cov , , ;h h hhh h *, 1, 2, , .h h q
Multivariate linear model which consist of q linear models simultaneously can be
written as:
11 11 ,n q n qn p p q
Y X B X Bε ε (1)
where 11n is a unit vector.
Then model in (1) can be written in vector form as follows:
Vec Vec Vec ,q Y I X B ε
where is the Kronecker product
11 21 1 12 22 2 1 2VecT
n n q q nqy y y y y y y y yY
01 11 1 02 12 2 0 1VecT
p p q q pq B
11 21 1 12 22 2 1 2Vec .T
n n q q nq ε
2.1. Parameter estimation and Hypothesis testing in Multivariate Linear Model
In multivariate linear model, error matrix ε is a random matrix. By assuming the
error matrix:
~ ,nq nN 0 Σ Iε or Vec ~ 0, ,nq nN Σ Iε (2)
after that, determine distribution of matrix Y in (1). Because of (2), thus:
~ ,nq nN Y XB Σ I
or Vec ~ Vec ,nq q nN Y I X B Σ I
Where Vec Vec Vec .q Y I X B ε
Parameter estimation B and Σ in (1) are done using the MLE method [15], give:
1ˆ T T
B X X X Y (3)
1ˆ ˆ ˆ ˆ .T
T
n n
YΣ Y XB Y XB Σ Y I H (4)
Σ̂ is the bias estimator for ,Σ where 1
T T
H X X X X is the projection matrix of
matrix ,X while unbias estimate for Σ is:
1 ˆ ˆ .
rank( )
T
n
S Y XB Y XB
X
2566 Makkulau et al
By using properties of Kronecker product and matrix inverse [16], then (3) can be
written in vector form, give:
1ˆVec Vec .T T
q
B I X X X Y
Then to test significance of (1), simultaneous test is done with hypothesis:
0 1 2: 0, 1,2,...,h h phH h q
1 :H at least one 0, 1,2,...,jh j p and 1,2,..., .h q
Based on the likelihood function for population in
122
12 exp tr
2
nnqT
L
B,Σ Σ Σ Y XB Y XB
estimation of parameter matrix is obtained i.e. Σ and ˆ .T
n
YΣ Y I H
Let ˆ T
U E Y I H Y dan ˆ ˆ ,THP
M Y H Y where 0 , HPH H H
1
0 0 0 0 0 ,T T
H X X X X M
and ˆUE matrix with size q q. The statistic test used to test the above hypothesis
using Wilks’s Lambda [17]:
11
ˆ ˆˆ ,
ˆ ˆ
U
U
H U
Σ EI M E
Σ E M where 0 1 and
; ; rank~
q d nU
X follows Wilk’s
Lambda distribution. Reject 0H if ; ; ; rank,calculation q d n
U
X
where q is various of
response variable, rank 1,p X and d = rank X rank 0 .X
2.2. Outlier in Multivariate Linear Model
According to [2, 18], outliers classification in multivariate linear models can be
divided into three categories i.e. outlier in observations on the predictor variable X,
outlier in observations on the response variable Y and outlier in observations on the
predictor X and respon Y together or outlier in the set of pairs 1 1 2 2, , , ,y x y x
, , .n ny x This study focuses on outlier detection and testing in the set of pairs
1 1 2 2, , , , , , .n ny x y x y x
2.3. Outlier Detection Methods in Multivariate Linear Model
Three methods of outliers detection in multivariate linear models i.e. LD, LR, and
Multivariate Leverage methods [10]. Outlier detection in multivariate linear model
based on the LD method was done by deleting observations which were detected as
outlier in the model gradually. For example, in m observations collected from set of
pairs 1 1 2 2, , , , , ,n ny x y x y x and take m pairs of observation vector which is
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2567
detected as outlier .m n After that, mA is defined and it is set of m observations
which are detected as outliers, i.e.: 1 2, , , ,m mA i i i where , , .a b b aj j j ji i i i
Consequently the index 1, 2, ,ji n for 1, 2, , .j m In other words, there are
m observations which are detected as outlier .mA Hence the pairs ,m mA AY X are
matrix of observations which detected as outlier and ,C Cm mA A
Y X are matrix of
observations without outliers or a matrix of observations after outliers is deleted.
Based on (3), the estimation of parameter matrix B after observations which
detected as outliers are deleted is:
11ˆ ˆ ,ˆ
Cm m mm
T T
A A AA
B B X X X Ι Q ε (6)
where:
1
;m m m
T T
A A A
Q X X X X ˆ ;ˆm m mA A A Y X Bε and
1 1
.m m mA A A
Ι Ι Q Q Ι Q
Estimate and parameter matrix Σ after observations which detected as outliers has
been deleted is:
11ˆ ˆ .ˆ ˆC
m m mm
T
A A AA
n
n m n m
Σ Σ Ι Qε ε
Then, the expected value and variance of ˆCmA
B are obtained:
ˆ
CmA
E B B and 1 11ˆ ˆ ˆ ˆ ˆVar ˆ ˆ
C Cm m m m mm m
T
A A A A AA A n
B Σ B Σ Ι Q Q Ι Qε ε
Definition [15]. The LD for B given Σ is:
ˆ ˆ ˆ ˆ ˆLD 2 ln , ln , ,C Cm m m
A A AL L B Σ B Σ B where ˆ ˆ
CmA
Σ B is estimated by ˆCmA
B
given in (6).
Test statistic which is used to test outlier existence with the LD method is:
ˆ ˆ ˆ ˆ ˆLD 2 ln , ln ,C Cm m m
A A AL L B Σ B Σ B
ˆ
ln ,ˆ
ˆ ˆm m m
T
A A Ann
n
Σ C
Σ
ε ε (7)
where 1 1
m m m mA A A A
C Ι Q Q Ι Q and 1 2, , ,
m are the eigenvalues of
.mAC
Next, determine distribution of (7). According to [10], when n, n – p, and n – q is
large, then LDmA follow Chi-square distribution with degrees of freedom (df) as
number of response variables or 2LD ~ .
mA qm Because of that value of LDtable for
observation which detected as outlier is 2
;dbLD ,m where is significance
level, m is biggest eigenvalue, and df = q is number of response variables.
Hypothesis testing in the LD method is based on the following hypothesis:
2568 Makkulau et al
0H : mA is no outlier
1H : mA
is outlier.
Test statistic used to test outliers existence is (8). Reject 0H when LD LD .
mA In
other words, to detect and determine outliers in multivariate linear models which
based on the LD method is obtained by comparing value of LDmA and LD . When
the value of LD LD ,mA
this means that the observations are outliers.
The next step is to look for each pair of observations which is detected as an outlier
based on the values of the LD. The largest LD values are chosen as the next pair. If
LD LDmA (no longer identify new observations as outlier), or the computational
burden is too large, or the value of LDmA does not change much, then the process is
terminated [10] and determine the final set mA dan .C
mA
The LR method is another way to detect and test outlier in multivariate linear model
besides the LD method. Analogue to the LD method, the LR method is done by
removing observation which are detected as outliers in the model gradually based on
the mean shift. The LR method starts by determining matriks estimate ,Ψ where Ψ
represent shift coefficient matrix which relates with observations in set .mA The set
mA is set of number of m observation which are detected as outlier. After that,
determine E Y from multivariate linear model. Multivariate linear model in the LR
method is written in the form:
mA
BY X Z
Ψε * * , X B ε (9)
where:
;mA
X X Z * ;T
B B Ψ 1 2
;mm
i i iAz z z Z and 1 2z , , , ,k mk i i i
mAZ is a matrix which contains column vector where the j-th component is one and
and the others are zero.
zk is a column vector where the j-th component is one and and the others are zero.
m qΨ is a matrix of coefficient of shift which relates with observations in set .mA
Based on general model (1), E Y XB is obatined which is the mean of the
model (Srivastava and von Rosen [9]), while based on (9) is obtained expected of
matrix
Y as follows:
E Y
mAXB Z Ψ
by (10) which is the shifting of the mean. Estimation of parameter matrix
B in (9) is:
1
ˆˆ ˆC
m
TT T
A
B X X X Y B Ψ
where B̂ is given by (3), 1
,m m m
T T
A A A
Q X X X X and
1ˆ .ˆ
m mA A
Ψ Ι Q ε
The estimation of parameter matrix
Σ in (9) is:
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2569
11ˆ ˆ ,ˆ ˆm m m
T
A A An
Σ Σ Ι Qε ε
where Σ̂ is given by (4).
Test statistic which is used to test outlier existence with the LR method is:
ˆ
LR lnˆmA
c
Σ
Σ
1
ˆ
ln ,ˆ
ˆ ˆm m m
T
A A An
cn
Σ Ι Q
Σ
ε ε (11)
where 1
1 .2
q mc n p m
Next, determine distribution of (11).
According to Box in Xu et al. [10], when n, n – p, and n – q is large, then LRmA
follow Chi-square distribution with degrees of freedom (df) as number of response
variables or 2
dfLR ~ .mA
Because of that value of LRtable for observation which
detected as outlier is 2
;dfLR , where is significance level and df is number of
response variables.
Hypothesis testing in the LR method is based on the following hypothesis:
0H : mA
is no outlier
1H :
mA
is outlier.
Test statistic used to test outliers existence is (12). Reject 0H when LR LR .mA In
other words, to detect and determine outliers in multivariate linear models which
based on the LR method is obtained by comparing value of LRmA and LR . When
the value of LR LR ,mA this means that the observations are outliers.
The next step is to look for each pair of observations which is detected as an outlier
based on the values of the LR. The largest LR values are chosen as the next pair. If
LR LRmA (no longer identify new observations as outlier), or the computational
burden is too large, or the value of LRmA
does not change much, then the process is
terminated [10] and determine the final set mA dan .C
mA
2.4. Nonlinear Optimization
Let the objective function L with constraints 0kg for 1,2, , .k K Then is
the feasible solution, then the Karush-Kuhn-Tucker (KKT) condition can be written in
the form as follows [18]:
Maximum objective function:
2570 Makkulau et al
1 2, , , ,KL L (13)
with constraints:
0.kg (14)
Equation (13) and (14) can be solved by Lagrange multipliers as follows:
1
, . .K
k k
k
F L g
Optimization of the KKT condition is achieved when it meets:
1
,0
K
k k
kk
FL g
,0k
k
Fg
1
0K
k k
k
g
and 0,k
where k represents the Lagrange multiplier.
3. METHOD
This method is a development of the LD and LR method using Lagrange multipliers.
The Lagrange multiplier used is the confidence interval of parameter vector that are
useful to optimize the confidence intervals obtained. This confidence interval is
obtained in stages by shifting the means in the model numerically using a nonlinear
program. Process of outlier detection and outlier testing using the LDL and LRL
method are collect m observations which are detected as outliers mA and * .mA
Determine ˆ ,LB ˆ ,LΣ ,CmLA
B
,CmLA
Σ ,CmLA
B ,C
mLA
Σ LDL ,
mA
LRL ,mA and compares LDL and
LRL values and table values.
4. RESULTS AND DISCUSSION
Outlier detection and testing based on the LDL method by making likelihood function
for the population to get ˆ ,LB ˆ ,LΣ and ˆ ˆL LL B ,Σ with assumption (3). Next,
constraints with confidence interval for Vec LB are created. Based on Box in Xu et
al. [10], where n, n - p, and n - q is very large, then the constraints follow F
distribution with df is number of response variables. In general, the confidence
interval 1 100% for Vec LB is:
1 2
1
; ;ˆ ˆVec Vec Vec Vec .ˆ
TT
L L L L L v vF
B B X X B BΣ (13)
Then based on (24), the region obtained is:
1 2
1
; ;ˆ ˆVec Vec Vec Vec Vec .ˆ
TT
L L L L L L v vg F
B B B X X B BΣ (14)
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2571
In the LDL method, it starts with determine mA for 1,m where 1 1, 2, , .i n A
set of observations which detected as outliers 1 ,A n is obtained and then determine
1 1
,A A
Y X and 1 1
, .C CA AY X
After that, take the new
mA without involving the old .mA
Determination of the new mbA for 2,m then 2 1 2, ,A i i where 1 2, 1, 2, , .i i n
Two pairs set of observations which are detected as outliers obtained 2 1, ,A n n
and then determine 2 2
,A A
Y X and 2 2
, .C CA AY X This step is done until a set of
1m n pairs of observations which are detected as outlier is obtained
1 2, 3, , 2, 1, .nA n n n Determine final set to determine ,m mA A
Y X and , .C Cm mA A
Y X
The following theorem is related to obtain parameter estimation LB and
LΣ from
Lagrange function.
Theorem 1. If natural logratihm of known Lagrange function ln L LL B ,Σ is given in:
11
ln ln 2 ln tr2 2 2
Tnq nL B,Σ Σ Σ Y XB Y XB
and such constraint is given in (14), then estimator ˆLB and ˆ
LΣ that maximized the
Lagrange function:
, , ln ,C Cm m
L LLA LAF L B Σ B Σ
1 2
1
; ;ˆ ˆVec Vec Vec Vec ,ˆ
C C C CL Lm m m m
TT
L v vLA LA LA LAF
B B X X B BΣ
is:
1
ˆ ˆVec Vec 2 Vec VecC C Cm m m
T T
qLA LA LA
B I X X X Y B B and
1ˆ ˆ ˆ ,
T
L L Ln
Σ Y XB Y XB
where CmLA
B is parameter of the LDL model without outlier, LΣ is matrix parameter
of variance covariance of LDL model, and is Lagrange multipliers.
To determine estimation value of parameter ˆVec CmLA
B which is optimum based on
numeric method that used nonlear program with KKT condition. The nonlinear
programming in the LRL method with KKT condition is:
Maximum objective function:
122
12 exp tr
2C C C C C C Cm m m m m m m
nnq T
L L LLA A A LA A A LAL
B ,Σ Σ Σ Y X B Y X B
with constraints:
(i). 1 2
1
; ;ˆ ˆVec Vec Vec Vec Vecˆ
C C C C CL Lm m m m m
TT
L v vLA LA LA LA LAg F
B B B X X B BΣ
2572 Makkulau et al
(ii). 1
ˆ ˆVec Vec 2 Vec VecC C Cm m m
T T
qLA LA LA
B I X X X Y B B
(iii). 1ˆ ˆ ˆ
T
L L Ln
Σ Y XB Y XB
(iv). 1 2
1
; ;ˆ ˆVec Vec Vec Vec 0ˆ
C C C CL Lm m m m
TT
L v vLA LA LA LAF
B B X X B BΣ
(v). 0.
A statistic test and its distribution as well as the rejection of the hypothesis of outliers
detection using the LDL method is given completely by Theorem 2.
Theorem 2. If the following hypothesis is given:
0H : mLA is no outlier
1H : mLA
is outlier,
then:
(a). Statistic test for the hypothesis is:
ˆ
LDL ln .ˆ
ˆ ˆm m m
m
T
L LA LA LA
A
L
nn
n
Σ C
Σ
ε ε
(b). Statistic distribution of LDL
mA is Chi-square.
(c). The null hypothesis is rejected when:
2
;dbLDL ,mA m
where is significance level, m is biggest eigenvalue, and df
is number of response variables.
To detect and determine outlier in multivariate linear model based on the LDL
method is done by comparing value of LDLmA and LDL . If LDL LDL ,
mA then it
means that the observation is outlier. After that deterimine pairs of each observations
which are detected as outlier gradually based on the LDL values. The largest LDL
values are chosen as the next pair result.
The next step is get the new mbA without involving the old ,mA which has been
detected as outlier and determine new multiplier with confidence interval for
Vec .CmbLA
B Based on (14), the new multiplier with confidence interval can be
written as follows:
1 2
1
; ;ˆ ˆVec Vec Vec Vec Vec .ˆ
C C C C CL b L bmb mb mb mb mb
TT
L v vLA LA LA LA LAg F
B B B X X B BΣ
Then to determine estimation of parameter *CmbLA
B which are obtained from nonlinear
program with KKT condition based on the new constraints. The nonlinear
programming of the LRL method with KKT condition based on the new constraints
is: maximum of objective function:
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2573
1221
2 exp tr2
C C C C C C Cb b bmb mb mb mb mb mb mb
nnq T
L L LLA A A LA A A LAL
B ,Σ Σ Σ Y X B Y X B
with constraints:
(i). 1 2
1
; ;ˆ ˆVec Vec Vec Vec Vecˆ
C C C C CL b L bmb mb mb mb mb
TT
L v vLA LA LA LA LAg F
B B B X X B BΣ
(ii). 1
ˆ ˆVec Vec 2 Vec VecC C Cmb mb mb
T T
qLA LA LA
B I X X X Y B B
(iii). 1ˆ ˆ ˆ
T
Lb Lb Lbn
Σ Y XB Y XB
(iv). 1 2
1
; ;ˆ ˆVec Vec Vec Vec 0ˆ
C C C CL b L bmb mb mb mb
TT
Lb v vLA LA LA LAF
B B X X B BΣ
(v). 0.
Outlier detection and determination in multivariate linear model with the LDL model
from the new mbA are based on hypothesis testing and test statistic as discussed in
Theorem 2. Reject 0H if LDL LDL ,mbA b The next step is taking the new mcA , then
determine new multiplier. After that determine values of parameter estimation of
CmcLA
B , which are obtained from nonlinear programming with KKT condition. Finally
perform hypothesis testing and test statistic to test the existence of outlier based on the
new multiplier. This step is conducted until no more observations are detected as
ouliers or LDL LDLmA . Besides, [10] noted that process is terminated when
computational burden becomes too large, or values LDLmA do not change much, then
determine final iteration. The advantage of the LDL method which uses Lagrange
multiplier is the optimal values obtained are limited in the determined confidence
interval. Since optimal values are limited to this confidence interval.
Outlier detection and testing based on the LRL method by making likelihood function
for the population to get ˆ ,L
B ˆ ,L
Σ and ˆ ˆ
L LL B ,Σ with assumption (3). Next,
constraints with confidence interval for Vec L
B are created. Based on [10], where
n, n - p, and n - q is very large, then the constraints follow Chi-squared distribution
with df is number of response variables. In general, the confidence interval
1 100% for Vec L
B is:
1
2
,dbˆ ˆVec Vec Vec Vec .ˆ
T T
L L L L L
B B X X B BΣ (15)
Then based on (15), the region obtained is:
1
2
,dbˆ ˆVec Vec Vec Vec Vec .ˆ
T T
L L L L L Lg
B B B X X B BΣ (16)
In the LRL method, it starts with determining ˆLΨ and E
Y . After that determine
mA for 1,m where 1 1, 2, , .i n A set of observations which detected as outliers
2574 Makkulau et al
1 ,A n is obtained and then determine 1 1
,A A Y X and
1 1
, .C CA A Y X After that, take
the new mA without involving the old
mA which has detected as outliers and determine
the new ˆ .LbΨ Determination of the new
mbA for 2,m then 2 1 2, ,A i i where
1 2, 1, 2, , .i i n Two pairs set of observations which are detected as outliers obtained
2 1, ,A n n and then determine 2 2
,A A Y X and
2 2
, .C CA A Y X This step is done until a
set of 1m n pairs of observations which are detected as outlier is obtained
1 2, 3, , 2, 1, .nA n n n
Determine final set to determine ,m mA A Y X dan , .C C
m mA A Y X
The following theorem is related to obtain parameter estimation L
B
and
L
Σ from
Lagrange function.
Theorem 3. If natural logratihm of known Lagrange function ln L LL B ,Σ is given in:
11
ln ln 2 ln tr2 2 2
Tnq nL B,Σ Σ Σ Y XB Y XB
and such constraint is given in (16), then estimator ˆL
B
and ˆ
L
Σ that maximized the
Lagrange function:
lnL L L LF L B ,Σ , B ,Σ
12
;dbˆ ˆVec Vec Vec Vec ,ˆ
T T
L L L L L
B B X X B BΣ
is:
11
1
ˆˆ
ˆ
ˆ
m m m
m m
T T
L LA LA LA
L
LA LA
B X X X Ι Q
B
Ι Q
ε
ε and
11ˆ ˆ ,ˆ ˆ
m m m
T
L L LA LA LAn
Σ Σ Ι Qε ε
where L
B
is parameter of the LRL model without outlier,
L
Σ is matrix parameter of
variance covariance of LRL model,
1
,m m m
T T
LA LA LA
Q X X X X
1 1
,m m mLA LA LA
Ι Ι Q Q Ι Q
ˆ ,ˆm m m
LLA LA LA Y X Bε and is Lagrange multipliers.
With properties result of kronecker product, then estimation of parameter matrix L
B
in vec can be written as:
1
ˆVec Vec .T T
L q
B I X X X Y (17)
Because of:
1
1ˆVec ~ Vec , T
p qN
B B X XΣ
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2575
then:
1
1ˆVec ~ Vec , .
T
L L Lp qN
B B X XΣ (18)
Because of (18), then estimation of ˆVar Vec L
B is:
1
ˆ ˆVar Vec .ˆ T
L L
B X XΣ
To determine estimation value of parameter L
B
which is optimum based on numeric
method that used nonlinear program with KKT condition. The nonlinear
programming in the LRL method with KKT condition is:
Maximum objective function:
1
221
2 exp tr2
nnq T
L L L L L LL
B ,Σ Σ Σ Y X B Y X B
with constraints:
(i). 1
2
;dbˆ ˆVec Vec Vec Vec Vecˆ
T T
L L L L L Lg
B B B X X B BΣ
(ii). 1
ˆVec VecT T
L q
B I X X X Y
(iii). 11ˆ ˆ ˆ ˆ
m m m
T
L L LA LA LAn
Σ Σ Ι Qε ε
(iv).
12
;dbˆ ˆVec Vec Vec Vec 0ˆ
T T
L L L L L
B B X X B BΣ
(v). 0.
A statistic test and its distribution as well as the rejection of the hypothesis of outliers
detection using the LRL method is given completely by Theorem 4.
Theorem 4. If the following hypothesis is given:
0H : mLA
is no outlier
1H : mLA
is outlier,
then:
(a). Statistic test for the hypothesis is:
1
ˆ
LRL lnˆ
ˆ ˆm m m
m
T
L LA LA LA
A
L
n
cn
Σ Ι Q
Σ
ε ε
where 1
,m m m
T T
LA LA LA
Q X X X X ˆ ,ˆm m m
T
LLA LA LA
Y X Bε and 1
1 .2
q mc n p m
(b). Statistic distribution of LRLmA is Chi-square.
(c). The null hypothesis is rejected when:
2
;dfLRL ,mA
where is significance level and df is number of response
variables.
2576 Makkulau et al
To detect and determine outlier in multivariate linear model based on the LRL method
is done by comparing value of LRLmA and
2
;dfLRL . If LRL LRL ,mA then it
means that the observation is outlier. After that deterimine pairs of each observations
which are detected as outlier gradually based on the LRL values. The largest LRL
values are chosen as the next pair result.
The next step is get the new mbA without involving the old
mA , which has been
detected as outlier and determine new multiplier with confidence interval for
Vec .Lb
B Based on (18), the new multiplier with confidence interval can be written
as follows:
12
;dbˆ ˆVec Vec Vec Vec Vec .ˆ
T T
Lb Lb Lb Lb Lb Lbg
B B B X X B BΣ
Then to determine estimation of parameter *CmbLA
B which are obtained from nonlinear
program with KKT condition based on the new constraints. The nonlinear
programming of the LRL method with KKT condition based on the new constraints
is: maximum of objective function:
1
221
2 exp tr2
nnq T
Lb Lb Lb Lb Lb LbL
B ,Σ Σ Σ Y X B Y X B
with constraints:
(i). 1
2
;dbˆ ˆVec Vec Vec Vec Vecˆ
T T
Lb Lb Lb Lb Lb Lbg
B B B X X B BΣ
(ii). 1
ˆVec VecT T
Lb q
B I X X X Y
(iii). * * *
11ˆ ˆ ˆ ˆmb m mb
T
Lb Lb LA LA LAn
Σ Σ Ι Qε ε
(iv).
12
;dbˆ ˆVec Vec Vec Vec 0ˆ
T T
Lb Lb Lb Lb Lb
B B X X B BΣ
(v). 0.
Outlier detection and determination in multivariate linear model with the LRL model
from the new
*
mbA are based on hypothesis testing and test statistic as discussed in
Theorem 4. Reject 0H if 2
;dfLRL .mbA The next step is taking the new mcA , then
determine new multiplier. After that determine values of parameter estimation of
*CmcLA
B , which are obtained from nonlinear programming with KKT condition. Finally
perform hypothesis testing and test statistic to test the existence of outlier based on the
new multiplier. This step is conducted until no more observations are detected as
ouliers or LRL LRLmA . For [10] noted that process is terminated when
computational burden becomes too large, or values LRLmA do not change much, then
determine final iteration. The advantage of the LRL method which uses Lagrange
Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2577
multiplier is the optimal values obtained are limited in the determined confidence
interval. Since optimal values are limited to this confidence interval.
5. CONCLUSION AND FURTHER RESEARCH
The LDL method is a method that uses Lagrange multipliers with confidence interval
constraint of parameter vector which done by deleting m observations in the models.
Determination of estimation of the optimal parameters is done through a numerical
method using the KKT conditions in nonlinear program. Outliers detection in
multivariate linear models using the LDL method is carried out by comparing the
calculated LDL values and the Chi-square table values that follow Chi-square
distribution. If the LDL value is greater than the Chi-square table values, then the
observation is considered as outlier. The LRL method is another way to detect and
test outlier in multivariate linear model besides the LDL method. The LRL method is
a method that uses Lagrange multipliers with confidence interval constraint of
parameter vector which done by shifting the means in the models. Determination of
estimation of the optimal parameters is done through a numerical method using the
KKT conditions in nonlinear program. Outliers detection in multivariate linear models
using the LRL method is carried out by comparing the calculated LRL values and the
Chi-square table that follow Chi-squared distribution. If the LRL value is greater than
the Chi-square table, then the observation is considered as outlier. The advantage of
the LDL and LRL method which uses Lagrange constraint is the optimal values
obtained that is limited to the determined confidence interval. The LDL and LRL
method are various methods of outlier detection in the context of a multivariate linear
model and could be applied to a set of data.
REFFERENCES
[1] Hawkins, D.M. (1980). Identifications of Outliers, Chapman & Hall, London.
[2] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd edition, John
Wiley, Great Britain.
[3] Peña, D. and Prieto, F.J. (2001). Multivariate Outlier Detection and Robust
Covariance Matrix Estimation, American Statistical Association and the
American Society for Quality, Technometrics, Vol. 43, No. 3, pp. 286-310.
[4] Filzmoser, P. (2005). Identification of Multivariate Outliers: A Performance
Study, Austrian Journal of Statistics, Vol. 34, No. 2, pp. 127-138.
[5] Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the
American Statistical Association, Vol. 79, pp. 871-880.
[6] Peña, D. and Guttman, I. (2001). Comparing Probabilistic Methods for Outlier
Detection in Linear Models, Biometrika, Technometrics, Vol. 80, No. 3, pp.
603-610.
[7] Cook, R.D. (2000). Detection of Influential Observation in Linear Regression,
Technometrics, Vol. 42, No. 1, pp. 65-68.
[8] Adnan, R., Mohamad, M.N., and Setan, H. (2003). Multiple Outliers Detection
Procedures in Linear Regression, Matematika, Vol. 19, No. 1, pp. 29-45.
2578 Makkulau et al
[9] Srivastava, M.S. and von Rosen, D. (1998). Outliers in Multivariate
Regression Models, Journal of Multivariate Analysis, Vol. 65, pp. 195-208.
[10] Xu, J., Abraham, B., and Steiner, S.H. (2005). Outlier Detection Methods in
Multivariate Regression Models, http://www.bisrg. uwaterloo.ca/ archive/RR-
06-07.pdf on April 4, 2007.
[11] Diaz-Garcia, J.A., Gonzalez-Farias, G., and Alvarado-Castro, V. (2007). Exact
Distributions for Sensitivity Analysis in Linear Regression, Applied
Mathematical Sciences, Vol. 1, No. 22, pp. 1083-1100.
[12] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2010). Pendeteksian Outlier
dan Penentuan Faktor-faktor yang Mempengaruhi Produksi Gula dan Tetes Tebu
dengan Metode Likelihood Displacement Statistic-Lagrange, Jurnal Teknik
Industri: Jurnal Keilmuan dan Aplikasi Teknik Industri, Vol. 12, No.2, pp. 95-
100 (In Indonesian).
[13] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2011). Pendeteksian Outlier
pada Pengamatan dalam Model Linear Multivariat dengan Metode Likelihood
Displacement Statistic-Lagrange, Jurnal Ilmu Dasar, Vol. 12, No.1, pp. 62-67
(In Indonesian).
[14] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2012). Outlier Detection in
Multivariate Linear Models using Likelihood Rasio Statistic for a Mean Shift-
Lagrange Method, International Journal of Academic Research, Part A.,
Natural and Applied Sciences, Vol. 4, No.6, pp. 5-13.
[15] Christensen, R. (1991). Linear Model for Multivariate, Time Series, and
Spatial Data, Springer-Verlag, New York.
[16] Harville, D.A. (2008). Matrix Algebra From a Statistician’s Perspective,
Springer Science+Business Media, LLC, USA.
[17] Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis,
3rd edition, John Wiley, New York.
[18] Rousseeuw, P.J. and Hubert, M. (1997). Recent Developments in
PROGRESS, dalam L1-Statistical Procedure and Related Topics, edited by Y.
Dodge, Institute of Mathematical Statistics Lecture Notes and Monograph
Series, Hayward, California, Vol. 31, pp. 201-214.
[19] Bazaara, M.S., Sherali, H.D., and Shetty, C.M. (1993). Nonlinear
Programming: Theory and Algorithms, 2rd edition, John Wiley & Sons, New
York.
[20] Schubert, E., Zimek, A., and Kriegel, H.P. (2014a). Local Outlier Detection
Reconsidered: a Generalized View on Locality with Applications to Spatial,
Video, and Network Outlier Detection, Data Mining and Knowledge
Discovery, Vol. 28, No.1, pp. 190–237.
[21] Schubert, E., Zimek, A., and Kriegel, H.P. (2014b). Generalized Outlier
Detection with Flexible Kernel Density Estimates, In Proceedings of the 14th
SIAM International Conference on Data Mining (SDM), Philadelphia, PA.