Outlier Detection in Multivariate Linear Models …Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2567 detected as outlier .mn After that, A m is defined

Global Journal of Pure and Applied Mathematics.

ISSN 0973-1768 Volume 13, Number 6 (2017), pp. 2563-2578

© Research India Publications

http://www.ripublication.com

------------------------------------------------------------------------------------------------ *Corresponding Author

Email Addresses : [email protected] (Makkulau),[email protected] (Edi Cahyono),

[email protected] (Mukhsar), [email protected] (Asrul Sani), [email protected] (La Ode Saidi),

[email protected] (Andi Tenri Ampa)

Outlier Detection in Multivariate Linear Models

Using Lagrange Multipliers

Makkulau1, Edi Cahyono1, Mukhsar1*, Asrul Sani1,

La Ode Saidi1, Andi Tenri Ampa1

Department of Mathematics and Statistictics, Faculty of Mathematics and Natural

Sciences, Universitas Halu Oleo, Kampus Hijau Bumi Tridharma Andounohu,

Kendari, 93232, Sulawesi Tenggara, Indonesia.

Abstract

In statistics, data which deviates so much from other data is called outlier.

Outlier can influence the resulting model, hence it can have an impact on

decision making. This paper reports a development of the outlier detection

method using the Likelihood Displacement Statistic method (LD) and

Likelihood Ratio Statistic for a Mean Shift method (LR), called Likelihood

Displacement Statistic-Lagrange (LDL) method and Likelihood Ratio

Statistic for a Mean Shift method (LR) to the Likelihood Ratio Statistic for

a Mean Shift-Lagrange (LRL) method. LDL and LRL method is a method

that uses Lagrange multipliers with confidence interval constraint of

parameter vector, which is done by deleting m observations and shifting the

means on the model using a nonlinear program with Karush-Kuhn Tucker

(KKT) condition. This method LDL compares LDL values and Chi-square

table values that follow Chi-square distribution, and method LRL compares

LRL values and Chi-square table values that follow Chi-square distribution.

The advantage of the LDL and LRL method which uses Lagrange

multiplier is the optimal values obtained are limited in the determined

confidence interval. Since optimal values are limited to this confidence

interval.

Keywords: Likelihood Displacement Statistic-Lagrange, Likelihood Ratio

Statistic for a Mean Shift-Lagrange, Karush-Kuhn Tucker, Outlier,

Lagrange Multipliers.

mailto:[email protected]





2564 Makkulau et al

1. INTRODUCTION

In statistics, an outlier is an observation point that is distant from other observations

(data). There are two different outlier, i.e. outlier in data and outlier in models. The

existing outlier detection method in models is using common Likelihood method. The

limitation of this method is the optimal value produced might be not the real optimal

values. When research object in the form of data contain outliers, i.e. data which

deviates so much from other data, these outliers can influence the resulting model.

Hence they can have an impact on decision making. Based on the number of variables

considered, outliers can be divided into outliers in univariate or multivariate

observations and outliers in univariate or multivariate linear models. Outliers in linear

models can exist in the predictor (independent) variable, response (dependent)

variable, or both at once. It is easier to handle outliers in predictor variables than

outliers in both at once. Development of outlier detection method in univariate and

multivariate observations have been done by [1,2, 3,4] which identified an

observation that deviate much from other observations. Outlier detection in univariate

liniear models have been developed by [5,6,7,8] identified ouliers in univariate linear

models with Least Trimmed Squares method and Single Linkage Clustering to obtain

potential oulier observations. Identified ouliers in data spatial and ouliers with flexible

Kernel density estimates [20,21].

Formal test for detecting outliers in multivariate linear models have been developed

by [9] with likelihood ratio test. By [10] developed univariate Cook's distance to

detect outliers in multivariate linear models, while [11] proposed the detection of the

observation that impact on the multivariate linear model using modified Cook's

distance. In developing outlier detection method in multivariate linear models, [10]

use Displacement Likelihood Statistic (LD) method, Likelihood Ratio Statistic for a

Mean Shift (LR) method, and multivariate Leverage method. In the LD and the LR

methods used by [10], parameter estimation of the likelihood function in multivariate

linear models used Maximum Likelihood Estimate (MLE). The optimal values in the

LD and the LR methods are obtained by taking into account all the values.

Development of the LD method which uses Lagrange multiplier method called

Displacement Likelihood Statistic-Lagrange (LDL) has been done by [12], and

development of the LR method which uses Lagrange multiplier method called

Likelihood Ratio Statistic for a Mean Shift-Lagrange (LRL) has been done by [14].

The application of the LDL method on process data of sugar and molasses production

in the Djombang Baru Sugar Factory, Jombang, East Java Province can be found in

[13]. The aims of this study are to determine how to detect and test outliers in

multivariate linear models using the Likelihood Displacement Statistic-Lagrange

(LDL) method and Likelihood Ratio Statistic methods for a Mean Shift-Lagrange

(LRL) method.

http://en.wikipedia.org/wiki/Statistics

Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2565

2. MATERIAL AND METHODS

Multivariate linear model is linear model with more than one response variable [15].

Suppose 1 2, , , pX X X are predictor variables and

1 2, , , qY Y Y are response

variables, then model for hY is:

0 1 1 2 2 , 1, 2, ,h h h h ph p hY X X X h q ,

where 0,hE 2Var ,h h and * *

*Cov , , ;h h hhh h *, 1, 2, , .h h q

Multivariate linear model which consist of q linear models simultaneously can be

written as:

11 11 ,n q n qn p p q

Y X B X Bε ε (1)

where 11n is a unit vector.

Then model in (1) can be written in vector form as follows:

Vec Vec Vec ,q Y I X B ε

where is the Kronecker product

11 21 1 12 22 2 1 2VecT

n n q q nqy y y y y y y y yY

01 11 1 02 12 2 0 1VecT

p p q q pq B

11 21 1 12 22 2 1 2Vec .T

n n q q nq ε

2.1. Parameter estimation and Hypothesis testing in Multivariate Linear Model

In multivariate linear model, error matrix ε is a random matrix. By assuming the

error matrix:

~ ,nq nN 0 Σ Iε or Vec ~ 0, ,nq nN Σ Iε (2)

after that, determine distribution of matrix Y in (1). Because of (2), thus:

~ ,nq nN Y XB Σ I

or Vec ~ Vec ,nq q nN Y I X B Σ I

Where Vec Vec Vec .q Y I X B ε

Parameter estimation B and Σ in (1) are done using the MLE method [15], give:

1ˆ T T

B X X X Y (3)

1ˆ ˆ ˆ ˆ .T

T

n n

YΣ Y XB Y XB Σ Y I H (4)

Σ̂ is the bias estimator for ,Σ where 1

T T

H X X X X is the projection matrix of

matrix ,X while unbias estimate for Σ is:

1 ˆ ˆ .

rank( )

T

n

S Y XB Y XB

X

2566 Makkulau et al

By using properties of Kronecker product and matrix inverse [16], then (3) can be

written in vector form, give:

1ˆVec Vec .T T

q

B I X X X Y

Then to test significance of (1), simultaneous test is done with hypothesis:

0 1 2: 0, 1,2,...,h h phH h q

1 :H at least one 0, 1,2,...,jh j p and 1,2,..., .h q

Based on the likelihood function for population in

122

12 exp tr

2

nnqT

L

B,Σ Σ Σ Y XB Y XB

estimation of parameter matrix is obtained i.e. Σ and ˆ .T

n

YΣ Y I H

Let ˆ T

U E Y I H Y dan ˆ ˆ ,THP

M Y H Y where 0 , HPH H H

1

0 0 0 0 0 ,T T

H X X X X M

and ˆUE matrix with size q q. The statistic test used to test the above hypothesis

using Wilks’s Lambda [17]:

11

ˆ ˆˆ ,

ˆ ˆ

U

U

H U

Σ EI M E

Σ E M where 0 1 and

; ; rank~

q d nU

X follows Wilk’s

Lambda distribution. Reject 0H if ; ; ; rank,calculation q d n

U

X

where q is various of

response variable, rank 1,p X and d = rank X rank 0 .X

2.2. Outlier in Multivariate Linear Model

According to [2, 18], outliers classification in multivariate linear models can be

divided into three categories i.e. outlier in observations on the predictor variable X,

outlier in observations on the response variable Y and outlier in observations on the

predictor X and respon Y together or outlier in the set of pairs 1 1 2 2, , , ,y x y x

, , .n ny x This study focuses on outlier detection and testing in the set of pairs

1 1 2 2, , , , , , .n ny x y x y x

2.3. Outlier Detection Methods in Multivariate Linear Model

Three methods of outliers detection in multivariate linear models i.e. LD, LR, and

Multivariate Leverage methods [10]. Outlier detection in multivariate linear model

based on the LD method was done by deleting observations which were detected as

outlier in the model gradually. For example, in m observations collected from set of

pairs 1 1 2 2, , , , , ,n ny x y x y x and take m pairs of observation vector which is


detected as outlier .m n After that, mA is defined and it is set of m observations

which are detected as outliers, i.e.: 1 2, , , ,m mA i i i where , , .a b b aj j j ji i i i

Consequently the index 1, 2, ,ji n for 1, 2, , .j m In other words, there are

m observations which are detected as outlier .mA Hence the pairs ,m mA AY X are

matrix of observations which detected as outlier and ,C Cm mA A

Y X are matrix of

observations without outliers or a matrix of observations after outliers is deleted.

Based on (3), the estimation of parameter matrix B after observations which

detected as outliers are deleted is:

11ˆ ˆ ,ˆ

Cm m mm

T T

A A AA

B B X X X Ι Q ε (6)

where:

1

;m m m

T T

A A A

Q X X X X ˆ ;ˆm m mA A A Y X Bε and

1 1

.m m mA A A

Ι Ι Q Q Ι Q

Estimate and parameter matrix Σ after observations which detected as outliers has

been deleted is:

11ˆ ˆ .ˆ ˆC

m m mm

T

A A AA

n

n m n m

Σ Σ Ι Qε ε

Then, the expected value and variance of ˆCmA

B are obtained:

ˆ

CmA

E B B and 1 11ˆ ˆ ˆ ˆ ˆVar ˆ ˆ

C Cm m m m mm m

T

A A A A AA A n

B Σ B Σ Ι Q Q Ι Qε ε

Definition [15]. The LD for B given Σ is:

ˆ ˆ ˆ ˆ ˆLD 2 ln , ln , ,C Cm m m

A A AL L B Σ B Σ B where ˆ ˆ

CmA

Σ B is estimated by ˆCmA

B

given in (6).

Test statistic which is used to test outlier existence with the LD method is:

ˆ ˆ ˆ ˆ ˆLD 2 ln , ln ,C Cm m m

A A AL L B Σ B Σ B

ˆ

ln ,ˆ

ˆ ˆm m m

T

A A Ann

n

Σ C

Σ

ε ε (7)

where 1 1

m m m mA A A A

C Ι Q Q Ι Q and 1 2, , ,

m are the eigenvalues of

.mAC

Next, determine distribution of (7). According to [10], when n, n – p, and n – q is

large, then LDmA follow Chi-square distribution with degrees of freedom (df) as

number of response variables or 2LD ~ .

mA qm Because of that value of LDtable for

observation which detected as outlier is 2

;dbLD ,m where is significance

level, m is biggest eigenvalue, and df = q is number of response variables.

Hypothesis testing in the LD method is based on the following hypothesis:

2568 Makkulau et al

0H : mA is no outlier

1H : mA

is outlier.

Test statistic used to test outliers existence is (8). Reject 0H when LD LD .

mA In

other words, to detect and determine outliers in multivariate linear models which

based on the LD method is obtained by comparing value of LDmA and LD . When

the value of LD LD ,mA

this means that the observations are outliers.

The next step is to look for each pair of observations which is detected as an outlier

based on the values of the LD. The largest LD values are chosen as the next pair. If

LD LDmA (no longer identify new observations as outlier), or the computational

burden is too large, or the value of LDmA does not change much, then the process is

terminated [10] and determine the final set mA dan .C

mA

The LR method is another way to detect and test outlier in multivariate linear model

besides the LD method. Analogue to the LD method, the LR method is done by

removing observation which are detected as outliers in the model gradually based on

the mean shift. The LR method starts by determining matriks estimate ,Ψ where Ψ

represent shift coefficient matrix which relates with observations in set .mA The set

mA is set of number of m observation which are detected as outlier. After that,

determine E Y from multivariate linear model. Multivariate linear model in the LR

method is written in the form:

mA

BY X Z

Ψε * * , X B ε (9)

where:

;mA

X X Z * ;T

B B Ψ 1 2

;mm

i i iAz z z Z and 1 2z , , , ,k mk i i i

mAZ is a matrix which contains column vector where the j-th component is one and

and the others are zero.

zk is a column vector where the j-th component is one and and the others are zero.

m qΨ is a matrix of coefficient of shift which relates with observations in set .mA

Based on general model (1), E Y XB is obatined which is the mean of the

model (Srivastava and von Rosen [9]), while based on (9) is obtained expected of

matrix

Y as follows:

E Y

mAXB Z Ψ

by (10) which is the shifting of the mean. Estimation of parameter matrix

B in (9) is:

1

ˆˆ ˆC

m

TT T

A

B X X X Y B Ψ

where B̂ is given by (3), 1

,m m m

T T

A A A

Q X X X X and

1ˆ .ˆ

m mA A

Ψ Ι Q ε

The estimation of parameter matrix

Σ in (9) is:


11ˆ ˆ ,ˆ ˆm m m

T

A A An

Σ Σ Ι Qε ε

where Σ̂ is given by (4).

Test statistic which is used to test outlier existence with the LR method is:

ˆ

LR lnˆmA

c

Σ

Σ

1

ˆ

ln ,ˆ

ˆ ˆm m m

T

A A An

cn

Σ Ι Q

Σ

ε ε (11)

where 1

1 .2

q mc n p m

Next, determine distribution of (11).

According to Box in Xu et al. [10], when n, n – p, and n – q is large, then LRmA

follow Chi-square distribution with degrees of freedom (df) as number of response

variables or 2

dfLR ~ .mA

Because of that value of LRtable for observation which

detected as outlier is 2

;dfLR , where is significance level and df is number of

response variables.

Hypothesis testing in the LR method is based on the following hypothesis:

0H : mA

is no outlier

1H :

mA

is outlier.

Test statistic used to test outliers existence is (12). Reject 0H when LR LR .mA In

other words, to detect and determine outliers in multivariate linear models which

based on the LR method is obtained by comparing value of LRmA and LR . When

the value of LR LR ,mA this means that the observations are outliers.

The next step is to look for each pair of observations which is detected as an outlier

based on the values of the LR. The largest LR values are chosen as the next pair. If

LR LRmA (no longer identify new observations as outlier), or the computational

burden is too large, or the value of LRmA

does not change much, then the process is

terminated [10] and determine the final set mA dan .C

mA

2.4. Nonlinear Optimization

Let the objective function L with constraints 0kg for 1,2, , .k K Then is

the feasible solution, then the Karush-Kuhn-Tucker (KKT) condition can be written in

the form as follows [18]:

Maximum objective function:

2570 Makkulau et al

1 2, , , ,KL L (13)

with constraints:

0.kg (14)

Equation (13) and (14) can be solved by Lagrange multipliers as follows:

1

, . .K

k k

k

F L g

Optimization of the KKT condition is achieved when it meets:

1

,0

K

k k

kk

FL g

,0k

k

Fg

1

0K

k k

k

g

and 0,k

where k represents the Lagrange multiplier.

3. METHOD

This method is a development of the LD and LR method using Lagrange multipliers.

The Lagrange multiplier used is the confidence interval of parameter vector that are

useful to optimize the confidence intervals obtained. This confidence interval is

obtained in stages by shifting the means in the model numerically using a nonlinear

program. Process of outlier detection and outlier testing using the LDL and LRL

method are collect m observations which are detected as outliers mA and * .mA

Determine ˆ ,LB ˆ ,LΣ ,CmLA

B

,CmLA

Σ ,CmLA

B ,C

mLA

Σ LDL ,

mA

LRL ,mA and compares LDL and

LRL values and table values.

4. RESULTS AND DISCUSSION

Outlier detection and testing based on the LDL method by making likelihood function

for the population to get ˆ ,LB ˆ ,LΣ and ˆ ˆL LL B ,Σ with assumption (3). Next,

constraints with confidence interval for Vec LB are created. Based on Box in Xu et

al. [10], where n, n - p, and n - q is very large, then the constraints follow F

distribution with df is number of response variables. In general, the confidence

interval 1 100% for Vec LB is:

1 2

1

; ;ˆ ˆVec Vec Vec Vec .ˆ

TT

L L L L L v vF

B B X X B BΣ (13)

Then based on (24), the region obtained is:

1 2

1

; ;ˆ ˆVec Vec Vec Vec Vec .ˆ

TT

L L L L L L v vg F

B B B X X B BΣ (14)


In the LDL method, it starts with determine mA for 1,m where 1 1, 2, , .i n A

set of observations which detected as outliers 1 ,A n is obtained and then determine

1 1

,A A

Y X and 1 1

, .C CA AY X

After that, take the new

mA without involving the old .mA

Determination of the new mbA for 2,m then 2 1 2, ,A i i where 1 2, 1, 2, , .i i n

Two pairs set of observations which are detected as outliers obtained 2 1, ,A n n

and then determine 2 2

,A A

Y X and 2 2

, .C CA AY X This step is done until a set of

1m n pairs of observations which are detected as outlier is obtained

1 2, 3, , 2, 1, .nA n n n Determine final set to determine ,m mA A

Y X and , .C Cm mA A

Y X

The following theorem is related to obtain parameter estimation LB and

LΣ from

Lagrange function.

Theorem 1. If natural logratihm of known Lagrange function ln L LL B ,Σ is given in:

11

ln ln 2 ln tr2 2 2

Tnq nL B,Σ Σ Σ Y XB Y XB

and such constraint is given in (14), then estimator ˆLB and ˆ

LΣ that maximized the

Lagrange function:

, , ln ,C Cm m

L LLA LAF L B Σ B Σ

1 2

1

; ;ˆ ˆVec Vec Vec Vec ,ˆ

C C C CL Lm m m m

TT

L v vLA LA LA LAF

B B X X B BΣ

is:

1

ˆ ˆVec Vec 2 Vec VecC C Cm m m

T T

qLA LA LA

B I X X X Y B B and

1ˆ ˆ ˆ ,

T

L L Ln

Σ Y XB Y XB

where CmLA

B is parameter of the LDL model without outlier, LΣ is matrix parameter

of variance covariance of LDL model, and is Lagrange multipliers.

To determine estimation value of parameter ˆVec CmLA

B which is optimum based on

numeric method that used nonlear program with KKT condition. The nonlinear

programming in the LRL method with KKT condition is:


122

12 exp tr

2C C C C C C Cm m m m m m m

nnq T

L L LLA A A LA A A LAL

B ,Σ Σ Σ Y X B Y X B

with constraints:

(i). 1 2

1

; ;ˆ ˆVec Vec Vec Vec Vecˆ

C C C C CL Lm m m m m

TT

L v vLA LA LA LA LAg F

B B B X X B BΣ

2572 Makkulau et al

(ii). 1

ˆ ˆVec Vec 2 Vec VecC C Cm m m

T T

qLA LA LA

B I X X X Y B B

(iii). 1ˆ ˆ ˆ

T

L L Ln

Σ Y XB Y XB

(iv). 1 2

1

; ;ˆ ˆVec Vec Vec Vec 0ˆ

C C C CL Lm m m m

TT

L v vLA LA LA LAF

B B X X B BΣ

(v). 0.

A statistic test and its distribution as well as the rejection of the hypothesis of outliers

detection using the LDL method is given completely by Theorem 2.

Theorem 2. If the following hypothesis is given:

0H : mLA is no outlier

1H : mLA

is outlier,

then:

(a). Statistic test for the hypothesis is:

ˆ

LDL ln .ˆ

ˆ ˆm m m

m

T

L LA LA LA

A

L

nn

n

Σ C

Σ

ε ε

(b). Statistic distribution of LDL

mA is Chi-square.

(c). The null hypothesis is rejected when:

2

;dbLDL ,mA m

where is significance level, m is biggest eigenvalue, and df

is number of response variables.

To detect and determine outlier in multivariate linear model based on the LDL

method is done by comparing value of LDLmA and LDL . If LDL LDL ,

mA then it

means that the observation is outlier. After that deterimine pairs of each observations

which are detected as outlier gradually based on the LDL values. The largest LDL

values are chosen as the next pair result.

The next step is get the new mbA without involving the old ,mA which has been

detected as outlier and determine new multiplier with confidence interval for

Vec .CmbLA

B Based on (14), the new multiplier with confidence interval can be

written as follows:

1 2

1

; ;ˆ ˆVec Vec Vec Vec Vec .ˆ

C C C C CL b L bmb mb mb mb mb

TT


B B B X X B BΣ

Then to determine estimation of parameter *CmbLA

B which are obtained from nonlinear

program with KKT condition based on the new constraints. The nonlinear

programming of the LRL method with KKT condition based on the new constraints

is: maximum of objective function:


1221

2 exp tr2

C C C C C C Cb b bmb mb mb mb mb mb mb

nnq T

L L LLA A A LA A A LAL


with constraints:

(i). 1 2

1

; ;ˆ ˆVec Vec Vec Vec Vecˆ

C C C C CL b L bmb mb mb mb mb

TT


B B B X X B BΣ

(ii). 1

ˆ ˆVec Vec 2 Vec VecC C Cmb mb mb

T T

qLA LA LA

B I X X X Y B B

(iii). 1ˆ ˆ ˆ

T

Lb Lb Lbn

Σ Y XB Y XB

(iv). 1 2

1

; ;ˆ ˆVec Vec Vec Vec 0ˆ

C C C CL b L bmb mb mb mb

TT

Lb v vLA LA LA LAF

B B X X B BΣ

(v). 0.

Outlier detection and determination in multivariate linear model with the LDL model

from the new mbA are based on hypothesis testing and test statistic as discussed in

Theorem 2. Reject 0H if LDL LDL ,mbA b The next step is taking the new mcA , then

determine new multiplier. After that determine values of parameter estimation of

CmcLA

B , which are obtained from nonlinear programming with KKT condition. Finally

perform hypothesis testing and test statistic to test the existence of outlier based on the

new multiplier. This step is conducted until no more observations are detected as

ouliers or LDL LDLmA . Besides, [10] noted that process is terminated when

computational burden becomes too large, or values LDLmA do not change much, then

determine final iteration. The advantage of the LDL method which uses Lagrange

multiplier is the optimal values obtained are limited in the determined confidence

interval. Since optimal values are limited to this confidence interval.

Outlier detection and testing based on the LRL method by making likelihood function

for the population to get ˆ ,L

B ˆ ,L

Σ and ˆ ˆ

L LL B ,Σ with assumption (3). Next,

constraints with confidence interval for Vec L

B are created. Based on [10], where

n, n - p, and n - q is very large, then the constraints follow Chi-squared distribution

with df is number of response variables. In general, the confidence interval

1 100% for Vec L

B is:

1

2

,dbˆ ˆVec Vec Vec Vec .ˆ

T T

L L L L L

B B X X B BΣ (15)

Then based on (15), the region obtained is:

1

2

,dbˆ ˆVec Vec Vec Vec Vec .ˆ

T T

L L L L L Lg

B B B X X B BΣ (16)

In the LRL method, it starts with determining ˆLΨ and E

Y . After that determine

mA for 1,m where 1 1, 2, , .i n A set of observations which detected as outliers

2574 Makkulau et al

1 ,A n is obtained and then determine 1 1

,A A Y X and

1 1

, .C CA A Y X After that, take

the new mA without involving the old

mA which has detected as outliers and determine

the new ˆ .LbΨ Determination of the new

mbA for 2,m then 2 1 2, ,A i i where

1 2, 1, 2, , .i i n Two pairs set of observations which are detected as outliers obtained

2 1, ,A n n and then determine 2 2

,A A Y X and

2 2

, .C CA A Y X This step is done until a

set of 1m n pairs of observations which are detected as outlier is obtained

1 2, 3, , 2, 1, .nA n n n

Determine final set to determine ,m mA A Y X dan , .C C

m mA A Y X

The following theorem is related to obtain parameter estimation L

B

and

L

Σ from

Lagrange function.

Theorem 3. If natural logratihm of known Lagrange function ln L LL B ,Σ is given in:

11

ln ln 2 ln tr2 2 2

Tnq nL B,Σ Σ Σ Y XB Y XB

and such constraint is given in (16), then estimator ˆL

B

and ˆ

L

Σ that maximized the

Lagrange function:

lnL L L LF L B ,Σ , B ,Σ

12

;dbˆ ˆVec Vec Vec Vec ,ˆ

T T

L L L L L

B B X X B BΣ

is:

11

1

ˆˆ

ˆ

ˆ

m m m

m m

T T

L LA LA LA

L

LA LA

B X X X Ι Q

B

Ι Q

ε

ε and

11ˆ ˆ ,ˆ ˆ

m m m

T

L L LA LA LAn

Σ Σ Ι Qε ε

where L

B

is parameter of the LRL model without outlier,

L

Σ is matrix parameter of

variance covariance of LRL model,

1

,m m m

T T

LA LA LA

Q X X X X

1 1

,m m mLA LA LA

Ι Ι Q Q Ι Q

ˆ ,ˆm m m

LLA LA LA Y X Bε and is Lagrange multipliers.

With properties result of kronecker product, then estimation of parameter matrix L

B

in vec can be written as:

1

ˆVec Vec .T T

L q

B I X X X Y (17)

Because of:

1

1ˆVec ~ Vec , T

p qN

B B X XΣ


then:

1

1ˆVec ~ Vec , .

T

L L Lp qN

B B X XΣ (18)

Because of (18), then estimation of ˆVar Vec L

B is:

1

ˆ ˆVar Vec .ˆ T

L L

B X XΣ

To determine estimation value of parameter L

B

which is optimum based on numeric

method that used nonlinear program with KKT condition. The nonlinear

programming in the LRL method with KKT condition is:


1

221

2 exp tr2

nnq T

L L L L L LL


with constraints:

(i). 1

2

;dbˆ ˆVec Vec Vec Vec Vecˆ

T T

L L L L L Lg

B B B X X B BΣ

(ii). 1

ˆVec VecT T

L q

B I X X X Y

(iii). 11ˆ ˆ ˆ ˆ

m m m

T

L L LA LA LAn

Σ Σ Ι Qε ε

(iv).

12

;dbˆ ˆVec Vec Vec Vec 0ˆ

T T

L L L L L

B B X X B BΣ

(v). 0.

A statistic test and its distribution as well as the rejection of the hypothesis of outliers

detection using the LRL method is given completely by Theorem 4.

Theorem 4. If the following hypothesis is given:

0H : mLA

is no outlier

1H : mLA

is outlier,

then:

(a). Statistic test for the hypothesis is:

1

ˆ

LRL lnˆ

ˆ ˆm m m

m

T

L LA LA LA

A

L

n

cn

Σ Ι Q

Σ

ε ε

where 1

,m m m

T T

LA LA LA

Q X X X X ˆ ,ˆm m m

T

LLA LA LA

Y X Bε and 1

1 .2

q mc n p m

(b). Statistic distribution of LRLmA is Chi-square.

(c). The null hypothesis is rejected when:

2

;dfLRL ,mA

where is significance level and df is number of response

variables.

2576 Makkulau et al

To detect and determine outlier in multivariate linear model based on the LRL method

is done by comparing value of LRLmA and

2

;dfLRL . If LRL LRL ,mA then it

means that the observation is outlier. After that deterimine pairs of each observations

which are detected as outlier gradually based on the LRL values. The largest LRL

values are chosen as the next pair result.

The next step is get the new mbA without involving the old

mA , which has been

detected as outlier and determine new multiplier with confidence interval for

Vec .Lb

B Based on (18), the new multiplier with confidence interval can be written

as follows:

12

;dbˆ ˆVec Vec Vec Vec Vec .ˆ

T T

Lb Lb Lb Lb Lb Lbg

B B B X X B BΣ

Then to determine estimation of parameter *CmbLA

B which are obtained from nonlinear

program with KKT condition based on the new constraints. The nonlinear

programming of the LRL method with KKT condition based on the new constraints

is: maximum of objective function:

1

221

2 exp tr2

nnq T

Lb Lb Lb Lb Lb LbL


with constraints:

(i). 1

2

;dbˆ ˆVec Vec Vec Vec Vecˆ

T T

Lb Lb Lb Lb Lb Lbg

B B B X X B BΣ

(ii). 1

ˆVec VecT T

Lb q

B I X X X Y

(iii). * * *

11ˆ ˆ ˆ ˆmb m mb

T

Lb Lb LA LA LAn

Σ Σ Ι Qε ε

(iv).

12

;dbˆ ˆVec Vec Vec Vec 0ˆ

T T

Lb Lb Lb Lb Lb

B B X X B BΣ

(v). 0.

Outlier detection and determination in multivariate linear model with the LRL model

from the new

*

mbA are based on hypothesis testing and test statistic as discussed in

Theorem 4. Reject 0H if 2

;dfLRL .mbA The next step is taking the new mcA , then

determine new multiplier. After that determine values of parameter estimation of

*CmcLA

B , which are obtained from nonlinear programming with KKT condition. Finally

perform hypothesis testing and test statistic to test the existence of outlier based on the

new multiplier. This step is conducted until no more observations are detected as

ouliers or LRL LRLmA . For [10] noted that process is terminated when

computational burden becomes too large, or values LRLmA do not change much, then

determine final iteration. The advantage of the LRL method which uses Lagrange


multiplier is the optimal values obtained are limited in the determined confidence

interval. Since optimal values are limited to this confidence interval.

5. CONCLUSION AND FURTHER RESEARCH

The LDL method is a method that uses Lagrange multipliers with confidence interval

constraint of parameter vector which done by deleting m observations in the models.

Determination of estimation of the optimal parameters is done through a numerical

method using the KKT conditions in nonlinear program. Outliers detection in

multivariate linear models using the LDL method is carried out by comparing the

calculated LDL values and the Chi-square table values that follow Chi-square

distribution. If the LDL value is greater than the Chi-square table values, then the

observation is considered as outlier. The LRL method is another way to detect and

test outlier in multivariate linear model besides the LDL method. The LRL method is

a method that uses Lagrange multipliers with confidence interval constraint of

parameter vector which done by shifting the means in the models. Determination of

estimation of the optimal parameters is done through a numerical method using the

KKT conditions in nonlinear program. Outliers detection in multivariate linear models

using the LRL method is carried out by comparing the calculated LRL values and the

Chi-square table that follow Chi-squared distribution. If the LRL value is greater than

the Chi-square table, then the observation is considered as outlier. The advantage of

the LDL and LRL method which uses Lagrange constraint is the optimal values

obtained that is limited to the determined confidence interval. The LDL and LRL

method are various methods of outlier detection in the context of a multivariate linear

model and could be applied to a set of data.

REFFERENCES

[1] Hawkins, D.M. (1980). Identifications of Outliers, Chapman & Hall, London.

[2] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd edition, John

Wiley, Great Britain.

[3] Peña, D. and Prieto, F.J. (2001). Multivariate Outlier Detection and Robust

Covariance Matrix Estimation, American Statistical Association and the

American Society for Quality, Technometrics, Vol. 43, No. 3, pp. 286-310.

[4] Filzmoser, P. (2005). Identification of Multivariate Outliers: A Performance

Study, Austrian Journal of Statistics, Vol. 34, No. 2, pp. 127-138.

[5] Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the

American Statistical Association, Vol. 79, pp. 871-880.

[6] Peña, D. and Guttman, I. (2001). Comparing Probabilistic Methods for Outlier

Detection in Linear Models, Biometrika, Technometrics, Vol. 80, No. 3, pp.

603-610.

[7] Cook, R.D. (2000). Detection of Influential Observation in Linear Regression,

Technometrics, Vol. 42, No. 1, pp. 65-68.

[8] Adnan, R., Mohamad, M.N., and Setan, H. (2003). Multiple Outliers Detection

Procedures in Linear Regression, Matematika, Vol. 19, No. 1, pp. 29-45.

2578 Makkulau et al

[9] Srivastava, M.S. and von Rosen, D. (1998). Outliers in Multivariate

Regression Models, Journal of Multivariate Analysis, Vol. 65, pp. 195-208.

[10] Xu, J., Abraham, B., and Steiner, S.H. (2005). Outlier Detection Methods in

Multivariate Regression Models, http://www.bisrg. uwaterloo.ca/ archive/RR-

06-07.pdf on April 4, 2007.

[11] Diaz-Garcia, J.A., Gonzalez-Farias, G., and Alvarado-Castro, V. (2007). Exact

Distributions for Sensitivity Analysis in Linear Regression, Applied

Mathematical Sciences, Vol. 1, No. 22, pp. 1083-1100.

[12] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2010). Pendeteksian Outlier

dan Penentuan Faktor-faktor yang Mempengaruhi Produksi Gula dan Tetes Tebu

dengan Metode Likelihood Displacement Statistic-Lagrange, Jurnal Teknik

Industri: Jurnal Keilmuan dan Aplikasi Teknik Industri, Vol. 12, No.2, pp. 95-

100 (In Indonesian).

[13] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2011). Pendeteksian Outlier

pada Pengamatan dalam Model Linear Multivariat dengan Metode Likelihood

Displacement Statistic-Lagrange, Jurnal Ilmu Dasar, Vol. 12, No.1, pp. 62-67

(In Indonesian).

[14] Makkulau, Linuwih, S., Purhadi, and Mashuri, M. (2012). Outlier Detection in

Multivariate Linear Models using Likelihood Rasio Statistic for a Mean Shift-

Lagrange Method, International Journal of Academic Research, Part A.,

Natural and Applied Sciences, Vol. 4, No.6, pp. 5-13.

[15] Christensen, R. (1991). Linear Model for Multivariate, Time Series, and

Spatial Data, Springer-Verlag, New York.

[16] Harville, D.A. (2008). Matrix Algebra From a Statistician’s Perspective,

Springer Science+Business Media, LLC, USA.

[17] Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis,

3rd edition, John Wiley, New York.

[18] Rousseeuw, P.J. and Hubert, M. (1997). Recent Developments in

PROGRESS, dalam L1-Statistical Procedure and Related Topics, edited by Y.

Dodge, Institute of Mathematical Statistics Lecture Notes and Monograph

Series, Hayward, California, Vol. 31, pp. 201-214.

[19] Bazaara, M.S., Sherali, H.D., and Shetty, C.M. (1993). Nonlinear

Programming: Theory and Algorithms, 2rd edition, John Wiley & Sons, New

York.

[20] Schubert, E., Zimek, A., and Kriegel, H.P. (2014a). Local Outlier Detection

Reconsidered: a Generalized View on Locality with Applications to Spatial,

Video, and Network Outlier Detection, Data Mining and Knowledge

Discovery, Vol. 28, No.1, pp. 190–237.

[21] Schubert, E., Zimek, A., and Kriegel, H.P. (2014b). Generalized Outlier

Detection with Flexible Kernel Density Estimates, In Proceedings of the 14th

SIAM International Conference on Data Mining (SDM), Philadelphia, PA.

http://www.dbs.ifi.lmu.de/cms/Erich_Schubert

http://www.dbs.ifi.lmu.de/cms/Arthur_Zimek

http://www.dbs.ifi.lmu.de/cms/Hans-Peter_Kriegel

http://www.dbs.ifi.lmu.de/cms/Erich_Schubert

http://www.dbs.ifi.lmu.de/cms/Arthur_Zimek

http://www.dbs.ifi.lmu.de/cms/Hans-Peter_Kriegel

Documents

Outlier Detection in Multivariate Linear Models …Outlier Detection in Multivariate Linear Models Using Lagrange Multipliers 2567 detected as outlier .mn After that, A m is defined