A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez

A Robust Outlier A Robust Outlier Detection Scheme for Detection Scheme for

Large Data SetsLarge Data Sets

Jian Tang Zhixiang Chen Ada Wai-chee Fu David CheungJian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung

Presented By David LopezPresented By David Lopez

A Robust Outlier Detection Scheme for Large Data SetsA Robust Outlier Detection Scheme for Large Data Sets

• Outlier:Outlier:– An outlier is an observation that deviates so much from other An outlier is an observation that deviates so much from other

observations as to arouse suspicion that it was generated by a different observations as to arouse suspicion that it was generated by a different mechanism.mechanism.

D. HawkinsD. Hawkins


• Recent Detection SchemesRecent Detection Schemes– Distance BasedDistance Based

• DB(n,q): if an object’s q neighborhood contains less than n DB(n,q): if an object’s q neighborhood contains less than n objects then it’s called an outlier with respect to n and q. objects then it’s called an outlier with respect to n and q.

• (t, k) nearest neighbor: ranks the top t objects with the (t, k) nearest neighbor: ranks the top t objects with the maximum to their kth nearest neighbors as outliers.maximum to their kth nearest neighbors as outliers.


• Recent Detection Schemes (cont.)Recent Detection Schemes (cont.)– Density BasedDensity Based

• Let p, o be members of D and let k be a positive integerLet p, o be members of D and let k be a positive integer• k-distance(o): the distance from o to its kth nearest neighbork-distance(o): the distance from o to its kth nearest neighbor• reachability distance of p with respect to k:reachability distance of p with respect to k:

reach-distreach-distkk(p, o) = max {k-distance(o), dist(p, o)}(p, o) = max {k-distance(o), dist(p, o)}


• Recent Detection Schemes (cont.)Recent Detection Schemes (cont.)– Density Based (cont.)Density Based (cont.)

• The local reachability density of p for k, lrdThe local reachability density of p for k, lrdkk(p),is the inverse of the average (p),is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. reachability distance from p to the objects in its k-distance neighborhood.

• Let NLet Nkk(p) stand for N(p) stand for Nk-distance(p)k-distance(p)(p) (p) • lrdlrdkk(p) is define as:(p) is define as:

•The local outlier factor of p, LOFThe local outlier factor of p, LOFkk(p), is just (p), is just the average fraction of the reachability the average fraction of the reachability densities of p’s k-distance neighbors and that densities of p’s k-distance neighbors and that of pof p

•LOFLOFkk(p) is defined as:(p) is defined as:


• Recent Detection Schemes (cont.)Recent Detection Schemes (cont.)– Advantages of Distance BasedAdvantages of Distance Based

– Disadvantages of Distance BasedDisadvantages of Distance Based

– Advantages of Density BasedAdvantages of Density Based

– Disadvantages of Density BasedDisadvantages of Density Based

– Where does this leave us?Where does this leave us?


• A Unified Model for OutliersA Unified Model for Outliers– First some termsFirst some terms

• D = {ID = {I11, …, I, …, INN} be a data set in a multi-demensional space S} be a data set in a multi-demensional space S

• NNvv(p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p(p) = {b : dist(p, b) <= v & b != p}……this is known as the v-neighborhood of p

– Some functionsSome functions• d( ) : D d( ) : D R R++

• m( ) : D m( ) : D R R++

• F( ) : RF( ) : R++ x R x R++ R R0+0+

• F(m(p), |NF(m(p), |Nd(p)d(p)(p)|) for every p in D is called an outlier measure on D(p)|) for every p in D is called an outlier measure on D• d( ) and m( ) are known as the characteristic functionsd( ) and m( ) are known as the characteristic functions

We can now construct the new functionsWe can now construct the new functions– DB(n, q)DB(n, q)

• d(p) = q and m(p) = n for all p in Dd(p) = q and m(p) = n for all p in D• F(x,y) = 1 if x > y and 0 otherwiseF(x,y) = 1 if x > y and 0 otherwise• The outlier measure function for DB(n,q) is:The outlier measure function for DB(n,q) is: F(n, |NF(n, |Nqq(p)|) shortened (p)|) shortened

as Fas F11(n, q, p)(n, q, p)

• FF11(n, q, p) = 1 if n > |N(n, q, p) = 1 if n > |Nqq(p)|(p)| 0 otherwise0 otherwise


– (t, k) nearest neighbor is just a special case of DB(n, q) where(t, k) nearest neighbor is just a special case of DB(n, q) where• q = ( k-distanceq = ( k-distancett + k-distance + k-distancett + 1+ 1 ) / 2 ) / 2

• Outlier function: Outlier function: F(k, |N F(k, |N( k-distance( k-distancett + k-distance+ k-distancet+1t+1 ) / 2 ) / 2 (p)|) (p)|)

use Fuse F22(t, k, p)(t, k, p)

• FF22(t, k, p) = 1 if t > |N(t, k, p) = 1 if t > |N( k-distance( k-distancett + k-distance+ k-distancet+1t+1 ) / 2 ) / 2 (p)|) (p)|)

0 otherwise0 otherwise

– density based schemedensity based scheme• d(p) = k-distance(p)d(p) = k-distance(p)

• F(x, y) = x / yF(x, y) = x / y22

• this is the same as LOFthis is the same as LOFkk(p)(p)

• FF33(k, p) = LOF(k, p) = LOFkk(p)(p)


• Thoughts on the previousThoughts on the previous– For the DB(n, q) outlier model the characteristic functions do not change as For the DB(n, q) outlier model the characteristic functions do not change as

objects changeobjects change– To detect outliers whose neighborhoods possess different kinds of structures, To detect outliers whose neighborhoods possess different kinds of structures,

we should use characteristic functions with different values for different we should use characteristic functions with different values for different structures.structures.

• Enhancing the expressive power of a formulation schemeEnhancing the expressive power of a formulation scheme– Formulation schemes have a tough time describing the outlies in terms of a Formulation schemes have a tough time describing the outlies in terms of a

user’s intuitionuser’s intuition• User’s view of an outlierUser’s view of an outlier• Outlier measure function’s view of an outlierOutlier measure function’s view of an outlier

– Question to answer:Question to answer:Under the constraint that the multiple patterns of a user’s interest for any data Under the constraint that the multiple patterns of a user’s interest for any data set are not available, can we enhance the expressive power of these schemes?set are not available, can we enhance the expressive power of these schemes?


• More useful notationsMore useful notations– For any C subset of D For any C subset of D ANDAND p member of D p member of D

• distdistmaxmax(C) = max{ dist(x, y) : x and y are members of C }(C) = max{ dist(x, y) : x and y are members of C }

• distdistminmin(C) = min { dist(x, y) : x and y are members of C and x != y }(C) = min { dist(x, y) : x and y are members of C and x != y }

• dist(p, C) = min { dist(p, x) : x member of C }dist(p, C) = min { dist(p, x) : x member of C }

• Any outlier measure function is denoted by O(r, d, p) Any outlier measure function is denoted by O(r, d, p)

where 0 <= d <= distwhere 0 <= d <= distmaxmax(D) ,(D) ,

p member of D, p member of D,

r member of Domr member of DomOO(D) (D)

or or

the domain for the variable rthe domain for the variable r

of the function Oof the function O


• Construct the new functionsConstruct the new functions– For DB(n, q):For DB(n, q):

O(n, q, p) = FO(n, q, p) = F11(n, q, p) where n mem of Dom(n, q, p) where n mem of DomOO(D) = {0, 1, …, |D| + (D) = {0, 1, …, |D| + 1}1}

– For (t, k) nearest neighbor:For (t, k) nearest neighbor:O(t, k, p) = FO(t, k, p) = F22(t, k, p) where t member of Dom(t, k, p) where t member of DomOO(D) = {1, 2, …, |(D) = {1, 2, …, |D|}D|}

– For density based scheme:For density based scheme:O(r, k, p) = FO(r, k, p) = F33(k, p) where the r variable is not needed(k, p) where the r variable is not needed


• Some definitionsSome definitions– Definition 1Definition 1

• Let D be a Data SetLet D be a Data Set• An interpretation of D is a partition D = DAn interpretation of D is a partition D = Doo U D U Dnn where D where Doo and D and Dnn denote the outlier set and denote the outlier set and

non-outlier set, respectivelynon-outlier set, respectively

– Definition 2Definition 2• Let O(r, q, p) be an outlier measure function and I be an interpretation D = DLet O(r, q, p) be an outlier measure function and I be an interpretation D = Doo U D U Dnn 1.1. O(r, q, p) is O-compatible with I if there exists a u > 0 and a sequence (rO(r, q, p) is O-compatible with I if there exists a u > 0 and a sequence (r11, q, q11), (r), (r22, q, q22), …, (r), …, (rii, ,

qqii) with i >= 1 and q) with i >= 1 and q1 1 <…< q<…< qii such that such that

2. 2. O(r, q, p) is N-compatible with I if there exists a u > 0 and a sequence (rO(r, q, p) is N-compatible with I if there exists a u > 0 and a sequence (r11, q, q11), (r), (r22, q, q22), …, (r), …, (rii, , qqii) with i >= 1 and q) with i >= 1 and q1 1 <…< q<…< qii such that such that


• For O-compatability, the entire sequence must consent for the For O-compatability, the entire sequence must consent for the object to be an outlier, but one member is enough for it to be a object to be an outlier, but one member is enough for it to be a non-outlier. non-outlier.

• For N-compatability, it’s just the other way around.For N-compatability, it’s just the other way around.

• ThoughtsThoughts– Objective: trying to produce an outlier function that fit’s the user’s Objective: trying to produce an outlier function that fit’s the user’s

intuition. intuition. – An O-compatibility scheme may filter out many objectsAn O-compatibility scheme may filter out many objects– An N-compatibility scheme may allow unworthies to pass throughAn N-compatibility scheme may allow unworthies to pass through– So, pick a scheme based upon the user’s requirementsSo, pick a scheme based upon the user’s requirements


• A concrete example:A concrete example:– Consider the data set D = CConsider the data set D = C1 1 U CU C22 U {o} U {o}

•Assume |C1| = 400, |C2| = 403

•Assume distmin(C2) > dist(o, x3),

•Assume distmax(C1) = dist(x1, x3) <= dist(o, x1) < dist(o, x2)

•Assume dist(o, C2) = dist(o, x2) = distmax(C2)


•Assertion: Let D be the data as shown above in Figure 1(a). Then DB(n, q) outlier scheme is O-compatible but not N-compatible with I•Proof:

•Recall that the outlier measure function O for the BN(r, q) scheme is

O(r, q, p) = F1(r, q, p) = 1 if r > |Nq(p)| 0 otherwise


• We choose u = 1.We choose u = 1.• Let:Let:• q1 = dist(o, C1) = dist(o, x1)q1 = dist(o, C1) = dist(o, x1)• r1 = 2r1 = 2• q2 = dist(o, C2) = dist(o, x2)q2 = dist(o, C2) = dist(o, x2)• r2 = 402r2 = 402• Use the properties given in the example to verify that u and the sequence of (r1, Use the properties given in the example to verify that u and the sequence of (r1,

q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure q1) and (r2, q2) satisfy the condition of definition 2(1) for the outlier measure function O(r, q, p). function O(r, q, p).

• Since q1 < dist(o, C2), x1 and o are on the diagonal line and x1 is the top right Since q1 < dist(o, C2), x1 and o are on the diagonal line and x1 is the top right corner point of the square that covers C1 we have |Ncorner point of the square that covers C1 we have |Nq1q1(o) = |{x1}| = 1 < r1, (o) = |{x1}| = 1 < r1, hence O(r1, q1, o) = 1 >= u.hence O(r1, q1, o) = 1 >= u.


• Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of Since |C1| = 400, o and x2 are on the diagonal line, x2 is the bottom left corner point of

the circle that covers C2, and the circle that covers C2, and distmax(C1) < q2 = dist(o, x2) we have N< q2 = dist(o, x2) we have Nq2q2(o) = C1 U {x2}, (o) = C1 U {x2}, which implies |Nwhich implies |Nq2q2(o)| = 401 < r2, hence O(r2, q2, o) = 1 >= u.(o)| = 401 < r2, hence O(r2, q2, o) = 1 >= u.

• For any p member of C1, since For any p member of C1, since distmax(C1) = dist(x1, x3) <= q1, Nq1, Nq1q1(p) has all points in (p) has all points in C1 – {p}, but may or may not have the point o, i.e. |NC1 – {p}, but may or may not have the point o, i.e. |Nq1q1(p)| >= |C1| - 1 = 399 >= r1, thus, (p)| >= |C1| - 1 = 399 >= r1, thus, O(r1, q1, p) = 0 < u for any p member of C1. Since distO(r1, q1, p) = 0 < u for any p member of C1. Since distmaxmax(C2) = q2, for any p member of (C2) = q2, for any p member of C2, NC2, Nq2q2(p) contains all points in C2 – {p}, but may or may not have the point o, i.e., |(p) contains all points in C2 – {p}, but may or may not have the point o, i.e., |NNq2q2(p)| >= 402 >= r2. Thus, O(r1, q1, p) = 0 < u for all p member of C2. (p)| >= 402 >= r2. Thus, O(r1, q1, p) = 0 < u for all p member of C2.

• It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility It follows that u and the sequence of (r1, q1) and (r2, q2) satisfy the O-compatibility condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible.condition (1.1) and (1.2). Therefore, O(r, q, p) is O-compatible.


• References:References:

1. 1. Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung, “A Robust Outlier Detection Scheme for Large Data Sets”Robust Outlier Detection Scheme for Large Data Sets”

Documents

A Robust Outlier Detection Scheme for Large Data Sets Jian Tang Zhixiang Chen Ada Wai-chee Fu David Cheung Presented By David Lopez