21
Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China [email protected] http://research.microsoft.com/en-us/people/yu zheng / Released Data & Codes

Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China [email protected]

Embed Size (px)

Citation preview

Page 1: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets

across Different Domains

Yu ZhengMicrosoft Research, Beijing, China

[email protected]

http://research.microsoft.com/en-us/people/yuzheng/

Released Data & Codes

Page 2: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Existing Anomaly Detection• Detecting anomalies (outliers) is sometimes more useful than regular patterns

• Existing research focuses on detecting anomalies based on a single dataset• May cause some anomalies undetected or very late• Or over detected when using a sparse dataset (false alerts)

A) Bike rentingB) Social mediaA) Taxi flow

r1r2

r3

r6

r4

r5

r1

<0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>

Reports of sickness in a neighborhood

time,

(1−𝜇)≫3𝜎

An undetected example A false alert

Page 3: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Collective Anomalies

• ST-data in different domains• , ,…, • Noise complaints: <construction, loud music, traffic…>• Air quality: <good, moderate, unhealthy, …> • Check in: <food, entertainment, shopping, arts,…>• Traffic conditions: <fast, normal, congestion>• Epidemic: <disease 1, disease 2,…, disease n> • ……

...

...

...

... ...

B) People¶s Complaints A) Traffic Sensing C) Social Media

• Detect collective anomalies based on multiple Spatio-Temporal (ST) datasets

t1

t2

t4

t3

2D Geo-Space

a1

a2

a3• Collective anomalies

• Spatio-temporal collectiveness: a collection of nearby locations () and during a few consecutive time intervals ()

• Data collectiveness: anomalous when checking multi ple datasets simultaneously

Page 4: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

An Example

A) Raw road network B) Segmented regions

8am 12pm9am 10am 11am 1pm

Benefits• Detect an underlying problem• Den o te an early stage of an epidemic disease

or the beginning of a natural disaster• Provide a panora mic view of an event

Eight regions are collectively anomalous in five consecutive hours

in terms of three datasets:Taxicab, bike-sharing, and 311 complaints,

𝛿𝑑

Page 5: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Challenges• Data sparsity and uncertainty

Difficult to estimate their true distri butions based on limited observationsHard to measure the deviation of an instance from its original dis tri bution

• Different scales and distributions Difficult to aggregate them into an integrate (anomalous) measurement

t1

t2

t4

t3

2D Geo-Space

a1

a2

a3

• Many combinations of regions and time intervals

High computational cost Conflicts online detection

<0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>

<1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…>

Distribution ?

Aggregation ?

Page 6: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Methodology• Multiple Sources Latent Topic (MSLT) Model :

• Combine multiple datasets to better estim ate the underlying distribution of a sparse dataset

• Leading to more accurate anomaly detection

• Spatio-Temporal Log-likelihood Ratio Test (ST_LRT)

• Adap ts Likelihood Ratio Test to a spatio-temporal setting• Aggregates the information of multiple datasets across

multiple regions to detect anomalies

• Candidate generation algorithm• Generate candidates using computational geometry• Prune unnecessary combinations based on skylines

σ

λ

μ

α θ

φ1

f

z ci

z cj

z ck

φ2

φ3

β

z1 z2 zk z1 z2 zk z1 z2 zk

0.11 0.25 0.07

c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3D1 D2 D3

θ

z1 z2 zk

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkwΛ=− 2 logh𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑 𝑓𝑜𝑟 𝑛𝑢𝑙𝑙𝑚𝑜𝑑𝑒𝑙

h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑 𝑓𝑜𝑟 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒𝑚𝑜𝑑𝑒𝑙

r5r1r1

r2

r3

r4

r5

r6

r5r1

r6

A) Retrieve candidate regions for r1

ᵹd

ᵹd

B) Find intersection between two regions

p1

p3

p1

p2

p3p4

C) Find combination of three regions

ᵹd

r1

r5r6

p1 p2 p3 p4

: r1, r5

: r1, r5, r6

: r1, r6

p1 p2

p2 p3

p3 p4

D) Output region sets

0.5

0.5

0.5

0

1.0

1.0

No

ise

Taxi1.0

Page 7: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT

Framework

A) Raw road network B) Segmented regionsA) Raw road network B) Segmented regions, …}

Learning Distributions

,,…,…,,…,

t1

t2

t4

t3

2D Geo-Space

σ

λ

μ

α θ

φ1

f

z ci

z cj

z ck

φ2

φ3

β

z1 z2 zk z1 z2 zk z1 z2 zk

0.11 0.25 0.07

c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3D1 D2 D3

θ

z1 z2 zk

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

MSLT Model

𝑠1 𝑠2

, …}

Skyline Detection

…}

r5r1r1

r2

r3

r4

r5

r6

r5r1

r6

A) Retrieve candidate regions for r1

ᵹd

ᵹd

B) Find intersection between two regions

p1

p3

p1

p2

p3p4

C) Find combination of three regions

ᵹd

r1

r5r6

p1 p2 p3 p4

: r1, r5

: r1, r5, r6

: r1, r6

p1 p2

p2 p3

p3 p4

D) Output region sets

Circel_Based_Spatial_Check(spatial constraint )

LRT

t1

t2

t4

t3

2D Geo-Space

a1

a2

a3

0.5

0.5

0.5

0

1.0

1.0

No

ise

Taxi1.0

An entry

Page 8: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

MSLT Model• Combine multiple datasets to discover

latent functions of a region • To better estimate the distribution of a sparse dataset• Different datasets in a region can mutually reinforce • A dataset can reference across different regions

𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡

𝜃𝑑𝑡𝜑𝑡𝑤 𝑖

𝑠2𝑠1σ

η

μ

α θ

φ1

f

z w1

z w2

z w|s|

φ2

φ|s|

β

z1 z2 zk z1 z2 zk z1 z2 zk c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3s1

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

z1 z2 zk

A) Graphic representation of MSLT

B) Topic-words distribution across different datasets

s2 s3

W1

W2

W3

OI OI OI OI OIOIOIOI

0:00 4:002:001:00 3:00

<c1, c2,« ,c10> <w1, w2,« ,w10>

0:30 1:30 2:30 3:30T

axi

311

20:00 24:0022:0021:00 23:00

OI OI OI OI OIOIOIOI

0:30 1:30 2:30 3:30

Tax

i31

1

Time interval 1 Time interval 6

<c1, c2,« ,c10><w51, w52,« ,w60>

tc=4:00

<λ'1, λ'2,« ,λ'10> λ i=λ ×pi

<c1, c2,« ,c10> λ=ZIP(ci|i=1,2,« ,10)<c1, c2,« ,c10> λ'=ZIP(ci|i=1,2,« ,10)

pi=prop(wi) / Si prop(wi), 1 � i � 10

λ'i=λ'×pi <λ 1, λ 2,« ,λ 10>

tc-4=0:00 tc-2=2:00

<c1, c2,« ,c10> <w1, w2,« ,w10>

w'1 w'16 w'81 w'96

A) Sett ings of MSLT

B) Sett ings of ST_LRT

• A topic model-based method: • A region a document • Latent functions latent topics• 311, bikes, taxicabs words (dynamic)• POIs and road networks keywords (static)

σ

λ

μ

α θ

φ1

f

z ci

z cj

z ck

φ2

φ3

β

z1 z2 zk z1 z2 zk z1 z2 zk

0.11 0.25 0.07

c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3D1 D2 D3

θ

z1 z2 zk

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

Page 9: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

MSLT Model• Learning

• , and are fixed parameters • Learn and based on observed and • Using a stochastic EM algorithm

• Structure• of a region depends on its geographical pro perties • There are multiple topic-word distributions

σ

λ

μ

α θ

φ1

f

z ci

z cj

z ck

φ2

φ3

β

z1 z2 zk z1 z2 zk z1 z2 zk

0.11 0.25 0.07

c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3D1 D2 D3

θ

z1 z2 zk

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

σ

η

μ

α θ

φ1

f

z w1

z w2

z w|s|

φ2

φ|s|

β

z1 z2 zk z1 z2 zk z1 z2 zk c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3s1

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

z1 z2 zk

A) Graphic representation of MSLT

B) Topic-words distribution across different datasets

s2 s3

W1

W2

W3

Latent Dirichlet Allocation (LDA) MSLT

φ

K

β

RN

α θ z w

𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡

𝜃𝑑𝑡𝜑𝑡𝑤 𝑖

Page 10: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT

• Log-Likelihood Ratio Test (LRT)• Apply LRT to a single (ST) dataset

• in a single region• in multiple regions

• Apply LRT to multiple datasets• Distribution estimations for different datasets• Aggregate anomalous degree of multiple datasets

Page 11: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT• LRT

• testing whether a simplifying assumption for a model is valid

• can be approximated by a chi-square distribution

1)

An example for a single region and a single dataset

3)

=0.999

Region r

12:00-14:00A) C)

12:00-14:00 14:00-16:00 16:00-18:00

Gaussian(200,1300)

xt=70

_cdf(˄, 1)>0.95PD

F χ 2

˄

B)

Region r

Poisson(8)x1=14

Poisson(10)x2=14

Poisson(6)x3=8

˄=3.84

;

= 2000.35=70; 13000.35=455

𝑝=70

200=0.35

2) The maximum likelihood for the alternative model (mean to 70)

Region r

12:00-14:00A) C)

12:00-14:00 14:00-16:00 16:00-18:00

Gaussian(200,1300)

xt=70

_cdf(˄, 1)>0.95PD

F χ 2

˄

B)

Region r

Poisson(8)x1=14

Poisson(10)x2=14

Poisson(6)x3=8

˄=3.84

20070

Page 12: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT• Apply LRT to multiple regions (or time slots)

Region r

12:00-14:00A) C)

12:00-14:00 14:00-16:00 16:00-18:00

Gaussian(200,1300)

xt=70

_cdf(˄, 1)>0.95PD

F χ 2

˄

B)

Region r

Poisson(8)x1=14

Poisson(10)x2=14

Poisson(6)x3=8

˄=3.84

1) ;

;

2) Calculate : To maximize the likelihood of the alternative model (=1)

81.5=12, =101.5=15, =61.5=9;

3) 5.19

𝑜𝑑= χ 2 _ cdf (5.19 , 𝑓𝑑=1 )=0.978

A dataset varies in different regions (or time slots) consist ently

A dataset changes differently in different regi ons (or slots).

𝑜𝑑 (𝑠 )=√∑𝑖

¿¿¿¿A) Bike rentingB) Social mediaA) Taxi flow

r1r2

r3

r6

r4

r5

r1

A) Bike rentingB) Social mediaA) Taxi flow

r1r2

r3

r6

r4

r5

r1

Page 13: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT• Deal with multiple datasets

• Dealing with a sparse dataset• The zero-inflated Poisson (ZIP) model

• Using latent topic-word distribution

𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡

𝜃𝑑𝑡𝜑𝑡𝑤 𝑖

1) ;

2) ;

;

𝑋=h ,with probability (1 −𝑝 ) 𝑒−𝜆𝜆h

h!

<0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>

σ

η

μ

α θ

φ1

f

z w1

z w2

z w|s|

φ2

φ|s|

β

z1 z2 zk z1 z2 zk z1 z2 zk c1 c2

cm

cm+1

cn cw

λ1

φ1

cm+2 cn+1 cn+2

φ2 φ3

λ2 λ3s1

θ

c1 cm cm+1 cn cn+1 cw

θ1 θkθ2

φ11 φkw

z1 z2 zk

A) Graphic representation of MSLT

B) Topic-words distribution across different datasets

s2 s3

W1

W2

W3

:<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, c2, 0, 0,…>

2 :<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, c2, 0, 0,…>

1 :<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, 0, 0, 0,…>

𝜆

𝑍𝐼𝑃

<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, c2, 0, 0,…>

<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, 0, 0, 0,…>𝜆1𝜆2

𝜆𝑖=𝜆×𝑝𝑟𝑜𝑝 (𝑤𝑖 )

𝐿𝑅𝑇

Page 14: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT• Estimate distributions for different datasets

𝜆

𝑍𝐼𝑃

𝜆𝑖=𝜆×𝑝𝑟𝑜𝑝 (𝑤𝑖 ) 𝐿𝑅𝑇

s

Sparse? variance (𝑠 ) ≫𝑚𝑒𝑎𝑛(𝑠)Y Y

() ()

NN

Page 15: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

ST_LRT• Aggregate anomalous degrees of multiple datasets

{

{<𝑟1 , 𝑡1 ¿ ,¿𝑟1 ,𝑡 2>,<𝑟 2 , 𝑡1>,<𝑟4 , 𝑡 2>}

{

r5r1r1

r2

r3

r4

r5

r6

r5r1

r6

A) Retrieve candidate regions for r1

ᵹd

ᵹd

B) Find intersection between two regions

p1

p3

p1

p2

p3p4

C) Find combination of three regions

ᵹd

r1

r5r6

p1 p2 p3 p4

: r1, r5

: r1, r5, r6

: r1, r6

p1 p2

p2 p3

p3 p4

D) Output region sets

Circel-Based Spatial Check

{𝑟6 ,𝑟7 },

< ,…, >

< ,…, >

< ,…, >

… …

0.5

0.5

0.5

0

1.0

1.0

No

ise

Taxi1.0Skyline ods

If a set of entries’ upper bound of is dominated by existing skyline combinations, all the combinations of its subsets will be dominated by the skyline too.

Pruning

Page 16: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Evaluation• Datasets

Construction Commercial – Music/Party/Talking Park – Music/Party/Talking

House – Music/Party/Talking/TV

Street – Music/Party/Talking

Dog Air Condition/ventilation

Traffic Manufacturing Others

Data sources Properties values

Taxicab data1/1/2014-1/1/2015

number of taxicabs 14,144number of trips 165M

total duration (hour) 36.5M

total distances (km) 5,671M

Bike Data1/1/2014-1/1/2015

number of stations 344number of bikes 6,811number of trips 8,081,216

total duration (hour) 1.9M

311 Complaints5/26/2013-12/13/2014

number of categories 10

number of instances 197,922

Road network 2013

number of nodes 79,315number of road segments (level5)

32,210

number of road segments (level>5)

83,655

number of regions 862

POIs2013

number of categories 14

number of instances 24,031

Data Release:http://research.microsoft.com/pubs/255670/release_data.zip

Page 17: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Evaluation• Evaluation on MSLT

• Estimating the distribution for 311 data (sparse)

• KL-Divergence between estimations and ground truth

• Down-sampling ground truth

0 20 40 60 80 100

0.5

0.6

0.7

0.8

KL-

Div

erge

nce

1/X

MSLT Count

0 20 40 60 80 100

0.4

0.6

0.8

1.0

KL-

Div

erge

nce

1/X

MSLT Count

c1 c2 c3 c4 c5

𝑟1 𝑟2

A distribution of 311

Page 18: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Event Name Address Start Time End Time

1 Bowlloween 2014 New York Halloween

624-660 W 42nd St

10/31/2014 9PM

11/1/2014 2AM

2 Largest Halloween Singles Party in NYC

247 West 37th Street

10/31/2014 7AM

11/1/2014 3AM

3 Kokun Cashmere Sample and Stock Sale

237 W 37th Street

11/5/2014 10:30AM

11/7/2014 5:45PM

4 Big Apple Film Festival 54 Varick St 11/5/2014 6PM

11/9/2014 11PM

5 InterHarmony Concert Series: The Soul of élégiaque

881 7th Avenue

11/6/2014 8PM

11/6/2014 10PM

6 Hiras Master Tailors New York Trunk Show

301 Park Avenue

11/6/2014 9AM

11/9/2014 1PM

7 in Collaboration with Carnegie Halls Neighborhood Concerts

881 Seventh Avenue

11/7/2014 6PM

11/7/2014 10PM

8 Thomas/Ortiz Dance Show 248 West 60th Street

11/7/2014 7PM

11/8/2014 9PM

9 Rebecca Taylor Sample Sale 260 5th Ave 11/11/2014 10AM

11/15/2014 8PM

10 The News NYC Sample Sale 495 Broadway 11/13/2014 9AM

11/15/2014 6AM

11 Giorgio Armani Sample Sale 317 W 33rd St 11/15/2014 9:30AM

11/19/2014 6:30PM

12 Get Buzzed 4 Good Charity Event NYC 200 5th Ave 11/15/2014

1PM11/15/2014 4PM

13 Ment’or Young Chef Competition 462 Broadway 11/15/2014

2PM11/15/2014 6PM

14 Gotham Comedy Club 208 West 23rd Street

11/17/2014 6PM

11/17/2014 9PM

15 Kal Rieman NYC Sample Sale 265 West 37th Street

11/18/2014 11AM

11/20/2014 8PM

16 Inhabit Cashmere Sample Sale 250 West 39th St

11/18/2014 10AM

11/20/2014 6 PM

17 Shoshanna NYC Sample Sale 231 W. 39th St 11/19/2014 10AM

11/20/2014 6:30PM

18 ICB / J. Press NYC Sample Sale 530 Seventh Avenue

11/19/2014 12AM

11/21/2014 12AM

19 Thanksgiving in New York City 2014 1675 Broadway 11/27/2014

6AM11/27/2014 10PM

20 Thanksgiving Day Dinner at Croton Reservoir Tavern

108 West 40th St

11/27/2014 12PM

11/27/2014 9PM

Taxi Inflow

Taxi Outflow

Bike Inflow

Bike Outflow

Single Dataset

DB-S-Taxi-S: one property

DB-S-Bike-S: one property

DB-S-Taxi-B: both properties

DB-S-Bike-B: both properties

Multi-Datasets

DB-M-One: one of the properties satisfying the 3-time deviationDB-M-ALL: all the properties need to satisfy the 3-time deviation

Methods Detected Anomalies/day Hit Event IDs

DB-S-Taxi-S 336.3 1, 9, 19, 20DB-S-Bike-B 25.7 9, 19, 20DB-S-Taxi-S 18.1 4, 19DB-S-Bike-B 1.83 NoneDB-M-One 353.2 1, 4, 9, 19, 20DB-M-ALL 0.12 None

ST_LRT 28.5 1, 3, 9, 10, 11, 13, 15, 16, 20

Baselines

Results

Events were reported by nycinsiderguide.com

Nov. 1, 2014 to Nov. 30, 2014

DB: distance-based methods

Page 19: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

B) Taxi inflow- C) Taxi outflow- D) Bike inflow- E) Bike outflow-

F) Taxi inflow- G) Taxi outflow- H) Bike inflow- I) Bike outflow-A) The News NYC Sample Sale

od=<0.571, 0.912, 0.256>

A

Data sources Properties (s)

Taxicab DataIn flow 0.274 0.593 0.822 0.932

0.571Out flow 0.383 0.282 0.612 0.202Total 0.404 0.700

Bike DataIn flow 0.796 0.901 0.932 0.901

0.912Out flow 0.872 0.953 0.983 0.987

Total 0.882 0.940

311 Data Complaints \ \ \ \ 0.256

• Beyond distance-based methods

• Beyond a single dataset

• Beyond a single region

(:18-20, : 20-22)

Page 20: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Conclusion • Detect collective anomalies based on multiple datasets

• Methodology• MSLT• ST_LRT• Candidate generation and pruning

• Evaluated based on five datasets in NYC

• Detect all anomalies in NYC in 3 minutes

HomepageReleased Data & Codes

Thanks!

Yu [email protected]

Page 21: Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains Yu Zheng Microsoft Research, Beijing, China yuzheng@microsoft.com

Collective Anomalies• Formal Definition

• Given • regions, …} • multiple datasets , …} during the recent time intervals and • that over a period of historical time

• Formulate a spatio-temporal set ,,…,…,,…, .• is associated with a vectordenoting the number of instances in each category of each

dataset in region at time interval .

• Detect , each is a collection of spatio-temporal entries from

• , ,• , • _)true

t1

t2

t4

t3

2D Geo-Space< c1, c2>

t4

t2

t3

t1

s2:< c¶1, c¶2>s1:

r1

r2r3 r4 r5

r6

t1

t2

t4

t3

2D Geo-Space

a1

a2

a3