34
Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Embed Size (px)

Citation preview

Page 1: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Beyond Streams and Graphs:

Dynamic Tensor Analysis Jimeng Sun Christos FaloutsosDacheng

Tao

Speaker: Jimeng Sun

Page 2: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

MotivationGoal: incremental pattern discovery on streaming applications

Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring

Graphs: E3: Social network analysis

Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis

How to summarize streaming data effectively and incrementally?

Page 3: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals

DB

Aut

hors

Keywords

DM

DB

1990

2004

Page 4: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]

450 GB/hour with compression

Task: Identify abnormal traffic pattern and find out the cause

normal trafficabnormal traffic

dest

inati

on

source

dest

inati

on

sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie

Page 5: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Static Data model

For a timestamp, the stream measurements can be modeled using a tensorDimension: a single stream

E.g, <Christos, “graph”>

Mode: a group of dimensions of the same kind.

E.g., Source, Destination, Port

Des

tina

tion

Source

Time = 0

Page 6: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Static Data model (cont.)Tensor

Formally,

Generalization of matrices

Represented as multi-array, data cube.Order 1st 2nd 3rd

Correspondence Vector Matrix 3D array

Example Sensors

Aut

hors

Keywords

SourcesD

estin

atio

nsP

orts

Page 7: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Dynamic Data model (our focus)

Streams come with structure(time, source, destination, port)(time, author, keyword)

time

De

stin

atio

n

Source

Page 8: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Dynamic Data model (cont.)Tensor Streams

A sequence of Mth order tensor

where

n is increasing over timeOrder 1st 2nd 3rd

Correspondence

Multiple streams Time evolving graphs

3D arrays

Example

Sources

Des

tinat

ions

Por

tstime

Sensors

time

au

thor

keyword

Page 9: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Dynamic tensor analysis

UDestination

USource

Old cores

Sourc

e

De

stin

atio

n

New TensorOld Tensors

Page 10: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Page 11: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Y

Background – Singular value decomposition (SVD)

SVD

Best rank k approximation in L2PCA is an important application of SVD

Am

n

m

nRR

R

UVT k

k k

UT

Page 12: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Latent semantic indexing (LSI)

Singular vectors are useful for clustering or correlation detection

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

pattern

cluster

querycache

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=DM

DB

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

document-conceptconcept-term

concept-association

frequent

Page 13: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Tensor Operation: Matricize X(d)

Unfold a tensor into a matrix

5 76 81 3

2 4

Acknowledge to Tammy Kolda for this slide

Page 14: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Tensor Operation: Mode-product

Multiply a tensor with a matrix

source

dest

inati

on

port

group

source

dest

inati

on

port

groupso

urc

e

Page 15: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Related workLow Rank approximationPCA, SVD: orthogonal

based projection

Multilinear analysisTensor decompositions:

Tucker, PARAFAC, HOSVD

Stream miningScan data once to

identify patternsSampling: [Vitter85],

[Gibbons98]Sketches: [Indyk00],

[Cormode03]

Graph miningExplorative: [Faloutsos04]

[Kumar99][Leskovec05]…

Algorithmic: [Yan05][Cormode05]…

Our Work

Page 16: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Page 17: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:

t

Note that this is a generalization of PCA when n is a constant

Sources

Des

tinat

ions

Por

ts

Source Projection

Des

tinat

ion

Pro

ject

ion

Port Projection

Core Tensor

Page 18: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

DB

Aut

hors

Keywords

DM

DB

UA

UK

1990

2004

1990

2004

Why do we care?

Anomaly detectionReconstruction error drivenMultiple resolution

Multiway latent semantic indexing (LSI) Philip Yu

Michael Stonebreak

er

QueryPattern

time

Page 19: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

1st order DTA - problemGiven x1…xn where each xi RN, find

URNR such that the error e is small:

n

N

x1

xn

….

?

tim

e

Sensors

UT

indooroutdoor

Y

Sensors

R

Note that Y = XU

Page 20: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

1st order DTAInput: new data vector x RN, old variance

matrix C RN N

Output: new projection matrix U RN R

Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U

xT C UUTx

Cnew

Diagonalization has to be done for every new x!

Old X

x

tim

e

Page 21: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

1st order STAAdjust U smoothly when new data arrive without diagonalization [VLDB05]

For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude

For each new point x and for i = 1, …, k : yi := Ui

Tx (proj. onto Ui)

di di + yi2 (energy i-th eigenval.)

ei := x – yiUi (error)

Ui Ui + (1/di) yiei (update estimate)

x x – yiUi (repeat with remainder)

error

U

Sensor 1

Sen

sor

2

Page 22: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Mth order DTA

dU

TdU

Reconstruct Variance Matrix

dC

dC

Update Variance Matrix

dS

Diagonalize Variance Matrix

dU

TdU

dSX(d)X(d)

dX TdX

Mat

riciz

ing,

Tra

nspo

se

Construct Variance Matrix of Incremental Tensor

Matricizing

T

Page 23: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single

timestampComputation: Ni

3 (or Ni2) diagonalization of C

+ Ni Ni matrix multiplication X (d)T X(d)

For low order tensor(<3), diagonalization is the main cost

For high order tensor, matrix multiplication is the main cost

Page 24: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Mth order STA

TdX

Matricizing

Run 1st order STA along each modeComplexity:

Storage: O( Ni)

Computation: Ri Ni which is smaller than DTAy1

U1

xe1

U1 updated

Page 25: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Page 26: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Experiment

ObjectivesComputational efficiencyAccurate approximationReal applications

Anomaly detection Clustering

Page 27: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 3rd order tensors with hourly windows with <source, destination, port>Each tensor: 500500100 1200 timestamps (hours)

Sparse data10AM to 11AM on 01/06/2005

valuePower-law distribution

Page 28: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Data set 2: Bibliographic data (DBLP)

Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)

Page 29: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Computational cost

3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:

DTA and STA are orders of magnitude faster than OTAThe slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

Page 30: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Accuracy comparison

Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

3rd order network tensor 2nd order DBLP tensor

Page 31: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Network anomaly detection

Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).

Reconstruction error over time

Normal trafficAbnormal traffic

Page 32: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time

DB

DM

Page 33: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

Conclusion

Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications

Anomaly detectionMultiway LSI

Page 34: Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

The world is not flat, neither should data mining be.

Final word: Think structurally!

Contact: Jimeng Sun [email protected]