Beyond Streams and Graphs: Dynamic Tensor Analysis

Preview:

DESCRIPTION

Beyond Streams and Graphs: Dynamic Tensor Analysis. Dacheng Tao. Christos Faloutsos. Jimeng Sun. Speaker: Jimeng Sun. Motivation. Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs: - PowerPoint PPT Presentation

Citation preview

Beyond Streams and Graphs:

Dynamic Tensor Analysis Jimeng Sun Christos FaloutsosDacheng

Tao

Speaker: Jimeng Sun

MotivationGoal: incremental pattern discovery on streaming applications

Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring

Graphs: E3: Social network analysis

Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis

How to summarize streaming data effectively and incrementally?

E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals

DB

Aut

hors

Keywords

DM

DB

1990

2004

E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]

450 GB/hour with compression

Task: Identify abnormal traffic pattern and find out the cause

normal trafficabnormal traffic

dest

inati

on

source

dest

inati

on

sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie

Static Data model

For a timestamp, the stream measurements can be modeled using a tensorDimension: a single stream

E.g, <Christos, “graph”>

Mode: a group of dimensions of the same kind.

E.g., Source, Destination, Port

Des

tina

tion

Source

Time = 0

Static Data model (cont.)Tensor

Formally,

Generalization of matrices

Represented as multi-array, data cube.Order 1st 2nd 3rd

Correspondence Vector Matrix 3D array

Example Sensors

Aut

hors

Keywords

SourcesD

estin

atio

nsP

orts

Dynamic Data model (our focus)

Streams come with structure(time, source, destination, port)(time, author, keyword)

time

De

stin

atio

n

Source

Dynamic Data model (cont.)Tensor Streams

A sequence of Mth order tensor

where

n is increasing over timeOrder 1st 2nd 3rd

Correspondence

Multiple streams Time evolving graphs

3D arrays

Example

Sources

Des

tinat

ions

Por

tstime

Sensors

time

au

thor

keyword

Dynamic tensor analysis

UDestination

USource

Old cores

Sourc

e

De

stin

atio

n

New TensorOld Tensors

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Y

Background – Singular value decomposition (SVD)

SVD

Best rank k approximation in L2PCA is an important application of SVD

Am

n

m

nRR

R

UVT k

k k

UT

Latent semantic indexing (LSI)

Singular vectors are useful for clustering or correlation detection

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

pattern

cluster

querycache

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=DM

DB

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

document-conceptconcept-term

concept-association

frequent

Tensor Operation: Matricize X(d)

Unfold a tensor into a matrix

5 76 81 3

2 4

Acknowledge to Tammy Kolda for this slide

Tensor Operation: Mode-product

Multiply a tensor with a matrix

source

dest

inati

on

port

group

source

dest

inati

on

port

groupso

urc

e

Related workLow Rank approximationPCA, SVD: orthogonal

based projection

Multilinear analysisTensor decompositions:

Tucker, PARAFAC, HOSVD

Stream miningScan data once to

identify patternsSampling: [Vitter85],

[Gibbons98]Sketches: [Indyk00],

[Cormode03]

Graph miningExplorative: [Faloutsos04]

[Kumar99][Leskovec05]…

Algorithmic: [Yan05][Cormode05]…

Our Work

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:

t

Note that this is a generalization of PCA when n is a constant

Sources

Des

tinat

ions

Por

ts

Source Projection

Des

tinat

ion

Pro

ject

ion

Port Projection

Core Tensor

DB

Aut

hors

Keywords

DM

DB

UA

UK

1990

2004

1990

2004

Why do we care?

Anomaly detectionReconstruction error drivenMultiple resolution

Multiway latent semantic indexing (LSI) Philip Yu

Michael Stonebreak

er

QueryPattern

time

1st order DTA - problemGiven x1…xn where each xi RN, find

URNR such that the error e is small:

n

N

x1

xn

….

?

tim

e

Sensors

UT

indooroutdoor

Y

Sensors

R

Note that Y = XU

1st order DTAInput: new data vector x RN, old variance

matrix C RN N

Output: new projection matrix U RN R

Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U

xT C UUTx

Cnew

Diagonalization has to be done for every new x!

Old X

x

tim

e

1st order STAAdjust U smoothly when new data arrive without diagonalization [VLDB05]

For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude

For each new point x and for i = 1, …, k : yi := Ui

Tx (proj. onto Ui)

di di + yi2 (energy i-th eigenval.)

ei := x – yiUi (error)

Ui Ui + (1/di) yiei (update estimate)

x x – yiUi (repeat with remainder)

error

U

Sensor 1

Sen

sor

2

Mth order DTA

dU

TdU

Reconstruct Variance Matrix

dC

dC

Update Variance Matrix

dS

Diagonalize Variance Matrix

dU

TdU

dSX(d)X(d)

dX TdX

Mat

riciz

ing,

Tra

nspo

se

Construct Variance Matrix of Incremental Tensor

Matricizing

T

Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single

timestampComputation: Ni

3 (or Ni2) diagonalization of C

+ Ni Ni matrix multiplication X (d)T X(d)

For low order tensor(<3), diagonalization is the main cost

For high order tensor, matrix multiplication is the main cost

Mth order STA

TdX

Matricizing

Run 1st order STA along each modeComplexity:

Storage: O( Ni)

Computation: Ri Ni which is smaller than DTAy1

U1

xe1

U1 updated

Roadmap

Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion

Experiment

ObjectivesComputational efficiencyAccurate approximationReal applications

Anomaly detection Clustering

Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 3rd order tensors with hourly windows with <source, destination, port>Each tensor: 500500100 1200 timestamps (hours)

Sparse data10AM to 11AM on 01/06/2005

valuePower-law distribution

Data set 2: Bibliographic data (DBLP)

Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)

Computational cost

3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:

DTA and STA are orders of magnitude faster than OTAThe slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

Accuracy comparison

Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

3rd order network tensor 2nd order DBLP tensor

Network anomaly detection

Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).

Reconstruction error over time

Normal trafficAbnormal traffic

Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time

DB

DM

Conclusion

Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications

Anomaly detectionMultiway LSI

The world is not flat, neither should data mining be.

Final word: Think structurally!

Contact: Jimeng Sun jimeng@cs.cmu.edu

Recommended