26
Multisite Internet Data Analysis Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor [email protected] http://www.eecs.umich.edu/~hero Research supported in part by: NSF CCR-0325571 1.Network Data Collection 2.Distributed Data Analysis 1. Dimension Reduction 2. Model-Based Data Analysis 3.Conclusions

Multisite Internet Data Analysis

  • Upload
    norris

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Multisite Internet Data Analysis. Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor [email protected] http://www.eecs.umich.edu/~hero. Network Data Collection Distributed Data Analysis Dimension Reduction Model-Based Data Analysis Conclusions. - PowerPoint PPT Presentation

Citation preview

Page 1: Multisite Internet Data Analysis

Multisite Internet Data Analysis

Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor

[email protected]://www.eecs.umich.edu/~hero

Research supported in part by: NSF CCR-0325571

1. Network Data Collection2. Distributed Data Analysis

1. Dimension Reduction2. Model-Based Data

Analysis3. Conclusions

Page 2: Multisite Internet Data Analysis

1. Network Data Collection • Objectives

– Global: monitoring centers aggregate statistics from sites distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints

– Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis.

• Types of data measured– Active: queries and requests, packet probes– Passive: netflow, router fields, honeypots, backscatter

Page 3: Multisite Internet Data Analysis

ISP 1

ISP 2

ISP 3

Local data collectionand probing site

Monitoring Center

Datacollection site

: Data collector

Page 4: Multisite Internet Data Analysis

Abilene Netflow DataProtocol

No. Flows

Avg. Duration

Std. Duration

Avg Packets

Std. Packets

Avg Bytes

Std. Bytes

No.

Flo

ws

Avg

. Dur

atio

n

Std

. Dur

atio

n

Avg

Pac

kets

Std

. Pac

kets

Avg

Byt

es

Std

. Byt

es

Dat

aset

1

Dataset 2

Page 5: Multisite Internet Data Analysis

Abilene Netflow DataR

outer

No. Flows

Avg. Duration

Std. Duration

Avg Packets

Std. Packets

Avg Bytes

Std. Bytes

No.

Flo

ws

Avg

. Dur

atio

n

Std

. Dur

atio

n

Avg

Pac

kets

Std

. Pac

kets

Avg

Byt

es

Std

. Byt

es

Dat

aset

1

Dataset 2

Page 6: Multisite Internet Data Analysis

Abilene Netflow Data

0 20 40 60 80 100 120 140 160 180 2005.5

6

6.5

7

7.5

8x 10

4

Time in sec.

Num

of F

low

s

Total Number of Flows for Data Set 1

Page 7: Multisite Internet Data Analysis

Challenges and Approaches• Challenges

– High dimensional measurement space– Non-linear dependencies and non-stationarity– Privacy and proprietary concerns– Insufficient bandwidth for cts sampled data

• Approaches– Dimension reduction– Model-based distributed inference – Controlled information sharing– Hierarchical and modular collection/analysis

Page 8: Multisite Internet Data Analysis

Hierarchical Architecure

Page 9: Multisite Internet Data Analysis

2. Distributed Data Analysis

• Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold.

• Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing

Site ASite C

Site B

Page 10: Multisite Internet Data Analysis

Sampling

2.1 Distributed Dimension Reduction

UnknownDistribution

ObservedSample

UnknownManifold

UnknownEmbedding

Page 11: Multisite Internet Data Analysis

Geodesic Entropic GraphsA Planar Sample and its Euclidean MST

Page 12: Multisite Internet Data Analysis

GMST Dimension Estimation

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9 x 105 MST Length for 3 Land Vehicles (=1,m=20)

n

L n

1020 1022 1024 1026 1028 1030 1032 1034 1036 10388.5

8.52

8.54

8.56

8.58

8.6

8.62

8.64 x 105 MST Length for 3 Land Vehicles (=1,m=20)

n

Ln

GMST Estimatesd=13H=120(bits)_

Page 13: Multisite Internet Data Analysis

Distributed GMST Estimator• Principal MST convergence result:

• Distributed BHH (Aggregation rule):

• Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites

BHH Theorem:

Page 14: Multisite Internet Data Analysis

2.2 Distributed Model-based Inference• Global likelihood model

• Global M-estimator recursion:

– Global Fisher score function

• Local Fisher score functions

Page 15: Multisite Internet Data Analysis

Distributed M-estimator

Compute Compute

k=k+1 k=k+1

A B

Page 16: Multisite Internet Data Analysis

Properties

• Communication requirement is: – 2p bytes/update/site.

• If data are independent attain stationary points of global likelihood

• All local MLE’s are available to each site.

• For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.

Page 17: Multisite Internet Data Analysis

Global maximum

Local maxima

x xx x xx xxxx x xx

Local MLE’s

Global Likelihood Function

Page 18: Multisite Internet Data Analysis

Key Theoretical Result• The asymptotic distribution of local estimates is a

Gaussian mixture dependent on global likelihood

• Parameters

M

mm

mTm

mKm tCtC

ptf0

1

2/ˆ )()(21exp

2)(

Proof: asymptotic normal theory of local maxima (Huber:67): see Blatt&Hero:2003

Page 19: Multisite Internet Data Analysis

SampleCovariance

Analysis

Local Estimator Aggregation Algorithm

Estimator 1

Estimator 2

Estimator N

Estimation of

Gaussian Mixture

Parameters

(FS,EM…)

AggregationTo FinalEstimate

Page 20: Multisite Internet Data Analysis

Local maximum

Ambiguity function.

Global maximum

IID Observation Model:

• Each site observes 2 component Gaussian mixture

• Identical component variances

• Unknown mixing parameters

• Unknown component means

• 200 data collection sites

• 100 samples/site• CEM2 algorithm

implemented for estimation and aggregation

Simple Example

Page 21: Multisite Internet Data Analysis

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

1

2

Clustering and Discrimination

Global maximum

Inverse FIM

Local maximum

Empirically estimated covariances via CEM2

Page 22: Multisite Internet Data Analysis

Validation of Key Result

QQ for Cluster 1 QQ for Cluster 2

Page 23: Multisite Internet Data Analysis

Conclusions

• Lossless distributed dimension reduction and model-based inference requires:– Reliable local inference methods – Aggregation rules for combining local statistics

• Information sharing constraints?• Effects of bandwidth constraints - data

compression? • Tracking in dynamical models?

Page 24: Multisite Internet Data Analysis

References

• A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002.

• J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004.

• D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003

• M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003

• N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.

Page 25: Multisite Internet Data Analysis

Information Sharing Game

Page 26: Multisite Internet Data Analysis

Addition of other Discriminants

00.5

11.5

22.5

3

00.5

11.5

22.5

3-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1

2

lo

g f(

y i; 1,

2) - E

{ lo

g f(

y i; 1,

2) }

Value-added due totransmission of likelihood values