26
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions ICPR 2016 – December 5th 2016 Measuring Dependency via Intrinsic Dimensionality Simone Romano * [email protected] @ialuronico Oussama Chelly Nguyen Xuan Vinh James Bailey Michael E. Houle * Currently I am an applied scientist for in London UK Simone Romano NII Tokyo Measuring Dependency via Intrinsic Dimensionality

Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)

Embed Size (px)

Citation preview

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

ICPR 2016 – December 5th 2016

Measuring Dependency via Intrinsic Dimensionality

Simone Romano∗

[email protected]

@ialuronico

Oussama Chelly Nguyen Xuan Vinh James Bailey Michael E. Houle

∗Currently I am an applied scientist for in London UK

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Motivation

Intrinsic Dimensionality Theory

Intrinsic Dimensional Dependency

Conclusions

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Need for a novel type of Dependency Measure

There is clearly a strong dependency between the two sensors.

Number of cars counted with sensor X1

0 1000 2000

Number

ofcars

countedwithsensor

X2

0

100

200

300

400

500

600

700

Mon(01)Tue(02)Wed(03)Thu(04)Fri(05)Sat(06)Sun(07)Mon(08)Tue(09)Wed(10)

X1

X2

Sensors X1 and X2 0.36 km far

However, state-of-the-art dependency measures struggle to identify non-functionaldependencies of this kind: e.g. Pearson correlation = 0.3 < 1.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Manifold DependencyVariables that embed low-dimensional manifolds are strongly dependent.

Linear Quadratic Cubic

Sinusoidal low freq. Sinusoidal high freq. 4th Root

Circle Step Function Two Lines

X Sinusoidal varying freq. Circle-bar

Figure : Eg. D = 2 dependent variables:they embed 1-dimensional manifolds

Proposal:Intrinsic Dimensional Dependency (IDD)measures dependency between D continuousvariables X = (X1 . . .XD):

IDD(X ) = 1 if the underlying distribution ofX is restricted to a constant number of 1-dimensional manifolds.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Current state-of-the-art

Dependency Measure Functionalrel.

Manifold D > 2

MIC Maximal Information Coefficient [Reshef et al., 2011] 3

MAC Multivariate mAximal Correlation [Nguyen et al., 2014] 3 3

UDS Universal Dependency Score [Nguyen et al., 2016] 3 3

MID Mutual Information Dimension [Sugiyama and Borgwardt, 2013] 3 3

IDD Intrinsic Dimensional Dependency ICPR 2016 3 3 3

IDD measures manifold dependency between multiple variables.

IDD is based on intrinsic dimensionality theory .

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Motivation

Intrinsic Dimensionality Theory

Intrinsic Dimensional Dependency

Conclusions

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Renyi dimension

α-Renyi dimension [Renyi, 1961] measures the dimensionality of the data:

dimα(X ) , limδ→0+

Hα(X , δ)

log 1/δ,

Where Hα(X , δ) is the α-Renyi entropy estimated on the discretized variable X usingboxes of size δ.

Intuitively, the α-Renyi entropy quantifies the space-filling capacity of the data,and dimα(X ) measures its growth rate:

small growth rate of the entropy ⇒ X embeds a small manifold

However, the α-Renyi is tricky to estimate when D is large

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Renyi dimension

α-Renyi dimension [Renyi, 1961] measures the dimensionality of the data:

dimα(X ) , limδ→0+

Hα(X , δ)

log 1/δ,

Where Hα(X , δ) is the α-Renyi entropy estimated on the discretized variable X usingboxes of size δ.

Intuitively, the α-Renyi entropy quantifies the space-filling capacity of the data,and dimα(X ) measures its growth rate:

small growth rate of the entropy ⇒ X embeds a small manifold

However, the α-Renyi is tricky to estimate when D is large

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Local Intrinsic Dimensionality

The local Intrinsic Dimension-ality (ID) proposed by Houle[Houle, 2013] in the data miningcommunity.

The local ID(x) measures thedimensionality of the manifoldembedded in X at the locality x .

Here, we identify the connectionsbetween the local ID(x) and theglobal α-Renyi dimension.

This enables us to use existing localestimators of dimensionality for theα-Renyi dimension.

Figure : Local Intrinsic Dimensionality for different locality[von Brunken et al., 2015].

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Column 0

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20Column 1

-10-8

-6-4

-20

24

6810

1214

1618

20Column 2

1

2

3

5

10

15

20

Estimated dimensionality

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Local Intrinsic Dimensionality

The local Intrinsic Dimension-ality (ID) proposed by Houle[Houle, 2013] in the data miningcommunity.

The local ID(x) measures thedimensionality of the manifoldembedded in X at the locality x .

Here, we identify the connectionsbetween the local ID(x) and theglobal α-Renyi dimension.

This enables us to use existing localestimators of dimensionality for theα-Renyi dimension.

Figure : Local Intrinsic Dimensionality for different locality[von Brunken et al., 2015].

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Column 0

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20Column 1

-10-8

-6-4

-20

24

6810

1214

1618

20Column 2

1

2

3

5

10

15

20

Estimated dimensionality

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Connection between local ID and α-Renyi dimension

Theorem 1The α-Renyi dimension can be expressed as:

dimα(X ) =

∫f α(x) ID(x)dx∫

f α(x)dx,

where f is pdf of X .

Special case: α = 1, dim(X ) is the growth rate of the Shannon entropy.

According to Theorem 1, dim(X ) is the expectation of the local ID:

dim(X ) =

∫f (x) ID(x)dx .

This can be estimated using local ID estimators proposed in [Amsaleg et al., 2015]

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Connection between local ID and α-Renyi dimension

Theorem 1The α-Renyi dimension can be expressed as:

dimα(X ) =

∫f α(x) ID(x)dx∫

f α(x)dx,

where f is pdf of X .

Special case: α = 1, dim(X ) is the growth rate of the Shannon entropy.

According to Theorem 1, dim(X ) is the expectation of the local ID:

dim(X ) =

∫f (x) ID(x)dx .

This can be estimated using local ID estimators proposed in [Amsaleg et al., 2015]

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

kNN estimators of dimensionality

We propose kNN estimators for the intrinsic dimensionality dimα of the variables X .

In the special case of α = 1, we can simply average local ID:

dim(X ) =1

n

n∑i=1

ID(xi) = −1

n

n∑i=1

(1

k

k∑i=1

lndi(x)

dk(x)

)−1

.

where di(x) is the distance of x to its ith nearest neighbor.

We can employ the estimators for dimα to build a dependency measure betweenmultiple variables

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

kNN estimators of dimensionality

We propose kNN estimators for the intrinsic dimensionality dimα of the variables X .

In the special case of α = 1, we can simply average local ID:

dim(X ) =1

n

n∑i=1

ID(xi) = −1

n

n∑i=1

(1

k

k∑i=1

lndi(x)

dk(x)

)−1

.

where di(x) is the distance of x to its ith nearest neighbor.

We can employ the estimators for dimα to build a dependency measure betweenmultiple variables

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Building a dependency measure: choice of αBigger α makes the estimation of dimensionality less sensitive to noise.

0 0.5 1X1

-1

0

1X

2

1 2 3,

1

1.2

1.4ddim,(X1;X2)

Figure : With α ≈ 3 the estimated dimensionality is too small: it is indeed 1 as if the figure

depicted a perfect 1-dimensional manifold.

When building a dependency measure, we want to be sensitive to noise

Therefore we use dim(X ) with α = 1 to build a dependency measure because:

I sensitive to noise

I simple estimator

I we can use properties of Shannon entropy

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Motivation

Intrinsic Dimensionality Theory

Intrinsic Dimensional Dependency

Conclusions

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Intrinsic Dimensional Dependency

The Intrinsic Dimensional Dependency (IDD) for X :

IDD(X ) ,

∑Di=1 dim(Xi)− dim(X )∑D

i=1 dim(Xi)−maxi dim(Xi).

Properties

1. 0 ≤ IDD(X ) ≤ 1;

2. IDD(X ) = 0 iff all Xi are independent;

3. IDD(X ) = 1 if there exist one or more manifolds of dimension 1 whose unionembeds X ;

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Algorithm to compute IDD

Compute IDD using the one single parameter k (number of NN).

We use the copula transformation to make it invariant to the marginals.

We use KD-trees to speed up computations.

IDD(X , k)

1 Copula transform X2 X = X + ε, where ε = 10−6 Gaussian noise3 Build KD-trees for X and Xi

4 Compute dim(X ) and dim(Xi)5 return IDD(X )

Average computational complexity: O(Dn log n + nk log n)

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Choice of kCaveat:: if k is chosen too small, no relationship will be identified even if thereexists only a small amount of noise.

(X1; Y1) (X2; Y2)

0 100 200 300 400k

0

0.5

1

Intrinsic Dimensional Dependency (IDD)

IDD(X1; Y1)

IDD(X2; Y2)

With small k the blue relationship gets scored ≈ 0

In this work we chose k ≈ n/4

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Synthetic relationships

IDD is behaves well on synthetic relationships:

I Rel. A: All variables are identical

I Rel. B: There are multiple 1-dimensional manifolds

I Rel. C: There is a functional relationship between one variable and theremaining variables

I Rel. D: All variables are independent.

2 3 4 5Number of variables D

0

0.5

1

Relationship A

2 3 4 5Number of variables D

0

0.5

1

Relationship B

2 3 4 5Number of variables D

0

0.5

1

Relationship C

2 3 4 5Number of variables D

0

0.5

1

Relationship D

IDD

UDS

MAC

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Real data setsData set 1: identification of correlated traffic sensors in MelbourneDue to the nature of traffic flow, the top 100 pairs identified by the dependencymeasure should consist of sensors that are geographically close (in km).

IDD MID MIC MAC UDS

6.6 ± 5.5 7.1 ± 5.0 7.1 ± 5.5 7.4 ± 5.6 7.5 ± 5.4

Data set 2: identify the building whose energy consumption is most dependent onthe outdoor temperature at the University of Melbourne

0 20 40 60Max Temperature

4

6

8

10

Par

kville

Buildin

g Top for IDD and UDS

0 20 40 60Max Temperature

0

0.5

1

Bai

llieu

Lib

rary

Top for MID

0 20 40 60Max Temperature

0

0.2

0.4

0.6

ICT

Buildin

g

Top for MIC

0 20 40 60Max Temperature

0

0.1

0.2

0.3

Sydney

Mye

rBuildin

g Top for MAC

IDD allows to identify buildings that have two different functioning regimes of thecooling system.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Motivation

Intrinsic Dimensionality Theory

Intrinsic Dimensional Dependency

Conclusions

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Conclusion - Summary

We discussed Intrinsic Dimensional Dependency (IDD),a dependency measure between multiple variablesto identify manifold dependencies.

To achieve this goal:

I we identified the connection between α-Renyi dimension and local intrinsicdimensionality;

I we proposed novel global estimators of dimensionality.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

Thank you.

Questions?

Simone Romano

[email protected]

@ialuronico

Code available online:

https://github.com/ialuronico

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

References I

Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M. E., Kawarabayashi, K.-i., and Nett,M. (2015).Estimating local intrinsic dimensionality.In SIGKDD, pages 29–38. ACM.

Houle, M. E. (2013).Dimensionality, discriminability, density and distance distributions.In Data Mining Workshops (ICDMW).

Nguyen, H.-V., Mandros, P., and Vreeken, J. (2016).Universal dependency analysis.SDM.

Nguyen, H. V., Muller, E., Vreeken, J., Efros, P., and Bohm, K. (2014).Multivariate maximal correlation analysis.In ICML, pages 775–783.

Renyi, A. (1961).On measures of entropy and information.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality

Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions

References II

Sugiyama, M. and Borgwardt, K. M. (2013).Measuring statistical dependence via the mutual information dimension.In IJCAI.

von Brunken, J., Houle, M. E., and Zimek, A. (2015).Intrinsic dimensional outlier detection in high-dimensional data.

Simone Romano NII Tokyo

Measuring Dependency via Intrinsic Dimensionality