Upload
simone-romano
View
139
Download
0
Embed Size (px)
Citation preview
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
ICPR 2016 – December 5th 2016
Measuring Dependency via Intrinsic Dimensionality
Simone Romano∗
@ialuronico
Oussama Chelly Nguyen Xuan Vinh James Bailey Michael E. Houle
∗Currently I am an applied scientist for in London UK
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Motivation
Intrinsic Dimensionality Theory
Intrinsic Dimensional Dependency
Conclusions
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Need for a novel type of Dependency Measure
There is clearly a strong dependency between the two sensors.
Number of cars counted with sensor X1
0 1000 2000
Number
ofcars
countedwithsensor
X2
0
100
200
300
400
500
600
700
Mon(01)Tue(02)Wed(03)Thu(04)Fri(05)Sat(06)Sun(07)Mon(08)Tue(09)Wed(10)
X1
X2
Sensors X1 and X2 0.36 km far
However, state-of-the-art dependency measures struggle to identify non-functionaldependencies of this kind: e.g. Pearson correlation = 0.3 < 1.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Manifold DependencyVariables that embed low-dimensional manifolds are strongly dependent.
Linear Quadratic Cubic
Sinusoidal low freq. Sinusoidal high freq. 4th Root
Circle Step Function Two Lines
X Sinusoidal varying freq. Circle-bar
Figure : Eg. D = 2 dependent variables:they embed 1-dimensional manifolds
Proposal:Intrinsic Dimensional Dependency (IDD)measures dependency between D continuousvariables X = (X1 . . .XD):
IDD(X ) = 1 if the underlying distribution ofX is restricted to a constant number of 1-dimensional manifolds.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Current state-of-the-art
Dependency Measure Functionalrel.
Manifold D > 2
MIC Maximal Information Coefficient [Reshef et al., 2011] 3
MAC Multivariate mAximal Correlation [Nguyen et al., 2014] 3 3
UDS Universal Dependency Score [Nguyen et al., 2016] 3 3
MID Mutual Information Dimension [Sugiyama and Borgwardt, 2013] 3 3
IDD Intrinsic Dimensional Dependency ICPR 2016 3 3 3
IDD measures manifold dependency between multiple variables.
IDD is based on intrinsic dimensionality theory .
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Motivation
Intrinsic Dimensionality Theory
Intrinsic Dimensional Dependency
Conclusions
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Renyi dimension
α-Renyi dimension [Renyi, 1961] measures the dimensionality of the data:
dimα(X ) , limδ→0+
Hα(X , δ)
log 1/δ,
Where Hα(X , δ) is the α-Renyi entropy estimated on the discretized variable X usingboxes of size δ.
Intuitively, the α-Renyi entropy quantifies the space-filling capacity of the data,and dimα(X ) measures its growth rate:
small growth rate of the entropy ⇒ X embeds a small manifold
However, the α-Renyi is tricky to estimate when D is large
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Renyi dimension
α-Renyi dimension [Renyi, 1961] measures the dimensionality of the data:
dimα(X ) , limδ→0+
Hα(X , δ)
log 1/δ,
Where Hα(X , δ) is the α-Renyi entropy estimated on the discretized variable X usingboxes of size δ.
Intuitively, the α-Renyi entropy quantifies the space-filling capacity of the data,and dimα(X ) measures its growth rate:
small growth rate of the entropy ⇒ X embeds a small manifold
However, the α-Renyi is tricky to estimate when D is large
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Local Intrinsic Dimensionality
The local Intrinsic Dimension-ality (ID) proposed by Houle[Houle, 2013] in the data miningcommunity.
The local ID(x) measures thedimensionality of the manifoldembedded in X at the locality x .
Here, we identify the connectionsbetween the local ID(x) and theglobal α-Renyi dimension.
This enables us to use existing localestimators of dimensionality for theα-Renyi dimension.
Figure : Local Intrinsic Dimensionality for different locality[von Brunken et al., 2015].
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Column 0
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20Column 1
-10-8
-6-4
-20
24
6810
1214
1618
20Column 2
1
2
3
5
10
15
20
Estimated dimensionality
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Local Intrinsic Dimensionality
The local Intrinsic Dimension-ality (ID) proposed by Houle[Houle, 2013] in the data miningcommunity.
The local ID(x) measures thedimensionality of the manifoldembedded in X at the locality x .
Here, we identify the connectionsbetween the local ID(x) and theglobal α-Renyi dimension.
This enables us to use existing localestimators of dimensionality for theα-Renyi dimension.
Figure : Local Intrinsic Dimensionality for different locality[von Brunken et al., 2015].
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Column 0
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20Column 1
-10-8
-6-4
-20
24
6810
1214
1618
20Column 2
1
2
3
5
10
15
20
Estimated dimensionality
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Connection between local ID and α-Renyi dimension
Theorem 1The α-Renyi dimension can be expressed as:
dimα(X ) =
∫f α(x) ID(x)dx∫
f α(x)dx,
where f is pdf of X .
Special case: α = 1, dim(X ) is the growth rate of the Shannon entropy.
According to Theorem 1, dim(X ) is the expectation of the local ID:
dim(X ) =
∫f (x) ID(x)dx .
This can be estimated using local ID estimators proposed in [Amsaleg et al., 2015]
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Connection between local ID and α-Renyi dimension
Theorem 1The α-Renyi dimension can be expressed as:
dimα(X ) =
∫f α(x) ID(x)dx∫
f α(x)dx,
where f is pdf of X .
Special case: α = 1, dim(X ) is the growth rate of the Shannon entropy.
According to Theorem 1, dim(X ) is the expectation of the local ID:
dim(X ) =
∫f (x) ID(x)dx .
This can be estimated using local ID estimators proposed in [Amsaleg et al., 2015]
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
kNN estimators of dimensionality
We propose kNN estimators for the intrinsic dimensionality dimα of the variables X .
In the special case of α = 1, we can simply average local ID:
dim(X ) =1
n
n∑i=1
ID(xi) = −1
n
n∑i=1
(1
k
k∑i=1
lndi(x)
dk(x)
)−1
.
where di(x) is the distance of x to its ith nearest neighbor.
We can employ the estimators for dimα to build a dependency measure betweenmultiple variables
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
kNN estimators of dimensionality
We propose kNN estimators for the intrinsic dimensionality dimα of the variables X .
In the special case of α = 1, we can simply average local ID:
dim(X ) =1
n
n∑i=1
ID(xi) = −1
n
n∑i=1
(1
k
k∑i=1
lndi(x)
dk(x)
)−1
.
where di(x) is the distance of x to its ith nearest neighbor.
We can employ the estimators for dimα to build a dependency measure betweenmultiple variables
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Building a dependency measure: choice of αBigger α makes the estimation of dimensionality less sensitive to noise.
0 0.5 1X1
-1
0
1X
2
1 2 3,
1
1.2
1.4ddim,(X1;X2)
Figure : With α ≈ 3 the estimated dimensionality is too small: it is indeed 1 as if the figure
depicted a perfect 1-dimensional manifold.
When building a dependency measure, we want to be sensitive to noise
Therefore we use dim(X ) with α = 1 to build a dependency measure because:
I sensitive to noise
I simple estimator
I we can use properties of Shannon entropy
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Motivation
Intrinsic Dimensionality Theory
Intrinsic Dimensional Dependency
Conclusions
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Intrinsic Dimensional Dependency
The Intrinsic Dimensional Dependency (IDD) for X :
IDD(X ) ,
∑Di=1 dim(Xi)− dim(X )∑D
i=1 dim(Xi)−maxi dim(Xi).
Properties
1. 0 ≤ IDD(X ) ≤ 1;
2. IDD(X ) = 0 iff all Xi are independent;
3. IDD(X ) = 1 if there exist one or more manifolds of dimension 1 whose unionembeds X ;
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Algorithm to compute IDD
Compute IDD using the one single parameter k (number of NN).
We use the copula transformation to make it invariant to the marginals.
We use KD-trees to speed up computations.
IDD(X , k)
1 Copula transform X2 X = X + ε, where ε = 10−6 Gaussian noise3 Build KD-trees for X and Xi
4 Compute dim(X ) and dim(Xi)5 return IDD(X )
Average computational complexity: O(Dn log n + nk log n)
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Choice of kCaveat:: if k is chosen too small, no relationship will be identified even if thereexists only a small amount of noise.
(X1; Y1) (X2; Y2)
0 100 200 300 400k
0
0.5
1
Intrinsic Dimensional Dependency (IDD)
IDD(X1; Y1)
IDD(X2; Y2)
With small k the blue relationship gets scored ≈ 0
In this work we chose k ≈ n/4
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Synthetic relationships
IDD is behaves well on synthetic relationships:
I Rel. A: All variables are identical
I Rel. B: There are multiple 1-dimensional manifolds
I Rel. C: There is a functional relationship between one variable and theremaining variables
I Rel. D: All variables are independent.
2 3 4 5Number of variables D
0
0.5
1
Relationship A
2 3 4 5Number of variables D
0
0.5
1
Relationship B
2 3 4 5Number of variables D
0
0.5
1
Relationship C
2 3 4 5Number of variables D
0
0.5
1
Relationship D
IDD
UDS
MAC
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Real data setsData set 1: identification of correlated traffic sensors in MelbourneDue to the nature of traffic flow, the top 100 pairs identified by the dependencymeasure should consist of sensors that are geographically close (in km).
IDD MID MIC MAC UDS
6.6 ± 5.5 7.1 ± 5.0 7.1 ± 5.5 7.4 ± 5.6 7.5 ± 5.4
Data set 2: identify the building whose energy consumption is most dependent onthe outdoor temperature at the University of Melbourne
0 20 40 60Max Temperature
4
6
8
10
Par
kville
Buildin
g Top for IDD and UDS
0 20 40 60Max Temperature
0
0.5
1
Bai
llieu
Lib
rary
Top for MID
0 20 40 60Max Temperature
0
0.2
0.4
0.6
ICT
Buildin
g
Top for MIC
0 20 40 60Max Temperature
0
0.1
0.2
0.3
Sydney
Mye
rBuildin
g Top for MAC
IDD allows to identify buildings that have two different functioning regimes of thecooling system.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Motivation
Intrinsic Dimensionality Theory
Intrinsic Dimensional Dependency
Conclusions
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Conclusion - Summary
We discussed Intrinsic Dimensional Dependency (IDD),a dependency measure between multiple variablesto identify manifold dependencies.
To achieve this goal:
I we identified the connection between α-Renyi dimension and local intrinsicdimensionality;
I we proposed novel global estimators of dimensionality.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
Thank you.
Questions?
Simone Romano
@ialuronico
Code available online:
https://github.com/ialuronico
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
References I
Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M. E., Kawarabayashi, K.-i., and Nett,M. (2015).Estimating local intrinsic dimensionality.In SIGKDD, pages 29–38. ACM.
Houle, M. E. (2013).Dimensionality, discriminability, density and distance distributions.In Data Mining Workshops (ICDMW).
Nguyen, H.-V., Mandros, P., and Vreeken, J. (2016).Universal dependency analysis.SDM.
Nguyen, H. V., Muller, E., Vreeken, J., Efros, P., and Bohm, K. (2014).Multivariate maximal correlation analysis.In ICML, pages 775–783.
Renyi, A. (1961).On measures of entropy and information.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality
Motivation Intrinsic Dimensionality Theory Intrinsic Dimensional Dependency Conclusions
References II
Sugiyama, M. and Borgwardt, K. M. (2013).Measuring statistical dependence via the mutual information dimension.In IJCAI.
von Brunken, J., Houle, M. E., and Zimek, A. (2015).Intrinsic dimensional outlier detection in high-dimensional data.
Simone Romano NII Tokyo
Measuring Dependency via Intrinsic Dimensionality