Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Effective and

Unsupervised

Fractal-based

Feature Selection

for Very Large

Datasets

Removing linear and non-linear attribute correlations

Antonio Canabrava Fraideinberze

Jose F Rodrigues-Jr

Robson Leonardo Ferreira Cordeiro

Databases and Images Group

University of São Paulo

São Carlos - SP - Brazil

2

Terabytes ?

…

How to analyze that data?

3

Terabytes?

Parallel processing

and dimensionality

reduction, for

sure...

…



4

Terabytes?

, but how to remove

linear and non-linear

attribute correlations,

besides irrelevant

attributes?

…


5

Terabytes?

, and how to reduce

dimensionality without

human supervision

and being task

independent?

…

6

Terabytes?

Curl-RemoverMedium-

dimensionality

…


Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

7

Agenda


Related Work

Proposed Method

Evaluation

Conclusion

8


Fractal Theory

...

...

...

...9


Fractal Theory

...

...

...

...10


Fractal Theory

11


Fractal Theory

12


Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Fractal Correlation Dimension ≅ Intrinsic Dimension

13


Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Embedded dimension ≅ 3

Intrinsic dimension ≅ 1

Embedded dimension ≅ 3

Intrinsic dimension ≅ 2

14


Fractal Theory

Fractal Correlation Dimension - Box Counting

15


Fractal Theory


16


Fractal Theory


log(r)17


Fractal Theory


log(r)18


Fractal Theory


19

Multidimensional

Quad-tree[Traina Jr. et al, 2000]

Agenda


Related Work

Proposed Method

Evaluation

Conclusion

20

Related Work

Dimensionality Reduction - Taxonomy 1

Dimensionality

Reduction

Supervised AlgorithmsUnsupervised

Algorithms

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

21

Related Work

Dimensionality Reduction - Taxonomy 2

Dimensionality

Reduction

Feature ExtractionFeature Selection

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

EmbeddedFilterWrapper

22

Related Work

23

Terabytes?

Existing methods need supervision,

miss non-linear correlations, cannot

handle Big Data or work for

classification only

…

Agenda


Related Work

Proposed Method

Evaluation

Conclusion

24

General Idea

25

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

General Idea

26


in ascending order of relevance.Builds partial trees

for the full dataset

and for its E

(E-1)-dimensional

projections

General Idea

27



TreeID

+

cell

spatial

position

Partial

count of

points

General Idea

28



Sums partial point

counts and reports

log(r) and log(sum2)

for each tree

General Idea

29



Computes D2 for

the full dataset and

pD2 for each of its E

(E-1)-dimensional

projections

General Idea

30



The least relevant

attribute, i.e., the one

not in the projection

that minimizes

| D2 - pD2 |

General Idea

31



Spots the second

least relevant

attribute …

General Idea

3 Main Issues

32



General Idea

3 Main Issues

33


in ascending order of relevance.1° Too much data to

be shuffled – one

data pair per cell/tree

General Idea

3 Main Issues

34


in ascending order of relevance.2° One

data pass

per

irrelevant

attribute

General Idea

3 Main Issues

35



3° Not enough

memory for mappers

Proposed Method

Curl-Remover

36

1° Issue - Too much data to be shuffled; one data pair per

cell/tree;

Our solution - Two-phase dimensionality reduction:

a) Serial feature selection in a tiny data sample (one reducer). Used to

speed-up processing only;

b) All mappers project data into a fixed subspace

37


in ascending order of relevance.Builds/reports N (2 or

3) tree levels of

lowest resolution…

Proposed Method

Curl-Remover

38


in ascending order of relevance.… plus the points

projected into the M (2

or 3) most relevant

attributes of sample

Proposed Method

Curl-Remover

39


in ascending order of relevance.Builds the full trees from

their low resolution level

cells and the projected

points

Proposed Method

Curl-Remover

40



Proposed Method

Curl-Remover

High resolution cells

are never shuffled

Proposed Method

Curl-Remover

41

2° Issue - One data pass per irrelevant attribute;

Our solution – Stores/reads the tree level of highest

resolution, instead of the original data.

42



Rdb = cost to read dataset;

TWRtree = cost to transfer,

write and read the last tree

level in next reduce step;

If (Rdb > TWRtree)

then writes tree;

Proposed Method

Curl-Remover

43



Proposed Method

Curl-Remover

44


in ascending order of relevance.Writes tree’s last level in

HDFS

Proposed Method

Curl-Remover

45


in ascending order of relevance.Reads tree’s last level

from HDFS

Proposed Method

Curl-Remover

46



Proposed Method

Curl-Remover

Reads dataset

only twice

Proposed Method

Curl-Remover

47

3° Issue - Not enough memory for mappers;

Our solution – Sorts data in mappers and reports “tree slices”

whenever needed.

48


in ascending order of relevance.Sorts its local points and

builds “tree slices”

monitoring memory

consumption

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

49

Y

X

Proposed Method

Curl-Remover

50

Reports “tree slices”

with very little overlap

Agenda


Related Work

Proposed Method

Evaluation

Conclusion

51

Evaluation

Datasets

Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-

linearly correlated. 5 attributes, 1.1 billion points;

Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2

random attributes. 5 attributes, 1.1 billion points;

Yahoo! Network Flows - communication patterns between end-users in the web. 12

attributes, 562 million points;

Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;

Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5

million points;

Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5

million points.

52

Evaluation

Fractal Dimension

Hepmass

53

Evaluation

Fractal Dimension

Hepmass Duplicated

54

Evaluation

Comparison with sPCA - Classification

55

Evaluation

Comparison with sPCA - Classification

56

8% more accurate,

7.5% faster

Evaluation

Comparison with sPCA

Percentage of Fractal Dimension after selection

57

Agenda


Related Work

Proposed Method

Evaluation

Conclusion

58

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics of

the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

59

Conclusions





Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;


the attributes;


classification;

60

Conclusions





Unsupervised - it does not require the user to guess the number of

attributes to be removed neither requires a training set;


the attributes;


classification;

61

Conclusions







Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;


classification;

62

Conclusions







Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;


classification;

63

[email protected]

Hepmass Duplicated

Data & Analytics

Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations