Product Formalisms for Measures on Spaces with Binary Tree … · 2020. 4. 24. · National Renewable Energy Laboratory ("NREL"), which is operated by the Alliance for Sustainable

1

© 2015 Applied Communication Sciences

© 2015 Applied Communication Sciences - A Vencore Labs Company.

Registered Trademark of Vencore Labs, Inc. .

Product Formalisms for Measures on

Spaces with Binary Tree Structures: Representation, Visualization, Inference,

Decision and Application

July 2015 D. Bassu*, L. Ness** , Peter W. Jones***, D. Shallcross**

IP Application Analysis – Collaboration with R. Izmailov**

[email protected]

Partially Supported by AFOSR Grant Agreement FA9550-10-1-0125: Applications to Network Dynamics of Positive Measure and Product Formalisms: Analysis, Synthesis, Visualization and Missing Data Approximation

*AIG, Inc, ** Applied Communication Sciences, *** Yale University

mailto:[email protected]

2


Background

Product Formula Representation Theorem

Visualization Theorem

Multi-scale Noise Model Theorems

Inference and Decision Methodology

Example 1: Classification of LIDAR data using product coefficients

Example 2: Classification of IP data using fused product coefficients

Talk Outline

3


An ordered binary set system S on a set X defines disjoint left and right

subsets for every set S whose union is their parent set:

− Example: X = [0,1]; S = dyadic sub-intervals intervals

− The scale of a set is its “depth” in the tree

Given a non-negative measure μ on S define a product coefficient statistic aS

for each set S by

μ(L(S)) = ½ (1 + aS) μ(S)

μ(R(S)) = ½ (1 - aS) μ(S)

If μ(S) = 0, define aS = 0.

The product coefficients measure relative volume; they take values in [-1,1].

Let dy denote the naïve dyadic measure dy(L(S)) = dy(R(S)) = ½ dy(S)

For each set define a Haar-like function hS

hS = 1 on L(S) and -1 on R(S)

hS = 0 on X – S

Product Coefficient Statistics and Haar-like functions

SSRSL )()(

4


Theorem: A non-negative measure μ on the sigma algebra generated by sets in an ordered binary set system S on a set X has a unique representation

Any assignment of product coefficients in (-1,1) determines a positive

measure of total volume 1 .

Any assignment of product coefficients in [-1,1] which assigns zero product

coefficients to sets with zero volume determines a non-negative measure

of total volume 1.

The product formula for measures in the unit interval (for the dyadic sub-intervals binary set

system) appeared in “The Theory of Weights and the Dirichlet Problem for Elliptic Equations” by

R. Fefferman, C. Kenig, and J.Pipher (Annals of Math., 1991)

Kolacyzk and Nowak (Annals of Statistics, 2004) also researched multiscale probability models.

Product Formula Representation Theorem

dyhaS

SS )1(

5


Identify relative change for a scale in a “locality” - zero means no change.

Provide a very general method for inferring an approximate statistical model

for a data set. Practical for a few scales in high-dimensions and for many

scales in low dimensions (1, 2, 3 , ..).

Provide a canonical set of parameters for all measures on the sigma algebra

generated by a binary set system.

Provide a simple formula for a dyadic Wasserstein metric between two

probability measures

Relation to standard statistics: product coefficients are “signed standard

deviations”

Joint distributions are often represented using product formulas determined by

conditional probabilities (e.g. in probabilistic graphical models). The graph

structure is usually determined by domain knowledge. The PF Representation

Theorem provides a method for canonical representation of any measure on a

space X as a tree-structured product formula relative to an ordered binary tree

structure on the data space X; product coefficients are conditional probabilities

Remarks on Product Coefficients

6


Wind Data -

Dataset from NREL gives wind speeds and potential wind turbine power levels for a large number of locations and elevations across the U.S., every 10 minutes for three years

− “These wind power data ("Data") are provided by the National Renewable Energy Laboratory ("NREL"), which is operated by the Alliance for Sustainable Energy ("Alliance") for the U.S. Department Of Energy ("DOE"). “

We looked at wind speeds for a single year, single location and elevation.

We can compare the product coefficients for each day with the original time series.

Product coefficients provide normalized representations of multi-scale wind patterns

7


Four Specific Days of Wind –

•The Jan 16 and Dec 23 wind patterns

have relatively little variation, so product

coefficients are small

•The Mar 1 and Sep 27 wind patterns

have minimum in early afternoon,

•so the scale 0 coefficients are > 0,

•the first scale 1 coefficient is > 0,

•second scale 1 coefficient is < 0.

Product coefficients specify multi-scale

patterns

8


Canonical Visualization of Positive Measures

Positive Measures are associated with a plane Jordan curve Γ via the

“welding map”.

Let F+ be a choice of Riemann mapping from the unit disk to inside of Γ, and

let F- be a choice of Riemann mapping from to the outside of Γ. The welding

map for Γ is:

It is a homeomorphism of the unit circle to itself.

Because the welding map is a homeomorphism its derivative is a measure. In

fact, it is a finite measure if and only if has bounded variation.

The von Koch snowflake curve is an example of a map of the unit circle whose

derivative is not only a singular Lebesgue measure, but in fact has support on

a set of Hausdorff dimension less than 1. There are examples of

homeomorphisms of the unit circle which are not welding maps

Suffices to visualize the measure on the unit circle or unit interval with the

same product coefficients

FF 1

9


Theorem:

− A positive measure on the unit circle represented by a finite product formula is the

derivative of a welding map determined by a Jordan curve. The Jordan curve is

unique up to Möbius transformations.

− A positive measure on the unit circle represented by an infinite product formula,

whose product coefficients under any rotation of the circle are bounded away from

+1 and -1, is the derivative of a welding map determined by a Jordan curve unique

up to Möbius transformations.

Proof: in the original portion of Lars Ahlfors’ ‘book “Lectures on Quasi-

Conformal Mappings”.

Measure Visualization Theorem

10


A multi-scale noise for a measure mu multiples each factor of the product

formula by

where

is a set of independent random Gaussian variables, mean 0, variance 1

Note: If σ2 < 2 and the product coefficients are all zeros, the noise model

measure determines a welding curve with probability 1. (Jones et. al, Acta

Math. 2012)

We will assume the noise parameters are dependent only on the scale.

Multi-scale Noise Model for a Measure

))(2/1exp( 2

XSSSS hhZ

}{ SZ

11


Theorem: Given a measure μ on sets S in a dyadic set system S on X with a

finite product formula and a noise model for the measure,

− The first order approximation for the mean is the mean of the non-noisy product

coefficient.

− An upper bound to the second order approximation for the variance of the product

coefficients of the noise model is

Multi-Scale Noise Model Theorem

sscale

SC 2

12


Inference:

− Product coefficients from a set of samples of a measure can be averaged to obtain

an estimate for the product coefficients of the underlying measure.

Decision:

− Product coefficients distinguish measures so they can be used to determine rules

to classify samples of two unknown measures.

− Confidence intervals for decision can be determined from a single sample of a

measure using a multi-scale noise model for a measure.

Inference and Decision Methodology

13


We validated that the product coefficients distinguish real-world empirical

measures in two supervised learning experiments.

− LIDAR data – distinguish “ground” points from “vegetation” points.

− IP data – distinguish 6 IP addresses by 12 observed time series.

In supervised learning, the goal is to find a classification rule that distinguishes

two empirical measures on the same domain – i.e. points labeled with two

different labels.

We demonstrated that a highly precise classification rule stated in terms of the

product coefficients can be found using the Support Vector Machine algorithm.

− For the LIDAR data we showed that is was possible to use histograms of the

product coefficients to “guess” a good approximation to rule found by SVM.

− For the IP data we showed that SVM using the radial basis kernel produced a

much more accurate rule that SVM using the linear kernel. This showed that the

inherently non-linear nature of the separation between the data sets. The product

coefficients were concatenated together to form an 84 dimensional vector.

The experiment validated that product coefficients distinguish real empirical

measures just as predicted by the product formula representation theorem –

i.e. product coefficient theory enables automated computation of features.

Experimental Validation

14




Classification of IP data using Fused

Product Coefficients

July 2015

D. Bassu and R. Izmailov

15


The subset of the IP traffic samples corresponding to port 22 (SSH/SCP service) was selected for

analysis.

For daily profiling, a day boundary was defined as starting from 8AM till 7:59:59.999AM (next day).

For any given IPv4 address, the traffic on port 22 could be viewed as local (i.e., the IP corresponds to

the server), or as remote (i.e., the IP corresponds to the client).

The following 12 raw feature signals were processed for each IP for each day.

− Packets inbound (local & remote, collected separately)

− Packets outbound (local & remote, collected separately)

− Bytes inbound (local & remote, collected separately)

− Bytes outbound (local & remote, collected separately)

− Degree inbound (local, collected separately)

− Degree outbound (local, collected separately)

PDPM coefficients were computed for each of the raw signals listed above for scales = 0, 1, 2 (finest

time interval is of size 3 hours).

The total collected and pre-processed data size: 8915 vectors of dimension 240.

Of these 240 features, 84 most informative features were used subsequently for analysis.

Data Collection & Pre-processing

16


There are a lot of IPs with a very small number of active days; probably, not very informative.

As a result, we selected only the top six IPv4 addresses (in terms of number of days active).

The threshold for inclusion into the top IP addresses were selected as “at least 140 days of activity”

Data Filtering

Top IPs:

more than 140 active days

Plenty of IPs with

just a few active days

17


Top IP Addresses: Visualization

Several visualization methods were explored.

Here we show all top six IP addresses’ visualization after Diffusion Mapping to 10 dimensions with

subsequent PCA into 3 dimensions.

Colors correspond to days of the week: Sun – dark blue, Mon – blue, Tue – light blue/cyan, Wed –

green, Thu – yellow, Fri – orange/red, Sat – dark red

It is clear that more accurate methods are needed to differentiate among these addresses.

18


In order to be able to classify the top 6 IPs (with IDs 1055, 1174, 1184, 2276, 2616, and 2276), based on

their observable characteristics, we designed six classification decision rules (known as “one-vs-others”):

− Correctly classify vectors from IP ID 1055 against all other IP IDs






In order to construct these rules, we extracted from the overall dataset 875 vectors of dimension 84,

representing an almost equal mix of all top six IPs.

We did not conduct any feature selection (which could be done using computed mutual information of

each individual feature of these 84), so our designed rules were 84-dimensional and required, for their

input, computation of the full set of the corresponding PDPM coefficients.

The goal of our machine learning study was to design decision rules allowing to correctly classify any of

these vectors to its correct ID and assess the performance of these classifications: (1) Error rate

(probability that the classification is incorrect), (2) Sensitivity (probability that the vector of the targeted class is correctly

identified), (3) Specificity (probability that the vector not belonging to the targeted class is correctly identified)

Our machine learning approach was based on applying two versions (linear and non-linear (RBF)) of the

current state-of-the art algorithm SVM (Support Vector Machines) to the constructed dataset.

Each of 10 run sconsists of a random selection of 3/4 of the overall set as training set and assignment of

the complementary 1/4 of the overall set as test set (on which the performance metrics were measured).

Machine Learning Approach to IP Classification

19


Performance of linear SVM decision rule can be quite poor for some of IPs (especially for 1184 and 2276). The error

rate of 30%+ essentially makes those decision rules useless for practical applications.

IP ID 1174 vs all others IP ID 2407 vs all others IP ID 1184 vs all others

sens spec error sens spec error sens spec error

Run 1 90.71% 63.89% 13.70% 81.18% 54.55% 22.83% 62.57% 90.63% 33.33%

Run 2 85.95% 73.53% 15.98% 87.30% 33.33% 20.09% 62.43% 78.26% 34.25%

Run 3 82.70% 76.47% 18.26% 85.41% 44.12% 21.00% 68.68% 75.68% 30.14%

Run 4 77.60% 66.67% 24.20% 91.57% 51.22% 15.98% 58.79% 81.48% 35.62%

Run 5 82.51% 77.78% 18.26% 89.42% 46.67% 16.44% 67.42% 85.37% 29.22%

Run 6 87.71% 67.50% 15.98% 86.78% 44.44% 21.92% 58.79% 86.49% 36.53%

Run 7 86.36% 81.40% 14.61% 83.43% 57.89% 21.00% 58.64% 96.43% 36.53%

Run 8 85.87% 80.00% 15.07% 90.16% 58.33% 15.07% 65.03% 88.89% 31.05%

Run 9 89.13% 62.86% 15.07% 90.00% 46.15% 17.81% 68.68% 81.08% 29.22%

Run 10 82.07% 80.00% 18.26% 92.86% 40.54% 15.98% 68.48% 82.86% 29.22%

Average 85.06% 73.01% 16.94% 87.81% 47.72% 18.81% 63.95% 84.72% 32.51%



Run 1 98.31% 88.10% 3.65% 99.47% 100.00% 0.46% 72.00% 90.91% 24.20%

Run 2 96.67% 89.74% 4.57% 100.00% 100.00% 0.00% 72.04% 96.97% 24.20%

Run 3 98.84% 87.23% 3.65% 100.00% 100.00% 0.00% 75.94% 96.88% 21.00%

Run 4 96.28% 90.32% 4.57% 98.95% 100.00% 0.91% 58.12% 96.43% 36.99%

Run 5 98.34% 81.58% 4.57% 98.88% 100.00% 0.91% 42.70% 97.06% 48.86%

Run 6 96.69% 84.21% 5.48% 100.00% 100.00% 0.00% 66.84% 100.00% 28.31%

Run 7 98.34% 73.68% 5.94% 99.46% 100.00% 0.46% 56.59% 100.00% 36.07%

Run 8 97.16% 88.37% 4.57% 100.00% 100.00% 0.00% 44.32% 100.00% 47.03%

Run 9 95.58% 94.74% 4.57% 99.47% 100.00% 0.46% 72.93% 92.11% 23.74%

Run 10 99.44% 80.49% 4.11% 100.00% 97.44% 0.46% 79.14% 90.63% 19.18%

Average 97.57% 85.85% 4.57% 99.62% 99.74% 0.37% 64.06% 96.10% 30.96%

Performance of Linear SVM

20




Run 1 97.81% 66.67% 7.31% 97.85% 75.76% 5.48% 95.72% 59.38% 9.59%

Run 2 94.05% 58.82% 11.42% 100.00% 80.00% 2.74% 95.38% 63.04% 11.42%

Run 3 92.43% 67.65% 11.42% 99.46% 73.53% 4.57% 94.51% 75.68% 8.68%

Run 4 88.52% 66.67% 15.07% 100.00% 65.85% 6.39% 96.97% 62.96% 11.42%

Run 5 91.80% 63.89% 12.79% 100.00% 73.33% 3.65% 95.51% 65.85% 10.05%

Run 6 92.74% 62.50% 12.79% 98.85% 75.56% 5.94% 95.60% 54.05% 11.42%

Run 7 98.30% 53.49% 10.50% 97.79% 63.16% 8.22% 92.67% 75.00% 9.59%

Run 8 97.28% 65.71% 7.76% 92.35% 94.44% 7.31% 98.91% 61.11% 7.31%

Run 9 94.02% 74.29% 9.13% 96.67% 66.67% 8.68% 92.31% 62.16% 12.79%

Run 10 92.93% 57.14% 12.79% 92.31% 86.49% 8.68% 94.02% 62.86% 10.96%

Average 93.99% 63.68% 11.10% 97.53% 75.48% 6.17% 95.16% 64.21% 10.32%



Run 1 98.87% 83.33% 4.11% 100.00% 100.00% 0.00% 98.29% 72.73% 6.85%

Run 2 96.11% 82.05% 6.39% 100.00% 97.30% 0.46% 96.77% 87.88% 4.57%

Run 3 99.42% 85.11% 3.65% 100.00% 97.14% 0.46% 96.26% 96.88% 3.65%

Run 4 96.81% 90.32% 4.11% 98.42% 100.00% 1.37% 96.86% 89.29% 4.11%

Run 5 98.34% 81.58% 4.57% 98.88% 100.00% 0.91% 98.38% 88.24% 3.20%

Run 6 96.13% 86.84% 5.48% 100.00% 96.30% 0.46% 97.33% 90.63% 3.65%

Run 7 97.79% 81.58% 5.02% 100.00% 100.00% 0.00% 99.45% 78.38% 4.11%

Run 8 97.16% 93.02% 3.65% 100.00% 100.00% 0.00% 100.00% 73.53% 4.11%

Run 9 95.58% 94.74% 4.57% 99.47% 100.00% 0.46% 91.16% 89.47% 9.13%

Run 10 99.44% 68.29% 6.39% 95.56% 100.00% 3.65% 97.86% 84.38% 4.11%

Average 97.57% 84.69% 4.79% 99.23% 99.07% 0.78% 97.24% 85.14% 4.75%

Performance of RBF SVM

Performance of RBF SVM decision rule is significantly better than that of linear SVM decision rule (on the given

dataset) for all the cases, except “classification of ID 1055 versus all others”, where already very small error rate of

0.37% (for linear SVM) increased to 0.78% (for RBF SVM). However, the error rate for that case remains below 1%,

which still makes RBF SVM an excellent choice for classifying top IPs in our dataset.

21




Lidar Data Analysis using Product

Coefficients

July 2015

D Shallcross

22


Lidar Dataset

10 sets of data, made publically available by Brodu & Lague at http://nicolas.brodu.net/recherche/canupo/index.html

Each consists of points in R3, representing a sample from the visible surface of a pioneer salt marsh in St Michel’s Mount Bay, and consists of scattered low vegetation against a sandy surface.

Points are labelled as “ground” or “vegetation”.

# pts Veg Ground

1 44677 4608 40069

2 7248 4828 2420

3 7886 1411 6475

4 102990 22351 80639

5 138864 65337 73527

6 8167 4757 3410

7 88151 2908 85243

8 151989 66500 85489

9 71604 20639 50965

10 17944 5871 12073

23


Product Formula Coefficients

Apply individual translations for each of the 10 data sets, and one common scaling factor for the x coordinate, one for the y coordinate, and one for the z coordinate, to bring each data set into the unit cube, with a common median point.

Treat each transformed dataset as a measure, with point masses at the individual Lidar sample points.

Compute product form coefficients on these measures down to depth 30.

− Subdividing on each coordinate axis in succession (x,y,z,x,y,z,…)

− Avoiding computing any coefficients corresponding to regions of the cube with no data points.

We can look at histograms of all of the computed coefficients for all ten of the data sets, by level.

− +1 and -1 are most common, then 0.

24


Each coefficient represents a division of a sub-region of the cube into two pieces. We can associate with each point the coefficients corresponding to sub-regions containing the point.

If we make histograms of the associated coefficients by level for points in the two classes, we see differences.

On levels 9, 12, 15, 18, and 21 the ground points have coefficients clustering around +1 and -1, while vegetation points have coefficients clustering around 0.

Levels 10, 13, 14, 16, 17, 19, and 20 show the reverse behavior.

Ground Vegetation

Ground vs Vegetation

25


Ground vs Vegetation

Explanation: A coefficient value of +1 or -1 corresponds to a subdivision where all of the points lie in one of the two halves. The Lidar points represent the observed surface of objects. Coefficients of +1 or -1 are more likely if the surface is parallel to the plane of the subdivision. The ground points are on a surface more parallel to the subdivision by z-coordinate, the vegetation points more likely to be parallel to subdivision by x- or y-coordinate.

An ad-hoc discriminator by taking the difference of the sum of absolute values of coefficients on levels 9, 12, 15, 18, and 21 with the sum of absolute values of coefficients on levels 10, 13, 14, 16, 17, 19, and 20 does some separation between the two classes.

26


Support Vector Machines

We applied a machine learning algorithm called a linear support vector machine (SVM) to try to find an optimal linear separator between the ground points and the vegetation points, using the product coefficients.

We trained the system on data sets 1, 2, 4, 6, and 8, and evaluated the generated separator on data sets 3, 5, 7, 9, and 10.

We also evaluated the separator function on the entire data set, giving the histogram on the right.

In comparison the ad hoc discriminator does not perform so well. Training on 1, 2, 4, 6, and 8 to determine the cut-off value, it evaluates on 3, 5, 7, 9, and 10 as follows:

SVM Predicted

Ground Vegetation Total Error %

Actual Ground 217986 10297 228283 4.51%

Vegetation 14774 81392 96166 15.36%

Total 232760 91689 324449 7.73%

Ad Hoc Predicted

Ground Vegetation Total Error %

Actual Ground 213301 14982 228283 6.56%

Vegetation 18711 77455 96166 19.46%

Total 232012 92437 324449 10.38%

27


Errors from SVM Classification

We can see some pattern in the

errors from the SVM

classification. Here are 3

different rotated views of the

classification for data set 10.

Color codes:

− Red -- ground classified as

ground;

− Black -- ground classified as

vegetation;

− Green -- vegetation classified as

vegetation;

− Blue -- vegetation classified as

ground.

We have an area of ground

misclassified as vegetation near

the main bush, and a separate

bush mostly misclassified as

ground.

28


In previous work, we had applied Multi-Scale Singular

Value Decomposition to this classification problem.

− R. Izmailov ; D. Bassu ; A. McIntosh ; L. Ness and D.

Shallcross, " Application of multi-scale singular vector

decomposition to vessel classification in overhead satellite

imagery ", Proc. SPIE 9631, Seventh International

Conference on Digital Image Processing

This method selects regions of several scales around

each point, and associates each point with the singular

values and vectors of the subset of points that lie in each

of these regions.

Using this information, we were able to use a SVM to

obtain a linear classifier with much better performance.

− Again, training on data sets 1, 2, 4, 6, and 8 and evaluating

on data sets 3, 5, 7, 9, and 10.

Using just the singular values and vectors for three

scales, we had 3% misclassification of vegetation as

ground, and 2% misclassification of ground as vegetation

Using additional discretization preprocessing, and

including the (x,y,z) coordinates, we obtained 2% for both

types of misclassification.

Comparison with Multi-Scale SVD

scale 1

scale 2

scale 3

Documents

Product Formalisms for Measures on Spaces with Binary Tree … · 2020. 4. 24. · National Renewable Energy Laboratory ("NREL"), which is operated by the Alliance for Sustainable