Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
© 2015 Applied Communication Sciences
© 2015 Applied Communication Sciences - A Vencore Labs Company.
Registered Trademark of Vencore Labs, Inc. .
Product Formalisms for Measures on
Spaces with Binary Tree Structures: Representation, Visualization, Inference,
Decision and Application
July 2015 D. Bassu*, L. Ness** , Peter W. Jones***, D. Shallcross**
IP Application Analysis – Collaboration with R. Izmailov**
Partially Supported by AFOSR Grant Agreement FA9550-10-1-0125: Applications to Network Dynamics of Positive Measure and Product Formalisms: Analysis, Synthesis, Visualization and Missing Data Approximation
*AIG, Inc, ** Applied Communication Sciences, *** Yale University
2
© 2015 Applied Communication Sciences
Background
Product Formula Representation Theorem
Visualization Theorem
Multi-scale Noise Model Theorems
Inference and Decision Methodology
Example 1: Classification of LIDAR data using product coefficients
Example 2: Classification of IP data using fused product coefficients
Talk Outline
3
© 2015 Applied Communication Sciences
An ordered binary set system S on a set X defines disjoint left and right
subsets for every set S whose union is their parent set:
− Example: X = [0,1]; S = dyadic sub-intervals intervals
− The scale of a set is its “depth” in the tree
Given a non-negative measure μ on S define a product coefficient statistic aS
for each set S by
μ(L(S)) = ½ (1 + aS) μ(S)
μ(R(S)) = ½ (1 - aS) μ(S)
If μ(S) = 0, define aS = 0.
The product coefficients measure relative volume; they take values in [-1,1].
Let dy denote the naïve dyadic measure dy(L(S)) = dy(R(S)) = ½ dy(S)
For each set define a Haar-like function hS
hS = 1 on L(S) and -1 on R(S)
hS = 0 on X – S
Product Coefficient Statistics and Haar-like functions
SSRSL )()(
4
© 2015 Applied Communication Sciences
Theorem: A non-negative measure μ on the sigma algebra generated by sets in an ordered binary set system S on a set X has a unique representation
Any assignment of product coefficients in (-1,1) determines a positive
measure of total volume 1 .
Any assignment of product coefficients in [-1,1] which assigns zero product
coefficients to sets with zero volume determines a non-negative measure
of total volume 1.
The product formula for measures in the unit interval (for the dyadic sub-intervals binary set
system) appeared in “The Theory of Weights and the Dirichlet Problem for Elliptic Equations” by
R. Fefferman, C. Kenig, and J.Pipher (Annals of Math., 1991)
Kolacyzk and Nowak (Annals of Statistics, 2004) also researched multiscale probability models.
Product Formula Representation Theorem
dyhaS
SS )1(
5
© 2015 Applied Communication Sciences
Identify relative change for a scale in a “locality” - zero means no change.
Provide a very general method for inferring an approximate statistical model
for a data set. Practical for a few scales in high-dimensions and for many
scales in low dimensions (1, 2, 3 , ..).
Provide a canonical set of parameters for all measures on the sigma algebra
generated by a binary set system.
Provide a simple formula for a dyadic Wasserstein metric between two
probability measures
Relation to standard statistics: product coefficients are “signed standard
deviations”
Joint distributions are often represented using product formulas determined by
conditional probabilities (e.g. in probabilistic graphical models). The graph
structure is usually determined by domain knowledge. The PF Representation
Theorem provides a method for canonical representation of any measure on a
space X as a tree-structured product formula relative to an ordered binary tree
structure on the data space X; product coefficients are conditional probabilities
Remarks on Product Coefficients
6
© 2015 Applied Communication Sciences
Wind Data -
Dataset from NREL gives wind speeds and potential wind turbine power levels for a large number of locations and elevations across the U.S., every 10 minutes for three years
− “These wind power data ("Data") are provided by the National Renewable Energy Laboratory ("NREL"), which is operated by the Alliance for Sustainable Energy ("Alliance") for the U.S. Department Of Energy ("DOE"). “
We looked at wind speeds for a single year, single location and elevation.
We can compare the product coefficients for each day with the original time series.
Product coefficients provide normalized representations of multi-scale wind patterns
7
© 2015 Applied Communication Sciences
Four Specific Days of Wind –
•The Jan 16 and Dec 23 wind patterns
have relatively little variation, so product
coefficients are small
•The Mar 1 and Sep 27 wind patterns
have minimum in early afternoon,
•so the scale 0 coefficients are > 0,
•the first scale 1 coefficient is > 0,
•second scale 1 coefficient is < 0.
Product coefficients specify multi-scale
patterns
8
© 2015 Applied Communication Sciences
Canonical Visualization of Positive Measures
Positive Measures are associated with a plane Jordan curve Γ via the
“welding map”.
Let F+ be a choice of Riemann mapping from the unit disk to inside of Γ, and
let F- be a choice of Riemann mapping from to the outside of Γ. The welding
map for Γ is:
It is a homeomorphism of the unit circle to itself.
Because the welding map is a homeomorphism its derivative is a measure. In
fact, it is a finite measure if and only if has bounded variation.
The von Koch snowflake curve is an example of a map of the unit circle whose
derivative is not only a singular Lebesgue measure, but in fact has support on
a set of Hausdorff dimension less than 1. There are examples of
homeomorphisms of the unit circle which are not welding maps
Suffices to visualize the measure on the unit circle or unit interval with the
same product coefficients
FF 1
9
© 2015 Applied Communication Sciences
Theorem:
− A positive measure on the unit circle represented by a finite product formula is the
derivative of a welding map determined by a Jordan curve. The Jordan curve is
unique up to Möbius transformations.
− A positive measure on the unit circle represented by an infinite product formula,
whose product coefficients under any rotation of the circle are bounded away from
+1 and -1, is the derivative of a welding map determined by a Jordan curve unique
up to Möbius transformations.
Proof: in the original portion of Lars Ahlfors’ ‘book “Lectures on Quasi-
Conformal Mappings”.
Measure Visualization Theorem
10
© 2015 Applied Communication Sciences
A multi-scale noise for a measure mu multiples each factor of the product
formula by
where
is a set of independent random Gaussian variables, mean 0, variance 1
Note: If σ2 < 2 and the product coefficients are all zeros, the noise model
measure determines a welding curve with probability 1. (Jones et. al, Acta
Math. 2012)
We will assume the noise parameters are dependent only on the scale.
Multi-scale Noise Model for a Measure
))(2/1exp( 2
XSSSS hhZ
}{ SZ
11
© 2015 Applied Communication Sciences
Theorem: Given a measure μ on sets S in a dyadic set system S on X with a
finite product formula and a noise model for the measure,
− The first order approximation for the mean is the mean of the non-noisy product
coefficient.
− An upper bound to the second order approximation for the variance of the product
coefficients of the noise model is
Multi-Scale Noise Model Theorem
sscale
SC 2
12
© 2015 Applied Communication Sciences
Inference:
− Product coefficients from a set of samples of a measure can be averaged to obtain
an estimate for the product coefficients of the underlying measure.
Decision:
− Product coefficients distinguish measures so they can be used to determine rules
to classify samples of two unknown measures.
− Confidence intervals for decision can be determined from a single sample of a
measure using a multi-scale noise model for a measure.
Inference and Decision Methodology
13
© 2015 Applied Communication Sciences
We validated that the product coefficients distinguish real-world empirical
measures in two supervised learning experiments.
− LIDAR data – distinguish “ground” points from “vegetation” points.
− IP data – distinguish 6 IP addresses by 12 observed time series.
In supervised learning, the goal is to find a classification rule that distinguishes
two empirical measures on the same domain – i.e. points labeled with two
different labels.
We demonstrated that a highly precise classification rule stated in terms of the
product coefficients can be found using the Support Vector Machine algorithm.
− For the LIDAR data we showed that is was possible to use histograms of the
product coefficients to “guess” a good approximation to rule found by SVM.
− For the IP data we showed that SVM using the radial basis kernel produced a
much more accurate rule that SVM using the linear kernel. This showed that the
inherently non-linear nature of the separation between the data sets. The product
coefficients were concatenated together to form an 84 dimensional vector.
The experiment validated that product coefficients distinguish real empirical
measures just as predicted by the product formula representation theorem –
i.e. product coefficient theory enables automated computation of features.
Experimental Validation
14
© 2015 Applied Communication Sciences
© 2015 Applied Communication Sciences - A Vencore Labs Company.
Registered Trademark of Vencore Labs, Inc. .
Classification of IP data using Fused
Product Coefficients
July 2015
D. Bassu and R. Izmailov
15
© 2015 Applied Communication Sciences
The subset of the IP traffic samples corresponding to port 22 (SSH/SCP service) was selected for
analysis.
For daily profiling, a day boundary was defined as starting from 8AM till 7:59:59.999AM (next day).
For any given IPv4 address, the traffic on port 22 could be viewed as local (i.e., the IP corresponds to
the server), or as remote (i.e., the IP corresponds to the client).
The following 12 raw feature signals were processed for each IP for each day.
− Packets inbound (local & remote, collected separately)
− Packets outbound (local & remote, collected separately)
− Bytes inbound (local & remote, collected separately)
− Bytes outbound (local & remote, collected separately)
− Degree inbound (local, collected separately)
− Degree outbound (local, collected separately)
PDPM coefficients were computed for each of the raw signals listed above for scales = 0, 1, 2 (finest
time interval is of size 3 hours).
The total collected and pre-processed data size: 8915 vectors of dimension 240.
Of these 240 features, 84 most informative features were used subsequently for analysis.
Data Collection & Pre-processing
16
© 2015 Applied Communication Sciences
There are a lot of IPs with a very small number of active days; probably, not very informative.
As a result, we selected only the top six IPv4 addresses (in terms of number of days active).
The threshold for inclusion into the top IP addresses were selected as “at least 140 days of activity”
Data Filtering
Top IPs:
more than 140 active days
Plenty of IPs with
just a few active days
17
© 2015 Applied Communication Sciences
Top IP Addresses: Visualization
Several visualization methods were explored.
Here we show all top six IP addresses’ visualization after Diffusion Mapping to 10 dimensions with
subsequent PCA into 3 dimensions.
Colors correspond to days of the week: Sun – dark blue, Mon – blue, Tue – light blue/cyan, Wed –
green, Thu – yellow, Fri – orange/red, Sat – dark red
It is clear that more accurate methods are needed to differentiate among these addresses.
18
© 2015 Applied Communication Sciences
In order to be able to classify the top 6 IPs (with IDs 1055, 1174, 1184, 2276, 2616, and 2276), based on
their observable characteristics, we designed six classification decision rules (known as “one-vs-others”):
− Correctly classify vectors from IP ID 1055 against all other IP IDs
− Correctly classify vectors from IP ID 1174 against all other IP IDs
− Correctly classify vectors from IP ID 1184 against all other IP IDs
− Correctly classify vectors from IP ID 2276 against all other IP IDs
− Correctly classify vectors from IP ID 2407 against all other IP IDs
− Correctly classify vectors from IP ID 2616 against all other IP IDs
In order to construct these rules, we extracted from the overall dataset 875 vectors of dimension 84,
representing an almost equal mix of all top six IPs.
We did not conduct any feature selection (which could be done using computed mutual information of
each individual feature of these 84), so our designed rules were 84-dimensional and required, for their
input, computation of the full set of the corresponding PDPM coefficients.
The goal of our machine learning study was to design decision rules allowing to correctly classify any of
these vectors to its correct ID and assess the performance of these classifications: (1) Error rate
(probability that the classification is incorrect), (2) Sensitivity (probability that the vector of the targeted class is correctly
identified), (3) Specificity (probability that the vector not belonging to the targeted class is correctly identified)
Our machine learning approach was based on applying two versions (linear and non-linear (RBF)) of the
current state-of-the art algorithm SVM (Support Vector Machines) to the constructed dataset.
Each of 10 run sconsists of a random selection of 3/4 of the overall set as training set and assignment of
the complementary 1/4 of the overall set as test set (on which the performance metrics were measured).
Machine Learning Approach to IP Classification
19
© 2015 Applied Communication Sciences
Performance of linear SVM decision rule can be quite poor for some of IPs (especially for 1184 and 2276). The error
rate of 30%+ essentially makes those decision rules useless for practical applications.
IP ID 1174 vs all others IP ID 2407 vs all others IP ID 1184 vs all others
sens spec error sens spec error sens spec error
Run 1 90.71% 63.89% 13.70% 81.18% 54.55% 22.83% 62.57% 90.63% 33.33%
Run 2 85.95% 73.53% 15.98% 87.30% 33.33% 20.09% 62.43% 78.26% 34.25%
Run 3 82.70% 76.47% 18.26% 85.41% 44.12% 21.00% 68.68% 75.68% 30.14%
Run 4 77.60% 66.67% 24.20% 91.57% 51.22% 15.98% 58.79% 81.48% 35.62%
Run 5 82.51% 77.78% 18.26% 89.42% 46.67% 16.44% 67.42% 85.37% 29.22%
Run 6 87.71% 67.50% 15.98% 86.78% 44.44% 21.92% 58.79% 86.49% 36.53%
Run 7 86.36% 81.40% 14.61% 83.43% 57.89% 21.00% 58.64% 96.43% 36.53%
Run 8 85.87% 80.00% 15.07% 90.16% 58.33% 15.07% 65.03% 88.89% 31.05%
Run 9 89.13% 62.86% 15.07% 90.00% 46.15% 17.81% 68.68% 81.08% 29.22%
Run 10 82.07% 80.00% 18.26% 92.86% 40.54% 15.98% 68.48% 82.86% 29.22%
Average 85.06% 73.01% 16.94% 87.81% 47.72% 18.81% 63.95% 84.72% 32.51%
IP ID 2616 vs all others IP ID 1055 vs all others IP ID 2276 vs all others
sens spec error sens spec error sens spec error
Run 1 98.31% 88.10% 3.65% 99.47% 100.00% 0.46% 72.00% 90.91% 24.20%
Run 2 96.67% 89.74% 4.57% 100.00% 100.00% 0.00% 72.04% 96.97% 24.20%
Run 3 98.84% 87.23% 3.65% 100.00% 100.00% 0.00% 75.94% 96.88% 21.00%
Run 4 96.28% 90.32% 4.57% 98.95% 100.00% 0.91% 58.12% 96.43% 36.99%
Run 5 98.34% 81.58% 4.57% 98.88% 100.00% 0.91% 42.70% 97.06% 48.86%
Run 6 96.69% 84.21% 5.48% 100.00% 100.00% 0.00% 66.84% 100.00% 28.31%
Run 7 98.34% 73.68% 5.94% 99.46% 100.00% 0.46% 56.59% 100.00% 36.07%
Run 8 97.16% 88.37% 4.57% 100.00% 100.00% 0.00% 44.32% 100.00% 47.03%
Run 9 95.58% 94.74% 4.57% 99.47% 100.00% 0.46% 72.93% 92.11% 23.74%
Run 10 99.44% 80.49% 4.11% 100.00% 97.44% 0.46% 79.14% 90.63% 19.18%
Average 97.57% 85.85% 4.57% 99.62% 99.74% 0.37% 64.06% 96.10% 30.96%
Performance of Linear SVM
20
© 2015 Applied Communication Sciences
IP ID 1174 vs all others IP ID 2407 vs all others IP ID 1184 vs all others
sens spec error sens spec error sens spec error
Run 1 97.81% 66.67% 7.31% 97.85% 75.76% 5.48% 95.72% 59.38% 9.59%
Run 2 94.05% 58.82% 11.42% 100.00% 80.00% 2.74% 95.38% 63.04% 11.42%
Run 3 92.43% 67.65% 11.42% 99.46% 73.53% 4.57% 94.51% 75.68% 8.68%
Run 4 88.52% 66.67% 15.07% 100.00% 65.85% 6.39% 96.97% 62.96% 11.42%
Run 5 91.80% 63.89% 12.79% 100.00% 73.33% 3.65% 95.51% 65.85% 10.05%
Run 6 92.74% 62.50% 12.79% 98.85% 75.56% 5.94% 95.60% 54.05% 11.42%
Run 7 98.30% 53.49% 10.50% 97.79% 63.16% 8.22% 92.67% 75.00% 9.59%
Run 8 97.28% 65.71% 7.76% 92.35% 94.44% 7.31% 98.91% 61.11% 7.31%
Run 9 94.02% 74.29% 9.13% 96.67% 66.67% 8.68% 92.31% 62.16% 12.79%
Run 10 92.93% 57.14% 12.79% 92.31% 86.49% 8.68% 94.02% 62.86% 10.96%
Average 93.99% 63.68% 11.10% 97.53% 75.48% 6.17% 95.16% 64.21% 10.32%
IP ID 2616 vs all others IP ID 1055 vs all others IP ID 2276 vs all others
sens spec error sens spec error sens spec error
Run 1 98.87% 83.33% 4.11% 100.00% 100.00% 0.00% 98.29% 72.73% 6.85%
Run 2 96.11% 82.05% 6.39% 100.00% 97.30% 0.46% 96.77% 87.88% 4.57%
Run 3 99.42% 85.11% 3.65% 100.00% 97.14% 0.46% 96.26% 96.88% 3.65%
Run 4 96.81% 90.32% 4.11% 98.42% 100.00% 1.37% 96.86% 89.29% 4.11%
Run 5 98.34% 81.58% 4.57% 98.88% 100.00% 0.91% 98.38% 88.24% 3.20%
Run 6 96.13% 86.84% 5.48% 100.00% 96.30% 0.46% 97.33% 90.63% 3.65%
Run 7 97.79% 81.58% 5.02% 100.00% 100.00% 0.00% 99.45% 78.38% 4.11%
Run 8 97.16% 93.02% 3.65% 100.00% 100.00% 0.00% 100.00% 73.53% 4.11%
Run 9 95.58% 94.74% 4.57% 99.47% 100.00% 0.46% 91.16% 89.47% 9.13%
Run 10 99.44% 68.29% 6.39% 95.56% 100.00% 3.65% 97.86% 84.38% 4.11%
Average 97.57% 84.69% 4.79% 99.23% 99.07% 0.78% 97.24% 85.14% 4.75%
Performance of RBF SVM
Performance of RBF SVM decision rule is significantly better than that of linear SVM decision rule (on the given
dataset) for all the cases, except “classification of ID 1055 versus all others”, where already very small error rate of
0.37% (for linear SVM) increased to 0.78% (for RBF SVM). However, the error rate for that case remains below 1%,
which still makes RBF SVM an excellent choice for classifying top IPs in our dataset.
21
© 2015 Applied Communication Sciences
© 2015 Applied Communication Sciences - A Vencore Labs Company.
Registered Trademark of Vencore Labs, Inc. .
Lidar Data Analysis using Product
Coefficients
July 2015
D Shallcross
22
© 2015 Applied Communication Sciences
Lidar Dataset
10 sets of data, made publically available by Brodu & Lague at http://nicolas.brodu.net/recherche/canupo/index.html
Each consists of points in R3, representing a sample from the visible surface of a pioneer salt marsh in St Michel’s Mount Bay, and consists of scattered low vegetation against a sandy surface.
Points are labelled as “ground” or “vegetation”.
# pts Veg Ground
1 44677 4608 40069
2 7248 4828 2420
3 7886 1411 6475
4 102990 22351 80639
5 138864 65337 73527
6 8167 4757 3410
7 88151 2908 85243
8 151989 66500 85489
9 71604 20639 50965
10 17944 5871 12073
23
© 2015 Applied Communication Sciences
Product Formula Coefficients
Apply individual translations for each of the 10 data sets, and one common scaling factor for the x coordinate, one for the y coordinate, and one for the z coordinate, to bring each data set into the unit cube, with a common median point.
Treat each transformed dataset as a measure, with point masses at the individual Lidar sample points.
Compute product form coefficients on these measures down to depth 30.
− Subdividing on each coordinate axis in succession (x,y,z,x,y,z,…)
− Avoiding computing any coefficients corresponding to regions of the cube with no data points.
We can look at histograms of all of the computed coefficients for all ten of the data sets, by level.
− +1 and -1 are most common, then 0.
24
© 2015 Applied Communication Sciences
Each coefficient represents a division of a sub-region of the cube into two pieces. We can associate with each point the coefficients corresponding to sub-regions containing the point.
If we make histograms of the associated coefficients by level for points in the two classes, we see differences.
On levels 9, 12, 15, 18, and 21 the ground points have coefficients clustering around +1 and -1, while vegetation points have coefficients clustering around 0.
Levels 10, 13, 14, 16, 17, 19, and 20 show the reverse behavior.
Ground Vegetation
Ground vs Vegetation
25
© 2015 Applied Communication Sciences
Ground vs Vegetation
Explanation: A coefficient value of +1 or -1 corresponds to a subdivision where all of the points lie in one of the two halves. The Lidar points represent the observed surface of objects. Coefficients of +1 or -1 are more likely if the surface is parallel to the plane of the subdivision. The ground points are on a surface more parallel to the subdivision by z-coordinate, the vegetation points more likely to be parallel to subdivision by x- or y-coordinate.
An ad-hoc discriminator by taking the difference of the sum of absolute values of coefficients on levels 9, 12, 15, 18, and 21 with the sum of absolute values of coefficients on levels 10, 13, 14, 16, 17, 19, and 20 does some separation between the two classes.
26
© 2015 Applied Communication Sciences
Support Vector Machines
We applied a machine learning algorithm called a linear support vector machine (SVM) to try to find an optimal linear separator between the ground points and the vegetation points, using the product coefficients.
We trained the system on data sets 1, 2, 4, 6, and 8, and evaluated the generated separator on data sets 3, 5, 7, 9, and 10.
We also evaluated the separator function on the entire data set, giving the histogram on the right.
In comparison the ad hoc discriminator does not perform so well. Training on 1, 2, 4, 6, and 8 to determine the cut-off value, it evaluates on 3, 5, 7, 9, and 10 as follows:
SVM Predicted
Ground Vegetation Total Error %
Actual Ground 217986 10297 228283 4.51%
Vegetation 14774 81392 96166 15.36%
Total 232760 91689 324449 7.73%
Ad Hoc Predicted
Ground Vegetation Total Error %
Actual Ground 213301 14982 228283 6.56%
Vegetation 18711 77455 96166 19.46%
Total 232012 92437 324449 10.38%
27
© 2015 Applied Communication Sciences
Errors from SVM Classification
We can see some pattern in the
errors from the SVM
classification. Here are 3
different rotated views of the
classification for data set 10.
Color codes:
− Red -- ground classified as
ground;
− Black -- ground classified as
vegetation;
− Green -- vegetation classified as
vegetation;
− Blue -- vegetation classified as
ground.
We have an area of ground
misclassified as vegetation near
the main bush, and a separate
bush mostly misclassified as
ground.
28
© 2015 Applied Communication Sciences
In previous work, we had applied Multi-Scale Singular
Value Decomposition to this classification problem.
− R. Izmailov ; D. Bassu ; A. McIntosh ; L. Ness and D.
Shallcross, " Application of multi-scale singular vector
decomposition to vessel classification in overhead satellite
imagery ", Proc. SPIE 9631, Seventh International
Conference on Digital Image Processing
This method selects regions of several scales around
each point, and associates each point with the singular
values and vectors of the subset of points that lie in each
of these regions.
Using this information, we were able to use a SVM to
obtain a linear classifier with much better performance.
− Again, training on data sets 1, 2, 4, 6, and 8 and evaluating
on data sets 3, 5, 7, 9, and 10.
Using just the singular values and vectors for three
scales, we had 3% misclassification of vegetation as
ground, and 2% misclassification of ground as vegetation
Using additional discretization preprocessing, and
including the (x,y,z) coordinates, we obtained 2% for both
types of misclassification.
Comparison with Multi-Scale SVD
scale 1
scale 2
scale 3