Big Astronomical (Big Ast) Data e.g., the National Virtual Observatory data What record ordering is best for astronomical data? Astronomical bodies lie

Big Astronomical (Big Ast) Data e.g., the National Virtual Observatory dataWhat record ordering is best for astronomical data?

Astronomical bodies lie on the celestial sphere, a sphere sharing origin and equatorial plane with earth but no radius.Hierarchical Triangle Mesh (HTM) is the standard. It is an ordering of equilateral triangulations (recursive).

Perfect Triangle Mesh (PTM) may be a better ordering, at least for data mining?

(Note: RA=Recession Angle (=longitudinal angle); dec=declination (=latitudinal angle)

PTM is similar to HTM: The Celestial Sphere is recursively equilaterally triangulated using great circle segments as sides. But PTM differs from HTM in the way in which these triangles are ordered at each level.

...

1,2

1,3

1,0

1,1

1

1,3,3

1,3,2

1,3,0

1,3,1

1,2

1,1

1,0

1,3

1

1,1,2

1,1,01,1,1

1.1.3

sub-triangle ordering in HTM

sub-triangular ordering in PTM

PTM with Recessional, then LRLR ordering (level-1 recesses the northern hemisphere, then reverse recesses the southern)

RA

dec

The following ordering produces a sphere filling curve with good continuity characteristics,The picture at right shows the earth (blue ball at the center) and the celestial sphere out around it.

Traverse the next level of triangulation, alternating again with left-turn, right-turn, left-turn, right-turn..

Traverse the southern hemisphere in the revere direction (just the identical pattern pushed down, arriving at the Southern neighbor of the start triangle.

L

L

R

R

L

L

R R

This scheme has the following good characteristics:1. At level-2, it's better than moving to the edge-neighbor of 1st-last since 2nd-last are closest to the target, where we 1st traverse the 3 nbrs of the 2nd-last pair of the source.2. Level-1 always moves to an edge neighbor.3. All levels it forms a circle.4. Level-1 can start anywhere.

Level-2 (blue) follows level-1 (green), with the LLRR pattern, where:

L

L

R

R

L

L

L

L

R

L

L

R

RRR

R

R

R L

PTM with LLRR, then LRLR order. Pierce at south pole. Peel to the north pole along quadrant great circles and around the equator. (Many other attempts failed, see appendix).

PTM_LLRR_LR_LR

L

L

R

R

L

L

R

R

R

L

L

R

R

L L

R

L

L

R

R

LL

R

R

R

L

L R

R

LL

R

LL

R

R

L

L

R

R

It works the same way at level-4 and higher.

L

L

R

L

L

R

RR

Level-3 follows level-2 withLRLR when the level-2 pattern is L andRLRL when the level-2 pattern is R

LL

L

L

Level-4 follows level-3, with pattern,LRLR when the level-3 pattern is L,RLRL when the level-3 pattern is R

PTM_LLRR_LR_LR_LR_...

First, let's consider a Mathematical theorem: an n-sphere filling (n-1)-sphere.

Corolary-2: There exists a sphere filling circle.

Proof-2: Let Cn ≡ the level-n circle,

C ≡ limitnCn is a circle which fills the 2-sphere!

Proof: Let x be any point on the 2-sphere.

distance(x,Cn) sidelength (=diameter) of the level-n triangles.

sidelengthn+1 = ½ * sidelengthn.

d(x,C) ≡ lim d(x,Cn) lim sidelengthn sidelength1 * lim ½n = 0

x

Standard K-means Clustering: Select k centroids. Assign each point to the closest centroid. Calculate mean=centroid of each new clusters. Iterate until stop_condition = true.

PK-means: Same as above but the means are calculated without scanning (using one pTree formulas).

MPK-means: Mohammad's PK-means Below, i=1...K. 1. Pick K centroids, {Ci}i=1..K

2. Calculate distances, Di=D(X,Ci), the result of each is a PTreeSet of d pTrees. ( d=bitwidth of diameter(X)? or? ) (Md creates 1 (for each i) width=d Distances_PTreeSet for all of X without a scan. Md? Does d produce itself automatically? )3. Calculate P(DiDj), i<j. (mask pTree where bit is 1 iff distance(x,Ci) distance(x,Cj) (Md: , instead of < ?)4. Calculate cluster mask pTrees PC1 = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2 = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC1

PC3 = P(D3D4) & ... & P(D3DK) & ~PC1 & ~PC2 . . . PCk = & ~PC1 & ~PC2 & ... & ~PCK-1 5. Calculate new Centroids, Ci = Sum(X&PCi)/count(PCi)6. If stop_condition = false, start the next iteration with these new centroids.PKL means: P(K-Less) means (pronounced "pickle means")If n1 is the least value you could possibly want for K (e.g., n1=2) and n2 is the greatest, then for each k, n1 K n2 calculate:4'. Calculate cluster mask pTrees. For K=2..n, PC1K = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2K = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC1 . . . PCK = P(X) & ~PC1 & ... & ~PCK-1 6'. If k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids.3.5'. At each iteration only continue with those k's that meet a requirement, e.g., only continue with the top t each iteration (Need a comparator that is easy to calculate (using pTrees) with which to compare two k's. The comparator could involve:e.g., a. Sum of the clustter diameters (which should be easy using max and min of D(Cluster i, Cj), or D(Clusteri. Clusterj) ?) b. Sum of the diameters of the gaps between clusters (should be easy using D(listPC i, Cj) or D(listPCi, listPCj). c. other?In 4', 1st round, pick n2 centroids, find all PC1K? (e.g., for K=2 (find PCh2, PCh2, h=n1..n2), on PCh2's do it again...

PKLD means: P K-Less Divisive means "pickled-means") 1. Apply PKL means with K=2.2. Then on each subcluster repeat 1. until a stopping condition is reached on each branch

Can we create PTreeSet(distance(x,NOTx)) where NOTx=X-{x} ?

If so, we can then find the max to pick out the most anomalous point, or, if we want the n most anomalous points, we can work our way from high order bit down to the units bit to determine that, orwe can just sort desc to rank all points as to outlier-ness or anomalousness.

Alternatively, we can take a medoid, C, and steadily increase r until count(dis(x,Disk(C,r))) > count(X)–n. Then declare everything outside disk to be an outlier.

From: Mohammad Hossain [mailto:[email protected]] Sent: Friday, May 04, 2012 5:33 PMTo: Perrizo, William; Mohammad Kabir Hossain ([email protected])Cc: Arjun Roy ([email protected])Subject: RE: time for a call today / tomorrow? I can go through my algorithm tomorrow in the meeting. I have another thought, even if we need to go through every points using a loop still our algorithm would be O(n) as we would need to go through the points only once whereas the horizontal methods will be O(n^2) because from every point we have to find the distances from every other points (to be accurate n(n-1)/2 distance calculation) which is O(n^2) As in the calculation of distance(x,C) where C is complement{x} , can we somehow predict C so that it is not complement{x} rather a fixed subset of X. Seems we have a big discussion tomorrow.Mohammad

mailto:[email protected]



o 0

r1

v1 r2 v2

r3 v3

v4

dim2

dim1

Algorithm1: Look for dimension where clustering is best (good?). Below, dimension=1 (3 clusters: {r1,r2,r3,O}, {v1,v2,v3,v4} and {0}). How do we determine those clusers? 1.a: Take each dimension chosen working left to right, when d(mean,median) > c*width, declare a cluster. (c=¼?) 1.b: Next take those clusters one at a time to the next chosen dimension for further sub-clustering via the same algorithm.

mean

median

mean

median

At this point we declare {r1,r2,r3,O} a cluster and start over.

mean

median

mean

median

mean

median

At this point we need to declare a cluster, but which one, {0,v1} or {v1,v2}? We will always take the one on the median side of the mean - in this case, {v1,v2}. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v1,v2}:

mean

median

mean

median

Declare {v1,v2,v3,v4} a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close.Next we would take one of the clusters and go to the best dimension to subcluster...

Oblique version: Take grid of Oblique direction vectors, e.g., For 3D dataset, a DirVect pointing to center of each PTM triangle. With projections onto those lines, do 1 or 2 above. Ordering = any sphere surface grid: Sn≡{x≡(x1...xn)Rn | xi

2=1}, in polar coords, {p≡(θ1...θn-1) | 0θi179}.

Can skip doubletons since mean always same as median.

Algorithm3: A variation is to calculate mean and vector of medians. On projections onto the line connecting them, do 1a or 1b. Repeat on each declared cluster, but use projection line other than the one through the mean and vom, this second time, since the mean-vom-line would likely be in approx in the same direction as the first round). Do until no new clusters? Adjust? e.g., proj lines and stop cond,...

Algorithm2: 2.a Take each dim in turn, working left to right, when density > Density_Threshold, declare a cluster ( density ≡ count/size ).

Algorithm4: Project onto line of dataset mean, vom, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, Repeat on any perpendicular line thru the mean. (mn, vom far apartmulti-modality.Algorithm4.1: 4.b.1 In each cluster, find 2 points furthest from line? (Require projection be done one point at a time? Or can we determine those 2 points in one pTree formula?)Algorithm4.2: 4.b.2 use a grid of unit direction lines, {dvi | i=1..m}. For each, calc mn, vom of projs of each cluster (except singletons). Take the one for which the separation is max.

4,9 2,8 5,8 4,6 3,4

dim2

dim1

11,10

10,5 9,4

8,3 7,2

6.3,5.96,5.5

Use lexicographical polar coords? 180n too many? Use e.g., 30 deg units, giving 6n vectors, for dim=n. Attrib relevance important!

Alg1-2: Use 1st criteria to trigger from 1.a, 2.a to declare clusters.

MASTERMINE (Medoid-based Affinity Subset deTERMINEr)

435 524 504 545 323 1

2

3mean=(8.18, 3.27, 3.73)

vom=(7,4,3)

1. no clusters determined yet.

924

b43 e43 c63 752 f72

2. (9,2,4) determined as an outlier cluster.

3. Using red dim line, (7,5,2) is determined as an outlier cluster. maroon pts determined as cluster, purple pts too.

3.a However, continuing to use line connecting (new) mean and vom of the projections onto this plane, would the same be determined?

Other option? use (at some judicious point) a p-Kmeans type approach. This could be done using K=2 and a divisive top down approach (using a GA mutation at various times to get us off a non-convergent track)?

Notes:Each round, reduce dim by one (low bound on the loop.)Each round, just need good line (in remaining hyperplane) to project cluster (so far).1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?)2. pick line thru longest diameter? ( or diam 1/2 previous diam?).3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set.

From: Mark Silverman [mailto:[email protected]] April 21, 2012 8:22 AM Subject: RE: oblique faust I’ve been doing some tests, so far not so accurate (I’m still validating the code – I “unhardcoded” it so I can deal with arbitrary datasets and it’s possible there’s a bug, so far I think it’s ok). Something rather unique about the test data I am using is that it has four attributes, but for all of the class decisions it is really one of the attributes driving the classification decision (e.g. for classes 2-10, attribute 2 is dominant decision, class 11 attribute 1 is dominant, etc). I have very wide variability in std deviation in the test data (some very tight, some wider). Thus, I think that placing “a” on the basis of relative deviation makes a lot of sense in my case (and probably in general). My assumption is that all I need to do is to modify as follows:Now: a[r][v] = (Mr + Mv) * d / 2 Changes to a[r][v] = (Mr + Mv) * d * std(r) / (std(r) + std(s)) Is this correct?


r r vv r mR r v v v r r v mV v r v v r v

FAUST Oblique FAUST Oblique (our best classifier?)(our best classifier?)

PR=P(X o dR

) < aR

1 pass gives classR pTree

D≡ mRmV

d=D/|D|

Separate class R using midpoint of means (midpoint of means (mommom)) method: Calc a

(mR+(mV-mR)/2)od = a = (mR+mV)/2od (works also if D=mVmR,

d

Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time)Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the

1. vectors_of_median, vomvom, to represent each class, not the mean mV, where vomV ≡(median{v1|vV},

2. mom_std, vom_std methodsmom_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv

median{v2|vV}, ...)

vomV

v1

v2

vomR

std of distances, vod, from origin

along the d-line

dim 2

dim 1

d-line

Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)

APPENDIX: The PTreeSet Genius for Big DataBig Vertical Data: PTreeSet (Dr. G. Wettstein's) perfect for BVD! (pTrees both horiz and vert)

PTreeSets incl methods for horiz querying and vertical DM, multihopQuery/DM, and XML. T(A1...An) is a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negs?) bitsliced (pt_posschema) and category attr bitmapped; coded then bitmapped; num coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet?A1..Ak num w bitwidths=bw1..bwk; Ak+1..An categorical w counts=cck+1...ccn, PTreeSet is bitmatrix:

0

1

0

1

0

0

1

A1,bw1

0

0

0

0

0

1

0

1

0

1

0

0

1

0

row number

N

...

5

4

3

2

1

A1,bw1

-1 ... A1,0

0

0

0

0

0

0

0

A2,bw2

0

1

0

1

0

0

1

...

0

0

0

0

0

1

0

Ak+1,c1

0

0

1

0

0

1

0

..An,ccn

Methods for this data structure can provide fast horizontal row access , e.g., an FPGA could (with zero delay) convert each bit-row back to original data row.

Methods already exist to provide vertical (level-0 or raw pTree) access.

Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g., 50% 1-bits ).

Add "level-1 only" DM meth, e.g., FPGA converts unclassified rowsets to equiwidth=64, 50% level1 pTrees, then entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN.

1

0

1

1

A1,bw1

0

0

1

0

0

0

1

1

inteval number

roof(N/64)

...

2

1

A1,bw1

-1 ... A1,0

0

0

0

0

A2,bw2

1

0

0

1

...

0

0

0

1

Ak+1,c1

1

0

1

0

...An,ccn

pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit-column would makes it impossible to break the code by simply focusing on the first bit row.

Relationships (rolodex cards) are 2 PTreeSets, AHGPeoplePTreeSet (shown) and AHGBasePairPositionPTreeSet (rotation of shown).Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table

(cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination?Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between

pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree)

More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed...

Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search).A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?).

AHG [THG/GHG/CHG] is relationship between People and adenine(A) [thymine(T)/guanine(G)/cytosine(C)] (1/0 for yes/no)

Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining?

Create both People and BPP-PTreeSet w human health records feature table (training set for classification and multi-hop ARM.)comprehensive decomp (ordering of bpps) FOR cross species genomic DM. If separate PTreeSets for each chrmomsome (even each region - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTrees.

pc bc lc cc pe age ht wt

AH

G(P

,bpp)

00

1

10

0

00

1

10

0

00

00

11

10

0

00

00

01

00

00

01

01

00

00

00

00

11

01

00

P 7B ... 5 4 3 2 1

12

3bpp

45

...3B

gene

ch

rom

osom

e

The red person features used to define classes. AHGp pTrees for data mining.We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else.

A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked.

Facebook-Buys:

Members4 3 2 1

F≡Friends(M,M)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Members

P≡Purchase(M,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

XI MX≡&xXPx People that purchased everything in X.

FX≡ORmMXFb = Friends of a MX person.

So, X={x}, is Mx Purchases x strong"Mx=ORmPxFmx frequent if Mx large. This is a tractable calculation.

Take one x at a time and do the OR.Mx=ORmPxFmx confident if Mx large. ct( Mx Px ) / ct(Mx) > minconf

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2 = {1,2,4} P2 = {2,4} ct(K2) = 3ct(K2&P2)/ct(K2) = 2/3

To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves....

ct(ORmPxFm & Px)/ct(ORmPx

Fm)>mncnf

Kx=OR Ogx frequent if Kx large (tractable- one x at a time and OR.gORbPxFb

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Others(G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={1,2,3,4} P2={2,4} ct(K2) = 4ct(K2&P2)/ct(K2)=2/4

0

1

0

1

4

1

1

1

0

2

1

1

0

1

1

1

2

3

4

Fcbk buddy, b, purchases x, tells friends.

Friend tells all friends.Strong purchase poss?Intersect rather than union

(AND rather than OR). Ad to friends of friends

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Compatriots (G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={2,4} P2={2,4} ct(K2) = 2ct(K2&P2)/ct(K2) = 2/2

0

1

0

1

4

1

1

0

1

1

1

2

3

4

The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F.

A condition is downward [upward] closed: If when it is true of A, it is true for all subsets [supersets], D, of A.

Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/even then downward/upward closure applies to frequency and confidence.

A pTree, X, is said to be "covered by" a pTree, Y, if one-bit in X, there is a one-bit at that same position in Y.

Lemma-0: For any two pTrees, X and Y, X&Y is covered by X and thus ct(X&Y) ct(X) and list(X&Y)list(X)Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones.

Lemma-1: Let AD, &aAXa covers &aDXa

Proof-1&2: Let Z=&aD-AXa then &aDXa =Z&(&aAXa). lemma-1 now follows from lemma-0, as does

Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc

D'=list(&aAXa)A'=list(&aDXa) so by lemma-1, we get lemma-2:

Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc

Lemma-3: AD, &elist(&clist(&aAXa)Yc)We covers &elist(&clist(&aDXa)Yc)WeProof-3: lemma-3 in the same way from lemma-1 and lemma-2. Continuing this establishes: If there

are an odd number of nested &'s then the expression with D is covered by the expression with A. Therefore the count with D with A. Thus, if the frequent expression and the confidence expression are > threshold for A then the same is true for D. This establishes downward closure.

Exactly analogously, if there are an even number of nested &'s we get the upward closures.

PTM with alternative upper ordering (consistently left-right alternating) Pierce at the south pole. Peel the apple all the way north to the north pole...

1. We have stayed in the northern hemisphere for 3 upper triangles.

left turn

right

left

right

left

right

This forces a move to a vertex neighbor on move-6 and a jump to a polar opposite on move-7.

This and the next few slides show other alternatives.In all but the last one, the scheme fails.

The series of slides culminates in a method which succeed and has very good characteristics.

PTM w alternative upper ordering (consistently left-right alternating). Pierce at the south pole. Peel the apple all the way north to the north pole...

2. At the upper level, always move to the triangle that is an edge-neighbor of the 1st and last sub-triangle.

L

R L

R

L R

Forces a move to triangle already covered on move-6

Note: One may wonder if this happened because we started at the south pole this time. I believe it can be proven that the north-south pole issue is mute - that these poles are just vertexes like any other (this analysis is rotationally invariant).

PTM w alternative upper ordering (consistently left-right alternating). Pierce at the south pole. Peel the apple all the way north to the north pole...

2. At the upper level, always move to the triangle that is an edge-neighbor of the 2nd and last sub-triangle.

L R

R

L R

L


PTM w alternative upper ordering (consistently left-left non-alternating). Pierce at the south pole. Peel the apple all the way north to the north pole...

2. At the upper level, always move to the triangle that is an edge-neighbor of the 1st and last sub-triangle.

L

L


PTM w alternative upper ordering (consistently left-left non-alternating). Pierce at the south pole. Peel the apple all the way north to the north pole...

2. At the upper level, always move to the triangle that is an edge-neighbor of the 2nd and last sub-triangle.

L

L

L L


Documents

Big Astronomical (Big Ast) Data e.g., the National Virtual Observatory data What record ordering is best for astronomical data? Astronomical bodies lie