Cluster by Evan

Clustering Analysis(of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU)

Notes:1. over 100 slides not going to go through each in detail.

Clustering MethodsA Categorization of Major Clustering MethodsPartitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methods

Clustering Methods based on PartitioningPartitioning method: Construct a partition of a database D of n objects into a set of k clustersGiven a k, find a partition of k clusters that optimizes the chosen partitioning criterionk-means (MacQueen67): Each cluster is represented by the center of the clusterk-medoids or PAM method (Partition Around Medoids) (Kaufman & Rousseeuw87): Each cluster is represented by 1 object in the cluster (~ the middle object or median-like object)

The K-Means Clustering MethodGiven k, the k-means algorithm is implemented in 4 steps (assumes partitioning criteria is: maximize intra-cluster similarity and minimize inter-cluster similarity. Of course, a heuristic is used. Method isnt really an optimization)

Partition objects into k nonempty subsets (or pick k initial means).

Compute the mean (center) or centroid of each cluster of the current partition (if one started with k means, then this step is done).centroid ~ point that minimizes the sum of dissimilarities from the mean or the sum of the square errors from the mean. Assign each object to the cluster with the most similar (closest) center.

Go back to Step 2

Stop when the new set of means doesnt change (or some other stopping condition?)

k-Means012345678910012345678910Step 1Step 2Step 3Step 4

The K-Means Clustering MethodStrengthRelatively efficient: O(tkn),n is # objects,k is # clusterst is # iterations. Normally, k, t

The K-Medoids Clustering MethodFind representative objects, called medoids, (must be an actual object in the cluster, where as the mean seldom is).

PAM (Partitioning Around Medoids, 1987)starts from an initial set of medoidsiteratively replaces one of the medoids by a non-medoidif it improves the aggregate similarity measure, retain the swap. Do this over all medoid-nonmedoid pairsPAM works for small data sets. Does not scale for large data setsCLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples then apply PAMCLARANS (Clustering Large Applications based on RANdomSearch) (Ng & Han, 1994): Randomized the sampling

PAM (Partitioning Around Medoids) (1987)Use real object to represent the clusterSelect k representative objects arbitrarilyFor each pair of non-selected object h and selected object i, calculate the total swapping cost TCi,hFor each pair of i and h, If TCi,h < 0, i is replaced by hThen assign each non-selected object to the most similar representative objectrepeat steps 2-3 until there is no change

CLARA (Clustering Large Applications) (1990)CLARA (Kaufmann and Rousseeuw in 1990)It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the outputStrength: deals with larger data sets than PAMWeakness:Efficiency depends on the sample sizeA good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS (Randomized CLARA) (1994)CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han94)CLARANS draws sample of neighbors dynamicallyThe clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoidsIf the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum (Genetic-Algorithm-like)Finally the best local optimum is chosen after some stopping condition.It is more efficient and scalable than both PAM and CLARA

Distance-based partitioning has drawbacksSimple and fast O(N)The number of clusters, K, has to be arbitrarily chosen before it is known how many clusters is correct.Produces round shaped clusters, not arbitrary shapes (Chameleon data set below)

Sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function if the initial partition is not well chosen.Correct resultK-means result

Distance-based partitioning (Cont.)If we start with A, B, and C as the initial centriods around which the three clusters are built, then we end up with the partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses.

Whereas, the correct three-cluster solution is obtained by choosing, for example, A, D, and F as the initial cluster means (rectangular clusters).

A Vertical Data ApproachPartition the data set using rectangle P-trees (a gridding)

These P-trees can be viewed as a grouping (partition) of data

Pruning out outliers by disregard those sparse values

Input: total number of objects (N), percentage of outliers (t) Output: Grid P-trees after prune(1) Choose the Grid P-tree with smallest root count (Pgc)(2) outliers:=outliers OR Pgc(3) if (outliers/N

Distance FunctionData Matrix n objects p variables

Dissimilarity Matrix n objects n objects

AGNES (Agglomerative Nesting)Introduced in Kaufmann and Rousseeuw (1990)Use the Single-Link (distance between two sets is the minimum pairwise distance) methodMerge nodes that are most similarityEventually all nodes belong to the same cluster

DIANA (Divisive Analysis)Introduced in Kaufmann and Rousseeuw (1990)Inverse order of AGNES (intitially all objects are in one cluster; then it is split according to some criteria (e.g., maximize some aggregate measure of pairwise dissimilarity again)Eventually each node forms a cluster on its own

Contrasting Clustering TechniquesPartitioning algorithms: Partition a dataset to k clusters, e.g., k=3

Hierarchical alg: Create hierarchical decomposition of ever-finer partitions.

e.g., top down (divisively).bottom up (agglomerative)

Hierarchical Clustering

Hierarchical Clustering (top down)In either case, one gets a nice dendogram in which any maximal anti-chain (no 2 nodes linked) is a clustering (partition).

Hierarchical Clustering (Cont.)Recall that any maximal anti-chain (maximal set of nodes in whichno 2 are chained) is a clustering (a dendogram offers many).

Hierarchical Clustering (Cont.)But the horizontal anti-chains are the clusterings resulting from thetop down (or bottom up) method(s).

Hierarchical Clustering (Cont.)Most hierarchical clustering algorithms are variants of the single-link, complete-link or average link.

Of these, single-link and complete link are most popular.

In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of patterns drawn one from each cluster.

In the complete-link algorithm, the distance between two clusters is the maximum of all pairwise distances between pairs of patterns drawn one from each cluster.

In the average-link algorithm, the distance between two clusters is the average of all pairwise distances between pairs of patterns drawn one from each cluster (which is the same as the distance between the means in the vector space case easier to calculate).

Distance Between ClustersSingle Link: smallest distance between any pair of points from two clustersComplete Link: largest distance between any pair of points from two clusters

Distance between Clusters (Cont.)Average Link: average distance between points from two clusters Centroid: distance between centroids of the two clusters

Single Link vs. Complete Link (Cont.)Single link works but not complete linkComplete link works but not single linkComplete link works but not single link

Single Link vs. Complete Link (Cont.)Single link worksComplete link doesnt

Single Link vs. Complete Link (Cont.)Single link doesnt worksComplete link does1-cluster noise 2-cluster

Hierarchical vs. PartitionalHierarchical algorithms are more versatile than partitional algorithms. For example, the single-link clustering algorithm works well on data sets containing non-isotropic (non-roundish) clusters including well-separated, chain-like, and concentric clusters, whereas a typical partitional algorithm such as the k-means algorithm works well only on data sets having isotropic clusters.On the other hand, the time and space complexities of the partitional algorithms are typically lower than those of the hierarchical algorithms.

More on Hierarchical Clustering MethodsMajor weakness of agglomerative clustering methodsdo not scale well: time complexity of at least O(n2), where n is the number of total objectscan never undo what was done previously (greedy algorithm)

Integration of hierarchical with distance-based clusteringBIRCH (1996): uses Clustering Feature tree (CF-tree) and incrementally adjusts the quality of sub-clusters

CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling

Density-Based Clustering MethodsClustering based on density (local cluster criterion), such as density-connected pointsMajor features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parameters as termination conditionSeveral interesting studies:DBSCAN: Ester, et al. (KDD96)OPTICS: Ankerst, et al (SIGMOD99).DENCLUE: Hinneburg & D. Keim (KDD98)CLIQUE: Agrawal, et al. (SIGMOD98)

Density-Based Clustering: BackgroundTwo parameters:: Maximum radius of the neighbourhoodMinPts: Minimum number of points in an -neighbourhood of that pointN(p):{q belongs to D | dist(p,q) }Directly (density) reachable: A point p is directly density-reachable from a point q wrt. , MinPts if 1) p belongs to N(q)2) q is a core point: |N(q)| MinPts

Density-Based Clustering: Background (II)Density-reachable: A point p is density-reachable from a point q (p) wrt , MinPts if there is a chain of points p1, , pn, p1=q, pn=p such that pi+1 is directly density-reachable from pi q, q is density-reachable from q.Density reachability is reflexive and transitive, but not symmetric, since only core objects can be density reachable to each other.Density-connectedA point p is density-connected to a q wrt , MinPts if there is a point o such that both, p and q are density-reachable from o wrt , MinPts.Density reachability is not symmetric, Density connectivity inherits the reflexivity and transitivity and provides the symmetry. Thus, density connectivity is an equivalence relation and therefore gives a partition (clustering).pqp1pqo

DBSCAN: Density Based Spatial Clustering of Applications with NoiseRelies on a density-based notion of cluster: A cluster is defined as an equivalence class of density-connected points.Which gives the transitive property for the density connectivity binary relation and therefore it is an equivalence relation whose components form a partition (clustering) according to the duality.Discovers clusters of arbitrary shape in spatial databases with noiseCoreBorderOutlier = 1cmMinPts = 3

DBSCAN: The AlgorithmArbitrary select a point pRetrieve all points density-reachable from p wrt , MinPts.If p is a core point, a cluster is formed (note: it doesnt matter which of the core points within a cluster you start at since density reachability is symmetric on core points.)If p is a border point or an outlier, no points are density-reachable from p and DBSCAN visits the next point of the database. Keep track of such points. If they dont get scooped up by a later core point, then they are outliers.Continue the process until all of the points have been processed.What about a simpler version of DBSCAN:Define core points and core neighborhoods the same way.Define (undirected graph) edge between two points if they cohabitate a core nbrhd.The connectivity component partition is the clustering.Other related method? How does vertical technology help here? Gridding?

OPTICSOrdering Points To Identify Clustering StructureAnkerst, Breunig, Kriegel, and Sander (SIGMOD99)http://portal.acm.org/citation.cfm?id=304187

Addresses the shortcoming of DBSCAN, namely choosing parameters.

Develops a special order of the database wrt its density-based clustering structure This cluster-ordering contains info equivalent to the density-based clusterings corresponding to a broad range of parameter settings

Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

OPTICSDoes this order resemble theTotal Variation order?

OPTICS: Some Extension from DBSCANIndex-based: k = number of dimensions N = number of points (20)p = 75%M = N(1-p) = 5Complexity: O(kN2)Core Distance

Reachability Distance

Dp2MinPts = 5e = 3 cmMax (core-distance (o), d (o, p))r(p1, o) = 2.8cm. r(p2,o) = 4cmoop1

Reachability-distanceCluster-orderof the objectsundefined

DENCLUE: using density functionsDENsity-based CLUstEring by Hinneburg & Keim (KDD98)Major featuresSolid mathematical foundationGood for data sets with large amounts of noiseAllows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data setsSignificant faster than existing algorithm (faster than DBSCAN by a factor of up to 45 claimed by authors ???)But needs a large number of parameters

Denclue: Technical EssenceUses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure.Influence function: describes the impact of a data point within its neighborhood.F(x,y) measures the influence that y has on x.A very good influence function is the Gaussian, F(x,y) = e d2(x,y)/2Others include functions similar to the squashing functions used in neural networks.One can think of the influence function as a measure of the contribution to the density at x made by y.Overall density of the data space can be calculated as the sum of the influence function of all data points.Clusters can be determined mathematically by identifying density attractors.Density attractors are local maximal of the overall density function.

DENCLUE(D,,c,)Grid Data Set (use r = , the std. dev.)Find (Highly) Populated Cells (use a threshold=c) (shown in blue)Identify populated cells (+nonempty cells)

Find Density Attractor pts, C*, using hill climbing:Randomly pick a point, pi.Compute local density (use r=4)Pick another point, pi+1, close to pi, compute local density at pi+1If LocDen(pi) < LocDen(pi+1), climbPut all points within distance /2 of path, pi, pi+1, C* into a density attractor cluster called C*Connect the density attractor clusters, using a threshold, , on the local densities of the attractors.

A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases with Noise. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1998. & KDD 99 Workshop.

Comparison: DENCLUE Vs DBSCAN

BIRCH (1996)Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96http://portal.acm.org/citation.cfm?id=235968.233324&dl=GUIDE&dl=ACM&idx=235968&part=periodical&WantType=periodical&title=ACM%20SIGMOD%20Record&CFID=16013608&CFTOKEN=14462336Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clusteringPhase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan and improves quality with a few additional scansWeakness: handles only numeric data, and sensitive to the order of the data record.

BIRCHABSTRACTFinding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset.Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases.BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints).BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans.BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments.

Clustering Feature VectorCF = (5, (16,30),(54,190))(3,4)(2,6)(4,5)(4,7)(3,8)

Chart4

4

6

2

7

8

5

5

2

4

4

y

Sheet1

xy

34

26

73

47

38

85

45

51

74

55

3.55.1666666667

6.754.25

distance2.27777777781.125

3.55.8333333333

xy

34

36

73

47

38

85

45

51

74

55

3.66666666675.8333333333

6.16666666673.8333333333

Sheet1

0

0

0

0

0

0

0

0

0

0

y

Sheet2

0

0

0

0

0

0

0

0

0

0

0

0

y

Sheet3

0

0

0

0

0

0

0

0

0

0

y

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

xyPAM

34

26

72

47

38

85

45

5.912

74

84

0

0

0

0

0

0

0

0

0

0

y

k

i

h

j

BirchBranching factor, B=6Threshold, L = 7Iteratively put points into closest leaf until threshold is exceed, then split leaf.Inodes summarize their subtrees and Inodes get split when threshold is exceeded.Once in-memory CF tree is built, use another method to cluster leaves together.

CURE (Clustering Using REpresentatives )CURE: proposed by Guha, Rastogi & Shim, 1998http://portal.acm.org/citation.cfm?id=276312Stops the creation of a cluster hierarchy if a level consists of k clustersUses multiple representative points to evaluate the distance between clustersadjusts well to arbitrary shaped clusters (not necessarily distance-basedavoids single-link effect

Drawbacks of Distance-Based MethodDrawbacks of square-error based clustering method Consider only one point as representative of a clusterGood only for convex shaped, similar size and density, and if k can be reasonably estimated

Cure: The AlgorithmVery much a hybrid method (involves pieces from many others).Draw random sample s.Partition sample to p partitions with size s/pPartially cluster partitions into s/pq clustersEliminate outliersBy random samplingIf a cluster grows too slow, eliminate it.Cluster partial clusters.Label data in disk

CureABSTRACT

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers.We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size.CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction.Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers.To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters.Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms.Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

Data Partitioning and Clusterings = 50p = 2s/p = 25xxs/pq = 5

Cure: Shrinking Representative PointsShrink the multiple representative points towards the gravity center by a fraction of .Multiple representatives capture the shape of the cluster

Clustering Categorical Data: ROCKhttp://portal.acm.org/citation.cfm?id=351745

ROCK: Robust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE99). Agglomerative HierarchicalUse links to measure similarity/proximityNot distance basedComputational complexity:Basic ideas:Similarity function and neighbors:Let T1 = {1,2,3}, T2={3,4,5}

ROCKAbstract

Clustering, in data mining, is useful to discover distribution patterns in the underlying data.Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions.In this paper, we study clustering algorithms for data with boolean and categorical attributes.We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points.We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters.Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge.In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques.For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

Rock: AlgorithmLinks: The number of common neighbors for the two pts

AlgorithmDraw random sampleCluster with linksLabel data in disk{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}{1,2,3} {1,2,4}3

CHAMELEONCHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar99 http://portal.acm.org/citation.cfm?id=621303

Measures the similarity based on a dynamic modelTwo clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clustersA two phase algorithm1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

CHAMELEONABSTRACT

Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model.By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters.Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged.Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters.Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters.By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters.Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters.Chameleon finds the clusters in the data set by using a two-phase algorithm.During the first phase, Chameleon uses a graph-partitioning algorithm to cluster the data items into several relatively small subclusters.During the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these sub-clusters.

Overall Framework of CHAMELEON

ConstructSparse GraphPartition the GraphMerge PartitionFinal ClustersData Set

Grid-Based Clustering Method Using multi-resolution grid data structureSeveral interesting methodsSTING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB98)A multi-resolution clustering approach using wavelet methodCLIQUE: Agrawal, et al. (SIGMOD98)

Vertical griddingWe can observe that almost all methods discussed so far suffer from the curse of cardinality (for very large cardinality data sets, the algorithms are too slow to finish in the average life time!) and/or the curse of dimensionality (points are all at ~ same distance).

The work-arounds employed to address the curses

sampling (throw out most of the points in a way that what remains is low enough cardinality for the algorithm to finish and in such a way that the remaining sample contains all the information of the original data set (Therein is the problem that is impossible to do in general);

Gridding (agglomerate all points in a grid cell and treat them as one point (smooth the data set to this gridding level). The problem with gridding, often, is that info is lost and the data structure that holds the grid cell information is very complex. With vertical methods (e.g., P-trees), all the info can be retained and griddings can be constructed very efficiently on demand. Horizontal data structures cant do this.

Subspace restrictions (e.g., Principal Components, Subspace Clustering)

Gradient based methods (e.g., the gradient tangent vector field of a response surface reduces the calculations to the number of dimensions, not the number of combinations of dimensions.)

j-hi gridding: the j hi order bits identify a grid cells and the rest identify points in a particular cell. Thus, j-hi cells are not necessarily cubical (unless all attribute bit-widths are the same).

j-lo gridding; the j lo order bits identify points in a particular cell and the rest identify a grid cell. Thus, j-lo cells always have a nice uniform shape (cubical).

1-hi gridding of Vector Space, R(A1, A2, A3) in which all bit-widths are the same = 3 (so each grid cell contains 22 * 22 * 22 = 64 potential points). Grid cells are identified by their Peano id (Pid) internally the points cell coordinates are shown - called the grid cell id and cell points are ided by coordinates within their cell.A1hi-bitA3hi-bitA2hi-bit010101 gci=001 gcp=00,00,00Pid = 001 gci=001 gcp=00,00,01 gci=001 gcp=00,01,00 gci=001 gcp=00,01,01 gci=001 gcp=01,00,00 gci=001 gcp=01,00,01 gci=001 gcp=01,01,00 gci=001 gcp=01,01,01 gci=001 gcp=00,00,10 gci=001 gcp=00,00,11 gci=001 gcp=00,01,10 gci=001 gcp=00,01,11 gci=001 gcp=01,00,10 gci=001 gcp=01,00,11 gci=001 gcp=01,01,10 gci=001 gcp=01,01,11 gci=001 gcp=00,10,00 gci=001 gcp=00,10,01 gci=001 gcp=00,11,00 gci=001 gcp=00,11,01 gci=001 gcp=01,10,00 gci=001 gcp=01,10,01 gci=001 gcp=01,11,00 gci=001 gcp=01,11,01 gci=001 gcp=00,10,10 gci=001 gcp=00,10,11 gci=001 gcp=00,11,10 gci=001 gcp=00,11,11 gci=001 gcp=01,10,10 gci=001 gcp=01,10,11 gci=001 gcp=01,11,10 gci=001 gcp=01,11,11 gci=001 gcp=10,00,00 gci=001 gcp=10,00,01 gci=001 gcp=10,01,00 gci=001 gcp=10,01,01 gci=001 gcp=10,00,10 gci=001 gcp=10,00,11 gci=001 gcp=10,01,10 gci=001 gcp=10,01,11 gci=001 gcp=10,10,00 gci=001 gcp=10,10,01 gci=001 gcp=10,11,00 gci=001 gcp=10,11,01 gci=001 gcp=10,10,10 gci=001 gcp=10,10,11 gci=001 gcp=10,11,10 gci=001 gcp=10,11,11 gci=001 gcp=11,00,00 gci=001 gcp=11,00,01 gci=001 gcp=11,01.00 gci=001 gcp=11,01.01 gci=001 gcp=11,00,10 gci=001 gcp=11,00,11 gci=001 gcp=11,01,10 gci=001 gcp=11,01,11 gci=001 gcp=11,10,00 gci=001 gcp=11,10,01 gci=001 gcp=11,11,00 gci=001 gcp=11,11,01 gci=001 gcp=11,10,10 gci=001 gcp=11,10,11 gci=001 gcp=11,11,10 gci=001 gcp=11,11,11

A3A2000110110001101100011011Pid = 001.001 gci=00,00,11gcp=0,0,0 gci=00,00,11gcp=0,0,1 gci=00,00,11gcp=0,1,0 gci=00,00,11gcp=0,1,1 gci=00,00,11gcp=1,0,0 gci=00,00,11gcp=1,0,1 gci=00,00,11gcp=1,1,0 gci=00,00,11gcp=1,1,12-hi gridding of Vector Space, R(A1, A2, A3) in which all bitwidths are the same = 3 (so each grid cell contains 21 * 21 * 21 = 8 points).

1-hi gridding of R(A1, A2, A3), bitwidths of 3,2,3A1hi-bitA3hi-bitA2hi-bit010101 gci=001 gcp=00,0,00Pid = 001 gci=001 gcp=00,0,01 gci=001 gcp=00,1,00 gci=001 gcp=00,1,01 gci=001 gcp=01,0,00 gci=001 gcp=01,0,01 gci=001 gcp=01,1,00 gci=001 gcp=01,1,01 gci=001 gcp=00,0,10 gci=001 gcp=00,0,11 gci=001 gcp=00,1,10 gci=001 gcp=00,1,11 gci=001 gcp=01,0,10 gci=001 gcp=01,0,11 gci=001 gcp=01,1,10 gci=001 gcp=01,1,11 gci=001 gcp=10,0,00 gci=001 gcp=10,0,01 gci=001 gcp=10,1,00 gci=001 gcp=10,1,01 gci=001 gcp=10,0,10 gci=001 gcp=10,0,11 gci=001 gcp=10,1,10 gci=001 gcp=10,1,11 gci=001 gcp=11,0,00 gci=001 gcp=11,0,01 gci=001 gcp=11,1,00 gci=001 gcp=11,1.01 gci=001 gcp=11,0,10 gci=001 gcp=11,0,11 gci=001 gcp=11,1,10 gci=001 gcp=11,1,11

2-hi gridding) of R(A1, A2, A3), bitwidths of 3,2,3 (each grid cell contains 21 * 20 * 21 = 4 potential pts). A12-hi-bitA32-hi-bitA22-hi-bit000100011011gcp=0,,0gcp=0,,1 gcp=1,,0gcp=1,,1101100011011Pid = 3.1.3

gcp=1010,1010,1011gcp=1010,1010,1010HOBBit disks and rings: (HOBBit = Hi Order Bifurcation Bit) 4-lo grid where A1,A2,A3 have bit-widths, b1+1, b2+1, b3+1, HOBBit grid centers are points of the form (exactly one per grid cell):x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010) where xi,js range over all binary patterns

HOBBit disk about x, of radius 20 , H(x,20).

Note: we have switched the direction of A3A3(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)(x1,b1..x1,41010, x2,b2..x2,41011, x3,b3..x3,41010)(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41011)(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41011)(x1,b1..x1,41010, x2,b2..x2,41011, x3,b3..x3,41011)(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41010)gcp=1010,1011,1011gcp=1010,1011,1010gcp=1011,1010,1011gcp=1011,1010,1010gcp=1011,1011,1011gcp=1011,1011,1010 (x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41011)(x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41010)

H(x,21) HOBBit disk about a HOBBit grid center pt, x = , of radius 21A1A2A3(x1,b1..x1,41000, x2,b2..x2,41000, x3,b3..x3,41000)(x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41011)(x1,b1..x1,41000, x2,b2..x2,41011, x3,b3..x3,41000)(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)(x1,b1..x1,41011, x2,b2..x2,41000, x3,b3..x3,41011)(x1,b1..x1,41011, x2,b2..x2,41000, x3,b3..x3,41000)(x1,b1..x1,41011, x2,b2..x2,41000, x3,b3..x3,41010)(x1,b1..x1,41011, x2,b2..x2,41000, x3,b3..x3,41001)(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41011)(x1,b1..x1,41000, x2,b2..x2,41011, x3,b3..x3,41011)(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)

The Regions of H(x,21) are as follows:

These REGIONS are labeled with dimensions in which length is increased(e.g., all three dimensions are increased below).A1A2A3123-REGION

A1A2A313-REGION

A1A2A323-REG

A1A2A312-REGION

A1A2A33-REG

A1A2A32-REG

A1A2A31-REGION

A1A2A3H(x,20) =123-REGOf H(x,20)

Algorithm (for computing gradients):

Select an outlier threshold, (pts without neighbors in their ot L-disk are outliers That is, there is no gradient at these outlier points (instantaneous rate of response change is zero).

Create an j-lo grid with j=ot (see previous slides - where HOBBit disks are built out from HOBBit centers x = ( x1,b1x1,ot+11010 , , xn,bnxn,ot+11010 ), xi,js ranging over all binary patterns).

Pick a point, x in R. Build out alternating one-sided-rings centered at x until a neighbor is found or radius ot is exceeded (in which case x is declared an outlier). If a neighbor is found at a raduis, ri < ot 2j, f/ xk(x) is estimated as below:

Note: one can use L-HOBBit or L ordinary distance.Note: One-sided means that each successive build out increases aternatively only in the positive direction in all dimensions then only in the negative direction in all dimensions.Note: Building out HOBBit disks from a HOBBit center automatically gives one-sided rings (a built-out ring is defined to be the built-out disk minus the previous built-out disk) as shown in the next few slides.

( RootcountD(x,ri) - RootcountD(x,ri)k ) / xk where D(x,ri)k is D(x,ri-1) expandedin all dimensions except k.

Alternatively in 3., actually calculate the mean (or median?) of the new points encountered in D(x,ri) (we have a P-tree mask for the set so this it trivial) and measure the xk-distance.

NOTE: Might want to go 1 more ring out to see if one gets the same or a similar gradient(this seems particularly important when j is odd (since the gradient then points the opposite way.)NOTE: the calculation of xk can be done in various ways. Which is best?

H(x,21)H(x,21)1HOBBit center, x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)First new pointgradient

H(x,21)H(x,21)2gradient( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1

H(x,21)H(x,21)3gradient( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1

H(x,21)HOBBit center, x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)First new pointgradient

H(x,21)H(x,21)1HOBBit center, x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)gradientEst f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-2)/(-1) = 0

H(x,21)H(x,21)2gradient

H(x,21)H(x,21)3gradient

H(x,21)gradient

H(x,21)H(x,21)1HOBBit center, x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)First new point

H(x,21)H(x,21)2( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-2)/(-1) = 0

H(x,21)H(x,21)3( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-2)/(-1) = 0

H(x,21)Intuitively, this Gradient estimation method seems to work.

Next we consider a potential accuracy improvement in which we take the medoid of all new points as the gradient

(or, more accurately, as the point to which we climb in any response surface hill climbing technique)

H(x,21)new points = sEstimate the gradient arrowhead as being at the medoid of the new point set (or, more correctly, estimate the next hill-climb step).Note: If the original points are truly part of a strong cluster, the hill climb will be excellent. new points centroid =

H(x,21)new points = sEstimate the gradient arrowhead as being at the medoid of the new point set (or, more correctly, estimate the next hill-climb step).Note: If the original points are not truly part of a strong cluster, the weak hill climb will indicate that. new points centroid =

H(x,22)H(x,22)1First new point

H(x,2)To evaluate how well the formula estimates thegradient, it is important to consider all cases of thenew point appearing in one of these regions(if 1 point appears, gradient components areadditive, so it suffices to consider 1?

To evaluate how well the formula estimates the gradient, it is important to consider all cases of the new point appearing in 1 of these regions (if 1 pt appears, gradient comps add)

H( x,23 )Notice that the HOBBit center moves more and more toward the true center as the grid size increases.

Grid based Gradients and Hill ClimbingIf we are using gridding to produce the gradient vector field of a response surface, might we always vary xi in the positive direction only? How can that be done most efficiently? j-lo gridding, building out HOBBit rings from HOBBit grid centers (see previous slides where this approach was used.) or j-lo gridding. building out HOBBit rings from lo-value grid pts (ending in j 0-bits) x = ( x1,b1x1,j+100 , , xn,bnxn,j+100 ) Ordinary j-lo griddng, building out rings from lo-value ids (ending in j zero bits)Ordinary j-lo gridding, uilding out Rings from true centers.Other? (there are many other possibilities, but we will first explore 2.) Using j-lo gridding with j=3 and lo-value cell identifiers, is shown on the next slide.

Of course, we need not use HOBBit build out.

With ordinary unit radius build out, the results are more exact, but are the calculations may be more complex???

HOBBit j-lo rings using lo-value cell ids x=(x1,b1x1,j+100 ,, xn,bnxn,j+100)

Disk(x,0)= PDisk(x,2)^PDisk(x,1)

wherePD(x,i) =

Pxb^..^Pxj+1 ^Pj^..^Pi+1 = PDisk(x,3)^PDisk(x,2)Ordinary j-lo rings using lo-value cell ids x=(x1,b1x1,j+100 ,, xn,bnxn,j+100)

k-Medoids Clustering Review:Find representative objects (medoids) (actual objects in the cluster - mean seldom is).PAM (Partitioning Around Medoids)Select k representative objects arbitrarilyFor each pair of non-selected object h and selected object i, calculate the total swapping cost TCi,hFor each pair of i and h, If TCi,h < 0, i is replaced by hThen assign each non-selected object to the most similar representative objectrepeat steps 2-3 until there is no change

CLARA (Clustering LARge Apps) draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS (Clustering Large Apps based on RANdom Search) draws sample of neighbors dynamically. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum (Genetic-Algorithm-like). Finally the best local optimum is chosen after some stopping condition. It is more efficient and scalable than both PAM and CLARA

A Vertical k-Medoids Clustering AlgorithmFollowing PAM (to illustrate the main killer idea but it can apply much more widely)Select k component P-treesThe Goal here is to efficiently get one Ptree mask for each componente.g., calculate the smallest j : the j-lo gridding has > k cells.Agglomerate into precisely k components (by ORing Ptrees of cells with closest means/mediods/corners(single_link)Where PAM uses: For each pair of non-selected object h and selected object i, calculate total swapping cost TCi,h. For each pair of i and h, if TCi,h < 0, i is replaced by h, then assign each non-selected object to the most similar object., use:Find Medoid of each component, Ci: Calculate TV(Ci,x) x Ci. (create P-tree of all points with min TV so far. smaller TV, reset this P-tree to it. Ending up with a P-tree, PtMi of tieing-Medoids for Ci. Calc TV(PtMi ,x) x PtMi and pick its medoid (if there are still multiple, repeat). This is Ci-medoid! Alternatively, just pick 1 pre-Medoid!Note: this avoids expense of pairwise swappings, and avoids subsampling as in CLARA, CLARANS)Put each point with its closest Medoid (building P-tree component masks as you do this).Repeat 1 & 2 until (some stopping condition such as: no change in Medoid set?)Can we cluster at step 4 without a scan (create component P-trees??)

A Vertical k-Means Clustering AlgorithmAs mentioned on previous slide, a Vertical k-Means algorithm goes similarily:Select k component P-trees (The Goal here is to efficiently get one Ptree mask for each component, e.g., calculate the smallest j : the j-lo gridding has > k cells and agglomerate into precisely k components (by ORing Ptrees of cells with closest means)Calculate the Means of each component, Ch, byANDing each basic Ptree with PCi In dimension, Ak , calculate the k-component of the mean as i=bk..0 2i * rc(PCh ^ Pk,i ) Put each pt with closest Mean (building P-tree component masks as you do this).Repeat 2, 3 until (some stopping condition such as: no change in Mean set?)Can we cluster at step 4 without a scan (create component P-trees??)

Zillions of vertical hybrid clustering algs leap to mind (involving partitioning, hierarchical, density methods)! Pick one!

Finding Density Attractors for Density Clustering AlgFinding density attractors (and their attractor sets which, when agglomerated via a density threshold constitutes the generic density clustering algorithm)

Pick a point.Build out (HOBBit or Ordinary or?) rings one at a time until the first k neighbors are found.In that ring, computer the medoid points (as in step 1 of the V-k-Medoids algorithm above)if the Medoid increases the density, climb to it and goto 2Else declare it a density attractor and goto 1

j-grids - P-tree relationship1-hi-gridding of R(A1, A2, A3).Bitwidths: 3,2,3 (dim cardinalities: 8,4,8)Using tree node-like identifierscell (tree) id (ci) of the form, c0c1cdpoint (coord) id (pi) of the form, p1,p2,p3

00.0.00

00.0.0100.1.0000.1.01

01.0.00

01.0.0101.1.0101.1.01

00.0.10

00.0.1100.1.1000.1.11

01.0.10

01.0.1101.1.1001.1.1110.0.0010.0.0110.1.0010.1.0110.0.10 10.0.1110.1.1010.1.1111.0.0011.0.0111.1.0011.1.0111.0.1011.0.1111.1.1011.1.11A 1-hi-grid yields a P-tree with level-0 (cell level) fanout of 23 and level-1 (point level) fanout of 25. If leaves are segment labelled (not coords):

j-grids - P-tree relationship (Cont.)One can view a standard P-tree as nested 1-hi-griddings, with compressed out constant subtrees.R(A1, A2, A3) with bitwidths: 3,2,300011011

Gridding categorical data?The following bioinformatics (yeast genome) data used was extracted mostly from the MIPS database (Munich Information center for Protein Sequences)

Left column shows featurestreat these with hi-order bit (1 iff gene participates)There may be more levels of hierarchy (e.g., function: some genes actually cause the function when they express in sufficient quantities, while others are transcription factors for those primary genes. Primary genes have hi bit on, tfs have second bit on)

Right column shows # distinct feature valuesBitmap these

Data Representationgene-by-feature table.

For a categorical feature, we consider each category as a separate attribute or column by bit-mapping it.

The resulting table has a total of 8039 distinct feature bit vectors (corresponding to items in MBR) for6374 yeast genes (corresponding to transactions in MBR)

STING: A Statistical Information Grid ApproachWang, Yang and Muntz (VLDB97)The spatial area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution

STING: A Statistical Information Grid Approach (2)Each cell at a high level is partitioned into a number of smaller cells in the next lower levelStatistical info of each cell is calculated and stored beforehand and is used to answer queriesParameters of higher level cells can be easily calculated from parameters of lower level cellcount, mean, s, min, max type of distributionnormal, uniform, etc.Use a top-down approach to answer spatial data queriesStart from a pre-selected layertypically with a small number of cellsFor each cell in the current level compute the confidence interval

STING: A Statistical Information Grid Approach (3)Remove the irrelevant cells from further considerationWhen finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reachedAdvantages:Query-independent, easy to parallelize, incremental updateO(K), where K is the number of grid cells at the lowest level Disadvantages:All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

WaveCluster (1998)Sheikholeslami, Chatterjee, and Zhang (VLDB98) A multi-resolution clustering approach which applies wavelet transform to the feature space A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub-band.Both grid-based and density-basedInput parameters: # of grid cells for each dimensionthe wavelet, and the # of applications of wavelet transform.

What is Wavelet (1)?

WaveCluster (1998)How to apply wavelet transform to find clusters Summaries the data by imposing a multidimensional grid structure onto data spaceThese multidimensional spatial data objects are represented in a n-dimensional feature spaceApply wavelet transform on feature space to find the dense regions in the feature spaceApply wavelet transform multiple times which result in clusters at different scales from fine to coarse

What Is Wavelet (2)?

Quantization

Transformation

WaveCluster (1998)Why is wavelet transformation useful for clusteringUnsupervised clustering It uses hat-shape filters to emphasize region where points cluster, but simultaneously to suppress weaker information in their boundary Effective removal of outliersMulti-resolutionCost efficiencyMajor features:Complexity O(N)Detect arbitrary shaped clusters at different scalesNot sensitive to noise, not sensitive to input orderOnly applicable to low dimensional data

CLIQUE (Clustering In QUEst) Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98). http://portal.acm.org/citation.cfm?id=276314

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-basedIt partitions each dimension into the same number of equal length intervalIt partitions an m-dimensional data space into non-overlapping rectangular unitsA unit is dense if the fraction of total data points contained in the unit exceeds the input model parameterA cluster is a maximal set of connected dense units within a subspace

CLIQUE: The Major StepsPartition the data space and find the number of points that lie inside each cell of the partition.Identify the subspaces that contain clusters using the Apriori principleIdentify clusters:Determine dense units in all subspaces of interestsDetermine connected dense units in all subspaces of interests.Generate minimal description for the clustersDetermine maximal regions that cover a cluster of connected dense units for each clusterDetermination of minimal cover for each cluster

CLIQUEABSTRACTData mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records.

We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality.It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension.It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution.Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

Salary (10,000)2030405060age54312670 = 3

Strength and Weakness of CLIQUEStrength It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspacesIt is insensitive to the order of records in input and does not presume some canonical data distributionIt scales linearly with the size of input and has good scalability as the number of dimensions in the data increasesWeaknessThe accuracy of the clustering result may be degraded at the expense of simplicity of the method

Model-Based Clustering MethodsAttempt to optimize the fit between the data and some mathematical modelStatistical and AI approachConceptual clusteringA form of clustering in machine learningProduces a classification scheme for a set of unlabeled objectsFinds characteristic description for each concept (class)COBWEB (Fisher87) A popular a simple method of incremental conceptual learningCreates a hierarchical clustering in the form of a classification treeEach node refers to a concept and contains a probabilistic description of that concept

COBWEB Clustering MethodA classification tree

More on Statistical-Based ClusteringLimitations of COBWEBThe assumption that the attributes are independent of each other is often too strong because correlation may existNot suitable for clustering large database data skewed tree and expensive probability distributionsCLASSITan extension of COBWEB for incremental clustering of continuous datasuffers similar problems as COBWEB AutoClass (Cheeseman and Stutz, 1996)Uses Bayesian statistical analysis to estimate the number of clustersPopular in industry

Other Model-Based Clustering MethodsNeural network approachesRepresent each cluster as an exemplar, acting as a prototype of the clusterNew objects are distributed to the cluster whose exemplar is the most similar according to some dostance measureCompetitive learningInvolves a hierarchical architecture of several units (neurons)Neurons compete in a winner-takes-all fashion for the object currently being presented

Model-Based Clustering Methods

Self-organizing feature maps (SOMs)Clustering is also performed by having several units competing for the current objectThe unit whose weight vector is closest to the current object winsThe winner and its neighbors learn by having their weights adjustedSOMs are believed to resemble processing that can occur in the brainUseful for visualizing high-dimensional data in 2- or 3-D space

Hybrid ClusteringOne of the common approaches is to combine k-means method and hierarchical clustering First partition the dataset into k small clusters, and then merge the clusters based on similarity using hierarchical method.Hybrid clustering combines partitioning clustering and hierarchical clustering approachK = 7

Problems of Existing Hybrid ClusteringPredefine the number of preliminary clusters, K.Unable to handle noisy data

Chart1

411.975006

147.626007

370.727997

284.882996

294.562988

401.381012

281.992004

94.350998

167.190002

118.780998

361.755005

81.598999

381.265991

421.480011

322.730011

69.384003

296.382996

307.89801

429.165985

73.486

163.341995

289.688995

406.865997

92.762001

306.391998

282.093994

468.524994

338.063995

319.738007

97.189003

158.792007

307.096008

387.63501

455.946991

346.17099

91.510002

397.192993

286.822998

166.382004

66.892998

289.194

393.152008

205.548004

148.632996

219.106995

299.589996

396.721985

386.545013

271.19101

453.752991

434.139008

404.265991

61.938

151.570999

340.463989

414.664001

309.684998

309.30899

117.589996

113.858002

342.661987

190.306

334.019012

415.173004

405.80899

325.358002

278.631989

441.615997

166.511993

402.427002

258.368011

123.528999

173.026993

342.243988

154.636002

359.445007

264.644989

318.136993

148.951004

342.123993

284.403015

106.501999

408.743011

168.382004

186.985001

98.852997

121.884003

198.557007

353.244995

336.570007

155.419998

62.771

311.597992

182.811005

417.424011

68.598

371.300995

424.826996

86.344002

423.584015

86.559998

75.369003

427.462006

454.506012

272.295013

363.919006

87.001999

327.880005

261.791992

335.471008

408.656006

79.885002

144.559998

426.477997

243.266998

169.444

173.817001

280.162994

433.441986

156.134995

342.634003

85.092003

107.855003

439.984985

441.214996

353.347992

40.794998

234.908997

277.278992

161.722

94.833

134.664993

281.058014

333.121002

399.162994

313.82901

446.593994

417.181

322.428986

82.452003

401.860992

158.348999

345.700989

266.322998

175.306

357.752991

278.677002

288.769012

111.399002

138.800003

295.947998

369.30899

274.343994

102.209999

162.998001

290.972992

215.755005

417.372986

399.623993

349.989014

394.075012

171.520996

106.579002

138.574005

335.604004

97.509003

399.070007

90.276001

415.957001

339.56601

414.022003

163.787994

461.57901

89.614998

236.263

210.246994

427.973999

161.421997

406.076996

108.414001

312.509003

366.365997

279.231995

380.846008

323.664001

440.687012

419.429993

253.397995

421.220001

292.575989

249.365005

133.772995

51.333

369.588989

396.860992

344.040009

152.404007

399.234985

261.708008

370.350006

124.433998

131.962997

350.19101

74.773003

434.911011

93.491997

99.889999

83.125

354.847992

243.822006

281.057007

303.593994

402.153015

307.951996

316.003998

373.040009

177.156998

361.018005

321.57901

218.477997

90.274002

145.770996

304.859985

393.790009

378.734009

375.536987

148.067993

69.130997

416.785004

352.445007

284.882996

100.473

319.355988

78.129997

91.217003

355.893005

345.015015

292.544006

101.383003

162.660995

306.937012

170.819

173.718994

104.254997

253.785995

346.757996

404.593994

410.269012

179.164993

248.854996

168.171005

223.132996

153.628006

181.177002

105.845001

412.964996

423.796997

205.031006

69.685997

415.209015

452.665009

375.322998

397.28299

178.970993

93.762001

319.312012

346.678009

70.839996

94.478996

364.03299

150.130997

425.682007

46.923

183.192993

215.259003

253.264008

70.057999

103.917999

101.127998

258.218994

277.386993

204.921005

239.093002

413.503998

453.852997

207.559006

272.687988

155.746002

363.645996

327.566986

347.648987

362.05899

443.962006

426.362

416.279999

211.813004

429.858002

248.307007

159.154999

87.805

245.417007

81.510002

402.902008

439.873993

344.179993

38.550999

415.115997

391.984985

121.133003

140.483994

280.044006

251.395004

113.889

129.988998

149.259003

386.951996

341.609009

413.817993

91.872002

256.539001

371.484985

426.82901

433.528992

272.737

424.382996

318.037994

136.136002

173.192001

102.169998

148.266998

148.826996

270.967987

273.917999

345.513

90.611

161.382996

254.389999

441.686005

353.803986

415.084991

360.951996

200.313004

447.243011

161.151993

143.869995

434.391998

284.217987

427.306

393.717987

89.828003

173.658997

148.011002

339.085999

420.188995

107.766998

272.647003

172.613998

387.895996

85.624001

141.061996

264.362

44.105999

281.847992

404.339996

76.639999

87.621002

97.117996

309.628998

172.781998

296.240997

326.884003

77.799004

263.118011

154.477997

89.481003

363.209015

48.134998

337.112

305.195007

149.408005

350.737

439.667999

75.254997

321.296997

224.789993

378.575989

399.600006

105.292

386.027008

391.389008

276.843994

426.437988

94.68

92.777

99.792999

372.265991

183.662994

302.263

359.138

333.631989

191.018005

299.346985

163.597

306.737

75.677002

71.044998

108.685997

183.113998

283.863007

430.812012

258.38501

98.038002

101.893997

171.526001

170.727005

160.106003

214.516998

353.427002

262.48999

369.528015

217.365997

144.889999

261.468994

252.020004

146.255997

309.867004

338.81601

74.276001

170.112

420.506989

286.636993

345.281006

161.199005

407.908997

104.052002

210.143997

185.255005

280.795013

398.273987

176.363998

167.182999

455.894989

410.996002

265.170013

58.123001

146.639999

290.269012

123.141998

251.643005

128.084

150.513

430.519012

77.293999

407.371002

171.828003

124.708

184.681

352.268005

29.370001

335.317993

356.350006

409.332001

289.036987

374.335999

433.358002

422.505005

170.764999

150.860992

72.875999

175.623993

402.600006

156.906998

430.181

30.91

74.058998

346.696014

309.187012

293.131989

143.582001

436.411987

162.210999

114.343002

357.123993

136.572006

287.850006

372.144012

342.381989

336.246002

104.377998

418.369995

234.050003

155.158005

262.756012

120.799004

140.332993

441.903992

414.872986

267.566986

285.682007

363.410004

68.014

139.917999

149.014999

152.589996

64.913002

286.81601

213.615005

123.617996

357.527008

179.009995

148.753006

144.634995

452.337006

254.742004

177.085007

204.873001

333.678009

413.658997

359.608002

59.185001

354.632996

276.035004

200.938004

288.074005

134.873001

257.946014

270.098999

407.196014

297.95401

409.427002

108.485001

356.910004

424.279999

90.167999

403.316986

383.121002

334.575989

430.992004

55.445999

153.067993

435.535004

410.891998

347.910004

303.158997

167.076004

179.141006

315.420013

251.970993

237.095993

300.765015

106.078003

325.735992

337.957001

384.118988

125.338997

321.266998

281.993988

327.049011

140.395004

411.807007

99.917

146.807007

109.885002

181.315994

343.324005

146.138

202.037003

354.878998

82.255997

103.931

393.513

157.832001

136.75

171.162994

341.546997

75.081001

219.559006

397.091003

276.619995

222.595001

418.187988

443.503998

438.928009

58.319

462.510986

427.606995

265.519989

288.994995

58.331001

169.615005

406.761993

264.567993

308.860992

350.002014

358.739014

417.102997

103.794998

252.470993

166.684998

99.315002

144.365997

71.482002

139.386993

456.074005

413.612

421.634003

371.907013

248.046997

49.363998

96.299004

242.654999

307.13501

225.276001

170.257996

419.315002

182.671997

320.062012

416.313995

113.790001

296.410004

164.539993

363.894012

438.174011

459.45401

163.332001

74.560997

232.052994

142.087997

385.596985

366.731995

33.029999

422.687012

339.394989

364.029999

409.218994

424.799988

61.924

286.834015

281.73999

147.494003

380.735992

279.720001

432.332001

89.833

419.402008

105.178001

303.576996

329.170013

228.669998

163.518005

299.259003

360.424011

386.960999

280.826996

430.199005

227.983002

84.471001

406.631989

279.035004

254.994995

302.79599

102.126999

219.246994

134.559998

246.656998

149.427994

70.155998

352.38501

181.440002

125.613998

189.464005

345.447998

69.726997

221.354004

404.820007

411.55899

390.354004

390.489014

245.335999

142.335999

211.039993

400.444

414.061005

413.035004

335.237

265.77301

268.003998

438.921997

108.485001

71.207001

272.036011

112.561996

353.608002

355.928009

101.720001

76.428001

164.184998

294.545013

333.962006

76.594002

87.478996

285.971985

83.738998

108.974998

129.959

68.763

80.113998

246.544006

87.264

102.337997

264.950989

109.217003

156.082001

169.095001

292.673004

270.07901

292.769012

349.415009

305.188995

103.540001

300.585999

254.238007

101.574997

70.324997

316.324005

80.808998

165.042007

295.459991

364.417999

160.257996

140.296005

342.928986

150.522995

69.125

194.188995

241.889008

419.772003

381.272003

266.501007

299.845001

83.634003

92.921997

106.843002

225.565002

115.693001

120.246002

169.343002

186.029007

282.281006

361.078003

322.162994

102.886002

381.977997

195.589005

153.572998

118.434998

258.385986

123.403

83.694

105.144997

87.931999

363.122986

105.375

135.084

62.674999

272.627991

239.132004

171.932999

366.229004

406.355988

371.421997

417.742004

171.348999

394.865997

304.01001

452.489014

101.063004

69.183998

70.362

377.388

277.790009

132.279999

285.881012

140.052994

320.221008

282.5

180.727005

75.057999

148.397995

253.003006

417.162994

245.822998

290.506012

157.123001

460.141998

435.468994

363.605011

466.536011

183.121002

383.799988

448.562988

465.03299

212.438995

64.663002

88.723

418.415985

260.058014

276.231995

316.390015

389.855988

160.513

76.130997

467.022003

423.502991

355.480988

393.710999

413.946014

400.63501

87.391998

96.442001

192.942001

135.835007

60.238998

77.614998

277.355011

76.836998

303.019012

309.695007

85.653

273.613007

60.386002

158.281006

225.621994

430.76001

68.586998

468.425995

333.167999

148.389008

72.014

271.080994

267.609985

138.069

53.355999

217.768997

175.237

380.566986

176.175995

328.488007

376.234985

271.23999

140.151001

388.799011

255.751007

235.598007

343.644012

251.996994

331.829987

303.695007

157.742004

384.822998

325.26001

385.264008

320.299988

101.286003

253.026001

283.078003

112.135002

374.411987

143.709

358.519989

462.231995

141.884995

226.324005

405.515015

326.714996

346.292999

342.200989

270.518005

331.062988

431.364014

139.406006

320.063995

452.329987

454.144989

144.998993

357.248993

123.603996

429.223999

371.342987

260.872986

404.317993

281.989014

470.860992

262.003998

309.516998

142.901001

442.464996

273.964996

211.356995

349.985992

439.441986

56.181999

60.667999

412.161011

171.227005

393.994995

407.834991

345.552002

73.113998

174.442993

262.859985

56.455002

407.343994

329.649994

453.915009

376.006012

164.615005

349.950989

287.348999

173.901993

436.981995

346.355011

266.794006

304.914001

178.822006

435.674988

155.576996

110.553001

418.669006

280.138

282.101013

85.728996

356.751007

249.483994

428.565002

277.845001

99.510002

89.015999

69.092003

270.851013

72.955002

353.61499

188.376999

164.914993

173.751999

246.804001

147.199005

235.903

411.838989

102.431

88.247002

390.391998

166.356003

80.586998

354.371002

434.235992

325.130005

124.962997

50.557999

312.928009

434.196014

465.752014

362.040985

341.359985

24.195

408.988007

381.540009

402.351013

137.990997

417.56601

319.752991

258.147003

96.987999

319.717987

168.427002

319.194

259.580994

167.729004

471.550995

367.903992

370.600006

228.319

318.751007

266.403015

160.035004

175.617996

161.423004

94.200996

125.359001

319.083008

215.783005

416.612

146.550995

368.246002

112.146004

158.473007

173.341003

181.654999

70.103996

430.949005

135.248993

313.53299

194.268997

177.054001

367.880005

102.415001

283.924988

170.354996

87.662003

164.649002

157.335007

144.792007

327.834015

290.97699

303.341003

177.119003

430.683014

345.322998

344.023987

358.787994

153.210007

126.336998

179.408005

263.916992

76.864998

325.679993

370.152008

412.424011

313.355011

298.157013

378.682007

260.118011

432.519989

401.90799

274.730988

269.661987

266.098999

466.01001

436.523987

297.326996

258.661011

176.785995

87.992996

264.59201

345.386993

273.285004

210.695007

255.416

311.945007

161.763

438.864014

441.990997

415.569

243.820999

129.319

147.042999

439.656006

99.763

289.109985

290.513

178.498001

267.716003

282.105011

366.289001

132.516006

138.520004

338.832001

294.417999

454.813995

167.947998

175.723007

86.389

378.95401

71.641998

340.763

127.134003

144.188995

433.039001

250.498001

433.303009

377.10199

321.985992

452.03299

227.639008

428.309998

375.893005

296.425995

105.628998

118.564003

443.565002

318.289001

268.94101

415.536987

278.402008

256.351013

360.056

430.358002

339.683014

181.393005

97.700996

270.011993

417.610992

398.066986

418.42099

453.471985

424.186005

296.593994

309.856995

341.031006

379.769989

73.056

275.959015

331.747009

250.753006

401.087006

150.406998

76.752998

169.544998

257.312988

430.457001

78.228996

120.024002

340.161987

162.809006

259.507996

273.309998

199.544006

119.684998

72.434998

302.149994

437.391998

73.427002

211.983002

345.201996

275.21701

383.062012

253.695007

427.82901

156.423004

182.039993

274.351013

462.704987

80.511002

270.144012

455.846985

269.567993

67.773003

163.164001

199.304001

344.76001

139.994003

407.27301

195.783997

304.009003

107.469002

50.537998

433.842987

158.886002

129.537003

306.761993

180.453003

398.321014

404.480988

395.395996

156.139999

218.270996

323.076996

282.971008

278.497986

84.252998

336.920013

376.751007

395.835999

319.217987

446.821014

78.745003

217.658997

137.832993

370.489014

418.985992

142.259003

250.199005

325.141998

297.856995

228.520996

59.577999

209.645004

453.653015

386.773987

308.195007

274.651001

65.334999

101.911003

175.979004

342.975006

248.554001

173.005005

105.205002

343.441986

350.524994

322.252991

78.988998

54.297001

78.147003

84.073997

73.643997

270.988007

261.151001

369.681

302.684998

276.309998

391.437988

441.592987

389.21701

301.868011

332.975006

61.547001

267.192993

256.776001

215.445999

401.992004

334.279999

157.289993

284.052002

72.517998

260.419006

250.524002

161.416

283.020996

326.403015

57.397999

407.739014

252.589996

226.102005

72.181999

78.222

143.785995

248.574997

305.502014

339.597992

82.142998

243.878998

310.395996

455.938995

262.346008

288.390991

147.794006

321.225006

271.954987

343.845001

75.432999

416.102997

360.662994

194.169998

429.386993

77.363998

403.014008

70.153

154.610992

406.69101

376.410004

97.058998

96.771004

358.548004

346.89801

252.451996

426.06601

192.048996

328.825012

293.309998

257.38501

325.351013

319.729004

420.747986

112.197998

425.375

410.444

176.154999

114.669998

401.640991

396.261993

180.768005

99.038002

438.881012

316.339996

174.324005

328.757996

144.485001

249.505997

387.660004

87.093002

463.450989

431.338989

386.653992

415.201996

434.864014

224.052994

178.468002

440.371002

282.946991

270.343994

262.238007

215.779007

383.192993

170.195999

224.410004

87.686996

97.904999

290.696991

152.660004

351.863007

171.507996

302.785004

275.352997

398.480988

326.631989

97.151001

181.389008

207.485992

366.165009

100.343002

327.608002

176.455002

68.344002

378.123993

416.915985

139.733994

281.63501

190.054001

444.847992

340.57901

63.146

378.292999

273.065002

97.813004

320.622009

185.669006

473.703003

24.483999

412.091003

205.061996

387.076996

370.088013

76.366997

151.475006

29.521999

208.225006

95.373001

160.373993

146.220993

283.746002

279.040985

181.751007

409.786011

289.123993

213.195999

161.423004

298.419006

322.709015

280.876007

382.477997

84.125999

174.212997

351.070007

117.563004

164.625

311.829987

205.925003

294.925995

317.82901

147.365997

71.588997

194.880005

351.165985

140.703003

397.027008

94.189003

215.328003

102.426003

344.346008

333.808014

290.485992

206.615997

455.115997

278.895996

76.800003

243.326996

352.631012

148.660004

342.268005

342.846985

324.191986

360.988007

358.977997

390.992004

299.390991

203.037994

294.381012

329.169006

130.296997

367.48999

144.052994

432.006012

325.104004

404.565002

320.595001

175.610992

433.359009

397.621002

73.658997

316.027008

378.713013

245.934998

390.570007

249.983994

328.346985

77.617996

288.899994

248.367996

211.802002

350.609009

365.092987

298.55899

343.928009

346.196991

344.247009

358.319

324.983002

373.936005

259.334991

151.195999

336.997009

319.910004

291.950989

247.371994

155.651993

77.438004

98.464996

154.815002

226.987

245.671005

232.845993

170.934006

149.481995

358.247986

165.417007

254.391998

244.020004

433.979004

74.668999

407.270996

197.472

181.220001

371.238007

266.029999

409.21701

382.877014

114.547997

404.726013

242.753998

93.075996

60.666

280.398987

334.97699

183.686005

435.132996

364.247986

216.214996

72.888

389.563995

157.884995

198.061005

220.746994

409.695007

365.154999

183.472

382.11499

328.618011

333.996002

87.056

397.996002

426.53299

196.431

388.653992

398.041992

207.981003

76.360001

267.61499

300.113007

148.248993

92.306999

237.987

124.499001

92.168999

98.041

70.652

355.688995

276.632996

179.190994

259.859009

105.25

162.753998

152.190002

132.507004

258.367004

266.630005

292.63501

199.921005

291.968994

441.139008

90.570999

156.906006

44.348

316.960999

393.328003

352.893005

319.295013

333.735992

391.338013

419.11499

107.491997

183.457001

100.469002

58.148998

166.438995

371.234985

302.122009

434.735992

167.791

76.970001

169.772995

88.875

166.186005

370.381012

439.600006

225.410995

263.010986

336.962006

86.060997

138.197006

412.873993

138.320999

74.719002

286.018005

298.372986

311.81601

292.959015

430.203003

404.127014

117.292999

441.625

166.602997

223.067001

283.121002

257.003998

150.242004

346.567993

442.768005

327.471985

218.453003

71.360001

387.317993

97.149002

226.992996

234.658005

337.459991

261.569

153.382004

151.813995

400.225006

113.255997

257.829987

174.755005

371.658997

410.816986

410.968994

274.214996

429.25

333.81601

408.886993

144.095001

287.806

186.449997

447.127991

169.819

408.203003

95.992996

296.994995

82.318001

147.630005

394.014008

198.876999

225.738007

377.492004

396.618011

330.614014

391.707001

309.403015

73.32

212.580002

97.509003

299.308014

86.675003

277.989014

385.246002

385.959015

169.300003

403.140991

317.556

410.09201

363.571991

84.181999

151.227005

176.699997

344.612

54.167

286.217987

335.019989

331.001007

116.712997

295.329987

97.195

259.428009

241.511993

426.632996

406.520996

63.404999

361.436005

379.993988

138.009995

354.981995

133.296005

332.371002

395.049988

395.850006

341.876007

138.417007

196.807007

70.387001

245.037994

428.351013

150.449997

349.054993

311.506012

68.877998

370.610992

166.475998

104.400002

226.651993

401.196991

213.063995

358.035004

79.586998

319.174011

393.053009

291.835999

137.279999

403.631989

174.630005

266.860992

406.165009

270.903015

163.097

322.890015

300.397003

294.972992

163.610001

111.523003

415.471985

64.246002

167.699997

170.520996

84.876999

171.779999

274.222992

276.705994

386.942993

405.697998

101.296997

84.097

252.408005

378.540985

154.889999

69.556

383.60199

401.118011

378.358002

433.216003

163.024994

40.888

142.356003

414.045013

89.620003

156.181

408.997009

346.492004

72.890999

387.877014

134.304001

192.817001

263.433014

408.945007

328.544006

401.428009

243.158997

124.241997

103.211998

243.903

108.440002

278.07901

120.735001

156.869995

178.233994

402.009003

360.279999

314.890991

71.481003

235.912994

80.758003

442.54599

170.085999

324.546997

222.315002

179.199005

158.130005

54.550999

277.29599

302.360992

156.731995

410.261993

379.932007

317.088989

235.845001

312.161987

345.752991

391.566986

351.032013

165.990997

125.924004

428.001007

342.75

321.877991

182.382004

384.548004

101.642998

410.434998

138.641998

399.598999

147.949005

165.632996

348.501007

97.224998

166.901993

343.546997

363.811005

264.178009

199.192001

359.640991

74.566002

295.598999

422.855988

290.641998

336.321991

145.863998

416.57901

232.628006

278.605988

224.481995

283.066986

308.622009

411.105988

356.59201

127.920998

382.473999

393.781006

79.410004

264.089996

93.407997

56.235001

109.217003

302.606995

157.033997

237.731995

138.462006

256.975006

436.569

376.803009

447.808014

41.754002

145.516006

62.521999

341.408997

181.770996

383.121002

148.291

429.32901

258.213013

150.091995

225.423004

370.709015

406.423004

383.51001

457.565002

260.898987

86.235001

74.495003

370.786011

385.632996

426.46701

80.503998

274.59201

347.487

163.528

152.792999

96.172997

339.686005

89.195999

163.029999

361.953003

425.589996

70.236

140.511993

425.29599

169.134003

193.645996

453.074005

343.102997

109.096001

440.221985

367.928009

377.621002

75.623001

339.471985

325.281006

212.835007

289.411987

156.709

170.779007

91.709

375.01001

80.445

142.431

38.668999

331.184998

343.359009

278.678986

289.156006

93.642998

163.300003

369.51001

416.651001

89.018997

159.796997

316.333008

184.972

307.692993

320.852997

180.628998

306.605011

401.829987

234.240997

138.399994

421.140991

189.231003

75.636002

156.195999

188.830002

359.32901

362.572998

175.464005

414.601013

408.825012

130.658997

284.432007

249.268997

132.126999

343.442993

151.384003

422.764008

166.987

416.481995

109.268997

112.372002

405.946991

423.880005

356.048004

154.311996

82.528

131.164001

258.752991

186.382004

434.843994

214.748001

71.858002

74.245003

415.927002

418.856995

89.095001

308.346008

323.938995

353.807007

178.757996

138.608994

112.773003

471.365997

202.371994

381.23999

98.403

345.128998

231.653

142.179001

465.782013

315.213989

140.988998

367.084991

163.444

88.864998

399.480988

268.270996

468.875

151.671997

432.436005

77.856003

219.389008

344.312988

357.731995

172.016998

145.636002

164.259003

331.493988

67.574997

340.217987

155.054001

104.714996

267.143005

364.661987

202.912994

288.207001

57.477001

315.885986

447.316986

451.438995

451.384003

101.379997

78.402

370.286987

290.562988

419.558014

263.458008

138.792007

331.347992

375.790985

233.820007

392.415009

207.205002

162.774002

171.600006

317.272003

93.765999

429.627991

164.975998

296.438995

247.964005

330.001007

260.790009

342.645996

122.978996

212.537003

211.024002

167.429993

140.128998

342.687988

430.868011

219.878006

209.037003

353.421997

262.878998

437.635986

412.638

396.472992

203.552994

98.125999

308.752014

380.60199

460.664001

187.367996

76.397003

227.197006

168.507004

180.578003

66.889999

367.140015

210.742004

124.014999

111.598

158.343994

146.057007

406.005005

109.607002

222.345001

322.669006

273.713013

296.993011

368.178986

439.588989

146.792007

308.028015

166.697006

329.566986

142.259995

266.548004

205.113007

413.350006

107.244003

93.380997

318.235992

129.380005

320.734009

272.390015

353.627014

79.888

123.622002

85.425003

122.546997

420.131012

422.007996

120.481003

113.037003

84.649002

201.645004

49.078999

254.069

282.923004

303.098999

108.820999

440.880005

123.660004

380.328003

392.940002

426.683014

99.589996

107.380997

392.367004

109.783997

219.029999

113.351997

458.144989

159.531998

284.105988

281.928009

302.657013

85.002998

95.623001

126.794998

253.860992

180.358002

162.854996

335.924988

372.875

325.390991

48.554001

141.563995

271.148987

133.891006

390.184998

256.958008

89.113998

351.438995

345.062988

183.013

248.979004

194.481003

357.158997

307.311005

420.488007

426.623993

85.649002

298.033997

376.772003

270.138

367.346008

180.664993

307.286011

445.929993

315.346008

75.471001

271.596008

59.764999

194.692001

261.221008

104.393997

425.105988

317.450012

273.381989

83.054001

440.296997

290.566986

106.275002

258.665985

112.412003

213.373993

433.299011

265.584015

145.755005

81.689003

299.351013

96.694

87.267998

292.890991

236.787003

349.959015

276.385986

88.633003

309.554993

311.601013

55.071999

55.862999

255.544998

335.524994

379.445007

326.516998

264.540985

275.113007

272.235992

421.766998

445.601013

66.835999

419.242004

310.093994

159.546005

258.90799

156.988007

380.343994

176.826996

334.562012

363.760986

101.518997

158.098007

254.287994

383.035004

326.621002

227.156998

351.029999

346.748993

145.774994

292.131012

86.739998

427.131989

173.904007

375.30899

429.118011

444.993011

351.681

448.040985

90.628998

173.787003

392.322998

420.082001

161.701004

96.550003

177.444

367.837006

323.821014

288.085999

231.330002

302.390991

343.704987

93.008003

274.618988

86.745003

75.527

341.399994

153.158005

346.026001

209.048996

212.494995

309.464996

132.735992

394.15799

355.828003

373.347992

29.874001

77.667

432.730011

257.178009

164.735001

426.64801

165.335007

329.498993

157.410004

203.328003

71.196999

404.273987

326.903992

190.397995

180.082001

78.414001

288.450989

148.496002

354.68399

187.488998

298.529999

162.610001

405.229004

435.479004

192.783005

163.542999

393.212006

107.244003

335.77301

228.289993

Documents

Cluster by Evan