25
Brendan Collins

Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Embed Size (px)

Citation preview

Page 1: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Brendan Collins

Page 2: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Page 3: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

“The function of the brain and nervous system is to protect us from being overwhelmed and confused by this mass of largely useless and irrelevant knowledge, by shutting out most of what we should otherwise perceive or remember at any moment, and leaving only that very small and special selection which is likely to be practically useful.”

-Aldous Huxley

Page 4: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

103,000 Public Schools (No Clustering)

Page 5: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

103,000 Public Schools (Count)

Page 6: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

103,000 Public Schools (Mean Student Teacher Ratio)

Page 7: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Operation Point Cluster

• Review general clustering algorithms

• Suggest strategies & implementations for clustering for web applications– Server-side (C#)– Offline w/ArcGIS (Python)– Offline w/3rd Party (Python)

Page 8: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Data Classification(One Dimensional Clustering)

• Equal-interval– Clusters have same max – min (interval)

• Quantile– Clusters have same count

• Natural Breaks (Jenks)– Clusters have minimum deviation from mean

Page 9: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

KMeans(Centroid-based)

Page 10: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

KMeans(Centroid-based)

1. Choose random starting points2. Assign each target point to cluster candidates 3. Replace randomly centroid point with mean of group.4. Repeat steps 2 & 3 until convergence.

Page 11: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Page 12: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Grid Clustering(Grid-based)

1. Overlay mesh sized appropriate for zoom level

2. Compare point coordinates to mesh to create clusters.

• Very common on client-side• Can lead to undesired “Grid” effect

• Somewhat non-deterministic

Page 13: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

QuadTree(Distance-based)

http://en.wikipedia.org/wiki/QUADTREE

Page 14: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

QuadTree(Distance-based)

1.Input minimum cluster tolerance2.Recursively insert points into

existing tree1. Where distance < tolerance, number

of points++2. Where distance > tolerance, insert

to child node.

• Easy to implement• Can lead to “Grid” affect

Page 15: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

http://en.wikipedia.org/wiki/DBSCAN

DBSCAN(Density-based)

Page 16: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

DBSCAN(Density-based)

1. Takes search radius and minimum number of points for cluster2. Visit each point and count

number of points in search radius

• Clusters can be any shape• Search radius determined by zoom level

Page 17: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Strategies & Implementations for Web Apps(Server Object Extension vs. Pre-Crunched)

Page 18: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Where should clustering occur?

Client-side• Small number of points ( < 10,000 )• No addition server load• Widely available within client APIs• Limited by client-side languages

Server-side• Medium number of points ( < 1M )• Many language/library options• Robust querying• Very maintainable / extendible

Offline• Large number of points( > 1M)• Many language/library options• Limited querying• Output Normal Feature Class

Page 19: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Clustering Server Object Extension(C#/QuadTree)

1.Extends MapServer 2.Wraps map query based on extent3.returns clustered results4.Stateless5.Problems

1. Re-calculates tree on each request 2. Client-side wrappers3. Lost out-of-box ArcGIS Server

functions

Page 20: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Clustering with Arcpy(distance-based / offline)

1.Divide data into logical chunks (where clause)

2.Integrate using tolerance3.Collect Events4.Spatial Joinadd descriptive statistics

4.Append all results

Page 21: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Clustering w/Python

• Numpy/Scipy– Defacto

• Scikit-Learn – (Python machine learning library)

• PyTables– HDF5, akin to NetCDF, but with support for hierarchical tables and very scalable

– http://bcdcspatial.blogspot.com/2013/02/converting-arcgis-feature-class-to.html

Page 22: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Scikit-Learn

SciKit – Learn…btw it’s awesome - http://scikit-learn.org/stable/

Page 23: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Bleeding Edge Python

• PyPy, Cython, Anaconda, Numba Pro, Pandas

• Python is now a first-class citizen on the GPU!

Page 24: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

In Summary:

• Clustering is not Panning• Think outside Count• Clustering is not only for spatial data

Page 25: Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Thank You!

Follow us on Twitter:@blueraster@brendancol

Visit us at:blueraster.com/blogbcdcspatial.blogspot.com