O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A
H-‐Hypermap: Heatmap Analy?cs at Scale David Smiley
Freelance Search Developer/Consultant
About: David Smiley • So2ware Engineer (16 years) • Search (7 years) • Java (full-‐stack), Web, SpaGal
• Freelance search consultant / developer • Apache Lucene / Solr commiKer & PMC • Wrote first book on Solr, updated twice
Agenda • About this project • Architecture • Solr & Gme sharding • Experiences with: – Kotlin, Dropwizard, Swagger
– KaUa – Docker, Kontena
• Solr for geo-‐enrichment • Solr adapter for Lucene BKD Lat-‐Lon point search & sort
• Heatmaps – ExisGng funcGonality
• demo – New funcGonality
H-‐Hypermap / BOP • Harvard University, CGA: Center for GeospaGal Analysis hKp://gis.harvard.edu
• Harvard Hypermap Project – Managed by Ben Lewis
• BOP “Billion Object Pla^orm” – Funded by the Sloan FoundaGon
BOP Requirements Summary
• Most recent ~billion geo-‐tweets • RealGme search (<5 sec latency) • Sub-‐second queries – Including heatmaps!
• On the cheap: ~6 mediocre boxes
Provide a proof-‐of-‐concept pla^orm designed to lower the barrier for researchers who need to access big streaming spaGo-‐temporal datasets.
Logical High-‐Level Architecture
Archival
RealGme
HarvesGng Enrichment
various clients...
various clients...
Data flows via Apache KaLa Systems expose HTTP web services
“BOP”
Shard: W51
The BOP KaUa Topic Ingester
ZooKeeper
Shard: W52 Shard: W53 Shard: W54 Shard: RT
...
Web-‐Service
KaUa Streams • Create Solr doc • Routes to shard
REST/JSON API • Keyword search • FaceGng • Heatmaps • CSV export
...
BOP Solr Sharding Architecture RealGme
T2016_05_20 T2016_05_06 T2016_04_22 T2016_04_08
… 4-‐5 mo.
T2016_05_20 T2016_05_06 T2016_04_22 T2016_04_08
… 4-‐5 mo.
G_North_America G_Elsewhere
Lone RealGme CollecGon/Shard. 1-‐25 hrs Copy then delete, at night
• RealGme shard is where realGme search happens. No caches, but small.
• Primary collecGons have useful caches • Housekeeping Tasks:
• Move data from RT to primary • Create new shards; expire old • Merge/opGmize shards
Building a Search Web-‐Service • Kotlin language (JVM based) – Nullity as first-‐class language feature
• DropWizard framework – Designed for web-‐services
• Swagger – Dynamically generated dev UI for web-‐services
Apache KaUa • KaUa: a scalable message/queue pla^orm • See new KaUa Streams & KaUa Connect APIs • No back-‐pressure; can be a challenge • Non-‐obvious use: – For storage; Gme parGGoning
• Lots of benefits yet serious limitaGons
Docker • Easy to find/try/use so2ware – No installaGon – Simplified configuraGon (env variables)
– Common logging – Isolated
• Ideal for: – ConGnuous Int. servers – Trying new so2ware – ProducGon advantages
• But “new”
Docker in ProducGon • I use “Kontena” • Common logging, machine/proc stats, security – VPN to secure network; access everything as local
• No longer need to care about: – Ansible, Chef, Puppet, etc. – Security at network or proxy; not service specific
• Challenges: state & big-‐data
Enrichment
Geo: Query Solr via spaGal point query; aKach related metadata to tweet
KaUa Topic Enrich KaUa
Topic
TwiKer SenGment Classifier
Geo: Solr with regional polygons & metadata
Solr for Geo Enrichment • Tweets (docs) can have a geo lat/lon • Enrich tweet with Country, State/Province, … – GazeKeer lookup (point-‐in-‐polygon)
Data Set Features Raw size Index ?me Index size
Admin2 46,311 824 MB 510 min 892 MB
US States 74,002 747 MB 4.9 min 840 MB
MassachuseKs Census Blocks 154,621 152 MB 5.9 min 507 MB
Fast Point-‐in-‐Polygon Tricks Index/Config • OpGmize to 1 segment • RptWithGeometry
SpaGalField – precisionModel=
"floating_single" – autoIndex="true"
• <cache name= "perSegSpatial FieldCache_WKT" …
Search • Embed Solr (in-‐process) • Use docValues, not stored
– fl=block:field(GEOID10) Query like this: • q={!field cache=false
f=WKT}Intersects(POINT( $lon $lat))
Sub-‐Millisecond!
Lucene “LatLonPoint” • Uses new PointValues (BKD index) in Lucene 6 • Fastest: hKp://home.apache.org/~mikemccand/geobench.html
• Presently in Lucene sandbox module • Some limitaGons: WGS84 points only • Credit to Rob Muir and Mike McCandless
Solr Adapter For LatLonPoint • New Solr FieldType for Lucene LatLonPoint – Filter points by circle, rect, polygon – Distance sort; but no boos(ng
Coming soon! Solr 6.4?
Heatmaps: SpaGal Grid FaceGng • SpaGal density summary grid faceGng,
also useful for point-‐plovng search results • Lucene & Solr APIs • Scalable & fast usually…
• Usually rendered with a gradient radius -‐> • See: hKp://spacemansteve.github.io/
leaflet-‐solr-‐heatmap/example/index.html
How-‐to: Heatmaps • On an RPT field geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"
• Query: /select?facet=true &facet.heatmap=geo_rpt &facet.heatmap.geom= ["-180 -90" TO "180 90”] &facet.heatmap.format= ints2D or png
// Normal Solr response... "facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”, [null, null, [0, 1, ... ]] ...
New HeatmapSpaGalField • Why? – With new BKD/PointValues, no “RPT” field to use – Scalable for heatmaps; don’t worry about search
• Scalable at all resoluGons; many millions of docs/shard
– Can be specific about grid resoluGons
Coming soon! Solr 6.4?
Heatmaps with Stats • Instead of counGng docs; calculate a metric – Ex: avg(minuteOfDay)
• Will require JSON Facet API • Inherently slower than just doc counts
Coming soon! Solr 6.4?
Final Remarks • Open-‐Source – hKps://github.com/dsmiley/hhypermap-‐bop
• In-‐progress • Improvements to Solr expected to be available before December; officially in Solr 6.4.