23

The Latest in Spatial & Temporal Search: Presented by David Smiley

Embed Size (px)

Citation preview

Page 1: The Latest in Spatial & Temporal Search: Presented by David Smiley
Page 2: The Latest in Spatial & Temporal Search: Presented by David Smiley

The Latest in Spatial & Temporal Search David Smiley

Page 3: The Latest in Spatial & Temporal Search: Presented by David Smiley

Agenda Spatial

•  Polygons  and  Accuracy:  SerializedDVStrategy  •  FlexPrefixTree  •  BBoxSpa=alStrategy  •  Student/Intern  contribu=ons,  Geodesics  

Temporal •  Dates,  and  Date  Ranges  

•  Search  •  Face=ng  

Page 4: The Latest in Spatial & Temporal Search: Presented by David Smiley

About David Smiley

•  Freelance search consultant / developer •  Expert  Lucene/Solr  development  skills,  

advice  (consul=ng),  training  •  Java  (full-­‐stack),  Web,  Spa=al  

•  Apache Lucene / Solr committer & PMC, Eclipse Locationtech PMC

•  Authored 1st book on Solr, plus two editions •  Presented at several conferences & meetups •  Taught several Solr classes, self-developed & LucidWorks

Page 5: The Latest in Spatial & Temporal Search: Presented by David Smiley

Lucene Spatial Overview •  Multiple approaches to index spatial data abstract class SpatialStrategy (5+  concrete  implementa=ons)  

•  RecursivePrefixTreeStrategy (RPT) is most prominent, versatile •  Grid  based  

•  Uses Spatial4j lib for shapes, distance calculations, and WKT •  Uses  JTS  Topology  Suite  lib  for  polygons  

Shape  

Spa=alPrefixTree  /  Cell   PrefixTreeStrategy  IntersectsPrefixTreeFilter  Contains…  Within…  Geohash  |  Quad  

Page 6: The Latest in Spatial & Temporal Search: Presented by David Smiley

SpatialPrefixTrees and Accuracy RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree

•  Thus  represents  shapes  as  grid  cells  of  varying  precision  by  prefix  Example, a point shape:

•  D,  DR,  DRT,  DRT2,  DRT2Y  Example, a polygon shape:

•  Too  many  to  list…  508  cells    

More  details  here:    h7p://opensourceconnec;ons.com/blog/2014/04/11/indexing-­‐polygons-­‐in-­‐lucene-­‐with-­‐accuracy/      

Page 7: The Latest in Spatial & Temporal Search: Presented by David Smiley

…continued •  For more accuracy, index more levels (longer prefixes)

•  Points:  linear  rela=onship  of  levels  to  number  of  cells      J  •  Non-­‐points:  exponen=al  rela=onship…      L  

RPT applies a distErrPct shape size ratio to non-point shapes to trade accuracy for scalability •  distErrPct=0.025 (2.5% of the radius, the default):

•  Massachuse[s:  level  6  •  USA:            level  4    (not  as  precise)  

Page 8: The Latest in Spatial & Temporal Search: Presented by David Smiley

SerializedDVStrategy (Lucene 4.7) •  Stores serialized geometry into Lucene BinaryDocValues

•  It’s  as  accurate  as  the  underlying  geometry  coordinates/shape  •  But  it’s  not  a  spa=al  index  –  it’s  retrievable  on  a  per-­‐document  basis  

•  Use RPT + SerializedDV for speed and accuracy!

•  More to come eventually: •  Solr  adapter  –  SOLR-­‐5728,  Elas=cSearch  adapter  #2361  •  Speed:  Skip  the  serialized  geometry  check  for  non-­‐edge  cells  –  LUCENE-­‐5579  

Page 9: The Latest in Spatial & Temporal Search: Presented by David Smiley

SpatialArgs  args  =  new  SpatialArgs(INTERSECTS,  point);    treeStrategy  =  new  RecursivePrefixTreeStrategy(  

       grid,  "geometry");  verifyStrategy  =  new  SerializedDVStrategy(  

       ctx,  "serialized_geometry");    Query  treeQuery  =  new  ConstantScoreQuery(  

       treeStrategy.makeFilter(args));  Query  combinedQuery  =  new  FilteredQuery(  

       treeQuery,          verifyStrategy.makeFilter(args),          FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);  

Code  is  from  a  related  presenta;on  by  the  Climate  Corpora;on  presented  at  FOSS4G  2014    

Sample Code

Page 10: The Latest in Spatial & Temporal Search: Presented by David Smiley

FlexPrefixTree (Coming to Lucene 5) •  A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !

•  LUCENE-­‐4922;  S=ll  needs  to  be  commi[ed.    Goal  is  for  5.0.  •  More optimized, more flexible, than Geohash & Quad

•  Configurable  sub-­‐cells  at  each  level:  4,  16,  64,  256  •  You  choose  trade-­‐off  between  index  speed/disk  size  &  search  speed  

•  Internally  uses  an  integer  coordinate  system  •  Rectangle  searches  are  par=cularly  fast;  minimal  floa=ng-­‐point  conversion  

•  Cells  are  always  squares  (equal  sides)  –  be[er  for  heatmaps  •  YMMV:  10%  -­‐  100%  faster  than  GeohashPrefixTree  

Page 11: The Latest in Spatial & Temporal Search: Presented by David Smiley

BBoxSpatialStrategy (Lucene 4.10) •  Rectangles (BBox’s) only, one value per field •  Wide predicate support

•  Equals,  Intersects,  Within,  Contains,  Disjoint  •  Accurate (8-byte double floating point) •  Area overlap relevancy

•  Weight  search  results  by  a  combina=on  of  query  shape  overlap  &  index  shape  overlap  ra=os  

•  Solr BBoxField…

Page 12: The Latest in Spatial & Temporal Search: Presented by David Smiley

Solr BBoxField •  Schema configuration <field name="bbox" type="bbox" /><fieldType name="bbox" class="solr.BBoxField”

geo="true" units="degrees" numberType="_bbox_coord" /><fieldType name="_bbox_coord" class="solr.TrieDoubleField”

precisionStep="8" docValues="true" stored="false"/>

•  Search with overlap ratio ordering &q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))

•  score  can  be:  overlapRa=o,  area,  area2D

Page 13: The Latest in Spatial & Temporal Search: Presented by David Smiley

Recent Student/Intern Contributions •  Varun Shenoy via GSOC: summer 2014

•  Lucene  spa=al:  new  “FlexPrefixTree”  –  an  op=mized  grid    •  Rebecca Alford via F.B. Open-Academy: winter 2014

•  Spa=al4j:  geodesic  polygons  •  Chris Pavlicek via F.B. Open-Academy: winter 2014

•  Spa=al4j:  geodesic  buffered  lines  •  Evana Gizzi, MITRE intern: winter 2014

•  Spa=al4j:  geodesic  circle  polygonizer  •  Liviy Ambrose, MITRE intern: fall 2013

•  Lucene  spa=al:  integrated  with  Lucene’s  benchmark  module  

Page 14: The Latest in Spatial & Temporal Search: Presented by David Smiley

Temporal/Date Durations or basically any numeric ranges

Page 15: The Latest in Spatial & Temporal Search: Presented by David Smiley

Approach: Simple Two-field (as you might do in SQL or any system without native range types) •  A start-time & end-time field pair •  A search window (time span) becomes two range queries

•  details  vary  by  predicate  (Intersects,  Contains,  vs.  Within)  •  Single-valued only

•  …even  though  Lucene  supports  mul=-­‐valued  fields  •  Theore=cally  possible  but  would  be  a  lot  of  work  

•  because  Lucene  doesn’t  store  “posi=on”  info  for  numeric  fields  •  because  numeric  range/prefix  queries  are  posi=on-­‐less  

Page 16: The Latest in Spatial & Temporal Search: Presented by David Smiley

Approach: 2D Spatial PrefixTree •  Lucene Spatial QuadPrefixTree

(2D) with RPT Strategy •  Use ‘x’ for start-time, ‘y’ for end-time •  A search window (time span)

becomes a rectangle query •  details  vary  by  predicate  (Intersects,  Contains,  vs.  Within)  

•  Cool… •  But  floa=ng-­‐point  edge  issues  •  Only  ~50  levels  supported;  not  64  

Details:  h[p://wiki.apache.org/solr/Spa=alForTimeDura=ons  

Page 17: The Latest in Spatial & Temporal Search: Presented by David Smiley

Approach: DateRangePrefixTree (Lucene 5) •  A new 1D SpatialPrefixTree: NumberRangePrefixTree

•  NumberRangePrefixTree  w/  DateRangePrefixTree  subclass  •  NR-­‐SPT:  Configurable  sub-­‐cells  per  level;  no  level  limit  •  Not  just  for  ranges;  instances  too  •  Index/Search  with  NumberRangePrefixTreeStrategy  

•  Indexing,  and  search  predicate  code  (e.g.  Intersects…)  completely  re-­‐used  

•  DateRangePrefixTree •  9  Levels:  1M  years,  1K  years,  years,  months,  days,  hours,  minutes,  seconds,  millis  

…continued…

Page 18: The Latest in Spatial & Temporal Search: Presented by David Smiley

Trade-offs of N/D-SPT •  Indexing:

•  “Common”  date-­‐ranges  use    ~  <50  terms,  but  random  millisecond  ranges  use  up  to  ~14K  terms  

•  All  date  instances  (not  a  range)  <=  9  terms  •  Comparison  to  2D  SPT:  instance  or  range,  always  50  

•  Search: •  Query  for  “common”  query  ranges  faster  than  uncommon  •  Comparison  to  2D  SPT:    •  Contains  &  Within  predicates:  overlapping  values  per  document  get  coalesced,  can’t  be  differen=ated  

Page 19: The Latest in Spatial & Temporal Search: Presented by David Smiley

Solr DateRangeField •  Configuration in schema.xml: <field  name="dateRange"  type=”dateRange”  />  <fieldType  name="dateRange"  class="solr.DateRangeField"  />  

•  Index field data, examples: •  2014-­‐05-­‐21T12:00:00.000Z  (same  as  TrieDate)  •  2014-­‐05-­‐21T12        (truncated  to  desired  precision)  •  [1990  TO  1995]  

•  Query, examples: •  fq=dateRange:[*  TO  2014-­‐05-­‐21]  •  fq={!field  f=dateRange  op=Contains}  [2000  TO  2014-­‐05-­‐21]  

Page 20: The Latest in Spatial & Temporal Search: Presented by David Smiley

Visualizing Date Facets •  http://bl.ocks.org/mbostock/4063318

Page 21: The Latest in Spatial & Temporal Search: Presented by David Smiley

Date Faceting •  Option A: facet.range

•  Not  for  indexed  date-­‐ranges  •  Internally  executes  one  query  for  each  value  &  caches  large  bitset  

•  Option B: facet.interval (Solr 4.10) •  Not  for  indexed  date-­‐ranges  •  Requires  DocValues  (more  index  data)  •  Supports  variable/custom  intervals  

•  New work-in-progress option: Facet on DateRangeField •  Ranges  are  fixed/pre-­‐determined  (months,  days,  etc.)  •  Op=mized  for  thousands  of  ranges  to  count  

•  Each  value-­‐range  is  only  1  term!  

Page 22: The Latest in Spatial & Temporal Search: Presented by David Smiley

Future stuff I’m excited about •  Continuing works in-progress •  Spatial heatmaps! Coming in January 2015!

•  Lucene  layer  &  Solr  adapter  •  Lucene term auto-prefixing LUCENE-5879

•  Brings  spa=al,  date,  numeric,  indexing/search  to  the  next  level!  •  More prefix-tree optimizations

•  Inner  vs  edge  leaf  cell  differen=a=on  for  non-­‐point  shapes  •  RPT  +  SerializedDVStrategy;  skip  accuracy  checks  for  inner  cells  •  Don’t  index  leaf  cells  twice  

Page 23: The Latest in Spatial & Temporal Search: Presented by David Smiley

That’s  all  for  now;  thanks  for  coming!  

   Need  Lucene/Solr  guidance  or  custom  development?    Contact  me!  

Email:    [email protected]  LinkedIn:  h[p://www.linkedin.com/in/davidwsmiley  G+:      +DavidSmiley  Twi[er:    @DavidWSmiley  

ETA:  December  2014