Transcript
Page 1: Lucene solr 4 spatial   extended deep dive

LUCENE / SOLR 4 SPATIAL DEEP DIVE

David Smiley Software Systems Engineer, Lead

Page 2: Lucene solr 4 spatial   extended deep dive

© 2013 The MITRE Corporation. All rights reserved.

LUCENE / SOLR 4 SPATIAL

DEEP-DIVE

2013 Lucene Revolution

Presented by David Smiley, MITRE

Page 3: Lucene solr 4 spatial   extended deep dive

About David Smiley

• Working at MITRE, for 13 years

• web development, Java, search

• 3 Solr apps, 1 Endeca

• Published 1st book on Solr; then 2nd edition (2009, 2011)

• Apache Lucene / Solr committer/PMC member (2012)

• Specializing on spatial

• Presented at Lucene Revolution (2010) & Basis O.S.

Search Conference (2011, 2012)

• Taught Solr classes at MITRE (2010, 2011, 2012)

• Solr search consultant within MITRE and its sponsors,

and privately

3

Page 4: Lucene solr 4 spatial   extended deep dive

Agenda

• Background, overview

• Spatial4j

• Lucene spatial

• PrefixTree / Trie / Grid

• Solr spatial

• Demo

• Interesting use-cases

Page 5: Lucene solr 4 spatial   extended deep dive

BACKGROUND &

OVERVIEW

Page 6: Lucene solr 4 spatial   extended deep dive

What is Spatial Search?

Popular features:

• Spatial filter query

• Spatial distance sorting

• Spatial distance relevancy (i.e. spatial query score)

NOT “geocoding” – resolve “Boston” to its latitude and longitude

Typical use-case:

1. Index a location for each Lucene document given a

latitude & longitude

2. Then search for matching documents by a circle (point-

radius) or bounding box

3. Then sort results by distance

Page 7: Lucene solr 4 spatial   extended deep dive

History of Spatial for Lucene & Solr

• 2007: Local-Lucene

• by Patric O’Leary (AOL)

• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0

• Local-Lucene graduates to an official Lucene contrib module

• 2009-12: Spatial Search Plugin (SSP) for Solr

• by Chris Male (JTeam -> Orange11, ElasticSearch)

• 2010-10: SOLR-2155 a geohash prefix tree filter

• by David Smiley (MITRE)

• 2011-01: Lucene Spatial Playground (LSP)

• by Ryan McKinley (Voyager GIS), David, and Chris

• 2011-03: Solr 3.1 new spatial features

• by Grant Ingersoll and Yonik Seeley (LucidWorks)

• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j + SSP

• replaces former Lucene spatial contrib module

Page 8: Lucene solr 4 spatial   extended deep dive

Lucene Spatial Committers

• David Smiley

• Works for MITRE

• Boston area

• Ryan McKinley

• Works for Voyager GIS

• Silicon Valley

• Chris Male,

• Formerly at Elastic Search

• New Zealand

Page 9: Lucene solr 4 spatial   extended deep dive

Spatial decomposed

• Spatial4j

• Shapes, WKT, Distance calculations, JTS adapter

• Lucene spatial

• Strategies: PrefixTree (TermQuery & Recursive impl.), BBox,

PointVector

• Solr adapters

• Misc: Spatial Solr Sandbox

• LSE

• JtsGeoStrategy

• Spatial-Demo (web app)

Page 10: Lucene solr 4 spatial   extended deep dive

Lines of Code for Spatial Components

Spatial4j 43%

Lucene spatial 35%

Solr adapters 6%

Misc 16%

Total: 4,781 Non-Comment Source Statements (without javadocs or tests)

as of 2012-09

Page 11: Lucene solr 4 spatial   extended deep dive

CarrotSearch Labs’ RandomizedTesting

• http://labs.carrotsearch.com/randomizedtesting.html • Provides plumbing for repeatable randomized JUnit tests

• All the spatial test code uses it extensively

Randomized testing more generally is a certain philosophy / approach on how to test

• A typical hard-coded test will only catch some regressions

• A randomized test will catch just about anything eventually, especially nasty edge cases

• Although it’s hard to read / write / maintain these tests

• Randomized testing helped find bugs related to… • Computing the bounding box of a circle

• Computing the relationship of a circle to a rectangle that has all 4 of its corners inside it

Page 12: Lucene solr 4 spatial   extended deep dive

SPATIAL4J It’s all about the shapes

Page 13: Lucene solr 4 spatial   extended deep dive

Spatial4j: It’s all about the shapes

https://github.com/spatial4j/spatial4j (spatial4j.com redirect)

• Shapes

• A “Shape” abstraction with multiple implementations

• Geodetic (sphere) & Cartesian/2D implementations

• Computes intersection relationship with other shapes

• Also…

• Distance and area math utilities, Geohash utilities

• Parsing Well Known Text (WKT) formatted shapes

• ASL licensed project independent of Apache on GitHub

• Requires JTS (LGPL licensed) for polygons & WKT*

• JTS is “JTS Topology Suite”

• * WKT parsing soon to be implemented directly by Spatial4j

• Ported to .NET as Spatial4n and used by RavenDB

• by Itamar Syn-Herskhko

Page 14: Lucene solr 4 spatial   extended deep dive

The case for Spatial4j’s existence

• Just for shapes? How much code could there be?

• You’d be surprised. Determining the relationship between a lat-lon

rectangle and a geodetic circle (Within, Contains, Intersects, Disjoint)

is non-trivial, and that’s just one shape.

• Lots of non-trivial test code go with it.

• Why isn’t it a part of Lucene spatial?

• Parts of Spatial4j depend on JTS, an LGPL licensed library. The

Lucene PMC voted not to introduce this compile-time dependency.

• Spatial4j is independently useful.

• Is this duplication of other open-source that could be used?

• Spatial4j needs to be ASL licensed to be a dependency of Lucene.

• Still… I haven’t found existing code that does what Spatial4j does.

• Can’t only the JTS dependent parts be external to Lucene?

Page 15: Lucene solr 4 spatial   extended deep dive

The Shape interface

(may become an abstract class in the next version)

• interface Shape {

• Point getCenter();

• Rectangle getBoundingBox();

• boolean hasArea();

• double getArea();

• SpatialRelation relate(Shape other);

• Must support Point & Rectangle

• enum SpatialRelation

• DISJOINT, INTERSECTS, WITHIN, CONTAINS

• Note: simpler set than the “DE-9IM” spatial standard

• no “equals” or “touches”

Page 16: Lucene solr 4 spatial   extended deep dive

Spatial4j shapes

Ca

rte

sia

n

Ca

rte

sia

n

wit

h

da

teli

ne

wra

p

Ge

od

eti

c

Point Y Y Y

Line & LineString (w/ buffer)

Y N N

Rectangle Y Y Y

Circle Y N Y

ShapeCollection Y Y Y

JTS Geometry

(incl. polygons) Y Y N

• Cartesian (AKA

Euclidean): a flat plane

• Dateline wrap assumes

the plane circles back on

itself

• Geodetic: a spherical

mathematical model

Page 17: Lucene solr 4 spatial   extended deep dive

Well Known Text (WKT)

(see Wikipedia)

• A popular standard for representing shapes as strings

• Requires JTS’s WKT Parser but Spatial4j has its own in-progress

• Extensions are TBD for Rectangles and Circles

• Limited support for EMPTY and “Z” and “M” dimensions (future)

• Some Examples: • POINT (3, -2)

• LINESTRING(30 10, 10 30, …

• POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))

• MULTIPOLYGON (((…

• …

• Deprecated (may move to Solr):

• -90, -180

• -180 -90 180 90

• CIRCLE(4.56,1.23 d=0.071)

• TBD / Pending: • ENVELOPE(-180,180,90,-90)

• BOX2D(-180 -90, 180 90)

Page 18: Lucene solr 4 spatial   extended deep dive

Spatial4j code sample

SpatialContext ctx = SpatialContext.GEO;

Rectangle r = ctx.makeRectangle(-71, -70, 42, 43);

Circle c = ctx.makeCircle(-72, 42, 1);

SpatialRelation rel = r.relate(c);

System.out.println(rel);

rel.intersects();//boolean

ctx = JtsSpatialContext.GEO;

Shape s = ctx.readShape(“POLYGON ((30 10, 10 20, 20 40, 40

40, 30 10))”);

double distanceDegrees = ctx.getDistCalc().distance(

ctx.makePoint(2, 2), ctx.makePoint(3, 3) );

Distances (including circle

radius) are in “Degrees”, not

radians or KM

Page 19: Lucene solr 4 spatial   extended deep dive

Spatial4j Future

• Built-in WKT support (no JTS dependency)

• Extensible to user-defined shapes

• API improvements

• Shape argument validation via WKT but not via ctx.makeShape(…)

• ShapeCollection visitor design pattern

• Refactor to remove need for isGeo()

• LineString dateline & geodetic support

• Projection / Datum support

Page 20: Lucene solr 4 spatial   extended deep dive

LUCENE SPATIAL Spatial index information retrieval

Page 21: Lucene solr 4 spatial   extended deep dive

Lucene 4 Spatial Module

• There isn’t one best way to implement spatial indexing for all use-cases • Index just points, or other shapes too? Which?

• Multiple shapes per field?

• Query by Intersection? Contains? Within? Equals? Disjoint? …

• Distance sorting? Query boost by distance?

• Or more exotic shape relevancy like overlap percentage?

• Tradeoff shape precision for speed?

• Multiple SpatialStrategy implementations: • RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy

• PointVectorStrategy

• BBoxStrategy (currently in trunk, not 4x)

• JtsGeoStrategy (in Spatial Solr Sandbox)

Page 22: Lucene solr 4 spatial   extended deep dive

Strategy: PointVector

• Similar to Solr’s PointType / LatLonType

• X & Y trie double fields; caching via FieldCache

• Characteristics

• Indexes points (only)

• Single-valued field (no multi)

• Query by rectangle or circle (only)

• Circle uses FieldCache (requires memory)

• Circle does bbox pre-filter for performance

• Relations: Intersects, Within (only)

• Exact precision for x & y coordinates and query shape

• Distance sort

• Uses FieldCache (requires memory)

Page 23: Lucene solr 4 spatial   extended deep dive

Strategy: BBox

• Implemented with 4 doubles & 1 boolean

• Ported from ESRI GeoPortal (Open Source)

• Characteristics:

• Indexes rectangles (only)

• Single-valued field (no multi)

• Query by rectangle (only)

• Supports all relations: Intersects, Within, Contains, …

• Distance sort from box center

• Uses FieldCache (requires memory)

• Area overlap sorting

• Sort results by percentage overlap between query and indexed boxes

• Uses FieldCache (requires memory)

• Note: FieldCache needs are somewhat high

Page 24: Lucene solr 4 spatial   extended deep dive

Strategy: JtsGeoStrategy

• Stores a JTS geometry in Lucene 4’s DocValues • Stores WKB (WKT in binary format)

• Full vector geometry is retained for search

• DocValues is mostly a better FieldCache • Faster loading into memory

• Can be disk resident or memory

• Multi-valued

• Characteristics: • Indexes any shape, including Multi… varieties

• Query by any shape • Uses DocValues (memory use optional)

• Supports all relations: intersect, within, contains, … • Could easily also support JTS’s exotic DE-9IM based relations

• Exact precision to the vector geometry

• No sorting

• Experimental / immature status More of a proof-of-concept for now

Page 25: Lucene solr 4 spatial   extended deep dive

PREFIXTREE STRATEGY Spatial grid indexing

Page 26: Lucene solr 4 spatial   extended deep dive

Strategy: RecursivePrefixTree

• Grid / Tile / Trie / Prefix-Tree based • With recursive decent

algorithms

• Or TermQueryPrefixTree alternative

• Choose Geohash (geo only) or Quad tree

• The most mature strategy to date • Highly tested

• The current evolution of SOLR-2155

Page 27: Lucene solr 4 spatial   extended deep dive

Strategy: RecursivePrefixTree

• Characteristics:

• Indexes all shapes

• Variable precision of shape edges

• Highly precise shapes other than Point won’t scale

• LineString possibly not precise enough for your needs

• Multi-valued field support

• Query by any shape

• Variable precision for query shape

• Highest precision usually scales

• All Relations: Intersects, Within, Contains, Disjoint

• Distance sort (w/ multi-value support)

• Warning: immature, won’t scale

• Uses significant amounts of memory

• Fast scalable spatial filtering; no caches needed

new in Lucene 4.3

How many search /

NoSQL systems have

these capabilities?

Page 28: Lucene solr 4 spatial   extended deep dive

Geohashes

• What is a Geohash?

• A lat/lon geocode system

• Has a hierarchical spatial structure

• Gradual precision degradation

• In the public domain

http://en.wikipedia.org/wiki/Geohash

• Example: (Boston) DRT2Y

Page 30: Lucene solr 4 spatial   extended deep dive

Zooming In: D

Page 31: Lucene solr 4 spatial   extended deep dive

Zooming In: DR

Page 32: Lucene solr 4 spatial   extended deep dive

Zooming In: DRT

Page 33: Lucene solr 4 spatial   extended deep dive

Zooming In: DRT2

Page 34: Lucene solr 4 spatial   extended deep dive

Zooming In: DRT2Y

Page 35: Lucene solr 4 spatial   extended deep dive

Geohash Grids

DRT2Y

Internal coordinates of an odd length geohash…

…and an even length geohash

DRT2

Page 36: Lucene solr 4 spatial   extended deep dive

Demo

• Spatial Solr Playground • Demo KML grid generation from geometries

• A sample point with quad tree indexes to these tokens: • A, AD, ADB, ADBA

• A sample circle with quad tree indexes to these tokens: • A, AB, ABA, ABAB+, ABAC+, ABAD+, ABB, ABBA+,

ABBB+, ABBC+, ABBD+, ABC, ABCA+, ABCB+, ABCC+,

ABCD+, ABD+, AD, ADA, ADAA+, ADAB+, ADAC+, ADAD+,

ADB+, ADC, ADCA+, ADCB+, ADCD+, ADD, ADDA+,

ADDB+, ADDC+, ADDD+, B, BA, BAA, BAAC+, BAAD+,

BAC, BACA+, BACB+, BACC+, BACD+, BC, BCA, BCAA+,

BCAB+, BCAC+, BCC, BCCA+, BCCC+, C, CB, CBB,

CBBA+

• Tokens with a ‘+’ are actually indexed with and without the ‘+’

Page 37: Lucene solr 4 spatial   extended deep dive

PrefixTreeStrategy Architecture

Shape

calc rect relationship

SpatialPrefixTree & Cell

byte string to/from Cell (rect)

PrefixTreeStrategy

index & search algorithms

Lucene

TermsEnum IntersectsPrefixTreeFilter

ContainsPrefixTreeFilter

WithinPrefixTreeFilter

Page 38: Lucene solr 4 spatial   extended deep dive

Lucene Spatial example code

ctx = SpatialContext.GEO;

strategy = new RecursivePrefixTreeStrategy(

new GeohashPrefixTree(ctx,11), “myGeoField”);

… // make indexWriter and a Document

for (Field f : strategy.createIndexableFields(shape))

doc.add(f);

indexWriter.addDocument(doc);

filter = strategy.makeFilter(

new SpatialArgs(SpatialOperation.Intersects,

ctx.makeCircle(-80.0, 33.0,

DistanceUtils.dist2Degrees(200,

DistanceUtils.EARTH_MEAN_RADIUS_KM))));

indexSearcher.search(userKeywordQuery, filter, 10);

See SpatialExample.java in Lucene spatial tests for more

Page 39: Lucene solr 4 spatial   extended deep dive

Future

• Possible de-emphasis of SpatialStrategy abstraction

• A better options for distance sorting of PrefixTree strategies

• Better PrefixTree encoding than both geohash & quad tree • Google Summer of Code 2013 -- TBD

• Performance improvements to spatial Intersects RecursivePrefixTree Filter

• Remove the need to double-index leaf-nodes (with and without ‘+’)

• Exact geometry search by blending benefits of PrefixTree and JtsGeoStrategy

• A Single-dimensional PrefixTree (for numeric range index)

Page 40: Lucene solr 4 spatial   extended deep dive

SOLR SPATIAL Adapters to Lucene 4 spatial

Page 41: Lucene solr 4 spatial   extended deep dive

Solr 3 Spatial: LatLonType & friends

• Solr 3 was Solr’s first release to include spatial support • Not based on Lucene’s old spatial contrib module

• Similar to TwoDoublesStrategy but more optimized • Single-valued only, fast distance sorting, can choose floats (save

memory)

• Fields: • LatLonType (Geodetic)

• PointType (Cartesian)

• Query parsers (spatial filters): • {!geofilt} (circle) “p” and “sfield” and “d” params

• {!bbox} (bounding box of a circle)

• Distance function: • geodist() and some esoteric others

NOT completely

superseded by Solr 4

spatial fields

Page 42: Lucene solr 4 spatial   extended deep dive

Solr 4 Spatial

• See

http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial

4 <fieldType name="location_rpt"

class="solr.SpatialRecursivePrefixTreeFieldType”

spatialContextFactory=”

com.spatial4j.core.context.jts.JtsSpatialContextFactory”

distErrPct="0.025”

maxDistErr="0.000009”

units="degrees” />

If you don’t need JTS

(polygons) don’t set this

Non-point shapes

approximated to

grid up to 2.5% of

radius

Max precision (1m) as

measured in degrees

Page 43: Lucene solr 4 spatial   extended deep dive

Indexing

• Point: Latitude, Longitude (i.e. Y, X) <field name="geo">43.17614, -90.57341</field>

• Point: X Y <field name="geo">-90.57341 43.17614</field>

• Rect: minX minY maxX maxY <field name="geo">-74.093 41.042 -69.347 44.558</field>

• Circle: point then d=radius (in degrees)

• will be deprecated

<field name="geo">Circle(4.56,1.23 d=0.0710)</field>

• WKT (preferred; it’s a standard) <field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20,

0 0, -10 30))</field>

Page 44: Lucene solr 4 spatial   extended deep dive

Filter (search)

• Using Solr 3’s bbox or geofilt query parsers

• Distance radius ‘d’ is interpreted as kilometers, just like LatLonType

• Limited to bbox and bbox of a circle fq={!geofilt}&sfield=geo&pt=45.15,-93.85&d=5

• Range query style (bounding box) • Handles dateline wrap

fq=geo:[-90,-180 TO 90,180]

• Field query style • Unique to Lucene 4 spatial; see SpatialArgsParser

fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40

20, 0 0, -10 30))) distErrPct=0”

• Predicates: Intersects, IsDisjointTo, IsWithin, Contains, …

• distErrPct (& distErr) optional; override field type’s default

SOLR-4242: A

better spatial

query parser

Page 45: Lucene solr 4 spatial   extended deep dive

Distance Sort & Relevancy Boost

• geodist() is for Solr 3 LatLonType only sort=geodist(lltField,45.15,-93.85) desc

• Solr 4 spatial queries can return the distance as the score q={!geofilt sfield=geo pt=45.15,-93.85 d=5

score=distance}&sort=score asc&fl=*,score

• Without a filter sort=query($sortsq) asc&sortsq={!geofilt filter=false

score=distance sfield=geo pt=45.15,-93.85 d=0}

• Relevancy boost defType=edismax&boost=query($mysq)&mysq={!geofilt

filter=false score=recipDistance pt=45.15,-98.85

d=5}

Page 46: Lucene solr 4 spatial   extended deep dive

Distance Faceting

• sfield=geo (the field)

• pt=45.15,-93.85 (point of reference)

• Within 10km • facet.query={!geofilt d=10}

• Within 50km • facet.query={!geofilt d=50}

• Within 100km • facet.query={!geofilt d=100}

Page 47: Lucene solr 4 spatial   extended deep dive

Future

• A more Solr-friendly spatial query parser SOLR-4242

• Retrofit geodist() to support the SpatialStrategies?

• Expose more tunables

• A grid based heat-map faceting component

• Idea: a multi-strategy spatial field encompassing

• A PrefixTree field for points

• A PrefixTree field for non-points

• A TwoDoubles field for good distance sorting / relevancy

• Knows whether its single vs. multi-valued

• A FieldType for multi-value numeric ranges

Page 48: Lucene solr 4 spatial   extended deep dive

DEMO

Page 49: Lucene solr 4 spatial   extended deep dive

INTERESTING USE CASES

Page 50: Lucene solr 4 spatial   extended deep dive

1. Geohash each point to multiple lengths and index each

length into its own field

• geohash_1:D, geohash_2:DR, geohash_3:DRT, geohash_4:DRT2

2. Search with a rectangle (bbox) filter, and…

3. Facet on the geohash field with the desired resolution

• facet.field=geohash_4

&facet.limit=10000

• Lots of tuning / customization

options

• Projected / quad tree

• facet.prefix may help

Heatmap / Grid faceting

Page 51: Lucene solr 4 spatial   extended deep dive

Plotting many points on a map

• Why not ask Solr for rows=1000 ?

• It’s slow

• If variable-points per doc then could yield be 1 distinct point or 1M

• Instead facet on a geohash with facet.limit=1000

• Fast

• Guaranteed <= 1000 points

• But might need lots of memory

• Or result-grouping on a geohash

But do you really want

to plot 1000+ points

on a map?

Page 52: Lucene solr 4 spatial   extended deep dive

Filter by indexed distance constraints

• Imagine a dating site where both potential parties have a

maximum distance they’re willing to travel

• Q: For the current user, who is not “too far” for you but is

also not “too far” for them?

• A: Index each user’s location as a point in one field and

as a circle in another. Query by the current user’s circle to

the indexed point field as well as the current user’s point

to the indexed circle field.

Page 53: Lucene solr 4 spatial   extended deep dive

Multi-valued durations

• What if your documents needed a variable number of time (or other numerical value) durations

• This approach won’t work: <field name=“start” type=“tdate” multiValued=“true”/>

<field name=“end” type=“tdate” multiValued=“true”/>

• Solr (without Solr 4 spatial fields) can’t do it!

• You need to think differently to solve this…

http://wiki.apache.org/solr/SpatialForTimeDurations

• Example use-cases

• Searching for hotel-room vacancies

• Searching for movie show-times

• (next slides) Each document is a person with a variable number of “shifts” that they are working…

Page 54: Lucene solr 4 spatial   extended deep dive

… model durations as points

Page 55: Lucene solr 4 spatial   extended deep dive

… queries become rectangles

Page 56: Lucene solr 4 spatial   extended deep dive

… some config & search details

• Configuration

<fieldType name="days_of_year”

class="solr.SpatialRecursivePrefixTreeFieldType"

geo="false" units="degrees"

worldBounds="0 0 365 365"

distErrPct="0" maxDistErr="1"/>

• Sample search: Find shifts that have any overlap with 19th day to 23rd

daysOfYear:Intersects(0 18.5 23.5 365)

• Caveat: Won’t scale to the full precision of a java Long (timestamp)

Page 57: Lucene solr 4 spatial   extended deep dive

Thank you!

• References

• Lucene 4 spatial javadocs

• https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/spatial/

• Spatial4j at GitHub

• https://github.com/spatial4j/spatial4j ( spatial4j.com redirect)

• http://spatial4j.16575.n6.nabble.com -- [email protected]

• Solr

• http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

• Spatial Solr Sandbox

• https://github.com/ryantxu/spatial-solr-sandbox

• Contact me:

• David Smiley [email protected] [email protected]

Page 58: Lucene solr 4 spatial   extended deep dive

CONTACT

David Smiley

[email protected]