28
OCTOBER 11-14, 2016 BOSTON, MA

Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Embed Size (px)

Citation preview

Page 1: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Solr Highlighting at Full Speed Timothy M. Rodriguez

Verticals Search Team Lead, Bloomberg David Smiley

Search Developer/Consultant

Page 3: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

3

Agenda §  Legal Search

§  Business Requirements

§  Highlighters Overview

§  Improving the Standard Highlighter

§  Unified Highlighter

§  Questions

01

Page 4: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Bloomberg Law §  Suite of legal business tools

for lawyers and legal professionals

§  Business development

§  Drafting

§  Analytics

§  Search

Page 5: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

5

Legal Search §  Recall Matters

§  Large Documents

§  Citizens United is 130 pages long

§  Some in the 100s of MB

§  Researchers rely on highlighting to help them decide if they should read a document

01

Page 6: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Requirements

ü

Page 7: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Accuracy §  Legal users issue detailed searches

§  “cafeteria plan” AND tax

§  Custom Span Queries

§  “insurance fraud” /s conviction

Page 8: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

“Just Right” Digest Sizing

Page 9: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

9

Full Document Highlighting 01

Page 10: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Zone Highlighting

Page 11: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

11

Speed 01

Page 12: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

12

Solr Highlighters Overview 01

Highlighter Offset Source Accuracy Speed

Default/Standard HighlighterAnalysis

BetterSlowest

Term Vectors Slow

Fast Vector Highlighter Term Vectors Good Medium

Postings Highlighter Postings (+ Analysis for wildcards) Okay Fast*

But poor wildcard performance

Page 13: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Offset Source & Index Size

§  Analysis requires no extra data on-disk

§  But analyzing text on the fly is expensive

§  Term vectors are heavy

§  Adding offsets to postings is much lighter than TVs

0 0.5 1 1.5 2 2.5 3Multiples of Stored Value

Stored Value

Terms

Positions

Offsets

TV Terms

TV Positions

TV Offsets

Page 14: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

14

Initial Attempt §  Chose the default highlighter and added in customizations as needed

§  Added payload support to the MemoryIndex - LUCENE-6155 v5.0

§  We investigated using the PostingsHighlighter and FastVectorHighlighter but the accuracy trade-offs were not acceptable for our users

§  Ran into performance problems as highlighting was taking the bulk of our execution time

01

Page 15: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

15

Make it Faster §  Improvements to the Standard Highlighter

§  Fast uninverting of term vectors to a token stream – LUCENE-6031 v5.0 (remove expensive sort)

§  Rely on Term Vectors for phrase highlighting instead of the analyzing into the MemoryIndex – LUCENE-6034 v5.0

Still not fast enough…

01

Page 16: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

16

Multithreaded Highlighting §  Highlighting each returned doc is easily parallelizable

§  Greatly improved performance

§  Greatly increased memory consumption

Still a sizeable fraction of our query times…

01

Page 17: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

17

Re-evaluating §  Didn’t look like we could make the Standard Highlighter much faster

§  Perhaps we could federate to one of the highlighters based on the query?

§  Our customizations would have to be ported to each of the highlighters

§  Work would need to be repeated 3x

§  Increased disk utilization from adding postings to the main index

01

Page 18: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

18

Enhance the Postings Highlighter? §  Fastest of the bunch

§  Add accuracy at least as good as the standard highlighter

§  Add support for the other offset sources too

§  (supports our full-doc-highlighting use-case)

§  But it’s a big job with major internal highlighter surgery…

Let’s do it!

01

Page 19: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Offsets Overview §  Getting character offsets is key to highlighting. 3 ways:

§  Analysis: §  Analyzer è TokenStream è OffsetAttribute,

oa.startOffset()

§  Term vectors:§  IndexReader.getTermVector(docId,field) è Terms è

TermsEnum, te.postings(…, PostingsEnum.OFFSETS) è PostingsEnum è pe.startOffset()

§  Postings:§  IndexReader è LeafReader è Terms è … (see above)

Page 20: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

PostingsHighlighter Algorithm 1.  Fetches all stored-value content needed up-front

2.  Highlight in field sorted order, then doc sorted order loop:

1.  Get PostingsEnum from a Terms for each query term

2.  MTQs: Fake PostingsEnum around filtered TokenStream

3.  Process PostingsEnum[ ] into Passage[ ]

java.text.BreakIterator: for passage delineation

PassageScorer: for passage scoring (BM25 default)

4.  PassageFormatter: for formatting/mark-up

Page 21: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

UnifiedHighlighter §  Forked PH; given new name agnostic of offset source

§  Mostly same PH API; internals re-arranged and expanded

§  Solr adapter is nearly identical too

§  Untouched: Passage, PassageScorer, PassageFormatter

§  Re-uses some standard-Highlighter code too:

§  WeightedSpanTermExtractor (for phrase accuracy)

§  TokenStreamFromTermVector (for wildcards/MTQs)

Page 22: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

UH: Accurate Phrases (including any SpanQuery)

§  Convert position-sensitive Queries to SpanQueries

§  Re-use WeightedSpanTermExtractor (WSTE) for this

§  Wrap PostingsEnum for position-sensitive words with one that filters by position-span extracted from span queries

§  Custom: WSTE is not used for this, although it’s similar

§  Note: not 100% accurate with query but very good

Page 23: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

UH: Analysis Offset Source §  The most difficult offset source…

§  Honor positionIncrementGap for multi-valued data

§  Populates a MemoryIndex when query has phrases

§  But smartly filters irrelevant terms! (new trick)

§  Wildcards/MTQs too? Uninvert MemoryIndex with re-used TokenStreamFromTermVector

§  If just terms, treat them like wildcards to avoid MemoryIndex usage

Page 24: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

UH: Postings Plus Light TVs §  Postings offset source is great, but not for MTQs (wildcards)

§  MTQs need to see all terms in just the document

§  A plain term vector (no offsets or postings) has that!

§  Trick:

§  Wrap the main index with a term vector’s TermsEnum

§  Then TokenStreamFromTermVector for MTQ

Page 25: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

25

Benchmark Results 01

§  Unified Highlighter performed similarly or better than peers

§  Best performance: Postings with “light” Term Vectors

§  No use case for full term vectors anymore?

§  Caveats

§  Substantial variability in test runs (YMMV)

§  Depends on the specifics of your use case

§  Benchmark code available

Highlighter Offset Source Terms Phrases Wildcards

(search) N/A 1.0x! 1.0x! 1.0x  

Standard Highlighter Analysis 4.6x! 4.7x! 7.4x  

Unified Highlighter Analysis 2.8x! 2.4x! 3.7x  

Standard Highlighter Term Vectors 2.7x! 2.3x! 3.7x  

Fast Vector Highlighter Term Vectors 1.8x! 2.1x! 2.6x  

Unified Highlighter Term Vectors 1.7x! 1.8x! 2.3x  

Postings Highlighter Postings 1.8x! 1.5x! 3.8x  

Unified Highlighter Postings 1.6x! 1.3x! 3.8x  

Unified Highlighter Postings with Term Vectors* 1.5x! 1.3x! 2.2x  

Times shown in multiples of the original search time (top row).

Page 26: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

26

Future Potential Improvements §  Accuracy

§  Switch from WSTE approach to SpanCollector API

§  Honor conjunctions “(X AND Y) OR Z”

§  Relevancy

§  Consider term diversity across top-X passages

§  Incorporate query boosts in passage scores

§  Support “requireFieldMatch=false”

01

Page 27: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

Summary §  Importance of highlighting in Legal Search

§  Overview of the existing Highlighters

§  Improvements to the Standard Highlighter

§  UnifiedHighlighter

§  Contributed to Lucene/Solr! LUCENE-7438

§  Your new favorite highlighter?

Page 28: Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

28

Questions? 01

Timothy M. RodriguezVerticals Search Team Lead, Bloomberg

@Timothy055

David SmileySearch Developer/Consultant

@DavidWSmiley