21
lots of facets, fast Anne Veling, BeyondTrees [email protected], May 26 th 2011

Lots of facets, fast

Embed Size (px)

DESCRIPTION

We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.

Citation preview

Page 1: Lots of facets, fast

lots of facets, fast

Anne Veling, [email protected], May 26th 2011

Page 2: Lots of facets, fast

introduction

Anne Veling• Freelance Search Architect• Lucene Trainer

Proquest New York Times

2

Page 3: Lots of facets, fast

visualization

data• 1851 up to 2006: almost 60k newspapers

How to give semantic overview• Context, where am I• Detail

Exploration and Discovery

3

Page 4: Lots of facets, fast

zoom

Present all newspapers on one canvas Dynamic zooming and panning Search interface

• for discovery

Front-end by Q42• HTML5 app• iPad app

Not yet live

4

Page 5: Lots of facets, fast

architecture

5

text

imagesTile

Generator

Indexer solrindex

Web Server

solrserver

facetplugin

tiles

client

Page 6: Lots of facets, fast

tiling

Newspaper images, old ones scanned• TIFF form• Wrinkles, coffee stains

Tile generator• Convert to jpg• One virtual canvas of 512Gpixel• Multilayers 3M tiles: ~100Gb in 11 levels

6

Page 7: Lots of facets, fast

search

25,072,989 articles 867M solr index DataImportHandler

• Issue with memory: load all XML URLs in memory first

• Solved by indexing in batches

Special• Nothing stored, not even IDs• We need nothing returned from search…

7

Page 8: Lots of facets, fast

8

0

maxDoc 2

results facets

query

4

Page 9: Lots of facets, fast

faceting memory

Store each facet as BitSet over 25M articles• 58k facets x 25M docs x 1 bit = 169Gb (memory!)

So we use DocSet from Solr• Scarce bitarray -> now fits in 1Gb memory

9

Page 10: Lots of facets, fast

faceting performance

Facet initialization• Takes ~1.5minute• Cached

Facet evaluation• Runtime!• #docs x #facets

10

query

Page 11: Lots of facets, fast

performance

Facet initialization/creation Runtime faceting

Solr LRU cache Creation of all facets ~72s Runtime evaluation ootb: 71 seconds…

/select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on&facet.date=thedate&facet.date.start=1850-01-01T00:00:00Z&facet.date.end=2007-01-01T00:00:00Z&facet.date.gap=%2B1DAY&facet=true

Client-side bottleneck vs Server-side

11

Page 12: Lots of facets, fast

<filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>

Improved performance to ~300ms for “Amsterdam” [1825] query!• 2.3Mb output…

<requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"></requestHandler>

Custom json output• Base 36 encoded heatmap

12

01111111111111111122111222777986878768885568855899beddbcebbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdimlbbhkahf77987afghhihjihjikjikifeefgppsomf8000

Page 13: Lots of facets, fast

runtime facet optimization

60,656 facets Worst case facet #DocSet.exists(doc)

• Originally: 25M x 60k = 1.5E12 checks, 60k per doc

• Now: average 0.5x for each level = 34.5 per doc

13

16 decades

160 years

1,920 months

58,560 days

Page 14: Lots of facets, fast

optimization

Custom facet runtime Collector• Break if facet matched

single value per doc per facet each doc has only 1 day

• Top-down facet selection decade – year – month – day

Performance for 1850 docs and 60k docs improved from 300ms to 10ms

Custom optimized heatmap json Bottleneck now in the client/canvas/js

14

Page 15: Lots of facets, fast

show us or it didn’t happen

Web Application iPad App

15

Page 16: Lots of facets, fast

zooming

16

Page 17: Lots of facets, fast

facet heatmap

17

“television”

“inflation”

Page 18: Lots of facets, fast

conclusions

Great exploratory UI Use domain knowledge to optimize for

performance• If you can

Next• Bring it live on the Web and in App Store• Using it for 1.2M books/CDs/DVDs of Belgium• More search options• Multipage

18

Page 19: Lots of facets, fast

enhancement suggestions

Lucene Collector• def collect(doc: Int):Boolean

Solr SingleValueFacet Break after first find Automatic order based on #counts?

19

class ExistsCollector extends Collector { var exists = false

def collect(doc: Int) = { exists = true false }

def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {}}

Page 20: Lots of facets, fast

lessons learned

Java Graphics has limitations for large fonts (>26,000)

Handling large data sets is tricky• Indexing• Copying

There’s technology and there’s corporate agendas

You can always make things 10x faster• Lucene is ridiculously fast

If you configure it well

• Using domain knowledge can get you far

20

Page 21: Lots of facets, fast

thank you

[email protected]

@anneveling

21