Upload
lucidworks
View
1.517
Download
5
Embed Size (px)
DESCRIPTION
Presented at Lucene/Solr Revolution 2014
Citation preview
Efficient Scalable Search in a Multi-Tenant Environment Harry Hight ©2014 Bloomberg L.P.
Overview
• Background • Architecture • Scale • Security • Questions
Background • Bloomberg Vault – hosted communication archive
• Explosive growth of enterprise data communica7ons
• Compliance for Regulated Industries (e.g. e-‐mail, chat, mobile, voice, social media, files)
• Private Cloud • E-Discovery - large historical data sets, but small
query volume • Search to accurately and 7mely respond to
li7ga7on requests • Reconstruct communica7ons across all channels
and types • Extrac7on of large data sets from special storage
(WORM)
User Index
Query
Results
Extrac7on
Sizing • 80 billion documents
• And growing • Average document size is 50KB
• Large variance -‐ 1KB to hundreds of MB • Hundreds of indexed fields
• There is a lot of metadata that goes along with communica7on • <10 searches/second
Overview
• Background • Architecture • Scale • Security • Questions
Architecture • Massive scale - shards have to be left
offline until needed • Load only the shards needed to serve
a search request • Searches normally require ~30 shards,
but can range from 1 to several hundred depending on applica7on
• Open shards cached in case they are needed again
• Indexing is an external batch process Shards
Solr Solr Solr Solr
Search Manager Shard Mapping
Overview
• Background • Architecture • Scale • Security • Questions
Incremental Search • Calculating the full result set is time
consuming • Query cache usually cold due to
unload • Shards load takes 7me
• Users want to review a subset before exporting
• Shards and results are date sorted • Search shards sequentially, and
return partial results as available • Creates a streaming interface
Shards
Solr Solr Solr Solr
Search Manager
Applica7ons
Pinned Shards • Incremental search starts with the most recent data • `Pin` shards for most recent data
• Subset of shards to be kept loaded at all 7mes • Shards already loaded for the beginning of the stream • User doesn’t see the load times for the rest since it happens while they review
initial results • Allows query caches to be more effective • User sees results in seconds rather than minutes
Overview
• Background • Architecture • Scale • Security • Questions
Security • What if each user has a different view
of a document? • User 1 has permission to view the red • User 2 has permission to view green • User 3 has permission to view
everything
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercita7on ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
Security • Post process each document
• Ends up being horribly slow • Ties applica7on logic to backend
• Generate a unique document for each view • 1000s of unique views makes for an unmanageable index • Trillions of documents is a whole different problem!
• Dynamic fields • text_view1:value1, text_view2:value2, text_view3:”value1 value2” • Solr doesn’t have a max number of fields, but string interning becomes an issue
• Mangle field values • text:”view1_value1 view2_value2 view3_value1 view3_value2” • Works pre^y well
Questions ?