Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Efficient Scalable Search in a Multi-Tenant Environment Harry Hight ©2014 Bloomberg L.P.

Overview

•  Background •  Architecture •  Scale •  Security •  Questions

Background •  Bloomberg Vault – hosted communication archive

•  Explosive growth of enterprise data communica7ons

•  Compliance for Regulated Industries (e.g. e-‐mail, chat, mobile, voice, social media, files)

•  Private Cloud •  E-Discovery - large historical data sets, but small

query volume •  Search to accurately and 7mely respond to

li7ga7on requests •  Reconstruct communica7ons across all channels

and types •  Extrac7on of large data sets from special storage

(WORM)

User Index

Query

Results

Extrac7on

Sizing •  80 billion documents

•  And growing •  Average document size is 50KB

•  Large variance -‐ 1KB to hundreds of MB •  Hundreds of indexed fields

•  There is a lot of metadata that goes along with communica7on •  <10 searches/second

Overview


Architecture •  Massive scale - shards have to be left

offline until needed •  Load only the shards needed to serve

a search request •  Searches normally require ~30 shards,

but can range from 1 to several hundred depending on applica7on

•  Open shards cached in case they are needed again

•  Indexing is an external batch process Shards

Solr Solr Solr Solr

Search Manager Shard Mapping

Overview


Incremental Search •  Calculating the full result set is time

consuming •  Query cache usually cold due to

unload •  Shards load takes 7me

•  Users want to review a subset before exporting

•  Shards and results are date sorted •  Search shards sequentially, and

return partial results as available •  Creates a streaming interface

Shards

Solr Solr Solr Solr

Search Manager

Applica7ons

Pinned Shards •  Incremental search starts with the most recent data •  `Pin` shards for most recent data

•  Subset of shards to be kept loaded at all 7mes •  Shards already loaded for the beginning of the stream •  User doesn’t see the load times for the rest since it happens while they review

initial results •  Allows query caches to be more effective •  User sees results in seconds rather than minutes

Overview


Security •  What if each user has a different view

of a document? •  User 1 has permission to view the red •  User 2 has permission to view green •  User 3 has permission to view

everything

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercita7on ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

Security •  Post process each document

•  Ends up being horribly slow •  Ties applica7on logic to backend

•  Generate a unique document for each view •  1000s of unique views makes for an unmanageable index •  Trillions of documents is a whole different problem!

•  Dynamic fields •  text_view1:value1, text_view2:value2, text_view3:”value1 value2” •  Solr doesn’t have a max number of fields, but string interning becomes an issue

•  Mangle field values •  text:”view1_value1 view2_value2 view3_value1 view3_value2” •  Works pre^y well

Questions ?

Software

Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P