14

Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Embed Size (px)

DESCRIPTION

Presented at Lucene/Solr Revolution 2014

Citation preview

Page 1: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P
Page 2: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Efficient Scalable Search in a Multi-Tenant Environment Harry Hight ©2014 Bloomberg L.P.

Page 3: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Overview

•  Background •  Architecture •  Scale •  Security •  Questions

Page 4: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Background •  Bloomberg Vault – hosted communication archive

•  Explosive  growth  of  enterprise  data  communica7ons    

•  Compliance  for  Regulated  Industries  (e.g.  e-­‐mail,  chat,  mobile,  voice,  social  media,  files)  

•  Private  Cloud  •  E-Discovery - large historical data sets, but small

query volume •  Search  to  accurately  and  7mely  respond  to  

li7ga7on  requests  •  Reconstruct  communica7ons  across  all  channels  

and  types  •  Extrac7on  of  large  data  sets  from  special  storage  

(WORM)  

User   Index  

Query  

Results  

Extrac7on  

Page 5: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Sizing •  80 billion documents

•  And  growing  •  Average document size is 50KB

•  Large  variance  -­‐  1KB  to  hundreds  of  MB  •  Hundreds of indexed fields

•  There  is  a  lot  of  metadata  that  goes  along  with  communica7on  •  <10 searches/second

Page 6: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Overview

•  Background •  Architecture •  Scale •  Security •  Questions

Page 7: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Architecture •  Massive scale - shards have to be left

offline until needed •  Load only the shards needed to serve

a search request •  Searches  normally  require  ~30  shards,  

but  can  range  from  1  to  several  hundred  depending  on  applica7on  

•  Open shards cached in case they are needed again

•  Indexing is an external batch process Shards  

Solr   Solr   Solr   Solr  

Search  Manager   Shard  Mapping  

Page 8: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Overview

•  Background •  Architecture •  Scale •  Security •  Questions

Page 9: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Incremental Search •  Calculating the full result set is time

consuming •  Query  cache  usually  cold  due  to  

unload  •  Shards  load  takes  7me  

•  Users want to review a subset before exporting

•  Shards and results are date sorted •  Search shards sequentially, and

return partial results as available •  Creates a streaming interface

Shards  

Solr   Solr   Solr   Solr  

Search  Manager  

Applica7ons  

Page 10: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Pinned Shards •  Incremental search starts with the most recent data •  `Pin` shards for most recent data

•  Subset  of  shards  to  be  kept  loaded  at  all  7mes  •  Shards already loaded for the beginning of the stream •  User doesn’t see the load times for the rest since it happens while they review

initial results •  Allows query caches to be more effective •  User sees results in seconds rather than minutes

Page 11: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Overview

•  Background •  Architecture •  Scale •  Security •  Questions

Page 12: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Security •  What if each user has a different view

of a document? •  User  1  has  permission  to  view  the  red    •  User  2  has  permission  to  view  green  •  User  3  has  permission  to  view  

everything  

Lorem  ipsum  dolor  sit  amet,  consectetur  adipiscing  elit,  sed  do  eiusmod  tempor  incididunt  ut  labore  et  dolore  magna  aliqua.  Ut  enim  ad  minim  veniam,  quis  nostrud  exercita7on  ullamco  laboris  nisi  ut  aliquip  ex  ea  commodo  consequat.  Duis  aute  irure  dolor  in  reprehenderit  in  voluptate  velit  esse  cillum  dolore  eu  fugiat  nulla  pariatur.  Excepteur  sint  occaecat  cupidatat  non  proident,  sunt  in  culpa  qui  officia  deserunt  mollit  anim  id  est  laborum  

Page 13: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Security •  Post process each document

•  Ends  up  being  horribly  slow  •  Ties  applica7on  logic  to  backend  

•  Generate a unique document for each view •  1000s  of  unique  views  makes  for  an  unmanageable  index  •  Trillions  of  documents  is  a  whole  different  problem!  

•  Dynamic fields •  text_view1:value1,  text_view2:value2,  text_view3:”value1  value2”  •  Solr  doesn’t  have  a  max  number  of  fields,  but  string  interning  becomes  an  issue  

•  Mangle field values •  text:”view1_value1  view2_value2  view3_value1  view3_value2”  •  Works  pre^y  well  

Page 14: Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P

Questions ?