30

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Embed Size (px)

Citation preview

Page 1: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P
Page 2: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav ©2014 Bloomberg L.P.

Page 3: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Who am I ?

•  Big Search and Distributed database specialist

•  Built a Search as a Service platform

•  Lead Search Architect @ Bloomberg Vault

•  Credit Derivatives Analytics Engineer @ Bloomberg

•  Masters' @ Courant Institute of Mathematical Sciences, New York University

•  Passionate about Search, Scuba Diving , Motorcycles and German Shepherds

Page 4: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

bloomberg.com/company

Page 5: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Agenda

•  Search  at  Bloomberg  

•   Goals  and  Objec5ves      •   A  li9le  background  

•   Factors  affec5ng  indexing  

•   Our  tests  and  benchmarks  

•   Design  for  a  be9er  NRT  indexer  

•   Future  work  

•   Q/A  

Page 6: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Search at Bloomberg

Page 7: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Search at Bloomberg

•  News Search

•  Federated Search

•  Complex re-ranking of search results •  Archival Search

•  GeoSpatial Search

•  Analytics and Statistics on Search

Page 8: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Objective

Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data

Page 9: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Indexing workflow

Page 10: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Indexing Data Flow in SolrCloud

Page 11: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Indexing Workflow

Down  Cas)ng Creates  tokens  by  lowercasing  all  le4ers  and  dropping  non-­‐le4ers.

We were talking about IBM during the fishing trip

[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

[talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat]

Consider  the  sentence:

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

Tokeniza)on A  tokenizer  splits  the  stream  of  characters  into  a  series  of  tokens.

Stemming  Lemma)za)on Stemming  algorithms  reduce  words  "fishing",  "fished",  "fish",  and  "fisher"  to  the  root  word,  "fish" Lemma*za*on  expands  words  to  their  inflected  forms  (ie  fishing  -­‐>  fished,  fishes,  fish  but  not  fisher)  

Stop  Word  Removal Remove  common  stop  words  “and”,”or”  etc.  which  introduce  noise  in  the  search  process

Synonym  Expansion Mapping  of  words  based  upon  thesaurus  (synonyms,  acronyms,  hypernyms,  business  rules,  etc..) For  example  talk  -­‐>  chat,  IBM    -­‐>  “big  blue”,  trip  -­‐>  journey

Page 12: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Designing the Search Index

Designing  a  good  Search  Applica)on  also  involves  many  aspects  of  user  

interac)on  that  directly  influence  indexing  design

•   Data  Type  and  Data  Distribu)on •   Server  side  parameters •   Networking •   Client  side  parameters •   Query  pa4erns

Page 13: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Factors Affecting Indexing

Page 14: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Data and Distribution of Tokens

Common types of data that we index in a search index

•   Textual  data  (  human  generated  )  e.g.  messages,  news,  blogs

•   Textual  data  (  machine  generated  )  e.g.  logs  ,  5ckets

•   Numerical  data

•   Geospa5al  data

How does this affect search index designs ?

•   Query  speed  and  indexing  speed  depend  on  the  size  of  an  index

•   Size  is  dependent  on   •   Number  of  documents  in  the  index •   Average  size  of  each  document •   Distribu5on  of  tokens •   Index  features  eg.  Face5ng,  Highligh5ng

Page 15: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Server-side Factors

•  Ratio of CPU’s to the number of solr cores running •   2  Solr  indices  per  CPU  or  a  Thread

•  Disk space •   Disk  space  for  Solr  index  *  2  (  head  room  for  merge  cycles  )

•  Memory

•   JVM  heap   •   Off  Heap  

•     DocValues  

Page 16: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Networking

Cluster design consideration

•   Should  a  cluster  span  data  centers  ? •   Latency  between  datacenters •   Reliability  and  availability  SLA’s

•   Where  does  your  Zookeeper  ensemble  live  ?

•   How  many  elec5on  members •   Consider  observers  to  scale  zookeeper •   Dynamically  promote  an  observer  to  elec5on  member

Manage concurrent connections on the server

Monitor network latencies for QoS guarantees

Page 17: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Client-side Factors

•  Managing connections and reusing connections

•  Which format to use for indexing data

•   javabin •   csv •   json •   xml

•  How many simultaneous threads to use

Page 18: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Experiments with NRT Indexing

It’s not always efficient to send a single document to Solr for indexing

How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents

•   Time  Triggers  (  T  ) •   Time  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr  

•   Document  Size  Triggers  (  S  ) •   Document  size  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr

•   Document  Number  Triggers  (  N  ) •   Number  of  documents  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr  

The  collectors  are  all  simultaneously  used  in  order  of  priority.  The  lower  priority  collectors  act  as  a  cut-­‐off  backups  to  safe  guard  from  overflows.  

Page 19: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Tests and Benchmarks

Page 20: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Benchmarking Setup

•  Client application sending data to 4-way replicated SolrCloud

•  5 node Zookeeper ensemble

•  All tests done with a similar dataset ( machine generated text ) •  We synthesize a high throughput ingest stream, which serves as our input

•  Soft commits set at 1sec

Page 21: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Benchmarking : Time Limit Tests

docs

/sec

Time Triggers: Collection window in ms

Page 22: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Benchmarking : Document Limit Tests

docs

/sec

Document Number Triggers: Collection window in number of documents

Page 23: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Benchmarking : Byte Limit Tests

docs

/sec

Document Size Triggers: Collection window in bytes

Page 24: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Observations

•  On an average we were able to observe 5x-7x increase in ingestion throughput

•  Optimization parameters are dependent constantly changing factors

•  The tuning variables need to be constantly adjusted for best performance

•  How to use this now

Page 25: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Design for a better NRT indexer

Page 26: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

PID Controller

Proportional term ( P ) – present

Output proportional to current error value

Integral term ( I ) - past

Sum of instantaneous error over time, and give accumulated offset that should

have been corrected previously

Derivative term ( D ) - future

Calculated by determining the slope of previous

error over time times the rate of change

Page 27: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

PID implementation in the indexer

Solr  Cloud

Sampling  thread  Process  variable  Docs/sec

Solr  response

Client  indexer  process

Pick  one  of  the  Triggers

 Time  (T    )  Control  Variable

PID  controller  implementa5on

Indexing  threads

Page 28: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Future Work

Page 29: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Future work

•  Perfect the PID indexer

•  Add it to the YCSB benchmarking framework

•  Add other server side parameters on the PID indexer

•  Use the PID indexer along with the YCSB framework to size hardware

Page 30: Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav , Bloomberg LP

QUESTIONS ?