Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

  • View
    1.007

  • Download
    2

  • Category

    Software

Preview:

Citation preview

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav ©2014 Bloomberg L.P.

Who am I ?

•  Big Search and Distributed database specialist

•  Built a Search as a Service platform

•  Lead Search Architect @ Bloomberg Vault

•  Credit Derivatives Analytics Engineer @ Bloomberg

•  Masters' @ Courant Institute of Mathematical Sciences, New York University

•  Passionate about Search, Scuba Diving , Motorcycles and German Shepherds

bloomberg.com/company

Agenda

•  Search  at  Bloomberg  

•   Goals  and  Objec5ves      •   A  li9le  background  

•   Factors  affec5ng  indexing  

•   Our  tests  and  benchmarks  

•   Design  for  a  be9er  NRT  indexer  

•   Future  work  

•   Q/A  

Search at Bloomberg

Search at Bloomberg

•  News Search

•  Federated Search

•  Complex re-ranking of search results •  Archival Search

•  GeoSpatial Search

•  Analytics and Statistics on Search

Objective

Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data

Indexing workflow

Indexing Data Flow in SolrCloud

Indexing Workflow

Down  Cas)ng Creates  tokens  by  lowercasing  all  le4ers  and  dropping  non-­‐le4ers.

We were talking about IBM during the fishing trip

[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

[talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat]

Consider  the  sentence:

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

Tokeniza)on A  tokenizer  splits  the  stream  of  characters  into  a  series  of  tokens.

Stemming  Lemma)za)on Stemming  algorithms  reduce  words  "fishing",  "fished",  "fish",  and  "fisher"  to  the  root  word,  "fish" Lemma*za*on  expands  words  to  their  inflected  forms  (ie  fishing  -­‐>  fished,  fishes,  fish  but  not  fisher)  

Stop  Word  Removal Remove  common  stop  words  “and”,”or”  etc.  which  introduce  noise  in  the  search  process

Synonym  Expansion Mapping  of  words  based  upon  thesaurus  (synonyms,  acronyms,  hypernyms,  business  rules,  etc..) For  example  talk  -­‐>  chat,  IBM    -­‐>  “big  blue”,  trip  -­‐>  journey

Designing the Search Index

Designing  a  good  Search  Applica)on  also  involves  many  aspects  of  user  

interac)on  that  directly  influence  indexing  design

•   Data  Type  and  Data  Distribu)on •   Server  side  parameters •   Networking •   Client  side  parameters •   Query  pa4erns

Factors Affecting Indexing

Data and Distribution of Tokens

Common types of data that we index in a search index

•   Textual  data  (  human  generated  )  e.g.  messages,  news,  blogs

•   Textual  data  (  machine  generated  )  e.g.  logs  ,  5ckets

•   Numerical  data

•   Geospa5al  data

How does this affect search index designs ?

•   Query  speed  and  indexing  speed  depend  on  the  size  of  an  index

•   Size  is  dependent  on   •   Number  of  documents  in  the  index •   Average  size  of  each  document •   Distribu5on  of  tokens •   Index  features  eg.  Face5ng,  Highligh5ng

Server-side Factors

•  Ratio of CPU’s to the number of solr cores running •   2  Solr  indices  per  CPU  or  a  Thread

•  Disk space •   Disk  space  for  Solr  index  *  2  (  head  room  for  merge  cycles  )

•  Memory

•   JVM  heap   •   Off  Heap  

•     DocValues  

Networking

Cluster design consideration

•   Should  a  cluster  span  data  centers  ? •   Latency  between  datacenters •   Reliability  and  availability  SLA’s

•   Where  does  your  Zookeeper  ensemble  live  ?

•   How  many  elec5on  members •   Consider  observers  to  scale  zookeeper •   Dynamically  promote  an  observer  to  elec5on  member

Manage concurrent connections on the server

Monitor network latencies for QoS guarantees

Client-side Factors

•  Managing connections and reusing connections

•  Which format to use for indexing data

•   javabin •   csv •   json •   xml

•  How many simultaneous threads to use

Experiments with NRT Indexing

It’s not always efficient to send a single document to Solr for indexing

How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents

•   Time  Triggers  (  T  ) •   Time  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr  

•   Document  Size  Triggers  (  S  ) •   Document  size  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr

•   Document  Number  Triggers  (  N  ) •   Number  of  documents  based  collector  on  the  client-­‐side  to  batch  document  payloads  to  Solr  

The  collectors  are  all  simultaneously  used  in  order  of  priority.  The  lower  priority  collectors  act  as  a  cut-­‐off  backups  to  safe  guard  from  overflows.  

Tests and Benchmarks

Benchmarking Setup

•  Client application sending data to 4-way replicated SolrCloud

•  5 node Zookeeper ensemble

•  All tests done with a similar dataset ( machine generated text ) •  We synthesize a high throughput ingest stream, which serves as our input

•  Soft commits set at 1sec

Benchmarking : Time Limit Tests

docs

/sec

Time Triggers: Collection window in ms

Benchmarking : Document Limit Tests

docs

/sec

Document Number Triggers: Collection window in number of documents

Benchmarking : Byte Limit Tests

docs

/sec

Document Size Triggers: Collection window in bytes

Observations

•  On an average we were able to observe 5x-7x increase in ingestion throughput

•  Optimization parameters are dependent constantly changing factors

•  The tuning variables need to be constantly adjusted for best performance

•  How to use this now

Design for a better NRT indexer

PID Controller

Proportional term ( P ) – present

Output proportional to current error value

Integral term ( I ) - past

Sum of instantaneous error over time, and give accumulated offset that should

have been corrected previously

Derivative term ( D ) - future

Calculated by determining the slope of previous

error over time times the rate of change

PID implementation in the indexer

Solr  Cloud

Sampling  thread  Process  variable  Docs/sec

Solr  response

Client  indexer  process

Pick  one  of  the  Triggers

 Time  (T    )  Control  Variable

PID  controller  implementa5on

Indexing  threads

Future Work

Future work

•  Perfect the PID indexer

•  Add it to the YCSB benchmarking framework

•  Add other server side parameters on the PID indexer

•  Use the PID indexer along with the YCSB framework to size hardware

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav , Bloomberg LP

QUESTIONS ?