Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav ©2014 Bloomberg L.P.

Who am I ?

•  Big Search and Distributed database specialist

•  Built a Search as a Service platform

•  Lead Search Architect @ Bloomberg Vault

•  Credit Derivatives Analytics Engineer @ Bloomberg

•  Masters' @ Courant Institute of Mathematical Sciences, New York University

•  Passionate about Search, Scuba Diving , Motorcycles and German Shepherds

bloomberg.com/company

Agenda

•  Search at Bloomberg

•  Goals and Objec5ves •  A li9le background

•  Factors affec5ng indexing

•  Our tests and benchmarks

•  Design for a be9er NRT indexer

•  Future work

•  Q/A

Search at Bloomberg

Search at Bloomberg

•  News Search

•  Federated Search

•  Complex re-ranking of search results •  Archival Search

•  GeoSpatial Search

•  Analytics and Statistics on Search

Objective

Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data

Indexing workflow

Indexing Data Flow in SolrCloud

Indexing Workflow

Down Cas)ng Creates tokens by lowercasing all le4ers and dropping non-‐le4ers.

We were talking about IBM during the fishing trip

[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

[talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat]

Consider the sentence:

[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]

Tokeniza)on A tokenizer splits the stream of characters into a series of tokens.

Stemming Lemma)za)on Stemming algorithms reduce words "fishing", "fished", "fish", and "fisher" to the root word, "fish" Lemma*za*on expands words to their inflected forms (ie fishing -‐> fished, fishes, fish but not fisher)

Stop Word Removal Remove common stop words “and”,”or” etc. which introduce noise in the search process

Synonym Expansion Mapping of words based upon thesaurus (synonyms, acronyms, hypernyms, business rules, etc..) For example talk -‐> chat, IBM -‐> “big blue”, trip -‐> journey

Designing the Search Index

Designing a good Search Applica)on also involves many aspects of user

interac)on that directly influence indexing design

•  Data Type and Data Distribu)on •  Server side parameters •  Networking •  Client side parameters •  Query pa4erns

Factors Affecting Indexing

Data and Distribution of Tokens

Common types of data that we index in a search index

•  Textual data ( human generated ) e.g. messages, news, blogs

•  Textual data ( machine generated ) e.g. logs , 5ckets

•  Numerical data

•  Geospa5al data

How does this affect search index designs ?

•  Query speed and indexing speed depend on the size of an index

•  Size is dependent on •  Number of documents in the index •  Average size of each document •  Distribu5on of tokens •  Index features eg. Face5ng, Highligh5ng

Server-side Factors

•  Ratio of CPU’s to the number of solr cores running •  2 Solr indices per CPU or a Thread

•  Disk space •  Disk space for Solr index * 2 ( head room for merge cycles )

•  Memory

•  JVM heap •  Off Heap

•  DocValues

Networking

Cluster design consideration

•  Should a cluster span data centers ? •  Latency between datacenters •  Reliability and availability SLA’s

•  Where does your Zookeeper ensemble live ?

•  How many elec5on members •  Consider observers to scale zookeeper •  Dynamically promote an observer to elec5on member

Manage concurrent connections on the server

Monitor network latencies for QoS guarantees

Client-side Factors

•  Managing connections and reusing connections

•  Which format to use for indexing data

•  javabin •  csv •  json •  xml

•  How many simultaneous threads to use

Experiments with NRT Indexing

It’s not always efficient to send a single document to Solr for indexing

How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents

•  Time Triggers ( T ) •  Time based collector on the client-‐side to batch document payloads to Solr

•  Document Size Triggers ( S ) •  Document size based collector on the client-‐side to batch document payloads to Solr

•  Document Number Triggers ( N ) •  Number of documents based collector on the client-‐side to batch document payloads to Solr

The collectors are all simultaneously used in order of priority. The lower priority collectors act as a cut-‐off backups to safe guard from overflows.

Tests and Benchmarks

Benchmarking Setup

•  Client application sending data to 4-way replicated SolrCloud

•  5 node Zookeeper ensemble

•  All tests done with a similar dataset ( machine generated text ) •  We synthesize a high throughput ingest stream, which serves as our input

•  Soft commits set at 1sec

Benchmarking : Time Limit Tests

docs

/sec

Time Triggers: Collection window in ms

Benchmarking : Document Limit Tests

docs

/sec

Document Number Triggers: Collection window in number of documents

Benchmarking : Byte Limit Tests

docs

/sec

Document Size Triggers: Collection window in bytes

Observations

•  On an average we were able to observe 5x-7x increase in ingestion throughput

•  Optimization parameters are dependent constantly changing factors

•  The tuning variables need to be constantly adjusted for best performance

•  How to use this now

Design for a better NRT indexer

PID Controller

Proportional term ( P ) – present

Output proportional to current error value

Integral term ( I ) - past

Sum of instantaneous error over time, and give accumulated offset that should

have been corrected previously

Derivative term ( D ) - future

Calculated by determining the slope of previous

error over time times the rate of change

PID implementation in the indexer

Solr Cloud

Sampling thread Process variable Docs/sec

Solr response

Client indexer process

Pick one of the Triggers

Time (T ) Control Variable

PID controller implementa5on

Indexing threads

Future Work

Future work

•  Perfect the PID indexer

•  Add it to the YCSB benchmarking framework

•  Add other server side parameters on the PID indexer

•  Use the PID indexer along with the YCSB framework to size hardware

Never Stop Exploring: Pushing the Limits of Solr

Anirudha Jadhav , Bloomberg LP

QUESTIONS ?

Software

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P