Scaling Lucene The event of ElasticSearch
Stéphane Gamard
Scalability
• Index Size - The number of entries upon which we act
• QPS - Number of requests serviced per second
• Time to operation - Time taken to be operational
Scalability is defined in 3 main axis:
Lucene
• IR library - Purely focused on Tf-iDf
• Bounded by native resources - Vertical scaling
• NRT Inverse Lookup - Segments
In a nutshell, Lucene does not scale. why?
LuceneSegments: the lucene storage
just a “bunch of files”
Lucene IndexingIn a “document” perspective
{#hello, #world}
{#there, #is, #a, #brown, #fox}
{#the, … , #kitchen}
…
T1 {#1, #33}
T2 {#2, … , #87}
…
T45 {#2, …}
…
#a T1
#is T2
…
#fox T45
…
Dictionary Inverse Lookup
Segment
Lucene IndexingFactors of growth
T1 {#1, #33}
T2 {#2, … , #87}
…
T45 {#2, …}
…
#a T1
#is T2
…
#fox T45
…
Dictionary Inverse Lookup
• Dictionary Size - NLP*
• New Inverse Entries
Segment
Lucene IndexingIn a storage perspective
Segment
Lucene IndexingIn a storage perspective
Segment
Lucene IndexingIn a storage perspective
Segment
Lucene IndexingIn a storage perspective
Segment
IndexReader(s)
IndexWriter
Lucene IndexingIn a storage perspective
IndexReader(s)
IndexWriter
Lucene Index
LuceneSegments: the lucene storage
just a “bunch of files”
Lucene IndexingThe wonderful world of merging segments
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-
segment-merges.html
Lucene Wrap-up
• A collection of segments
• One or multiple IndexReader
• A single IndexWriter
A Lucene Index is:
Lucene Wrap-upA single Lucene Index scales to:
• Index- Available HDD/Ram for segments
• QPS - number of IndexReader threads
• T-to-Op - Speed at which indexWriter can ingest (IOPs)
It can only scale vertically!!!
ElasticsearchAlso known as the commodity scaling of Lucene ;)
There is no magic…
It’s about partitioning,
Using an index of indexes as its index.
ElasticsearchA shard is the magic sauce of web scale
Lucene Lucene Lucene Lucene Lucene
Elasticsearch Index
ElasticsearchDocument Indexing
Lucene Lucene Lucene Lucene Lucene
• Distributed
• Routing
ElasticsearchRequest
Lucene Lucene Lucene Lucene Lucene
• Parallel
• Aggregated
{search: {…}}
ElasticsearchIn a nutshell
• Distributed - Distribute IndexWriter per shard
• Parallel - Parallelise request IndexReader per shard
ClusteringHow to leverage ES to scale Lucene
Lucene
• 2 Threads - 1 searcher, 1 writer
• 2G ram - Lucene Cache
• 30G disk - Index size
Sample sizing for xM indexed documents
Elasticsearch Index
Clustering
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Single Machine Scope: 8Core 16G ram 500G hdd
can sustain 4 times xM documents
Clustering
# Documents
QPS
1 machine -> 4 * xM documents
Clustering2 machines -> 2 * 4 * xM documents
# Documents
QPS
• 4 Threads - 3 searcher, 1 writer
• 4G ram - Lucene Cache
• 60G disk - Index size
Clustering
# Documents
QPS
4 machines -> 2 * 4 * xM documents
twice more QPS
Clustering
# Documents
QPS Is there a limit to this scalability?
Clustering
# Documents
QPS
• 8 Threads - 7 searcher, 1 writer
• 8G ram - Lucene Cache
• 120G disk - Index size
4 machines -> 4 * 4 * xM documents
ClusteringThe rules of thumbs
• Threads - are the core of the scalability factors
• IOPs - is generally the limiting factor to horizontal scaling
• Ram - is generally the limiting factor of vertical scaling
ES is generally excellent with its parameters
ClusteringHealth
• Redundancy - auto-balance shards for best possible HA
• Timing - Warmup and Commit points
• Latency - Result merging (especially on remote aggregations)