From Lucene to Elasticsearch, a short explanation of horizontal scalability

Preview:

Citation preview

Scaling Lucene The event of ElasticSearch

Stéphane Gamard

Scalability

• Index Size - The number of entries upon which we act

• QPS - Number of requests serviced per second

• Time to operation - Time taken to be operational

Scalability is defined in 3 main axis:

Lucene

• IR library - Purely focused on Tf-iDf

• Bounded by native resources - Vertical scaling

• NRT Inverse Lookup - Segments

In a nutshell, Lucene does not scale. why?

LuceneSegments: the lucene storage

just a “bunch of files”

Lucene IndexingIn a “document” perspective

{#hello, #world}

{#there, #is, #a, #brown, #fox}

{#the, … , #kitchen}

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#a T1

#is T2

#fox T45

Dictionary Inverse Lookup

Segment

Lucene IndexingFactors of growth

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#a T1

#is T2

#fox T45

Dictionary Inverse Lookup

• Dictionary Size - NLP*

• New Inverse Entries

Segment

Lucene IndexingIn a storage perspective

Segment

Lucene IndexingIn a storage perspective

Segment

Lucene IndexingIn a storage perspective

Segment

Lucene IndexingIn a storage perspective

Segment

IndexReader(s)

IndexWriter

Lucene IndexingIn a storage perspective

IndexReader(s)

IndexWriter

Lucene Index

LuceneSegments: the lucene storage

just a “bunch of files”

Lucene IndexingThe wonderful world of merging segments

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-

segment-merges.html

Lucene Wrap-up

• A collection of segments

• One or multiple IndexReader

• A single IndexWriter

A Lucene Index is:

Lucene Wrap-upA single Lucene Index scales to:

• Index- Available HDD/Ram for segments

• QPS - number of IndexReader threads

• T-to-Op - Speed at which indexWriter can ingest (IOPs)

It can only scale vertically!!!

ElasticsearchAlso known as the commodity scaling of Lucene ;)

There is no magic…

It’s about partitioning,

Using an index of indexes as its index.

ElasticsearchA shard is the magic sauce of web scale

Lucene Lucene Lucene Lucene Lucene

Elasticsearch Index

ElasticsearchDocument Indexing

Lucene Lucene Lucene Lucene Lucene

• Distributed

• Routing

ElasticsearchRequest

Lucene Lucene Lucene Lucene Lucene

• Parallel

• Aggregated

{search: {…}}

ElasticsearchIn a nutshell

• Distributed - Distribute IndexWriter per shard

• Parallel - Parallelise request IndexReader per shard

ClusteringHow to leverage ES to scale Lucene

Lucene

• 2 Threads - 1 searcher, 1 writer

• 2G ram - Lucene Cache

• 30G disk - Index size

Sample sizing for xM indexed documents

Elasticsearch Index

Clustering

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Single Machine Scope: 8Core 16G ram 500G hdd

can sustain 4 times xM documents

Clustering

# Documents

QPS

1 machine -> 4 * xM documents

Clustering2 machines -> 2 * 4 * xM documents

# Documents

QPS

• 4 Threads - 3 searcher, 1 writer

• 4G ram - Lucene Cache

• 60G disk - Index size

Clustering

# Documents

QPS

4 machines -> 2 * 4 * xM documents

twice more QPS

Clustering

# Documents

QPS Is there a limit to this scalability?

Clustering

# Documents

QPS

• 8 Threads - 7 searcher, 1 writer

• 8G ram - Lucene Cache

• 120G disk - Index size

4 machines -> 4 * 4 * xM documents

ClusteringThe rules of thumbs

• Threads - are the core of the scalability factors

• IOPs - is generally the limiting factor to horizontal scaling

• Ram - is generally the limiting factor of vertical scaling

ES is generally excellent with its parameters

ClusteringHealth

• Redundancy - auto-balance shards for best possible HA

• Timing - Warmup and Commit points

• Latency - Result merging (especially on remote aggregations)

Recommended