From Lucene to Elasticsearch, a short explanation of horizontal scalability

Scaling Lucene The event of ElasticSearch

Stéphane Gamard

Scalability

• Index Size - The number of entries upon which we act

• QPS - Number of requests serviced per second

• Time to operation - Time taken to be operational

Scalability is defined in 3 main axis:

Lucene

• IR library - Purely focused on Tf-iDf

• Bounded by native resources - Vertical scaling

• NRT Inverse Lookup - Segments

In a nutshell, Lucene does not scale. why?

LuceneSegments: the lucene storage

just a “bunch of files”

Lucene IndexingIn a “document” perspective

{#hello, #world}

{#there, #is, #a, #brown, #fox}

{#the, … , #kitchen}

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#is T2

#fox T45

Dictionary Inverse Lookup

Segment

Lucene IndexingFactors of growth

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#is T2

#fox T45

Dictionary Inverse Lookup

• Dictionary Size - NLP*

• New Inverse Entries

Segment

Lucene IndexingIn a storage perspective

Segment

IndexReader(s)

IndexWriter

IndexReader(s)

IndexWriter

Lucene Index

LuceneSegments: the lucene storage

just a “bunch of files”

Lucene IndexingThe wonderful world of merging segments

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-

segment-merges.html

Lucene Wrap-up

• A collection of segments

• One or multiple IndexReader

• A single IndexWriter

A Lucene Index is:

Lucene Wrap-upA single Lucene Index scales to:

• Index- Available HDD/Ram for segments

• QPS - number of IndexReader threads

• T-to-Op - Speed at which indexWriter can ingest (IOPs)

It can only scale vertically!!!

ElasticsearchAlso known as the commodity scaling of Lucene ;)

There is no magic…

It’s about partitioning,

Using an index of indexes as its index.

ElasticsearchA shard is the magic sauce of web scale

Lucene Lucene Lucene Lucene Lucene

Elasticsearch Index

ElasticsearchDocument Indexing

• Distributed

• Routing

ElasticsearchRequest

• Parallel

• Aggregated

{search: {…}}

ElasticsearchIn a nutshell

• Distributed - Distribute IndexWriter per shard

• Parallel - Parallelise request IndexReader per shard

ClusteringHow to leverage ES to scale Lucene

Lucene

• 2 Threads - 1 searcher, 1 writer

• 2G ram - Lucene Cache

• 30G disk - Index size

Sample sizing for xM indexed documents

Elasticsearch Index

Clustering

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Single Machine Scope: 8Core 16G ram 500G hdd

can sustain 4 times xM documents

Clustering

# Documents

1 machine -> 4 * xM documents

Clustering2 machines -> 2 * 4 * xM documents

# Documents

Clustering

# Documents

4 machines -> 2 * 4 * xM documents

twice more QPS

Clustering

# Documents

QPS Is there a limit to this scalability?

Clustering

# Documents

4 machines -> 4 * 4 * xM documents

ClusteringThe rules of thumbs

• Threads - are the core of the scalability factors

• IOPs - is generally the limiting factor to horizontal scaling

• Ram - is generally the limiting factor of vertical scaling

ES is generally excellent with its parameters

ClusteringHealth

• Redundancy - auto-balance shards for best possible HA

• Timing - Warmup and Commit points

• Latency - Result merging (especially on remote aggregations)

From Lucene to Elasticsearch, a short explanation of horizontal scalability

Technology

Elasticsearch And Apache Lucene For Apache Spark And MLlib

CESNET Radoslav Bodó, Daniel Kouřilbodik/doc/effective-log-management.pdf · MongoDB ElasticSearch is a full-text search engine built on the top of the Lucene library it is meant

Lucene @ Yelp

Apache Lucene 5 - FOSDEM · 2015-02-18 · Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch ... My Background •Committer and PMC member of Apache

Hibernate Search 5.10.9 - JBoss · 2/18/2020 · Since version 5.6 Hibernate Search sports an experimental integration with Elasticsearch. Elasticsearch is built on Apache Lucene

Spring Lucene Reference Guidespring-lucene.sourceforge.net/docs/pdf/spring-lucene-reference.pdf · Spring Lucene Reference Guide ... Spring Lucene Reference Guide Spring Lucene

Lucene Introduction

Search Evolution – von Lucene zu Solr und ElasticSearchTerm Document Id Such 1 Evolution 1 Von 1 Lucene 1 zu 1 Solr 1 und 1 ElasticSearch 1 Verteiltes 2 Suchen 2 mit 2 Elasticsearch

Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

Introduction to Elasticsearch with basics of Lucene

Architecture and system requirements...Elasticsearch search engine Elasticsearch, is a free, open source search engine based on the 100% Java Lucene engine under Apache License 2.0

AN INTRODUCTION TO ELASTICSEARCH - Microsoft · Elasticsearch is a real-time distributed search and analytics engine. It is an open-source search engine built on top of Apache Lucene

Empowering Elasticsearch with Exact and Fast r-Neighbor ... · Elasticsearch, nearest neighbor search, Hamming space 1 INTRODUCTION Elasticsearch (ES) [19], built upon Apache Lucene

The Operations Trifecta Logging, Metrics, and APM · • Elasticsearch 1.0 evolves to support a columnar store (built on top of Lucene “doc values”) ... Suricata, Sysmon,…)

Lucene Tutorial

ADVANCED DATABASES CIS 6930 Dr. Markus Schneidermschneid/Teaching/CIS...Elastic Search Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable

Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

Text categorization with Lucene and Solrarchive.apachecon.com/...Lucene/...lucene-and-solr.pdf · Automatic text categorization ! Once a doc reaches Solr ! We can use the Lucene classifiers

Elastic search in CA PPM - ijmra.us doc/2017/IJMIE_NOVEMBER2017/IJMRA-12857.pdf · Introduction: Elasticsearch is a distributed, RESTful search and analytics engine based on Lucene

FairSearch: A Tool For Fairness in Ranked Search Results · provided as Elasticsearch plugins. Elasticsearch is a well-known search engine API based on Apache Lucene. With our plugins