If you can't read please download the document
Upload
knoldus-software-llp
View
10.341
Download
0
Embed Size (px)
Citation preview
Deep Dive Into Elasticsearch
Kunal KapoorSoftware ConsultantKnoldus Software LLP
AGENDA
What is Elasticsearch
Getting Started
Key Terminologies
CRUD Operations
Understanding the physical layout
What happens when you index a document
How to make an inverted index mutable
How per-segment search works
How a delete operation works
Segment Merging
What is Elasticsearch
Search engine based on Lucene.
Provides near real-time search
Distributed
Fault Tolerant
Notable users:Facebook
Github
CERN
1. It allows you to explore your data at a speed.
2. Distributed-- 2 or more servers can be started at different locations and the data will be stored on both of them.
3. fault- Es-index is distributed across different nodes As the nodes are distributed across different zones, it is unlikely that a failure affects both the servers at the same time
Getting Started
Download the elasticsearch distribution from https://www.elastic.co/downloads/elasticsearch
To start the elasticsearch server run the following command from within the extracted directory ./bin/elasticsearch
Once the server or node is created you can check the health of your cluster by runningcurl 'localhost:9200/_cat/health?v'
Key Terminologies
Node - A node is a single server that is part of your cluster, stores your data, and participates in the clusters indexing and search capabilities
Cluster - A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides indexing and search capabilities across all nodes.
Index - An index is a collection of documents that have somewhat similar characteristics.
Key Terminologies
Type - A type is a logical category/partition of your index. It is defined for documents that have a set of common fields.
Shard A shard is basically an lucene index. Contains the documents and various data structures that help in searching.
CRUD Operations
Indexing a documentcurl -XPOST 'localhost:9200/test/test/1?pretty' -d '{"text":"Hello World"}'
Updating a documentcurl -XPOST 'localhost:9200/test/test/1?pretty' -d '{"text":"Hello"}'
Delete a documentcurl -XDELETE 'localhost:9200/test/test/1?pretty'
Search a documentcurl -XPOST 'localhost:9200/test/_search?pretty' -d '{"query": { "match": {"text": "hello" }}}'
Understanding the physical layout
Es is configured to use multicast out of the box.
In multicast it sends UDP pings across the local network to
discover the nodes.
Other nodes receive this ping and respond thus creating a
cluster.
Not good for production as it can also discover unwanted nodes if
cluster name is same
Thus unicast is used which accepts IpAddress of nodes that you want to add in the cluster
SHARDS (Lucene indices)
A node has multiple shards within them
An Es index can span across multiple nodes through shards.
A shard is the lowest level worker that contains the data that
is inserted in the index.
SEGMENTS
A lucene index or a shard contains various segments that are
like mini indices
These indices contain the datastructures required by elasticsearch
to provide near-real time search.
Inverted Index
Data structure storing a mapping, from content such as words or numbers, to its locations in a database file, or a set of documents.
Provides full-text search
Consists of 2 partsSorted Dictionary
Postings
Immutable
What happens when you index a document?
The node that receives the request becomes the controller for that request.
That node determines the shard in which the document should reside on the basis of the documents Id.shard = hash(document_id) % number_of_primary_shards
The request is then forwarded to the appropriate node which contains the shard.
The node forwards the request to the appropriate shard.
What happens when you index a document?
The shard performs analysis on the document and creates the appropriate inverted index which is helpful for searching.
The request is then sent to the replica shards.
The documents are analyzed by the standard analyzer by default.
It split the documents on white-space and lowercases the documents.
The documents are then ready to be inserted in the inverted index
What happens when you index a document?
{text:Elasticsearch is an awesome search engine}
{text:Elasticsearch is not a database}
elasticsearch, is, an, awesome, search, engine, elasticsearch,not, a, database
Terms
Document1Document2
elasticsearch
is
an-
awesome-
search-
engine-
not-
a-
database-
How to make an inverted index
mutable?
Earlier, the whole inverted index would be rewritten to disk with the changes.
Very costly approach
Lucene introduced the concept of per-segment search.
Now a Lucene index would mean a collection of segments plus a commit point.
A commit point is a file that contains the list of segments that are ready for search.
How per-segment search works?
New documents are collected in an in-memory buffer.
Every so often, the buffer is commited (refresh)A new supplementary segment with a commit point is written to file-system cache.
The transaction log is updated with the request for a full commit later.
The buffer is cleared and the segment is made available for search.
In-memory Buffer
COMMIT POINT
Transaction Log
How a delete operation works?
Every shard in the Elasticsearch node maintains a .del file along with the commit point.
.del file lists which documents in which segments have been deleted.
When a delete request is encountered, the appropriate document is marked as deleted.
The deleted document will still match the search query but will be filtered later on.
Later the document is purged from the file-system while segment merging.
Segment Merging
Each segment consumes file handles, memory, and CPU cycles.
The more the number of segments, the slower the search will be.
Elasticsearch solves this problem by merging segments in the background.
Small segments are merged into bigger segments.
This is the moment when those old deleted documents are purged from the file-system.
Segment Merging
This is how the merge process works:-The merge process selects a few segments of similar size and merges them into a new bigger segment in the background.
The new segment is flushed to disk.
A new commit point is written that includes the new segment and excludes the old, smaller segments.
The new segment is opened for search.
The old segments are deleted.
References
The Definitive Guide by Clinton Gormley and Zachary Tong
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Questions?
References
The Definitive Guide by Clinton Gormley and Zachary Tong
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
THANK YOU