Introduction To ElasticSearch (DamnData)

• Tagline: “Proven Search Capabilities”• Free & Open Source• Created in 1999• Features:

• Indexes & Analyzes Data• Tokenizing, Stemming, Filtering

• Search Queries• Phrases, wildcards, proximity searches, ranges, fielded searches

• Relevance Scoring, Field Sorting

LUCENE

• Tagline: “You know, for Search”• Free & Open Source• Created by Shay Banon @kimchy• Versions

• First public release, v0.4 in February 2010• A rewrite of earlier “Compass” project, w/ scalability built-in from the very

core

• Latest release 0.90.5• In Java, so inherently cross-platform

ELASTICSEARCH

• Multiple servers (nodes) running in a cluster• Acts as single service (internal routing)

• Data is split into shards (# shards is configurable)

• Zero or more replicas• Replicas on different servers (server pools) for

failover• Node in cluster goes down? Replica takes over.

• Self managing cluster• Automatic master detection + failover• Responsible for distribution/relocating shards

DISTRIBUTED & HIGHLY AVAILABLE

$ cd ~/Downloads$ wget https://download […] /elasticsearch-0.90.5.tar.gz$ tar -xzf elasticsearch-0.90.5.tar.gz$ cd elasticsearch-0.90.5/$ ./bin/elasticsearch

$ curl -XPUT http://localhost:9200/reddevils/matches/1 -d '{"date": "2013-10-15T19:00:00Z", "opponent": "Wales", "result": "1-1"}'

{"ok":true,"_index":"reddevils","_type":"matches","_id":"1","_version":1}

$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"}'


$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2", "girlfriend_attention_span": 30}’


“Aha! A NoSQL store?!”

• Full Text Search• Search for “Croatia”

• Structured Search• Search for “All matches where outcome was ‘1-1’”

• Analytics• Search for “Average attention span of my girlfriend”

• Incl. custom functions (scripts)

• … or a combination of those!

QUERY DSL

• Searching in your data set …• queries: full text search & relevance scoring• filters: exact matches

• Aggregating information from your data set … • facets:

• Averages• Sums• Date histograms• …

QUERY DSL (CONT’D)

curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{

"query": {"query_string": {

"query": "croatia"}

}}'

{ "took" : 18, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.40240064, "hits" : [ { "_index" : "reddevils", "_type" : "matches", "_id" : "2", "_score" : 0.40240064, "_source" : {"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"} }, { "_index" : "reddevils", "_type" : "matches", "_id" : "4", "_score" : 0.3125, "_source" : {"date": "2012-09-11T15:00:00Z", "opponent": "Croatia", "result": "1-1"} } ] }}


"query": {"constant_score": {

"filter": {"term": {"result": "1-1”}}

}}

}’


"size": 0,"facets": {

"opponent": {"terms": {

"field": "opponent"}

}}

}'

{ … "facets" : { "opponent" : { "_type" : "terms", "missing" : 0, "total" : 10, "other" : 0, "terms" : [ { "term" : "wales”, "count" : 2 }, { "term" : "serbia”, "count" : 2 }, … { "term" : "croatia”, "count" : 2 } ] …

• ElasticSearch provides 2 mechanisms• Parent/Child Documents

• add links between documents by defining parent/child ids.• query example: “return children where parent matches x”• use case: linking “product” and “offer” documents.• query-time join

• Nested Documents• use case: “actions” on a “mention” (Engagor)• denormalized in Lucene index• in Lucene index data is stored nearby

• thus local join, thus very fast.• index-time join

DOCUMENT RELATIONS

• range filter on publish_date• query_string w/ (internal version of) user defined

query string• date_histogram facet on mention-document

publish_date field• term_stats facet per action type on “delay” field

nested-document “action” of mention-document• result contains:

• amount of mentions with action• amount of actions• total delay of actions

• facet_filter per defined facet.

EXAMPLE EXPLAINED

• Running ES since 2 years• 1 billion social messages, sharded by client• 20 nodes cluster

• 24GB RAM, 12-18 reserved for ES• Main data source

• Other storage systems in place mainly for backup

• Usage:• write heavy (indexing new data all the time, real time)

• less reads (no need for micro-optimizing read caches, yet)

• # updates on data depends on client use case • social care and/or pure analytics

THE ENGAGOR SETUP

3 lessons learned …

• Bulk Indexing is faster, obviously• Less network overhead

• With RabbitMQ• Handles peaks in data• Allows us to slow down throughput to ES while still

consuming firehoses from our 3rd party services• Bulk w/ Timeouts

• (so Engagor users get their messages near-realtime)

1/3: INDEXING SPEED

• Plan # shards on expected growth, not on current set-up

• But, take care …• We have several shards per monitored topic (related to

# customers and volume of data)• Biggest problem in our cluster right now is big # shards• Bugfixes in latest versions

• You can use “aliases” to create “virtual shards”/”windows on shards”

2/3: CHOOSE SHARDING STRATEGY WISELY

• ElasticSearch is a young product• 0.90 releases

• September 2013• August 2013• June 2013• May 2013• April 2013

• The 1.0 release is for early 2014.• Updates help you

• Great improvements over every release• Much needed bugfixes over every release

• Bonus Tip: + keep your JVM up to date

3/3: TRY TO KEEP UP WITH RELEASES

“filtering, free text search & analytics

all in the same box”

“power of search and data-diggingin the hands of your users”

flexible and powerful open source, distributed real-time search and analytics engine for the

cloud

$ sudo bin/service/elasticsearch stop

Technology

Introduction To ElasticSearch (DamnData)