Upload
jurriaan-persyn
View
114
Download
5
Embed Size (px)
DESCRIPTION
Slides to the Introduction to ElasticSearch talk given at Damn Data. In these slides I present a use case from ElasticSearch detailing some of the core functionalities of the search & analytics platform. A blog post with more details about this subject is available here: http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Citation preview
• Tagline: “Proven Search Capabilities”• Free & Open Source• Created in 1999• Features:
• Indexes & Analyzes Data• Tokenizing, Stemming, Filtering
• Search Queries• Phrases, wildcards, proximity searches, ranges, fielded searches
• Relevance Scoring, Field Sorting
LUCENE
• Tagline: “You know, for Search”• Free & Open Source• Created by Shay Banon @kimchy• Versions
• First public release, v0.4 in February 2010• A rewrite of earlier “Compass” project, w/ scalability built-in from the very
core
• Latest release 0.90.5• In Java, so inherently cross-platform
ELASTICSEARCH
• Multiple servers (nodes) running in a cluster• Acts as single service (internal routing)
• Data is split into shards (# shards is configurable)
• Zero or more replicas• Replicas on different servers (server pools) for
failover• Node in cluster goes down? Replica takes over.
• Self managing cluster• Automatic master detection + failover• Responsible for distribution/relocating shards
DISTRIBUTED & HIGHLY AVAILABLE
$ cd ~/Downloads$ wget https://download […] /elasticsearch-0.90.5.tar.gz$ tar -xzf elasticsearch-0.90.5.tar.gz$ cd elasticsearch-0.90.5/$ ./bin/elasticsearch
$ curl -XPUT http://localhost:9200/reddevils/matches/1 -d '{"date": "2013-10-15T19:00:00Z", "opponent": "Wales", "result": "1-1"}'
{"ok":true,"_index":"reddevils","_type":"matches","_id":"1","_version":1}
$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"}'
{"ok":true,"_index":"reddevils","_type":"matches","_id":"2","_version":1}
$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2", "girlfriend_attention_span": 30}’
{"ok":true,"_index":"reddevils","_type":"matches","_id":"2","_version":2}
“Aha! A NoSQL store?!”
• Full Text Search• Search for “Croatia”
• Structured Search• Search for “All matches where outcome was ‘1-1’”
• Analytics• Search for “Average attention span of my girlfriend”
• Incl. custom functions (scripts)
• … or a combination of those!
QUERY DSL
• Searching in your data set …• queries: full text search & relevance scoring• filters: exact matches
• Aggregating information from your data set … • facets:
• Averages• Sums• Date histograms• …
QUERY DSL (CONT’D)
curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
"query": {"query_string": {
"query": "croatia"}
}}'
{ "took" : 18, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.40240064, "hits" : [ { "_index" : "reddevils", "_type" : "matches", "_id" : "2", "_score" : 0.40240064, "_source" : {"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"} }, { "_index" : "reddevils", "_type" : "matches", "_id" : "4", "_score" : 0.3125, "_source" : {"date": "2012-09-11T15:00:00Z", "opponent": "Croatia", "result": "1-1"} } ] }}
curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
"query": {"constant_score": {
"filter": {"term": {"result": "1-1”}}
}}
}’
curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
"size": 0,"facets": {
"opponent": {"terms": {
"field": "opponent"}
}}
}'
{ … "facets" : { "opponent" : { "_type" : "terms", "missing" : 0, "total" : 10, "other" : 0, "terms" : [ { "term" : "wales”, "count" : 2 }, { "term" : "serbia”, "count" : 2 }, … { "term" : "croatia”, "count" : 2 } ] …
• ElasticSearch provides 2 mechanisms• Parent/Child Documents
• add links between documents by defining parent/child ids.• query example: “return children where parent matches x”• use case: linking “product” and “offer” documents.• query-time join
• Nested Documents• use case: “actions” on a “mention” (Engagor)• denormalized in Lucene index• in Lucene index data is stored nearby
• thus local join, thus very fast.• index-time join
DOCUMENT RELATIONS
• range filter on publish_date• query_string w/ (internal version of) user defined
query string• date_histogram facet on mention-document
publish_date field• term_stats facet per action type on “delay” field
nested-document “action” of mention-document• result contains:
• amount of mentions with action• amount of actions• total delay of actions
• facet_filter per defined facet.
EXAMPLE EXPLAINED
• Running ES since 2 years• 1 billion social messages, sharded by client• 20 nodes cluster
• 24GB RAM, 12-18 reserved for ES• Main data source
• Other storage systems in place mainly for backup
• Usage:• write heavy (indexing new data all the time, real time)
• less reads (no need for micro-optimizing read caches, yet)
• # updates on data depends on client use case • social care and/or pure analytics
THE ENGAGOR SETUP
3 lessons learned …
• Bulk Indexing is faster, obviously• Less network overhead
• With RabbitMQ• Handles peaks in data• Allows us to slow down throughput to ES while still
consuming firehoses from our 3rd party services• Bulk w/ Timeouts
• (so Engagor users get their messages near-realtime)
1/3: INDEXING SPEED
• Plan # shards on expected growth, not on current set-up
• But, take care …• We have several shards per monitored topic (related to
# customers and volume of data)• Biggest problem in our cluster right now is big # shards• Bugfixes in latest versions
• You can use “aliases” to create “virtual shards”/”windows on shards”
2/3: CHOOSE SHARDING STRATEGY WISELY
• ElasticSearch is a young product• 0.90 releases
• September 2013• August 2013• June 2013• May 2013• April 2013
• The 1.0 release is for early 2014.• Updates help you
• Great improvements over every release• Much needed bugfixes over every release
• Bonus Tip: + keep your JVM up to date
3/3: TRY TO KEEP UP WITH RELEASES
“filtering, free text search & analytics
all in the same box”
“power of search and data-diggingin the hands of your users”
flexible and powerful open source, distributed real-time search and analytics engine for the
cloud
$ sudo bin/service/elasticsearch stop