Author
oleksiy-panchenko
View
3.361
Download
1
Embed Size (px)
E L A S T I C S E A R C H ,LO G S TA S H , K I B A N A
C O O L S E A R C H ,A N A LY T I C S ,
D ATA M I N I N GA N D M O R E …
O L E K S I Y PA N C H E N KO / LO H I K A / 2 0 1 5
MY NAME IS…
Oleksiy PanchenkoSoftware engineer, Lohika
E-mail: [email protected]: oleskiyp
LinkedIn: https://ua.linkedin.com/in/opanchenko
AGENDA• Introduction. What is it all about?• Jump start Elastic. Demo time• Architecture and deployment. Why is
Elasticsearch elastic?• Case studies. 4 real-life projects• Query API in depth + Demo• Elasticsearch ecosystem. ELK Stack + Demo• Q & A
INTRODUCTIONW H AT I S I T A L L A B O U T ?
HOW TO MAKE YOUR SITE SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
• Google search• Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where …• Sphinx (hello Craigslist, Habrahabr, The Pirate
Bay, 1C); Xapian• Lucene Family: Apache Lucene, Elasticsearch,
Apache Solr, Amazon Cloudsearch, …
WHO HAS EVER USED ELASTICSEARCH?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
LUCENE AS A CORE• Lucene = Low-level Java library (JAR) which
implements search functionality• Can be used in both web and standalone
applications (desktop, mobile)• Lucene stores its index as a local binary file• Implemented in Java, ports to other languages
available• Initial version: 1999• Apache project since 2001• Latest stable release: 5.2.1 (15 June 2015)
LUCENE AS A CORE• Lucene was originally
written in 1999 by Doug Cutting (creator of Hadoop and Nutch, currently Chief Architect at Cloudera) as a part of open-source web search engine (Nutch)
http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
MORE ABOUT SEARCH ENGINES
Riak Search
TIME TO TALK ABOUT ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text SearchMultilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete
https://www.elastic.co/products/elasticsearch
High Availability
Multitenancy
Distributed, Horizontally Scalable
https://www.elastic.co/products/elasticsearch
Document-Oriented
Schema-Free
Conflict ManagementOptimistic Concurrency Control
https://www.elastic.co/products/elasticsearch
Apache 2 Open Source License
Awesome documentation
Large community
Developer-Friendly, RESTful APIClient libraries available for many programming languages and frameworks.
ELASTICSEARCH USERS
https://www.elastic.co/use-caseshttps://en.wikipedia.org/wiki/Elasticsearch#Users
ELASTICSEARCH – PAST & PRESENT• 2004. Shay Banon (aka
Kimchy) started working on Compass – Java Search Engine on top of Lucene• 2010. Initial release of
Elasticsearch• Latest stable release:
1.7.1(July 29, 2015)• 500K downloads per
month• https://github.com/elastic/elasticsearch
http://opensource.hk/sites/default/files/u1/shay-banon.jpg
ELASTICSEARCHAS A COMPANY• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees• https://www.elastic.co/• Product portfolio:
– Elasticsearch, Logstash, Kibana (ELK stack)– Watcher– Shield– Marvel– es-hadoop– found
JUMP START ELASTIC
D E M O T I M E
INSTALLATION & CONFIGURATION• Prerequisites:
– JDK 6 or above (recommended: JDK 8)– RAM: min. 2Gb (recommended: 16–64 Gb for
production)– CPU: number of cores over clock rate– Disks: recommended SSD
• Homebrew, apt, yum: apt-get install elasticsearch
• Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch
• Installation is absolutely straightforward and easy: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html
LET’S TALK ABOUT TERMINOLOGYIndex ~ DB Schema
Type ~ DB Table
Document
Record, JSON object
Mapping ~ Schema definition in RDBMS
DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ARCHITECTURE AND DEPLOYMENTW H Y I S E L A S T I C S E A R C H E L A S T I C ?
Cluster One or more nodes which share the same cluster name
Node Running instance of Elasticsearch which belongs to a cluster
Shard A portion of data – single Lucene instance.Default: 5 shards in an index
Primary Shard
Master copy of data
Replica Shard
Exact copy of a primary shard.Default: 1 replica
SINGLE-NODE CLUSTER0 1 2 3 4
HashFunction*
{ "id": "123", "name": "john", … }
{ "id": "124", "name": "patricia", … }
{ "id": "125", "name": "scott", … }
* Also consider custom routing
TWO-NODE CLUSTER
0 1 R2 3 R4Node 1
R0 R1 2 R3 4Node 2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
BENEFITS OF SHARDING• Take advantage of multi-core CPUs (one shard
is a single Lucene instance = single JVM process)• Horizontal scalability. Dynamic rebalancing• Fault tolerance and cluster resilience• NB! The number of shards can not be changed
dynamically on the fly – need to perform full reindexing• Max number of documents per shard:
2,147,483,519 – imposed by Lucene
CUSTOM ROUTING• Social network. Users, events• event_id: 17567654, 17567655, 17567656, …user_id: 10300, 10301, …
• No Elasticsearch ID provided: ID will be auto-generated Events will be equally distributed across the shards
• Obvious approach: Elasticsearch ID = event_id Events will be equally distributed across the shards
• Elasticsearch ID = user_id Events which belong to the same user will be stored in a single shard no overheads better performance
ELASTICSEARCH NODE TYPES• Data node node.data = true• Master node node.master = true• Communication client http.enabled = true• TCP ports 9200 (ext), 9300 (int)• A node can play 2 or 3 roles at the same time• Multicast discovery (true by default):discovery.zen.ping.multicast.enabled
DEPLOYMENT DIAGRAM
INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard
DISTRIBUTED SEARCH• Given search query, retrieve 10 most relevant results
https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
CASE STUDIES4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&path-prefix=ru
GENERAL INFO• 4 projects, ~2 years• RDBMS (MySQL, PostgreSQL) as a primary
data storage• Both on-premise Elasticsearch installation
(AWS, MS Azure) and SaaS (Bonsai @ Heroku)• 1 or 2 instances in a cluster• Data volume: Gigabytes; millions of
documents• Back-end: Java, Ruby
#1. SOCIAL INFLUENCER MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
• Document types: Blog Posts, Bloggers (Influencers)• Elasticsearch usage:
– search and rank Influencers by category, keywords, tags, location, audience, influence
– search blog posts by keywords etc.• Amount of data:
– Influencers: hundreds of thousands– Blog Posts: millions
• ES cluster size: 2 instances• Technology stack: Java, MySQL, Dynamo DB,
AWS• Considered alternatives: Sphinx, Apache Solr
#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
• Document types: Job Postings, Jobseekers• Find relevant jobs
– Simple one-click search– Advanced search (title, keywords, industry,
location/distance, salary, requirements)• Elasticsearch as a Recommendation Engine
Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc.• 2 types of recommendations:
– Side banner (You also might be interested in…)
– E-mail subscriptions every 2 weeks• Find appropriate candidates by location,
requirements (experience, education, languages), salary expectations
• No fixed document structure (jobs from different providers)• Full-text search• Fuzzy search• Geolocation (distance)• Weighted search: Boosted search
clauses• Dynamic scripting (Mvel until v1.4.0,
then Groovy)
SEARCH QUERIES
SOME MORE FACTS• Amount of data:
– Job postings: ~1M–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances• Technology stack:
–Ruby on Rails–Elasticsearch, PostgreSQL, Redis–Heroku + add-ons, AWS (S3, EC2)–Lots of 3rd party APIs and integrations
IMPLEMENTATION (RUBY)• A Model is ActiveRecord (Ruby on Rails ORM)• ActiveRecord can persist itself to the database• ActiveRecord::Callbacks:
– after_commit on [:create, :update] { index_document }– after_commit on [:destroy] { delete_document }– after_create…– after_save …– after_destroy…
• Rake tasks to drop/recreate index, reindex documents
• Zero-downtime reindexing using aliases• Ruby/Rails client:
https://github.com/elastic/elasticsearch-rails
LESSONS LEARNED• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)• Dynamic scripting• PostgreSQL as a backup search engine
sucks
#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
PARSING ADS
Price
$3900
1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPGWAT???• Fuzzy Search (Levenstein Distance Algorithm) used to parse
ads and classify cars• Elasticsearch index contains dictionary (Year, Make, Model,
Trim)• Used in conjunction with other approaches: regular
expressions, dictionaries of synonyms (VW Volkswagen, Chevy Chevrolet), normalization (e.g. LX-370 LX370)
• Algorithm approach:– Parse Year (1996)– Search most relevant Make (VW, volkswagon
Volkswagen)– Search most relevant Model (Passat) for Make =
Volkswagen, Year = 1996– Search most relevant Trim (TDi 4dr Sedan)
• Parsing quality: 90%https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
SOME UNCOVERED INFO• Check documents against duplicate content• Shingle analysis (commonly used by copywriters and SEO
experts)– I have a dream that one day this nation will rise up and live…– Normalization
I have a dream that one day this nation will rise up and live…
– Splitting a text into shingles (n-grams), n = 3..10have dream that
dream that thisthat this nationthis nation will
…– Replacement: latin ‘c’ cyrillic ‘c’
• Custom or standard ES implementation of Shingle analysishttps://en.wikipedia.org/wiki/W-shingling
QUERY API IN DEPTH+ D E M O
FILTERS VS. QUERIESAs a general rule, filters should be used:• for binary yes/no searches• for queries on exact values
Filters are much faster than queriesFilters are usually great candidates for caching
27 Filters available (Elasticsearch 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
QUERIES VS. FILTERSAs a general rule, queries should be used instead of filters:• for full text search• where the result depends on a relevance score
Common approach: Filter as many records as possible, then query them.
38 Queries available (Elasticsearch v 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
SOME THEORY BEHIND RELEVANCE SCORINGfull AND text AND search AND (elasticsearch OR lucene)
• Term Frequency: How often does the term appear in the document?
• Inverse Document Frequency: How often does the term appear in all documents in the collection?
• Field-length norm: How long is the field?
• TF, FLN etc. are calculated and stored at index timehttps://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
MORE COOL FEATURES• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)• Autocomplete suggestion:
• Did-you-mean suggestion:
• Highlight results:
SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/https://github.com/kzwang/elasticsearch-image
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ELASTICSEARCH ECOSYSTEM.ELK STACK+ D E M O
CLIENTS
http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
• Java: 1 native client + 1 community supported• Python: 1 official + 7 community supported• Ruby: 1 official + 7 community supported• JavaScript: 1 official + 4• PHP: 1 official + 4• C#. NET: 1 official + 2• Scala: 4• Groovy (1), Haskell (1), Perl (1), Clojure (1),
Go (3),R (2), Erlang (3), OCaml (2), Smalltalk (1), ColdFusion (1), C++ (1)• Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html
INTEGRATIONS• Django• Ruby on Rails• Spring, Spring Data• Node.js• Symfony, Drupal, Wordpress• Grails• Play! Framework
https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html
FRONT ENDS
http://php.archive.razorflow.com/assets/img/header_v1.png
ELASTICSEARCH-HEAD
http://mobz.github.io/elasticsearch-head/
ESCLIENT
https://github.com/rdpatil4/ESClient
AVAILABLE FRONT ENDS
https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html
• elasticsearch-head: A web front end for an Elasticsearch cluster.
• browser: Web front-end over elasticsearch data.• Inquisitor: Front-end to help debug/diagnose queries and
analyzers• Hammer: Web front-end for elasticsearch• Calaca: Simple search client for Elasticsearch• ESClient: Simple search, update, delete client for
Elasticsearch
HEALTH AND PERFORMANCE
http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
ELASTICSEARCH-HEAD
https://github.com/mobz/elasticsearch-head
BIGDESK
https://github.com/lukas-vlcek/bigdesk
WHATSON
https://github.com/xyu/elasticsearch-whatson
ELASTICOCEAN
https://itunes.apple.com/us/app/elasticocean/id955278030
HEALTH AND PERFORMANCE
https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html
• bigdesk: Live charts and statistics for elasticsearch cluster.• Kopf: Live cluster health and shard allocation monitoring with administration
toolset.• paramedic: Live charts with cluster stats and indices/shards information.• ElasticsearchHQ: Free cluster health monitoring tool• SPM for Elasticsearch: Performance monitoring with live charts showing cluster
and node stats, integrated alerts, email reports, etc.• check-es: Nagios/Shinken plugins for checking on elasticsearch• check_elasticsearch: An Elasticsearch availability and performance monitoring
plugin for Nagios.• opsview-elasticsearch: Opsview plugin written in Perl for monitoring
Elasticsearch• SegmentSpy: Plugin to watch Lucene segment merges across your cluster• es2graphite: Send cluster and indices stats and status to Graphite for monitoring
and graphing.• Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices.• ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring tool to keep
an eye on DigitalOcean Droplets or Elasticsearch instances or both of them on-a-go.
10 ES METRICS TO WATCH
http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html
1. Cluster health — nodes and shards2. Node performance — CPU3. Node performance — memory usage4. Node performance — disk I/O5. Java — heap usage and garbage collection6. Java — JVM pool size7. Search performance — request latency and
request rate8. Search performance — filter cache9. Search performance — field data cache10.Indexing performance — refresh times and
merge times
RIVERS (DEPRECATED IN 1.5.0)
http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
• JDBC River Plugin, CSV River Plugin• MongoDB, CouchDB, Solr, Redis, Neo4j,
DynamoDB, RethinkDB, Hazelcast, …• JMS, RabbitMQ, ActiveMQ, Amazon SQS,
Kafka, …• Twitter, Wikipedia, Git, GitHub, Subversion,
RSS, …• FileSystem, Dropbox, Google Drive, Amazon S3,
…• IMAP/POP3, Web, LDAP
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river
OTHER PLUGINS
https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
• Internalization, normalization, analysis, languages support (Chinese, Japanese, Khmer, Thai etc.), transliteration etc.• Discovery plugins: Amazon AWS, MS Azure,
Google GCE, ZooKeeper• Transport plugins: allow to use Elasticsearch
REST API over Servlet, ZeroMQ, Jetty, Redis, Memecached• Scripting in Elasticsearch queries: Groovy,
JavaScript, Python, Clojure, SQL (!)• Front-ends (CRUD operations) & data
visualization• Snapshot/Restore Repository: HDFS, AWS S3,
GridFS• Misc: Attachments handling (uses Apache
Tika), image support, tracking changes, Mock Solr, NewRelic integration, …
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html
ELASTICSEARCHPRODUCT PORTFOLIO
http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
FOUND ($)• Elasticsearch as a service• Starts from $45/mo (1GB RAM, 8GB SSD, 1
data center)• No deployment and maintenance overhead
https://www.elastic.co/products/found
SHIELD ($)• Authentication• Authorization: RBAC• Encrypted communication, IP filtering• Audit logging
• Other approaches:• Jetty instead of
embedded server• Nginx as a front-end
https://www.elastic.co/products/shield
MARVEL ($)• Elasticsearch cluster health check,
monitoring, performance• Real-time and historical analysis• Customizable dashboards
https://www.elastic.co/products/marvel
WATCHER• Alerts about anomalies in data• Proactive monitoring of ES cluster (in
conjunction with Marvel)• A lot of ways of notifications: e-mails, SMS,
webhooks• Retrospective analysis• High availability
https://www.elastic.co/products/watcher
ELK
https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
LOGSTASH + ELASTIC + KIBANA
LOGSTASH ADVANCED
LOGSTASH
• Variety of inputs and outputs (165 plugins)• 120 predefined patterns + custom log formats• Flexible DSL to parse/normalize/enrich logs• Implemented in Ruby, running on JRuby
https://www.elastic.co/products/logstash
SOME LOGSTASH INPUTS
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
• file• stdin• syslog• eventlog• jdbc• varnishlog• websocket• log4j• jmx• s3
• sqs• rss• redis• rabbitmq• zeromq• kafka• twitter• elasticsearch• github• lumberjack
SOME LOGSTASH OUTPUTS
https://www.elastic.co/guide/en/logstash/current/output-plugins.html
• file• stdout• csv• exec• elasticsearch• email• nagios• syslog• redis• loggly
• jira• hipchat• irc• graphite• http• s3• sqs• sns• rabbitmq• zeromq
KIBANA• Variety of charts: bar charts, line and scatter
plots, histograms, pie charts, maps• Flexible and customizable UI, responsive
design• Slice and dice data to get necessary details• Seamless integration with Elasticsearch• Simple data export
https://www.elastic.co/products/kibana
DEMO #3
http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
ELASTICSEARCH DRAWBACKS• No transaction support. Elasticsearch is not a
database.• No joins, constraints and other RDBMS
features• Durability and consistency issues, data loss:– https://
aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0
– https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/http://solr-vs-elasticsearch.com/
• Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently• Sphinx is faster at indexing (up to 15MB/s per
core)• Performance issues can be usually fixed by
horizontal scaling
SUMMARY• ES is not a silver bullet but really really
powerful tool• Elasticsearch is not a RDBMS and is not
supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES
• Elasticsearch is dead simple at the start but might be sophisticated later as you go
• Kick off easily, then hire a good DevOps engineer for best results
• Ecosystem around Elasticsearch is just amazing• Give it a try – it can bring a lot of value to your
product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png
QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
USEFUL LINKS• Elasticsearch: https://
www.elastic.co/products/elasticsearch• Logstash: https://www.elastic.co/products/logstash• Kibana: https://www.elastic.co/products/kibana
• Scripts for the demos:https://github.com/opanchenko/morning-at-lohika-ELK