36
Elastic{Search} Blueprint PyCon7 - Firenze - 2016-04-16 Speaker: Christian “Strap” Strappazzon

Elastic{Search} Blueprint - pycon.it

Embed Size (px)

Citation preview

Page 1: Elastic{Search} Blueprint - pycon.it

Elastic{Search} BlueprintPyCon7 - Firenze - 2016-04-16Speaker: Christian “Strap” Strappazzon

Page 2: Elastic{Search} Blueprint - pycon.it

$ whoami

★ GS1 Italy - IT Specialist

★ Passionate programmer

★ From Codementor: “You’re not the dev every team needs, but you’re the dev every team deserves.”

★ Spend time on reading technical books

★ Python Milano Organizer

★ BBQ Master

★ Dad, family addicted.

Page 3: Elastic{Search} Blueprint - pycon.it

Objective

Image from: http://ibaldi.blog.tiscali.it/lavori/

Overview on “ELK-B Stack” with a focus on Elasticsearch and Python/Django integration.

Your homeworks: get some informations from this presentation and then go deeper if you want to use these tools in your current/next projects.

Page 4: Elastic{Search} Blueprint - pycon.it

Why am I here?● Google Site Search service was ending

○ we exceeded the yearly query quota allocated

○ service downgrade with ads○ possible service suspension

● In the past we (they) used Solr, but the current hype was on Elasticsearch

● It was a good time to try a new tool● Performance, Elasticsearch is fast

○ indexed ~350 webpage and ~150 pdf in less than 4 minutes, index ~55Mb

○ search comes in milliseconds and provide the limit for you

● Last but not least… Community voted my talk - THANKS!!! - and then I do my best! :-)

Let’s begin!image from http://aragec.org/bip+bip.html

Page 5: Elastic{Search} Blueprint - pycon.it

Agenda

➢ The Open Source “ELK-B Stack” and commercial products

○ logstash, beats, kibana, sense, elasticsearch

➢ Python/Django Tools

○ haystack install/configure and some other related projects

➢ Final Thoughts

➢ Q & A

Page 6: Elastic{Search} Blueprint - pycon.it

The Open Source “ELK-B Stack” and commercial products

Page 7: Elastic{Search} Blueprint - pycon.it

Images from: http://elastic.co

Goodbye ELK-B(ee)

Say “Heya” to Elastic Stackand X-Pack

From Elastic{ON} 16

https://www.elastic.co/blog/heya-elastic-stack-and-x-pack

Page 8: Elastic{Search} Blueprint - pycon.it

MarvelMonitoring

WatcherAlerting

ShieldSecurity

Hadoop Connector

SenseConsole

Other plugins...

Graph, Reporting

Collect, parse and enrich data

Collect, parse and ship

Store, search, analyze

Visualize and explore data

Images from: http://elastic.co

Page 9: Elastic{Search} Blueprint - pycon.it

LogstashCollect, Enrich and Transport

Logstash is a data pipeline that helps you process logs and other event data from a variety of systems. With 200 plugins and counting, Logstash can connect to a variety of sources and stream data at scale to a central analytics system.

Apache License 2.0

What is a log:

➔ log is a record of activity by system, application, etc ➔ a timestamp and some data

What kind of problem try to solve:

➔ every application and device logs in its own special way

➔ each logs can be analyzed separately➔ search across logs is difficult due to a different

formats➔ logs are spread around your servers➔ many servers and different kind of logs➔ ssh + grep aren’t scalable➔ expert required to read the log

Image from: http://elastic.co

Page 10: Elastic{Search} Blueprint - pycon.it

LogstashInstall and Configure

➔ Install◆ require jvm 1.7+◆ download and unzip ◆ prepare a logstash.conf config file◆ run ./bin/logstash agent -f logstash.conf

➔ Configure◆ create one or more config file◆ grok is regex powered, over 80 patterns,

custom patterns◆ http://grokdebug.herokuapp.com

input {# Apache log, mail log, app log ...

}

filter {# Grok, GeoIP, Date ...

}

output {# Elasticsearch, Graphite ...

}

https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html

Image from: http://elastic.co

Page 11: Elastic{Search} Blueprint - pycon.it

BeatsCollect, Parse and Ship

Beats is the platform for building lightweight, open source data shippers for many types of operational data you want to enrich with Logstash, search and analyze in Elasticsearch, and visualize in Kibana.

Written in Go, simple to deploy: download and install/unzip, configure the yaml file and run the daemon with sudo.

Apache License, Version 2.0

Type of beats:

➔ libbeat: for building more beats➔ Packetbeat: tap into your wire data➔ Topbeat: gather infrastructure metrics➔ Filebeat: analyze log files in real time➔ Winlogbeat: gather insight from windows

event logs➔ {Future}beats: there's oh-so-much more

to come

Image from: http://elastic.co

Page 12: Elastic{Search} Blueprint - pycon.it

KibanaVisualize and Explore Data

➔ Flexible analytics and visualization platform

➔ Real-time summary and charting of streaming data

➔ Intuitive interface for a variety of users➔ Instant sharing and embedding of

dashboards➔ Apache license, Version 2.0➔ Easy to install:

◆ require a modern browser◆ download and unzip◆ set elasticsearch.url to point ES

instance ◆ run binary

Image from: http://elastic.co

Image from: https://michael.bouvy.net

Page 13: Elastic{Search} Blueprint - pycon.it

Sense - Visually Interact with Elasticsearch REST APIs

Sense is a visual console that provides auto-complete, auto-indentation, and syntaxchecking all through a Kibana plugin.

Some features:

➔ multiple requests➔ auto formatting➔ keyboard shortcuts➔ history (500 requests)

Apache License, Version 2.0Image from: http://elastic.co

Page 14: Elastic{Search} Blueprint - pycon.it

WARNING!A lot of information incoming...

Page 15: Elastic{Search} Blueprint - pycon.it

ElasticsearchSearch, store and analyze

Elasticsearch is a search server based on Lucene.

It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.

Apache License, Version 2.0

Features

❖ Distributed, scalable and resilient➢ design for scale-out, high availability

❖ Developer friendly➢ API first, schemaless, native JSON, client

libraries for any languages❖ Real-time search & analytics

➢ real time aggregations, geospatial, full-text search, query structured and unstructured data

Image from: http://elastic.co

Page 16: Elastic{Search} Blueprint - pycon.it

ElasticsearchNode and Cluster

Node

➔ A running instance of elasticsearch (JVM process)

Cluster

➔ Multiple nodes working together

Image from: http://elastic.co

Default node➔ master eligible➔ holds data➔ indexing,

aggregations, query…

Dedicated master node➔ master eligible➔ no data

Data node➔ holds data➔ indexing,

aggregations, query...

Client node➔ no data➔ know the state of the

cluster➔ routing

Node Types

Page 17: Elastic{Search} Blueprint - pycon.it

ElasticsearchIndex & Shards

Index

➔ An index is a lightweight container for data

Shard

➔ A single piece of an Elasticsearch index➔ Indexes are partitioned into shards so they can be distributed across multiple nodes➔ Each shard is a standalone Lucene index➔ A shard is either a primary or a replica➔ By default shards are copied for high availability➔ Replica shards are always on different nodes from each other and their primary shard➔ Searches may be performed against primary or replica

Image from: http://elastic.co

Page 18: Elastic{Search} Blueprint - pycon.it

ElasticsearchInverted Index

In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents (named in contrast to a Forward Index, which maps from documents to content)

from wikipedia https://en.wikipedia.org/wiki/Inverted_index

Image from: http://elastic.co

id | content-----------------------------------------------------1 | The quick brown fox jumped over the lazy dog2 | Quick brown foxes leap over lazy dogs in summer-----------------------------------------------------

Term Doc_1 Doc_2-------------------------Fast | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |-------------------------

Page 19: Elastic{Search} Blueprint - pycon.it

ElasticsearchLet’s talk about Search! :-)

❖ Different type of search➢ suggestions, synonyms, autocomplete, filters, aggregations

❖ Iterative process➢ relevance tuning, accuracy, classification

❖ No downtime➢ depends on the cluster

Image from: http://elastic.co

Page 20: Elastic{Search} Blueprint - pycon.it

ElasticsearchMapping

When you insert a JSON document into ES, automatically ES creates a mapping with data detection.

Mapping is composed by field:

➔ each field requires a type➔ no change of field type once added➔ adding new field➔ changing field type requires re-indexing➔ fields can have a boost

Fields types: analyzed string, float, boolean, double, date, integer, not analyzed string, long, binary

Image from: http://elastic.co

{ "myidx" : { "mappings" : { "meetup" : { "properties" : { "message" : { "type" : "string" }, "post_date" : { "type" : "date", "format" : "dateOptionalTime" } } } } }}

Page 21: Elastic{Search} Blueprint - pycon.it

ElasticsearchAnalyzing Text

❖ Tokenizer➢ breaks up text into tokens

❖ Filters➢ applied to tokens in sequence

❖ Analyzers➢ associated with fields in mapping, can be

customized, applied at index and query time

Image from: http://elastic.co

..."analyzer": {

"italian": { "tokenizer": "standard", "filter": [ "italian_elision", "lowercase", "italian_stop", "italian_keywords", "italian_stemmer" ]

}}...

Page 22: Elastic{Search} Blueprint - pycon.it

ElasticsearchIndex Alias

➔ Alias is a view of one or more indexes➔ Can be filtered➔ Decouple application from indexes➔ Lightweight➔ Used on re-index with no downtime, atomic operation

Image from: http://elastic.co

POST /_aliases{ "actions": [ { "remove": { "index": "pyconit_v1", "alias": "pyconit" }}, { "add": { "index": "pyconit_v2", "alias": "pyconit" }} ]}

Page 23: Elastic{Search} Blueprint - pycon.it

ElasticsearchQuerying the data

Elasticsearch provides a full Query DSL based on JSON to define queries.

Some options are:

➔ boost on fields at query time➔ full-text and Term query➔ score on result➔ filter result➔ aggregate

The documentation rocks! You’ll find everything you need. Trust me! :-)

Image from: http://elastic.co

{ "multi_match" : {

"query" : "this is a test","fields" : [ "subject^3", "message" ]

}}

{"regexp":{

"name.first":{ "value":"s.*y", "boost":1.2 }

}}

Page 24: Elastic{Search} Blueprint - pycon.it

ElasticsearchInstall and Run

➔ require JVM 1.7+➔ download and unzip➔ run ./bin/elasticsearch➔ edit config file, some options could be given from command line

Pretty simple! :-)

Image from: http://elastic.co

Page 25: Elastic{Search} Blueprint - pycon.it

https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html

Page 26: Elastic{Search} Blueprint - pycon.it

OK, but...Let’s talk about Django… :-)

Page 27: Elastic{Search} Blueprint - pycon.it

HaystackModular search for django

Haystack provides modular search for Django with an abstraction layer for different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.)

➔ It's a django app➔ Elasticsearch backend depends on elasticsearch-py➔ Provide signals, multiple routing, search query API similar to django ORM➔ Lack of documentation, but enough to start➔ You get your hands dirty if you want more➔ Currently only supports ElasticSearch 1.x. ElasticSearch 2.x is not supported yet, if you would like

to help, please see #1247.➔ BSD License

Image from: http://haystacksearch.org/

Page 28: Elastic{Search} Blueprint - pycon.it

HaystackInstall and Settings

(env) $ pip install django-haystack# add 'haystack' to INSTALLED_APPS# add in settings.py HAYSTACK_CONNECTIONS which backend to use, e.g.:...HAYSTACK_CONNECTIONS = {

'default': {'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine','URL': 'http://127.0.0.1:9200/','INDEX_NAME': 'haystack',

},}

Image from: http://haystacksearch.org/

Page 29: Elastic{Search} Blueprint - pycon.it

HaystackHandling Data

(env) $ ./manage.py startapp search(env) $ cd search(env) $ touch search_indexes.py

# Edit with your editor of choice# ... Vim or Emacs? Fight! # @raymondh

import datetimefrom haystack import indexesfrom myapp.models import Note

class NoteIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date')

def get_model(self): return Note

def index_queryset(self, using=None): """Used when the entire index for model is updated.""" now = datetime.datetime.now() return self.get_model().objects.filter(pub_date__lte=now)

(env) $ ./manage.py rebuild_index

Image from: http://haystacksearch.org/

Page 30: Elastic{Search} Blueprint - pycon.it

HaystackSetup Search View and URL

# inside urls.py

(r'^search/', include('haystack.urls')),

# Override the search/search.html default template

{# search.html #}...

<form method="get" action="."> {{ form.as_p }} <p><input type="submit" value="Search"></p> ...

{% for result in page.object_list %} <p> <a href="{{ result.object.get_absolute_url }}"> {{ result.object.title }}</a> </p>{% empty %} <p>No results found.</p>{% endfor %}

Pay attention! Don't use result.object.something, use instead the fields on your index.e.g. result.title, because result.object.title hits the database!

Image from: http://haystacksearch.org/

Page 31: Elastic{Search} Blueprint - pycon.it

That’s it!Ok… Let’s talk a bit on customizations...

Page 32: Elastic{Search} Blueprint - pycon.it

HaystackCustomization - The Hard Part

Custom Backendhttps://github.com/bennylope/elasticstackhttps://github.com/wingify/superelasticsearchhttps://github.com/Jiydam/haystack-elasticsearch-raw-queryhttps://wellfire.co/learn/custom-haystack-elasticsearch-backend/http://www.stamkracht.com/extending-haystacks-elasticsearch-backend/http://stackoverflow.com/questions/27802628/search-for-multiple-words-elasticsearch-haystackhttp://cstrap.blogspot.it/2015/06/dealing-with-elasticsearch-reindex-and.html

Attachmenthttps://gist.github.com/frague59/aab071f0bdce5b010ce4http://cstrap.blogspot.it/2015/06/django-haystack-elasticsearch-index-pdf.html

I told you so… Here’s your homework... ;-)

Image from: http://haystacksearch.org/

Page 33: Elastic{Search} Blueprint - pycon.it

Final Thoughts

❖ Use haystack if you will up and running in (almost) no time❖ Take some time on elasticsearch API❖ Learn to use the elasticsearch-py client provided from elastic❖ Avoid hitting the database by preparing a good mapping❖ Tuning take time, not on the bare metal but on search contents❖ Indices alias is your friend ❖ Good search needs good content❖ You learn a lot of things on text processing❖ Have Fun! :-)

Image from: http://www.focusonanimation.com/les-trois-courts-bip-bip-et-le-coyote-en-3d-6399/

Page 34: Elastic{Search} Blueprint - pycon.it

Links Summary

https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.htmlhttps://www.elastic.co/products/beatshttps://www.elastic.co/guide/en/kibana/current/index.htmlhttps://www.elastic.co/guide/en/sense/current/index.htmlhttps://www.elastic.co/learnhttps://www.elastic.co/use-cases/green-man-gaminghttps://www.elastic.co/v5https://info.elastic.co/cloud-enterprise.htmlhttps://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-is-broken.htmlhttps://www.elastic.co/blog/changing-mapping-with-zero-downtimehttps://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.htmlhttp://haystacksearch.org/http://django-haystack.readthedocs.org/en/latest/https://github.com/elastic/elasticsearch-pyhttps://qbox.io/blog/series/elasticsearch-python-django-series

Join Us on Slack! :-) https://pythonmilano.herokuapp.com Image from: http://xmastime.blogspot.it/

Page 35: Elastic{Search} Blueprint - pycon.it

Thanks!

Page 36: Elastic{Search} Blueprint - pycon.it

Answers?Credits: Valentino Volonghi… some PyCon Italy ago…

Keep in touch! @cstrap on Twitter, Github, Bitbucket, LinkedIn