Gitsearch Elastic Workshop - Percona · 2017-05-12 · Installing Filebeat In this workshop we will learn how to build an Elasticsearch-powered Git search engine! Before we dive into

Installing Filebeat

In this workshop we will learn how to build an Elasticsearch-powered Git search engine!

Before we dive into the components of this project and how to model Git data in Elasticsearch,let's take care of some housekeeping items like Elasticsearch Concepts and environment setup.

One of the best ways to learn what Elasticsearch is, is to simply run it! To do this, we'll unpackagethe workshop zip, extract, and run Elasticsearch and Kibana.

Minimum RequirementsPython 2.7+Java 1.8.0_20+at least 20% of free disk space

There is a percona-elastic-workshop zip containing Elasticsearch, Kibana, Beats and ourapplication, Gitsearch.

1.

Gitsearch Elastic Workshop

Getting Started

Workshop Checklist

Installing/Running Elasticsearch

cd percona-elastic-workshop

2.

cd elasticsearch

Once in the elasticsearch directory, you can simply run it with

# on Linux/Mac ./bin/elasticsearch # on Windows bin\elasticsearch.bat

To verify that Elasticsearch is running successfully, go ahead to your favorite web browser andvisit http://localhost:9200 (http://localhost:9200)

That's it! you now have Elasticsearch installed on your machine.

1. Change into the workshop directory

cd percona-elastic-workshop

2. change into the appropriate kibana directory for your platform (darwin=Mac)

cd kibana-version-platform

3. Run Kibana

./bin/kibana

That's it! Now you are running a Kibana instance that is communicating with your localElasticsearch Cluster.

A cluster is a collection of one or more nodes (servers) that together holds your entire data andprovides federated indexing and search capabilities across all nodes

A node is a single server that is part of your cluster, stores your data, and participates in thecluster’s indexing and search capabilities

An index is a collection of documents that have somewhat similar characteristics.

Installing/Running Kibana

ConceptsCluster

Node

Index

Type

http://localhost:9200/

Within an index, you can define one or more types. A type is a logical category/partition of yourindex whose semantics is completely up to you.

A document is a basic unit of information that can be indexed. This document is is expressed inJSON.

Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. Theseshards allow for splitting/scaling your content volume horizontally and parallelizing operations.Each of these shards can be assigned replicas, which are copies of the shard. Once replicated,each index will have primary shards (the original shards that were replicated from) and replicashards (the copies of the primary shards)

textkeyworddatelongfloat/scaled_float/half_floatdoubleintegerbooleanipgeo_pointgeo_shapecompletition

Let's take a look at CRUD in Elasticsearch

1. Indexing A Document

PUT my_index/doc_type/1 { "username":"kimchy", "comment":"I like search!" }

2. Retrieving indexed document by ID

GET my_index/doc_type/1

3. Updating indexed document

Document

Shards & Replicas

Data Types

Using ElasticsearchCRUD

GET my_index/doc_type/1/_update { "doc" : { "comment" : "I love search!!" } }

4. Delete document

DELETE my_index/doc_type/1

That is how easy it is to manipulate documents in Elasticsearch. Searching for these documentsby keywords is just as easy!

GET my_index/_search { "query": { "match": { "comment": "love" } }}

Now that we know the basics of talking to Elasticsearch, let's take a look at the original problemat hand. How do we represent our git commits in a way we can index them as documents intoElasticsearch?

We have three entities that relate to our dataset: authors, committers, commits, file-contents, andrepositories.

We've got two separate indices we are interested in creating: One for commits, and one for the filecontents.

Since we are only interested in searching across commits and file-contents, we can normalize theother entities into those two. We can attribute authors and committers to the commitsthemselves. The only repository information we have is the repository's name, so we cannormalize that information into both the file-contents and commits.

Let's take a look at what that leaves us with!

Our commits have just a few attributes we are interested in:

author

Search

Modeling Git Data To Elasticsearch

Git Commits

The name of the author of the commitauthored_date

the time the author created the commitcommitter

the name of the person who committed the change into the parent branchcommitter_date

the time the committer committed the comitparent_shas

the sha hash of the commit in the merged branchdescription

the description written about the commit by the authorfiles

A list of the files that were changed in the commitrepo

the name of the repository that the data was committed into

Since we will only be searching across the master branch's state of the files, we can simplyattribute a repository name to each file.

our files can look like:

contentThe actual text content of the file

pathThe path to this file in the repository

repoThe name of the repository that the file pertains to

Now that we have an idea of how to model our data, we can go ahead and create these twoindices. To do so, we will want to specify some mappings

Index/Type mappings represent the schemas of the data. You can think of this as an RDBMS tableschema.

File Contents

Indexing Our Data

Mappings

Creating Git Commits Index

PUT git { 'mappings': { 'commits': { 'properties': { 'author': 'properties': { 'name': { 'type': 'text', 'fields': { 'raw': {'type': 'keyword'}, } } } 'authored_date': {'type': 'date'}, 'committer': 'properties': { 'name': { 'type': 'text', 'fields': { 'raw': {'type': 'keyword'}, } } } 'committed_date': {'type': 'date'}, 'parent_shas': {'type': 'keyword'}, 'description': {'type': 'text', 'analyzer': 'snowball'}, 'files': {'type': 'text', 'analyzer': 'file_path', "fielddata": True, 'fields': { 'keyword': {'type': 'keyword'} }}, "repo": {'type': 'keyword'} } } } }

PUT files { 'mappings': { 'files': { 'properties': { 'content': { "type": "text", "index_options": "offsets", "term_vector": "with_positions_offsets" }, 'path': {'type': 'text', 'analyzer': 'file_path', 'fielddata': True} } } } }

I have already run this with the Kafka repository that was bundled in the zip, but feel free to cloneanother repository and run this agains that. The application will work just as well with multiplerepositories.

python load.py </path/to/git/repo> <branch_name>

Creating Files Index

Indexing using the Gitsearch script

Checking Index Information In Elasticsearch

Once we have the indices created and the data indexed, we can see what mappings and settingsElasticsearch actually assigned our indices

GET git

GET files

Before we continue onto our Gitsearch application, let's take the time to explore our data and seehow we can represent it in different ways.

After exploring Kibana a bit, we should now have an idea of how to create our main timeline page

pip install --user .

Exploring Data In Kibana

Bringing Searches and Aggregationsinto Gitsearch

Building Python Dependencies

Building List Of Repositories

def get_repositories(self): query = { "aggs": { "top_repositories": { "terms": { "field": "repo", "size": 10, "order": { "_count": "desc" } } } } } return self.es.search(index="git", doc_type="commits", body=query)

def get_contributors(self): query = { "aggs": { "top_contributors": { "terms": { "field": "author.name.raw", "size": 10, "order": { "_count": "desc" } } } } } return self.es.search(index="git", doc_type="commits", body=query)

Building List Of Contributors

Building The Timeline View

def search_newest_commits(self, q=""): match_all = { "match_all" : {} } query_string = {"match": { "author.name.raw": { "query": q } } } query = { "query": { "function_score": { "query": query_string if q else match_all, "boost": 10, "functions": [ { "linear": { "committed_date": { "scale": "7d", "offset": "2d" } } } ] } } } return self.es.search(index="git", doc_type="commits", body=query)

def search_code(self, q=""): query = { "query" : { "match": { "content": q } }, "highlight" : { "fields" : { "content" : {"type" : "fvh", "fragment_size": 500, "number_of_fragments": 1} } } } return self.es.search(index="files", doc_type="files", body=query)

Building File Search

That's it! With four queries, we can create a very powerful exploratory view of our git data.

To Launch our application you just need to run

FLASK_APP=gitsearch python -m flask run

The stack we've developed up until now is a tradition search application. We have an opportunityto leverage our stack again for monitoring our data

Our Gitsearch application is logging data into a file called gitsearch.log and its contents lookslike:

127.0.0.1 - - [20/Apr/2017 14:33:30] "GET /search?q=how+do+you+do HTTP/1.1" 200 -

This is very similar to how an Apache Web Server would log its data.

How can we parse this into Elasticsearch? We can use Ingest Node(https://www.elastic.co/guide/en/elasticsearch/reference/5.3/ingest.html) and Filebeat(https://www.elastic.co/products/beats/filebeat).

You can use ingest node to pre-process documents before the actual indexing takes place. Thispre-processing happens by an ingest node that intercepts bulk and index requests, applies thetransformations, and then passes the documents back to the index or bulk APIs.

Monitoring Our Application

Ingest Node

Creating an Ingest Pipeline

https://www.elastic.co/guide/en/elasticsearch/reference/5.3/ingest.html

https://www.elastic.co/products/beats/filebeat

PUT _ingest/pipeline/gitsearch { "description" : "describe pipeline", "processors" : [ { "grok" : { "field": "message", "ignore_failure": true, "pattern_definitions": { "MYDATE": "%{MONTHDAY}/%{MONTH}/%{YEAR} %{TIME}" }, "patterns": [ "%{IP:client_ip} %{USER:ident} %{USER:auth} \\[%{MYDATE:timestamp}\\] \"%{WORD:method} %{DATA:rawrequest} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}" ] } }, { "date": { "field": "timestamp", "formats": ["dd/MMM/YYYY HH:mm:ss"] } } ] }

Lightweight shipper for logs

1. Change directory into the appropriate Filebeat

cd filebeat-version-platform

2. Open the default filebeat.yml configuration with a text-editor3. Configure the path Filebeat will look into for logs

Filebeat

Installing Filebeat

...

...

...

filebeat.prospectors:

# Each - is a prospector. Most options can be set at the prospector level, so # you can use different prospectors for various configurations. # Below are the prospector specific configurations.

- input_type: log

# Paths that should be crawled and fetched. Glob based paths. paths: - /path/to/percona-elastic-workshop/gitsearch/*

...

...

...

...

#-------------------------- Elasticsearch output ------------------------------ output.elasticsearch: # Array of hosts to connect to. hosts: ["localhost:9200"] pipeline: "gitsearch" ... ... ...

1. Test the configuration

./filebeat -configtest

2. Start Filebeat

./filebeat

Documents

Gitsearch Elastic Workshop - Percona · 2017-05-12 · Installing Filebeat In this workshop we will learn how to build an Elasticsearch-powered Git search engine! Before we dive into