Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Installing Filebeat
In this workshop we will learn how to build an Elasticsearch-powered Git search engine!
Before we dive into the components of this project and how to model Git data in Elasticsearch,let's take care of some housekeeping items like Elasticsearch Concepts and environment setup.
One of the best ways to learn what Elasticsearch is, is to simply run it! To do this, we'll unpackagethe workshop zip, extract, and run Elasticsearch and Kibana.
Minimum RequirementsPython 2.7+Java 1.8.0_20+at least 20% of free disk space
There is a percona-elastic-workshop zip containing Elasticsearch, Kibana, Beats and ourapplication, Gitsearch.
1.
Gitsearch Elastic Workshop
Getting Started
Workshop Checklist
Installing/Running Elasticsearch
cd percona-elastic-workshop
2.
cd elasticsearch
Once in the elasticsearch directory, you can simply run it with
# on Linux/Mac ./bin/elasticsearch # on Windows bin\elasticsearch.bat
To verify that Elasticsearch is running successfully, go ahead to your favorite web browser andvisit http://localhost:9200 (http://localhost:9200)
That's it! you now have Elasticsearch installed on your machine.
1. Change into the workshop directory
cd percona-elastic-workshop
2. change into the appropriate kibana directory for your platform (darwin=Mac)
cd kibana-version-platform
3. Run Kibana
./bin/kibana
That's it! Now you are running a Kibana instance that is communicating with your localElasticsearch Cluster.
A cluster is a collection of one or more nodes (servers) that together holds your entire data andprovides federated indexing and search capabilities across all nodes
A node is a single server that is part of your cluster, stores your data, and participates in thecluster’s indexing and search capabilities
An index is a collection of documents that have somewhat similar characteristics.
Installing/Running Kibana
ConceptsCluster
Node
Index
Type
Within an index, you can define one or more types. A type is a logical category/partition of yourindex whose semantics is completely up to you.
A document is a basic unit of information that can be indexed. This document is is expressed inJSON.
Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. Theseshards allow for splitting/scaling your content volume horizontally and parallelizing operations.Each of these shards can be assigned replicas, which are copies of the shard. Once replicated,each index will have primary shards (the original shards that were replicated from) and replicashards (the copies of the primary shards)
textkeyworddatelongfloat/scaled_float/half_floatdoubleintegerbooleanipgeo_pointgeo_shapecompletition
Let's take a look at CRUD in Elasticsearch
1. Indexing A Document
PUT my_index/doc_type/1 { "username":"kimchy", "comment":"I like search!" }
2. Retrieving indexed document by ID
GET my_index/doc_type/1
3. Updating indexed document
Document
Shards & Replicas
Data Types
Using ElasticsearchCRUD
GET my_index/doc_type/1/_update { "doc" : { "comment" : "I love search!!" } }
4. Delete document
DELETE my_index/doc_type/1
That is how easy it is to manipulate documents in Elasticsearch. Searching for these documentsby keywords is just as easy!
GET my_index/_search { "query": { "match": { "comment": "love" } }}
Now that we know the basics of talking to Elasticsearch, let's take a look at the original problemat hand. How do we represent our git commits in a way we can index them as documents intoElasticsearch?
We have three entities that relate to our dataset: authors, committers, commits, file-contents, andrepositories.
We've got two separate indices we are interested in creating: One for commits, and one for the filecontents.
Since we are only interested in searching across commits and file-contents, we can normalize theother entities into those two. We can attribute authors and committers to the commitsthemselves. The only repository information we have is the repository's name, so we cannormalize that information into both the file-contents and commits.
Let's take a look at what that leaves us with!
Our commits have just a few attributes we are interested in:
author
Search
Modeling Git Data To Elasticsearch
Git Commits
The name of the author of the commitauthored_date
the time the author created the commitcommitter
the name of the person who committed the change into the parent branchcommitter_date
the time the committer committed the comitparent_shas
the sha hash of the commit in the merged branchdescription
the description written about the commit by the authorfiles
A list of the files that were changed in the commitrepo
the name of the repository that the data was committed into
Since we will only be searching across the master branch's state of the files, we can simplyattribute a repository name to each file.
our files can look like:
contentThe actual text content of the file
pathThe path to this file in the repository
repoThe name of the repository that the file pertains to
Now that we have an idea of how to model our data, we can go ahead and create these twoindices. To do so, we will want to specify some mappings
Index/Type mappings represent the schemas of the data. You can think of this as an RDBMS tableschema.
File Contents
Indexing Our Data
Mappings
Creating Git Commits Index
PUT git { 'mappings': { 'commits': { 'properties': { 'author': 'properties': { 'name': { 'type': 'text', 'fields': { 'raw': {'type': 'keyword'}, } } } 'authored_date': {'type': 'date'}, 'committer': 'properties': { 'name': { 'type': 'text', 'fields': { 'raw': {'type': 'keyword'}, } } } 'committed_date': {'type': 'date'}, 'parent_shas': {'type': 'keyword'}, 'description': {'type': 'text', 'analyzer': 'snowball'}, 'files': {'type': 'text', 'analyzer': 'file_path', "fielddata": True, 'fields': { 'keyword': {'type': 'keyword'} }}, "repo": {'type': 'keyword'} } } } }
PUT files { 'mappings': { 'files': { 'properties': { 'content': { "type": "text", "index_options": "offsets", "term_vector": "with_positions_offsets" }, 'path': {'type': 'text', 'analyzer': 'file_path', 'fielddata': True} } } } }
I have already run this with the Kafka repository that was bundled in the zip, but feel free to cloneanother repository and run this agains that. The application will work just as well with multiplerepositories.
python load.py </path/to/git/repo> <branch_name>
Creating Files Index
Indexing using the Gitsearch script
Checking Index Information In Elasticsearch
Once we have the indices created and the data indexed, we can see what mappings and settingsElasticsearch actually assigned our indices
GET git
GET files
Before we continue onto our Gitsearch application, let's take the time to explore our data and seehow we can represent it in different ways.
After exploring Kibana a bit, we should now have an idea of how to create our main timeline page
pip install --user .
Exploring Data In Kibana
Bringing Searches and Aggregationsinto Gitsearch
Building Python Dependencies
Building List Of Repositories
def get_repositories(self): query = { "aggs": { "top_repositories": { "terms": { "field": "repo", "size": 10, "order": { "_count": "desc" } } } } } return self.es.search(index="git", doc_type="commits", body=query)
def get_contributors(self): query = { "aggs": { "top_contributors": { "terms": { "field": "author.name.raw", "size": 10, "order": { "_count": "desc" } } } } } return self.es.search(index="git", doc_type="commits", body=query)
Building List Of Contributors
Building The Timeline View
def search_newest_commits(self, q=""): match_all = { "match_all" : {} } query_string = {"match": { "author.name.raw": { "query": q } } } query = { "query": { "function_score": { "query": query_string if q else match_all, "boost": 10, "functions": [ { "linear": { "committed_date": { "scale": "7d", "offset": "2d" } } } ] } } } return self.es.search(index="git", doc_type="commits", body=query)
def search_code(self, q=""): query = { "query" : { "match": { "content": q } }, "highlight" : { "fields" : { "content" : {"type" : "fvh", "fragment_size": 500, "number_of_fragments": 1} } } } return self.es.search(index="files", doc_type="files", body=query)
Building File Search
That's it! With four queries, we can create a very powerful exploratory view of our git data.
To Launch our application you just need to run
FLASK_APP=gitsearch python -m flask run
The stack we've developed up until now is a tradition search application. We have an opportunityto leverage our stack again for monitoring our data
Our Gitsearch application is logging data into a file called gitsearch.log and its contents lookslike:
127.0.0.1 - - [20/Apr/2017 14:33:30] "GET /search?q=how+do+you+do HTTP/1.1" 200 -
This is very similar to how an Apache Web Server would log its data.
How can we parse this into Elasticsearch? We can use Ingest Node(https://www.elastic.co/guide/en/elasticsearch/reference/5.3/ingest.html) and Filebeat(https://www.elastic.co/products/beats/filebeat).
You can use ingest node to pre-process documents before the actual indexing takes place. Thispre-processing happens by an ingest node that intercepts bulk and index requests, applies thetransformations, and then passes the documents back to the index or bulk APIs.
Monitoring Our Application
Ingest Node
Creating an Ingest Pipeline
PUT _ingest/pipeline/gitsearch { "description" : "describe pipeline", "processors" : [ { "grok" : { "field": "message", "ignore_failure": true, "pattern_definitions": { "MYDATE": "%{MONTHDAY}/%{MONTH}/%{YEAR} %{TIME}" }, "patterns": [ "%{IP:client_ip} %{USER:ident} %{USER:auth} \\[%{MYDATE:timestamp}\\] \"%{WORD:method} %{DATA:rawrequest} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}" ] } }, { "date": { "field": "timestamp", "formats": ["dd/MMM/YYYY HH:mm:ss"] } } ] }
Lightweight shipper for logs
1. Change directory into the appropriate Filebeat
cd filebeat-version-platform
2. Open the default filebeat.yml configuration with a text-editor3. Configure the path Filebeat will look into for logs
Filebeat
Installing Filebeat
...
...
...
filebeat.prospectors:
# Each - is a prospector. Most options can be set at the prospector level, so # you can use different prospectors for various configurations. # Below are the prospector specific configurations.
- input_type: log
# Paths that should be crawled and fetched. Glob based paths. paths: - /path/to/percona-elastic-workshop/gitsearch/*
...
...
...
...
#-------------------------- Elasticsearch output ------------------------------ output.elasticsearch: # Array of hosts to connect to. hosts: ["localhost:9200"] pipeline: "gitsearch" ... ... ...
1. Test the configuration
./filebeat -configtest
2. Start Filebeat
./filebeat