Turbocharge your MySQL - Percona€¦ · Turbocharge your MySQL analytics with ElasticSearch...

Turbocharge your MySQL analytics with ElasticSearch

Guillaume LefrancData & Infrastructure Architect, Productsup GmbH

Percona Live Europe 2017

About the Speaker

Guillaume Lefranc • Data Architect at Productsup

• Replication Manager for MySQL and MariaDB - Lead Architect

• DBA Manager at MariaDB Corporation

• Infrastructure Consultant at Coinigy

• DB Architect at dailymotion.com

Takeaways

In this presentation, we will speak about:

● How Elasticsearch works as a document and column store

● What are its strengths and weaknesses when it comes to analytics

● How to sync data with MySQL

● How to build aggregations

Case study

Disclaimer: This case study is about medium data, not big data

Case study

What is medium data?

Case study

Answer: from 100GB to a few TBs

Case study

Types of data:

● User activity (Clicks)

Case study

Types of data:

● Market Data

Case study

Types of data:

● Market Data

● Trips

Use Cases

Business cases:

● A ride-sharing app

Example dataset: NYC Taxi Data (6 months: 78 million trips)

Use Cases

Business cases:

● A ride-sharing app

Example dataset: NYC Taxi Data (6 months: 78 million trips)

● Cryptocurrency market data

200 million documents per month - Courtesy of coinigy.com

MySQL and Medium Data

● Medium data can scale well in MySQL (SELECT … WHERE id = ?)

● … not with Analytics

● Every case cannot be covered by an index

● Aggregations can be slow, especially if doing table scans

Elasticsearch - What?

● “You know, for Search” -> Inverted Index

● Document Store

● REST API

○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

● Document Store

● REST API

○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

● JSON Native

Elasticsearch - Anatomy of a Document

{ "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }

Elasticsearch - Why?

● Distributed

● Fault Tolerant

● Distributed

● Fault Tolerant

● Scales Horizontally

● Distributed

● Fault Tolerant

● Scales Horizontally

Elasticsearch - Column Store

● All fields are indexed by default

● Query is (almost) always an index scan

● Doc values

○ Field values serialized on disk

○ Not stored in the JVM Heap: OS cache reliant

○ Compression

○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

● Doc values

○ Compression

● Not a general purpose column-store replacement

● Doc values

○ Compression

● Not a general purpose column-store replacement

● Not exactly fast as a document DB

Bibliography:

● https://www.elastic.co/blog/elasticsearch-as-a-column-store

● https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

Alternatives (Open Source)

● MariaDB Columnstore

● Yandex Clickhouse

● Apache Spark

Bibliography:

https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs

-clickhouse-vs-apache-spark/

MySQL Analytics: Performance

Fast if data is in the index

Very slow if linear scans have to be used

Example:

SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC;-> returns in milliseconds

SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC;-> 4 minutes

Importing data: Logstash

● Logstash is an open source data collection engine with real-time pipelining capabilities.

● The L in ELK Stack

● ETL for ElasticSearch

● Pipeline model (input -> filter -> output)

Importing data - Input

input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }

Importing data - Input

● Dealing with large result setsjdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true"jdbc_fetch_size => 10000

● SQL Last Value use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value"

● Schedulerschedule => "*/5 * * * *"

Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

Importing Data - Filters

Filters (Input Transformation)

filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] }}

● Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and

Importing Data - Output

output { elasticsearch { hosts => [ "es1:9200" ]

user => "elastic"password => "elasticpassword"index => "taxi-%{+YYYY-MM}"document_type => "trips"

document_id => "%{id}" }}

Importing Data - Output

● Document partitioning

index => "taxi-%{+YYYY-MM}"

● Matching ID with MySQL

document_id => "%{id}"

Schema Design - Indexes

● Document Partitioning

● ElasticSearch Types

● Number of Indices

● Number of Shards

● Replication Factor

Schema Design - Indexes

PUT /taxi{ "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } }}

PUT _template/template1{

“template”: “taxi*”,“settings”: {

Schema Design - Mapping

● Indexing with the optimal type

● Avoiding Full Text Search indexing (aka text or “analyzed”)

● Type overview

○ keyword

○ long

○ byte

○ date

○ geo_point

○ scaled_float

Schema Design - Mapping

{ "mappings": { "trips": { "properties": { "id": { "type": "long" }, "driver_id": { "type": "long" }, "passenger_id": { "type": "long" }, "pickup_datetime": { "type": "date" },

"dropoff_datetime": { "type": "date" }, "passenger_count": { "type": "byte" }, "pickup_location": { "type": "geo_point" }, "dropoff_location": { "type": "geo_point" }, "payment_type": { "type" : "byte" }, "total_amount": { "type": "scaled_float", "scaling_factor": 100 } } } }}

Query Design - Aggregation

High Level concepts:

● Buckets

○ driver name, passenger name, location

○ currency exchange (bitstamp, coinbase, etc)

○ Can be nested

● Metrics

○ count

○ pricing sum, avg, max, min, etc

SELECT metric FROM indexGROUP BY bucket

Building an aggregation, step by step

Drivers’ average earning per day with customers using Credit Cards as a filter

● Search component: payment_type:1

● Bucket terms aggregation (driver name or id)

● Bucket date histogram aggregation (per day, month, specific interval, etc)

● Metric average aggregation

Query design - The search component

● Aggregations can be filtered by search

● Multiple indexes can be hit using regular expressions

GET /taxi-*/_search?q=payment_type:1

● Up to complex searches

GET /taxi-2016-01/_search?q=+pickup_datetime:2016-01-09 +total_amount:>60&size=10000'

Query Design - The Terms bucket

The Terms bucket

{ "size": 0, <- do not return documents "aggs": { <- define aggregation "drivers": { <- canonical agg name "terms": { <- agg type "field": "driver_id" <- agg argument (and options) } } }}

Query Design - The date histogram bucket

"aggs": { "drivers": { "terms": { "field": "driver_id" }, "aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" },

Query design - Metric aggregation

"aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" }, "aggs": { "avg_earning": { "avg": { "field": "total_amount" } } } }

Query Results

Obviously too long to show here! General info comes first:

{ "took" : 1804, <- 1804ms execution time "timed_out" : false, "_shards" : { "total" : 30, <- number of shards parsed "successful" : 30, "failed" : 0 }, "hits" : { "total" : 46066859 <- number of parsed documents}

Query Results - Example aggregation result

{ "key" : "Shay Banon", "doc_count" : 5476, "by_day" : { "buckets" : [ { "key_as_string" : "2017-01-01 00:00:00", "doc_count" : 152, "avg_earning" : { "value" : 11.11

Query filtering - Filtering by location

"filter": { "geo_polygon": { "pickup_location": {

"points": [ {"lat": -74.04, "lon": 40.56},

{"lat": -74.04, "lon": 40.74}, {"lat": -73.83, "lon": 40.74}, {"lat": -73.83, "lon": 40.57}

] } } }

Going further - Bucket aggregations

Adjacency Matrix AggregationChildren AggregationDate Histogram AggregationDate Range AggregationDiversified Sampler AggregationFilter AggregationFilters AggregationGeo Distance AggregationGeoHash grid AggregationGlobal Aggregation

Histogram AggregationIP Range AggregationMissing AggregationNested AggregationRange AggregationReverse nested AggregationSampler AggregationSignificant Terms AggregationTerms Aggregation

Going Further - Metric Aggregations

Avg AggregationCardinality AggregationExtended Stats AggregationGeo Bounds AggregationGeo Centroid AggregationMax AggregationMin Aggregation

Percentiles AggregationPercentile Ranks AggregationScripted Metric AggregationStats AggregationSum AggregationTop hits AggregationValue Count Aggregation

Going Further - Kibana

Build cool dashboards with your aggregations

We’re Hiring!

● Frontend Developers

● Backend Developers

● Data Scientists

We’re Hiring!

Turbocharge your MySQL - Percona€¦ · Turbocharge your MySQL analytics with ElasticSearch...

Documents

Percona Live London 2014 - MySQL Backup Strategy @ IEDR

Percona Toolkit for Effective MySQL Administration

Percona Software & ServicesUpdate · Percona Server for MySQL MariaDB Percona XtraDBCluster GaleraCluster for MySQL MariaDBGaleraCluster MongoDB Percona Server for MongoDB Amazon

MySQL backup and restore best practices - Percona

AWS + MySQL - Percona · Agenda Options for Running MySQL in AWS RDS or EC2 IO Performance Cloud Watch and Percona Monitoring Plugins HA Options

MySQL Geographic Features in - Percona

MySQL Replication Options - Percona · MySQL Replication Options. 2 Few Words About Percona Your Partner in MySQL and MongoDB Success 100% Open Source Software “No Lock in Required”

Percona xtrabackup - MySQL Meetup @ Mumbai

Percona Live Template · PDF file•BTREE Indexes –Majority of indexes you deal in MySQL is this type ... Learn More at Percona Live MySQL Conferences •Percona Live New York,2012

At Rest Encryption with MySQL & Vault - Percona...Tate McDaniel Percona Senior MySQL DBA with 15 years in the industry, 2 years at Percona. Expert at query tuning, schema design, and

MySQL In the Cloud - Percona€¦ · MySQL In the Cloud Migraon, Best Prac

MySQL optimisation Percona LeMug.fr

Percona Live: Various MySQL High Availability (HA) Solutions

Integrate Percona MySQL...EventTracker: Integrate Percona MySQL 3 Introduction Percona MySQL is an open source software company specializing in . MySQL support, consulting, managed

Percona XtraDB Cluster · Percona XtraDB Cluster powered by Galera Peter Zaitsev CEO, Percona Slide Credits: Vadim Tkachenko Percona MySQL University , Montevideo,UY Feb 5,2013

Industrial-Strength MySQL Applications using Percona and ... · Industrial-Strength MySQL Applications using Percona and Continuent Robert Hodges, CEO Continuent Peter Zaitsev,

Management to Troubleshoot MySQL Using Percona Monitoring … · Using Percona Monitoring and Management to Troubleshoot MySQL Performance Issues September 7th, 2017. ... Learn to

Monitoring All (Yes, All!) MySQL Metrics with Percona ... · Monitoring All (Yes, All!) MySQL Metrics with Percona Cloud Tools Vadim Tkachenko Percona, Co-founder /CTO cloud.percona.com

MySQL Router – Simple HA - Percona · MySQL Router – Simple HA Marcos Albe, Percona Inc. Percona Live Data Performance Conference Santa Clara – April 2016

Monitoring MySQL with Prometheus, Grafana and Percona Dashboards