Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Case StudyElasticsearch Ingest @ Cisco Intercloud

Agenda• Express Overview of StreamSets Data Collector

Kirit Basu, Product Management, StreamSets

• Introduction to Elastic

Catherine Johnson, Solutions Architect, Elastic

• Implementing Shipped Analytics Using StreamSets and Elasticsearch

Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO GroupGroup

Performance Management for Data Flows

© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.

History Founded by Informatica and Cloudera veterans.

Mission Bring operational excellence to managing data in motion.

Challenge Move data efficiently and with quality in the face of change.

Solution Open source software enabling performance management of

data flows.

Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement,

Log Shipping, Cloud Migration, IoT, ...

Momentum Thousands of downloads, hundreds of companies using.

StreamSets At a Glance

© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.

StreamSets Data Collector

Adaptable Flows for Efficiency

Design ingest pipelines with minimal coding and maximum flexibility.

Data Flow KPIs for Control

Monitor and act on data flow performance and data quality.

Containerized Architecture for Agility

Operate continuously in the face of constant change.

Open source software for the rapid development and reliably operation of complex data flows.

Get Started with StreamSets

http://streamsets.com/opensource

https://github.com/streamsets/datacollector/

#streamsets

March 2016

Introduction to Elastic

Software that makes massive amounts of structured and unstructured data usable for

search, logging, analytics, and more in mission critical systems and applications

Examples: Elastic Stack Use CasesLogging

IT OperationsApplication Management

Security Analytics

Analytics Search

Marketing InsightsBusiness DevelopmentCustomer Sentiment

Website SearchInternal/Intranet Search

URL Search

Internal Systems/Applications External Systems/Applications

Developers IT/Ops Business Users

Elastic Solves Many Developer Use Cases

Social

Location

User-Activity

Machine(Log files)

Documents

Handles Complex & Diverse Data

Meets Today’s CoreDeveloper Requirements

Developer requirements

Many users / use cases

Fast data processing

Large data volumes

Data quality & integrity

Cross-source insights

Solves CriticalUse Cases

ApplicationSearch

Embedded Search

LoggingSecurity Analytics

OperationalAnalytics

More …

The Elastic Stack

Ingest

Store, Index,& Analyze

User Interface

Plugins Monitoring Security Alerting

Elastic Cloud: Hosted Elasticsearch

Thank you!

www.elastic.co

http://www.elastic.co/

Implementing Shipped Analytics Using Streamsets and Elasticsearch

Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group

Tymofii Polekhin, Software Engineer

Agenda

MANTL & Shipped

Shipped Analytics for Shipped

Why we need Shipped Analytics?

Archtecture and Data Flow

Streamsets Pipelines

End to end dataflow and performance with Elasticsearch

Benefits of Streamsets

Demo

Microservices managed and scaled separatelyMicroservices managed by Mesos in a single platform

Microservices architecture for Mesos frameworks and other components

CIS/AWS/Metastack/vSphere/UCS…

Terraform

Spark Executor N

Spark Executor 1

Spark Scheduler

Kafka Broker N

Kafka Broker 1

Kafka Scheduler

Docker DockerTraefikMicroservices …

REST APIREST API

Scripted provisioning

Direct provisioning

Policy, Auto-scaling

VM1

or

BM1

VM2

or

BM2

VM3

or

BM3

VM4

or

BM4

VM5

or

BM5

Shipped Analytics Cluster

Probe

Probe

Probe

• Both Shipped and Shipped Analytics running on MANTL• Shipped Analytics – infra and app logs and metrics analysis

mesos-master

mesos-slave

marathon

zookeeper

consul

syslog

frameworks

collectd

cpu

memory

interface

disk

df

load

dockerzookeeper

marathonmesos-slave

mesos-master

CollectD and Filebeat processes running on every node in the cluster.

Infrastructure Layer

Zookeeper Cluster Consul Cluster

Mesos Cluster

Marathon Framework

Kafka Cluster

topbeat filebeat

journalbeat dockerbeat

• Experimenting with Elastic Beats (unified arch., closer to micro-services model)• Elastic Beats to replace collectd plugins and cAdvisor for containers

<file | top | *>beat collectd

logstash

DNS SRV beats.logstash.service.consul

Data normalizationTaggingCluster name decoration

Logstash is a single process per cluster, discoverable with standard inter-cluster discovery mechanism, which will get metrics from collectdon every slave and logs from filebeat on every slave, normalize data and send to desired output

DNS SRV collectd.logstash.service.consul

NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon

logstash

Kafka 0.9.0.0 supports SSL authentication and data encryption for producers.

This is must-have security when sending data to external destination through WAN.

Sending data to central SA cluster for long-term analytics

SSL encryption

WAN

kafka

SSL authentication

Shipped cluster

Shipped Analytics

StreamSets running in MesosSpark Cluster mode processing data from multiple source Shipped clusters and storing it in Elasticsearch cluster.

kafka

elasticsearch

Streamsets Spark Streaming Cluster

Spark Job

Master instance

Spark Job Spark Job Spark Job

Lambda Reference Architecture

Monitoring / Analytics Cluster (local, Texas-3)

Global Monitoring / Analytics Cluster (global, Texas-1)

Monitoring / Analytics Cluster (local, Ams. -1 )

Monitoring / Analytics Cluster (local, Lon.-1)

Local components and deployment is the same as global, just smaller

Real-time and batch processing (Lambda), anomaly detection, visualization

SSL

Kafka

SSL

SSL

MQTT

Divide nodes by role for more stable cluster operation and ease of scalability

3 master/search nodes5 live data nodes3 archive data nodes

master/search

master/search

master/search

live/ data

live/ data

live/ data

live/ data

live/ data

archive/data

archive/data

archive/data

Shards=5 Replicas=4 Shards=5 Replicas=1

archive/data

archive/data

CPU=4RAM=30GBHDD=4TB



Streamsets pipelines process incoming messages and transform them according to business logic requirements, normalizing metrics and parsing log lines; popping up important information using GROK filters or scripts.

Cluster Name Decorator

Fields Type Normalization

Metrics/Logs Stream Splitter

ES Logs OutputGeneral GROK

Filters

Float Value Truncate

ES Metrics Output

Shipped GROK Logic

Marathon

• Streamsets instances running in docker containers in Marathono Easy deployment and scalingo Fast upgrade to newer version

• Issues we faced with this approach:o Containers were killed by marathono Needed to re-import pipeline every time we launch container

Marathon

• Working with Streamsets trying to resolve the OOM issue we increasedcontainer memory and SDC heap size

• At first, all looked normal and we thought that it was juststarving on resources, but several days later we had SDC killed again

• We increased MEM and HEAP even more – to 16G, but we bought justanother day or two before is was killed again

• Looked like SDC heap were constantly filling with datathat don’t go away and eventually it kills the container

• Also GC was working hard and sometimes we got freezesup to 60 seconds

• Decided to move out from Docker

Marathon

• Streamsets reading JSON messages from Kafka cluster and output to Elasticsearch clustero De-serializing and serializing JSON was very slow with single

threaded processo Consuming from Kafka performance test showed:

JSON format: 5k records/sec avg Text format: 50k records/sec avg Binary format: 250k records/sec avg

• Streamsets team were very proactive with this issuesand in 2 days we received a fix for multi-threaded JSON parsing

o New testing showed: JSON format: 66k records/sec avg

Marathon

• Streamsets has never failed because of any internal logic bugsbut we kept seeing this oom-killer popping up and recovering wasnot automated

• We decided to leave docker and run SDC natively on host,still using Marathon for scaling and failover

• Without docker, we now can upload our pipeline on SDC startup, and it will start working as soon as instance has loaded

We can freely scale up/down whenever we need

Also, we got rid of oom-killer issue as well

Each one of our 3 SDC instances already processes ~3B messages, with no issues!

• Streamsets pipeline consume metrics gathered by collectdand logs gathered by logstash from 4 different clusters (including self), transform and decorate them and send to Elasticsearch for storage and analytics.

• First of all we consume messages from Kafka topic at average of 5,000 messages per second. The consumer itself parses JSON-format and sends further.

• Next stage is a JavaScript script that decorates messages with cluster name, based on a instance hostname in that message

• Finally, we exclude Marathon events from stream sending them directly to ES

• Next stage will splits stream into 2 parts: logs and metrics

• Metrics are send straight to ES without any transformation

• Logs are the most interesting part:

o We pop docker container logs from stream and delete “time” field that’s duplicate timstamp and sending them to ES

o We separate logs from specific clusters, because we need to apply special logic for them

o Separation is done though mapping IP’s to clusters in the pipeline realtime

• Collecting data from several Mesos clusters and need to correlate container metrics with it’s logs

• Use appID taskID and runID to identify specific containers logs

• Container logs itself have all three of this, while mesos-master and mesos-agent logs lacks runID

• All unidentified data is discarded

Current ShippedAnalytics prod cluster configuration:

Kafka Cluster: 7 brokers with 4CPU and 16GB RAM eachLogstash topic for all incoming messages with 7 partitions and 2 replicas

Current data flow is avg 5000 messages/sec to KafkaCurrent data size is avg 1,2MB/sec to Kafka

Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster7 partitions are split between 3 instances like 3/2/2All 3 instances running natively on host (non-docker) with MarathonMarathon restarts failed instance with automatic pipeline upload and start

Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage eachEach metrics is written to its own index, total of 15 indexesEach index has 5 primary shards and 5 replica shardsTotal Doc count: 17,5B Total Doc size: 3.84TB1 Day rate count: ~500M 1 Day rate size: ~120GB

Streamsets is a great product to work with, also team is super helpful and works fast

• Lots of input and output connectors, huge processing capabilities• Very intuitive and rich User Interface• Easy to create pipelines visually, instead of writing code• Clear data flow paths

• Small resource consumption compared to performance• Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM• Simple configuration and deployment process• Opensource(!)• Fast logic changes with minimum downtime• Preview mode(!) – check every stage before throwing all your data it• Rich data transformation possibilities• GROK filters – easy to migrate from Logstash• Smart Errors handling• Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues

Thank You!

Data & Analytics

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud