Upload
streamsets-inc
View
1.937
Download
0
Embed Size (px)
Citation preview
Case StudyElasticsearch Ingest @ Cisco Intercloud
Agenda• Express Overview of StreamSets Data Collector
Kirit Basu, Product Management, StreamSets
• Introduction to Elastic
Catherine Johnson, Solutions Architect, Elastic
• Implementing Shipped Analytics Using StreamSets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO GroupGroup
Performance Management for Data Flows
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
History Founded by Informatica and Cloudera veterans.
Mission Bring operational excellence to managing data in motion.
Challenge Move data efficiently and with quality in the face of change.
Solution Open source software enabling performance management of
data flows.
Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement,
Log Shipping, Cloud Migration, IoT, ...
Momentum Thousands of downloads, hundreds of companies using.
StreamSets At a Glance
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
StreamSets Data Collector
Adaptable Flows for Efficiency
Design ingest pipelines with minimal coding and maximum flexibility.
Data Flow KPIs for Control
Monitor and act on data flow performance and data quality.
Containerized Architecture for Agility
Operate continuously in the face of constant change.
Open source software for the rapid development and reliably operation of complex data flows.
Get Started with StreamSets
http://streamsets.com/opensource
https://github.com/streamsets/datacollector/
#streamsets
March 2016
Introduction to Elastic
Software that makes massive amounts of structured and unstructured data usable for
search, logging, analytics, and more in mission critical systems and applications
Examples: Elastic Stack Use CasesLogging
IT OperationsApplication Management
Security Analytics
Analytics Search
Marketing InsightsBusiness DevelopmentCustomer Sentiment
Website SearchInternal/Intranet Search
URL Search
Internal Systems/Applications External Systems/Applications
Developers IT/Ops Business Users
Elastic Solves Many Developer Use Cases
Social
Location
User-Activity
Machine(Log files)
Documents
Handles Complex & Diverse Data
Meets Today’s CoreDeveloper Requirements
Developer requirements
Many users / use cases
Fast data processing
Large data volumes
Data quality & integrity
Cross-source insights
Solves CriticalUse Cases
ApplicationSearch
Embedded Search
LoggingSecurity Analytics
OperationalAnalytics
More …
The Elastic Stack
Ingest
Store, Index,& Analyze
User Interface
Plugins Monitoring Security Alerting
Elastic Cloud: Hosted Elasticsearch
Implementing Shipped Analytics Using Streamsets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group
Tymofii Polekhin, Software Engineer
Agenda
MANTL & Shipped
Shipped Analytics for Shipped
Why we need Shipped Analytics?
Archtecture and Data Flow
Streamsets Pipelines
End to end dataflow and performance with Elasticsearch
Benefits of Streamsets
Demo
Microservices managed and scaled separatelyMicroservices managed by Mesos in a single platform
Microservices architecture for Mesos frameworks and other components
CIS/AWS/Metastack/vSphere/UCS…
Terraform
Spark Executor N
Spark Executor 1
Spark Scheduler
Kafka Broker N
Kafka Broker 1
Kafka Scheduler
Docker DockerTraefikMicroservices …
REST APIREST API
Scripted provisioning
Direct provisioning
Policy, Auto-scaling
VM1
or
BM1
VM2
or
BM2
VM3
or
BM3
VM4
or
BM4
VM5
or
BM5
Shipped Analytics Cluster
Probe
Probe
Probe
• Both Shipped and Shipped Analytics running on MANTL• Shipped Analytics – infra and app logs and metrics analysis
mesos-master
mesos-slave
marathon
zookeeper
consul
syslog
frameworks
collectd
cpu
memory
interface
disk
df
load
dockerzookeeper
marathonmesos-slave
mesos-master
CollectD and Filebeat processes running on every node in the cluster.
Infrastructure Layer
Zookeeper Cluster Consul Cluster
Mesos Cluster
Marathon Framework
Kafka Cluster
topbeat filebeat
journalbeat dockerbeat
• Experimenting with Elastic Beats (unified arch., closer to micro-services model)• Elastic Beats to replace collectd plugins and cAdvisor for containers
<file | top | *>beat collectd
logstash
DNS SRV beats.logstash.service.consul
Data normalizationTaggingCluster name decoration
Logstash is a single process per cluster, discoverable with standard inter-cluster discovery mechanism, which will get metrics from collectdon every slave and logs from filebeat on every slave, normalize data and send to desired output
DNS SRV collectd.logstash.service.consul
NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon
logstash
Kafka 0.9.0.0 supports SSL authentication and data encryption for producers.
This is must-have security when sending data to external destination through WAN.
Sending data to central SA cluster for long-term analytics
SSL encryption
WAN
kafka
SSL authentication
Shipped cluster
Shipped Analytics
StreamSets running in MesosSpark Cluster mode processing data from multiple source Shipped clusters and storing it in Elasticsearch cluster.
kafka
elasticsearch
Streamsets Spark Streaming Cluster
Spark Job
Master instance
Spark Job Spark Job Spark Job
Lambda Reference Architecture
Monitoring / Analytics Cluster (local, Texas-3)
Global Monitoring / Analytics Cluster (global, Texas-1)
Monitoring / Analytics Cluster (local, Ams. -1 )
Monitoring / Analytics Cluster (local, Lon.-1)
Local components and deployment is the same as global, just smaller
Real-time and batch processing (Lambda), anomaly detection, visualization
SSL
Kafka
SSL
SSL
MQTT
Divide nodes by role for more stable cluster operation and ease of scalability
3 master/search nodes5 live data nodes3 archive data nodes
master/search
master/search
master/search
live/ data
live/ data
live/ data
live/ data
live/ data
archive/data
archive/data
archive/data
Shards=5 Replicas=4 Shards=5 Replicas=1
archive/data
archive/data
CPU=4RAM=30GBHDD=4TB
CPU=4RAM=30GBHDD=4TB
CPU=4RAM=30GBHDD=4TB
Streamsets pipelines process incoming messages and transform them according to business logic requirements, normalizing metrics and parsing log lines; popping up important information using GROK filters or scripts.
Cluster Name Decorator
Fields Type Normalization
Metrics/Logs Stream Splitter
ES Logs OutputGeneral GROK
Filters
Float Value Truncate
ES Metrics Output
Shipped GROK Logic
Marathon
• Streamsets instances running in docker containers in Marathono Easy deployment and scalingo Fast upgrade to newer version
• Issues we faced with this approach:o Containers were killed by marathono Needed to re-import pipeline every time we launch container
Marathon
• Working with Streamsets trying to resolve the OOM issue we increasedcontainer memory and SDC heap size
• At first, all looked normal and we thought that it was juststarving on resources, but several days later we had SDC killed again
• We increased MEM and HEAP even more – to 16G, but we bought justanother day or two before is was killed again
• Looked like SDC heap were constantly filling with datathat don’t go away and eventually it kills the container
• Also GC was working hard and sometimes we got freezesup to 60 seconds
• Decided to move out from Docker
Marathon
• Streamsets reading JSON messages from Kafka cluster and output to Elasticsearch clustero De-serializing and serializing JSON was very slow with single
threaded processo Consuming from Kafka performance test showed:
JSON format: 5k records/sec avg Text format: 50k records/sec avg Binary format: 250k records/sec avg
• Streamsets team were very proactive with this issuesand in 2 days we received a fix for multi-threaded JSON parsing
o New testing showed: JSON format: 66k records/sec avg
Marathon
• Streamsets has never failed because of any internal logic bugsbut we kept seeing this oom-killer popping up and recovering wasnot automated
• We decided to leave docker and run SDC natively on host,still using Marathon for scaling and failover
• Without docker, we now can upload our pipeline on SDC startup, and it will start working as soon as instance has loaded
We can freely scale up/down whenever we need
Also, we got rid of oom-killer issue as well
Each one of our 3 SDC instances already processes ~3B messages, with no issues!
• Streamsets pipeline consume metrics gathered by collectdand logs gathered by logstash from 4 different clusters (including self), transform and decorate them and send to Elasticsearch for storage and analytics.
• First of all we consume messages from Kafka topic at average of 5,000 messages per second. The consumer itself parses JSON-format and sends further.
• Next stage is a JavaScript script that decorates messages with cluster name, based on a instance hostname in that message
• Finally, we exclude Marathon events from stream sending them directly to ES
• Next stage will splits stream into 2 parts: logs and metrics
• Metrics are send straight to ES without any transformation
• Logs are the most interesting part:
o We pop docker container logs from stream and delete “time” field that’s duplicate timstamp and sending them to ES
o We separate logs from specific clusters, because we need to apply special logic for them
o Separation is done though mapping IP’s to clusters in the pipeline realtime
• Collecting data from several Mesos clusters and need to correlate container metrics with it’s logs
• Use appID taskID and runID to identify specific containers logs
• Container logs itself have all three of this, while mesos-master and mesos-agent logs lacks runID
• All unidentified data is discarded
Current ShippedAnalytics prod cluster configuration:
Kafka Cluster: 7 brokers with 4CPU and 16GB RAM eachLogstash topic for all incoming messages with 7 partitions and 2 replicas
Current data flow is avg 5000 messages/sec to KafkaCurrent data size is avg 1,2MB/sec to Kafka
Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster7 partitions are split between 3 instances like 3/2/2All 3 instances running natively on host (non-docker) with MarathonMarathon restarts failed instance with automatic pipeline upload and start
Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage eachEach metrics is written to its own index, total of 15 indexesEach index has 5 primary shards and 5 replica shardsTotal Doc count: 17,5B Total Doc size: 3.84TB1 Day rate count: ~500M 1 Day rate size: ~120GB
Streamsets is a great product to work with, also team is super helpful and works fast
• Lots of input and output connectors, huge processing capabilities• Very intuitive and rich User Interface• Easy to create pipelines visually, instead of writing code• Clear data flow paths
• Small resource consumption compared to performance• Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM• Simple configuration and deployment process• Opensource(!)• Fast logic changes with minimum downtime• Preview mode(!) – check every stage before throwing all your data it• Rich data transformation possibilities• GROK filters – easy to migrate from Logstash• Smart Errors handling• Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues
Thank You!