Upload
loggly
View
1.186
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Log management isn’t easy to do at scale. We designed Loggly Gen2 using the latest social-media-scale technologies—including ElasticSearch, Kafka from LinkedIn, and Apache Storm—as the backbone of ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. Since we launched Gen2, we’ve learned a lot more about these technologies. We regularly contribute back to the open source community, so we decided that it’s time to give an update on our experience with Storm and explain why we have dropped it from our platform, at least for now. Read full blog post here: http://bit.ly/ScaleApacheStorm
Citation preview
| Log management as a service Simplified Log Management Simplified Log Management
Apache Storm What We Learned About Scaling with Apache Storm
Manoj Chaudhary CTO & VP of Engineering August 2014
| Log management as a service Simplify Log Management
We’re the world’s most popular cloud-based log management service
§ More than 5,000 customers § Near real-time indexing of events
Distributed architecture, built on AWS
Initial production services in 2011 § Loggly Generation 2 released in Sept 2013
What Loggly Does
| Log management as a service Simplify Log Management
§ The unique challenges of log management § Overview of the Loggly event pipeline § Use of open source technologies § Lessons we have learned § Why we removed Storm § Conclusions: the Storm 411
Agenda for this Presentation
| Log management as a service Simplify Log Management
Everyone starts with … § A bunch of log files (syslog, application specific) § On a bunch of machines
Management consists of doing the simple stuff:
§ Rotate files, compress and delete § Information is there but awkward to find
specific events § Log retention policies evolve over time
How Log Management Starts
| Log management as a service Simplify Log Management
Log Volume
Self-Inflicted Pain
“…hmmm, our logs are getting a bit bloated”
“…let’s spend time managing our log capacity”
“…how can I make this someone else’s problem!”
As Log Data Grows
| Log management as a service Simplify Log Management
Use existing logging infrastructure § Real time syslog forwarding is built in § Application log file watching
Store logs in the cloud § Accessible when there is a system failure § Cost-effective data retention
Log messages in machine parsable format § JSON encoding when logging structured
information § Key-value pairs
Loggly Makes Log Management Much Easier
| Log management as a service Simplify Log Management
Gen1 • 2011-2013 • AWS EC2 deployment • SOLR Cloud • ZeroMQ for message
queue
Gen2 • Launched September
2013 • AWS deployment • Utilized ElasticSearch,
Kafka, Storm
Incremental Improvements
and Scale
Loggly’s Evolution
| Log management as a service Simplify Log Management
§ Big data § >750 billion events logged to
date § Sustained bursts of 100,000+
events per second § Data space measured in
petabytes § Need for high fault tolerance § Near real-time indexing
requirements § Time-series index
management
The Challenges of Log Management at Scale
| Log management as a service Simplify Log Management
Open sourced by Twitter in September 2011 § Now an Apache Software Foundation project
§ Currently Incubator Status
Framework is for stream processing § Distributed § Fault tolerant § Computation § Fail-fast components
About Apache Storm
| Log management as a service Simplify Log Management
Storm Logical View
Bolt
Bolt
Spout Bolt Bolt
Spouts emit source stream Bolts perform stream processing
Example Topology
| Log management as a service Simplify Log Management
Nimbus
ZooKeeper
ZooKeeper
Supervisor Worker
Supervisor Worker
Supervisor Worker
Supervisor
Supervisor
Executor Task ZooKeeper
Storm Physical View
Master Daemon § Distributes Code § Assigns Tasks § Monitors Failures
Storing Operational Cluster State
Java thread spawned by Worker, runs tasks of same component.
Daemon listening for work assigned to its node.
Component (spout / bolt) instance, performs the actual data processing.
Java process executing a subset of topology
Worker Node
Worker Process
| Log management as a service Simplify Log Management
Load Balancing
Kafka Stage
2
Log Ingestion and Processing Overview
Storm Event
Processing
| Log management as a service Simplify Log Management
§ Storm provides Complex Event Processing § Where we run much of our secret-sauce
§ Stage 1 contains the raw Events § Stage 2 contains processed Events § Snapshot the last day of Stage 2 events to S3
Event Pipeline in Summary
| Log management as a service Simplify Log Management
§ Spout and bolts principle fit our network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel
§ Guaranteed data processing of data stream § Allowed us to focus on writing the best possible code
for different bolts
§ Dynamic deployment makes it easy to add or remove new nodes to adjust for actual loads and requirements § Log data has peaks and valleys
What Attracted Us to Storm
| Log management as a service Simplify Log Management
Kafka Stage 1
S3 Bucket
Identify Customer
Summary Statistics
Loggly Gen2 at Launch: Where Storm Fits In
Kafka Stage 2
| Log management as a service Simplify Log Management
What We Learned
| Log management as a service Simplify Log Management
Guaranteed delivery feature needed for log management resilience but…
Guaranteed Delivery Causes Big Performance Hit
Bolt
Bolt
Spout Bolt Bolt
Spouts emit source stream Bolts perform stream processing
Example Topology
2.5x hit to performance!!
ack
ack
ack ack
ack
| Log management as a service Simplify Log Management
Preload Kafka broker
• Kafka partitions with 8 spouts and 20 mapper bolts
• 4K provisioned IPOS backend AWS instance
Deploy Storm
topology with Kafka
spout
• TOPOLOGY_ACKERS set to 0 • Kafka disks red hot
Ack’ing per tuple
turned off
• Kafka disks not saturated • Bolts not running on high capacity
Ack’ing per tuple enabled
Our Performance Testing
- 50,000
100,000 150,000 200,000 250,000
Without guaranteed
delivery
With guaranteed
delivery
Average events per second processed per
cluster • 50 GB of raw log data from production
cluster
| Log management as a service Simplify Log Management
§ Ack a set of logs instead of individual events § PROBLEM: not consistent with Storm’s
semantics of a “message”
Potential Workaround: Batch Logs
It is not trivial to change the Kafka spout as well as each bolt to reinterpret a single message as a bunch of logs.
| Log management as a service Simplify Log Management
Load Balancing
Kafka Stage
2
Loggly Custom Module
Ultimate Solution: Build Custom Queue for Module-to-Module Communication
| Log management as a service Simplify Log Management
§ High-performance, reliable communication that implements our workflow
§ Supports sustained rates of 100K+ events per second
§ Relatively easy to port
Benefits of New Approach
| Log management as a service Simplify Log Management
Conclusions
Storm 0.82 has plenty of potential
But… Log management’s unique challenges drive the need for a custom framework
| Log management as a service Simplify Log Management
Log Management is Our Full-Time Job. It Shouldn’t Be Yours.
About Us: Loggly is the world’s most popular cloud-based log management solution, used by more than 5,000 happy customers to effortlessly spot problems in real-time, easily pinpoint root causes and resolve issues faster to ensure application success.
Unless You Want it to Be (Join us!) Check out our career page to see if there’s a great match for your skills! loggly.com/careers.
Try Loggly for Free! → http://bit.ly/ScaleApacheStorm
Visit us at loggly.com or follow @loggly on Twitter.