View
155
Download
1
Category
Tags:
Preview:
DESCRIPTION
DevOps helps accelerate the delivery of software applications through automation and by removing Development & Operations silos. The Netflix Platform Engineering team has developed a robust data pipeline solution called SURO that has been open sourced. Come learn from the experiences of pioneers like Netflix how they are leveraging the data pipeline for new and innovative use cases. This is the presentation by Danny Yuan, Netflix Platform Engineering Team on operational and monitoring aspects of applications on cloud platforms.
Citation preview
Netflix Data Pipeline
Sudhir Tonse (@stonse)Danny Yuan (@g9yuayon)
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/!
Netflix is a log generating company that also happens to stream movies
- Adrian Cockroft
Data Is the most important asset at Netflix
If all the data is easily available to all teams, it can be leveraged in new and
exciting ways
Dashboard
~1000 Device Types
Dashboard
~1000 Device Types
~500 Apps/Web Services
Dashboard
~1000 Device Types
~500 Apps/Web Services
~100 Billion Events/Day !3.2M messages per second at peak time !3GB per second at peak time
Dashboard
Type of Events• User Interface Events • Search Event (‘Matrix’ using PS3 …) • Star Ra>ng Event (HoC : 5 stars, Xbox, US, …)
!
• Infrastructural Events • RPC Call (API -‐> Billing Service, ‘/bill/..’, 200, …) • Log Errors (NPE, “Movie is null”, …, …)
!
• Other Events … !!
Making Sense of Billions of Events
A Humble Beginning
Evolution …Scale!
ApplicationApplication
Application Application
Application
Application
Application
Application
ApplicationApplication
We Want to Process App Data in Hadoop
Our Hadoop Ecosystem
@NetflixOSS Big Data Tools
Hadoop as a Service
Pig Scripting on Steroids
Pig Married to Clojure
S3MPER
S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
S3mper is a library that provides an additional layer of consistency
checking on top of Amazon's S3 index through use of a consistent, secondary index.
Efficient ETL with Cassandra
Cassandra
Offline Analysis
Evolution … Speed!
hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
We Want to Aggregate, Index, and Query Data in Real Time
Interactive Exploration
Let’s walk through some use cases
client activity event
*/name = “movieStarts”
Pipeline Challenges
Pipeline Challenges
• App owners: send and forget
Pipeline Challenges
• App owners: send and forget
• Data scientists: validation, ETL, batch processing
Pipeline Challenges
• App owners: send and forget
• Data scientists: validation, ETL, batch processing
• DevOps: stream processing, targeted search
Message Routing
We Want to Consume Data Selectively in Different Ways
• Message broker!
• High-throughput!
• Persistent and replicated
There Is More
Intelligent Alerts
Intelligent Alerts
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
What We Need
• Ad-hoc query with different dimensions
What We Need
• Ad-hoc query with different dimensions
• Quick aggregations and Top-N queries
What We Need
• Ad-hoc query with different dimensions
• Quick aggregations and Top-N queries• Time series with flexible filters
What We Need
• Ad-hoc query with different dimensions
• Quick aggregations and Top-N queries• Time series with flexible filters• Quick access to raw data using boolean queries
What We Need
Druid
• Rapid exploration of high dimensional data!
• Fast ingestion and querying!
• Time series
• Real-time indexing of event streams!
• Killer feature: boolean search!
• Great UI: Kibana
The Old Pipeline
The New Pipeline
There Is More
It’s Not All About Counters and Time Series
RequestId Parent Id Node Id Service Name Status
4965-4a74 0 123 Edge Service 200
4965-4a74 123 456 Gateway 200
4965-4a74 456 789 Service A 200
4965-4a74e 456 abc Service B 200
Status:200
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
A System that Supports All These
A Data Pipeline To Glue Them All
Make It Simple
Message Producing
Message Producing
• Simple and Uniform API
• messageBus.publish(event)
Consumption Is Simple Too consumer.observe().subscribe(new Subscriber<>() { @Override public void onNext(Ackable<IncomingMessage> ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); !consumer.pause(); consumer.resume()
RxJava
• Functional reactive programming model!
• Powerful streaming API!
• Separation of logic and threading model
Design Decisions
Design Decisions
• Top Priority: app stability and throughput
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
• Aggressive buffering
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
• Aggressive buffering
• Drops messages if necessary
Anything Can Fail
Cloud Resiliency
Fault Tolerance Features
Fault Tolerance Features
• Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage)
Fault Tolerance Features
• Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage)
• disk-backed queue: big-queue
Fault Tolerance Features
• Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage)
• disk-backed queue: big-queue
• Customized scaling down
There’s More to Do
• Contribute to @NetflixOSS !
• Join us :-)
You can build your own web-scale data pipeline using open source components
Thank You!Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse
Danny Yuan http://www.linkedin.com/pub/danny-yuan/4/374/862 Twitter: @g9yuayon
Recommended