Upload
tianjian-chen
View
143
Download
3
Tags:
Embed Size (px)
Citation preview
About Me
• Principal Architect @Baidu.com
• Contract Programmer
– C/C++/Python
• Post Engineering Disorder Therapist
Murphy’s Law
• Everything that Can Go Wrong, Goes Wrong
– Unstable Servers
– Unstable Network
– Unstable Data Source
– Unstable Managers
– Unstable…
Agenda
• Section I: A Scratch
– Brief Intro to Streams
• Section II: Build From Ground Up
– Modern Stream Processing Architecture
• Section III: Could Be Much Sexier
– Stream Evolution in Progress
Highlights
• Natural Beauty of Streams
• Stream Processing Design Language(SDL)
• Scratching Stream Applications
Definition of Stream
• A Series Of Data Packs
• Data is Structured or Semi-Structured
• With Internal Topologies
– DAG in most cases
Common Join Rules
• Internal Mode
– Indexing Only Input Messages
– Limited Time Window
• External Mode
– Indexing External Data
– Extreme Useful In Integrating Asynchronous Components
Common Partition Rules
• Feature Based– Application Related
• Random – Balancing Load
• Hash– Aggregating Data by User Defined Key
• Replication– For Improving Availability
Sum Up!
• Streams are everywhere
• We have a modeling tool to describe stream processing flows(SDL)
• Proceed to build real systems
Why Reliability Matters?
• We got mission critical applications
– Real time stock exchange analysis
– Ads network click monitoring & billing
Reliability Solutions
• Upstream Backup & Replay
• Source Backup & Replay
• Processing Status Backup & Replay
Processing Status Backup & Replay
UpstreamDownstream Upstream
Downstream
Stream Operator
Downstream Upstream
Stream Operator Shadow
Status Synchronize Messages
RedirectRedirect
While Reliability Hurts Performance
• All Reliability Solutions are Based on
– Indexing
– Snapshot
– Replay
• A Club Of Performance Penalties
Tuning Strategies
• State Operators VS. Stateless Operators
– Independent State Storage
– White Board Programming Model
– Lazy State Synchronization
• Micro Batching Snapshot
Micro Batching
Batch Size Time Window Throughput Snapshot Cost Restore Cost
1 1ms 1x very high very low
10 10ms 10x high low
100 100ms 100x medium low
1000 100ms 1000x medium low
10000 1s 5000x low high
Most Systems Are Here
May Constrained By Network Configuration
Pit Stop
• Reliability is based on
– Message Backup & Replay
– Status Snapshot
• When tuning, think of
– How to handle operator status
– Micro Batching
Fluctuation Handling
• Technologies To Obtain
– High Performance RPC Framework
– Auto Partitioning
– Dynamic Resource Allocation
– Global Flow Control
High Performance RPC Framework
• Indication of High Performance
– Over 20k QPS/sec, with 1byte payload
– On commodity server with 2 6-core CPU and Giga Ethernet
• See Also
– SOFA Framework from Baidu.com
– https://github.com/BaiduPS/sofa-pbrpc
Dynamic Resource Allocation
• Tree Model
• Evaluation Function
• Network Constrains
IDC A
IDC B
PHY1
PHY2
PHY3
PHY4
VM 1
VM 2
VM 3
Pit Stop
• Workload Fluctuation Handling
– Very Fast RPC Framework
– Auto Partitioning
– Dynamic Resource Allocation
– Global Flow Control
Sum Up!
• Basic Architecture of Stream Processing System
• Reliability of Stream Processing
• Ways to Handle Workload Fluctuations
Highlights
• Challenges in Real World Applications
• High Level Stream Programming
• Optimization Inside Hardware
Stream Web Crawling
OP1
OP2
Redis Cluster
Web Page Cache
OP3 OP4
Log Filter
Data Join
FeatureExtraction
Logging API
Crawling JobUpdater
OP5User-Model
Updater
OP6HBase Cluster
Web DataBase
Crawling JobGenerator
OP7 Crawling Bot
OP8
Image Crawling JobGenerator
OP9
Cache Synchronizer
OP10
Image Crawling Cluster
StatusDB
Co-Serving
• Distribute RPC
• Dynamic Routing
OP1
OP2
OP5
OP6
Online Web Services
QueryPreprocess
QueryTransform
ResultMerge
IntentExtraction
OnlineQuery Log
OP3 OP4User
IntentMining
Hadoop
Initiating DRPC from Mapper
Mapper Receiving DRPC result
High Level Programming
• Stream DataBase & StreamSQL
• Stream Computing Description Language
• Stream Programming Framework
Pit Stop
Programming Interface Complexity Flexibility
Stream SQL Low Low
Stream DML Medium High
Stream Framework High Very High
Network Management Challenges
• Things Are App. Dependent
– QoS Request
– Relay Priority
– Security Strategy
– Resource Allocation
Pit Stop
• Network Management is Crucial
– Stream Processing is Bandwidth Consuming
– Stream Processing may be Latency Sensitive
• Solution is Simple
– Software Defined Network Integration (SDN)
See Also
• A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services
– 20x Performance
– ISCA 2014 by Microsoft Research
– http://research.microsoft.com/pubs/212001/Catapult_ISCA_2014.pdf
Pit Stop
• Scale-out is difficult, think of scale-up
• Reconfigurable CPU has got significant performance improvements
Conclusion
• Stream Processing System can be Well Modeled by SDL
• Trade Off between Reliability & Performance
• High level programming & Scale-Up are Future Trends
References
• Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream processing." ACM SIGMOD Record 34, no. 4 (2005): 42-47.
• Zaharia, Matei, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. "Discretized streams: Fault-tolerant streaming computation at scale." In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423-438. ACM, 2013.
• Murray, Derek G., Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. "Naiad: a timely dataflow system." In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 439-455. ACM, 2013.
• Castro Fernandez, Raul, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. "Integrating scale out and fault tolerance in stream processing using operator state management." In Proceedings of the 2013 international conference on Management of data, pp. 725-736. ACM, 2013.