Upload
confluent
View
2.710
Download
1
Embed Size (px)
Citation preview
STREAM PROCESSING IN UBER MARKETPLACE
~ 68 countries / 350+ cities Transportation as reliable as running water, everywhere, for everyone
2
AgendaWhat’s on the menu?
•Use Cases •Problem Space •Overall Architecture •Choices & Tradeoffs •Q & A
Use Case: Realtime OLAP
There is always need for quick exploration
How many open cars in the world, NOW?
How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
Driving time and other metrics over time by hexagonal area
Use Case: Complex Event Processing
There are patterns in event streams
How many drivers cancel requests more than 3 times in a row within a 10-
minute window?
Report riders requesting a pickup 100 miles apart within a half hour window?
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
Use Case: Supply Positioning
Clusters Of Supply & Demand
Predicted Health Metrics
Actual Health Metrics
Monitor Marketplace Health
Challenges
OLAP of Geo-spatial Temporal Data
Reasonably Large Scale
Near Real Time
• Indexing, Lookup, Rendering
• Symmetric Neighbors
• Convex & Compact Regions
• Equal Areas
• Equal Shape
Hexagons
Scale
Geo Space Vehicle Types Time Status
X X X
Granular Geo Areas
Granular Geo Areas
Over 10,000 hexagons in a city
Multiple Vehicle Types
7 vehicle types
Minute-level Time Buckets
1440 minutes in a day
Many Driver States
13 driver states
Many Cities
300 cities
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations
Unknown Query Patterns
Any combination of dimensions
Variety of Aggregations - Heatmap
- Top N
- Histogram
- count(), avg(), sum(), percent(), geo
Large Data Volume
• Hundreds of thousands of events per second
• At least dozens of fields in each event
Multiple TopicsRider States Driver States
Let’s build a stream processing pipeline
Pipeline Template
Event Collection
Multiple Event Types with Different Volume
Hundreds of Thousands of Events Per Second
Events Should Be Available Under a Second
Events Should Rarely Get Lost
Multiple Consumers
Natural Choice: Apache Kafka
- Low latency and high throughput
- Persistent events
- Distributes a topic by partitions
- Groups consumers by consumer groups
Event Processing
Transformation
Event Transformation Example
(Lat, Long) -> (zipcode, hexagon, S2)
Pre-aggregation
Joining Multiple Streams
Sessionization
Multi-Staged Processing
Minimum Requirements
- Statement Management
- Checkpointing
- Automatic Resource Management
- Multi-staged processing
Apache Samza
Why Apache Samza? - DAG on Kafka
- Excellent integration with Kafka
- Built-in checkpointing
- Built-in state management
- Excellent support from our data team
Samza Is Conceptually Simple
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
Applications
Dashboard of Realtime Business Metrics
Ad-Hoc Queries
Visualization with Streaming
Visualization with Streaming
LocationUpdatewherecity=X
LocationUpdatewherecity=Yandvehicle=‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdatewherecity=X
LocationUpdatewherecity=Yandvehicle=‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdatewherecity=X
LocationUpdatewherecity=Yandvehicle=‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdatewherecity=X
LocationUpdatewherecity=Yandvehicle=‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdatewherecity=X
LocationUpdatewherecity=Yandvehicle=‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdatewherecity=‘SF’
LocationUpdatewherecity=‘LA’andvehicle
10%
5%
100% 100%
Ad-hoc Exploration
A Few Trade-Offs
Lambda vs Kappa
We Use Lambda - Spark + HDFS/S3 for batch processing - Yes, it is painful, but
- We may need to go way back due to change of business requirements
- Batch process can run faster — they scale differently - It was not easy to start a new stream processing instance
Processing by Event Time Is Not Always Easy
Leverage The Storage Layer
Dealing with Limitation of Samza -No broadcasting. We have to override SystemStreamPartitionGrouper
-No dynamic topology. Can’t have arbitrary number of
nested CEP queries
-Tedious configuration and deployment of jobs. In house
code-gem and deployment solution
Thank You