133
Stream Processing & Analytics with Flink Danny Yuan, Engineer @ Uber @g9yuayon

Streaming Analytics in Uber

Embed Size (px)

Citation preview

Page 1: Streaming Analytics in Uber

Stream Processing & Analytics with FlinkDanny Yuan, Engineer @ Uber

@g9yuayon

Page 2: Streaming Analytics in Uber

Four Kinds of Analytics

- On demand aggregation and pattern detection

- Clustering

- Forecasting

- Pattern detection on geo-temporal data

Page 3: Streaming Analytics in Uber

Two Ingredients

Geo/Spatial Time

Page 4: Streaming Analytics in Uber

Real-time aggregation and pattern matching

Page 5: Streaming Analytics in Uber

Slide title

Complex Event Processing

Page 6: Streaming Analytics in Uber

How many cars enter and exit a user defined area in past 5 minutes

Examples

Page 7: Streaming Analytics in Uber

It doesn’t have to be within a time or areaCEP with full historical context

Notify me if a partner completed her 100th trip in a given area just now?

Page 8: Streaming Analytics in Uber

Patterns in the future

How many first-time riders will be dropped off in a given area in the next 5

minutes?

Page 9: Streaming Analytics in Uber

Patterns in the future

How many first-time riders will be dropped off in a given area in the next 5

minutes?

Page 10: Streaming Analytics in Uber

Geo: user flexibility is important

Page 11: Streaming Analytics in Uber

Geo: user flexibility is important

Page 12: Streaming Analytics in Uber

It needs to be scalable

Page 13: Streaming Analytics in Uber

It needs to be scalable

Page 14: Streaming Analytics in Uber

It needs to be scalable

- Every hexagon

- Every driver/rider

Page 15: Streaming Analytics in Uber

CEP Pipeline Built on Samza

- No hard-coded CEP rules

- Applying CEP rules per individual entity: topic, driver,

rider, cohorts, and etc

- Flexible checkpointing and statement management

Page 16: Streaming Analytics in Uber

Slide title

Page 17: Streaming Analytics in Uber

Slide title

Page 18: Streaming Analytics in Uber

Slide title

Page 19: Streaming Analytics in Uber

Slide title

Page 20: Streaming Analytics in Uber

Slide title

Page 21: Streaming Analytics in Uber

Slide title

Page 22: Streaming Analytics in Uber

Slide title

Page 23: Streaming Analytics in Uber

Slide title

Page 24: Streaming Analytics in Uber

We need to evolve our architecture for other analytics

Page 25: Streaming Analytics in Uber

Clustering

Page 26: Streaming Analytics in Uber

Slide title Manually Created Cluster

Page 27: Streaming Analytics in Uber

Slide titleCall for algorithmically created clusters

- Clustering based on key performance metrics

Page 28: Streaming Analytics in Uber

Slide titleCall for algorithmically created cluster

- Clustering based on key performance metrics

- Continuously measure the clusters

Page 29: Streaming Analytics in Uber

Slide titleCall for algorithmically created clusters

- Clustering based on key performance metrics

- Continuously measure the clusters

- Different clustering for different business needs

Page 30: Streaming Analytics in Uber

Slide titleCall for algorithmically created clusters

- Clustering based on key performance metrics

- Continuously measure the clusters

- Different clustering for different business needs

- Create clusters in minutes for all cities

Page 31: Streaming Analytics in Uber

Slide titleCall for algorithmically created clusters

- Clustering based on key performance metrics

- Continuously measure the clusters

- Different clustering for different business needs

- Create clusters in minutes for all cities

- Foundation for other stream analytics

Page 32: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

Page 33: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

Page 34: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Pluggable algorithms and measurements

Page 35: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Easily pluggable algorithms and measurements

- Historical geo-temporal data for clustering

Page 36: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Easily pluggable algorithms and measurements

- Historical geo-temporal data for clustering

- Real-time geo-temporal data for measurement

Page 37: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Easily pluggable algorithms and measurements

- Historical geo-temporal data for clustering

- Real-time geo-temporal data for measurement

- Shared optimizations

Page 38: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Easily pluggable algorithms and measurements

- Historical geo-temporal data for clustering

- Real-time geo-temporal data for measurement

- Shared optimizations. To put things in perspective:

- 70,000 hexagons in SF

- Naive distance function requires at least 70,000 x

70,000 = 4.9 billion pairs!

Page 39: Streaming Analytics in Uber

Slide titleHome-grown Clustering Service

- All cities under 3 minutes

- Easily pluggable algorithms and measurements

- Historical geo-temporal data for clustering

- Real-time geo-temporal data for measurement

- Shared optimizations

- Incremental updates

- Compact data representation

- Memoization

- Avoid anything more complex than O(nlog(n))

Page 40: Streaming Analytics in Uber

Forecasting- Every decision is based on forecasting

Page 41: Streaming Analytics in Uber

Forecasting- Forecasting based on both historical data and stream input

Page 42: Streaming Analytics in Uber

Forecasting- Forecasting based on both historical data and stream input

Page 43: Streaming Analytics in Uber

Forecasting- Forecasting based on both historical data and stream input

Anomaly, or emerging demand?

Page 44: Streaming Analytics in Uber

Forecasting- Spatially granular forecasting - down to every hexagon

Page 45: Streaming Analytics in Uber

Forecasting- Spatially granular forecasting - down to every hexagon

Page 46: Streaming Analytics in Uber

Forecasting- Temporally granular forecasting - down to every minute

Page 47: Streaming Analytics in Uber

Forecasting- Temporally granular forecasting - down to every minute

Page 48: Streaming Analytics in Uber

Pattern Detection

- Similarity of different metrics across geolocation and time

- Metric outliers across geolocations and time

- Frequent occurrences of certain patterns

- Clustered behavior

- Anomalies

Page 49: Streaming Analytics in Uber

Common Requirements in Pattern Detection

- Not just traditional time series analysis

- Incorporating insights on marketplace data

- Required both historical data and real-time input

- Spatially granular patterns - down to every hexagon

- Temporally granular patterns - down to every minute

Page 50: Streaming Analytics in Uber

Example: Anomaly Detection

- Simple time series analysis

- For a single geo area

- Can be noisy

Page 51: Streaming Analytics in Uber

A More Realistic Anomaly Detection

Page 52: Streaming Analytics in Uber

Example: Anomaly Detection

Page 53: Streaming Analytics in Uber

Example: Anomaly Detection

Page 54: Streaming Analytics in Uber

What’s the right architecture to support the analytics use cases?

Page 55: Streaming Analytics in Uber

Shared abstraction: multi-dimensional geo-temporal data

Page 56: Streaming Analytics in Uber

- Time series by event time

Shared abstraction: multi-dimensional geo-temporal data

Page 57: Streaming Analytics in Uber

- Time series by event time

Shared abstraction: multi-dimensional geo-temporal data

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Page 58: Streaming Analytics in Uber

- Time series by event time

Shared abstraction: multi-dimensional geo-temporal data

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Page 59: Streaming Analytics in Uber

- Time series by event time

Shared abstraction: multi-dimensional geo-temporal data

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Page 60: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

Shared abstraction: multi-dimensional geo-temporal data

Page 61: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- e.g. event-based triggers

Shared abstraction: multi-dimensional geo-temporal data

Page 62: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- e.g. event-based triggers

- e.g., triggers of computation results

Shared abstraction: multi-dimensional geo-temporal data

Page 63: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing

Shared abstraction: multi-dimensional geo-temporal data

Page 64: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

Shared abstraction: multi-dimensional geo-temporal data

Page 65: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

Shared abstraction: multi-dimensional geo-temporal data

Page 66: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

Shared abstraction: multi-dimensional geo-temporal data

Page 67: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

Shared abstraction: multi-dimensional geo-temporal data

Page 68: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

State

Shared abstraction: multi-dimensional geo-temporal data

Page 69: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing. E.g.,

State per key

Shared abstraction: multi-dimensional geo-temporal data

Page 70: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing

- Unified stream

Shared abstraction: multi-dimensional geo-temporal data

Page 71: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing

- Unified stream

- Real-time streams: unbounded streams

Shared abstraction: multi-dimensional geo-temporal data

Page 72: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing

- Unified stream

- Real-time streams: unbounded streams - Batch: bounded streams

Shared abstraction: multi-dimensional geo-temporal data

Page 73: Streaming Analytics in Uber

- Time series by event time

- Flexible windowing - tumbling, sliding, conditionally triggered

- Stateful processing

- Unified stream

- Real-time streams: unbounded streams - Batch: bounded streams - s/lambda/kappa

Shared abstraction: multi-dimensional geo-temporal data

Page 74: Streaming Analytics in Uber

- Ordering by event time

- Flexible windowing with watermark and triggers

- Exactly-once semantics

- Built-in state management and checkpointing

- Nice data flow APIs

Apache Flink

Page 75: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 76: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 77: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 78: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 79: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 80: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 81: Streaming Analytics in Uber

Mental Picture for Processing Geo-temporal Data

Page 82: Streaming Analytics in Uber

A Simple Example: simple prediction

Sources

.fromKafka()

.config(config)

.cluster(aCluster)

.topics(topicList)

Page 83: Streaming Analytics in Uber

A Simple Example

assignTimestampsAndWatermarks

Page 84: Streaming Analytics in Uber

A Simple Example

keyBy(…)

Page 85: Streaming Analytics in Uber

A Simple Example

.timeWindow(…)

Page 86: Streaming Analytics in Uber

A Simple Example

.flatMap(…)

Page 87: Streaming Analytics in Uber

A Simple Example

.keyBy(…)

Page 88: Streaming Analytics in Uber

A Simple Example

.apply(statefulFn)

Page 89: Streaming Analytics in Uber

A Simple Example

.addSink(…)

Page 90: Streaming Analytics in Uber

High Level Data Flow

Page 91: Streaming Analytics in Uber

High Level Data Flow

Page 92: Streaming Analytics in Uber

High Level Data Flow

Page 93: Streaming Analytics in Uber

High Level Data Flow

Page 94: Streaming Analytics in Uber

High Level Data Flow

Page 95: Streaming Analytics in Uber

High Level Data Flow

Page 96: Streaming Analytics in Uber

High Level Data Flow

Page 97: Streaming Analytics in Uber

High Level Data Flow

Page 98: Streaming Analytics in Uber

High Level Data Flow

Page 99: Streaming Analytics in Uber

High Level Data Flow

Page 100: Streaming Analytics in Uber

High Level Data Flow

Page 101: Streaming Analytics in Uber

Geotemporal API for efficiency

Page 102: Streaming Analytics in Uber

Geotemporal API for efficiency

Page 103: Streaming Analytics in Uber

Geotemporal API for productivity

Page 104: Streaming Analytics in Uber

Geotemporal API for productivity

Page 105: Streaming Analytics in Uber

Geotemporal API for productivity

Page 106: Streaming Analytics in Uber

Forecasting as an example

Page 107: Streaming Analytics in Uber

Forecasting as an example

Page 108: Streaming Analytics in Uber

Forecasting as an example

Page 109: Streaming Analytics in Uber

Forecasting as an example

Page 110: Streaming Analytics in Uber

Forecasting as an example

Page 111: Streaming Analytics in Uber

Forecasting as an example

Page 112: Streaming Analytics in Uber

Forecasting as an example

Page 113: Streaming Analytics in Uber

Forecasting as an example

Page 114: Streaming Analytics in Uber

Forecasting as an example

Page 115: Streaming Analytics in Uber

Forecasting as an example

Page 116: Streaming Analytics in Uber

Forecasting as an example

Page 117: Streaming Analytics in Uber

Forecasting as an example

Page 118: Streaming Analytics in Uber

Forecasting as an example

Page 119: Streaming Analytics in Uber

Forecasting as an example

Page 120: Streaming Analytics in Uber

Forecasting as an example

Page 121: Streaming Analytics in Uber

Forecasting as an example

Page 122: Streaming Analytics in Uber

Forecasting as an example

Page 123: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

- Scaling up, namely single-node optimization matters

- Ensure exactly-once by proper data modeling

- Use external state store to avoid too much snapshotting

- Standardize monitoring and data validation

Page 124: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

Page 125: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

Page 126: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

Page 127: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

Page 128: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

- Scaling up, namely single-node optimization matters

Page 129: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

- Scaling up, namely single-node optimization matters

- Ensure exactly-once by proper data modeling

Page 130: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

- Scaling up, namely single-node optimization matters

- Ensure exactly-once by proper data modeling

- Use external state store to avoid too much snapshotting

- Standardize monitoring and data validation

Page 131: Streaming Analytics in Uber

Lessons Learned

- Make sure you have robust infrastructure support

- Scaling up, namely single-node optimization matters

- Ensure exactly-once by proper data modeling

- Standardize monitoring and data validation

Page 132: Streaming Analytics in Uber

Choose a Stream Processing Platform

Page 133: Streaming Analytics in Uber

Thank You