Generating Real-time, Streaming Recommendations[NiFi + Kafka + Spark ML]
Kafka Summit SFApril 26, 2016
Who am I?Chris Fregly, Principal Data Solutions Engineer @ IBM Spark Technology Center
Previously, Data Engineer @ Netflix and Databricks
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow
Author @ Advanced Spark (advancedspark.com)
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Pipeline (Bonus!)
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
NiFi
NiFi = “Niagra Files”
Maintainers @ Hortonworks since 2015
Developed @ NSA over last 8+ years
Integrates with EVERYTHING!
Provides Data Provenance
Data Flow Management
Me, Normal Guy
Joe Witt,NiFi Co-Creator
Buffalo Wild Wings
Hat
NiFi Provenance Event TypesATTRIBUTES_MODIFIED (ie. Extract Topic Name)
CONTENT_MODIFIED (ie. Enrich with Geo)
RECEIVE (ie. Handle Http Request)
ROUTE (ie. Check Http Method)
SEND (ie. PutKafka)
DROP (Handle Http Response)
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Spark Streaming
Submits Time-Based Micro Batches of Data as Spark Jobs
Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!
Framework for Custom Streaming Receivers
Flexible Window Operations, Optimized State Management
Basic Back Pressure and Throttling Support
At Least Once Guarantees through Write Ahead Log (WAL)
Spark Streaming KafkaRDDKafka “Direct” Streaming Implementation (Spark 1.4+)
Recover/Replay from Kafka using File System-like Offsets
Removes need for Write Ahead Log (WAL)
Uses Kafka, itself, as the WAL!
KafkaRDD
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Streaming RecommendationsIncremental Matrix Factorization!!
(Based on github.com/brkyvz/streaming-matrix-factorization)
Recommendation Serving LayerUse Case: Recommendation Service Depends on Redis Cache
Problem: Redis Cache Goes Down!?Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States:Closed: Service OKOpen: Service DOWN
Fallback to Non-Personalized Recommendations from Disk
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Netflix Data Pipeline
9 million events, 22 GB per second @ peak!
EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps
Auto-scaling,Fault tolerance
A/B Tests,Trending Now
SAMZA
Splits high andnormal priority
Recommendations Pipeline
Batch Matrix Factorization
Keep Batch Video (V) Matrix
Calculate Newer User (U) Matrix
Compute U x V Dot Product
Save Model to Disk and EVCache
https://github.com/Netflix/EVCache
Throw away batch user factors (U)
Keep video factors (V)