Upload
dan-harvey
View
3.937
Download
12
Embed Size (px)
Citation preview
Change Data Capture with
Mongo + KafkaBy Dan Harvey
High level stack
React.js - Website
Node.js - API Routing
Ruby on Rails + MongoDB - Core API
Java - Opinion Streams, Search, Suggestions
Redshift - SQL Analytics
Problems
• Keep user experience consistent
• Streams / search index need to update
• Keep developers efficient
• Loosely couple services
• Trust denormalisations
Use case
• User to User recommender
• Suggest “interesting” users to a user
• Update as soon as you make a new opinion
• Instant feedback for contributing content
Log transformationJava$Services
AvroRails$API
JSON/BSON
Mongo
Opinion
Optaileroplog
Kafka:User Topic User
Recommender
Change$data$capture
Stream$processing
User
Kafka:Opinion Topic
Op(log)tailer
• Converts BSON/JSON to Avro
• Guarantees latest document in topic (eventually)
• Does not guarantee all changes
• Compacting Kafka topic (only keeps latest)
Avro Schemas
• Each Kafka topic has a schema
• Schemas evolve over time
• Readers and Writers will have different schemas
• Allows us to update services independently
Schema Changes
• Schema to ID managed by Confluent registry
• Readers and writers discover schemas
• Avro deals with resolution to compiled schema
• Must be forwards and backwards compatible
Ka#a$message:$byte[]
message:$byte[]schema$ID:$int
Search indexing
• User / Topic / Opinion search
• Re-use Kafka topics from before
• Index from Kafka to Elasticsearch
• Need to update quickly and reliably
Samza Indexers
• Index from Kafka to Elasticsearch
• Used Samza for transform and loading
• Far less code than Java Kafka consumers
• Stores offsets and state in Kafka
Elasticsearch Producer
• Samza consumers/producers deal with I/O
• Wrote new ElasticsearchSystemProducer
• Contributed back to Samza project
• Included in Samza 0.10.0 (released soon)
Samza Good/Bad• Good API
• Simple transformations easy
• Simple ops: logging, metrics all built in
• Only depends on Kafka
• Inbuilt state management
• Joins tricky, need consistent partitioning
• Complex flows are hard (Flink/Spark better)
Decoupling Good/Bad• Easy to try out complex new services
• Easy to keep data stores in sync, low latency
• Started to duplicate core logic
• More overhead with more services
• Need high level framework for denormalisations
• Samza SQL being developed
Ruby Workers
• Ruby Kafka consumers not great…
• Optailer to AWS SQS (Shoryuken gem)
• No order guarantee like Kafka topics
• But guaranteed trigger off database writes
• Better for core data transformations
Future
• Segment.io user interaction logs to Kafka
• Use in product, view counts, etc…
• Fill Redshift for analytics (currently batch)
• Kafka CopyCat instead of our Optailer
• Avro transformation in Samza