Change data capture with MongoDB and Kafka

Change Data Capture with

Mongo + KafkaBy Dan Harvey

High level stack

React.js - Website

Node.js - API Routing

Ruby on Rails + MongoDB - Core API

Java - Opinion Streams, Search, Suggestions

Redshift - SQL Analytics

Problems

• Keep user experience consistent

• Streams / search index need to update

• Keep developers efficient

• Loosely couple services

• Trust denormalisations

Use case

• User to User recommender

• Suggest “interesting” users to a user

• Update as soon as you make a new opinion

• Instant feedback for contributing content

Log transformationJava$Services

AvroRails$API

JSON/BSON

Mongo

Opinion

Optaileroplog

Kafka:User Topic User

Recommender

Change$data$capture

Stream$processing

User

Kafka:Opinion Topic

Op(log)tailer

• Converts BSON/JSON to Avro

• Guarantees latest document in topic (eventually)

• Does not guarantee all changes

• Compacting Kafka topic (only keeps latest)

Avro Schemas

• Each Kafka topic has a schema

• Schemas evolve over time

• Readers and Writers will have different schemas

• Allows us to update services independently

Schema Changes

• Schema to ID managed by Confluent registry

• Readers and writers discover schemas

• Avro deals with resolution to compiled schema

• Must be forwards and backwards compatible

Ka#a$message:$byte[]

message:$byte[]schema$ID:$int

https://github.com/confluentinc/schema-registry

Search indexing

• User / Topic / Opinion search

• Re-use Kafka topics from before

• Index from Kafka to Elasticsearch

• Need to update quickly and reliably

Samza Indexers

• Index from Kafka to Elasticsearch

• Used Samza for transform and loading

• Far less code than Java Kafka consumers

• Stores offsets and state in Kafka

Elasticsearch Producer

• Samza consumers/producers deal with I/O

• Wrote new ElasticsearchSystemProducer

• Contributed back to Samza project

• Included in Samza 0.10.0 (released soon)

Samza Good/Bad• Good API

• Simple transformations easy

• Simple ops: logging, metrics all built in

• Only depends on Kafka

• Inbuilt state management

• Joins tricky, need consistent partitioning

• Complex flows are hard (Flink/Spark better)

Decoupling Good/Bad• Easy to try out complex new services

• Easy to keep data stores in sync, low latency

• Started to duplicate core logic

• More overhead with more services

• Need high level framework for denormalisations

• Samza SQL being developed

Ruby Workers

• Ruby Kafka consumers not great…

• Optailer to AWS SQS (Shoryuken gem)

• No order guarantee like Kafka topics

• But guaranteed trigger off database writes

• Better for core data transformations

Future

• Segment.io user interaction logs to Kafka

• Use in product, view counts, etc…

• Fill Redshift for analytics (currently batch)

• Kafka CopyCat instead of our Optailer

• Avro transformation in Samza

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

Questions?

• email: [email protected]

• twitter: @danharvey

mailto:[email protected]

Internet

Change data capture with MongoDB and Kafka