136
MANCHESTER LONDON NEW YORK

Data in Motion: Streaming Static Data Efficiently

Embed Size (px)

Citation preview

Page 1: Data in Motion: Streaming Static Data Efficiently

MANCHESTER LONDON NEW YORK

Page 2: Data in Motion: Streaming Static Data Efficiently

Martin Zapletal @zapletal_martin#ScalaDays

Data in Motion: Streaming Static Data Efficientlyin Akka Persistence (and elsewhere)

@cakesolutions

Page 3: Data in Motion: Streaming Static Data Efficiently

Databases

Page 4: Data in Motion: Streaming Static Data Efficiently

Batch processing

Page 5: Data in Motion: Streaming Static Data Efficiently

Data at scale

● Reactive● Real time, asynchronous and message driven● Elastic and scalable● Resilient and fault tolerant

Page 6: Data in Motion: Streaming Static Data Efficiently

Streams

Page 7: Data in Motion: Streaming Static Data Efficiently

Streaming static data

● Turning database into a stream

Page 8: Data in Motion: Streaming Static Data Efficiently

Pulling data from source

0 0

5 5

10 10

Page 9: Data in Motion: Streaming Static Data Efficiently

0 0

0 0

5 5

10 10

Page 10: Data in Motion: Streaming Static Data Efficiently

5 5

0 0

5 5

10 100 0

Page 11: Data in Motion: Streaming Static Data Efficiently

10 10

0

5 5

10 105 5 0 0

0

Page 12: Data in Motion: Streaming Static Data Efficiently

10 10

0 0

5 5

10 10

5 5 0 01 1

Inserts

Page 13: Data in Motion: Streaming Static Data Efficiently

10 10

0 0

5 55

10 105 5 0 0

Updates

Page 14: Data in Motion: Streaming Static Data Efficiently

Pushing data from source● Change log, change data capture

0 0

5 5

10 10

Page 15: Data in Motion: Streaming Static Data Efficiently

0 0

5 5

10 10

1 1

Page 16: Data in Motion: Streaming Static Data Efficiently

110 0

5 5

10 10

1 1

Page 17: Data in Motion: Streaming Static Data Efficiently

Infinite streams of finite data source● Consistent snapshot and change log

0 05 510 10

0 0

5 510 10

1 10 0

5 510 10

1 1

Page 18: Data in Motion: Streaming Static Data Efficiently

0

1

2

3

4

0

5

10

1

5

Inserted value 0

Inserted value 5

Inserted value 10

Inserted value 1

Inserted value 55

Log data structure

Page 19: Data in Motion: Streaming Static Data Efficiently

Pulling data from a log

10 10 5 5 0 0

0 0

105 5

10

Page 20: Data in Motion: Streaming Static Data Efficiently

10 10 5 5 0 0

0 0

1015 15

5 510

Page 21: Data in Motion: Streaming Static Data Efficiently

0 0

15 15

5 5 15 15 10 10 5 5 0 010 10

Page 22: Data in Motion: Streaming Static Data Efficiently

persistence_id1, event 2

persistence_id1, event 3

persistence_id1, event 4

persistence_id1, event 1

235

Akka Persistence

1 4

Page 23: Data in Motion: Streaming Static Data Efficiently

Akka Persistence Query● eventsByPersistenceId, allPersistenceIds, eventsByTag

1 4 235

persistence_id1, event 2

persistence_id1, event 3

persistence_id1, event 4

persistence_id1, event 1

Page 24: Data in Motion: Streaming Static Data Efficiently

Persistence_ id partition_nr

0 00 1

event 1

event 100 event 101 event 102

event 0 event 2

1 0 event 0 event 1 event 2

Akka Persistence Query Cassandra● Purely pull● Event (log) data

Page 25: Data in Motion: Streaming Static Data Efficiently

Actor publisherprivate[query] abstract class QueryActorPublisher[MessageType, State: ClassTag](refreshInterval: Option[FiniteDuration]) extends ActorPublisher[MessageType] {

protected def initialState: Future[State] protected def initialQuery(initialState: State): Future[Action] protected def requestNext(state: State, resultSet: ResultSet): Future[Action] protected def requestNextFinished(state: State, resultSet: ResultSet): Future[Action] protected def updateState(state: State, row: Row): (Option[MessageType], State) protected def completionCondition(state: State): Boolean

private[this] def nextBehavior(...): Receive = { if (shouldFetchMore(...)) { listenableFutureToFuture(resultSet.fetchMoreResults()).map(FetchedResultSet).pipeTo(self) awaiting(resultSet, state, finished) } else if (shouldIdle(...)) { idle(resultSet, state, finished) } else if (shouldComplete(...)) { onCompleteThenStop() Actor.emptyBehavior } else if (shouldRequestMore(...)) { if (finished) requestNextFinished(state, resultSet).pipeTo(self) else requestNext(state, resultSet).pipeTo(self) awaiting(resultSet, state, finished) } else { idle(resultSet, state, finished) } }}

}

Page 26: Data in Motion: Streaming Static Data Efficiently

private[query] abstract class QueryActorPublisher[MessageType, State: ClassTag](refreshInterval: Option[FiniteDuration]) extends ActorPublisher[MessageType] {

protected def initialState: Future[State] protected def initialQuery(initialState: State): Future[Action] protected def requestNext(state: State, resultSet: ResultSet): Future[Action] protected def requestNextFinished(state: State, resultSet: ResultSet): Future[Action] protected def updateState(state: State, row: Row): (Option[MessageType], State) protected def completionCondition(state: State): Boolean

private[this] def nextBehavior(...): Receive = { if (shouldFetchMore(...)) { listenableFutureToFuture(resultSet.fetchMoreResults()).map(FetchedResultSet).pipeTo(self) awaiting(resultSet, state, finished) } else if (shouldIdle(...)) { idle(resultSet, state, finished) } else if (shouldComplete(...)) { onCompleteThenStop() Actor.emptyBehavior } else if (shouldRequestMore(...)) { if (finished) requestNextFinished(state, resultSet).pipeTo(self) else requestNext(state, resultSet).pipeTo(self) awaiting(resultSet, state, finished) } else { idle(resultSet, state, finished) } }}

}

Page 27: Data in Motion: Streaming Static Data Efficiently

initialQuery

Cancel

initialFinished

shouldFetchMore

shouldIdle

shouldTerminate

shouldRequestMore

SubscriptionTimeout

Cancel

SubscriptionTimeout

initialNewResultSet

request newResultSet

fetchedResultSet

finished

Cancel

SubscriptionTimeout

requestcontinue

Red transitionsdeliver buffer and update internal state (progress)

Blue transitions asynchronous database query

Page 28: Data in Motion: Streaming Static Data Efficiently

SELECT * FROM ${tableName} WHERE persistence_id = ? AND partition_nr = ? AND sequence_nr >= ? AND sequence_nr <= ?

0 0

0 1

event 1

event 100 event 101 event 102

event 0 event 2

Events by persistence id

Page 29: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 2event 0

Page 30: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 2event 0

Page 31: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 2event 0

Page 32: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 2event 0

Page 33: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 2event 0

Page 34: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 0 event 1

event 100 event 101 event 102

event 2

Page 35: Data in Motion: Streaming Static Data Efficiently

private[query] class EventsByPersistenceIdPublisher(...) extends QueryActorPublisher[PersistentRepr, EventsByPersistenceIdState](...) { override protected def initialState: Future[EventsByPersistenceIdState] = { ... EventsByPersistenceIdState(initialFromSequenceNr, 0, currentPnr) }

override protected def updateState( state: EventsByPersistenceIdState, Row: Row): (Option[PersistentRepr], EventsByPersistenceIdState) = { val event = extractEvent(row) val partitionNr = row.getLong("partition_nr") + 1

(Some(event), EventsByPersistenceIdState(event.sequenceNr + 1, state.count + 1, partitionNr)) }}

Page 36: Data in Motion: Streaming Static Data Efficiently

private[query] class EventsByPersistenceIdPublisher(...) extends QueryActorPublisher[PersistentRepr, EventsByPersistenceIdState](...) { override protected def initialState: Future[EventsByPersistenceIdState] = { ... EventsByPersistenceIdState(initialFromSequenceNr, 0, currentPnr) }

override protected def updateState( state: EventsByPersistenceIdState, Row: Row): (Option[PersistentRepr], EventsByPersistenceIdState) = { val event = extractEvent(row) val partitionNr = row.getLong("partition_nr") + 1

(Some(event), EventsByPersistenceIdState(event.sequenceNr + 1, state.count + 1, partitionNr)) }}

Page 37: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 0 event 2

1 0 event 0 event 1 event 2

All persistence idsSELECT DISTINCT persistence_id, partition_nr FROM $tableName

Page 38: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 0 event 2

1 0 event 0 event 1 event 2

Page 39: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1

event 100 event 101 event 102

event 0 event 2

1 0 event 0 event 1 event 2

Page 40: Data in Motion: Streaming Static Data Efficiently

0

0

0

1

event 1

event 100 event 101 event 102

event 0 event 2

1 0 event 0 event 1 event 2

Page 41: Data in Motion: Streaming Static Data Efficiently

private[query] class AllPersistenceIdsPublisher(...) extends QueryActorPublisher[String, AllPersistenceIdsState](...) {

override protected def initialState: Future[AllPersistenceIdsState] = Future.successful(AllPersistenceIdsState(Set.empty))

override protected def updateState( state: AllPersistenceIdsState, row: Row): (Option[String], AllPersistenceIdsState) = {

val event = row.getString("persistence_id")

if (state.knownPersistenceIds.contains(event)) { (None, state) } else { (Some(event), state.copy(knownPersistenceIds = state.knownPersistenceIds + event)) } }}

Page 42: Data in Motion: Streaming Static Data Efficiently

private[query] class AllPersistenceIdsPublisher(...) extends QueryActorPublisher[String, AllPersistenceIdsState](...) {

override protected def initialState: Future[AllPersistenceIdsState] = Future.successful(AllPersistenceIdsState(Set.empty))

override protected def updateState( state: AllPersistenceIdsState, row: Row): (Option[String], AllPersistenceIdsState) = {

val event = row.getString("persistence_id")

if (state.knownPersistenceIds.contains(event)) { (None, state) } else { (Some(event), state.copy(knownPersistenceIds = state.knownPersistenceIds + event)) } }}

Page 43: Data in Motion: Streaming Static Data Efficiently

Events by tag

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0 event 2,tag 1

1 0 event 0 event 1 event 2,tag 1

Page 44: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 2,tag 1

1 0 event 0 event 1

event 0

event 2,tag 1

Page 45: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0 event 2,tag 1

1 0 event 1event 0 event 2,tag 1

Page 46: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0 event 2,tag 1

1 0 event 0 event 1 event 2,tag 1

Page 47: Data in Motion: Streaming Static Data Efficiently

event 0

event 0

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 2,tag 1

1 0 event 1 event 2,tag 1

Page 48: Data in Motion: Streaming Static Data Efficiently

event 0

event 0 event 1

0 0

0 1event 100,tag 1

event 101 event 102

event 2,tag 1

1 0event 2,tag 1

event 1,tag 1

Page 49: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 2,tag 1

1 0event 2,tag 1

event 0

event 0 event 1

event 1,tag 1

Page 50: Data in Motion: Streaming Static Data Efficiently

event 1,tag 1

event 2,tag 1

event 0

event 0 event 1

event 1,tag 10 0

0 1event 100,tag 1

event 101 event 102

1 0event 2,tag 1

Page 51: Data in Motion: Streaming Static Data Efficiently

event 2,tag 1

event 0

event 0 event 1

0 0

0 1event 100,tag 1

event 101 event 102

1 0

event 2,tag 1

event 1,tag 1

Page 52: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

1 0event 2,tag 1

event 0

event 0 event 1

event 100,tag 1

event 101 event 102

event 2,tag 1

event 1,tag 1

Page 53: Data in Motion: Streaming Static Data Efficiently

Events by tag

Id 0, event 1

Id 1,event 2

Id 0, event 100

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

1 0 event 0 event 1 event 2,tag 1

Id 0, event 2

tag 1 1/1/2016

tag 1 1/2/2016

event 2,tag 1

SELECT * FROM $eventsByTagViewName$tagId WHERE tag$tagId = ? AND timebucket = ? AND timestamp > ? AND timestamp <= ? ORDER BY timestamp ASC LIMIT ?

Page 54: Data in Motion: Streaming Static Data Efficiently

Id 1,event 2

Id 0, event 100

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

Id 0, event 2

1 0 event 0 event 1 event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

event 2,tag 1

Page 55: Data in Motion: Streaming Static Data Efficiently

Id 1,event 2

Id 0, event 100

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

Id 0, event 2

1 0 event 0 event 1 event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

event 2,tag 1

Page 56: Data in Motion: Streaming Static Data Efficiently

Id 0, event 100

Id 1,event 2

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

Id 0, event 2

1 0 event 0 event 1 event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

event 2,tag 1

Page 57: Data in Motion: Streaming Static Data Efficiently

Id 0, event 100

Id 1,event 2

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

1 0 event 0 event 1 event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

event 2,tag 1

Id 0, event 2

Page 58: Data in Motion: Streaming Static Data Efficiently

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0 event 2,tag 1

1 0 event 0 event 1 event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

Page 59: Data in Motion: Streaming Static Data Efficiently

tag 1 1/1/2016

tag 1 1/2/2016

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

1 0 event 0 event 1 event 2,tag 1

persistence_id

seq

0 11 . . .

event 2,tag 1

Page 60: Data in Motion: Streaming Static Data Efficiently

Id 0, event 100

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

1 0 event 0 event 1 event 2,tag 1

persistence_id

seq

0 ?1 . . .

event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

Page 61: Data in Motion: Streaming Static Data Efficiently

Id 0, event 100

Id 0, event 2

Id 0, event 1

0 0

0 1

event 1,tag 1

event 100,tag 1

event 101 event 102

event 0

1 0 event 0 event 1 event 2,tag 1

persistence_id

seq

0 ?1

event 2,tag 1

tag 1 1/1/2016

tag 1 1/2/2016

. . .

Page 62: Data in Motion: Streaming Static Data Efficiently

seqNumbers match { case None => replyTo ! UUIDPersistentRepr(offs, toPersistentRepr(row, pid, seqNr)) loop(n - 1)

case Some(s) => s.isNext(pid, seqNr) match { case SequenceNumbers.Yes | SequenceNumbers.PossiblyFirst => seqNumbers = Some(s.updated(pid, seqNr)) replyTo ! UUIDPersistentRepr(offs, toPersistentRepr(row, pid, seqNr)) loop(n - 1)

case SequenceNumbers.After => replyTo ! ReplayAborted(seqNumbers, pid, s.get(pid) + 1, seqNr) // end loop

case SequenceNumbers.Before => // duplicate, discard if (!backtracking) log.debug(s"Discarding duplicate. Got sequence number [$seqNr] for [$pid], " + s"but current sequence number is [${s.get(pid)}]") loop(n - 1) }}

Page 63: Data in Motion: Streaming Static Data Efficiently

seqNumbers match { case None => replyTo ! UUIDPersistentRepr(offs, toPersistentRepr(row, pid, seqNr)) loop(n - 1)

case Some(s) => s.isNext(pid, seqNr) match { case SequenceNumbers.Yes | SequenceNumbers.PossiblyFirst => seqNumbers = Some(s.updated(pid, seqNr)) replyTo ! UUIDPersistentRepr(offs, toPersistentRepr(row, pid, seqNr)) loop(n - 1)

case SequenceNumbers.After => replyTo ! ReplayAborted(seqNumbers, pid, s.get(pid) + 1, seqNr) // end loop

case SequenceNumbers.Before => // duplicate, discard if (!backtracking) log.debug(s"Discarding duplicate. Got sequence number [$seqNr] for [$pid], " + s"but current sequence number is [${s.get(pid)}]") loop(n - 1) }}

Page 64: Data in Motion: Streaming Static Data Efficiently

def replay(): Unit = { val backtracking = isBacktracking val limit = if (backtracking) maxBufferSize else maxBufferSize - buf.size val toOffs = if (backtracking && abortDeadline.isEmpty) highestOffset else UUIDs.endOf(System.currentTimeMillis() - eventualConsistencyDelayMillis) context.actorOf(EventsByTagFetcher.props(tag, currTimeBucket, currOffset, toOffs, limit, backtracking, self, session, preparedSelect, seqNumbers, settings)) context.become(replaying(limit))}

def replaying(limit: Int): Receive = { case env @ UUIDPersistentRepr(offs, _) => // Deliver buffer case ReplayDone(count, seqN, highest) => // Request more case ReplayAborted(seqN, pid, expectedSeqNr, gotSeqNr) => // Causality violation, wait and retry. Only applicable if all events for persistence_id are tagged case ReplayFailed(cause) => // Failure case _: Request => // Deliver buffer case Continue => // Do nothing case Cancel => // Stop}

Page 65: Data in Motion: Streaming Static Data Efficiently

def replay(): Unit = { val backtracking = isBacktracking val limit = if (backtracking) maxBufferSize else maxBufferSize - buf.size val toOffs = if (backtracking && abortDeadline.isEmpty) highestOffset else UUIDs.endOf(System.currentTimeMillis() - eventualConsistencyDelayMillis) context.actorOf(EventsByTagFetcher.props(tag, currTimeBucket, currOffset, toOffs, limit, backtracking, self, session, preparedSelect, seqNumbers, settings)) context.become(replaying(limit))}

def replaying(limit: Int): Receive = { case env @ UUIDPersistentRepr(offs, _) => // Deliver buffer case ReplayDone(count, seqN, highest) => // Request more case ReplayAborted(seqN, pid, expectedSeqNr, gotSeqNr) => // Causality violation, wait and retry. Only applicable if all events for persistence_id are tagged case ReplayFailed(cause) => // Failure case _: Request => // Deliver buffer case Continue => // Do nothing case Cancel => // Stop}

Page 66: Data in Motion: Streaming Static Data Efficiently

Akka Persistence Cassandra Replaydef asyncReplayMessages(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) (replayCallback: (PersistentRepr) => Unit): Future[Unit] = Future { new MessageIterator(persistenceId, fromSequenceNr, toSequenceNr, max).foreach(msg => { replayCallback(msg) }) }

class MessageIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) extends Iterator[PersistentRepr] { private val initialFromSequenceNr = math.max(highestDeletedSequenceNumber(persistenceId) + 1, fromSequenceNr) private val iter = new RowIterator(persistenceId, initialFromSequenceNr, toSequenceNr) private var mcnt = 0L private var c: PersistentRepr = null private var n: PersistentRepr = PersistentRepr(Undefined) fetch() def hasNext: Boolean = ... def next(): PersistentRepr = … ...}

Page 67: Data in Motion: Streaming Static Data Efficiently

Akka Persistence Cassandra Replaydef asyncReplayMessages(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) (replayCallback: (PersistentRepr) => Unit): Future[Unit] = Future { new MessageIterator(persistenceId, fromSequenceNr, toSequenceNr, max).foreach(msg => { replayCallback(msg) }) }

class MessageIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) extends Iterator[PersistentRepr] { private val initialFromSequenceNr = math.max(highestDeletedSequenceNumber(persistenceId) + 1, fromSequenceNr) private val iter = new RowIterator(persistenceId, initialFromSequenceNr, toSequenceNr) private var mcnt = 0L private var c: PersistentRepr = null private var n: PersistentRepr = PersistentRepr(Undefined) fetch() def hasNext: Boolean = ... def next(): PersistentRepr = … ...}

Page 68: Data in Motion: Streaming Static Data Efficiently

Akka Persistence Cassandra Replaydef asyncReplayMessages(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) (replayCallback: (PersistentRepr) => Unit): Future[Unit] = Future { new MessageIterator(persistenceId, fromSequenceNr, toSequenceNr, max).foreach(msg => { replayCallback(msg) }) }

class MessageIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long) extends Iterator[PersistentRepr] { private val initialFromSequenceNr = math.max(highestDeletedSequenceNumber(persistenceId) + 1, fromSequenceNr) private val iter = new RowIterator(persistenceId, initialFromSequenceNr, toSequenceNr) private var mcnt = 0L private var c: PersistentRepr = null private var n: PersistentRepr = PersistentRepr(Undefined) fetch() def hasNext: Boolean = ... def next(): PersistentRepr = … ...}

Page 69: Data in Motion: Streaming Static Data Efficiently

class RowIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long) extends Iterator[Row] { var currentPnr = partitionNr(fromSequenceNr) var currentSnr = fromSequenceNr var fromSnr = fromSequenceNr var toSnr = toSequenceNr var iter = newIter()

def newIter() = session.execute(preparedSelectMessages.bind(persistenceId, currentPnr, fromSnr, toSnr)).iterator

final def hasNext: Boolean = { if (iter.hasNext) true else if (!inUse) false } else { currentPnr += 1 fromSnr = currentSnr iter = newIter() hasNext } }

def next(): Row = { val row = iter.next() currentSnr = row.getLong("sequence_nr") row }}

Page 70: Data in Motion: Streaming Static Data Efficiently

class RowIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long) extends Iterator[Row] { var currentPnr = partitionNr(fromSequenceNr) var currentSnr = fromSequenceNr var fromSnr = fromSequenceNr var toSnr = toSequenceNr var iter = newIter()

def newIter() = session.execute(preparedSelectMessages.bind(persistenceId, currentPnr, fromSnr, toSnr)).iterator

final def hasNext: Boolean = { if (iter.hasNext) true else if (!inUse) false } else { currentPnr += 1 fromSnr = currentSnr iter = newIter() hasNext } }

def next(): Row = { val row = iter.next() currentSnr = row.getLong("sequence_nr") row }}

Page 71: Data in Motion: Streaming Static Data Efficiently

class RowIterator(persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long) extends Iterator[Row] { var currentPnr = partitionNr(fromSequenceNr) var currentSnr = fromSequenceNr var fromSnr = fromSequenceNr var toSnr = toSequenceNr var iter = newIter()

def newIter() = session.execute(preparedSelectMessages.bind(persistenceId, currentPnr, fromSnr, toSnr)).iterator

final def hasNext: Boolean = { if (iter.hasNext) true else if (!inUse) false } else { currentPnr += 1 fromSnr = currentSnr iter = newIter() hasNext } }

def next(): Row = { val row = iter.next() currentSnr = row.getLong("sequence_nr") row }}

Page 72: Data in Motion: Streaming Static Data Efficiently

Non blocking asynchronous replayprivate[this] val queries: CassandraReadJournal = new CassandraReadJournal( extendedActorSystem, context.system.settings.config.getConfig("cassandra-query-journal"))

override def asyncReplayMessages( persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long)(replayCallback: (PersistentRepr) => Unit): Future[Unit] = queries .eventsByPersistenceId( persistenceId, fromSequenceNr, toSequenceNr, max, replayMaxResultSize, None, "asyncReplayMessages") .runForeach(replayCallback) .map(_ => ())

Page 73: Data in Motion: Streaming Static Data Efficiently

private[this] val queries: CassandraReadJournal = new CassandraReadJournal( extendedActorSystem, context.system.settings.config.getConfig("cassandra-query-journal"))

override def asyncReplayMessages( persistenceId: String, fromSequenceNr: Long, toSequenceNr: Long, max: Long)(replayCallback: (PersistentRepr) => Unit): Future[Unit] = queries .eventsByPersistenceId( persistenceId, fromSequenceNr, toSequenceNr, max, replayMaxResultSize, None, "asyncReplayMessages") .runForeach(replayCallback) .map(_ => ())

Page 74: Data in Motion: Streaming Static Data Efficiently

Benchmarks

500010 00015 00020 00025 00030 00035 00040 000

500010 00015 00020 00025 00030 00035 00040 000

0 0

10 00020 00030 00040 000

0

50 000

Time

(s)

Time

(s)

Time

(s)

Actors

Threads, Actors

Threads 20 40 60 80 100 120 1405000 10000 15000 20000 25000 30000

10 20 30 40 50 60 70

45 00050 000

blockingasynchronous

REPLAY STRONG SCALING

WEAK SCALING

Page 75: Data in Motion: Streaming Static Data Efficiently

node_id

Alternative architecture

0

1

persistence_id 0, event 0

persistence_id 0, event 1

persistence_id 1, event 0

persistence_id 0, event 2

persistence_id 2, event 0

persistence_id 0, event 3

Page 76: Data in Motion: Streaming Static Data Efficiently

persistence_id 0, event 0

persistence_id 0, event 1

persistence_id 1, event 0

persistence_id 2, event 0

persistence_id 0, event 2

persistence_id 0, event 3

Page 77: Data in Motion: Streaming Static Data Efficiently

tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 1event o

Page 78: Data in Motion: Streaming Static Data Efficiently

node_id

0

1

Id 0, event 0

Id 0, event 1

Id 1, event 0

Id 0, event 2

Id 2, event 0

Id 0, event 3

Id 0, event 0

Id 0, event 1

Id 1, event 0

Id 2, event 0

Id 0, event 2

Id 0, event 3 tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 0 event 1

Page 79: Data in Motion: Streaming Static Data Efficiently

tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 0 event 1

val boundStatements = statementGroup(eventsByPersistenceId, eventsByTag, allPersistenceIds)

Future.sequence(boundStatements).flatMap { stmts => val batch = new BatchStatement().setConsistencyLevel(...).setRetryPolicy(...) stmts.foreach(batch.add) session.underlying().flatMap(_.executeAsync(batch))}

Page 80: Data in Motion: Streaming Static Data Efficiently

tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 0 event 1

val boundStatements = statementGroup(eventsByPersistenceId, eventsByTag, allPersistenceIds)

Future.sequence(boundStatements).flatMap { stmts => val batch = new BatchStatement().setConsistencyLevel(...).setRetryPolicy(...) stmts.foreach(batch.add) session.underlying().flatMap(_.executeAsync(batch))}

Page 81: Data in Motion: Streaming Static Data Efficiently

val eventsByPersistenceIdStatement = statementGroup(eventsByPersistenceIdStatement)val boundStatements = statementGroup(eventsByTagStatement, allPersistenceIdsStatement)...session.underlying().flatMap { s => val ebpResult = s.executeAsync(eventsByPersistenceIdStatement) val batchResult = s.executeAsync(batch)) ...}

tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 0 event 1

Page 82: Data in Motion: Streaming Static Data Efficiently

val eventsByPersistenceIdStatement = statementGroup(eventsByPersistenceIdStatement)val boundStatements = statementGroup(eventsByTagStatement, allPersistenceIdsStatement)...session.underlying().flatMap { s => val ebpResult = s.executeAsync(eventsByPersistenceIdStatement) val batchResult = s.executeAsync(batch)) ...}

tag 1 0

allIds

Id 0, event 1

Id 2, event 1

0 1

0 0 event 0 event 1

Page 83: Data in Motion: Streaming Static Data Efficiently

Event time processing● Ingestion time, processing time, event time

Page 84: Data in Motion: Streaming Static Data Efficiently
Page 85: Data in Motion: Streaming Static Data Efficiently

Ordering

10 2

1 12:34:57 1

KEY TIME VALUE

2 12:34:58 2

KEY TIME VALUE

0 12:34:56 0

KEY TIME VALUE

Page 86: Data in Motion: Streaming Static Data Efficiently

0

1

21 12:34:57 1

KEY TIME VALUE

2 12:34:58 2

KEY TIME VALUE

0 12:34:56 0

KEY TIME VALUE

Page 87: Data in Motion: Streaming Static Data Efficiently

Distributed causal stream merging

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1Id 2,event 0

Id 0,event 3

node_id

Page 88: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1Id 2,event 0

Id 0,event 3

Id 0,event 0

node_id

Page 89: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1Id 2,event 0

Id 0,event 3

Id 0,event 0

node_id

Page 90: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1Id 2,event 0

Id 0,event 3

Id 0,event 0

node_id

persistence_id

seq

0 0

1 . . .

2 . . .

Page 91: Data in Motion: Streaming Static Data Efficiently

persistence_id

seq

0 1

1 . . .

2 . . .

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 0

node_id

0

1Id 2,event 0

Id 0,event 0

Id 0,event 1

Id 0,event 3

Page 92: Data in Motion: Streaming Static Data Efficiently

persistence_id

seq

0 2

1 0

2 0Id 0,event 1

Id 0,event 0

Id 1,event 0

node_id

0

1Id 2,event 0

Id 0,event 0

Id 0,event 1

Id 0,event 2

Id 0,event 3

Id 2,event 0

Id 0,event 2

Id 1,event 0

Page 93: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

node_id

Id 1,event 0

persistence_id

seq

0 3

1 0

2 0

Page 94: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1 Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

node_id

Id 1,event 0 0 0 Id 0,

event 0Id 0,event 1

Replay

Page 95: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1 Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

node_id

Id 1,event 0 0 0 Id 0,

event 0Id 0,event 1

Page 96: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1 Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 1,event 0 0 0 Id 0,

event 0Id 0,event 1

node_id

Page 97: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

1 Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 1,event 0 0 0 Id 0,

event 0Id 0,event 1

node_id

persistence_id

seq

0 2

Page 98: Data in Motion: Streaming Static Data Efficiently

Id 0,event 2

Id 0,event 1

Id 0,event 0

Id 1,event 00

Id 2,event 0

Id 0,event 3

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 1,event 0 0 0 Id 0,

event 0Id 0,event 1

persistence_id

seq

0 2

stream_id seq

0 1

1 2

1

node_id

Page 99: Data in Motion: Streaming Static Data Efficiently

Exactly once delivery

Page 100: Data in Motion: Streaming Static Data Efficiently

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

Id 1,event 0

Page 101: Data in Motion: Streaming Static Data Efficiently

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

Id 1,event 0

Page 102: Data in Motion: Streaming Static Data Efficiently

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

Id 1,event 0

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 3

Id 1,event 0

ACK ACK ACK ACK ACK

Page 103: Data in Motion: Streaming Static Data Efficiently

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

Id 1,event 0

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 3

Id 1,event 0

ACK ACK ACK ACK ACK

Page 104: Data in Motion: Streaming Static Data Efficiently

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 2

Id 0,event 3

Id 1,event 0

Id 0,event 0

Id 0,event 1

Id 2,event 0

Id 0,event 3

Id 1,event 0

ACK ACK ACK ACK ACK

Page 105: Data in Motion: Streaming Static Data Efficiently

Checkpoint data

StateBackend

Source 1: 6791Source 2: 7252Source 3: 5589Source 4: 6843

State 1: ptr 1State 1: ptr 2Sink 2: ack!Sink 2: ack!

Page 106: Data in Motion: Streaming Static Data Efficiently

class KafkaSource(private var offsetManagers: Map[TopicAndPartition, KafkaOffsetManager]) extends TimeReplayableSource { def open(context: TaskContext, startTime: Option[TimeStamp]): Unit = { fetch.setStartOffset(topicAndPartition, offsetManager.resolveOffset(time)) ... } def read(batchSize: Int): List[Message] def close(): Unit}

Page 107: Data in Motion: Streaming Static Data Efficiently

class KafkaSource(private var offsetManagers: Map[TopicAndPartition, KafkaOffsetManager]) extends TimeReplayableSource { def open(context: TaskContext, startTime: Option[TimeStamp]): Unit = { fetch.setStartOffset(topicAndPartition, offsetManager.resolveOffset(time)) ... } def read(batchSize: Int): List[Message] def close(): Unit}

Page 108: Data in Motion: Streaming Static Data Efficiently

class DirectKafkaInputDStream[K, V, U <: Decoder[K]: ClassTag, T <: Decoder[V]: ClassTag, R]( _ssc: StreamingContext, val kafkaParams: Map[String, String], val fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R ) extends InputDStream[R](_ssc) with Logging {

override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = latestLeaderOffsets(maxRetries) ... }}

Page 109: Data in Motion: Streaming Static Data Efficiently

class DirectKafkaInputDStream[K, V, U <: Decoder[K]: ClassTag, T <: Decoder[V]: ClassTag, R]( _ssc: StreamingContext, val kafkaParams: Map[String, String], val fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R ) extends InputDStream[R](_ssc) with Logging {

override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = latestLeaderOffsets(maxRetries) ... }}

Page 110: Data in Motion: Streaming Static Data Efficiently

Exactly once delivery● Durable offset

0 1 2 3 4

Page 111: Data in Motion: Streaming Static Data Efficiently

0 1 2 3 4

Page 112: Data in Motion: Streaming Static Data Efficiently

10 2 3 4

Page 113: Data in Motion: Streaming Static Data Efficiently

10 3 42

Page 114: Data in Motion: Streaming Static Data Efficiently

Stream source

Stream source

Stream source

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

select

map filter

filtermap

select

select

sele

ct

Optimisation

Page 115: Data in Motion: Streaming Static Data Efficiently

Worker

Worker

Worker

Worker

select where

select where

WorkerStream source

Stream source

Stream source

select where

select where

Page 116: Data in Motion: Streaming Static Data Efficiently

Worker

Worker

Workerselect where

select where

Stream source

Stream source

Stream source select where

select where

select where

select where

Page 117: Data in Motion: Streaming Static Data Efficiently

val partitioner = partitionerClassName match { case "org.apache.cassandra.dht.Murmur3Partitioner" => Murmur3TokenFactory case "org.apache.cassandra.dht.RandomPartitioner" => RandomPartitionerTokenFactory case _ => throw new IllegalArgumentException(s"Unsupported partitioner: $partitionerClassName") }

private def splitToCqlClause(range: TokenRange): Iterable[CqlTokenRange] = { if (range.end == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) > ?", startToken)) else if (range.start == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) <= ?", endToken)) else if (!range.isWrapAround) List(CqlTokenRange(s"token($pk) > ? AND token($pk) <= ?", startToken, endToken)) else List( CqlTokenRange(s"token($pk) > ?", startToken), CqlTokenRange(s"token($pk) <= ?", endToken))}

Page 118: Data in Motion: Streaming Static Data Efficiently

val partitioner = partitionerClassName match { case "org.apache.cassandra.dht.Murmur3Partitioner" => Murmur3TokenFactory case "org.apache.cassandra.dht.RandomPartitioner" => RandomPartitionerTokenFactory case _ => throw new IllegalArgumentException(s"Unsupported partitioner: $partitionerClassName") }

private def splitToCqlClause(range: TokenRange): Iterable[CqlTokenRange] = { if (range.end == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) > ?", startToken)) else if (range.start == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) <= ?", endToken)) else if (!range.isWrapAround) List(CqlTokenRange(s"token($pk) > ? AND token($pk) <= ?", startToken, endToken)) else List( CqlTokenRange(s"token($pk) > ?", startToken), CqlTokenRange(s"token($pk) <= ?", endToken))}

Page 119: Data in Motion: Streaming Static Data Efficiently

val partitioner = partitionerClassName match { case "org.apache.cassandra.dht.Murmur3Partitioner" => Murmur3TokenFactory case "org.apache.cassandra.dht.RandomPartitioner" => RandomPartitionerTokenFactory case _ => throw new IllegalArgumentException(s"Unsupported partitioner: $partitionerClassName") }

private def splitToCqlClause(range: TokenRange): Iterable[CqlTokenRange] = { if (range.end == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) > ?", startToken)) else if (range.start == tokenFactory.minToken) List(CqlTokenRange(s"token($pk) <= ?", endToken)) else if (!range.isWrapAround) List(CqlTokenRange(s"token($pk) > ? AND token($pk) <= ?", startToken, endToken)) else List( CqlTokenRange(s"token($pk) > ?", startToken), CqlTokenRange(s"token($pk) <= ?", endToken))}

Page 120: Data in Motion: Streaming Static Data Efficiently

override def getPreferredLocations(split: Partition): Seq[String] = split.asInstanceOf[CassandraPartition].endpoints.flatMap(nodeAddresses.hostNames).toSeq

override def getPartitions: Array[Partition] = { val partitioner = CassandraRDDPartitioner(connector, tableDef, splitCount, splitSize) val partitions = partitioner.partitions(where) partitions}

override def compute(split: Partition, context: TaskContext): Iterator[R] = { val session = connector.openSession() val partition = split.asInstanceOf[CassandraPartition] val tokenRanges = partition.tokenRanges val metricsUpdater = InputMetricsUpdater(context, readConf)

val rowIterator = tokenRanges.iterator.flatMap( fetchTokenRange(session, _, metricsUpdater))

new CountingIterator(rowIterator, limit)}

Page 121: Data in Motion: Streaming Static Data Efficiently

override def getPreferredLocations(split: Partition): Seq[String] = split.asInstanceOf[CassandraPartition].endpoints.flatMap(nodeAddresses.hostNames).toSeq

override def getPartitions: Array[Partition] = { val partitioner = CassandraRDDPartitioner(connector, tableDef, splitCount, splitSize) val partitions = partitioner.partitions(where) partitions}

override def compute(split: Partition, context: TaskContext): Iterator[R] = { val session = connector.openSession() val partition = split.asInstanceOf[CassandraPartition] val tokenRanges = partition.tokenRanges val metricsUpdater = InputMetricsUpdater(context, readConf)

val rowIterator = tokenRanges.iterator.flatMap( fetchTokenRange(session, _, metricsUpdater))

new CountingIterator(rowIterator, limit)}

Page 122: Data in Motion: Streaming Static Data Efficiently

object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelper { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case filter @ Filter(condition, project @ Project(fields, grandChild)) if fields.forall(_.deterministic) =>

val aliasMap = AttributeMap(fields.collect { case a: Alias => (a.toAttribute, a.child) })

project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) }}

Page 123: Data in Motion: Streaming Static Data Efficiently

object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelper { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case filter @ Filter(condition, project @ Project(fields, grandChild)) if fields.forall(_.deterministic) =>

val aliasMap = AttributeMap(fields.collect { case a: Alias => (a.toAttribute, a.child) })

project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) }}

Page 124: Data in Motion: Streaming Static Data Efficiently

Table and stream duality

14

35

2

Page 125: Data in Motion: Streaming Static Data Efficiently

Table and stream duality

14

35

2

1 State X

Page 126: Data in Motion: Streaming Static Data Efficiently

1 Id 0Event 1

Table and stream duality

14

35

2

1 State X

Id 0Event 2

Id 0Event 1

Page 127: Data in Motion: Streaming Static Data Efficiently

Snapshot for offset N

Table and stream duality

14

35

2

1 Id 0Event 1

1 State X

Id 0Event 2

Id 0Event 1

4

Page 128: Data in Motion: Streaming Static Data Efficiently

Table and stream duality

Snapshot for offset N

14

35

2

1 Id 0Event 1

1 State X

Id 0Event 2

Id 0Event 1

4

NId 0Offset 123State X

Id 11Offset 123State X

Page 129: Data in Motion: Streaming Static Data Efficiently

Cache / view / index / replica / system / service

Continuous stream applying transformation function

Updates to the source of truth data

Original table

Infinite streams application

Page 130: Data in Motion: Streaming Static Data Efficiently

internet

services

devices

social

Kafka Stream processing

apps

Stream consumer

Search

Apps

Services

Databases

Batch

Batch

Serialisation

Page 131: Data in Motion: Streaming Static Data Efficiently

Distributed systems

User

Mobile

System

Microservice

Microservice

MicroserviceMicroservice Microservice Microservice

Microservice

CQRS/ES Relational NoSQL

Page 132: Data in Motion: Streaming Static Data Efficiently

Client 1

Client 2

Client 3

Update

Update

UpdateModel devices Model devices Model devices

Input data Input data Input data

Parameter devices

P

ΔP

ΔP

ΔP

Page 133: Data in Motion: Streaming Static Data Efficiently

Challenges

● All the solved problems○ Exactly once delivery○ Consistency○ Availability○ Fault tolerance○ Cross service invariants and consistency○ Transactions○ Automated deployment and configuration management○ Serialization, versioning, compatibility○ Automated elasticity○ No downtime version upgrades○ Graceful shutdown of nodes○ Distributed system verification, logging, tracing, monitoring, debugging○ Split brains○ ...

Page 134: Data in Motion: Streaming Static Data Efficiently

Conclusion

● From request, response, synchronous, mutable state● To streams, asynchronous messaging

● Production ready distributed systems

Page 135: Data in Motion: Streaming Static Data Efficiently

MANCHESTER LONDON NEW YORK

Questions

Page 136: Data in Motion: Streaming Static Data Efficiently

MANCHESTER LONDON NEW YORK

@zapletal_martin @cakesolutions

347 708 1518

[email protected]

We are hiringhttp://www.cakesolutions.net/careers