Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)

Event Sourcing with Cassandra

Luke TillmanTechnical Evangelist

@LukeTillman

• Evangelist with a

focus on

Developers

– Long-time

Developer on

RDBMS (lots of

.NET)

• I still write a lot of

code, but now I also

do a lot of teaching

and speaking

Who are you?

2

A Quick Recap of Event Sourcing

3

Persistence with Event Sourcing

• Instead of keeping the

current state, keep a journal

of all the deltas (events)

• Append only (no UPDATE or

DELETE)

• We can replay our journal of

events to get the current

state

4

Shopping Cart (id = 1345)

user_id= 4762created_on= 7/10/2015…

Cart Created

item_id= 7621quantity= 1price= 19.99

Item Added


Item Added

Item Removed item_id= 7621

Qty Changeditem_id= 9134quantity= 1

Event Sourcing in Practice

• Typically two kinds of storage:

– Event Journal Store

– Snapshot Store

• A history of how we got to the

current state can be useful

• We've also got a lot more data

to store than we did before

5

Shopping Cart (id = 1345)

user_id= 4762created_on= 7/10/2015…

Cart Created


Item Added


Item Added

Item Removed item_id= 7621

Qty Changeditem_id= 9134quantity= 1

Why use Cassandra for Event Sourcing?

• Transactional (OLTP) Workload

• Sequentially written, immutable data

– Looks a lot like time series data

• Easy to scale out to capture more events

6

Event Sourcing Example: Akka Persistence

7

Akka Persistence Journal API Summary

• Write Method

– For a given actor, write a group

of messages

• Delete Method

– For a given actor, permanently

or logically delete all messages

up to a given sequence number

• Read Methods

– For a given actor, read back all

the messages between two

sequence numbers

– For a given actor, read the

highest sequence number that's

been written

8

An Event Journal in Cassandra

Data Modeling for Reads and Writes

9

A Simple First Attempt

• Use persistence_id as partition key– all messages for a given persistence Id

together

• Use sequence_number as clustering column

– order messages by sequence number inside a partition

• Read all messages between two sequence numbers

• Read the highest sequence number

10

CREATE TABLE messages (persistence_id text,sequence_number bigint,message blob,PRIMARY KEY (

persistence_id, sequence_number));

SELECT * FROM messagesWHERE persistence_id = ?

AND sequence_number >= ? AND sequence_number <= ?;

SELECT sequence_number FROM messagesWHERE persistence_id = ?ORDER BY sequence_number DESC LIMIT 1;

A Simple First Attempt

• Write a group of messages

• Use a Cassandra Batch statement to ensure all messages (success) or no messages (failure) get written

• What's the problem with this data model (ignoring implementing deletes for now)?

11

CREATE TABLE messages (persistence_id text,sequence_number bigint,message blob,PRIMARY KEY (

persistence_id, sequence_number));

BEGIN BATCHINSERT INTO messages ... ;INSERT INTO messages ... ;INSERT INTO messages ... ;

APPLY BATCH;

Unbounded Partition Growth

• Cassandra has a hard limit of 2

billion cells in a partition

• But there's also a practical limit

– Depends on row/cell data size, but

likely not more than millions of rows

12

Journal

INSERT INTO messages ...

persistence_id='57ab...'

seq_nr=1

seq_nr=2

message=0x00...

message=0x00...

∞?

Fixing the Unbounded Partition Growth Problem

• General strategy: add a column to the partition key– Compound partition key

• Can be data that's already part of the model, or a "synthetic" column

• Allow users to configure a partition size in the plugin– Partition Size = number of rows per

partition

– This should not be changeable once messages have been written

• Partition number for a given sequence number is then easy to calculate– (seqNr – 1) / partitionSize

(100 – 1) / 100 = partition 0

(101 – 1) / 100 = partition 1

13

CREATE TABLE messages (persistence_id text,partition_number bigint,sequence_number bigint,message blob,PRIMARY KEY (

(persistence_id, partition_number),sequence_number)

);


• Read all messages between two sequence numbers

• Read the highest sequence number

14



);


AND partition_number = ? AND sequence_number >= ? AND sequence_number <= ?;

SELECT sequence_number FROM messagesWHERE persistence_id = ?

AND partition_number = ?ORDER BY sequence_number DESC LIMIT 1;

(repeat until we reach sequence number or run out of partitions)

(repeat until we run out of partitions)


• Write a group of messages

• A Cassandra Batch statement might now write to multiple partitions (if the sequence numbers cross a partition boundary)

• Is that a problem?

15



);

BEGIN BATCHINSERT INTO messages ... ;INSERT INTO messages ... ;INSERT INTO messages ... ;

APPLY BATCH;

RTFM: Cassandra Batches Edition

16

"Batches are atomic by default. In the context of a Cassandra batch

operation, atomic means that if any of the batch succeeds, all of it will."

- DataStax CQL Docshttp://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html

"Although an atomic batch guarantees that if any part of the batch succeeds,

all of it will, no other transactional enforcement is done at the batch level.

For example, there is no batch isolation. Clients are able to read the first

updated rows from the batch, while other rows are still being updated on the

server."


Atomic? That's kind of a loaded word.

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html


Multiple Partition Batch Failure Scenario

17

Journal

RF = 3


17

Journal

BEGIN BATCH...

APPLY BATCH;

CL = QUORUM

RF = 3


17

Journal

BEGIN BATCH...

APPLY BATCH;

Batch

Log

Batch

Log

Batch

LogCL = QUORUM

RF = 3


• Once written to the

Batch Log successfully,

we know all the writes

in the batch will

succeed eventually

(atomic?)

17

Journal

BEGIN BATCH...

APPLY BATCH;

CL = QUORUM

RF = 3





in the batch will

succeed eventually

(atomic?)

17

Journal

BEGIN BATCH...

APPLY BATCH;

CL = QUORUM

RF = 3





in the batch will

succeed eventually

(atomic?)

• Batch has been

partially applied

17

Journal

BEGIN BATCH...

APPLY BATCH;

CL = QUORUM

RF = 3





in the batch will

succeed eventually

(atomic?)

• Batch has been

partially applied

• Possible to read a

partially applied batch

since there is no batch

isolation

17

Journal

BEGIN BATCH...

APPLY BATCH;

CL = QUORUM

RF = 3

WriteTimeout

- writeType = BATCH

RTFM: Cassandra Batches Edition Part 2

24

"For example, there is no batch isolation. Clients are able to read the first

updated rows from the batch, while other rows are still being updated on the

server. However, transactional row updates within a partition key are

isolated: clients cannot read a partial update."


What we really need is Isolation.

When writing a group of messages, ensure that

we write the group to a single partition.


Logic Changes to Ensure Batch Isolation

• Still use configurable Partition Size

– not a "hard limit" but a "best attempt"

• On write, see if messages will all fit in the

current partition

• If not, roll over to the next partition early

• Reading is slightly more complicated

– For a given sequence number it might be in

partition n or (n+1)

25

seq_nr = 97

seq_nr = 98

seq_nr = 1

99100

101

partition_nr = 1

partition_nr = 2

Partition Size = 100

Accounting for Deletes

26

Option 1: Mark Individual Messages as Deleted

• Add an is_deleted column

to our messages table

• Read all messages between

two sequence numbers

27

CREATE TABLE messages (persistence_id text,partition_number bigint,sequence_number bigint,message blob,is_deleted bool,PRIMARY KEY (


);




... sequence_number message is_deleted

... 1 0x00 true

... 2 0x00 true

... 3 0x00 false

... 4 0x00 false

Option 1: Mark Individual Messages as Deleted

• Pros:– On replay, easy to check if a

message has been deleted (comes included in message query's data)

• Cons:– Messages not immutable any

more

– Issue lots of UPDATEs to mark each message as deleted

– Have to scan through a lot of rows to find max deleted sequence number if we want to avoid issuing unnecessary UPDATEs

28

CREATE TABLE messages (persistence_id text,partition_number bigint,sequence_number bigint,message blob,is_deleted bool,PRIMARY KEY (


);

Option 2: Write a Marker Row for Each Deleted Row

• Add a marker column and

make it a clustering column

– Messages written with 'A'

– Deletes get written with 'D'



29

CREATE TABLE messages (persistence_id text,partition_number bigint,sequence_number bigint,marker text,message blob,PRIMARY KEY (

(persistence_id, partition_number),sequence_number, marker)

);




... sequence_number marker message

... 1 A 0x00

... 1 D null

... 2 A 0x00

... 3 A 0x00

Option 2: Write a Marker Row for Each Deleted Row

• Pros– On replay, easy to peek at next

row to check if deleted (comes included in message query's data)

– Message data stays immutable

• Cons– Issue lots of INSERTs to mark

each message as deleted

– Have to scan through a lot of rows to find max deleted sequence number if we want to avoid issuing unnecessary INSERTs

– Potentially twice as many rows to store

30



);

Looking at Physical Deletes

• Physically delete messages to a given sequence number

• Still probably want to scan through rows to see what's already been deleted first

31



);

BEGIN BATCHDELETE FROM messagesWHERE persistence_id = ?

AND partition_number = ?AND marker = 'A'AND sequence_number = ?;

...APPLY BATCH;

• Can't range delete, so we have

to do lots of individual

DELETEs

Looking at Physical Deletes



• With how DELETEs work in

Cassandra, is there a potential

problem with this query?

32



);




Tombstone Hell: Queue-like Data Sets

33

Journal persistence_id'57ab...'

partition_nr1

message=0x00...

seq_nr=1marker='A'

...message=0x00...

seq_nr=2marker='A'


33


partition_nr1

message=0x00...

seq_nr=1marker='A'

...

Delete messages to a sequence number

BEGIN BATCHDELETE FROM messagesWHERE persistence_id = '57ab...'

AND partition_nr = 1AND marker = 'A'AND sequence_nr = 1;

...APPLY BATCH;

message=0x00...

seq_nr=2marker='A'


33


partition_nr1

message=0x00...

seq_nr=1marker='A'

seq_nr=1marker='A'

TombstoneNO DATA HERE

...

Delete messages to a sequence number

BEGIN BATCHDELETE FROM messagesWHERE persistence_id = '57ab...'

AND partition_nr = 1AND marker = 'A'AND sequence_nr = 1;

...APPLY BATCH;

message=0x00...

seq_nr=2marker='A'

seq_nr=2marker='A'



• At some point compaction runs and we

don't have two versions any more, but

tombstones don't go away immediately

– Tombstones remain for gc_grace_seconds

– Default is 10 days

33


partition_nr1

seq_nr=1marker='A'


...

seq_nr=2marker='A'



37


partition_nr1

seq_nr=1marker='A'


...

Read all messages between 2 sequence numbers

SELECT * FROM messagesWHERE persistence_id = '57ab...'

AND partition_number = 1 AND sequence_number >= 1 AND sequence_number <= [max value];

seq_nr=2marker='A'


seq_nr=3marker='A'


seq_nr=4marker='A'


Avoid Tombstone Hell

38

We need a way to avoid reading

tombstones when replaying messages.

SELECT * FROM messages

WHERE persistence_id = ?

AND partition_number = ?

AND sequence_number >= ?

AND sequence_number <= ?;

AND sequence_number >= ?

If we know what sequence number we've already deleted to

before we query, we could make that lower bound smarter.

A Third Option for Deletes

• Use marker as a clustering

column, but change the

clustering order

– Messages still 'A', Deletes 'D'



39

CREATE TABLE messages (persistence_id text,partition_number bigint,marker text,sequence_number bigint,message blob,PRIMARY KEY (

(persistence_id, partition_number),marker, sequence_number)

);


AND partition_number = ?AND marker = 'A'AND sequence_number >= ? AND sequence_number <= ?;


... sequence_number marker message

... 1 A 0x00

... 2 A 0x00

... 3 A 0x00


• Messages data no longer has deleted information, so how do we know what's already been deleted?

• Get max deleted sequence number

• Can avoid tombstones if done before getting message data

40



);

SELECT sequence_number FROM messagesWHERE persistence_id = ?

AND partition_number = ?AND marker = 'D'

ORDER BY marker DESC, sequence_number DESC

LIMIT 1;


• Pros– Message data stays immutable

– Issue a single INSERT when deleting to a sequence number

– Read a single row to find out what's been deleted (no more scanning)

– Can avoid reading tombstones created by physical deletes

• Cons– Requires a separate query to find

out what's been deleted before getting message data

41



);

Lessons Learned

42

Summary

• Seemingly simple data models can

get a lot more complicated

• Avoid unbounded partition growth– Add data to your partition key

• Be aware of how Cassandra Logged Batches work– If you need isolation, only write to a single partition

• Avoid queue-like data sets and be aware of how tombstones might impact your queries– Try to query with ranges that avoid tombstones

43

Questions?@LukeTillman

https://www.linkedin.com/in/luketillman/

https://github.com/LukeTillman/

44

https://www.linkedin.com/in/luketillman/

https://github.com/LukeTillman/

Software

Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)