C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structured Data in Cassandra

Preview:

DESCRIPTION

Speaker: Eric Zoerner, Senior Software Developer at eBuddy Video: http://www.youtube.com/watch?v=fwgCJ2MzakA&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=12 In this session you'll learn about the design and implementation of a new open source general-purpose Java library that supports storing structured data in Cassandra. Instead of mapping the data to multiple tables like an ORM would or embedding data using serialization, this approach decomposes structured data of arbitrary complexity into separate columns of simple values, allowing the data to be retrieved or updated in parts using hierarchical paths. Implementations are included for Cassandra using both the Thrift and CQL3 APIs. In addition, Eric's experiences are shared regarding the challenges of using CQL3 vs. Thrift for schema-less data.

Citation preview

#CASSANDRAEU CASSANDRASUMMITEU

C* Path: Denormalize your data

Eric Zoerner | Software Developer, eBuddy BV Cassandra Summit Europe 2013 London

#CASSANDRAEU CASSANDRASUMMITEU

About eBuddy

#CASSANDRAEU CASSANDRASUMMITEU

XMS

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra in eBuddy Messaging Platform

• User Data Service

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra in eBuddy Messaging Platform

• User Data Service

• User Discovery Service

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra in eBuddy Messaging Platform

• User Data Service

• User Discovery Service

• Persistent Session Store

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra in eBuddy Messaging Platform

• User Data Service

• User Discovery Service

• Persistent Session Store

• Message History

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra in eBuddy Messaging Platform

• User Data Service

• User Discovery Service

• Persistent Session Store

• Message History

• Location-based Discovery

#CASSANDRAEU CASSANDRASUMMITEU

Some Statistics

• Current size of data – 1,4 TB total (replication of 3x); 467 GB actual data

!• 12 million sessions (11 million users plus groups) !

• Almost a billion rows in one column family(inverse social graph)

#CASSANDRAEU CASSANDRASUMMITEU

C* Path

#CASSANDRAEU CASSANDRASUMMITEU

The Problem (a “classic”)

Complex Object

name: Stringbirthdate: Datenickname: String

Person

street: Stringcity: Stringprovince: StringpostalCode: StringcountryCode: String

Address

*1

name: Stringnumber: String

Phone*

1

??

??

??

? ?

Key-Value Store(RDB table, NoSQL, etc.)

#CASSANDRAEU CASSANDRASUMMITEU

Some Strategies

Serialization!

#CASSANDRAEU CASSANDRASUMMITEU

Some StrategiesSerialization!

Normalization!

Personid

John

birthdate

Jack

1979-11-30

110 1985-04-06

Mary111 Mary

name nickname

person_id

001

003

street

New York

78 Hoofd Str

456 Singel

110 123 Main St

Amsterdam110 002

address_id city

London111

Address

person_id

mobile

mobile

phone

+44030393

+44884800

110 +15551234

111 home

name

111

Phone

#CASSANDRAEU CASSANDRASUMMITEU

Some StrategiesSerialization!

Normalization!

Decomposition!

Personid

John

birthdate

Jack

1979-11-30

110 1985-04-06

Mary111 Mary

name nickname

person_id

001

003

street

New York

78 Hoofd Str

456 Singel

110 123 Main St

Amsterdam110 002

address_id city

London111

Address

person_id

mobile

mobile

phone

+44030393

+44884800

110 +15551234

111 home

name

111

Phone

name/ John

addresses/@0/street 123 Main St.

phones/@0/number +31123456789

... ...

#CASSANDRAEU CASSANDRASUMMITEU

Strategies Comparison

✔ ✘ ✔

✔ ✘ ✔

✔ ✔

✘ ✔ ✔

✔ ✔ ✘

Serialization Normalization Decomposition

Single Write

Single Read

Consistent Updates not enforced

Structural Access

Cycles

#CASSANDRAEU CASSANDRASUMMITEU

C* Path

Open Source Java Library for decomposing complex objects into Path-Value pairs — and storing them in Cassandra

https://github.com/ ebuddy/c-star-path !!

* Artifacts available at Maven Central.

#CASSANDRAEU CASSANDRASUMMITEU

C* Path: Decomposition

• Easy to Use • Simple API

#CASSANDRAEU CASSANDRASUMMITEU

C* Path: Decomposition

• Easy to Use • Simple API

• Good for Cassandra because:

– Structural Access: Write parts of objects without reading first

#CASSANDRAEU CASSANDRASUMMITEU

C* Path: Decomposition

• Easy to Use • Simple API

• Good for Cassandra because:

– Structural Access: Write parts of objects without reading first

– Good for denormalizing data, can read or write large complex objects with one read or write operation

#CASSANDRAEU CASSANDRASUMMITEU

How does it work?

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Write to a Path

StructuredDataSupport<UUID> dao = … ; UUID rowKey = … ; Pojo pojo = … ; !

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Write to a Path

StructuredDataSupport<UUID> dao = … ; UUID rowKey = … ; Pojo pojo = … ; !Path path = dao.createPath(“some”, “path”, ”to”,”my”,”pojo”); !

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Write to a Path

StructuredDataSupport<UUID> dao = … ; UUID rowKey = … ; Pojo pojo = … ; !Path path = dao.createPath(“some”, “path”, ”to”,”my”,”pojo”); !dao.writeToPath(rowKey, path, pojo);

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Read from a Path

!Path path = dao.createPath(“some”, “path”, ”to”,”my”,”pojo”); !!

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Read from a Path

!Path path = dao.createPath(“some”, “path”, ”to”,”my”,”pojo”); !!Pojo pojo = dao.readFromPath(rowKey, path, new TypeReference<Pojo>() { });

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Delete

!!dao.deletePath(rowKey, path);

#CASSANDRAEU CASSANDRASUMMITEU

API Example - Batch Operations

!BatchContext batch = dao.beginBatch(); !dao.writeToPath(rowKey1, path, pojo1, batch); dao.writeToPath(rowKey2, path, pojo2, batch); dao.deletePath(rowKey3, path, pojo3, batch); !dao.applyBatch(batch);

#CASSANDRAEU CASSANDRASUMMITEU

Read or write at any level of a path

Person person = …; !Path path = dao.createPath(“x”); dao.writeToPath(rowKey, path, person); !

#CASSANDRAEU CASSANDRASUMMITEU

Read or write at any level of a path

Person person = …; !Path path = dao.createPath(“x”); dao.writeToPath(rowKey, path, person); !Path pathToName = path.withElements(“name”); String name = dao.readFromPath(rowKey, pathToName, stringTypeReference);

#CASSANDRAEU CASSANDRASUMMITEU

Write Implementation: Decomposition

• Step 1:

– Convert domain object into basic structure of Maps, Lists, and simple values. Uses the jackson (fasterxml) library for this and honors the jackson annotations

#CASSANDRAEU CASSANDRASUMMITEU

Write Implementation: Decomposition

• Step 1:

– Convert domain object into basic structure of Maps, Lists, and simple values. Uses the jackson (fasterxml) library for this and honors the jackson annotations

• Step 2:

– Decompose this basic structure into a map of paths to simple values (i.e. String, Number, Boolean), done by Decomposer

#CASSANDRAEU CASSANDRASUMMITEU

Write Implementation: Decomposition

• Step 1:

– Convert domain object into basic structure of Maps, Lists, and simple values. Uses the jackson (fasterxml) library for this and honors the jackson annotations

• Step 2:

– Decompose this basic structure into a map of paths to simple values (i.e. String, Number, Boolean), done by Decomposer

• Step 3:

– Write this map as key-value pairs in the database

#CASSANDRAEU CASSANDRASUMMITEU

Example Decomposition - step 1

name: Stringbirthdate: Datenickname: String

Person

street: Stringcity: Stringprovince: StringpostalCode: StringcountryCode: String

Address

*1

name: Stringnumber: String

Phone*

1

Simplify structure into regular Maps, Lists, and simple values

#CASSANDRAEU CASSANDRASUMMITEU

Example Decomposition - step 1

Simplify structure into regular Maps, Lists, and simple values

Map

name = "John" birthdate = "-39080932298" nickname="Jack" addresses=<List>

[0] = <Map>

[1] = <Map>

street="Singel 45"

place="Amsterdam"

street="123 Main"

place="New York"

phones=<List>

[0] = <Map>

name="mobile"

number="+31651234567"

#CASSANDRAEU CASSANDRASUMMITEU

path value

name/ “John”

birthdate/ “-39080932298”

nickname/ “Jack”

addresses/@0/street “123 Main St.”

addresses/@0/place “New York”

addresses/@1/street “Singel 45”

addresses/@1/place “Amsterdam”

phones/@0/name “mobile”

phones/@1/number "+31651234567"

Example Decomposition - step 2

#CASSANDRAEU CASSANDRASUMMITEU

Read implementation: Composition

• Step 1:

– Read path-value pairs from database

#CASSANDRAEU CASSANDRASUMMITEU

Read implementation: Composition

• Step 1:

– Read path-value pairs from database

• Step 2:

– “Merge” path-value maps back into basic structure(Maps, Lists, simple values), done by Composer

#CASSANDRAEU CASSANDRASUMMITEU

Read implementation: Composition

• Step 1:

– Read path-value pairs from database

• Step 2:

– “Merge” path-value maps back into basic structure(Maps, Lists, simple values), done by Composer

• Step 3:

– Use Jackson to convert basic structure back into domain object using a TypeReference

#CASSANDRAEU CASSANDRASUMMITEU

Design & Challenges

#CASSANDRAEU CASSANDRASUMMITEU

Path Encoding

• Paths stored as strings

• Forward slashes in paths (but hidden by Path API)

• Path elements are internally URL encoded allowing use of special characters in the implementation

• Special characters: @ for list indices(@0, @1, @2, ...)

#CASSANDRAEU CASSANDRASUMMITEU

Challenge: “Shrinking Lists”

➀ Write a list.

x/@0/ “1”

x/@1/ “2”dao.writeToPath(key, “x”, {“1”,”2”});

#CASSANDRAEU CASSANDRASUMMITEU

➀ Write a list. ➁ Write a shorter list.

x/@0/ “1”

x/@1/ “2”dao.writeToPath(key, “x”, {“1”,”2”});

x/@0/ “3”

x/@1/ “2”dao.writeToPath(key, “x”, {“3”});

Challenge: “Shrinking Lists”

#CASSANDRAEU CASSANDRASUMMITEU

➀ Write a list. ➁ Write a shorter list. ➂ Read the list.

x/@0/ “1”

x/@1/ “2”dao.writeToPath(key, “x”, {“1”,”2”});

x/@0/ “3”

x/@1/ “2”dao.writeToPath(key, “x”, {“3”});

dao.readFromPath(key, “x”, new TypeReference<List<String>>() {});

{“3”,”2”}

Challenge: “Shrinking Lists”

#CASSANDRAEU CASSANDRASUMMITEU

Solution: Implementation writes a list terminator value.

x/@0/ “1”

x/@1/ “2”

x/@2/ 0xFFFFFFFF

dao.writeToPath(key, “x”, {“1”,”2”});

x/@0/ “3”

x/@1/ 0xFFFFFFFF

x/@2/ 0xFFFFFFFF

dao.writeToPath(key, “x”, {“3”});

dao.readFromPath(key, “x”, new TypeReference<List<String>>() {});

{“3”}

Challenge: “Shrinking Lists”

#CASSANDRAEU CASSANDRASUMMITEU

Solution: Implementation writes a list terminator value.

Challenge: “Shrinking Lists”

Unfortunately, this is only a partial solution, because it is still possible to read “stale” list elements using a positional index in the path. !This can be avoided by doing a delete before a write, but for performance reasons the library will not do that automatically. !Conclusion: The user must know what they are doing and understand the implementation.

#CASSANDRAEU CASSANDRASUMMITEU

Challenge: Inconsistent UpdatesBecause objects can be updated at any path, there is no

protection against a write “corrupting” an object structure

x/address/street/ “Singel 45”

x/name/ “John”

Path path = dao.createPath(“x”); dao.writeToPath(key, path, person1);

#CASSANDRAEU CASSANDRASUMMITEU

Challenge: Inconsistent UpdatesBecause objects can be updated at any path, there is no

protection against a write “corrupting” an object structure

x/address/street/ “Singel 45”

x/name/ “John”

Path path = dao.createPath(“x”); dao.writeToPath(key, path, person1);

path = dao.createPath(“x”,”name”); dao.writeToPath(key, path, person1);

x/address/street/ “Singel 45”

x/name/ “John”

x/name/address/street/ “Singel 45”

x/name/name/ “John”✘

#CASSANDRAEU CASSANDRASUMMITEU

Challenge: Inconsistent Updates

Solution: Don’t do that!

* If it does happen... !The implementation provides a way to still get the “corrupted” data as simple structures, but an attempt to convert to a now incompatible POJO will fail.

Conclusion: The user must know what they are doing and understand the implementation.

#CASSANDRAEU CASSANDRASUMMITEU

Issue: Sorting

Question:What about sorting path elements as something other than strings, such as numerical or time-based UUID elements? !!

#CASSANDRAEU CASSANDRASUMMITEU

Issue: Sorting

Question:What about sorting path elements as something other than strings, such as numerical or time-based UUID elements? !Instead of storing paths as strings, the implementation could have used DynamicComposite. !

#CASSANDRAEU CASSANDRASUMMITEU

Issue: Sorting

Question:What about sorting path elements as something other than strings, such as numerical or time-based UUID elements? !Instead of storing paths as strings, the implementation could have used DynamicComposite. !We tried it.

#CASSANDRAEU CASSANDRASUMMITEU

Issue: Sorting

Question:What about sorting path elements as something other than strings, such as numerical or time-based UUID elements? !It can work. CQL supports it as a user-defined type. !Unfortunately it causes cqlsh to crash, making it difficult to “browse” the data.

#CASSANDRAEU CASSANDRASUMMITEU

Issue: Sorting

Question:What about sorting path elements as something other than strings, such as numerical or time-based UUID elements? !It is still in consideration to use DynamicComposite for paths in a future version.

#CASSANDRAEU CASSANDRASUMMITEU

Cassandra Data Model

#CASSANDRAEU CASSANDRASUMMITEU

Thriftx/address/street/ “Singel 45”

x/name “John”

… …

<UUID>

row key column name column value

column family

- OR -

super column family !(coming soon)

xaddress/street/ “Singel 45”name “John”… …

<UUID>

row keysuper column name

#CASSANDRAEU CASSANDRASUMMITEU

Thrift

ColumnFamilyOperations<K,String,Object> operations = new ColumnFamilyTemplate<K,String,Object>( keyspace,KeySerializer,StringSerializer,StructureSerializer); !!!!

StructuredDataSupport<K> dao = new ThriftStructuredDataSupport<K>(operations);

Thrift implementation relies on the Hector client.

#CASSANDRAEU CASSANDRASUMMITEU

CQLCREATE TABLE person ( key text, path text, value text, PRIMARY KEY (key, path) )

• Cannot use the path itself as a column name because it is “dynamic”

• Dynamic column family

#CASSANDRAEU CASSANDRASUMMITEU

CQL: Data Model Constraints

• Need to do a range (“slice”) query on the path ⇒ path must be a clustering key

• Also, the path must be the first clustering key, since otherwise we would need to have to provide an equals condition on previous clustering keys in a query.

• One might try putting a secondary index on the path instead of making it a clustering key, but this doesn’t work since Cassandra indexes only work with equals conditionsBad Request: No indexed columns present in by-columns clause with Equal operator

CREATE TABLE person ( key text, path text, value text, PRIMARY KEY (key, path) )

#CASSANDRAEU CASSANDRASUMMITEU

CQL

!StructuredDataSupport<K> dao = new CqlStructuredDataSupport<K>(String tableName, String partitionKeyColumnName, String pathColumnName, String valueColumnName, Session session);

CQL implementation relies on the DataStax Java driver.

#CASSANDRAEU CASSANDRASUMMITEU

And the rest…

#CASSANDRAEU CASSANDRASUMMITEU

Planned Features

• Sets with simple values: element values stored in path

• DynamicComposites?

• Multiple row reads and writes

• Slice queries on path ranges

#CASSANDRAEU CASSANDRASUMMITEU

Credits and Acknowledgements

• Thanks to Joost van de Wijgerd at eBuddy for his ideas and feedback

• jackson JSON Processor, which is core to the C* Path implementation http://wiki.fasterxml.com/JacksonHome

• Image credits:

Slide image name author link

Some Strategies binary noegranado http://www.flickr.com/photos/43360884@N04/6949896929/

#CASSANDRAEU CASSANDRASUMMITEU

C* Path

Open Source Java Library for decomposing complex objects into Path-Value pairs — and storing them in Cassandra

https://github.com/ ebuddy/c-star-path !!

* Artifacts available at Maven Central.

Recommended