Apache Cassandra Developer Training Slide Deck

Preview:

DESCRIPTION

This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.

Citation preview

The C* Developer TrainingChuck Droukas, Systems Engineer – Datastax

Disclaimers

•This course is designed to be a “fast start” on the basics of data modeling with Cassandra.•We will cover some basic Administration information upfront that is important to understand as you choose your data model•It is still important to take a proper Admin class if you are responsible for production instance•This course focuses on CQL3, but thrift shall not be ignored•Please ask questions and interrupt me. It makes the day go faster for both of us.

Agenda

•Architecture Overview-Ring Topology-Write Path-Read Path-Updates and Deletes

•Break•Columns and their components•Column Families

•Lunch•Keyspaces•Complex Queries•Break•Timeseries Example•User Activity Example•Shopping Cart Example•Logging Example

The Cassandra Schema

Consists of:•Column•Column Family (aka Table)•Keyspace (aka Database)•Cluster

High Level Overview

Keyspace

Column Family /Table

Rows

Columns

Components of the Column

The column is the fundamental data type in Cassandra and includes:• Column name• Column value• Timestamp• TTL (Optional)

The Column

Name

Value

Timestamp

(Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)

Column Name

• Can be any value• Can be any type• Not optional• Must be unique• Stored with every value

Column Value

• Any value• Any type• Can be empty – but is required

Column Names and Values

•the data type for a column (or row key) value is called a validator. •The data type for a column name is called a comparator. •Cassandra validates that data type of the keys of rows. •Columns are sorted, and stored in sorted order on disk, so you have to specify a comparator for columns. This can be reversed… more on this later

Data Types

Column TimeStamp

• 64-bit integer• Best Practice

– Should be created in a consistent manner by all your clients

• Required

Column TTL

• Defined on INSERT• Positive delay (in seconds)• After time expires it is marked for deletion

Special Types of Columns

• Super• Counter• Collections

Counters

• Allows for addition / subtraction• 64-bit value• No timestamp• Deletion does not require a

timestamp

Collections

•New in 1.2!•Set, Map, List

SET Example

The Cassandra Schema

Consists of:•Column•Column Family•Keyspace•Cluster

Column Families / Tables

•Same as tables-Groupings of Rows- AcID-Eventual Consistency

•De-Normalization-To avoid I/O-Simplify the Read Path

•Static or Dynamic

Static Column Families

•Are the most similar to a relational table•Most rows have the same column names•Columns in rows can be different

jbellisName Email Address State

Jonathan jb@ds.com

123 main TX

dhutchName Email Address State

Daria dh@ds.com

45 2nd St. CA

egilmoreName Email

eric eg@ds.com

Row Key Columns

Dynamic Column Families

•Also called “wide rows”•Structured so a query into the row will answer a question

jbellisdhutch egilmore datastax mzcassie

dhutchegilmore

egilmoredatastax mzcassie

Row Key Columns

Subscribers

Dynamic Table CQL3 Example

CREATE TABLE timeline (

user_id varchar,

tweet_id uuid,

author varchar,

body varchar,

PRIMARY KEY (user_id, tweet_id)

)

Clustering Order

•Sorts columns on disk by default•Can change the order

The Cassandra Schema

Consists of:•Column•Column Family•Keyspace•Cluster

Keyspaces

•Are groupings of Column Families•Replication strategies•Replication factor

CREATE KEYSPACE videodb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }

In production you would use NetworkTopologyStrategy for multiple DCs.

CREATE KEYSPACE "Excalibur“ WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' :2};

Complex QueriesPartitioning and Indexing

Partitioners

•Partitioner Types- RandomPartitioner / Murmur3Partitioner- ByteOrderedPartioner

•Random means that your tokens are random your ordering is Random•Ordered means your K T is a no-op and ordering is lexical

- For each node- And for the ring

Partitioners (cont’d)

•SELECT * FROM test WHERE token(k) > token(42);

Primary Index Overview

•Index for all of your row keys•Per-node index•Partitioner + placement manages which node•Keys are just kept in ordered buckets•Partitioner determines how K Token

Natural Keys

•Examples:-An email address-A user id

•Easy to make the relationship•Less de-normalization•More risk of an ‘UPSERT’•Changing the key requires a bulk copy operation

Surrogate Keys

•Example:-UUID

•Independently generated•Allows you to store multiple versions of a user•Relationship is now indirect•Changing the key requires the creation of a new row, or column

Compound (Composite) Primary Keys

Sorting

•It’s Free!•Like Open Source is free•ONLY on the second column in compound Primary Key

Secondary Indexes

•Need for an easy way to do limited ad-hoc queries•Supports multiple per row•Single clause can support multiple selectors•Implemented as a hash map, not B-Tree•Low cardinality ONLY

Secondary Indexes

Conditional Operators

Data Modeling

The Basics of C* Modeling

•Work backwards-What does your application do?-What are the access patterns?

•Now design your data model

Procedures

Consider use case requirements•What data?•Ordering?•Filtering?•Grouping?•Events in chronological order?•Does the data expire?

De-Normalization

•The New Black: De-Normalization-Forget everything you’ve learned about normalization…then forget it again!!!

•The Ugly:-Resource contention-Latency-Client-side joins

•Avoid them in your C* code

Foreign Keys

•There are no foreign keys•No server-side joins

What now?

•Ideally each query will be one row-Compared to other resources, disk space is cheap

•Reduce disk seeks•Reduce network traffic

Workload Preference

•High level of de-normalization means you may have to write the same data many times•Cassandra handles large numbers of writes well

Concurrent Writes

•A row is always referenced by a Key•Keys are just bytes•They must be unique within a CF•Primary keys are unique

-But Cassandra will not enforce uniqueness

-If you are not careful you will accidentally [UPSERT] the whole thing

Let’s Review Some Examples…

Relational Concept - De-normalization

• To combine relations into a single row• Used in relational modeling to avoid

complex joins

Employees

Department

SELECT e.First, e.Last, d.Dept FROM Department d, Employees e WHERE 1 = e.idAND e.id = d.id

Take this and then...

13

Thursday, May 2, 13

id First Last

1 Edgar Codd

2 Raymond Boyce

id Dept

1 Engineering

2 Math

Relational Concept - De-normalization

• Combine table columns into a single view• No joins• All in how you set the data for fast reads

Employees

SELECT First, Last, Dept FROM employees

WHERE id = ‘1’

14

Thursday, May 2, 13

id First Last Dept

1 Edgar Codd Engineering

2 Raymond Boyce Math

Cassandra Concept - One-to-Many

• Relationship without being relational

• Users have many videos• Wait? Where is the foreign key?

Users

Videos

15

Thursday, May 2, 13

username firstname lastname email

tcodd Edgar Codd tcodd@relational.com

rboyce Raymond Boyce rboyce@relational.com

videoid videoname username description tags

99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol

b3a76c6b Math tcodd Now my dog plays dogs,piano,lol

Cassandra Concept - One-to-many

• Static table to store videos• UUID for unique video id• Add username to

denormalize

CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY(videoid)

);

16

Thursday, May 2, 13

Cassandra Concept - One-to-Many

• Lookup video by username• Write in two tables at once for fast lookups

CREATE TABLE username_video_index ( username varchar,

videoid uuid, upload_date timestamp, video_name varchar,

PRIMARY KEY (username, videoid)

);

SELECT video_nameFROM username_video_index WHERE username = ‘ctodd’ AND videoid = ‘99051fe9’

Creates a wide row!

17

Thursday, May 2, 13

Cassandra concept - Many-to-many• Users and videos have many comments

Videos

Comments

18

Thursday, May 2, 13

username firstname lastname email

tcodd Edgar Codd tcodd@relational.com

rboyce Raymond Boyce rboyce@relational.com

videoid videoname username description tags

99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol

b3a76c6b Math tcodd Now my dog plays dogs,piano,lol

username videoid comment

tcodd 99051fe9 Sweet!

rboyce b3a76c6b Boring :(

Users

Cassandra concept - Many-to-many

• Model both sides of the view• Insert both when comment is

created• View from either side

CREATE TABLE comments_by_user ( username varchar,

videoid uuid, comment_ts timestamp,

comment varchar,PRIMARY KEY

username,videoid));

19

Thursday, May 2, 13

CREATE TABLE comments_by_video ( videoid uuid,username varchar, comment_ts timestamp,comment varchar,PRIMARY KEY (videoid,username));

Time Series Data

•Sensors- CPU- Network Card- Wave-Form- Resource Utilization

•Clickstream data•Historical trends•Anything that varies on a temporal basis

Timeseries Example

WHITEBOARD TIME!!

Single Device Per Row

Single device per row - Time Series Pattern 1• The simplest model for storing time series data is creating a wide

row of data for each source. • The timestamp of the reading will be the column name and the

temperature the column value• Since each column is dynamic, our row will grow as needed to

accommodate the data. • We will also get the built-in sorting of Cassandra to keep everything

in order.

http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling#!pc

Single Device Per Row

CREATE TABLE temperature (   weatherstation_id text,   event_time timestamp,   temperature text,   PRIMARY KEY (weatherstation_id,event_time));

Slice Query

SELECT temperatureFROM temperatureWHERE weatherstation_id=’1234ABCD’AND event_time > ’2013-04-03 07:01:00′AND event_time < ’2013-04-03 07:04:00′;

Partitioning to limit row size 

Partitioning to limit row size – Time Series Pattern 2• Cassandra can store up to 2 billion columns per row, but if we're

storing data every millisecond you wouldn't even get a month’s worth of data.

• The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device.

• Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id.

• This will give us a row per day, per weather station, and an easy way to find the data.

 

Partitioning to limit row size 

CREATE TABLE temperature_by_day (   weatherstation_id text,   date text,   event_time timestamp,   temperature text,   PRIMARY KEY ((weatherstation_id,date),event_time));

Get all the weather data for a single day..

SELECT *FROM temperature_by_dayWHERE weatherstation_id=’1234ABCD’AND date=’2013-04-03′; 

Reverse Order Time Series/Expiring Columns

Reverse order timeseries with expiring columns – Time Series Pattern 3• Imagine we are using this data for a dashboard application and

we only want to show the last 10 temperature readings.• Older data is no longer useful, so can be purged eventually. • We can take advantage of a feature called expiring columns to

have our data quietly disappear after a set amount of seconds.

Partitioning to limit row size 

CREATE TABLE latest_temperatures (   weatherstation_id text,   event_time timestamp,   temperature text,   PRIMARY KEY (weatherstation_id,event_time),) WITH CLUSTERING ORDER BY (event_time DESC);

Insert Data With TTLs

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:03:00′,’72F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:01:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:04:00′,’74F’) USING TTL 20;

Shopping cart use case

*Store shopping cart data reliably

*Minimize (or eliminate) downtime. Multi-dc

*Scale for the “Cyber Monday” problem

The bad

*Every minute off-line is lost $$

*Online shoppers want speed!

Shopping Cart Example

* Un-ashamedly ripped off from Patrick McFaddin’s Cassandra Summit 2013 presentation

The 5 C* Commandments for Developers

1. Start with queries. Don’t data model for data modeling sake. That is sooo turn of the century.

2. It’s ok to duplicate data. Really. Get over it.3. C* is designed to read and write sequentially.

Great for rotational disk, awesome for SSDs, awful for NAS. So don’t do it. Ever.

4. Secondary indexes are not a band-aid for a poor data model.

5. Embrace wide rows and de-normalization

…and Cassandra will not ask if your “wallet is open.”