Upload
planet-cassandra
View
656
Download
3
Tags:
Embed Size (px)
DESCRIPTION
This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.
Citation preview
The C* Developer TrainingChuck Droukas, Systems Engineer – Datastax
Disclaimers
•This course is designed to be a “fast start” on the basics of data modeling with Cassandra.•We will cover some basic Administration information upfront that is important to understand as you choose your data model•It is still important to take a proper Admin class if you are responsible for production instance•This course focuses on CQL3, but thrift shall not be ignored•Please ask questions and interrupt me. It makes the day go faster for both of us.
Agenda
•Architecture Overview-Ring Topology-Write Path-Read Path-Updates and Deletes
•Break•Columns and their components•Column Families
•Lunch•Keyspaces•Complex Queries•Break•Timeseries Example•User Activity Example•Shopping Cart Example•Logging Example
The Cassandra Schema
Consists of:•Column•Column Family (aka Table)•Keyspace (aka Database)•Cluster
High Level Overview
Keyspace
Column Family /Table
Rows
Columns
Components of the Column
The column is the fundamental data type in Cassandra and includes:• Column name• Column value• Timestamp• TTL (Optional)
The Column
Name
Value
Timestamp
(Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)
Column Name
• Can be any value• Can be any type• Not optional• Must be unique• Stored with every value
Column Value
• Any value• Any type• Can be empty – but is required
Column Names and Values
•the data type for a column (or row key) value is called a validator. •The data type for a column name is called a comparator. •Cassandra validates that data type of the keys of rows. •Columns are sorted, and stored in sorted order on disk, so you have to specify a comparator for columns. This can be reversed… more on this later
Data Types
Column TimeStamp
• 64-bit integer• Best Practice
– Should be created in a consistent manner by all your clients
• Required
Column TTL
• Defined on INSERT• Positive delay (in seconds)• After time expires it is marked for deletion
Special Types of Columns
• Super• Counter• Collections
Counters
• Allows for addition / subtraction• 64-bit value• No timestamp• Deletion does not require a
timestamp
Collections
•New in 1.2!•Set, Map, List
SET Example
The Cassandra Schema
Consists of:•Column•Column Family•Keyspace•Cluster
Column Families / Tables
•Same as tables-Groupings of Rows- AcID-Eventual Consistency
•De-Normalization-To avoid I/O-Simplify the Read Path
•Static or Dynamic
Static Column Families
•Are the most similar to a relational table•Most rows have the same column names•Columns in rows can be different
jbellisName Email Address State
Jonathan [email protected]
123 main TX
dhutchName Email Address State
Daria [email protected]
45 2nd St. CA
egilmoreName Email
eric [email protected]
Row Key Columns
Dynamic Column Families
•Also called “wide rows”•Structured so a query into the row will answer a question
jbellisdhutch egilmore datastax mzcassie
dhutchegilmore
egilmoredatastax mzcassie
Row Key Columns
Subscribers
Dynamic Table CQL3 Example
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY KEY (user_id, tweet_id)
)
Clustering Order
•Sorts columns on disk by default•Can change the order
The Cassandra Schema
Consists of:•Column•Column Family•Keyspace•Cluster
Keyspaces
•Are groupings of Column Families•Replication strategies•Replication factor
CREATE KEYSPACE videodb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
In production you would use NetworkTopologyStrategy for multiple DCs.
CREATE KEYSPACE "Excalibur“ WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' :2};
Complex QueriesPartitioning and Indexing
Partitioners
•Partitioner Types- RandomPartitioner / Murmur3Partitioner- ByteOrderedPartioner
•Random means that your tokens are random your ordering is Random•Ordered means your K T is a no-op and ordering is lexical
- For each node- And for the ring
Partitioners (cont’d)
•SELECT * FROM test WHERE token(k) > token(42);
Primary Index Overview
•Index for all of your row keys•Per-node index•Partitioner + placement manages which node•Keys are just kept in ordered buckets•Partitioner determines how K Token
Natural Keys
•Examples:-An email address-A user id
•Easy to make the relationship•Less de-normalization•More risk of an ‘UPSERT’•Changing the key requires a bulk copy operation
Surrogate Keys
•Example:-UUID
•Independently generated•Allows you to store multiple versions of a user•Relationship is now indirect•Changing the key requires the creation of a new row, or column
Compound (Composite) Primary Keys
Sorting
•It’s Free!•Like Open Source is free•ONLY on the second column in compound Primary Key
Secondary Indexes
•Need for an easy way to do limited ad-hoc queries•Supports multiple per row•Single clause can support multiple selectors•Implemented as a hash map, not B-Tree•Low cardinality ONLY
Secondary Indexes
Conditional Operators
Data Modeling
The Basics of C* Modeling
•Work backwards-What does your application do?-What are the access patterns?
•Now design your data model
Procedures
Consider use case requirements•What data?•Ordering?•Filtering?•Grouping?•Events in chronological order?•Does the data expire?
De-Normalization
•The New Black: De-Normalization-Forget everything you’ve learned about normalization…then forget it again!!!
•The Ugly:-Resource contention-Latency-Client-side joins
•Avoid them in your C* code
Foreign Keys
•There are no foreign keys•No server-side joins
What now?
•Ideally each query will be one row-Compared to other resources, disk space is cheap
•Reduce disk seeks•Reduce network traffic
Workload Preference
•High level of de-normalization means you may have to write the same data many times•Cassandra handles large numbers of writes well
Concurrent Writes
•A row is always referenced by a Key•Keys are just bytes•They must be unique within a CF•Primary keys are unique
-But Cassandra will not enforce uniqueness
-If you are not careful you will accidentally [UPSERT] the whole thing
Let’s Review Some Examples…
Relational Concept - De-normalization
• To combine relations into a single row• Used in relational modeling to avoid
complex joins
Employees
Department
SELECT e.First, e.Last, d.Dept FROM Department d, Employees e WHERE 1 = e.idAND e.id = d.id
Take this and then...
13
Thursday, May 2, 13
id First Last
1 Edgar Codd
2 Raymond Boyce
id Dept
1 Engineering
2 Math
Relational Concept - De-normalization
• Combine table columns into a single view• No joins• All in how you set the data for fast reads
Employees
SELECT First, Last, Dept FROM employees
WHERE id = ‘1’
14
Thursday, May 2, 13
id First Last Dept
1 Edgar Codd Engineering
2 Raymond Boyce Math
Cassandra Concept - One-to-Many
• Relationship without being relational
• Users have many videos• Wait? Where is the foreign key?
Users
Videos
15
Thursday, May 2, 13
username firstname lastname email
tcodd Edgar Codd [email protected]
rboyce Raymond Boyce [email protected]
videoid videoname username description tags
99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol
b3a76c6b Math tcodd Now my dog plays dogs,piano,lol
Cassandra Concept - One-to-many
• Static table to store videos• UUID for unique video id• Add username to
denormalize
CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY(videoid)
);
16
Thursday, May 2, 13
Cassandra Concept - One-to-Many
• Lookup video by username• Write in two tables at once for fast lookups
CREATE TABLE username_video_index ( username varchar,
videoid uuid, upload_date timestamp, video_name varchar,
PRIMARY KEY (username, videoid)
);
SELECT video_nameFROM username_video_index WHERE username = ‘ctodd’ AND videoid = ‘99051fe9’
Creates a wide row!
17
Thursday, May 2, 13
Cassandra concept - Many-to-many• Users and videos have many comments
Videos
Comments
18
Thursday, May 2, 13
username firstname lastname email
tcodd Edgar Codd [email protected]
rboyce Raymond Boyce [email protected]
videoid videoname username description tags
99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol
b3a76c6b Math tcodd Now my dog plays dogs,piano,lol
username videoid comment
tcodd 99051fe9 Sweet!
rboyce b3a76c6b Boring :(
Users
Cassandra concept - Many-to-many
• Model both sides of the view• Insert both when comment is
created• View from either side
CREATE TABLE comments_by_user ( username varchar,
videoid uuid, comment_ts timestamp,
comment varchar,PRIMARY KEY
username,videoid));
19
Thursday, May 2, 13
CREATE TABLE comments_by_video ( videoid uuid,username varchar, comment_ts timestamp,comment varchar,PRIMARY KEY (videoid,username));
Time Series Data
•Sensors- CPU- Network Card- Wave-Form- Resource Utilization
•Clickstream data•Historical trends•Anything that varies on a temporal basis
Timeseries Example
WHITEBOARD TIME!!
Single Device Per Row
Single device per row - Time Series Pattern 1• The simplest model for storing time series data is creating a wide
row of data for each source. • The timestamp of the reading will be the column name and the
temperature the column value• Since each column is dynamic, our row will grow as needed to
accommodate the data. • We will also get the built-in sorting of Cassandra to keep everything
in order.
http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling#!pc
Single Device Per Row
CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time));
Slice Query
SELECT temperatureFROM temperatureWHERE weatherstation_id=’1234ABCD’AND event_time > ’2013-04-03 07:01:00′AND event_time < ’2013-04-03 07:04:00′;
Partitioning to limit row size
Partitioning to limit row size – Time Series Pattern 2• Cassandra can store up to 2 billion columns per row, but if we're
storing data every millisecond you wouldn't even get a month’s worth of data.
• The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device.
• Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id.
• This will give us a row per day, per weather station, and an easy way to find the data.
Partitioning to limit row size
CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id,date),event_time));
Get all the weather data for a single day..
SELECT *FROM temperature_by_dayWHERE weatherstation_id=’1234ABCD’AND date=’2013-04-03′;
Reverse Order Time Series/Expiring Columns
Reverse order timeseries with expiring columns – Time Series Pattern 3• Imagine we are using this data for a dashboard application and
we only want to show the last 10 temperature readings.• Older data is no longer useful, so can be purged eventually. • We can take advantage of a feature called expiring columns to
have our data quietly disappear after a set amount of seconds.
Partitioning to limit row size
CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time),) WITH CLUSTERING ORDER BY (event_time DESC);
Insert Data With TTLs
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:03:00′,’72F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:01:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:04:00′,’74F’) USING TTL 20;
Shopping cart use case
*Store shopping cart data reliably
*Minimize (or eliminate) downtime. Multi-dc
*Scale for the “Cyber Monday” problem
The bad
*Every minute off-line is lost $$
*Online shoppers want speed!
Shopping Cart Example
* Un-ashamedly ripped off from Patrick McFaddin’s Cassandra Summit 2013 presentation
The 5 C* Commandments for Developers
1. Start with queries. Don’t data model for data modeling sake. That is sooo turn of the century.
2. It’s ok to duplicate data. Really. Get over it.3. C* is designed to read and write sequentially.
Great for rotational disk, awesome for SSDs, awful for NAS. So don’t do it. Ever.
4. Secondary indexes are not a band-aid for a poor data model.
5. Embrace wide rows and de-normalization
…and Cassandra will not ask if your “wallet is open.”