Upload
luke-tillman
View
259
Download
1
Embed Size (px)
DESCRIPTION
Everyone wants to build applications that are scalable and highly available. But how do you build a site that’s capable of withstanding the public’s insatiable demand for sharing cat videos, even if your data center gets hit with a nuclear bomb? In this session we’ll take a look at KillrVideo, an open source video sharing application demo (similar to YouTube) built on Apache Cassandra and Microsoft Azure. You’ll get an introduction to Cassandra, a highly available distributed database including data modelling (and how it’s different from the relational world you probably have experience with), using CQL to query, and how to interact with Cassandra from your code. We’ll also touch on using Azure Media Services for processing and streaming video content as well as how to setup a Cassandra cluster in Azure. While the code samples in this session will be in C#, the same APIs are available and the same concepts apply to other languages (like Java and Python). If you’re interested in learning more about NoSQL solutions, Cassandra, or Azure, this talk will get you started. No kittens were harmed in the making of this talk.
Citation preview
Satisfying the Public’s Demand for Cat
Videos with Cassandra and Azure
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax
Who are you?!
• Evangelist with a focus on the .NET Community
• Long-time Developer
• Recently presented at Cassandra Summit 2014 with Microsoft
• Very Recent Denver Transplant
2
1 What is this KillrVideo thing you speak of?
2 Cassandra, the really short version
3 CQL: NoSQL, now with more SQL!
4 Breaking the Relational Mindset
5 Putting it all together: Cassandra, Azure, and .NET
3
What is this KillrVideo thing you speak of?
4
KillrVideo, a Video Sharing Site
• Think a YouTube competitor
– Users add videos, rate them, comment on them, etc.
– Can search for videos by tag
5
See the Live Demo, Get the Code
• Live demo available at http://www.killrvideo.com
– Written in C#
– Live Demo running in Azure
– Open source: https://github.com/luketillman/killrvideo-csharp
• Interesting use case because of different data modeling
challenges and the scale of something like YouTube
– More than 1 billion unique users visit YouTube each month
– 100 hours of video are uploaded to YouTube every minute
6
Just How Popular are Cats on the Internet?
7
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
Just How Popular are Cats on the Internet?
8
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
Cassandra, the really short version
What is Cassandra?
• A Linearly Scaling and Fault Tolerant Distributed Database
• Fully Distributed
– Data spread over many nodes
– All nodes participate in a cluster
– All nodes are equal
– No SPOF (shared nothing)
10
What is Cassandra?
Linearly Scaling
– Have More Data? Add more nodes.
– Need More Throughput? Add more nodes.
11
Fault Tolerant
– Nodes Down != Database Down
– Datacenter Down != Database Down
What is Cassandra?
• Fully replicated across multiple DCs
• Clients write local
• Data syncs across WAN
• Replication Factor per DC
12
US Europe
Client
Cassandra and the CAP Theorem
• The CAP Theorem limits what distributed systems can do
– Consistency
– Availability
– Partition Tolerance
• Limits? “Pick 2 out of 3”
• Cassandra is an AP system that is Eventually Consistent
13
Two knobs control Cassandra fault tolerance
• Replication Factor (server side)
– How many copies of the data should exist?
14
Client
B AD
C AB
A CD
D BC
Write A
RF=3
Two knobs control Cassandra fault tolerance
• Consistency Level (client side)
– How many replicas do we need to hear from before we acknowledge?
15
Client
B AD
C AB
A CD
D BC
Write A
CL=QUORUM
Client
B AD
C AB
A CD
D BC
Write A
CL=ONE
Consistency Levels
• Applies to both Reads and Writes (i.e. is set on each query)
• ONE – one replica from any DC
• LOCAL_ONE – one replica from local DC
• QUORUM – 51% of replicas from any DC
• LOCAL_QUORUM – 51% of replicas from local DC
• ALL – all replicas
• TWO
16
Consistency Level and Availability
• Consistency Level choice affects availability
• For example, QUORUM can tolerate one replica being down and
still be available (in RF=3)
17
Client
B AD
C AB
A CD
D BC
A=2
A=2
A=2
Read A
(CL=QUORUM)
Eventual Consistency
• Cassandra is an AP system that is Eventually Consistent so
replicas may disagree
• Column values are timestamped
• In Cassandra, Last Write Wins (LWW)
18
Client
B AD
C AB
A CD
D BC
Read A
(CL=QUORUM) A=2
Newer
A=1
Older
A=2
CQL: NoSQL, now with more SQL!
Schema Definition (DDL)
• Easy to define tables for storing data
• First part of Primary Key is the Partition Key
CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) );
20
Partition Key
Partition Key Determines Data Distribution
• Partition Key determines node placement
21
name description ...
Keyboard Cat Keyboard Cat is the ... ...
Nyan Cat Check out Nyan cat ... ...
Original Grumpy Cat Visit Grumpy Cat’s … ...
videoid
689d56e5- …
93357d73- …
d978b136- …
Partition Key – Hashing
• The Partition Key is hashed using a consistent hashing function
(Murmur 3) and the output is used to place the data on a node
• The data is also replicated to RF-1 other nodes
22
Murmur3 videoid: 689d56e5- ... Murmur3: A
B AD
C AB
A CD
D BC
RF=3 Partition Key
name description ...
Keyboard Cat Keyboard Cat is the ... ...
videoid
689d56e5- ...
Hashing – Back to Reality
• Back in reality, Partition Keys actually hash to 128 bit numbers
• Nodes in Cassandra own token ranges (i.e. hash ranges)
23
B AD
C AB
A CD
D BC
Range Start End
A 0xC000000..1 0x0000000..0
B 0x0000000..1 0x4000000..0
C 0x4000000..1 0x8000000..0
D 0x8000000..1 0xC000000..0
Murmur3 0xadb95e99da887a8a4cb474db86eb5769
Partition Key
videoid
689d56e5- ...
Clustering Columns
• Second part of Primary Key is Clustering Column(s)
• Clustering columns affect ordering of data (on disk)
• Ascending/Descending order is possible
24
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
Clustering Columns – Wide Rows
• Use of Clustering Columns (and the layout on disk) is where the
term “Wide Rows” comes from
25
videoid='0fe6a...'
userid= 'ac346...'
comment= 'Awesome!'
commentid='82be1...' (10/1/2014 9:36AM)
userid= 'f89d3...'
comment= 'Garbage!'
commentid='765ac...' (9/17/2014 7:55AM)
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
Inserts and Updates
• Use INSERT or UPDATE to add and modify data
• Both will overwrite data (no constraints like RDBMS)
• INSERT and UPDATE functionally equivalent 26
INSERT INTO comments_by_video ( videoid, commentid, userid, comment) VALUES ( '0fe6a...', '82be1...', 'ac346...', 'Awesome!');
UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';
TTL and Deletes
• Can specify a Time to Live (TTL) in seconds when doing an
INSERT or UPDATE
• Use DELETE statement to remove data
• Can optionally specify columns to remove part of a row
27
INSERT INTO comments_by_video ( ... ) VALUES ( ... ) USING TTL 86400;
DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';
Querying
• Use SELECT to get data from your tables
• Always include Partition Key and optionally Clustering Columns
• Can use ORDER BY (on Clustering Columns) and LIMIT
• Use range queries (for example, by date) to slice partitions
28
SELECT * FROM comments_by_video WHERE videoid = 'a67cd...' LIMIT 10;
Breaking the Relational Mindset
Breaking the Relational Mindset
• How do we data model when we have to query by the Partition Key (and optionally Clustering Columns)?
• Denormalize all the things!
• Disk is cheap now and writes in Cassandra are FAST
• Data modeling is very much query driven
• Many times we end up with a “table per query”
30
Users – The Relational Way
• Single Users table with all user data and an Id Primary Key
• Add an index on email address to allow queries by email
User Logs
into site
Find user by email
address
Show basic
information
about user Find user by id
31
Users – The Cassandra Way
User Logs
into site
Find user by email
address
Show basic
information
about user Find user by id
CREATE TABLE user_credentials ( email text, password text, userid uuid, PRIMARY KEY (email) );
CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) );
32
Considerations When Duplicating Data
• Can the data change?
• How likely is it to change or how frequently will it change?
• Do I have all the information I need to update duplicates and
maintain consistency?
• Just scratching the surface of data modeling examples here
33
Putting it all together: Cassandra, Azure,
and .NET
KillrVideo on Azure
Cassandra Cluster (DSE)
App data storage (video
metadata, comments, users,
ratings, etc.)
Azure Media Services
Uploaded video encoding,
thumbnail generation, Video
access URI generation
Azure Storage
Queues – notifications on
encoding job progress
Blob – uploaded video storage
OpsCenter
provisioning,
monitoring,
management
KillrVideo Web App C# MVC Web Application, Azure Web Role
Serves up UI, JSON Endpoints
KillrVideo Upload Worker C#, Azure Worker Role
Monitors encoding job events, publishes completed
uploads
Web UI HTML5 / JavaScript (KnockoutJS, jQuery, Bootstrap, etc)
35
Deploying Cassandra in Azure
• Cassandra is a JVM application and should be deployed on Linux
VMs (parity in Windows is coming – 3.0?)
• IOPs is super important (recommend A7 instances for
production, A4 for testing and development)
• New SSD instances in Azure look promising
• In-depth documentation and scripts available to help
36
.NET and Cassandra
• Open Source (on GitHub), available via NuGet
• Bootstrap using the Builder and then reuse the ISession object
Cluster cluster = Cluster.Builder() .AddContactPoint("127.0.0.1") .Build(); ISession session = cluster.Connect("killrvideo");
37
.NET and Cassandra
• Executing CQL
• Sync and Async API available
var statement = new SimpleStatement("SELECT * FROM users WHERE userid = ?"); statement = statement.Bind(145); RowSet rows = await session.ExecuteAsync(statement);
38
.NET and Cassandra
• Getting values from a RowSet is easy
• Rowset is a collection of Row (IEnumerable<Row>)
RowSet rows = await _session.ExecuteAsync(statement); foreach (Row row in rows) { var videoId = row.GetValue<Guid>("videoid"); var addedDate = row.GetValue<DateTimeOffset>("added_date"); var name = row.GetValue<string>("name"); }
39
.NET and Cassandra
• Mapping results to DTOs: if you like using CQL, try CqlPoco
package
• Note: This package may be pulled into the official driver soon.
public class User { public Guid UserId { get; set; } public string Name { get; set; } } // Get a user by id from Cassandra or null if not found var user = client.SingleOrDefault<User>( "SELECT userid, name FROM users WHERE userid = ?", someUserId);
40
.NET and Cassandra
• Mapping results to DTOs: if you like LINQ, use built-in LINQ
provider
[Table("users")] public class User { [Column("userid"), PartitionKey] public Guid UserId { get; set; } [Column("name")] public string Name { get; set; } } var user = session.GetTable<User>() .SingleOrDefault(u => u.UserId == someUserId) .Execute();
41
Some Tips for .NET and Cassandra
• Look at Prepared Statements in the documentation for an easy
performance optimization
• Take advantage of the async API to run queries in parallel
• Don’t write boilerplate mapping code—use LINQ or CqlPoco
42
What Next?
• Planet Cassandra: http://planetcassandra.org/
– Windows installer for Cassandra for development
– More information on the drivers
– Resources for Data Modeling
• Guidance and Scripts for deployments on Azure – https://academy.datastax.com/demos/enterprise-deployment-microsoft-azure-cloud
• KillrVideo Source
– https://github.com/luketillman/killrvideo-csharp
43
Questions?
44
Follow me on Twitter for updates or to ask questions later: @LukeTillman