Introduction to apache_cassandra_for_developers-lhg

  • Upload
    zznate

  • View
    8.270

  • Download
    1

Embed Size (px)

Citation preview

PowerPoint Presentation

Introduction to
Apache Cassandra
(for Java Developers!)

Nate [email protected]@zznate

Overview

Apache Cassandra is NOT a "key/value storeColumns are dynamic inside a column family
(but they don't have to be)

Gain an understanding concepts in Apache Cassandra that have particulr effect on application development

Gain an understanding of concepts in Apache Cassandra that have particular effect on application development

Brief Intro - Storage

SSTables are immutable
SSTables merged on reads

Brief Intro - Compaction

Combine columnsKeep SSTable count down
Discard tombstones (more on this later)

Brief Intro- The Ring

All nodes share the same role: No single point of failure

Easy to scale

Simplified operations

Brief Intro - Consistency Level - ONE

Cassandra provides consistency when
R + W > N (read replica count +write replica count > replication factor).

Brief Intro - Consistency Level QUORUM

Brief Intro Read Repair

vs. RDBMS - Consistency Level

*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK ***

Idempotent: an operation can be applied multiple times without changing the result(except counters!)

vs. RDBMS - Append Only

Proper data modeling will minimizes seeksNo read before write(Go to Matt's presentation for more!)

How does this impact development?

Substantially.For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.Trade availability for consistencyStore whatever you want. It's all just bytes.Think about how you will query the data before you write it.

Neat. So Now What?

Like any database, you need a client!

Python:Telephus:http://github.com/driftx/Telephus(Twisted)

Pycassa:http://github.com/pycassa/pycassa

Java:Hector:http://github.com/rantav/hector(Exampleshttps://github.com/zznate/hector-examples)

Pelops:http://github.com/s7/scale7-pelops

Kunderahttp://code.google.com/p/kundera/

Datanucleus JDO:http://github.com/tnine/Datanucleus-Cassandra-Plugin

Grails:grails-cassandra:https://github.com/wolpert/grails-cassandra

.NET:FluentCassandra:http://github.com/managedfusion/fluentcassandra

Aquiles:http://aquiles.codeplex.com/

Ruby:Cassandra:http://github.com/fauna/cassandra

PHP:phpcassa:http://github.com/thobbs/phpcassa

SimpleCassie:http://code.google.com/p/simpletools-php/wiki/SimpleCassie

... but do not roll your own

Thrift

Fast, efficient serialization and network IO.

Lots of clients available (you can probably use it in other places as well)

Why you don't want to work with the Thrift API directly:

SuperColumn

ColumnOrSuperColumn (don't forget Counters!)

ColumnParent.super_column

ColumnPath.super_column

Map mutationMap

Higher Level Clients

Hector

JMX Counters

Add/remove hosts:

automatically

programatically

via JMX

Plugable load balancing

Complete encapsulation of Thrift API

Type-safe approach to dealing with Apache Cassandra

Lightweight ORM (supports JPA 1.0 annotations)

JPA support: https://github.com/riptano/hector-jpa

Mavenized!http://repo2.maven.org/maven2/me/prettyprint/

CQL

Viable alternative as of 0.8.0

JDBC Driver implementation means lots of possibilities

Encapsulate API changes

In-tree support on the way for:

DataSource

Pooling

Avro, etc??

Gone. Added too much complexity after
Thrift caught up.

None of the libraries distinguished themselves as being a particularly crappy choice for serialization.

(SeeCASSANDRA-1765)

Thrift API Methods

Five general categoriesRetrieving

Writing/Updating/Removing (all the same op!)Increment counters

Meta Information

Schema Manipulation

CQL Execution

On to the Code...

https://github.com/zznate/cassandra-tutorialUses Maven.Really basic.Modify/abuse/alter as needed.Descriptions of what is going on and how to run each example are in the Javadoc comments.Sample data is based on North American Numbering Plan (easy to find thanks to InfoChimps)http://infochimps.com/datasets/area-code-and-exchange-to-location-north-america-npanxx

Data Shape

512 202 30.27 097.74 W TX Austin512 203 30.27 097.74 L TX Austin512 204 30.32 097.73 W TX Austin512 205 30.32 097.73 W TX Austin512 206 30.32 097.73 L TX Austin

Get a Single Column for a Key

GetCityForNpanxx.java

columnQuery.setColumnFamily("Npanxx");columnQuery.setKey("512204");

columnQuery.setName("city");

Get the Contents of a Row

GetSliceForNpanxx.java

sliceQuery.setColumnFamily("Npanxx");sliceQuery.setKey("512202");

sliceQuery.setColumnNames("city","state","lat","lng");

Get the (sorted!) Columns of a Row

GetSliceForStateCity.java

sliceQuery.setColumnFamily("StateCity");sliceQuery.setKey("TX Austin");

sliceQuery.setRange(202L, 204L, false, 5)

Get the Same Slice from Several Rows

MultigetSliceForNpanxx.java

multigetSlicesQuery.setColumnFamily("Npanxx");multigetSlicesQuery.setColumnNames("city","state","lat","lng");

multigetSlicesQuery.setKeys("512202","512203","512205","512206");

Get Slices From a Range of Rows

GetRangeSlicesForStateCity.javaThe results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)

rangeSlicesQuery.setColumnFamily("Npanxx");rangeSlicesQuery.setColumnNames("city","state","lat","lng");

rangeSlicesQuery.setKeys("512202", "512205");

rangeSlicesQuery.setRowCount(5);

Get Slices From a Range of Rows - 2

GetSliceForAreaCodeCity.javaBonus: DynamicComparator and DynamicComposite (Ed's talk)

sliceQuery.setKey("512");

sliceQuery.setRange("Austin", "Austin__204", false, 5);

Get Slices from Indexed Columns

GetIndexedSlicesForCityState.javaYou only need to index
a single column to apply
clauses on other columns

isq.setColumnFamily("Npanxx");

isq.setColumnNames("city","lat","lng");isq.addEqualsExpression("state", "TX");

isq.addEqualsExpression("city", "Austin");

isq.addGteExpression("lat", "30.30");

Insert, Update and Delete

... are effectively the same operation:Application of columns to a row

Insertion

InsertRowsForColumnFamilies.java

mutator.addInsertion("650222", "Npanxx", HFactory.createStringColumn("lat", "37.57"));mutator.addInsertion("650222", "Npanxx", HFactory.createStringColumn("lng", "122.34"));

mutator.addInsertion("650222", "Npanxx", HFactory.createStringColumn("city", "Burlingame"));

mutator.addInsertion("650222", "Npanxx", HFactory.createStringColumn("state", "CA"));

mutator.addInsertion("CA Burlingame", "StateCity", HFactory.createColumn(650L, "37.57x122.34",longSerializer,stringSerializer));

mutator.addInsertion("650", "AreaCode",

HFactory.createStringColumn("Burlingame__650", "37.57x122.34"));

Add insertions to the other two column families to the same mutation

Deletion

DeleteRowsForColumnFamily.java

mutator.addDeletion("650222", "Npanxx", city, stringSerializer);

mutator.addDeletion("CA Burlingame", "StateCity", null, stringSerializer);mutator.addDeletion("650", "AreaCode", null, stringSerializer);

mutator.addDeletion("650222", "Npanxx", null, stringSerializer);

Or row level

Record Level

Deletion

[default@Tutorial] list StateCity;Using default limit of 100

-------------------

RowKey: CA Burlingame

=> (column=650, value=33372e3537783132322e3334, timestamp=1310340410528000)

-------------------

RowKey: TX Austin

=> (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)

=> (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)

=> (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)

=> (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)

=> (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000)

Deletion

[default@Tutorial] list StateCity;Using default limit of 100

-------------------

RowKey: CA Burlingame

-------------------

RowKey: TX Austin

=> (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)

=> (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)

=> (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)

=> (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)

=> (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000)

Deletion - FYI

mutator.addDeletion("202230", "Npanxx", city, stringSerializer);

You just inserted a tombstone!

Sending a deletion for a non-existing row:

[default@Tutorial] list Npanxx; Using default limit of 100

. . .

-------------------

RowKey: 202230

-------------------

. . .

ColumnFamilyTemplate

ColumnFamilyUpdater updater = template.createUpdater("cskey1");

updater.setString("stringval","value1");

updater.setDate("curdate", date);

updater.setLong("longval", 5L);

template.update(updater);

template.addColumn("stringval", se);

template.addColumn("curdate", DateSerializer.get());

template.addColumn("longval", LongSerializer.get());

ColumnFamilyResult wrapper = template.queryColumns("cskey1");

Template method design patternhttps://github.com/rantav/hector/wiki/Getting-started-%285-minutes%29

Development Resources

Cassandra Maven Plugin
http://mojo.codehaus.org/cassandra-maven-plugin/CCM localhost cassandra cluster
https://github.com/pcmanus/ccmOpsCenter
http://www.datastax.com/products/opscenter

Cassandra AMIs
https://github.com/riptano/CassandraClusterAMI

Stuff I Punted on for the Sake of Brevity

meta_* methods
CassandraClusterTest.java: L43-81 @hectorsystem_* methods
SchemaManipulation.java @ hector-examples
CassandraClusterTest.java: L84-157 @hectorORM (it works and is in production)
https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29multiple nodes and failure scenariosData modeling (go see Matt's presentation)

Things to Remember

deletes and timestamp granularity

range ghosts and tombstones

using the wrong column comparator, key/default validators and InvalidRequestException

Schema-less Schema Optional

use column-level TTL to automate deletion

"how do I iterate over all the rows in a column family"?

get_range_slices, but don't do that

a good sign your data model is wrong

Questions?