Cassandra Training Modeling

Cassandra: 0-60

Jonathan Ellis / @spyced

Keyspaces & ColumnFamilies

● Conceptually, like “schemas” and “tables”

Inside CFs, columns are dynamic

● Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”

ColumnFamilies

Columns

“static” Cfs vs “dynamic”

Inserting

● Really “insert or update”● As much of the row as you want

(remember sstable merge-on-read)

Column indexes

● Name vs range flters● “reversed=true”

Denormalization

● Whiteboard: Turn, long, skinny tables into long rows

● Reduces i/o and cpu to perform read

Example: twissandra

● http://twissandra.com

http://twissandra.com/

CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR(64), password VARCHAR(64));

CREATE TABLE following ( user INTEGER REFERENCES user(id), followed INTEGER REFERENCES user(id));

CREATE TABLE tweets ( id INTEGER, user INTEGER REFERENCES user(id), body VARCHAR(140), timestamp TIMESTAMP);

Cassandrifed

<Keyspaces> <Keyspace Name="Twissandra"> <ColumnFamily CompareWith="UTF8Type" Name="User"/> <ColumnFamily CompareWith="BytesType" Name="Username"/> <ColumnFamily CompareWith="BytesType" Name="Friends"/> <ColumnFamily CompareWith="BytesType" Name="Followers"/> <ColumnFamily CompareWith="UTF8Type" Name="Tweet"/> <ColumnFamily CompareWith="LongType" Name="Userline"/> <ColumnFamily CompareWith="LongType" Name="Timeline"/> </Keyspace></Keyspaces>

Connecting

CLIENT = pycassa.connect_thread_local()

USER = pycassa.ColumnFamily(CLIENT, 'Twissandra', 'User', dict_class=OrderedDict)

Users

'a4a70900-24e1-11df-8924-001ff3591711': { 'id': 'a4a70900-24e1-11df-8924-001ff3591711', 'username': 'ericflo', 'password': '****',}

username = 'jericevans'password = '**********'useruuid = str(uuid()) columns = {'id': useruuid, 'username': username, 'password': password} USER.insert(useruuid, columns)

Natural keys vs surrogate

Friends and Followers

'a4a70900-24e1-11df-8924-001ff3591711': { # friend id: timestamp when the friendship was added '10cf667c-24e2-11df-8924-...': '1267413962580791', '343d5db2-24e2-11df-8924-...': '1267413990076949', '3f22b5f6-24e2-11df-8924-...': '1267414008133277',}

frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711' FRIENDS.insert(useruuid, {frienduuid: time.time()})FOLLOWERS.insert(frienduuid, {useruuid: time.time()})

Your row is your index

● Long skinny table vs short, fat columnfamily

Tweets

'7561a442-24e2-11df-8924-001ff3591711': { 'id': '89da3178-24e2-11df-8924-001ff3591711', 'user_id': 'a4a70900-24e1-11df-8924-001ff3591711', 'body': 'Trying out Twissandra. This is awesome!', '_ts': '1267414173047880',}

Userline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Timeline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Adding a tweet

tweetuuid = str(uuid())body = '@ericflo thanks for Twissandra, it helps!'timestamp = long(time.time() * 1e6) columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp}TWEET.insert(tweetuuid, columns) columns = {struct.pack('>d', timestamp): tweetuuid}USERLINE.insert(useruuid, columns) TIMELINE.insert(useruuid, columns)for otheruuid in FOLLOWERS.get(useruuid, 5000): TIMELINE.insert(otheruuid, columns)

Reads

timeline = USERLINE.get(useruuid, column_reversed=True)tweets = TWEET.multiget(timeline.values())

start = request.GET.get('start')limit = NUM_PER_PAGE timeline = TIMELINE.get(useruuid, column_start=start, column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())

I can has smarter clients?

● Shouldn't need to pack('>d', int), Cassandra provides describe_keyspace so this can be introspected

Raw thrift API: Connecting

def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client

Raw thrift API: Inserting

data = {'id': useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {'User': mutations}}

client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)

Raw thrift API: Fetching

● get, get_slice, get_count, multiget_slice, get_range_slices

● ColumnOrSuperColumn● http://wiki.apache.org/cassandra/API

http://wiki.apache.org/cassandra/API

Running twissandra

● cd twissandra● python manage.py runserver● Navigate to http://127.0.0.1:8000

http://127.0.0.1:8000/

Pycassa cheat sheet

● get(key, …)● multiget(key_list)● get_range(...)● insert(key, columns_dict)● remove(key, ...)

Exercise

● python manage.py shell● import cass● help(cass.TWEET.remove)● Delete the most recent tweet by user

foo

Exercise

● Open cass.py● Finish save_retweet

Language support

● Python● Scala● Ruby

● Speed is a negative

● Java

PHP [thrift] tickets

● https://issues.apache.org/jira/browse/THRIFT-347




https://issues.apache.org/jira/browse/THRIFT-347




Done yet?

● Still doing 1+N queries per page

SuperColumns

SuperColumns

Applying SuperColumns to Twissandra

ColumnParent

Supercolumns: limitations

UUIDs

● Column names should be uuids, not longs, to avoid collisions

● Version 1 UUIDs can be sorted by time (“TimeUUID”)

● Any UUID can be sorted by its raw bytes (“LexicalUUID”)● Usually Version 4

● Slightly less overhead

0.7: secondary indexes

●Obviate need for Userline (but not Timeline)

Lucandra

● What documents contain term X?● … and term Y?

● … or start with Z?

Lucandra ColumnFamilies

<ColumnFamily Name="TermInfo" CompareWith="BytesType" ColumnType="Super" CompareSubcolumnsWith="BytesType" KeysCached="10%" /> <ColumnFamily Name="Documents" CompareWith="BytesType" KeysCached="10%" />

Lucandra data

Term Key col name value"field/term" => { documentId , position vector }

Document Key"documentId" => { fieldName , value }

Lucandra queries

● get_slice● get_range_slices● No silver bullet

FAQ: counting

● UUIDs + batch process● Mutex (contrib/mutex or “cages”)● Use redis or mysql or memcached● 0.7: vector clocks

Tips

● Insert instead of check-then-insert● Bulk delete with 'forged' timestamps

● In 0.7: use ttl instead

as notroot/notroot:git clone http://github.com/ericflo/twissandra.git

as root/riptano:apt-get updateapt-get install python-setuptoolsapt-get install python-djangoeasy_install -U thriftrm -r /var/lib/cassandra/*cp twissandra/storage-conf.xml /etc/cassandraedit /etc/cassandra/log4j.properties to DEBUG/etc/init.d/cassandra starttail -f /var/log/cassandra/system.log

as notroot:find templates |xargs grep empty# r/m the {empty} blockspython manage.py runserver