48
Cassandra: 0-60 Jonathan Ellis / @spyced

Cassandra Training Modeling

Embed Size (px)

Citation preview

Page 1: Cassandra Training Modeling

Cassandra: 0-60

Jonathan Ellis / @spyced

Page 2: Cassandra Training Modeling

Keyspaces & ColumnFamilies

● Conceptually, like “schemas” and “tables”

Page 3: Cassandra Training Modeling

Inside CFs, columns are dynamic

● Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”

Page 4: Cassandra Training Modeling

ColumnFamilies

Columns

Page 5: Cassandra Training Modeling

“static” Cfs vs “dynamic”

Page 6: Cassandra Training Modeling

Inserting

● Really “insert or update”● As much of the row as you want

(remember sstable merge-on-read)

Page 7: Cassandra Training Modeling

Column indexes

● Name vs range flters● “reversed=true”

Page 8: Cassandra Training Modeling

Denormalization

● Whiteboard: Turn, long, skinny tables into long rows

● Reduces i/o and cpu to perform read

Page 9: Cassandra Training Modeling
Page 10: Cassandra Training Modeling

Example: twissandra

● http://twissandra.com

Page 11: Cassandra Training Modeling

CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR(64), password VARCHAR(64));

CREATE TABLE following ( user INTEGER REFERENCES user(id), followed INTEGER REFERENCES user(id));

CREATE TABLE tweets ( id INTEGER, user INTEGER REFERENCES user(id), body VARCHAR(140), timestamp TIMESTAMP);

Page 12: Cassandra Training Modeling

Cassandrifed

<Keyspaces> <Keyspace Name="Twissandra"> <ColumnFamily CompareWith="UTF8Type" Name="User"/> <ColumnFamily CompareWith="BytesType" Name="Username"/> <ColumnFamily CompareWith="BytesType" Name="Friends"/> <ColumnFamily CompareWith="BytesType" Name="Followers"/> <ColumnFamily CompareWith="UTF8Type" Name="Tweet"/> <ColumnFamily CompareWith="LongType" Name="Userline"/> <ColumnFamily CompareWith="LongType" Name="Timeline"/> </Keyspace></Keyspaces>

Page 13: Cassandra Training Modeling

Connecting

CLIENT = pycassa.connect_thread_local()

USER = pycassa.ColumnFamily(CLIENT, 'Twissandra', 'User', dict_class=OrderedDict)

Page 14: Cassandra Training Modeling

Users

'a4a70900-24e1-11df-8924-001ff3591711': { 'id': 'a4a70900-24e1-11df-8924-001ff3591711', 'username': 'ericflo', 'password': '****',}

username = 'jericevans'password = '**********'useruuid = str(uuid()) columns = {'id': useruuid, 'username': username, 'password': password} USER.insert(useruuid, columns)

Page 15: Cassandra Training Modeling

Natural keys vs surrogate

Page 16: Cassandra Training Modeling

Friends and Followers

'a4a70900-24e1-11df-8924-001ff3591711': { # friend id: timestamp when the friendship was added '10cf667c-24e2-11df-8924-...': '1267413962580791', '343d5db2-24e2-11df-8924-...': '1267413990076949', '3f22b5f6-24e2-11df-8924-...': '1267414008133277',}

frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711' FRIENDS.insert(useruuid, {frienduuid: time.time()})FOLLOWERS.insert(frienduuid, {useruuid: time.time()})

Page 17: Cassandra Training Modeling

Your row is your index

● Long skinny table vs short, fat columnfamily

Page 18: Cassandra Training Modeling

Tweets

'7561a442-24e2-11df-8924-001ff3591711': { 'id': '89da3178-24e2-11df-8924-001ff3591711', 'user_id': 'a4a70900-24e1-11df-8924-001ff3591711', 'body': 'Trying out Twissandra. This is awesome!', '_ts': '1267414173047880',}

Page 19: Cassandra Training Modeling

Userline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Page 20: Cassandra Training Modeling
Page 21: Cassandra Training Modeling

Timeline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Page 22: Cassandra Training Modeling

Adding a tweet

tweetuuid = str(uuid())body = '@ericflo thanks for Twissandra, it helps!'timestamp = long(time.time() * 1e6) columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp}TWEET.insert(tweetuuid, columns) columns = {struct.pack('>d', timestamp): tweetuuid}USERLINE.insert(useruuid, columns) TIMELINE.insert(useruuid, columns)for otheruuid in FOLLOWERS.get(useruuid, 5000): TIMELINE.insert(otheruuid, columns)

Page 23: Cassandra Training Modeling

Reads

timeline = USERLINE.get(useruuid, column_reversed=True)tweets = TWEET.multiget(timeline.values())

start = request.GET.get('start')limit = NUM_PER_PAGE timeline = TIMELINE.get(useruuid, column_start=start, column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())

Page 24: Cassandra Training Modeling

I can has smarter clients?

● Shouldn't need to pack('>d', int), Cassandra provides describe_keyspace so this can be introspected

Page 25: Cassandra Training Modeling

Raw thrift API: Connecting

def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client

Page 26: Cassandra Training Modeling

Raw thrift API: Inserting

data = {'id': useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {'User': mutations}}

client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)

Page 27: Cassandra Training Modeling

Raw thrift API: Fetching

● get, get_slice, get_count, multiget_slice, get_range_slices

● ColumnOrSuperColumn● http://wiki.apache.org/cassandra/API

Page 28: Cassandra Training Modeling

Running twissandra

● cd twissandra● python manage.py runserver● Navigate to http://127.0.0.1:8000

Page 29: Cassandra Training Modeling

Pycassa cheat sheet

● get(key, …)● multiget(key_list)● get_range(...)● insert(key, columns_dict)● remove(key, ...)

Page 30: Cassandra Training Modeling

Exercise

● python manage.py shell● import cass● help(cass.TWEET.remove)● Delete the most recent tweet by user

foo

Page 31: Cassandra Training Modeling

Exercise

● Open cass.py● Finish save_retweet

Page 32: Cassandra Training Modeling

Language support

● Python● Scala● Ruby

● Speed is a negative

● Java

Page 33: Cassandra Training Modeling

PHP [thrift] tickets

● https://issues.apache.org/jira/browse/THRIFT-347

● https://issues.apache.org/jira/browse/THRIFT-638

● https://issues.apache.org/jira/browse/THRIFT-780

● https://issues.apache.org/jira/browse/THRIFT-788

Page 34: Cassandra Training Modeling

Done yet?

● Still doing 1+N queries per page

Page 35: Cassandra Training Modeling

SuperColumns

SuperColumns

Page 36: Cassandra Training Modeling

Applying SuperColumns to Twissandra

Page 37: Cassandra Training Modeling

ColumnParent

Page 38: Cassandra Training Modeling

Supercolumns: limitations

Page 39: Cassandra Training Modeling

UUIDs

● Column names should be uuids, not longs, to avoid collisions

● Version 1 UUIDs can be sorted by time (“TimeUUID”)

● Any UUID can be sorted by its raw bytes (“LexicalUUID”)● Usually Version 4

● Slightly less overhead

Page 40: Cassandra Training Modeling

0.7: secondary indexes

●Obviate need for Userline (but not Timeline)

Page 41: Cassandra Training Modeling

Lucandra

● What documents contain term X?● … and term Y?

● … or start with Z?

Page 42: Cassandra Training Modeling

Lucandra ColumnFamilies

<ColumnFamily Name="TermInfo" CompareWith="BytesType" ColumnType="Super" CompareSubcolumnsWith="BytesType" KeysCached="10%" /> <ColumnFamily Name="Documents" CompareWith="BytesType" KeysCached="10%" />

Page 43: Cassandra Training Modeling

Lucandra data

Term Key col name value"field/term" => { documentId , position vector }

Document Key"documentId" => { fieldName , value }

Page 44: Cassandra Training Modeling

Lucandra queries

● get_slice● get_range_slices● No silver bullet

Page 45: Cassandra Training Modeling

FAQ: counting

● UUIDs + batch process● Mutex (contrib/mutex or “cages”)● Use redis or mysql or memcached● 0.7: vector clocks

Page 46: Cassandra Training Modeling

Tips

● Insert instead of check-then-insert● Bulk delete with 'forged' timestamps

● In 0.7: use ttl instead

Page 47: Cassandra Training Modeling
Page 48: Cassandra Training Modeling

as notroot/notroot:git clone http://github.com/ericflo/twissandra.git

as root/riptano:apt-get updateapt-get install python-setuptoolsapt-get install python-djangoeasy_install -U thriftrm -r /var/lib/cassandra/*cp twissandra/storage-conf.xml /etc/cassandraedit /etc/cassandra/log4j.properties to DEBUG/etc/init.d/cassandra starttail -f /var/log/cassandra/system.log

as notroot:find templates |xargs grep empty# r/m the {empty} blockspython manage.py runserver