Upload
vasi-hojda
View
229
Download
2
Embed Size (px)
Citation preview
Cassandra: 0-60
Jonathan Ellis / @spyced
Keyspaces & ColumnFamilies
● Conceptually, like “schemas” and “tables”
Inside CFs, columns are dynamic
● Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”
ColumnFamilies
Columns
“static” Cfs vs “dynamic”
Inserting
● Really “insert or update”● As much of the row as you want
(remember sstable merge-on-read)
Column indexes
● Name vs range flters● “reversed=true”
Denormalization
● Whiteboard: Turn, long, skinny tables into long rows
● Reduces i/o and cpu to perform read
CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR(64), password VARCHAR(64));
CREATE TABLE following ( user INTEGER REFERENCES user(id), followed INTEGER REFERENCES user(id));
CREATE TABLE tweets ( id INTEGER, user INTEGER REFERENCES user(id), body VARCHAR(140), timestamp TIMESTAMP);
Cassandrifed
<Keyspaces> <Keyspace Name="Twissandra"> <ColumnFamily CompareWith="UTF8Type" Name="User"/> <ColumnFamily CompareWith="BytesType" Name="Username"/> <ColumnFamily CompareWith="BytesType" Name="Friends"/> <ColumnFamily CompareWith="BytesType" Name="Followers"/> <ColumnFamily CompareWith="UTF8Type" Name="Tweet"/> <ColumnFamily CompareWith="LongType" Name="Userline"/> <ColumnFamily CompareWith="LongType" Name="Timeline"/> </Keyspace></Keyspaces>
Connecting
CLIENT = pycassa.connect_thread_local()
USER = pycassa.ColumnFamily(CLIENT, 'Twissandra', 'User', dict_class=OrderedDict)
Users
'a4a70900-24e1-11df-8924-001ff3591711': { 'id': 'a4a70900-24e1-11df-8924-001ff3591711', 'username': 'ericflo', 'password': '****',}
username = 'jericevans'password = '**********'useruuid = str(uuid()) columns = {'id': useruuid, 'username': username, 'password': password} USER.insert(useruuid, columns)
Natural keys vs surrogate
Friends and Followers
'a4a70900-24e1-11df-8924-001ff3591711': { # friend id: timestamp when the friendship was added '10cf667c-24e2-11df-8924-...': '1267413962580791', '343d5db2-24e2-11df-8924-...': '1267413990076949', '3f22b5f6-24e2-11df-8924-...': '1267414008133277',}
frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711' FRIENDS.insert(useruuid, {frienduuid: time.time()})FOLLOWERS.insert(frienduuid, {useruuid: time.time()})
Your row is your index
● Long skinny table vs short, fat columnfamily
Tweets
'7561a442-24e2-11df-8924-001ff3591711': { 'id': '89da3178-24e2-11df-8924-001ff3591711', 'user_id': 'a4a70900-24e1-11df-8924-001ff3591711', 'body': 'Trying out Twissandra. This is awesome!', '_ts': '1267414173047880',}
Userline
'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}
Timeline
'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}
Adding a tweet
tweetuuid = str(uuid())body = '@ericflo thanks for Twissandra, it helps!'timestamp = long(time.time() * 1e6) columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp}TWEET.insert(tweetuuid, columns) columns = {struct.pack('>d', timestamp): tweetuuid}USERLINE.insert(useruuid, columns) TIMELINE.insert(useruuid, columns)for otheruuid in FOLLOWERS.get(useruuid, 5000): TIMELINE.insert(otheruuid, columns)
Reads
timeline = USERLINE.get(useruuid, column_reversed=True)tweets = TWEET.multiget(timeline.values())
start = request.GET.get('start')limit = NUM_PER_PAGE timeline = TIMELINE.get(useruuid, column_start=start, column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())
I can has smarter clients?
● Shouldn't need to pack('>d', int), Cassandra provides describe_keyspace so this can be introspected
Raw thrift API: Connecting
def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client
Raw thrift API: Inserting
data = {'id': useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {'User': mutations}}
client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)
Raw thrift API: Fetching
● get, get_slice, get_count, multiget_slice, get_range_slices
● ColumnOrSuperColumn● http://wiki.apache.org/cassandra/API
Running twissandra
● cd twissandra● python manage.py runserver● Navigate to http://127.0.0.1:8000
Pycassa cheat sheet
● get(key, …)● multiget(key_list)● get_range(...)● insert(key, columns_dict)● remove(key, ...)
Exercise
● python manage.py shell● import cass● help(cass.TWEET.remove)● Delete the most recent tweet by user
foo
Exercise
● Open cass.py● Finish save_retweet
Language support
● Python● Scala● Ruby
● Speed is a negative
● Java
PHP [thrift] tickets
● https://issues.apache.org/jira/browse/THRIFT-347
● https://issues.apache.org/jira/browse/THRIFT-638
● https://issues.apache.org/jira/browse/THRIFT-780
● https://issues.apache.org/jira/browse/THRIFT-788
Done yet?
● Still doing 1+N queries per page
SuperColumns
SuperColumns
Applying SuperColumns to Twissandra
ColumnParent
Supercolumns: limitations
UUIDs
● Column names should be uuids, not longs, to avoid collisions
● Version 1 UUIDs can be sorted by time (“TimeUUID”)
● Any UUID can be sorted by its raw bytes (“LexicalUUID”)● Usually Version 4
● Slightly less overhead
0.7: secondary indexes
●Obviate need for Userline (but not Timeline)
Lucandra
● What documents contain term X?● … and term Y?
● … or start with Z?
Lucandra ColumnFamilies
<ColumnFamily Name="TermInfo" CompareWith="BytesType" ColumnType="Super" CompareSubcolumnsWith="BytesType" KeysCached="10%" /> <ColumnFamily Name="Documents" CompareWith="BytesType" KeysCached="10%" />
Lucandra data
Term Key col name value"field/term" => { documentId , position vector }
Document Key"documentId" => { fieldName , value }
Lucandra queries
● get_slice● get_range_slices● No silver bullet
FAQ: counting
● UUIDs + batch process● Mutex (contrib/mutex or “cages”)● Use redis or mysql or memcached● 0.7: vector clocks
Tips
● Insert instead of check-then-insert● Bulk delete with 'forged' timestamps
● In 0.7: use ttl instead
as notroot/notroot:git clone http://github.com/ericflo/twissandra.git
as root/riptano:apt-get updateapt-get install python-setuptoolsapt-get install python-djangoeasy_install -U thriftrm -r /var/lib/cassandra/*cp twissandra/storage-conf.xml /etc/cassandraedit /etc/cassandra/log4j.properties to DEBUG/etc/init.d/cassandra starttail -f /var/log/cassandra/system.log
as notroot:find templates |xargs grep empty# r/m the {empty} blockspython manage.py runserver