A Walk down NOSQL Lane in the Cloud
New York City Cloud Computing GroupFebruary 2011
Alexander Sicular@siculars
Who is this blowhard?Columbia University pays my mortgage
For the better part of a decade in Medical Informatics
Am not shilling for any of these companies
Am not a computer scientist
Am a computer science enthusiast particularly in the area of Informatics
When I put my data in the “cloud”, to me it just means that it’s
virtualized in someone else’s server room
Many, many providers and only growing
Amazon, Rackspace, Joyent, CouchOne, Cloudant, Azure, GAE, Heroku, no.de
Outsourced management
Zero capex
Controlled costs
...the Silver Lining
...With a Chance of Rain?
Vendor lock in
Unreliable performance
i/o
cpu, memory
Bare metal > software virtualization
NoSQL or NOSQL?Not Only SQL
Non/post relational
Big tent policy
Umbrella term
Fragmented
http://www.flickr.com/photos/morgennebel/2933723145/
Your Usage PatternsRead vs. Write
Mutable vs. Immutable
Product Considerations:
In place updates
Write Only Logs
This vs. ThatRiak wiki comparisons pagehttp://wiki.basho.com/Riak-Comparisons.html
Popular one page comparison of a number of NOSQL players by Kristof Kovacs:http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
NOSQL concepts are Not Brand New
Memcached since 2003 http://memcached.org
Google papers 2004-2006
Amazon Dynamo 2007
Consistent Hashing 2007 http://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients
Using relational systems as a key-value blob store
2009 FriendFeed (not the first) http://bret.appspot.com/entry/how-friendfeed-uses-mysql
Why NOSQLSupport for “Vary Large” data sets
Schemaless
Denormalized
Green field
New applications
http://www.flickr.com/photos/gailtang/1243984297/
AcademiaGoogle:
Bigtable http://labs.google.com/papers/bigtable.html
GFS http://labs.google.com/papers/gfs.html
M/R http://labs.google.com/papers/mapreduce.html
Amazon:
Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
NOSQL Summer http://nosqlsummer.org/papers
Under the Hood Terminology
Write Only Log http://en.wikipedia.org/wiki/Log-structured_file_system
Merkle Trees http://en.wikipedia.org/wiki/Hash_tree
B-trees http://en.wikipedia.org/wiki/B-tree
Vector clock http://en.wikipedia.org/wiki/Vector_clock
Bloom filters http://en.wikipedia.org/wiki/Bloom_filters
Big O Notation http://en.wikipedia.org/wiki/Big_o_notation
Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing
CAP Theoremhttp://en.wikipedia.org/wiki/CAP_theorem
Consistency
Availability
Partition Tolerance
Pick two?
http://guide.couchdb.org/draft/consistency.html
CouchDBCouchOne, Cloudant
Erlang
Extreme replication scenarios
Works on phones
Updated indexing (b-tree)
HTTP interface
Offline usage
Sharded scaling
CouchDB Internal Architecture
http://nosqlpedia.com/wiki/File:CouchDB-Arch.JPG
MongoDB10Gen, MongoHQ, MongoLab
C++
huMONGOus
Sharded scaling, replicated master/slave
Located in NYC (go visit them)
Soft landing for those coming from mysql (relational databases)
Native javascript
Secondary indexes
MongoDB Sharding Diagram
http://www.snailinaturtleneck.com/blog/2010/03/30/sharding-with-the-fishes/
MySQL to Mongo Query similarity
http://nosqlpedia.com/wiki/File:MongoDB.JPG
RiakBasho, Joyent
Erlang
Distributed
HTTP, protobuf
Native javascript, erlang
Multiple backends
Homogeneous
CAP tunable
HadoopCloudera, Apache Foundation
Java
High latency
Batch oriented
HDFS is GFS based
Open source Google stack via the Google papers
Huge ecosystem
Yahoo, FB, Twitter, Fortune 500
Pig, Hive, Flume
HBaseJava
Low latency store
sits on top of Hadoop
Modeled after Google Bigtable
Column oriented
Thrift, protobuf
Backend for new Facebook Messaging service
CassandraApache
Java
Column oriented
Like Bigtable and Dynamo
Originated at Facebook
At Twitter, Distributed countinghttp://www.infoq.com/presentations/NoSQL-at-Twitter-by-Ryan-Kinghttp://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
RedisOpenRedis
C
REmote DIctionary Server
Specific data structures
incredibly fast
memcached on steroids
replicated master/slave
CommonalitiesOpen Source
Adherence to common or standard:
data formats
json, bson, utf8, binary
data trandport mechanisms
http, thrift, protobuf, simple wire protocols
Ok. So Now What?Analyze your requirements
Mailing lists
IRC, twitter
Project pages, wiki
Github/Google Code/Bitbucket:
project page
specific language clients
Variety PackHybrid architectures will become the norm
Twitter - mysql, cassandra, hadoop
Google - mysql, GAE (BT)
Facebook - mysql, cassandra, hbase, memcached
Yahoo - mysql, hadoop
LinkedIn - voldemorthttp://www.flickr.com/photos/uncleweed/82245324/
Questions?
New York City Cloud Computing GroupFebruary 2011
Alexander Sicular@siculars