Real World NoSQL (by Chris Yuen)

Preview:

DESCRIPTION

The Hong Kong Big Data community had a guest speaker at our Tuesday, 18 February meeting. Chris Yuen from Demyst Data discussed his experience with three NoSQL solutions: Cassandra, MongoDB, and HBase. For more information see http://www.infoincog.com/hong-kong-big-data-meeting-tuesday-18-february/.

Citation preview

Real World NoSQL x Big Data

OverviewIntroduction

Motivation for NoSQLThe NoSQL landscape

Experience sharingHBaseMongoDBCassandra

Tying it up – how does it really matter

MotivationToo much data – the need to “scale out”

CAP theorem

MotivationToo much data – the need to “scale out”

CAP theorem

PerformanceRDMBS joining is slowDenormalization

Key value data store

Alternative data representationSchemaless “No SQL”

MotivationToo much data – the need to “scale out”

CAP theorem

PerformanceRDMBS joining is slowDenormalization

Key value data store

Alternative data representationSchemaless “No SQL”

Document data store

HBaseBuilds on top of HDFS

Consistent “big-data” database

Automatically scales out

HBase… but we didn’t use it in the end

HBaseA nightmare to set up and maintain

Depends on Hadoop, HDFS, Zookeeper

HBaseA nightmare to set up and maintain

Depends on Hadoop, HDFS, Zookeeper

No secondary index

“Table” alteration requires downtime

Not spectacular latency for OLTP usage

MongoDBDe-facto “big-data” “NoSQL” database

Document based data representation

MongoDBDe-facto “big-data” “NoSQL” database

Document based data representation

MongoDBA good balance of “traditional” usage and

“NoSQL” usageSupports secondary indexRange query

Can do table scan

MongoDB“Big-data” features: sharding, replica set

MongoDB… but it got ugly pretty fast

Devil’s in the detailsReplica set management fiascoSharding is difficult to set up and poorly

implementedhttps://github.com/kizzx2/mongolab

MongoDB

MongoDBReality – it doesn’t scale beyond one machine

Replica set

CassandraColumn Family data store

CassandraColumn Family data store

CassandraColumn Family data store

More “NoSQL” than MongoDB. Less features

Column data store – strictly key/value query

CassandraAuto-sharding just works

Replica set requires 0 configuration

Append only, LSM-tree based storage formatGood for SSDHigh insert throughput

For storing analytic data

CassandraHas rudimentary support for secondary index

Difficult to do table scan or range scan

Require substantial application / paradigm shift

Real World ImplicationsWhy does NoSQL matter to Big Data?

Schemaless storage modelPerformanceScalability

Rapidly incorporate unstructured new data sources without extensive planning

How to ChooseMaintenance / Scalability

Supported operations

OLAP vs. OLTP

Thank YouChris Yuen

http://cfc.kizzx2.com

http://github.com/kizzx2

@kizzx2

chris@kizzx2.com

Recommended