A general overview of the non-relational database
By Andrew Kandels
When to use an RDMS?
Organized, structured data matched by common characteristics.• Financial & Medical Records• Personal Information• Access Control (Usernames & Passwords) • Order Processing• Logistics• Mailing Lists
… or, any data that works more efficiently when normalized
What Relational Databases are Bad At
• Content Management System (CMS)• Real-time Analytics• Caching• Logging and Archiving Events• Messaging• Job Queue• Social Networking• Data Mining and Warehousing
When to Consider NoSQL?
• De-normalizing SQL as last resort• Consistency can be sacrificed for scale• Dynamic data models• Tables storing meta-data• BLOB tables storing serialized data!• Very high writes, reads, or both• Don’t have a DBA• Temporary & volatile data
Caching layers are a band aid that fix problems the RDMS was never meant to handle
ConsistencyService operates fully or not at all. You either clicked “Place Order” or you didn’t.
AvailabilityService is always available with no need for scheduled downtime or maintenance windows.
Partition ToleranceNo set of failures less than total network failure is allowed to cause the system to respond incorrectly.
Pick two.
Brewer’s CAP Theorem
(CA) Consistency, Availability• Relational Databases
Trouble with partitions & scale. Deal with it through replication.
(CP) Consistency, Partition-Tolerant• MongoDB• HBase• Redis
Trouble with availability while staying consistent.
(AP) Availability, Partition-Tolerant• CouchDB• Cassandra• Riak• Voldemort
Trouble with partitions & scale. Deal with it through replication.
Non-Relational Databases
• Key/Value Stores
• Document Databases
• Graph Databases
• Big Data & Warehousing Databases
Key/Value Store
MemcachedSimple, high-performance distributed memory object caching system.Pros:• Caching• Rate limiting• Real-time analytics
Cons:• Serialization• Replication• Not fault tolerant
RedisAdvanced key-value store with support for hashes, lists, sets and sorted sets.Pros:• Disk-backed, persistent, journaled (fault tolerant) • Replication out-of-the-box• VERY fast reads/writes
Cons:• Complex to query
Key/Value Store
CassandraVery scalable, distributed and decentralized data store.Pros:• Extremely fast reads and writes (Twitter boasts 100k/second+)• Massive, engaged open source community (Twitter, Facebook)• Fault tolerant
Cons:• Java (see: Riak, an Erlang/C alternative that’s very similar)• Not production ready
VoldemortLinkedIn’s distributed persistent caching solution.Pros:• Distributed storage• In-memory with disk-backed persistence and fault tolerance (no single POF)• Very fast reads and writes (10-20k/second)• Drop-in storage layer (great for unit testing mock objects)• MVCC• Native Serialization (hash tables, arrays, etc.)
Document Databases
MongoDBScalable, high performance database with familiar RDMS functionality.Pros:• Semi-structured (hash tables, lists, dates, …)• Full, range and nested Indexes• Replication and distributed storage• Query language and Map/Reduce• GridFS file storage (NFS replacement)• BSON Serialization• Capped Collections
Cons:• Map/Reduce is single process (soon to be resolved)
CouchDBPortable, fault-tolerant document database.Pros:• Bi-directional replication (offline access)• Some transaction support (ACID)
Cons:• Complicated to query (Map/Reduce)
Graph Databases
Neo4JDesigned on an object-oriented, flexible network structure rather than with strict and static tables. Ideal for social networking applications.Pros:• Read optimized• Indexing• Complex relationship tree processing
Big Data & Warehouse Databases
HBaseThe Hadoop database. For very large tables (billions of rows times millions of columns) on commodity hardware.Pros:• On-demand distributed processing (Map/Reduce)• ETL optional• Integrates tightly in Hadoop ecosystem (Pig, Hive, HDFS)
Cons:• Slow, seconds or minutes (not milliseconds)
InfiniDBDistributed column-oriented database.Pros:• Data warehousing (high speed data loader)• Very fast queries and joins• Analytics & Metrics
Cons:• Slow Updates• Schema designed up-front (hard to change later)
My Two Cents
Content Management System (CMS) MongoDB, CouchDB
Real-time analytics MongoDB, Cassandra (Rainbird)
Page/Query Cache Redis, Voldemort
Logging and Archiving Events MongoDB, Cassandra
Messaging Redis, Cassandra
Job Queue MongoDB (CC), Redis
Social Networking Neo4J, Cassandra
Data Mining & Warehousing HBase, InfiniDB
Binary Storage (Files, Images, …) MongoDB
Sessions Redis, MongoDB
Why Choose MongoDB?
• Semi-structured Data• Native BSON Serialization• Full Index Support• Built-In Replication & Cluster Management• Distributed Storage (Sharding)• Easy to Query• Fast In-Place Updates• GridFS File Storage• Capped collections
MongoDB in many ways “feels” like an RDMS. It’s easy to learn and quick to implement.
Semi-Structured Data
MongoDB is NOT a key/value store. Store complex documents as arrays, hash tables, integers, objects and everything else supported by JSON:
Native BSON Serialization
BSON JSON PHP0
0.2
0.4
0.6
0.8
1
1.2
100,000 serialize/de-serialize runs of bson_encode(), json_encode() and serialize() in the PHP:
The PHP MongoDB extension serializes the data in C outside of the runtime leading to even better results.
Full Index Support
Built-In Replication & Cluster Management
•Data redundancy
•Fault tolerant (automatic failover AND recovery)
•Consistency (wait-for-propagate or write-and-forget)
•Distribute read load
•Simplified maintenance
•Servers in the cluster managed by an elected leader
Easy to Query
Fast In-Place Updates
MongoDB stores documents in padded memory slots. Typical RDMS updates on VARCHAR columns:
• Mark the row and index as deleted (without freeing the space)• Append the new updated row• Append the new index and possibly rebuild the tree
Most updates are small and don’t drastically change the size of the row:
• Last login date• UUID replace / Password update• Session cookie• Counters (failed login attempts, visits)
MongoDB can apply most updates over the existing row, keeping the index and data structure relatively untouched – and do so VERY FAST.
GridFS File Storage
Efficiently store binary files in MongoDB:• Videos• Pictures• Translations• Configuration files
Data is distributed in 4 or 16 MB chunks and stored redundantly in your MongoDB network.
• No serialization / fast reads• Command line and PHP extension access
Capped Collections
Fixed-size round robin tables with extremely fast reads and writes.
Perfect for:• Logging• Messaging• Job Queues• Caching
Features:• Automatically “ages out” old data• Can also query, delete and update out of FIFO order• FIFO reads/writes are nearly as fast as cat > file; tail –f /file• Tailable cursor stays open as reads rows as they are added• Persistent, fault-tolerant, distributed• Atomic pop items off the stack
Object Document Mapper
doctrine-project.org/ projects/mongodb_odm
The Doctrine MongoDB Object Document Mapper is built for PHP 5.3.2+ and provides transparent persistence for PHP objects.
The PHP MongoDB extension is simple; but, this makes it even easier for:
• Document generation seamlessly from your class• Query using your existing class structures• Very easy migration path from an ORM• Rapid Application Development