Big Data -Stores

Introduction to Big Data stores:

Key Value stores:

Cassandra:

• First developed at Facebook (powered the Inbox Search)• Uses decentralized clustered nodes• Considered one of the most scalable NoSQL systems• Very high availability – no single point of failure• Flexible data storage (structured/un-structured)• Relatively easy to configure• Designed for high transaction rates• Java based – Available under the latest Apache license

Key Value NOSQL Databases

DynamoDB:• Amazon DynamoDB stores data on Solid State Drives (SSDs)• DynamoDB implements cryptographic methods to authenticate

users and prevent unauthorized data access.• Stronger consistency on read tracked by atomic counters enables

latest values. • Reduces the over-head of scaling and replication from developers.• Synchronous replication across multiple AWS Availability Zones in

an Single Region.• DynamoDB with other AWS features like AWS-EMR, AWS-Data

Pipeline can perform complex analytics and data movement respectively.


Riak:• Riak adopts Mater-less peer-peer architecture • Written in Erlang & C, some JavaScript.• Distributes data and performs replication across nodes with consistent

hashing. • Riak uses HTTP/REST or custom binary to communicate data with

Cluster/Nodes. • Riak has two modes of operation (ie) fullsync (Synchronization occurs

every 6 hours) and real-time. (requires synchronization trigger)• When new nodes are added to cluster, data is rebalanced across nodes

with no downtime. • Used by 25% of fortune 50 companies. AT&T, AOL, Ask.com, Best Buy,

Boeing and Comcast.


Redis:• Redis adopts Master-Slave architecture• Slaves are allowed communicate with each other.• Redis is written in ANSI C and is best suited for rapidly changing

data, with predictable size. Ex) Stock-Analysis • By default, latency monitoring is disabled and user can enable by

setting a threshold value to variable "latency threshold"• Redis is designed to be accessed by trusted-users within trusted

environment. • Performs Hash or Range partitioning(Mapping range of object to

specific Redis instance)


CouchDB:• Written in Erlang.• Instead of locks, CouchDB uses Multi-Version Concurrency Control

(MVCC) to manage concurrent access to the database.• CouchDB achieves eventual consistency between multiple

databases by using incremental replication.• Validates documents using Java Script functions and approve/deny

the document update.• CouchDB supports both pull replication(node acts as target)and

push replication(node acts as source).• CouchDB is best suited for data that changes occasionally.

Key Value NOSQL DatabasesAzure Table Storage:• Maximum data size is 200 TB per table.• Azure Table retrieves a maximum of 1000 rows per table.• Azure Table Storage provides ACID transaction that guarantees CRDU

operations for a single entity in a table.• Storage access architecture of Azure Table Storage has three-layered structure

Front-End (FE) layer - Authenticates and authorizes the request.Partition Layer - partitions the object data and performs load-

balancing.Distributed and replicated File System (DFS) Layer - Distributes and

Replicates data across many clusters.• Azure Table Storage does not provide a way to represent relationships between

data. • To provide fault tolerance the stored data is replicated three times within the

region, and replicated an additional 3 times in another region.


BerkeleyDB:

• Berkeley DB is a embedded database engine and is suitable for storing key/value data.

• Key and data items are stored in simple structures called DBT (DBT is an acronym for database thang) that contains reference to memory and length.

• Berkeley DB supports concurrency in threads even in database with size.• Program accessing Berkeley DB determines how data is to stored in records. • Berkeley DB has three different products:

o Berkeley DB - contains database implementations and is written in Co Berkeley DB Java Edition - Log structured storage architecture and

coded in Pure Java.o Berkeley DB XML - specializes in the storage of XML documents

Column-Family NOSQL databases:

HBase:

• First developed at Powerset (to power natural language search)

• Distributed column oriented database on top of Hadoop/HDFS.

• Continuous access to data - Multiple master nodes.• Linear and modular scalability.• Provides interactive commands for manipulating database• Single row atomic operations and row level exclusive locks.• Multiple clients like its native Java library, Thrift, and REST


BigTable:

• First developed at Google(Structured data ).• Sparse, distributed, persistent multidimensional sorted

map.• Self Managing ( Servers can be added/removed

dynamically. Servers adjust to load imbalance).• Fault tolerant & Persistent.• Designed to scale into the petabyte range.• Tables are optimized for GFS (Google File System) by being

split into multiple tablets.


HyperTable:

• Developed as an in-house software at Zvents.• Manages massive spare tables with timestamped cell

versions.• Maximum efficiency (Less hardware, power, datacenter).• Good fit for wide range of applications.• Clean semantics.• High performance.

Graph NOSQL databases:

Neo4j:• Developed by Neo Technology• Highly scalable, robust.• Graph structures with nodes, edges and properties to

store data.• Provides index-free adjacency• Neo4j is schema free – Data does not have to adhere to

any convention• ACID – atomic, consistent, isolated and durable for logical

units of work• Easy to get started and use.• Support for wide variety of languages (Java, Python, Perl,

Scala, Cypher, etc)

Document NOSQL databases:

MongoDB:

• Developed by the software company 10gen as service product later shifted to open source.

• Document Oriented Database.• Implemented in C++ for best performance. (built for

speed).• Super low latency access to your data (Very little CPU

overhead).• Auto Sharding for easy scalability.• Map/Reduce for Aggregation.• Full index support for high performace.• Language drivers for (Ruby/Ruby on rails, Java, C#,

JavaScript, Python, Perl, Erlang etc).

Documents

Big Data -Stores