Upload
kumaran-ramanujam
View
246
Download
0
Embed Size (px)
Citation preview
Introduction to Big Data stores:
Key Value stores:
Cassandra:
• First developed at Facebook (powered the Inbox Search)• Uses decentralized clustered nodes• Considered one of the most scalable NoSQL systems• Very high availability – no single point of failure• Flexible data storage (structured/un-structured)• Relatively easy to configure• Designed for high transaction rates• Java based – Available under the latest Apache license
Key Value NOSQL Databases
DynamoDB:• Amazon DynamoDB stores data on Solid State Drives (SSDs)• DynamoDB implements cryptographic methods to authenticate
users and prevent unauthorized data access.• Stronger consistency on read tracked by atomic counters enables
latest values. • Reduces the over-head of scaling and replication from developers.• Synchronous replication across multiple AWS Availability Zones in
an Single Region.• DynamoDB with other AWS features like AWS-EMR, AWS-Data
Pipeline can perform complex analytics and data movement respectively.
Key Value NOSQL Databases
Riak:• Riak adopts Mater-less peer-peer architecture • Written in Erlang & C, some JavaScript.• Distributes data and performs replication across nodes with consistent
hashing. • Riak uses HTTP/REST or custom binary to communicate data with
Cluster/Nodes. • Riak has two modes of operation (ie) fullsync (Synchronization occurs
every 6 hours) and real-time. (requires synchronization trigger)• When new nodes are added to cluster, data is rebalanced across nodes
with no downtime. • Used by 25% of fortune 50 companies. AT&T, AOL, Ask.com, Best Buy,
Boeing and Comcast.
Key Value NOSQL Databases
Redis:• Redis adopts Master-Slave architecture• Slaves are allowed communicate with each other.• Redis is written in ANSI C and is best suited for rapidly changing
data, with predictable size. Ex) Stock-Analysis • By default, latency monitoring is disabled and user can enable by
setting a threshold value to variable "latency threshold"• Redis is designed to be accessed by trusted-users within trusted
environment. • Performs Hash or Range partitioning(Mapping range of object to
specific Redis instance)
Key Value NOSQL Databases
CouchDB:• Written in Erlang.• Instead of locks, CouchDB uses Multi-Version Concurrency Control
(MVCC) to manage concurrent access to the database.• CouchDB achieves eventual consistency between multiple
databases by using incremental replication.• Validates documents using Java Script functions and approve/deny
the document update.• CouchDB supports both pull replication(node acts as target)and
push replication(node acts as source).• CouchDB is best suited for data that changes occasionally.
Key Value NOSQL DatabasesAzure Table Storage:• Maximum data size is 200 TB per table.• Azure Table retrieves a maximum of 1000 rows per table.• Azure Table Storage provides ACID transaction that guarantees CRDU
operations for a single entity in a table.• Storage access architecture of Azure Table Storage has three-layered structure
Front-End (FE) layer - Authenticates and authorizes the request.Partition Layer - partitions the object data and performs load-
balancing.Distributed and replicated File System (DFS) Layer - Distributes and
Replicates data across many clusters.• Azure Table Storage does not provide a way to represent relationships between
data. • To provide fault tolerance the stored data is replicated three times within the
region, and replicated an additional 3 times in another region.
Key Value NOSQL Databases
BerkeleyDB:
• Berkeley DB is a embedded database engine and is suitable for storing key/value data.
• Key and data items are stored in simple structures called DBT (DBT is an acronym for database thang) that contains reference to memory and length.
• Berkeley DB supports concurrency in threads even in database with size.• Program accessing Berkeley DB determines how data is to stored in records. • Berkeley DB has three different products:
o Berkeley DB - contains database implementations and is written in Co Berkeley DB Java Edition - Log structured storage architecture and
coded in Pure Java.o Berkeley DB XML - specializes in the storage of XML documents
Column-Family NOSQL databases:
HBase:
• First developed at Powerset (to power natural language search)
• Distributed column oriented database on top of Hadoop/HDFS.
• Continuous access to data - Multiple master nodes.• Linear and modular scalability.• Provides interactive commands for manipulating database• Single row atomic operations and row level exclusive locks.• Multiple clients like its native Java library, Thrift, and REST
Column-Family NOSQL databases:
BigTable:
• First developed at Google(Structured data ).• Sparse, distributed, persistent multidimensional sorted
map.• Self Managing ( Servers can be added/removed
dynamically. Servers adjust to load imbalance).• Fault tolerant & Persistent.• Designed to scale into the petabyte range.• Tables are optimized for GFS (Google File System) by being
split into multiple tablets.
Column-Family NOSQL databases:
HyperTable:
• Developed as an in-house software at Zvents.• Manages massive spare tables with timestamped cell
versions.• Maximum efficiency (Less hardware, power, datacenter).• Good fit for wide range of applications.• Clean semantics.• High performance.
Graph NOSQL databases:
Neo4j:• Developed by Neo Technology• Highly scalable, robust.• Graph structures with nodes, edges and properties to
store data.• Provides index-free adjacency• Neo4j is schema free – Data does not have to adhere to
any convention• ACID – atomic, consistent, isolated and durable for logical
units of work• Easy to get started and use.• Support for wide variety of languages (Java, Python, Perl,
Scala, Cypher, etc)
Document NOSQL databases:
MongoDB:
• Developed by the software company 10gen as service product later shifted to open source.
• Document Oriented Database.• Implemented in C++ for best performance. (built for
speed).• Super low latency access to your data (Very little CPU
overhead).• Auto Sharding for easy scalability.• Map/Reduce for Aggregation.• Full index support for high performace.• Language drivers for (Ruby/Ruby on rails, Java, C#,
JavaScript, Python, Perl, Erlang etc).