Big Data and the growing relevance of NoSQL

Big Data trends and the rising importance of NOSQL

Abhijit Sharma, Architect,

Innovation & Incubation Lab, BMC Software

Trends in cloud, web, and even enterprise scale apps Unprecedented growth in -

Data set sizes which need to be stored, analyzed Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg,

Foursquare Connectedness and democratization of data

social networks, feeds, blogs, wiki, tags, semantic web Data API’s - mash up data - use Twitter, FB, Flickr API’s

Semi structured or unstructured data

Performance requirements of these apps Humongous R/W Scalability High Availability Trading consistency for availability – ACID not mandatory

RDBMS woes Challenge - Storing and scaling humongous amounts of data

and remaining highly available Vertical scaling mostly - upper limit & expensive Horizontal scaling – no automatic sharding, no rebalancing – no

infrastructure Distributed transactions & joins due to normalization inhibit

performance, availability Schema less data models – rigid schema – alter table, null

columns Deeply connected data – not designed for this

NOSQL is NOT

No SQL

The NOSQL Alternative

NOSQL is simply

Not only SQL

The NOSQL Alternative

NOSQL – So what else is it? “One size fits all” RDBMS is not working NOSQL alternatives are polyglot solutions that

better fit the new requirements thrown up by the trends. They can be categorized along these axes -

Data Model - simple to complex Scalability – single to horizontal Persistence

NOSQL categories Graph Databases

Based on Graph theory Data model – graph, nodes, edges, properties Scalability – single node – high performance Persistence – On disk data structures Examples – Neo4J, AllegroGraph

Document Databases Based loosely on documents/Lotus Notes Data model – collections of documents Scalability – horizontal, auto-sharding & replication Persistence – B-Tree Examples – mongoDB, CouchDB

NOSQL categories Column Stores

Based on Google’s BigTable design Data model - big table, column families Scalability – horizontal, auto-sharding & replication Persistence – Memory + File (on DFS) Examples – HBase, Cassandra

Key Value Stores Based on DHT, Amazon’s Dynamo design Data model – collection of key value pairs Scalability – horizontal, auto-sharding & replication Persistence – Memory or File Examples – Redis, Amazon Dynamo, Voldemort

Graph Databases

Graph oriented data Graphs are ubiquitous – Social networks,

wikis, the web, recommendation engines et. al.

Deep trees, complex networks Graph traversal - apt for expressing graph

related problems (shortest path, network size etc.)

LinkedIn Social Graph

Why not RDBMS for large scale graphs? Difficult to model and traverse graphs in

RDBMS recursive approaches - slow SQL queries that span

many table joins Hacks like storing paths for trees

node name

1 abhijit

2 sameer

from to

1 2

2 3

Graph Databases Designed for efficient storage & traversal of

large scale graphs Natural modeling of graph network - nodes,

relationships and their properties Neo4J is a leading graph db

Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS

Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs

REST API and various language bindings Graph pattern matching, Cypher Query language,

Indexer – Lucene

Graph basics

All Paths & My Network size

Shortest path between …

Is connected to?

You may know…

Mining your network Centrality Algorithms

Closeness – who has the most followers on twitter Betweenness – who has more influential people following them Eigenvector – PageRank

Document Databases

Flexible document oriented data Document style unstructured data - schema

less – e.g. JSON documents No alter table needed like in an RDBMS, de-normalized data Useful for iterative/agile development

Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability

mongoDB is a leading document database

Document Database – Use cases Use cases :

Archiving of historic data which has undergone many schema changes

Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents

Track near real time metrics - optimized increment of perf counters

Geo Loc based mobile and gaming apps (Geospatial indices can be key here)

Craigslist Archival Database Premium service to

customers allowed search over their historical postings

Archival (no purging) of 10 years of postings - billions of documents Schema changes across

versions

MySQL based archival database ALTER TABLE took a month to

complete

Foursquare Find a venue whose

name is Starbucks and mayor is Abhijit

Geo : Optimized for geo location queries – Find Starbucks near my current GPS location

mongoDB Architecture

ShardShard

ShardMongo RouterMongo

Router

Mongo Configuration Server

Mongo Configuration Server

Client

mongoDB Features JSON documents, collection oriented storage Rich, document-based queries Indexes on document attributes Fast in-place updates Scalability features

Horizontal scalability Configurable replication and high-availability Auto-sharding & rebalancing

Language specific drivers – Java, Scala, Ruby etc.

Column Stores

Column Store Reasonably rich data model –

sparse, distributed, persistent multi-dimensional sorted map Sorted row keys, columns

Use cases - Large scale data storage and analysis like - Time series data along with associated dimension data

Row keys are timestamps and thus sorted – helps time range queries Google analytics

Provides aggregate statistics, # unique visitors/day, page views/URL/day Raw click table has a row for each URL + user session time ~200 TB –

ensures contiguous URLs chronologically sorted

Data Cube - CPU

Time

OS

DC

Column Store Performance

Excellent R/W performance – large storage – PB’s High scalability - horizontal scaling, auto-sharding High Availability - transparent replication of data

HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence

Column Store - HBase Table defines Column Families - groups similar attributes , vertical

partitioning (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell -

value Table is split into multiple equal distributed regions each of which is a range

of sorted keys (partitioned automatically by the key) Ordered Rows by key, Ordered columns in a Column Family Rows can have different number of columns Columns have value and versions (any number) Row range & column range and key range queries

Row Key Column Family (dimensions) Column Family (metric)

112334-7782 server : host1 dc : PUNE value:20

112334-7783 server:host2 value:10

HBase Architecture

Key Value Stores

Key Value Stores Simplest possible data model

Caching a user’s personalized, rendered page – avoid DB

S3 bucket storage for blob data against a unique id

Range of KV stores Distributed, scaleable persistent key-value storage

– Dynamo, Voldemort Auto-Partitioned key space Replicated KV Highly Available

Largely in-memory KV stores – Redis, memcached Redis blazing fast for cache and other interesting

operations

Redis In memory KV store

Blazing fast – 100 K/sec R/W Async snapshot to disk

More than KV store – a data structure store – Supports lists, queues, sets and operations on

them Sorted list range operations Set operations UNION, INTERSECTION, DIFF

Redis – Use Cases Web session caching with EXPIRE set for

session expiry Live real time bit.ly URL stats like clicks etc –

fast increments of counters Auto Complete – Type first few characters –

maps to a sort list and a range query is fired Publish / Subscribe – fan out a message to

subscribers Set operations – My Twitter <Followers

INTERSECTION Followees> - tells me who all I follow but they don’t follow me back

Thanks

Email : [email protected] : sharmaabhijitBlog : abhijitsharma.blogspot.com

mailto:[email protected]

Technology

Big Data and the growing relevance of NoSQL