Big Data presentation at GITPRO 2013

Big Data- GITPRO 2013By - Sameer WadkarCo-Founder & Big Data Architect / Data Scientist at Axiomine

Agenda

• What is Big Data• Big Data Characteristics• Big Data and Business Intelligence Applications• Big Data and Transactional Applications• Demo

What is Big Data?

Volume

Velocity

Big Data

Variety

Big Data monitors 12 Terabytes of Tweets each day to improve product sentiment analysis (source :IBM)

Amazon and PayPal use Big-Data for real time fraud detection (source: McKinsey)

In 15 of the US economy’s 17 sectors, companies with upward of 1,000 employees store, on average, more information than the Library of Congress (source: McKinsey)

Big Data monitors 12 TB of Tweets each day to improve product sentiment analysis (source :IBM)

Most Big Data applications are based around the Volume dimension

Visualizing Big Data

• 1 Petabyte is 54000 movies in digital format• Reading 1 Terabyte of data sequentially from a single disk drive

takes 3 hours• Typical speed to read from the hard-disk – 80 MB/sec

• Traversing 1 Terabyte of data randomly over 1 disk (a typical database access scenario) requires orders of magnitude longer

• Disk transfer rates are significantly higher than disk seek rate

Single node processing capacity will drown in the face of Big Data

Big Data vs. Traditional

Big Data Architecture

…

In Big Data architectures the application is moves to the data. Why?

User launches a batch job

1

Three Tier Architecture

App Request Data from Data Tier

2

Data Tier sends data to the App Tier

3

App Tier processes data4

App Tier sends the report

5

User requests a report

1

Master Distributes Application

2

Master launches App on nodes

35

User downloads results

4

All Nodes process the data on their nodes

Master Node

Application & Data Tier

Data TierApplication Tier

Why is Big Data hard?

Divide out and conquer in place is a Big Data Strategy

• Goal is to divide the data on multiple nodes and conquer by processing the data in-place of the node.

• Real world processing cannot be always divided into smaller sub-problems (Divide and Conquer is not always feasible)

• Data has dependencies• Normalization v/s Denormalization

• There are processing dependencies. Later phase of the process may require results of an earlier phase

• Single Pass v/s Multi-pass

Big Data Characteristics

Scale-out, Fault Tolerance & Graceful Recovery are essential features

• Big Data Systems must scale out • Adding more nodes should lead to greater parallelization

• Big Data Systems must be resilient to partial failure• If one part of the system fails other parts should continue to

function• Big Data Systems must be able to self-recover from partial failure

• If any part of the system fails another part of the system will attempt to recover from the failure.

• Data must be replicated on separate nodes• Loss of any node does not lose data or processing.• Recovery should be transparent to the end-user.

Big Data Applications

Big Data design is dictated by the nature of the applications

• Business Intelligence applications• Read-only systems• ETL Systems• Query massive data for purpose of generating reports or for large

scale transformations and import into destination data-source• Transactional Applications

• One part of the system updates data while another part reads the data

• Example Systems – Imagine running a online store of Amazon.com scale.

BI - Sample Use-Case

A very simple query but size makes all the difference

• “Select SUM(SALES_AMT) from SALES where state=‘MD’ group by YEAR order by YEAR”

• Find me total sales revenue by year for “Maryland” and order them by year

• What if SALES table has billions of rows over 20 years?

Sales Transactions Table

Big Data Reporting

Year Sales Revenue

1980 11 Million

1981 13 Million

… …

2010 10 Billion

Input

Output

BI Big Data Flavors

We discuss three flavors in increasing order of scale-out capability

Big Data Flavor ProductsIn-Memory Databases Oracle Exalytics, SAP HANAMassively Parallel Computing (MPP) Greenplum, NetezzaMap Reduce Hadoop

In Memory Databases

If State=‘VA’ is next query & cache is only big enough to hold one state results at a time, cache miss occurs & no performance gains

Simplified version - Data is partitioned randomly across all nodes.

Selection Phase1. Each data node contains fast Memory (SSD)

and mechanism to apply “Where” clause2. Only the necessary data (“MD” records) are

passed over the expensive Network I/O to the processing node

Processing Phase1. The processing nodes will compute the

SUM(sales_amt) by year2. Order the results3. Place it In-Memory cache

• First execution of the query is slow.• Subsequent executions are very fast (almost real-time) as the cache is

hot.• Cache has SQL-Interface. User experiences “Real-Time”!!

Data Node Data Node …… Data Node

Processing Node

In-Memory TB Cache with SQL Interface

User SQL Interface

Fetch PhaseThe user is served the results from the cache through the familiar SQL Interface

In Memory Databases (cont.)

In-Memory DBs provide real-time querying on moderate sized data

Characteristics

• Specialized hardware • Specialized I/O and Flash Memory for faster I/O • Massive in-memory cache (Multi-Terabyte TB) with SQL Interface

Pros

• Familiar model (SQL Interface)• Can integrate with standard toolkits and BI Solutions• Unified software/hardware solution

Cons

• Vendor lock-in• Expensive – Hardware as well as licensing cost• Typically cannot scale beyond 1-2 TB of data• Works best when same data is read often (Cache remains hot).

MPP (Typical Architecture)

If query is “Group by State” ,no longer works as fast. Why?

Data is partitioned horizontally across all slave nodes. Assume “Sale Year” is the distribution key. Secondary indexes by other keys can be added to each slave node.

Distributed Query Phase1. Each salve node will compute the query

for the data contained in its own node.2. Each year data is completely held in its

own node3. This phase produces partial query results

which are complete for each year

Slave Node(1980 & 1990 data)

Slave Node(1981 & 1991 data) .. Slave Node

(2000 & 2010 data)

Master Node

Accumulation Phase1. All slave results are aggregated and sorted.

• Scale Out – More nodes means less years of data per node.• Redundancy & Failover – Each node will have a backup node.• Data distribution strategy & access patterns compatibility determine

performance.• Enormous network overhead if access-patterns do not respect

distribution strategy

MPP (cont.)

MPP supports familiar RDBMS paradigm for medium scalability

Characteristics

• Balances throughput with responsiveness.• Some implementations use specialized hardware (Ex. Netezza uses FPGA)• Familiar RDBMS (SQL) paradigm• Can scale to 10’s of Terabytes in most cases

Pros

• Familiar model (SQL Interface)• Can integrate with standard toolkits and BI Solutions

Cons

• Vendor lock-in• Cannot scale for ad-hoc queries• Queries must respect data distribution strategy for acceptable performance.

MapReduce

If query is “Group by State” – It still works!!

Data is partitioned randomly/redundantly across all data nodes. Every data node contain sales data for every state and every year.

Map Phase 1. Each data node reads all of its

records sequentially.2. It filters out all non- “MD” state

records3. It computes a SUM(sales_amt) by

year for each yearData Node Data Node … Data Node

Reduce Node

Reduce Phase 1. Reduce node receives

SUM(sales_amt) for state “MD” by each year from each node

2. Add all map results by year and compute the final SUM(sales_amt) by year for “MD” sales

3. Orders results by year

• Data blocks (order of 128 MB) are stored and accessed contiguously• Scales out efficiently and degrades gracefully.• If a task fails the framework restarts automatically (on another node

if necessary) – Redundancy and Graceful Recovery

Master Node

Map Nodes

MapReduce (cont.)

Map Reduce – How it works

Year Sales

1990 $1M

1982 $2M

… ..

1999 $20M

Map Process 1

Year Sales

1998 $6M

1982 $5M

… ..

2010 $30M

Map Process 20

……

Reduce Node adds up all the map results, sorts by year to give final result

Year Sales

1980 $100M

1981 $102M

… ..

2010 $250M

MapReduce (cont.)

MapReduce is general purpose but requires complex skills.

Characteristics

• Batch oriented - Maximizes throughput not responsiveness

Pros

• Simple programming model• Scales out efficiently• Failure and redundancy built in• Adapts well to a wide variety of problems

Cons

• Requires custom programming• Higher level languages (SQL-like) exist but programming skills are often critical• Requires a complex array of skills to manage & maintain a MapReduce System

Summary of BI Apps

Each option has tradeoffs. Choose based on requirements

Big Data Flavor How much data can it typically handle?

In Memory Databases

Order of 1TB

Massively Parallel Databases

Order of 10 TB

MapReduce Order of 100’s of TB into the Petabyte range

Transactional System - Use-Case

How many items in stock do users A and B on their second access?

Web Based Online Store Database

User A Looks up item X

User B Looks up item X

User C buys item X Updates inventory

User A Looks up item X again

User B Looks up item X again

Context – CAP Theorem

You can get any two but not all three features in any system

Characteristic

Consistency All nodes (and users) see the same data at the same time.

Availability A guarantee that every request receives a valid response. Site does not go down or appear down under heavy load.

Partition Tolerance The system continues to function regardless of loss of one of its components

CA – Single RDBMS

A single RDBMS instance is both consistent and available

Web Based Online Store RDBMS






• When setup in “Read Committed” every user sees the same inventory count

• System responds with last committed inventory count even during updates

• Consistent• Available

CP – Distributed RDBMS

A Distributed RDBMS is consistent and resilient to failure of nodes

Web Based Online Store

East Region RDBMS






• Under “Read Committed” mode all user see consistent counts.• If one DB fails the other one will serve all users(Partition Tolerance)• During two phase commit system is unavailable.

• Consistent• Partition Tolerant

West Region RDBMS

2- Phase Commit

AP – Distributed RDBMS

Eventual Consistency is the key to Big Data Transactional Systems

Web Based Online Store






• Amazon Dynamo and Apache Cassandra work on this principle• If one DB fails the other one will serve all users(Partition Tolerance)• Users will always be able to browse all products but occasionally

some users will see a stale count of inventory (Eventual Consistency)

• Available• Partition Tolerant• Eventually Consistent

Hybrid Solution

Big Data Techniques – Not an either or choice!

Large Structured DB

Large Unstructured

DB

Map Reduce based ETL MPP DB

In-Memory DB

Business Users can use familiar SQL based tools in real-time. In-Memory DB allows that

No-SQL DB

Programmers, System Admins with no real-time requirements can use all three techniques. NoSQL DB’s allow technical users to gain real-time benefits in ways which suite their complex needs.

Familiar BI Solution

Programs & Scripts

100 TB to 1 PB

5-10 TB

1 TB

Few 100 GB

Exploring over Millions US Patent Pages at the Speed of Thought

www.axiomine.com/patents/

Demo- US Patent Explorer

http://www.axiomine.com/patents/

Patent Explorer Goals

Seamlessly navigate Structured and Unstructured data in real-time

• Navigate 3 million US Patents Data (Text and Metadata) from 1963 to 1999 at the speed of thought.

• Data Sources• Patent Metadata - National Bureau of Economic Research• Patent Text – Bulk Download from Google Site

• Each week granted patents are published to the Google Site as an archive.

• Size of uncompressed data• Structured Metadata – Approximately 2 GB• Patent Text Data – Approximately 300 GB

Patent Metadata

Cannot answer – What is the title of Patent No 8086905?

Source – National Bureau of Economic Research http://data.nber.org/patents/

Patent Master Pairwise Citations

*

Inventors

*

Patent Master Other Master Data

Company Master

Country Master

Classification Master

Contains only meta-data. No text data such as Patent Title available. Ex. Pairwise citations contains millions of patent id pairs

http://data.nber.org/patents/

Patent Text

Need to merge both metadata & text

Source – Google http://www.google.com/googlebooks/uspto.htmlSample File

http://www.google.com/googlebooks/uspto.html

High Level Architecture

Need to merge both metadata & text

Hadoop

Patent Metadata

Patent Text

Navigation, Search & Text Analytics

Apache Solr

Patent Details

MongoDB

Text Enhanced Citation Data

Raw Data Tier ETL & Text Analytics Tier Search & Visualization

Navigate, Search & Visualize

Drill down to Patent Details

Big Data Flavors – Summary

Choose a Big Data tool and product based on requirements

Flavor CharacteristicsMap-Reduce • Massive 100 TB to 1 PB Scale ELT

• Complex Analytics on Massive Data • Large Scale Unstructured Data Analysis

Massively Parallel Processing (MPP)

• Batch oriented aggregations • Analytics on Moderately Large Structured Data with

predictable access patterns

In-Memory DB • Similar to MPP but with real-time access patterns required.• Rich and Interactive Business Intelligence Apps

NoSQL databases • Similar to In-Memory DB but simpler (Non SQL) access patterns

• Provide fast access to detail data where other techniques are used to serve summary data

GPGPU • Real time Value At Risk (Financial Risk Management)• Compute intensive analytics Ex. Simulation of a Hospital

Waiting Room over 1 years

Technology

Big Data presentation at GITPRO 2013