Big Data - Umesh Bellur

Preview:

Citation preview

Not Only Big Data

Prof. Umesh Bellur Department of Computer Science

The Indian Institute of Technology (IIT) Bombay India

But FAST

What’s Big Data? No single definition; here is one from Wikipedia:

• “…difficult to process using on-hand database

management tools or traditional data processing applications. “

• This is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

2

The Vs of Big Data

3

12+ TBs

of tweet data every day

25+ TBs of log data

every day

? TB

s o

f d

ata

ever

y d

ay

2+ billion

people on the Web

by end 2011

30 billion RFID

tags today (1.3B in 2005)

4.6 billion

camera phones

world wide

100s of millions

of GPS enabled

devices sold annually

76 million smart meters

in 2009… 200M by 2014

Volume

Variety - A Single perspective of the Digital Universe

Customer

Social Media

Gaming

Entertain

Banking Finance

Our

Known History

Purchase

Velocity (Speed)

• Data is being generated fast and need to be processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

6

Motivational Use Cases

Customer

Influence Behavior

Product Recommendations that are Relevant

& Compelling

Friend Invitations to join a

Game or Activity that expands

business

Preventing Fraud as it is Occurring

& preventing more proactively

Learning why Customers Switch to competitors

and their offers; in time to Counter

Improving the Marketing

Effectiveness of a Promotion while it

is still in Play

“Fast” in Smart Grids

An electricity network that can intelligently integrate the actions of all users connected to it (generators, consumers and those that do both) in order to efficiently deliver sustainable, economic and secure electricity supplies

No longer just an experiment!

Estimated investments of ~ 60-75 Billion Euro by 2020

Hinges on

• Real time decision making to route energy from producers to consumers

• Based on fine-grained energy demand predictions.

• Millions of events a second have to be processed “on the fly” – A Billion events per day (10000 smart plugs, per

second readings)

Another Motivational Angle for

“Fast”

Performance of disks:

1987 2004 Increase

CPU Performance 1 MIPS 2,000,000 MIPS 2,000,000 x

Memory Size 16 Kbytes 32 Gbytes 2,000,000 x

Memory Performance 100 usec 2 nsec 50,000 x

Disc Drive Capacity 20 Mbytes 300 Gbytes 15,000 x

Disc Drive Performance 60 msec 5.3 msec 11 x

Source: Seagate Technology Paper: ” Economies of Capacity and Speed:

Choosing the most cost-effective disc drive size and RPM to meet IT requirements” Memory I/O is much faster

than disk I/O!

11

Processing Fast Data

• Streams of data that must be processed in one pass in real time: – No random access allowed. – Continuous – Massive – Unbounded – May be dense or sparse – Event arrive faster than can be “mined” – Uncertainty – missing values

Lack of a real time response may be either life threatening or result in large revenue losses

Challenges

• Time/Space constrained – Not enough memory – Can’t afford storing/revisiting the data

• Single pass computation

– External memory algorithms for handling data sets larger than main memory cannot be used.

• Do not support continuous queries • Too slow real-time response

• Noise – Missing data is a common feature – Outliers – Aged (Stale) data

So…..

• No time to stop and smell the roses

• Only one chance to look at the data…

Harnessing Big Data – the Evolution

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

15

DBMS vs. DSMS

Query Processing Continuous Query (CQ) Result

Query Processing

Main Memory Data Stream(s) Data Stream(s)

Disk

Main Memory

SQL Query Result

16

Transient Continuous queries Bounded memory Real time requirements

Persistent relations (relatively static,

stored)

Random access

“Unbounded” disk store

Only current state matters

No real-time services

Synopsis • Random sampling • Histograms • Wavelets

Aging • Sliding Window

Techniques

Stream Processing

• Temporal and spatial operators

• Distributed Complex event processing

Approximations • Deterministic

bounds • Probabilistic

bounds

Technical Aspects of DSMS

Maturity Model

Monitoring

Insights

Process Optimization

Data Monetization

Metamorphosis

(Role of) Standards in Big Data Adoption

• OGC Standards – SOS – Sensor Observation Service

• IEEE Big Data Initiative (BDI) – Metadata standards for Big data management – Verticals – Healthcare, energy etc.

• ISO/IEC CD 20546 – Big Data Vocabulary

• NIST Public working group on Big Data • ITU-T Technology Watch report on Big Data • …

Summary

• Fast data processing is fundamentally different from Big data processing

– DSMS Vs Hadoop/Data Warehousing etc.

• More and more applications having real time needs.

• While there are some solutions, wide open space for research and technological innovation.

– Role of standards cannot be emphasized enough

Questions?

umesh@cse.iitb.ac.in

NIST Reference Architecture for Big Data