27
© 2012 IBM Corporation IBM Security Systems 1 © 2013 IBM Corporation 1 Big Data Analytics Lecture Series Kalapriya Kannan IBM Research Labs July, 2013

Big Data Analytics Lecture Series - Mu

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

1© 2013 IBM Corporation

1

Big Data AnalyticsLecture Series

Kalapriya KannanIBM Research LabsJuly, 2013

Page 2: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

2© 2013 IBM Corporation

2

Small changes/additions done by Dr. Enis Karaarslan, 2014

Page 3: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

3© 2013 IBM Corporation

3

What is the aim of the course

Focus is on “Systems” and applications for cloud-based storage and processing of BIG DATA.

+Big Data - Definition+Big Data - Analytics+Big Data - Storage (HDFS)+Big Data - Computing (Map/Reduce)+Big Data - Database (HBase)+Big Data – Graph DB (Titan)+Big Data - Streaming (Strom)

Page 4: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

4© 2013 IBM Corporation

4

Pre-Requisite

“Nothing” – All of you are equally qualified.A VM machine either through a VMPlayer/Virtual Box

Acknowledgements:– IBM Material/Examples/Machine etc.,

– IBM External talks/publically available material and authors of the same.

– Several Internet material – Thanks to “Internet”

– Apache Documentation and Examples

Page 5: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

5© 2013 IBM Corporation

5

Mantra

“Learning is not just restricted to listening, it is actively asking relevant questions”

Page 6: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

6© 2013 IBM Corporation

6

After 6 hrs of lecture

Get Convinced about “Big Data” Understand why we need a different paradigm. Ascertain with confidence the need to look at data computing in

a different way. Realize the potential of big data

–All of you are skilled enough to get into it.

What we will not do–Do research on why things have evolved into the current

trends as it stands.–Try to be hands-on – But not guaranteed

Page 7: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

7© 2013 IBM Corporation

7

Today’s 1 hr.

Page 8: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

8© 2013 IBM Corporation

8

Introduction to Big Data

Kalapriya KannanIBM Research LabsJuly, 2013

Page 9: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

9© 2013 IBM Corporation

9

What are we going to understand

What is Big Data?

Why we landed up there?To whom does it matter?Where is the money?Are we ready to handle it?What are the concerns?Tools and Technologies

–Is Big Data <=> Hadoop

Page 10: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

10© 2013 IBM Corporation

10

Simple to start

What is the maximum file size you have dealt so far?– Movies/Files/Streaming video that you have used?

– What have you observed?

What is the maximum download speed you get?Simple computation

– How much time to just transfer.

Page 11: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

11© 2013 IBM Corporation

11

640 K ought to be enough for everybody

Page 12: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

12© 2013 IBM Corporation

12

● Google processes 20 PB (10^15 bytes) a day (2008)

● Wayback Machine has 3 PB + 100 TB/month (3/2009)

● Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

● eBay has 6.5 PB of user data + 50 TB/day (5/2009)

● CERN’s Large Hydron Collider (LHC) generates 15 PB a year

Page 13: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

13© 2013 IBM Corporation

13

The Earthscope

The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)

Page 14: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

14© 2013 IBM Corporation

14

What is big data?

“Every day, we create 2.5 quintillion (10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few.

This data is “big data.”

Page 15: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

15© 2013 IBM Corporation

15

Huge amount of data

There are huge volumes of data in the world:+From the beginning of recorded time until 2003,

+ We created 5 billion gigabytes (exabytes) of data.

+In 2011, the same amount was created every two days

+In 2013, the same amount of data is created every 10 minutes.

Page 16: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

16© 2013 IBM Corporation

16

Big data spans three dimensions: Volume, Velocity and Variety

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.

– Turn 12 terabytes of Tweets created each day into improved product sentiment analysis

– Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as

catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

– Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer

churn faster – The latest I have heard is 10 nano seconds delay is too much.

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

– Monitor 100’s of live video feeds from surveillance cameras to target points of interest

– Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Page 17: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

17© 2013 IBM Corporation

17

Finally….

`Big- Data’ is similar to ‘Small-data’ but bigger

.. But having data bigger it requires different approaches:

Techniques, tools, architecture… with an aim to solve new problems

Or old problems in a better way

Page 18: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

18© 2013 IBM Corporation

18

Whom does it matter

Research Community Business Community - New tools, new capabilities, new infrastructure, new business

models etc., On sectors

Financial Services..

Page 19: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

19© 2013 IBM Corporation

19

How are revenues looking like….

Page 20: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

20© 2013 IBM Corporation

20

The Social Layer in an Instrumented Interconnected World

2+ billion

people on the

Web by end 2011

30 billion RFID tags today

(1.3B in 2005)

4.6 billion camera phones

world wide

100s of millions of GPS

enabled devices

sold annually

76 million smart meters in 2009… 200M by 2014

12+ TBs of tweet data

every day

25+ TBs oflog data

every day

? T

Bs

of

dat

a ev

ery

day

Page 21: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

21© 2013 IBM Corporation

21

What does Big Data trigger?

From “Big Data and the Web: Algorithms for Data Intensive Scalable Computing”, Ph.D Thesis, Gianmarco

Page 22: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

22© 2013 IBM Corporation

22

BIG DATA is not just HADOOP

Manage & store huge volume of any data

Hadoop File System

MapReduce

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Data WarehousingStructure and control data

Integrate and govern all data sources

Integration, Data Quality, Security, Lifecycle Management, MDM

Understand and navigate federated big data sources

Federated Discovery and Navigation

Page 23: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

23© 2013 IBM Corporation

23

Types of tools typically used in Big Data Scenario

Where is the processing hosted?–Distributed server/cloud

Where data is stored?–Distributed Storage (eg: Amazon s3)

Where is the programming model?–Distributed processing (Map Reduce)

How data is stored and indexed?–High performance schema free database

What operations are performed on the data?–Analytic/Semantic Processing (Eg. RDF/OWL)

Page 24: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

24© 2013 IBM Corporation

24

When dealing with Big Data is hard

When the operations on data are complex:–Eg. Simple counting is not a complex problem.–Modeling and reasoning with data of different kinds can get extremely complex

Good news with big-data:–Often, because of the vast amount of data, modeling techniques can get simpler (e.g., smart counting can replace complex model-based analytics)…

–…as long as we deal with the scale.

Page 25: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

25© 2013 IBM Corporation

25

Time for thinking

What do you do with the data.– Lets take an example:

• “From application developers to video streamers, organizations of all sizes face the challenge of capturing, searching, analyzing, and leveraging as much as terabytes of data per second—too much for the constraints of traditional system capabilities and database management tools.”

Page 26: Big Data Analytics Lecture Series - Mu

© 2012 IBM Corporation

IBM Security Systems

26© 2013 IBM Corporation

26

Why Big-Data?

Key enablers for the appearance and growth of ‘Big-Data’ are:

+Increase in storage capabilities+Increase in processing power+Availability of data

Page 27: Big Data Analytics Lecture Series - Mu

© 2013 IBM Corporation

IBM Security Systems

27

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

da

ta

IBM

big

da

taIB

M b

i g d

ata • IB

M b

ig d

ata

THINK