49
Mingon Kang, Ph.D. Department of Computer Science, University of Nevada, Las Vegas CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr. Jeongkyu Lee@UB

CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Mingon Kang, Ph.D.

Department of Computer Science, University of Nevada, Las Vegas

CS 789 ADVANCED BIG DATA

ANALYTICS

BIG DATA

* The contents are adapted from Dr. Jeongkyu Lee@UB

Page 2: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Era of Big Data

Main Frame

Computer

PC

Internet

Mobile

Computer

IT

everywher

e

1970 1980 1990 2000 2010 2020 2030

www PC BroadbandSNS

Mobile

Virtual

Realty

AI2011: Amount of Digital Information = 1.8 ZB

2020: maybe 50 times more??

Data Size

Data Type

Data

Characteristic

EB (Exa Byte)90’ = 100EB

Structured Data(RDBMS, Office Info)

Organized Data

Beginning ZB2011 = 1.8 ZB

Unstructured Data(MM, SNS, email)

Complex, SNS

Data

ZB Era2020 = x 50 data

Object, Spatial(IoT, RFID, Sensor)

Real-time Data

Big

Data

IoT

Page 3: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

What is BIG DATA?

Wiki said in 2012 …

data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time

Wiki says NOW …

A broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, ….. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.

Page 4: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

What is BIG DATA?

Wiki said in 2012 …

data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time

Wiki says NOW …

A broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, ….. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.

Page 5: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

What is BIG DATA?

Gartner says …

Big data is high volume, high velocity, and/or high

variety information assets that require new forms of

processing to enable enhanced decision making, insight

discovery and process optimization

Oxford English Dictionary says …big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.

Page 6: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

3V: Volume (Scale)

Data Volume

44x increase from 2009 2020

From 0.8 zettabytes to 35zb

Data volume is increasing exponentially

Exponential increase in

collected/generated data

Page 7: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

3V: Variety (Complexity)

Various formats, types, and structures

Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

Static data vs. streaming data

A single application can be generating/collecting many types of data

To extract knowledge➔ all these types of

data need to linked together

Page 8: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

3V: Velocity (Speed)

Data is begin generated fast and need to be processed fast

Online Data Analytics

Late decisions ➔ missing opportunities

Examples

E-Promotions: Based on your current location, your purchase history, what you

like ➔ send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body ➔ any

abnormal measurements require immediate reaction

Page 9: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

4V: Veracity

Page 10: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Who’s Generating Big Data

Social media and networks

(all of us are generating data)Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and

networks

(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from

the collected data in a timely manner and in a scalable fashion

Page 11: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use big data

Page 12: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

What’s driving Big Data

- Ad-hoc querying and reporting

- Basic data mining techniques

- Structured data, typical sources

- Small to mid-size datasets

- Optimizations and predictive analytics

- Complex statistical analysis

and huge data mining

- All types of data, and many sources

- Very large datasets

- More of a real-time

Page 13: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• Big Data, like Business

Intelligence, can be used to

improve stuff.

• It can also be used to solve

problems (i.e. answer “big

questions”).

Page 14: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• Suppose you have a Combine Harvester.

Page 15: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• Suppose you have a Combine Harvester.

• Sensors are becoming increasingly cheap, so it would be quite easy to cover the harvester in sensors (temperature, GPS, pressure, capacity, etc…).

• This will generate some big data. Especially if all the harvesters in Europe are equipped with the same sensors.

Page 16: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

But what would you use this data for?

• Finding the most economical driving style by monitoring driving habits, tracking the position of the harvester and fuel levels in the tank.

• Monitoring vibrations and temperature patterns in the parts to predict when parts might break. This could then tie into a system that automatically orders parts.

• Tracking the harvester’s position and yield, to identify the most fertile areas and those which require fertilisers.

Page 17: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

Page 18: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• How did Google use Big Data?

• They stored search history for every user as well as what every user clicked.

• This data was needless and pointless (Data Exhaust).

• What do you think that Google did with this data?

Page 19: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• Google used that data to power a spell-checker. Because if I search for “bansnas” and click on something relating to “bananas”, the chances are that I meant to search for “bananas” in the first place.

• About 2 billion searches a day are made on Google.

Page 20: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr
Page 21: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• LAPD use PredPol to predict crimes.

• https://www.predpol.com/

Page 22: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

• The LAPD mined 13 million crime reports with a specialised algorithm.

1. Type of Crime

2. Place of Crime

3. Time of Crime

• 13 million arrests is 80 years’ of crime data.

• They then build mission maps covering dangerous areas, and would patrol them to minimise crime.

Page 23: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to use Big Data

It worked. It reduced:• Property crime by 12% and Burglary by 26%

Page 24: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

How to manage Big data

Page 25: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Challenges in Handling Big Data

The Bottleneck is in technology

New architecture, algorithms, techniques are needed

Also in technical skills

Experts in using the new technology and dealing with big data

Page 26: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Storing Big Data

• Here are a few tools that can be used to store Big Data.

Page 27: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Traditional Large-Scale Computation

Page 28: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Distributed System: Problems

Page 29: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Distributed Systems: Data Storage

Page 30: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Data-Driven World

Page 31: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Data Become the Bottleneck

Page 32: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Requirements for a new approach

Page 33: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Partial Failure Support

Page 34: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Data Recoverability

Page 35: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Component Recovery

Page 36: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Consistency

Page 37: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Scalability

Page 38: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Newbie for Big Data

- Hadoop Eco-System

38

Page 39: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop History

Page 40: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Core Hadoop Concepts

Page 41: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Very High-level Overview

Page 42: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Fault Tolerance

Page 43: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

CPSC651- Big Data Systems and Analytics 43

Page 44: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop in IBM

Page 45: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop in Oracle

Page 46: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop in Teradata

Page 47: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop in Microsoft

Page 48: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr

Hadoop in EMC

Page 49: CS 789 ADVANCED BIG DATA ANALYTICS BIG DATAmkang.faculty.unlv.edu/teaching/CS789/04.Big Data.pdf · CS 789 ADVANCED BIG DATA ANALYTICS BIG DATA * The contents are adapted from Dr