Big data

Big Data

Guy Lion

April 5, 2013

Table of Content

1) Big Data trends.

2) How Big is your Data?

3) Big Data Potential.

4) Big technologies. New databases.

5) Big quantitative methods. New stats.

6) Big Data temperaments.

7) Is Big always better?

1) Big Data Trends

Cost of Data storage has dropped

5 Social networks are creating a huge live Unstructured Data.

Social Media (Facebook & Twitter) has grown

exponentially

Twitter started in

March 2006.

Has 500 million

users.

Facebook started

in Feb 2004. Has

1 billion active

users.

Facebook vs Twitter # Active Users in 000

exponential growth

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Facebook Twitter

Unstructured Data is taking over…

• How Tall is it? How large is your sample (rows)?

• How Wide is it? How many variables (columns)?

• What is its Velocity? How frequently is it updated?

• Does it include unstructured data (documents, emails, Social Media)?

2) How Big is your Data?

3) Big Data Potential

4) Big Technologies.

New Databases

Database: Structured vs Unstructured

Data Type

Unstructured.

Social Media,

Text documents,

Web services

Structured.

Customers,

transactions,

numbers in rows.

Database

language

structured

query language

not only SQL

Database

Relational

database

Non-relational

database

Database

structure

Data Warehouse

Data Marts

Hadoop

Connectors

Reporting

Business

Intelligence

Oracle Essbase

& IBM Cognos

Reporting

5) Big quantitative methods.

New Stats

New Stats Map

Predictive

Analytics

Statistics &

Regression

Data Mining &

Machine Learning

(formerly Artificial

Intelligence)

A/B Testing

(hypothesis testing)

Regression

Time Series

Analysis

Spatial Analysis

Signal Processing

Association

Rule Learning

Cluster Analysis

Classification

Neural Networks

Natural Language

ProcessingSentiment Analysis

Optimization Genetic Algorithms

Pattern Recognition

Definitions. Part I

Association Rule Learning: method to uncover interesting relationships

by generating and testing possible rules. One application is “market

basket analysis”, where a retailer figures out what products are

frequently bought together. A cited example is that shoppers who buy

diapers often buy beer.

Classification: identifies the categories in which new data belongs,

based on an existing data set grouped in predefined categories. It

differs from Cluster Analysis that starts without predefined categories.

Genetic algorithms: an optimization method inspired by the “survival of

the fittest” process. Potential solutions are encoded as “chromosomes”

that can combine and mutate. The chromosomes are selected for

survival within a modeled “environment.” Examples: optimizing the

performance of an investment portfolio.

Definitions. Part II

Natural language processing (NLP): it uses algorithms to analyze text data.

Sentiment Analysis is a common application. It measures customers’

reaction to a product campaign by analyzing social media.

Neural networks: models inspired by the workings of neurons and synapses

within the brain. Used for finding nonlinear patterns. They can be used for

Pattern recognition and Optimization. Examples of neural network

applications include identifying customers that may leave and identifying

fraudulent insurance claims.

Signal processing: an electrical engineering method to analyze signals

(radio, etc…) and discern between signal and noise. It is used to extract

the signal from the noise from a set of less precise data [Signal Detection

Theory].

Definitions. Part III

Spatial Analysis: it analyzes geographic location encoded within

the data. The information comes from GPS. Applications include

spatial regression to figure a consumer willingness to purchase a

product given his location.

6) Big Data Temperaments

Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne

and Jaime Capella.

7) Is Big always better?

No! says Nate Silver

•He refers to John P. Ioannidis 2005

paper: “Why Most Published

Research Findings are False.”

2/3ds of scientific papers’ results

can’t be replicated!

“… numbers have no way of speaking for

themselves. We speak for them.”

•“I came to realize that prediction in the era of Big Data was

not going very well.”

•“If the quantity of information is increasing [exponentially]…

Most of it is just noise.”

Nate’s targets

• Political pundits. Their “intuitive” election predictions have been disastrous. Granted, it was not because of Big Data but instead No Data. He showed them how to do it using Small Data (polls with samples < 1,000);

• Economists forecasters. They have used Big Data with poor results. The majority of them can’t forecast a recession already underway. ECRI predicted with certainty a double dip recession in 2011 using tens of variables they did not understand. Instead, the economy improved;

• Stock market & financial market forecasters. Similar performance as economists forecasters;

• Earthquake forecasting. The field is not well understood.

“… Statistical inferences are much stronger when backed

up by theory… about their root causes.”

No! says Vincent Granville

• Big Data is huge, but information is very sparse;

• Storing and processing the entire data is very inefficient;

• You can do better by smartly sampling only 5% of the data;

You don’t need Big Data, you need Smart Data.

Yes! Says Chris Anderson

• He quotes Peter Norvig, Google’s research director: “All models are

wrong, and increasingly you can succeed without them.”

• “… with massive data, [the scientific method] is becoming

obsolete.”

• “We can throw the numbers into the biggest computing clusters …

and let statistical algorithms find patterns where science cannot.” He

mentions examples such as J.Craig Venter gene sequencing,

Google Search, and Google Translator, among other successes.

“Correlation supersedes causation, and science can advance without

coherent models, unified theories, or … any … explanation at all.”

“With enough data, the numbers speak for themselves.”

Big Data Effectiveness Map

Theory not well

understood

Theory well

understood

Tall data

More data more

Oversampling

More data more

Signal

Oversampling

Wide data

More variables more

false positives

Multicollinearity

Model overfitting

More variables more

explanation

Multicollinearity

Model overfitting

Examples

Economics,

Financial markets,

Earthquake

forecasting

Weather forecasting,

Customer behavior

Games & Sports

[Chess, Baseball,

etc…], Politics

Google Search,

Google Translator,

Google Flu-trends,

Customer behavior

Field not needing

causal

understanding

More data better

model performance

Field needing causal understanding

Rule Based

More data better

model performance

Big data

Education

Caterpillar Big Data Infrastructure Big Data, Data Analytics, and … · Caterpillar Big Data Infrastructure Big Data, Data Analytics, and Machine Learning. Caterpillar is the world’s

An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

Big Data Meets Big Data Analytics

Informatica Big Data Management - Meetup › 16208282 › Big Data Management... · 2016-04-15 · Big Data = Big Opportunity Sources: Informatica Big Data Survey, March 2012 Cisco,

Big Data and Business Analytics: The Engine of Digital ... · Enterprise Big Data Strategy . BIG DATA MANAGEMENT . BIG DATA ANALYTICS . BIG DATA APPLICATIONS . BIG DATA INTEGRATION

Caterpillar Big Data Infrastructure Big Data, Data

· for executive: box big data ussuiu lla:ansnnns1ðxnu big data -big data big -wifñuiaÖ big data big data • hadoop big clouderâ manager hive impala big data 22 airuntju 2559

Introduction to Big Data, Big Data Processing, and Big ...eecs.csuohio.edu/~sschung/CIS660/Lecture1_IntroBig... · What’s Big Data? From Wikipedia: • Big data is the term for

Big Vulnerabilities + Big Data = Big Intelligence

Big Data Visualization: Turning Big Data Into Big Insights – White

Big Data Madison: Architecting for Big Data

Unite and Free your Data Making Big Data Big …files.meetup.com/14077672/WiDB - Making Big Data Big...Unite and Free your Data Making Big Data Big Business East Coast Chapter Launch

Big Success With Big Data - Executive Summary · Big Success with Big Data 3 Big success with big data Big data is clearly delivering significant value to users who have actually

การประยุกต์ใช้ Big Data · การประยุกต์ใช้ Big Data ในการบริหารจัดการฐานข้อมูลทางด้าน

The BIG Future of BIG Data · 2017-03-08 · The BIG Future of BIG Data. Big Data Data Governance Data Warehousing Data Reporting Data Infrastructure. Become a PREDICTIVEEnterprise

Big Data, Big Revenue: How Big Data Means Millions for Marketing

Big Data and Hadoop - How Big is this Big Data?

Big Data, Big Challenges, Big - Oracle

Big data solutions - Big data technology

Big Data, Big Risks – Simplify Big Data Security & Management | Vormetric