View
472
Download
5
Category
Tags:
Preview:
DESCRIPTION
An exploration of what Big Data is and when it adds incremental information and when it does not.
Citation preview
1
Big Data
Guy Lion
April 5, 2013
2
Table of Content
1) Big Data trends.
2) How Big is your Data?
3) Big Data Potential.
4) Big technologies. New databases.
5) Big quantitative methods. New stats.
6) Big Data temperaments.
7) Is Big always better?
3
1) Big Data Trends
4
Cost of Data storage has dropped
5 Social networks are creating a huge live Unstructured Data.
Social Media (Facebook & Twitter) has grown
exponentially
Twitter started in
March 2006.
Has 500 million
users.
Facebook started
in Feb 2004. Has
1 billion active
users.
Facebook vs Twitter # Active Users in 000
exponential growth
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Jan-
08
Apr
-08
Jul-0
8
Oct-0
8
Jan-
09
Apr
-09
Jul-0
9
Oct-0
9
Jan-
10
Apr
-10
Jul-1
0
Oct-1
0
Jan-
11
Apr
-11
Jul-1
1
Oct-1
1
Jan-
12
Apr
-12
Jul-1
2
Oct-1
2
Jan-
13
Facebook Twitter
6
Unstructured Data is taking over…
7
• How Tall is it? How large is your sample (rows)?
• How Wide is it? How many variables (columns)?
• What is its Velocity? How frequently is it updated?
• Does it include unstructured data (documents, emails, Social Media)?
2) How Big is your Data?
8
3) Big Data Potential
9
4) Big Technologies.
New Databases
10
11
Database: Structured vs Unstructured
Data Type
Unstructured.
Social Media,
Text documents,
Web services
Structured.
Customers,
transactions,
numbers in rows.
Database
language
SQL
structured
query language
NoSQL
not only SQL
Database
type
Relational
database
Non-relational
database
Database
structure
Data Warehouse
Data Marts
Hadoop
Hadoop
Connectors
Reporting
Business
Intelligence
Oracle Essbase
& IBM Cognos
Reporting
tool
12
5) Big quantitative methods.
New Stats
13
New Stats Map
Predictive
Analytics
Statistics &
Regression
Data Mining &
Machine Learning
(formerly Artificial
Intelligence)
A/B Testing
(hypothesis testing)
Regression
Time Series
Analysis
Spatial Analysis
Signal Processing
Association
Rule Learning
Cluster Analysis
Classification
Neural Networks
Natural Language
ProcessingSentiment Analysis
Optimization Genetic Algorithms
Pattern Recognition
14
Definitions. Part I
Association Rule Learning: method to uncover interesting relationships
by generating and testing possible rules. One application is “market
basket analysis”, where a retailer figures out what products are
frequently bought together. A cited example is that shoppers who buy
diapers often buy beer.
Classification: identifies the categories in which new data belongs,
based on an existing data set grouped in predefined categories. It
differs from Cluster Analysis that starts without predefined categories.
Genetic algorithms: an optimization method inspired by the “survival of
the fittest” process. Potential solutions are encoded as “chromosomes”
that can combine and mutate. The chromosomes are selected for
survival within a modeled “environment.” Examples: optimizing the
performance of an investment portfolio.
15
Definitions. Part II
Natural language processing (NLP): it uses algorithms to analyze text data.
Sentiment Analysis is a common application. It measures customers’
reaction to a product campaign by analyzing social media.
Neural networks: models inspired by the workings of neurons and synapses
within the brain. Used for finding nonlinear patterns. They can be used for
Pattern recognition and Optimization. Examples of neural network
applications include identifying customers that may leave and identifying
fraudulent insurance claims.
Signal processing: an electrical engineering method to analyze signals
(radio, etc…) and discern between signal and noise. It is used to extract
the signal from the noise from a set of less precise data [Signal Detection
Theory].
16
Definitions. Part III
Spatial Analysis: it analyzes geographic location encoded within
the data. The information comes from GPS. Applications include
spatial regression to figure a consumer willingness to purchase a
product given his location.
17
6) Big Data Temperaments
Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne
and Jaime Capella.
18
7) Is Big always better?
19
No! says Nate Silver
•He refers to John P. Ioannidis 2005
paper: “Why Most Published
Research Findings are False.”
2/3ds of scientific papers’ results
can’t be replicated!
“… numbers have no way of speaking for
themselves. We speak for them.”
•“I came to realize that prediction in the era of Big Data was
not going very well.”
•“If the quantity of information is increasing [exponentially]…
Most of it is just noise.”
20
Nate’s targets
• Political pundits. Their “intuitive” election predictions have been disastrous. Granted, it was not because of Big Data but instead No Data. He showed them how to do it using Small Data (polls with samples < 1,000);
• Economists forecasters. They have used Big Data with poor results. The majority of them can’t forecast a recession already underway. ECRI predicted with certainty a double dip recession in 2011 using tens of variables they did not understand. Instead, the economy improved;
• Stock market & financial market forecasters. Similar performance as economists forecasters;
• Earthquake forecasting. The field is not well understood.
“… Statistical inferences are much stronger when backed
up by theory… about their root causes.”
21
No! says Vincent Granville
• Big Data is huge, but information is very sparse;
• Storing and processing the entire data is very inefficient;
• You can do better by smartly sampling only 5% of the data;
You don’t need Big Data, you need Smart Data.
22
Yes! Says Chris Anderson
• He quotes Peter Norvig, Google’s research director: “All models are
wrong, and increasingly you can succeed without them.”
• “… with massive data, [the scientific method] is becoming
obsolete.”
• “We can throw the numbers into the biggest computing clusters …
and let statistical algorithms find patterns where science cannot.” He
mentions examples such as J.Craig Venter gene sequencing,
Google Search, and Google Translator, among other successes.
“Correlation supersedes causation, and science can advance without
coherent models, unified theories, or … any … explanation at all.”
“With enough data, the numbers speak for themselves.”
23
Big Data Effectiveness Map
Theory not well
understood
Theory well
understood
Tall data
More data more
Noise
Oversampling
More data more
Signal
Oversampling
Wide data
More variables more
false positives
Multicollinearity
Model overfitting
More variables more
explanation
Multicollinearity
Model overfitting
Examples
Economics,
Financial markets,
Earthquake
forecasting
Weather forecasting,
Customer behavior
Games & Sports
[Chess, Baseball,
etc…], Politics
Google Search,
Google Translator,
Google Flu-trends,
Customer behavior
Field not needing
causal
understanding
More data better
model performance
Field needing causal understanding
Rule Based
More data better
model performance
Recommended