Upload
gilad-barkan
View
122
Download
4
Tags:
Embed Size (px)
DESCRIPTION
This is an introductory lecture of the buzziest domain technology nowadays. The domain encapsulates a lot of new concepts, keywords, theories and paradigm shifts, from computer science to business.
Citation preview
GILAD BARKAN
The Rise of Big Data Science
Big Data Science
Big Data
Data Scienc
e
Big Data
Science
Big Data
Why ?What ?How ?
Big Data
Why ?What ?How ?
Why Big Data ?
It’s the flooded information era we live inIn a world where data is power, big data is
big power
Why Big Data ?
Web 2.0
Why should we care about Big Data ?
The big business opportunities Competitive fast moving marketplace
Capitalize on business opportunities before everyone else Existing channels to every person on the planet Maximizing revenues from customers Segment-of-1 - more personal customer
experiences
Big Data
Why ?What ?How ?
What is Big Data ?
Volume
Variety
Velocity
The 3 V’s
What is Big Data ?
Volume
Variety
Velocity
The 3 V’s
Big Data - Volume
Big Data - Volume
Smartphone Users
Hours Spent Online
35Billion Hours
1Billion
+
Global Online
Population
2Billion
Big UsersMore Users, All the Time
Big Data+
More Data
More Users
What is Big Data ?
Volume
Variety
Velocity
The 3 V’s
Heterogeneous sources of data Structured Unstructured
Tri
llio
ns
of
Gig
ab
ytes
(Zett
ab
ytes)
Text, Log Files, Click Streams, Blogs, Tweets, Audio, Video,
etc.
Big Data - Variety
Unstructured NoSQLTraditional Structured SQL
tables
5 KB / record
text
50 KB / record
images
1000 KB / image
Audio
5000 KB / song
video
700 MB / movie
Un/Semi-Structured Data
Structured Data
What is Big Data ?
Volume
Variety
Velocity
The 3 V’s
Big Data - Velocity
How the hell does Google return an answer in 0.28 seconds by looking at 4 Billion pages?
Big Data - Velocity
Online Advertisement - Real Time Bidding (RTB)
Big Data - Velocity
Recommendations
Big Data
Why ?What ?How ?
How is Big Data Handled ?
The challenge is huge Store, analyze and serve huge volume of variety
of data in high velocity
We can’t achieve this using a single machine, no matters how strong it is. Why? Expensive – stay tuned Load balancing requests
Outbrain serves 3,000 per second DG (MediaMind) serves 500K per second!!!
Not fault tolerant
Distributing the Data
The Big Data Paradigms Shifts
Scale Up (Vertical)
SQL Server
Scale Out(Horizontal)
Volume
HDFS(GFS)
NodesHadoop Cluster
Big Data –Reducing Costs
Hadoop is a 5 times cheaper infrastructure !!!TCO (purchase + maintenance) for 3 years per 300 TB:
75 nodes cluster = 1 M$DBMS server = 5 M$
Big Data Paradigm Shift - Computing
MapReduce Computing Paradigm
Exploiting the distributed architecture for large scale computations in parallel
MapReduce
“Hello MapReduce” – counting words
C W
5 the
0 Cow
2 quick
C W
7 the
1 Cow
0 quick
C W
9 the
1 Cow
3 quick
URL 1
URL 3
URL 2
C W
21 the
2 Cow
5 quick
MapReduc
e
+
Hadoop Cluster
Master
Mappers
Reducer
{𝑤 ,𝑐 }
{𝑤 ,𝑐 }
{𝑤 ,𝑐}
Big Data Paradigm Shift – NoSQL
Schema-less databases to support the variety of dataComplex SQL queries (joins, etc.) in a distributed data
framework is extremely inefficient Key-Value Store NoSQL
Value Key
user_id
url
image_id
video_id
tables
text
images
video
anyAny – not single
primary as in SQL
Variety
Big Data Paradigm Shift –
RAM-based DBs instead of traditional disk-based DBsStore critical data in memory (much more expensive)
If the data doesn't come to Alg - Alg will come to the data
Velocity
Alg
Read
traditional
Data
WriteAlg
Data
today
Read Write
Big Data - Summary
Big Data - Summary
BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityTechnological paradigm shifts
Big Data Technological Paradigm Shifts
NoSQL
Value Key Scale up
Master
Mappers
Reducer
Scale Out
ReduceMap
Volume Variety
Velocity
Data
Alg
Data
Alg
Big Data - Summary
BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologies
Flood of New Big Data Technologies
Open Source
Big Data - Summary
BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologiesIt’s definitely not just a buzz
Big Buzz ?
Big Data - Summary
BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologiesIt’s definitely not just a buzz
It’s a real response to the world hectic paced evolution
reducing costs by order of magnitudeStill it doesn’t mean every business today will /
should transform its technology stack to support big data
Big Data Science
Big Data
Data Scienc
e
Big Data
Science
Data Science
Why ?What ?How ?
Data Science
Why ?What ?How ?
data scientist
s
Why Data Science ?
Data is a real value
Facebook acquires Onavo for ~150M$
Data Science
Why ?What ?How ?
Welcome to the Intelligent world
Data Scienc
e
Data Analysis
Data Mining
Automatic Decisionin
g
Predictive
Analytics
Machine Learning
Data Analytics
Data Miners are the New Gold Miners
Search
Online Advertisement - Real Time Bidding (RTB)
Recommendations
Recommendations
Text Analysis
CRM – Customers Churn Prediction
Time Series Analysis
Machine Learning
ClassificationClusteringRegressionRecommendation
Third PartyCharges
Pay Bill
Abnormal
fee
Classification
Amdocs Insight™ - why is the customer calling the Call Center ?
Bill too high
Overage
Clustering
Market Segmentation Social Network
Analysis
Regression
Housing price prediction
50 100 150 200 250
100
200
300
400
130
280
Size in m2
Price ($)in 1000’s 215
The Data Scientist
Data Scientist Skillset
Hands on tools,
languages, technologies
MsC / PhD in Math, CS,
Stats, Physics
Hands on the specific problem domain
Data Science ≠ BI
Apply advanced statistical machine learning algorithms to: dig deeper to find patterns that traditional BI
tools may not reveal much wider domains / applications spectrum
Predictive Analytics ≠ Exploratory Analytics
Exploratory AnalyticsBusiness Intelligence
Traditional BIExploratory Analytics
Big Data Science
Predictive Analytics Data Science Vs.
Academia Response to Data Science
Data Science
Why ?What ?How ?
The Art of Data Science
We need at least one semester course for itStill…
Data Science Life Cycle
Understand Data
Prepare Data
Model
Evaluate
Deploy
Monitor
Offline Data Analysis
Run Time
Business Goal
Big Data
Data Scienc
e
Big Data
Science
Closing the Loop
Technically wise, what do you think? Is Big Data good or bad for Data Science ?
The Bad - Finding a Needle in a Haystack
It’s the same treasure that hides – the problem is that the pile is now huge
Big Data Big Noise
The Bad - Finding a Needle in a Haystack
It’s the same treasure that hides – the problem is that the pile is now huge
Big Data Big Noise
The Good - The Statistical View
Statistics is predictive analytics’ fuel !The more data you have (Big Data) the
better your predictive models will perform
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Combining the Good & Bad
Data is a function of quality and quantity
Small Big
Low
High
Quantity
Quality
Big Data Science - Summary
Big Data Big Numbers Big Opportunities Big Data is the buzziest technology nowadays
Data Scientists the ones that coax the treasures for their
companies, out of the big data Are multi-discipline skilled the new industry rock stars
Thank You for your attention