Upload
ampoolio
View
463
Download
1
Embed Size (px)
Citation preview
©2015
Slide 2
What is big-data?
Definition of big data and 4 V’s
Some Stats
Who is using Big data?
Applications
Intro to Hadoop and Map Reduce
Coins counting analogy
Typical workflow of Hadoop
Big data for Statisticians.
Problems with Big Data
Conclusion
OUTLINE
©2015
Slide 3
Big data refers to a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications. That is, beyond
current comfort levels. “Big” is relative, depending on context, amount of data and complexity of the
problem.
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation. (Gartner)
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and
unstructured data that is so large it is difficult to process using traditional database and software
techniques (Webopedia)
• Multiple terabytes or petabytes
• Today’s big may be tomorrow’s normal
• It is relative to its context
Ref: https://www.stat.wisc.edu/bigdata
What is Big Data
©2015
Slide 5
OBAMA ADMINISTRATION UNVEILS “BIG
DATA” INITIATIVE: ANNOUNCES $200
MILLION IN NEW R&D INVESTMENTS
https://www.whitehouse.gov/sites/default/files/mi
crosites/ostp/big_data_press_release_final_2.pd
f
Stats
©2015
Slide 8Ref: http://www.slideshare.net/VipinBatra/introduction-to-big-data-45010980
©2015
Slide 10
• Apache Hadoop is an open-source software framework for
distributed storage and distributed processing for very
large data sets.
• Works on computer clusters built from commodity
hardware.
• Popularized by Google Map Reduce paper in 2004.
• Written in Java.
• Has two main components: MapReduce and HDFS
Hadoop
©2015
Slide 11
Counting coins Analogy
Img src: http://thelogicalindian.com/wp-content/uploads/2015/09/Untitled-138-750x500.jpg
©2015
Slide 12
Anatomy of MapReduce
d a c
a b c
a 3
b 1
c 2
a 1
b 1
c 1
a 1
c 1
a 1
a 1 1 1
b 1
c 1 1
HDFS mappers reducers HDFS
©2015
Slide 17Ref: http://bit.ly/S1ma4Z
Seven (7) Tips for Statisticians using Big
Data
©2015
Slide 18
One temptation in applied statistics is
to take a tool you know well
(regression) and use it to hit all the
nails.
There is a similar temptation in big
data to get fixated on a tool (hadoop,
pig, hive, nosql databases, distributed
computing, gpgpu, etc.) and ignore the
problem of can we infer x relates to y
or that x predicts y.
Problem first not solution backward
©2015
Slide 19
Even in small data example, there can be a bug in the code used to
analyze them. With big data and complex models this is even more
important. Mozilla Science is doing interesting work on code review
for data analysis in science. But in general if you just get a friend to
look over your code it will catch a huge fraction of the problems you
might have.
Make your code and data available and have
smart people check it
©2015
Slide 20
Unless you ran a randomized trial, potential
confounders should keep you up at night
Any time you discover a cool new result, your first thought should be, "what are
the potential confounders?"
©2015
Slide 21
It can be easy to be tricked by the size of a data set. Imagine you have an
image of a simple black circle on a white background stored as pixels. As the
resolution increases the size of the data increases, but the amount of
information may not. In general the bigger the sample size the better and
sample size and data size aren't always tightly correlated.
Know what your real sample size is.
©2015
Slide 22
Before you analyze your data with computers, be sure to
plot it
A common mistake made by amateur
analysts is to immediately jump to fitting
models to big data sets with the fanciest
computational tool. But you can miss
pretty obvious things like this if you don't
plot your data.
©2015
Slide 23
If you want to understand a data set you have to be able to play around with it
and explore it. You need to make tables, make plots, identify quirks, outliers,
missing data patterns and problems with the data. To do this you need to
interact with the data quickly. One way to do this is to analyze the whole data
set at once using tools like Hive, Hadoop, or Pig. But an often easier, better,
and more cost effective approach is to use random sampling . As Robert
Gentleman put it "make big data as small as possible as quick as possible".
Interactive analysis is the best way to really figure out
what is going on in a data set
©2015
Slide 24
If the goal is prediction accuracy, average many prediction
models together
In general, the prediction algorithms that most frequently win Kaggle
competitions or the Netflix prize blend multiple models together.
The idea is that by averaging (or majority voting) multiple good
prediction algorithms you can reduce variability without giving up bias.
©2015
Slide 25
The parable of Google Flu: traps in big data
analysis
Google Flu Trends: the limits of big data
Eight (No, Nine!) Problems with Big Data
Big Data Problems
Ref: http://bit.ly/1fUzZO1