Introduction to Big Data

©2015

Slide 1

Introduction to Big Data

- Satish Gopalani, Ashish Tadose, Deepak Dixit

(Ampool)

©2015

Slide 2

What is big-data?

Definition of big data and 4 V’s

Some Stats

Who is using Big data?

Applications

Intro to Hadoop and Map Reduce

Coins counting analogy

Typical workflow of Hadoop

Big data for Statisticians.

Problems with Big Data

Conclusion

OUTLINE

©2015

Slide 3

Big data refers to a collection of data sets so large and complex that it becomes difficult to process

using on-hand database management tools or traditional data processing applications. That is, beyond

current comfort levels. “Big” is relative, depending on context, amount of data and complexity of the

problem.

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-

effective, innovative forms of information processing that enable enhanced insight, decision making,

and process automation. (Gartner)

Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and

unstructured data that is so large it is difficult to process using traditional database and software

techniques (Webopedia)

• Multiple terabytes or petabytes

• Today’s big may be tomorrow’s normal

• It is relative to its context

Ref: https://www.stat.wisc.edu/bigdata

What is Big Data

https://www.stat.wisc.edu/bigdata

©2015

Slide 4

©2015

Slide 5

OBAMA ADMINISTRATION UNVEILS “BIG

DATA” INITIATIVE: ANNOUNCES $200

MILLION IN NEW R&D INVESTMENTS

https://www.whitehouse.gov/sites/default/files/mi

crosites/ostp/big_data_press_release_final_2.pd

f

Stats

https://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

©2015

Slide 6

©2015

Slide 7Ref: http://bit.ly/1RPwZnb

http://bit.ly/1RPwZnb

©2015

Slide 8Ref: http://www.slideshare.net/VipinBatra/introduction-to-big-data-45010980

http://www.slideshare.net/VipinBatra/introduction-to-big-data-45010980

©2015

Slide 9

Introduction to Hadoop and Map Reduce

©2015

Slide 10

• Apache Hadoop is an open-source software framework for

distributed storage and distributed processing for very

large data sets.

• Works on computer clusters built from commodity

hardware.

• Popularized by Google Map Reduce paper in 2004.

• Written in Java.

• Has two main components: MapReduce and HDFS

Hadoop

©2015

Slide 11

Counting coins Analogy

Img src: http://thelogicalindian.com/wp-content/uploads/2015/09/Untitled-138-750x500.jpg

http://thelogicalindian.com/wp-content/uploads/2015/09/Untitled-138-750x500.jpg

©2015

Slide 12

Anatomy of MapReduce

d a c

a b c

a 3

b 1

c 2

a 1

b 1

c 1

a 1

c 1

a 1

a 1 1 1

b 1

c 1 1

HDFS mappers reducers HDFS

©2015

Slide 13

Evolution of Analytics Process

©2015

Slide 14

©2015

Slide 15

©2015

Slide 16

©2015

Slide 17Ref: http://bit.ly/S1ma4Z

Seven (7) Tips for Statisticians using Big

Data

http://bit.ly/S1ma4Z

©2015

Slide 18

One temptation in applied statistics is

to take a tool you know well

(regression) and use it to hit all the

nails.

There is a similar temptation in big

data to get fixated on a tool (hadoop,

pig, hive, nosql databases, distributed

computing, gpgpu, etc.) and ignore the

problem of can we infer x relates to y

or that x predicts y.

Problem first not solution backward

http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/hitnails/

http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/hitnails/

©2015

Slide 19

Even in small data example, there can be a bug in the code used to

analyze them. With big data and complex models this is even more

important. Mozilla Science is doing interesting work on code review

for data analysis in science. But in general if you just get a friend to

look over your code it will catch a huge fraction of the problems you

might have.

Make your code and data available and have

smart people check it

http://mozillascience.org/code-review-for-science-what-we-learned/

©2015

Slide 20

Unless you ran a randomized trial, potential

confounders should keep you up at night

Any time you discover a cool new result, your first thought should be, "what are

the potential confounders?"

http://xkcd.com/552/

http://xkcd.com/552/

©2015

Slide 21

It can be easy to be tricked by the size of a data set. Imagine you have an

image of a simple black circle on a white background stored as pixels. As the

resolution increases the size of the data increases, but the amount of

information may not. In general the bigger the sample size the better and

sample size and data size aren't always tightly correlated.

Know what your real sample size is.

©2015

Slide 22

Before you analyze your data with computers, be sure to

plot it

A common mistake made by amateur

analysts is to immediately jump to fitting

models to big data sets with the fanciest

computational tool. But you can miss

pretty obvious things like this if you don't

plot your data.

http://en.wikipedia.org/wiki/File:Bland-altman_plot.png

http://en.wikipedia.org/wiki/File:Bland-altman_plot.png

http://en.wikipedia.org/wiki/Anscombe's_quartet

©2015

Slide 23

If you want to understand a data set you have to be able to play around with it

and explore it. You need to make tables, make plots, identify quirks, outliers,

missing data patterns and problems with the data. To do this you need to

interact with the data quickly. One way to do this is to analyze the whole data

set at once using tools like Hive, Hadoop, or Pig. But an often easier, better,

and more cost effective approach is to use random sampling . As Robert

Gentleman put it "make big data as small as possible as quick as possible".

Interactive analysis is the best way to really figure out

what is going on in a data set

https://twitter.com/EllieMcDonagh/status/469184554549248000

©2015

Slide 24

If the goal is prediction accuracy, average many prediction

models together

In general, the prediction algorithms that most frequently win Kaggle

competitions or the Netflix prize blend multiple models together.

The idea is that by averaging (or majority voting) multiple good

prediction algorithms you can reduce variability without giving up bias.

http://en.wikipedia.org/wiki/Ensemble_learning

©2015

Slide 25

The parable of Google Flu: traps in big data

analysis

Google Flu Trends: the limits of big data

Eight (No, Nine!) Problems with Big Data

Big Data Problems

Ref: http://bit.ly/1fUzZO1

http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf

http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/

http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html

http://bit.ly/1fUzZO1

©2015

Slide 26

Conclusion

©2015

Slide 27

Questions?

Technology

Introduction to Big Data