Scalable Data Analysis (CIS 602-02)dkoop/cis602-2015fa/lectures/lecture05.pdf · While vanilla Hadoop performs poorly in a scale-up conﬁgurations, a series of optimizations makes

D. Koop, CIS 602-02, Fall 2015

Scalable Data Analysis (CIS 602-02)

Data Sources

Dr. David Koop

Technical Papers• A document that describes scientific research • Two general categories:

- Survey: What has been done in a specific area - Research: a problem, related work, solution, and results

• Writing helps clarify your own thinking and communicate it to others [N. Feamster]

• "The purpose of research is to increase the store of human knowledge, and so even the very best work is useless if you cannot effectively communicate it to the rest of the world." — M. Ernst

• Research papers are primary sources, textbooks are secondary sources

• Most recent research is not in a textbook • Technical Reports vs. Journal Articles/Conference Proceedings

2D. Koop, CIS 602-02, Fall 2015

http://greatresearch.org/2013/10/11/storytelling-101-writing-tips-for-academics/

https://homes.cs.washington.edu/~mernst/advice/write-technical-paper.html

Paper Structure• Title & Author List • Abstract • Introduction • [Background/Preliminaries] • Contribution (Approach/Theory/Specification/Implementation) • Evaluation (Experiments, case studies) • [Discussion] • Related Work (here or after introduction) • Conclusion [& Future Work] • [Appendices]

3D. Koop, CIS 602-02, Fall 2015

Citations and References

4D. Koop, CIS 602-02, Fall 2015

[Piled Higher and Deeper, J. Cham, 9/11/2015]

http://phdcomics.com/comics/archive.php?comicid=1820

Reading Papers• "How to Read and Evaluate Technical Papers", B. Griswold

modified by G. Murphy • Sometimes useful to read the paper "out of order" • Five questions you should answer when reading a paper:

1. What are the motivations for this work? • People problem • Technical problem

2. What is the proposed solution? 3. What is the evaluation of the proposed solution? 4. What are the contributions? 5. What are future directions for this research?

5D. Koop, CIS 602-02, Fall 2015

http://www.apple.com

Reading Responses• Read the white paper: The Digital Universe of Opportunities • Write a reading response and turn in via myCourses

- Summarize the key contributions (1 paragraph) • Highlight key contributions of the paper

- Offer your critique of the paper (1-2 paragraphs) • Describe why the paper is important, what concepts you agree

or disagree with, and how the work might be extended or integrated with other work.

• May include questions in the critique, but they must be questions that are not trivial

- The response must be your own writing; you may not copy text from any other source including the paper itself, other students, online resources, etc.

6D. Koop, CIS 602-02, Fall 2015

Assignment 1• Start working with data in Python • http://www.cis.umassd.edu/~dkoop/cis602/assignment1.html • United Nations High Commissioner for Refugees maintains

statistics about population of concern around the world • Includes data about refugees and asylum seekers in different

countries • Use the provided template • You must complete the provided functions and not rename them • Turn in a single .py file via myCourses

7D. Koop, CIS 602-02, Fall 2015

http://www.cis.umassd.edu/~dkoop/cis602/assignment1.html

Reading Presentations• Present paper and critique of it • Goal: help the class better understand a specific paper topic • Two students per reading

- Both summarize key points - One student focuses on positive points, associated ideas, and

past work - One student focuses on criticisms, improvements, later work

• After presentations, everyone in the class should be involved in discussing items further

• Let me know of your interest in specific topics • Volunteers for Tuesday's paper?

8D. Koop, CIS 602-02, Fall 2015

D. Koop, CIS 602-02, Fall 2015

The Digital Universe

V. Taylor, D. Reinsel, J. F. Gantz, S. Minton

History• EMC and IDC have been examining the growth of the "Digital

Universe" since 2007 • Yearly reports that focus on key trends and data • Focused on supporting business, not necessarily academic pursuits • Collect some useful and provoking statistics

10D. Koop, CIS 602-02, Fall 2015

Growth of Data

11D. Koop, CIS 602-02, Fall 2015

Geography of Data

12D. Koop, CIS 602-02, Fall 2015

Classification of Data

13D. Koop, CIS 602-02, Fall 2015

Usefulness of Data

14D. Koop, CIS 602-02, Fall 2015

Analyzed Data

15D. Koop, CIS 602-02, Fall 2015

Security and Privacy

16D. Koop, CIS 602-02, Fall 2015

Data in the Cloud

17D. Koop, CIS 602-02, Fall 2015

What will be stored in the Cloud?

18D. Koop, CIS 602-02, Fall 2015

Data Sources• Analog to digital transitions

- Photos - Voice Data - Video - Sensors

• Transient data - "Normal" data is often discarded - Data may be summarized or aggregated

19D. Koop, CIS 602-02, Fall 2015

"Internet of Things"• Many devices are now digital, connected to the internet • Wired or wireless • 14 billion devices now (out of ~200 billion), 32 billion by 2020 • Often real-time, no need to wait for data • Containers (aka "Files"):

- Sensors produce small bits of data often - Why would this data not be aggregated before being stored?

20D. Koop, CIS 602-02, Fall 2015

Example Data Sources

21D. Koop, CIS 602-02, Fall 2015

Example Data Sources• Radio Telescopes • Twitter • Wind Turbine Sensors • Surveillance Cameras • Cars & Airplanes • Dog Collars • Dishwashers • Traffic Lights • MRI Scanners • NFL Football Players • Farming

22D. Koop, CIS 602-02, Fall 2015

[Zebra MotionWorks]

[CC-SA 2.0, Stephan Trebs]

Easy to access.Can you obtain the data, or is it hopelessly locked away on end-user PCs, shuttling about on closed-end data processing systems, or trapped in proprietary embedded systems?

Transformative.Could this kind of data, properly analyzed and acted upon, actually change a company or society in a meaningful way?

Real-time.Is the data available in real-time, or does much of it come too late to drive real-time decisions and actions?

Intersection synergy. Could this kind of data have more than one of the above attributes?

Footprint.

Could top-notch analysis of this data affect a lot of people, major parts of the organization, or lots of customers?

High-Value Data

23D. Koop, CIS 602-02, Fall 2015

[The Digital Universe of Opportunities]

http://www.emc.com/leadership/digital-universe/2014iview/index.htm

Messy Data• Unstructured • Diversely formatted • Uncertain accuracy • Unknown origin • Unpredictable value • Demands real-time attention

• "Data Lakes" - Flexible storage - Processed and classified later

24D. Koop, CIS 602-02, Fall 2015

The human element• "Increased investment in human capital and the new skills required

today is the first order of business for all organizations" • How do we decide what data is useful? • What questions do we ask about the data to process it?

25D. Koop, CIS 602-02, Fall 2015

Critique• Very focused on how data affects business • Defines very real problems and suggests solutions • No description of how the data was collected

- Were surveys of businesses done? - Where do the estimates for the amount of data come from?

• No citations (white paper) - Hard to evaluate the importance of the problems - Hard to evaluate the utility of the solutions

26D. Koop, CIS 602-02, Fall 2015

Discussion• Other sources of data? • What makes data valuable? • Cloud storage? • Security & privacy

27D. Koop, CIS 602-02, Fall 2015

Copyright © National Academy of Sciences. All rights reserved.

Frontiers in Massive Data Analysis

INTRODUCTION 19

ORGANIZATION OF THIS REPORT

The statement of task for the study that led to this report reads as follows:

The study will carry out the following tasks:

• Assess the current state of data analysis for mining of massive sets and streams of data,

• Identify gaps in current practice and theory, and • Propose a research agenda to fill those gaps.

TABLE 1.1 Scientific and Engineering Fields Impacted by Massive Data

Area Affected in 1995 Area Affected in 2012 Noteworthy Use Cases

Physical sciences Physical sciences Astronomy, particle physics

Climatology Climatology

Signal processing Signal processing

Medicine Medicine Imaging, medical records

Artificial intelligence Artificial intelligence Natural language processing, computer vision

Marketing Marketing Internet advertising, corporate loyalty programs

N/A Political science Agent-based modeling of regime change

N/A Forensics Fraud detection, drug/human/CBRNe trafficking

N/A Cultural studies Human terrain assessment, land use, cultural geography

N/A Sociology Comparative sociology, social networks, demography, belief and information diffusion

N/A Biology Genomics, proteomics, ecology

N/A Neuroscience fMRI, multi-electrode recordings

N/A Psychology Social psychology

NOTE: CBRNe, chemical, biological, radiological, nuclear, enhanced improvised explosive devices; fMRI, functional magnetic resonance imaging; N/A, not applicable.

Scientific Fields Impacted by Massive Data

28D. Koop, CIS 602-02, Fall 2015

[Frontiers in Massive Data Analysis]

http://www.nap.edu/catalog/18374/frontiers-in-massive-data-analysis

Examples of Massive Data• Earth and Planetary Observations • Astronomy • Biological and Medical Research • Large Numerical Simulations • Telecommunications and Networking • Social Networks • National Security

29D. Koop, CIS 602-02, Fall 2015

[Frontiers in Massive Data Analysis]

http://www.nap.edu/catalog/18374/frontiers-in-massive-data-analysis

Square Kilometer Array (SKA) 1 Mid

30D. Koop, CIS 602-02, Fall 2015

[www.skatelescope.org]

http://www.skatelescope.org

SKA1-Low

31D. Koop, CIS 602-02, Fall 2015

[www.skatelescope.org]

http://www.skatelescope.org

Large Synoptic Survey Telescope (LSST)

32D. Koop, CIS 602-02, Fall 2015

[http://www.lsst.org]

• Image every 15 seconds • 100PB over 10 years

http://www.lsst.org

Biology and Medical Research• MRIs for Neuroscience (10s of TB per session) • Genomics: Short Read Archive approaching 1PB • Brain Imaging • High-resolution microscopy

33D. Koop, CIS 602-02, Fall 2015

Large Numerical Simulations• Millennium simulation: dark matter, 30TB raw data

34D. Koop, CIS 602-02, Fall 2015

Figure 1: The dark matter density field on various scales. Each individual image shows the projecteddark matter density field in a slab of thickness 15h−1Mpc (sliced from the periodic simulation volumeat an angle chosen to avoid replicating structures in the lower two images), colour-coded by densityand local dark matter velocity dispersion. The zoom sequence displays consecutive enlargements byfactors of four, centred on one of the many galaxy cluster halos present in the simulation.

5

[V. Springel et al., 2005]

Social Networks• Facebook storing petabytes of data per year • Twitter: 500,000,000 tweets (8 TB of data) per day • Millions of users, different links • Huge graphs

35D. Koop, CIS 602-02, Fall 2015

More Big Data Use Cases• Government Operation (Census, Archives, Surveys) • Commercial (Financial, Netflix, Web Search, Cargo Shipping,

Materials Data) • Defense (Geospatial Analysis, Object Identification) • Health (Electronic Medical Records, Digital Pathology, Epidemiology,

Biodiversity) • Energy (Smart Grids)

36D. Koop, CIS 602-02, Fall 2015

[NIST Big Data]

http://bigdatawg.nist.gov/usecases.php

Big Data or Small Data?• "Nobody ever got fired for buying a cluster", R. Appuswamy et al.,

2013, http://research.microsoft.com/apps/pubs/default.aspx?id=179615

• "Forget Big Data, Small Data is the Real Revolution", R. Pollock, http://blog.okfn.org/2013/04/22/forget-big-data-small-data-is-the-real-revolution/

• "Forget Big Data -- Small Data Is Driving The Internet Of Things", M. Kavis, http://www.forbes.com/sites/mikekavis/2015/02/25/forget-big-data-small-data-is-driving-the-internet-of-things/

• "Big Data Doesn't Exist", S. Victoroff, http://techcrunch.com/2015/09/10/big-data-doesnt-exist

• http://smalldatagroup.com

37D. Koop, CIS 602-02, Fall 2015

http://research.microsoft.com/apps/pubs/default.aspx?id=179615

http://blog.okfn.org/2013/04/22/forget-big-data-small-data-is-the-real-revolution/

http://www.forbes.com/sites/mikekavis/2015/02/25/forget-big-data-small-data-is-driving-the-internet-of-things/

http://techcrunch.com/2015/09/10/big-data-doesnt-exist

http://smalldatagroup.com

tions to Hadoop that improve scale-up performance with-out compromising the ability to scale out.

While vanilla Hadoop performs poorly in a scale-upconfigurations, a series of optimizations makes it com-petitive with scale-out. Broadly, we remove the initialdata load bottleneck by showing that it is cost-effectiveto replace disk by SSDs for local storage. We then showthat simple tuning of memory heap sizes results in dra-matic improvements in performance. Finally, we showseveral small optimizations that eliminate the “shufflebottleneck”.

This paper makes two contributions. First, it showsthrough an analysis of real-world job sizes as well as anevaluation on a range of jobs, that scale-up is a com-petitive option for the majority of Hadoop MapReducejobs. Of course, this is not true for petascale or multi-terabyte scale jobs. However, there is a large number ofjobs, in fact the majority, that are sub-terabyte in size.For these jobs we claim that processing them on clus-ters of 10s or even 100s of commodity machines, as iscommonly done today, is sub-optimal. Our second con-tribution is a set of transparent optimizations to Hadoopthat enable good scale-up performance. Our results showthat with these optimizations, raw performance on a sin-gle scale-up server is better than scale-out on an 8-nodecluster for 9 out of 11 jobs, and within 5% for the other2. Larger cluster sizes give better performance but incurother costs. Compared to a 16-node cluster, a scale-upserver provides better performance per dollar for all jobs.When power and server density are considered, scale-upperformance per watt and per rack unit are significantlybetter for all jobs compared to either size of cluster.

Our results have implications both for data centerprovisioning and for software infrastructures. Broadly,we believe it is cost-effective for providers supporting“big data” analytic workloads to provision “big memory”servers (or a mix of big and small servers) with a view torunning jobs entirely within a single server. Second, itis then important that the Hadoop infrastructure supportboth scale-up and scale-out efficiently and transparentlyto provide good performance for both scenarios.

The rest of this paper is organized as follows. Sec-tion 2 shows an analysis of job sizes from real-worldMapReduce deployments that demonstrates that mostjobs are under 100 GB in size. It then describes 11 exam-ple Hadoop jobs across a range of application domainsthat we use as concrete examples in this paper. Section 3then briefly describes the optimizations and tuning re-quired to deliver good scale-up performance on Hadoop.Section 4 compares scale-up and scale-out for Hadoopfor the 11 jobs on several metrics: performance, cost,power, and server density. Section 5 discusses someimplications for analytics in the cloud as well as thecrossover point between scale-up and scale-out. Sec-

Figure 1: Distribution of input job sizes for a large analytics cluster

tion 6 describes related work, and Section 7 concludesthe paper.

2 Job sizes and example jobs

A key claim of this paper is that the majority of real-world analytic jobs can fit into a single “scale-up” serverwith up to 512 GB of memory. We analyzed 174,000 jobssubmitted to a production analytics cluster in Microsoftin a single month in 2011 and recorded the size of theirinput data sets. Figure 1 shows the CDF of input datasizes across these jobs. The median job input data set sizewas less than 14 GB, and 80% of the jobs had an inputsize under 1 TB. Thus although there are multi-terabyteand petabyte-scale jobs which would require a scale-outcluster, these are the minority.

Of course, these job sizes are from a single clusterrunning a MapReduce like framework. However we be-lieve our broad conclusions on job sizes are valid forMapReduce installations in general and Hadoop installa-tions in particular. For example, Elmeleegy [10] analyzesthe Hadoop jobs run on the production clusters at Ya-hoo. Unfortunately, the median input data set size is notgiven but, from the information in the paper we can esti-mate that the median job input size is less than 12.5 GB.1Ananthanarayanan et al. [4] show that Facebook jobs fol-low a power-law distribution with small jobs dominating;from their graphs it appears that at least 90% of the jobshave input sizes under 100 GB. Chen et al. [7] presenta detailed study of Hadoop workloads for Facebook as

1The paper states that input block sizes are usually 64 or 128 MBwith one map task per block, that over 80% of the jobs finish in 10minutes or less, and that 70% of these jobs very clearly use 100 orfewer mappers (Figure 2 in [10]). Therefore conservatively assuming128 MB per block, 56% of the jobs have an input data set size of under12.5 GB.

2

Jobs on a Large Analytics Cluster

38D. Koop, CIS 602-02, Fall 2015

[R. Appuswamy et al., 2013]

http://research.microsoft.com/pubs/179615/msrtr-2013-2.pdf

Big Data or Small Data?• Many companies feel the need to overclaim the amount of data • "when you take a normal tech company and sprinkle on data, you

get the next Google" — [C. O'Neil] • Many large datasets are not useful • Twitter processes 8TB, but the tweets only take about 30GB… • Wikipedia can be downloaded onto a USB drive • All MP3s can be stored on a moderately sized disk array • Can learn a lot from a "small" dataset, e.g. sensors from a single

turbine, grocery store, Apple Watch • Small data focused on end-user, more timely insights?

39D. Koop, CIS 602-02, Fall 2015

Big Data or Small Data• Projects with big data concerns that require strategies and

infrastructure • Lots of small data waiting for analysis

40D. Koop, CIS 602-02, Fall 2015

Next Class• Data Cleaning

- Research paper - Volunteers to present?

• Reading Response: by 12pm noon before class • Assignment 1 due by 11:59pm (after class) • Topic interests • Be prepared for quizzes about the readings

41D. Koop, CIS 602-02, Fall 2015

Documents

Scalable Data Analysis (CIS 602-02)dkoop/cis602-2015fa/lectures/lecture05.pdf · While vanilla Hadoop performs poorly in a scale-up conﬁgurations, a series of optimizations makes