60
CIS 602-01: Scalable Data Analysis Reproducibility Dr. David Koop D. Koop, CIS 602-01, Fall 2017

CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

CIS 602-01: Scalable Data Analysis

Reproducibility Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Page 2: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Common words in Tom Sawyer

Word Freq. Use

the 3332 determiner (article)and 2972 conjunctiona 1775 determinerto 1725 preposition, verbal infinitive markerof 1440 prepositionwas 1161 auxiliary verbit 1027 (personal/expletive) pronounin 906 prepositionthat 877 complementizer, demonstrativehe 877 (personal) pronounI 783 (personal) pronounhis 772 (possessive) pronounyou 686 (personal) pronounTom 679 proper nounwith 642 preposition

1

Analyzing Text: Common words in Tom Sawyer

2D. Koop, CIS 602-02, Fall 2015

Page 3: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Zipf's Law in Tom Sawyer

3D. Koop, CIS 602-02, Fall 2015

Word Freq. Rank f · r(f ) (r )

turned 51 200 10200you’ll 30 300 9000name 21 400 8400comes 16 500 8000group 13 600 7800lead 11 700 7700friends 10 800 8000begin 9 900 8100family 8 1000 8000brushed 4 2000 8000sins 2 3000 6000Could 2 4000 8000Applausive 1 8000 8000

4

Zipf’s law in Tom Sawyer

Word Freq. Rank f · r(f ) (r )

the 3332 1 3332and 2972 2 5944a 1775 3 5235he 877 10 8770but 410 20 8400be 294 30 8820there 222 40 8880one 172 50 8600about 158 60 9480more 138 70 9660never 124 80 9920Oh 116 90 10440two 104 100 10400

3

Page 4: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Filtered common bigrams in the NYT

Frequency Word 1 Word 2 POS pattern

11487 New York A N7261 United States A N5412 Los Angeles N N3301 last year A N3191 Saudi Arabia N N2699 last week A N2514 vice president A N2378 Persian Gulf A N2161 San Francisco N N2106 President Bush N N2001 Middle East A N1942 Saddam Hussein N N1867 Soviet Union A N1850 White House A N1633 United Nations A N1337 York City N N1328 oil prices N N1210 next year A N1074 chief executive A N1073 real estate A N

9

Bigrams in New York Times

4D. Koop, CIS 602-02, Fall 2015

Page 5: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

on Many Eyes, for instance, we would not have guessed at the popularity of religious analyses. Given the broad demand for text visualizations, however, it seems like a fruitful area of study.

ACKNOWLEDGEMENTS The authors thank Frank van Ham, Jesse Kriss, Matt McKeon, Lee Byron, and Eric Gilbert for helpful suggestions. In addition, we are grateful to the users of Many Eyes for their creativity and willingness to provide feedback on an experimental visualization technique.

Fig 10: Word Tree showing all occurrences of “I have a dream” in Martin Luther King’s historical speech.

Fig 9. Word tree of the King James Bible showing all occurrences of “love the.”

Text Visualization: Word Tree

5D. Koop, CIS 602-02, Fall 2015

[Wattenberg & Viegas, 2007]

Page 6: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Positive or negative movie review?Unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

6D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 7: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Positive or negative movie review?Unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

6D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 8: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Positive or negative movie review?Unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

6D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 9: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Positive or negative movie review?Unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

6D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 10: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Positive or negative movie review?Unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

6D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 11: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Twitter Mood Predicts the Stock Market

7D. Koop, CIS 602-01, Fall 2017

[J. Bollen et al., 2011]

Page 12: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Twitter Mood Predicts the Stock Market

8D. Koop, CIS 602-01, Fall 2017

[J. Bollen et al., 2011]

CALM Predicts Dow Jones Industrial Average 3 days later

Page 13: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Why sentiment analysis?• Movie: is this review positive or negative? • Products: what do people think about the new iPhone? • Public sentiment: how is consumer confidence? Is despair

increasing? • Politics: what do people think about this candidate or issue? • Prediction: predict election outcomes or market trends from

sentiment

9D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 14: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Sentiment Analysis Tasks• Simplest task:

- Is the attitude of this text positive or negative?

10D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 15: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Sentiment Analysis Tasks• Simplest task:

- Is the attitude of this text positive or negative?• More complex:

- Rank the attitude of this text from 1 to 5

10D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 16: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Sentiment Analysis Tasks• Simplest task:

- Is the attitude of this text positive or negative?• More complex:

- Rank the attitude of this text from 1 to 5• Advanced:

- Detect the target, source, or complex attitude types

10D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 17: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Features for Sentiment Classification• How to handle negation

- "I didn’t like this movie" - "I really like this movie" - Add NOT_ to words after negation and before punctuation - didn’t like this movie , but I → didn’t NOT_like NOT_this NOT_movie but I

• Which words to use? - Only adjectives - All words - All words turns out to work better on some data

11D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 18: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Hard Cases• Perfume review in "Perfumes: the Guide": “If you are reading this

because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”

• Dorothy Parker on Katherine Hepburn: “She runs the gamut of emotions from A to B”

• “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”

• Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.

12D. Koop, CIS 602-01, Fall 2017

[D. Jurafsky]

Page 19: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalable Sentiment Analysis• Challenges:

- Different lexicon • Twitter is not formal language • Existing lexicons don't consider misspellings

- Volume of tweets - Time to process

• Use MapReduce • Components:

- Lexicon Builder - Sentiment Classifier

13D. Koop, CIS 602-01, Fall 2017

[Khuc et al., 2012]

Page 20: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalable Sentiment Analysis Architecture

14D. Koop, CIS 602-01, Fall 2017

In our lexicon builder, tweets are first normalized. The reason is that Twitter users frequently express their sentiments by repeating characters in words, such as “soooo”, “gooodddd”, etc. Assuming that the words “gooodddd” and “goooodd” have the same sentiment, we map them into the same word “goodd” where every “ooo” and “ddd” are replaced by “oo” and “dd” respectively. We argue that this way of mapping is useful since even if in the training data, there does not exist the word “gooddd”, its mapping word is still included in the lexicon (this normalization technique is similar to the technique used by Go et al in [12]). We utilize a Twitter POS tagger1 created by Gimpel et al [11] to extract (only) nouns, adjectives, adverbs, verbs, interjections, abbreviations, hashtags, and emoticons from tweets since only these words can contain sentiment. This preprocessing step is done so as to reduce the total nodes in the graph.

Figure 1. Architecture of the large-scale distributed system

for real-time Twitter sentiment analysis.

3.1.1 Generating co-occurrence matrix A MapReduce job is used for computing co-occurrences between words/phrases in the training tweets. Phrases are formed by connecting continuous words. In our experiment, we only used bi-gram phrases. The co-occurrence matrix is stored in an HBase table described in Table 1, where kij is the number of occurrences between wi and wj.

Table 1. Co-occurrence table

key column family “cooccurrence”

w1 w1:k11 … wn:k1n

… … … ...

wn w1:kn1 … wn:knn

3.1.2 Computing cosine similarity and creating the graph of words

The cosine similarity between two words Wi and Wj is computed by dividing the inner product of two vectors wi and wj with coordinates (ki1, ki2, …, kin), (kj1, kj2, …, kjn) respectively, by the product of their lengths. In order to avoid re-computing vector lengths, all vectors wi are first normalized to unit vectors w’i, then the cosine similarity values are calculated as the pairwise dot product between the unit vectors w’i = (k’i1, k’i2,… , k’in) :

cosine_sim(w!,w!) = ! k!!" ∗ k!!"!

!!!(1)

A MapReduce algorithm for computing pairwise inner product of a large collection of vectors has been proposed by Elsayed et al [9]. The idea is that the index c in formula (1) contributes to the right hand side sum, if and only if both coordinates k’ic and k’jc are not equal to zero. If we denote as Ci the set of indices c for which the coordinates k’ic of vector w’i are different from zero, then (1) can be rewritten as: 1http://code.google.com/p/ark-tweet-nlp/

cosine_sim(w!,w!) = ! k!!" ∗ k!!"!!∈!!∩!! (2)

Vector length normalization and the set of indices Ci are calculated in one MapReduce job as described in Algorithm 1a.

Mappers: For each vector wi,

a) Compute its length leni. b) For every non-zero coordinate wc, emit <wc ,<wi, normic>>

where normic = kic/leni. Reducers: Do nothing.

Algorithm 1a. Vector normalization and non-zero coordinate indexing

Another MapReduce job is executed for calculating cosine similarity scores and creating the graph table as described in Algorithm 1b. The output graph table has the schema with a column family “weight” (similar to the co-occurrence table).

Mappers: For each key wc,

For every pair of its values <wu, normuc> and <wv, normvc>, emit <<wu,wv>, cosine_simc> where cosine_simc = normuc*normvc

Reducers: For each key <wu, wv>,

a) Take cosine similarity score between wu and wv as the sum of all associated values cosine_simc: cosine_sim(w!,w!) =cosine_sim!!

b) If cosine_sim(wu, wv) >= α, insert into the graph table two edges (wu, wv) and (wv, wu) with the same weight cosine_sim(wu, wv).

Algorithm 1b. Cosine similarity calculation.

3.1.3 Removing edges with low weights The purpose of this step is to keep only edges with top weights so that we can avoid non-sentiment words and improve the speed of sentiment score propagation. An edge (wi, wj) is kept if it is one of the TOP_N highest weighted edges adjacent to nodes wi and wj. The MapReduce algorithm for removing low weighted edges is described in Algorithm 2 below.

Mappers: For each key wi in the graph table,

Calculate a list L1 of TOP_N highest weighted edges adjacent to wi and a list L! = L\L! where L – list of all edges adjacent to wi.

a) For each edge e ∈ L!, emit <e, L1> b) For each edge e′ ∈ L!, emit <e, ∅>

Reducers: For each key e, there are two associated values L’,L’’

a) If one of the associated values is ∅, remove edge e from the graph table.

b) If e is not in one of the TOP_N highest weighted edges in the combination list L = L′ ∪ L′′, edge e is also removed.

Algorithm 2. Removing edges with low weights

3.1.4 Sentiment score propagation This is the main algorithm which propagates sentiment scores from seed nodes into other nodes which are located within a distance D from them. Score propagation is implemented with the help of an HBase table named “propagate”. In this table, nodes reachable from a seed node are stored as qualifiers in a column family named “visited” in the row with the key equal to the seed node’s id. The MapReduce implementation of sentiment score propagation is described in Algorithm 3.

Page 21: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Final Project• Presentations: Saturday, Dec. 16 (8-11am), Sorry :( • Presentation (5-6 minutes)

- Brief Introduction to Dataset/Problem (1 minute) - Scalability Challenges (1-2 minutes) - Results (2-3 minutes)

• Report - Expand proposal into technical report (figures, tables, references)

• Code - Include README - Source with build/run instructions - Link to data (make subset available to me if not publicly available)

15D. Koop, CIS 602-01, Fall 2017

Page 22: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Reproducibility: Gene Names and Excel

16D. Koop, CIS 602-01, Fall 2016

Page 23: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Gene names and Excel• SEPT2 → 9/2/2016 , MARCH1 → 3/1/2016 • First cited in 2004 • Blog posts:

- https://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-biological-data/

- http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/

• Studied supplemental data from 18 journals, 35,175 Excel files • Increased at an annual rate of 15% in the past five years • Not just Excel: e.g. Apache OpenOffice Calc

17D. Koop, CIS 602-01, Fall 2016

["Gene name errors are widespread in the scientific literature", M. Ziemann et al., 2016,]

Page 24: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Affected Gene Lists

18D. Koop, CIS 602-01, Fall 2016

["Gene name errors are widespread in the scientific literature", M. Ziemann et al., 2016,]

Page 25: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

One could teach a course on reproducibility.

19D. Koop, CIS 602-01, Fall 2017

Page 26: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

I did. (Fall 2016)

20D. Koop, CIS 602-01, Fall 2017

Page 27: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Reproducibility Overview• Code Availability, Collaboration, and Version Control • Data Availability, Citation, and Curation • Virtual Machines and Containers • Scientific Workflows • Provenance • Tools • Numerical Reproducibility & Scalability • Cultural, Ethical, and Legal Challenges

21D. Koop, CIS 602-01, Fall 2016

Page 28: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Importance of Reproducibility

22D. Koop, CIS 602-01, Fall 2016

Page 29: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Better Understanding of how to do Reproducible Work

23D. Koop, CIS 602-01, Fall 2016

Page 30: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Rules for Reproducible Computational Research• Rule 1: For Every Result, Keep Track of How It Was Produced • Rule 2: Avoid Manual Data Manipulation Steps • Rule 3: Archive the Exact Versions of All External Programs Used • Rule 4: Version Control All Custom Scripts • Rule 5: Record All Intermediate Results, When Possible in

Standardized Formats

24D. Koop, CIS 602-01, Fall 2016

[Sandve et al., 2013]

Page 31: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Rules for Reproducible Computational Research• Rule 6: For Analyses That Include Randomness, Note Underlying

Random Seeds • Rule 7: Always Store Raw Data behind Plots • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of

Increasing Detail to Be Inspected • Rule 9: Connect Textual Statements to Underlying Results • Rule 10: Provide Public Access to Scripts, Runs, and Results

25D. Koop, CIS 602-01, Fall 2016

[Sandve et al., 2013]

Page 32: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalability Concerns?

26D. Koop, CIS 602-01, Fall 2017

Page 33: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalability Concerns?• Rule 1: For Every Result, Keep Track of How It Was Produced • Rule 2: Avoid Manual Data Manipulation Steps • Rule 3: Archive the Exact Versions of All External Programs Used • Rule 4: Version Control All Custom Scripts • Rule 5: Record All Intermediate Results, When Possible in

Standardized Formats

27D. Koop, CIS 602-01, Fall 2016

[Sandve et al., 2013]

Page 34: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalability Concerns?• Rule 6: For Analyses That Include Randomness, Note Underlying

Random Seeds • Rule 7: Always Store Raw Data behind Plots • Rule 8: Generate Hierarchical Analysis Output, Allowing

Layers of Increasing Detail to Be Inspected • Rule 9: Connect Textual Statements to Underlying Results • Rule 10: Provide Public Access to Scripts, Runs, and Results

28D. Koop, CIS 602-01, Fall 2016

[Sandve et al., 2013]

Page 35: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Enabling Reproducibility in Big Data Research

V. Stodden

D. Koop, CIS 602-01, Fall 2017

Page 36: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Reliable Scientific Conclusions in Big Data Era• "Ubiquity of Error":

- Mistakes and self-delusion can creep in absolute anywhere - Scientist's effort is expended in recognizing and noting that error

• "Computations are frequently of such a complexity that an explanation sufficiently detailed to enable others to replicate the results is not possible in a typical scientific publication."

30D. Koop, CIS 602-01, Fall 2017

Page 37: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scale of Data (LSST)

31D. Koop, CIS 602-01, Fall 2017

[http://www.lsst.org]

• Image every 15 seconds • 100PB over 10 years

Page 38: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalable Data Science

32D. Koop, CIS 602-01, Fall 2017

Page 39: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalability• “Big Data”

- What is “big”? For whom is it “big”? - variety, velocity, volume, …

• Lots of data that was big is not an issue now • Understanding the scalability of techniques is important • There will always be larger datasets, want to understand

- how methods scale - performance bounds - storage constraints

33D. Koop, CIS 602-01, Fall 2017

Page 40: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalable Visualization Challenges

34D. Koop, CIS 602-01, Fall 2017© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics– Lecture 10: Data Visualization49

Challenges

Visual clutter

How can we avoid visual clutters like overlaps

and crossings?�

Performance issues

How can we render the huge datasets in real time

with rich interactions?�

Limited cognition

How can users understand the visual representation

when the information is overwhelming?�

Page 41: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Progressive Visualization

35D. Koop, CIS 602-01, Fall 2017

the currently applied brush, will force all other visualiza-tions (i.e., v1, v3, and v4) to recompute. Depending on thevisualization condition, v1, v3, and v4 will either show theresults of the new brushing action instantaneously, a load-ing animation until the full result is available, or incremen-tally updated progressive visualizations. The system doesnot perform any form of result caching.

Upon startup, the system loads the selected dataset intomemory. The system randomly shuffles the data points inorder to avoid artifacts caused by the natural oder of thedata and to improve convergence of progressive computa-tions [46]. We approximate the instantaneous visualizationcondition by computing queries over entire dataset asquickly as possible. Initial microbenchmarks for a variety ofqueries over the small datasets used in the experimentyielded the following measurements: time to compute aresult ! 100 ms and time to render a visualization ! 30 ms.Although not truly instantaneous, a total delay of only! 130 ms is well below the guideline of one second, there-fore allowing us to simulate the instantaneous condition.We simulate the blocking condition by artificially delayingthe rendering of a visualization by the number of secondsspecified through the dataset-delay factor. In other words,we synthetically prolong the time necessary to compute aresult in order to simulate larger datasets. While a blockingcomputation is ongoing, we display a simple loading ani-mation, but users can still interact with other visualizationpanels, change the ongoing computation (e.g, selecting adifferent subset), or replace the ongoing computation with anew query (e.g., changing an axis). Fig. 4 (top) shows anexample of a blocking visualization over time.

We implemented the progressive condition by process-ing the data in chunks of 1,000 data points at a time, withapproximate results displayed after each chunk. In total, werefresh the progressive visualization 10 times, independentof the dataset-delay factor. We display the first visualizationas quickly as possible, with subsequent updates appropri-ately delayed so that the final accurate visualization is dis-played after the specified dataset-delay condition. Note thatthe initial min and max estimates might change afters seeingadditional data, in which case we extend the visualizationby adding bins of the same width to accommodate newincoming data. Throughout the incremental computation,we display a progress indication in the bottom left corner ofa visualization (Fig. 3d). Progressive visualizations are aug-mented with error metrics indicating that the current view

is only an approximation of the final result. Even though95 percent confidence intervals based on the standard errorhave been shown to be problematic to comprehend in cer-tain cases [47], we still opted to use them for bar charts dueto their wide usage and familiarity. We render labels withmargins of error (e.g., “"3%”) in each bin of a 2D-histo-gram. Fig. 4 (bottom) shows an example of a progressivevisualization and how it changes over time. Note that theconfidence intervals in the example are rather small, buttheir size can change significantly based on the dataset andquery.

Our system is implemented in C# / Direct2D. We testedour system and ran all sessions on a quad-core 3.60 GHz,16 GB RAM, Microsoft Windows 10 desktop machine witha 16:9 format, 1,920 # 1,080 pixel display.

3.4 ProcedureWe recruited 24 participants from a research university inthe US. All participants were students (22 undergraduate,two graduate), all of whom had some experience with dataexploration or analysis tools (e.g., Excel, R, Pandas). 21 of theparticipants were currently enrolled in and halfway throughan introductory data science course. Our experimentincluded visualization condition as a within-subject factorand dataset-delay and dataset-order as between-subject fac-tors. Note that the dataset-delay has no direct influence onthe instantaneous condition, but it is used as a between-subject factor to test if it affects the other visualization condi-tions. To control against ordering and learning effects, wefully randomized the sequence in which we presented thedifferent visualization conditions to the user and counterbal-anced across dataset-delay conditions. Instead of fully ran-domizing the ordering of datasets, we opted to create twopredefined dataset-orderings and factored them into ouranalysis. The two possible dataset-ordering values were 123and 312 (i.e., DS1 followed by DS2 followed by DS3 and DS3followed by DS1 followed by DS2, respectively), and weagain counterbalanced across dataset-ordering. We seed oursystem’s random number generator differently for each ses-sion to account for possible effects in the progressive condi-tion caused by the sequence in which data points areprocessed. While each participant did not experience all pos-sible combinations of visualization, dataset-ordering, anddataset-delay conditions, all users saw all visualization con-ditions with one specific dataset-delay and dataset-ordering.For example, participant P14 was assigned dataset-ordering312, dataset-delay 6 s, and visualization condition ordering[blocking, instantaneous, progressive]. In total, this adds upto four trial sub-groups, each with six participants: (dataset-delay = 6 s & dataset-order = 123), (dataset-delay = 12 s &dataset-order = 123), (dataset-delay = 6 s & dataset-order =312) and (dataset-delay = 12 s & dataset-order = 312). Withineach subgroup, we fully randomized the order in which wepresented the different visualization conditions.

After a 15 minute tutorial of the system, including a sum-mary of the visualization conditions and how to interpret con-fidence intervals and margins of error, we instructed theparticipants to perform three exploration sessions. In the caseof participant P14, these three sessions included (1) blockingvisualizations on DS3, (2) instantaneous visualizations onDS1, and (3) progressive visualizations on DS2 all with 6 s

Fig. 4. The blocking and progressive visualization conditions.

ZGRAGGEN ETAL.: HOW PROGRESSIVE VISUALIZATIONS AFFECT EXPLORATORYANALYSIS 1981

Page 42: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Data Science Tools

36D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

Page 43: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

The Cloud: Scaling Up

37D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC

Page 44: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

The Cloud: Scaling Up

37D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC Server

Page 45: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

The Cloud: Scaling Up

37D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC Server Cluster

Page 46: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

The Cloud: Scaling Up

37D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC Server Cluster Data center

Page 47: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

The Cloud: Scaling Up

37D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC Server Cluster Data center Network of data centers

Page 48: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Cloud Workloads

38D. Koop, CIS 602-01, Fall 2017

10�2 10�1 100 101 102 103

job duration (hours)

103

104

105

106jo

bslo

nger

allproductionnon-production

Figure 2: Log-log scale inverted CDF of job durations. Only theduration for which the job runs during the trace time period isknown; thus, for example, we do not observe durations longerthan around 700 hours. The thin, black line shows all jobs; thethick line shows production-priority jobs; and the dashed lineshows non-production priority jobs.

types of jobs besides their actual resource usage and duration: even‘scheduling class’, which the trace providers say represents howlatency-sensitive a job is, does not separate short-duration jobs fromlong-duration jobs. Nevertheless, the qualitative difference in theaggregate workloads at the higher and lower priorities shows thatthe trace is both unlike batch workload traces (which lack the com-bination of diurnal patterns and very long-running jobs) and unlikeinteractive service (which lack large sets of short jobs with littlepattern in demand).

3.3 Job durationsJob durations range from tens of seconds to essentially the entire

duration of the trace. Over 2000 jobs (from hundreds of distinctusers) run for the entire trace period, while a majority of jobs lastfor only minutes. We infer durations from how long tasks are activeduring the one month time window of the trace; jobs which are cutoff by the beginning or end of the trace are a small portion (< 1%)of jobs and consist mostly of jobs which are active for at least sev-eral days and so are not responsible for us observing many shorterjob durations. These come from a large portion of the users, so it isnot likely that the workload is skewed by one particular individualor application. Consistent with our intuition about priorities corre-lating with job types, production priorities have a much higher pro-portion of long-running jobs and the ‘other’ priorities have a muchlower proportion. But slicing the jobs by priority or ‘schedulingclass’ (which the trace providers say should reflect how latency-sensitive a job is) reveals a similar heavy-tailed distribution shapewith a large number of short jobs.

3.4 Task shapesEach task has a resource request, which should indicate the amount

of CPU and memory space the task will require. (The requests areintended to represent the submitter’s predicted “maximum” usagefor the task.) Both the amount of the resources requested and theamount actually used by tasks varies by several orders of magni-tude; see Figures 3 and 4, respectively. These are not just out-liers. Over 2000 jobs request less than 0.0001 normalized units ofmemory per task, and over 8000 jobs request more than 0.1 unitsof memory per task. Similarly, over 70000 jobs request less than

Figure 3: CDF of instantaneous task requested resources. (1unit = max machine size.) These are the raw resource requestsin the trace; they do not account for task duration.

Figure 4: Log-log scale inverted CDF of instantaneous me-dian job usage, accounting for both varying per-task usage andvarying job task counts.

0.0001 units of CPU per task, and over 8000 request more than 0.1units of CPU. Both tiny and large resource requesting jobs includehundreds of distinct users, so it is not likely that the particularlylarge or small requests are caused by the quirky demands of a sin-gle individual or service.

We believe that this variety in task “shapes” has not been seen inprior workloads, if only because most schedulers simply do notsupport this range of sizes. The smallest resource requests arelikely so small that it would be difficult for any VM-based sched-uler to run a VM using that little memory. (0.0001 units would bearound 50MB if the largest machines in the cluster had 512GB ofmemory.) Also, any slot-based scheduler, which includes all HPCand Grid installations we are aware of, would be unlikely to havethousands of slots per commodity machine.

The ratio between CPU and memory requests also spans a largerange. The memory and CPU request sizes are correlated, butweakly (linear regression R2 ⇡ 0.14). A large number jobs request0 units of CPU — presumably they require so little CPU they candepend on running in the ‘left-over’ CPU of a machine; it makeslittle sense to talk about the CPU:memory ratio without adjustingthese. Rounding these requests to the next largest request size, the

[Reiss et al., 2012]

Page 49: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

MapReduce

39D. Koop, CIS 602-01, Fall 2017

Distributing Costly Computation:e.g. Rendering Map Tiles

Input Map Shuffle Reduce OutputEmit each to all

overlapping latitude-

longitude rectangles

Sort by key

(key= Rect. Id)

Render tile using

data for all enclosed

features

Rendered tilesGeographic

feature list

I-5

Lake Washington

WA-520

I-90

(N, I-5)

(N, Lake Wash.)

(N, WA-520)

(S, I-90)

(S, I-5)

(S, Lake Wash.)

(N, I-5)

(N, Lake Wash.)

(N, WA-520)

(S, I-90)

N

S (S, I-5)

(S, Lake Wash.)

(Bucket pattern) (Parallel rendering)

[Google]

Page 50: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Spark and RDDs

40D. Koop, CIS 602-01, Fall 2017

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

[Zaharia et al., 2012]

Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.

Page 51: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Data Cleaning By Example

41D. Koop, CIS 602-01, Fall 2017

FOOFAH: A Programming-By-Example System forSynthesizing Data Transformation Program

Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. JagadishUniversity of Michigan

Most real-world data is unstructured and must be transformed into a structured form to be used. Manual transformation (e.g., using Excel) requires too much user effort. Traditional transformation often requires good programming skills beyond most of the users. Data transformation tools, like Data Wranger [1], often require repetitive and tedious work and a depth of data transformation knowledge from the user. Our goal: minimize a user's effort and reduce the required background knowledge for data transformation tasks.

Motivation

Design of FOOFAH

Related Work

[email protected] w https://markjin1990.github.io w SIGMOD 2017

Our Solution

Proposed Heuristic Function

1. Kandel,Sean,etal.“Wrangler:Interactivevisualspecificationofdatatransformationscripts.” CHI,2011.2. V.RamanandJ.M.Hellerstein.“Potter’sWheel:Aninteractivedatacleaningsystem”.VLDB,2001.

3. Gulwani,Sumit."Automatingstringprocessinginspreadsheetsusinginput-outputexamples." ACMSIGPLANNotices.Vol.46.No.1.ACM,2011.

4. Harris,WilliamR.,andSumit Gulwani."Spreadsheettabletransformationsfromexamples." ACMSIGPLANNotices.Vol.46.No.6.ACM,2011.

5. Barowy,DanielW.,etal."FlashRelate:extractingrelationaldatafromsemi-structuredspreadsheetsusing

examples." ACMSIGPLANNotices.Vol.50.No.6.ACM,2015.

6. Guo,PhilipJ.,etal."Proactivewrangling:mixed-initiativeend-userprogrammingofdatatransformation

scripts." Proceedingsofthe24thannualACMsymposiumonUserinterfacesoftwareandtechnology.ACM,2011.

User Study

Our PBE technique prototypeFOOFAH:1. can handle most test cases from

the benchmarks.2. requires little user effort3. generally efficient (low system

runtime)

Benchmark Tests

Tasks: 8 tasks frombenchmarks covering bothsimple and complex tasksComparisons: Wrangler

• FOOFAH on average requires 60% less user effort than Wrangler

0

100

200

300

400

500

600

Taskcompletiontime:WranglervsFoofah

Wrangler

Foofah

50.00% 40.00%

10.00% 0%

20% 40% 60% 80%

100%

1 2 Failure

#OFRECORDS

Sizesofinput-outputexamplesrequiredforbenchmarktests

74.00% 86.00% 88.00%

0% 20% 40% 60% 80%

100%

≤1sec ≤5sec ≤30secPERCENTOFTESTSCENARIOS

Worst-casesystemruntimeforeachsynthesis

Tasks: 50 test scenarios selectedfrom [1,2,4,6]Test Approach: lazy approach [4]Comparison: [1,3,4,5]

Input

ExampleeiInput

Exampleeo?

A search problemsolved by A* algorithm

edges: operationnodes: different views of the dataA* search: iteratively explore the

node with min f(n)f(n) = g(n) + h(n)

observed distanceestimated distance

Intuition: Most data transformation operations can be seen as many cell-level transformation operations

Solution: Table Edit Distance as the heuristic function

Table Edit Distance (TED) Definition:The cost of transforming Table T1 to Table T2 using the cell-leveloperators Add/Remove/Move/Transform cell.

TED $%, $' = min,-,… ,,0 ∈2 3-,3456789 :;<

;=>• P(T1, T2): Set of all “paths” transforming T1 to T2 using cell-level operators

Batching: a remedy for Table Edit Distance to scale down heuristic

Batch the geometrically-adjacent cell-level operations of the same type

8 Transform operations 2 “batched” Transform operations

88.40% 97.70% 74.40%

55.80%

0% 20% 40% 60% 80%

100%

Successrateson pure layouttransformationbenchmark tasks

Foofah FlashRelate ProgFromEx Wrangler

100.00%

0.00% 0.00%

85.70%

0% 20% 40% 60% 80%

100%

Successrateson benchmarktasksrequiring syntactic transformations

Foofah FlashRelate ProgFromEx Wrangler

Program to synthesize:• A loop-free Potter’s Wheel [2] program

SystemInput-output

ExampleSynthesized

Program

RawData

Programming-By-Example interaction model: User provides input-output examples rather than demonstrating correct operations

Note:Ideally,Wranglershouldbeable

tohandlesametasksasFOOFAH

User Input:• Sample from raw data• Transformed view of the sample

Raw Data: • A grid of values, i.e., spreadsheets• “Somewhat” structured - must have some

regular structure or is automatically generated.

Transformations Targeted:1. Layout transformation 2. String transformation

05/16/2017

05/17/2017

05-16-2017

05-17-2017

[Z. Jin et al., 2017]

Page 52: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Data Cleaning Using Samples

42D. Koop, CIS 602-01, Fall 2017

!

Dirty!Data!

Result!Es.ma.on!(RawSC)!

Dirty!Sample!

Cleaned!Sample!

Result!Es.ma.on!(NormalizedSC)!

Results!with!Con<!fidence!Intervals!

Aggregate!Queries!

Sample!Crea.on!

Data!Cleaning!

Results!with!Con<!fidence!Intervals!

Figure 2: The SampleClean framework.

entire table. To determine it, one way would be to estimateits value from the sample. However, both analytical proofsand empirical tests have shown that this method can lead tohighly inaccurate query results [10]. Therefore, in our pa-per, we determine the duplication factor from the completerelation.

It is important to note, however, that compared to fullcleaning, we only need to determine the duplication factorfor those tuples in the sample. As with other uses of sam-pling, this can result in significant cost savings in duplicatedetection. In the following, we will describe how to apply ex-isting deduplication techniques to compute the duplicationfactor, and explain why it is cheaper to determine the du-plication factor for a sample of the data, even though doingso requires access to the complete relation.

Duplicate detection (also known as entity resolution) aimsto identify di↵erent tuples that refer to the same real-worldentity. This problem has been extensively studied for severaldecades [22]. Most deduplication approaches consist of twophases:

1. Blocking. Due to the large (quadratic) cost of all-pair comparisons, data is partitioned into a numberof blocks, and duplicates are considered only within ablock. For instance, if we partition papers based onconference_name, then only the papers that are pub-lished in the same conference will be checked for dupli-cates;

2. Matching. To decide whether two tuples are duplicatesor not, existing techniques typically model this problemas a classification problem, and train a classifier to la-bel each tuple pair as duplicate or non-duplicate [9].In some recent research (and also at many compa-nies) crowdsourcing is used to get humans to matchtuples [20,54].

A recent survey on duplicate detection has argued that thematching phase is typically much more expensive than theblocking phase [13]. For instance, an evaluation of the popu-lar duplicate detection technique [9] shows that the matchingphase takes on the order of minutes for a dataset of thou-sands of tuples [39]. This is especially true in the context ofcrowdsourced matching where each comparison is performedby a crowd worker costing both time and money. Sample-Clean reduces the number of comparisons in the matchingphase, as we only have to match each tuple in the samplewith the others in its block. For example, if we sample 1% ofthe table, then we can reduce the matching cost by a factorof 100.

2.3.3 Result EstimationAfter cleaning a sample, SampleClean uses the cleaned

sample to estimate the result of aggregate queries. Simi-lar to existing SAQP systems, we can estimate query resultsdirectly from the cleaned sample. However, due to data er-ror, result estimation can be very challenging. For example,

consider the avg(citation_count) query in previous section.Assume that the data has duplication errors and that paperswith a higher citation count tend to have more duplicates.The greater the number of duplicates, the higher probabilitya paper is sampled, and thus the cleaned sample may con-tain more highly cited papers, leading to an over-estimatedcitation count. We formalize these issues and propose theRawSC approach to address them in Section 3.Another quantity of interest is how much the dirty data

di↵ers from the cleaned data. We can estimate the meandi↵erence based on comparing the dirty and cleaned sam-ple, and then correct a query result on the dirty data withthis estimate. We describe this alternative approach, calledNormalizedSC, and compare its performance with RawSCin Section 4.

SampleClean v.s. SAQP: SAQP assumes perfectly cleandata while SampleClean relaxes this assumption and makescleaning feasible. In RawSC, we take a sample of data, ap-ply a data cleaning technique, and then estimate the result.The result estimation is similar to SAQP, however, we re-quire a few additional scaling factors related to the clean-ing. On the other hand, NormalizedSC is quite di↵erentfrom typical SAQP frameworks. NormalizedSC estimatesthe average di↵erence between the dirty and cleaned data,and this is only possible in systems that couple data clean-ing and sampling. What is surprising about SampleCleanis that sampling a relatively small population of the overalldata makes it feasible to manually or algorithmically cleanthe sample, and experiments confirm that this cleaning of-ten more than compensates for the error introduced by thesampling.

2.3.4 Example: SampleClean with OpenRefineIn this section, we will walk through an example imple-

mentation of SampleClean using OpenRefine [46] to cleanthe data. Consider our example dirty dataset of publica-tions in Figure 1(a). First, the user creates a sample of data(e.g., 100 records) and loads this sample into the OpenRefinespreadsheet interface. The user can use the tool to detectdata errors such as missing attributes, and fill in the cor-rect values (e.g., from another data source or based on priordomain expertise). Next, for deduplication, the system willpropose potential matches for each publication in the sam-ple based on a blocking technique and the user can acceptor reject these matches. Finally, the clean sample with thededuplication information is loaded back into the dataset.In this example, sampling reduces the data cleaning e↵ortfor the user. The user needs to inspect only 100 records in-stead of the entire dataset, and has no more than 100 setsof potential duplicates to manually check.To query this clean sample, we need to apply Sample-

Clean’s result estimation to ensure that the estimate remainsunbiased after cleaning since some records may have beencorrected, or marked as duplicates. In the rest of the paper,we discuss the details of how to ensure unbiased estimates,and how large the sample needs to be to get a result ofacceptable quality.

3. RawSC ESTIMATIONIn this section, we present the RawSC estimation ap-

proach. RawSC takes a sample of data as input, appliesa data cleaning technique to the sample, runs an aggregatequery directly on the clean sample, and returns a result witha confidence interval.

3.1 Sample EstimatesWe will first introduce the estimation setting without data

errors and explain some results about estimates from sam-

[J. Wang et al., 2014]

Page 53: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Data Fusion: Less is More

43D. Koop, CIS 602-01, Fall 2017

Figure 1: Coverage of results.

0 10 20 30 40 50 60 70 80 90

100 0 50

10

0 15

0 20

0 25

0 30

0 35

0 40

0 45

0 50

0 55

0 60

0 65

0 70

0 75

0 80

0 85

0

Gain

=#(C

orre

ct a

utho

rs)

#Sources

Authors as We Add Sources

#(Returned books) Vote Accu Cost

Figure 2: Returned correct results. Figure 3: Different integration models.

ually checked the book title page for 100 randomly selected booksto obtain the correct author lists as the gold standard.

Ideally, we would like to find the correct author list from con-flicting values. We did this in two ways. First, VOTE applies thevoting strategy and chooses the author list provided by the largestnumber of sources. Second, ACCU considers in addition the accu-racy of the sources: it takes the accuracy of each source as input(computed by the percentage of correctly provided values for thebooks inside the gold standard), assigns a higher vote to a sourcewith a higher accuracy, and chooses the author list with the highestsum of the votes (details in Section 3).

We considered the sources in decreasing order of their accu-racy (this is just for illustration purpose and we discuss ordering ofsources in Section 1.3). Fig.2 plots the gain, defined as the numberof correctly returned author lists for these 100 books, as we addedeach new source. We observed that we obtained all 100 books afterprocessing 548 sources (see the line for #(Returned books)). Thenumber of correct author lists increased at the beginning for bothmethods; then, VOTE hits the highest number, 93, after integrating583 sources, and ACCU hits the highest number after integrating579 sources; after that the numbers decreased for both methodsand dropped to 78 and 80 respectively for VOTE and ACCU. Inother words, integrating the 584th to 894th sources has a negativegain for VOTE and similar for ACCU. 2

This example shows that for data, “the more the better” does notnecessarily hold and sometimes “less is more”. As the researchcommunity for data integration has been focusing on improvingvarious integration techniques, which is important without a doubt,we argue that it is also worthwhile to ask the question whether inte-grating all available data is the best thing to do. Indeed, Fig.2 showsthat although in general the more advanced method, ACCU, is bet-ter than the naive method, VOTE, the result of ACCU on all sourcesis not as good as that of VOTE on the first 583 sources. This ques-tion is especially relevant in the big data environment: not only dowe have larger volume of data, but also we have larger number ofsources and more heterogeneity, so we wish to spend the comput-ing resources in a wise way. This paper studies how we can selectsources wisely before real integration or aggregation such that wecan balance the gain and the cost. Source selection can be importantin many scenarios, ranging from Web data providers that aggregatedata from multiple sources, to enterprises that purchase data fromthird parties, and to individual information users who shop for datafrom data markets [1].

1.2 Source selection by MarginalismSource selection in the planning phase falls in the category of re-

source optimization. There are two standard ways to formalize theproblem: finding the subset of sources that maximizes the resultquality under a given budget, or finding the subset that minimizesthe cost while satisfying a minimal requirement of quality. How-ever, neither of them may be ideal in our context, as shown next.

EXAMPLE 1.3. Consider ACCU in Fig.2. Assume only for sim-plicity that the applied order is the best for exploring the sources.Suppose the budget allows integrating at most 300 sources; thenwe may select all of the first 300 sources and obtain 17 correctauthor lists. However, if we select only the first 200 sources, wecan cut the cost by 1/3, while obtaining only 3 fewer correct au-thor lists (17.6% fewer); arguably, the latter selection is better. Onthe other hand, suppose we require at least 65 correct author lists;then we may select the first 520 sources, obtaining 65 correct lists.However, if we instead select 526 sources, we introduce 1% morecost but can obtain 81 correct lists (improving by 25%); arguably,spending the few extra resources is worthwhile. 2

We propose a solution inspired by the Marginalism principle ineconomic theory [11]. Assuming we can measure gain and cost us-ing the same unit (many enterprises do predict revenue and expensefrom integration in dollars according to some business models), wewish to stop selecting a new source when the marginal gain is lessthan the marginal cost. Here, the marginal gain is the differencebetween the gain after and before integrating the new source andsimilar for marginal cost. In our example, if the gain of finding onecorrect author list is 1 while the cost of integrating one source is .1,the 548th source is such one marginal point.

1.3 Challenges for data integrationSource selection falls outside the scope of traditional integration

tasks, such as mapping schemas, linking records that refer to thesame real-world entity, and resolving conflicts. On the one hand, itis a prelude for data integration. On the other hand, how we selectthe sources would be closely related to the integration techniqueswe apply. Applying the Marginalism principle to source selectionfor data integration faces many challenges.

First, in economic theory the Law of Diminishing Returns [11](i.e., keeping adding resources will gradually yield lower per-unitreturns) often holds and so we can keep adding resources until themarginal cost exceeds the marginal gain for the next unit of re-source. However, the Law of Diminishing Returns does not nec-essarily hold in data integration, so there can be multiple marginalpoints. In our example (Fig.2), after we integrate 71 sources byACCU, the gain curve flattens out and the marginal gain is muchless than the marginal cost; however, starting from the 381st source,there is a sharp growth for the gain. Indeed, there are four marginalpoints on the curve: the 35th, 71st, 489th, and 548th sources (markedby vertical lines in Fig.2). We thus need to find all marginal pointsbefore we stop investigation.

Second, the data sources are different, providing data with dif-ferent coverage and quality, so integrating the sources in differentorders can lead to different quality curves (so gain curves). Eachcurve has its own marginal points, so we need to be able to compareall marginal points in some way and choose one as the best.

38

[X. L. Dong et al., 2012]

Page 54: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

CAP Theorem

44D. Koop, CIS 602-01, Fall 2017

[E. Brewer]

Page 55: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

NoSQL: Replication and Consistency Levels

45D. Koop, CIS 602-01, Fall 2017

Mutli DC replication

39

WriteDC 1 DC 2

[R. Stupp]

Page 56: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

00,11 01,11 10,11 11,11

00,10 01,10 10,10 11,10

00,01 01,01 10,01 11,01

00,00 01,00 10,00 11,00

o1

o2

o3

o4

o5

0,1 1,1

0,0 1,0

Five Tweets: Location and Device

= iPhone= Android`device( )

`device( )

`spatial1 `spatial2

S = [ [`spatial1, `spatial2], [`device] ]

o2

o2o1

o2 o3

0,1

01,10

Android

o1

iPhone

1,0

10,01

o3

iPhone

o2o1 o3

iPhone Android

10,10

Android

o4

1,1

o1 o4

o4

11,01

iPhone

o5

iPhone

o5o3

0,1

01,10

Android

o1

0,1

01,10

Android

o1 o2 o2o1

iPhone

0,1

01,10

Android

o1 o2 o2o1

iPhone

1,0

10,01

o3

iPhone

o2o1 o3

iPhoneAndroid

o2 o3

0,1

01,10

Android

o1 o2 o2o1

iPhone

1,0

10,01

o3

iPhone

o2o1 o3

iPhone Android

o2 o3

10,10

Android

o4

1,1

o1 o4

o4

Indexing Schema

1. 2. 3.

4. 5.

parent-child (same dimension):

proper

content (next dimension):

shared

proper shared

o5

o5

updated in current step

dimensionboundary

Fig. 2. An illustration of how to build a nanocube for five points [o1, . . . ,o5] under schema S. The complete process is described in Section 4.

Section 4, we show how to construct a data cube that fits in the mainmemory of a modern laptop computer or workstation, extending thework of Sismanis et al. [31]. In addition, the query times to build thevisual encodings in which we are interested will be at most proportionalto the size of the output, which is bounded by the number of screenpixels (within a small factor). This is an important observation: the timecomplexity of a visualization algorithm should ideally be bounded thenumber of pixels it touches on the screen. Our technique enables real-time exploratory visualization on datasets that are large, spatiotemporal,and multidimensional. Because the speed of our data cube structurehinges partly on it being small enough to fit in main memory, we call ita nanocube.

By real-time, we mean query times on average under a millisecondfor a single thread running on computers ranging from laptops, toworkstations, to server-class computing nodes (Section 6). By large,we mean that the datasets we support have millions to billions of entries.

By spatiotemporal, we mean that nanocubes support queries typicalof spatial databases, such as counting events in a spatial region thatcan be either a rectangle covering most of the world, or a heatmapof activity in downtown San Francisco (Section 4.3.1). By the sametoken, nanocubes support temporal queries at multiple scales, suchas event counts by hour, day, week, or month over a period of years(Section 4.3.3). Data cubes in general enable the Visual Information-Seeking Mantra [29] of “Overview first, zoom and filter, then details-on-demand” by providing summaries and letting users drill down byexpanding along the wanted dimensions. Nanocubes also provideoverviews, filters, zooming, and details-on-demand inside the spa-tiotemporal dimensions themselves.

By multidimensional, we mean that besides latitude, longitude, andtime, each entry can have additional attributes (see section 6) that canbe used in query selections and rollups.

As we will show, nanocubes lend themselves very well to buildingvisual encodings which are fundamental building blocks of interac-tive visualization systems, such as scatterplots, histograms, parallelcoordinate plots, and choropleth maps. In summary, we contribute:

• a novel data structure that improves on the current state of the artdata cube technology to enable real-time exploratory visualizationof multidimensional, spatiotemporal datasets;

• algorithms to query the nanocube and build linked and brushablevisual encodings commonly found in visualization systems; and

• case studies highlighting the strengths and weaknesses of our

technique, together with experiments to measure its utilization ofspace, time, and network bandwidth.

2 RELATED WORK

Relational databases are so widespread and fundamental to the practiceof computing that they were a natural target for information visualiza-tion almost since the field’s inception [20]. Mackinlay’s AutomaticPresentation Tool is the breakthrough result that critically connected therelational structure of the data with the graphical primitives availablefor display [23] and ultimately lead to data cube visualization toolslike Polaris [34, 35] and Show Me [24]. Nanocubes are specificallydesigned to speed up queries for spatiotemporal data cubes, and couldeventually be used as a backend for these types of applications.

In contrast, some of the work in large data visualization involvesshipping the computation and data to a cluster of processing nodes.While parallelism is an attractive option for increasing throughput, itdoes not necessarily help achieve low latency, which is essential forfluid interactions with a visualization tool. As a result, sophisticatedtechniques such as query prediction become necessary [6]. Leveragingthe enormous power of graphics processing units has also becomepopular [25, 21], but without algorithmic changes, linear scans throughthe dataset will still be too slow for fluid interaction, even with GPUs.

Another popular way to cope with large datasets is through sampling.Statistical sampling can be performed on the database backend [26, 1,10, 14], or on the front-end [11]. Still, the techniques we introducewith nanocubes can produce results quickly and exactly (to withinscreen precision) without requiring approximations, which we believeis preferable. In addition, as Liu et al. argue, sampling by itself is notsufficient to prevent overplotting, and might actually mask importantdata outliers [21].

Fekete and Plaisant have proposed modifications of traditional visualencodings which use the computer screen more efficiently [13]. Thesescale better with dataset size, but nevertheless require a traversal ofall input data points that renders the proposal less attractive for largerdatasets. Carr et al. were among the first to propose techniques replac-ing a scatterplot with an equivalent density plot [5]; nanocubes enablethese visualizations at a variety of dataset sizes and scales.

Careful data aggregation [17], then, appears to be one of the fewscalable solutions for low-latency large data graphics. While Elmqvistand Fekete propose variations of visualization techniques that includeaggregation as a first-class citizen [12], in this paper we show how toissue queries such that, at the screen resolution in which the applicationis operating, the result is indistinguishable (or close to) from a complete

Data Cubes and Nanocubes

46D. Koop, CIS 602-02, Fall 2015

[Lins et. al, 2013]

Page 57: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Scalable Machine Learning

47D. Koop, CIS 602-01, Fall 2017

Device A Device B

distributed

Add Mul

biases

learning rate

−=...

Devices: Processes, Machines, GPUs, etc

Send and Receive Nodes

Send

Recv

Send RecvSend Recv

... RecvSend

[J. Dean, 2015]

Page 58: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Graph Analytics

48D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

Title PRText Table

Title Body Topic Model (LDA) Word Topics

Word Topic

Editor GraphCommunity Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

Page 59: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Internet of Things

49D. Koop, CIS 602-01, Fall 2017

A. Botta et al. / Future Generation Computer Systems 56 (2016) 684–700 689

Fig. 5. Services made possible thanks to the CloudIoT paradigm.Source: http://siliconangle.com/.

Fig. 6. Application scenarios driven by the CloudIoT paradigm and related challenges.

4. Applications

CloudIoT gave birth to a new set of smart services and ap-plications, that can strongly impact everyday life (Fig. 5). Manyof the applications described in the following (may) benefit fromMachine-to-Machine communications (M2M) when the thingsneed to exchange information among themselves and not onlysend them towards the cloud [42]. This represents one of the openissues in this field, as discussed in Section 7. In this section wedescribe the wide set of applications that are made possible orsignificantly improved thanks to the CloudIoT paradigm. For eachapplication we point out the challenges – see Fig. 6 – which wediscuss in detail in Section 5.

Healthcare. The adoption of the CloudIoT paradigm in thehealthcare field can bring several opportunities to medical IT, andexperts believe that it can significantly improvehealthcare servicesand contribute to its continuous and systematic innovation [43].Indeed, CloudIoT employed in this scenario is able to simplifyhealthcare processes and allows to enhance the quality of themedical services by enabling the cooperation among the differententities involved. Ambient assisted living (AAL), in particular, aimsat easing the daily lives of people with disabilities and chronicmedical conditions.

Through the application of CloudIoT in this field it is possibleto supply many innovative services, such as: collecting patients’vital data via a network of sensors connected to medical devices,

[Botta et al., 2016]

Page 60: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture23.pdf · D. Koop, CIS 602-02, Fall 2015 2. Zipf's Law in Tom Sawyer ... 3191 Saudi Arabia N N 2699 last week

Final Project• Presentations: Saturday, Dec. 16 (8-11am), Sorry :( • Presentation (5-6 minutes)

- Brief Introduction to Dataset/Problem (1 minute) - Scalability Challenges (1-2 minutes) - Results (2-3 minutes)

• Report - Expand proposal into technical report (figures, tables, references)

• Code - Include README - Source with build/run instructions - Link to data (make subset available to me if not publicly available)

50D. Koop, CIS 602-01, Fall 2017