Intro to Data Science Concepts

Bill Howe, PhD

Introduction to Data Science

04/15/2023 2

This morning

• Context for “Data Science”• Databases and Relational Algebra• NoSQL

Bill Howe, UW

http://commons.wikimedia.org/wiki/File:ElectoralCollege2012.svg(public domain)

http://commons.wikimedia.org/wiki/File:ElectoralCollege2012.svg

“The intuition behind this ought to be very simple: Mr. Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes.”

Nate Silver, Oct. 26, 2012

“…the argument we’re making is exceedingly simple. Here it is: Obama’s ahead in Ohio.”

Nate Silver, Nov. 2, 2012

“The bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign.”

Nate Silver, Nov. 10, 2012

DailyBeast

fivethirtyeight.com

fivethirtyeight.comsource: randy stewart

Nate Silver

04/15/2023 5Bill Howe, UW

“…the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed.”

Dan Woods Jan 13 2013, CITO Research

“The decision was made to have Hadoop do the aggregate generations and anything not real-time, but then have Vertica to answer sort of ‘speed-of-thought’ queries about all the data.”

Josh Hendler, CTO of H & K Strategies

Related: Obama campaign’s data-driven ground game

"In the 21st century, the candidate with [the] best data, merged with the best messages dictated by that data, wins.”

Andrew Rasiej, Personal Democracy Forum

Hurricane Sandy

http://rpubs.com/JoFrhwld/sandy Josef Fruehwald

http://rpubs.com/JoFrhwld/sandy

Hurricane Sandy

Josef Fruehwald

Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030

1) Convert all the digitized books in the 20th century into n-grams (Thanks, Google!)

(http://books.google.com/ngrams/)

2) Label each 1-gram (word) with a mood score. (Thanks, WordNet!)

3) Count the occurences of each mood word

A 1-gram: “yesterday”A 5-gram: “analysis is often described as”


04/15/2023 10Bill Howe, UW


04/15/2023 11Bill Howe, UW

…2. Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitized books. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online3. Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature 449: 713–716. doi: 10.1038/nature06137. Find this article online4. Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online…6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers of Psychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5: 200–207. doi: 10.1037/a0023195. Find this article online…

04/15/2023 12

What is Data Science?

• Fortune – “Hot New Gig in Tech”

• Hal Varian, Google’s Chief Economist, NYT, 2009: – “The next sexy job”– “The ability to take data—to be able to understand it,

to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

• Mike Driscoll, CEO of metamarkets: – “Data science, as it's practiced, is a blend of Red-Bull-

fueled hacking and espresso-inspired statistics.”– “Data science is the civil engineering of data. Its

acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible.”

Bill Howe, UW

04/15/2023 13

Drew Conway’s Data Science Venn Diagram

Bill Howe, UW

04/15/2023 14

What do data scientists do?

“They need to find nuggets of truth in data and then explain it to the business leaders”

Data scientists “tend to be “hard scientists”, particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem.”

Bill Howe, UW

-- DJ Patil, Chief Scientist at LinkedIn

-- Rchard Snee, EMC

04/15/2023 15

Mike Driscoll’s three sexy skills of data geeks

• Statistics– traditional analysis

• Data Munging – parsing, scraping, and formatting data

• Visualization – graphs, tools, etc.

Bill Howe, UW

04/15/2023 16

“Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.”

Bill Howe, UW

Jeffrey Stanton Syracuse University School of Information Studies

An Introduction to Data Science

04/15/2023 17

Data Science is about Data Products

• “Data-driven apps”– Spellchecker– Machine Translator

• Interactive visualizations– Google flu application– Global Burden of Disease

• Online Databases– Enterprise data warehouse– Sloan Digital Sky Survey

Bill Howe, UW

(Mike Loukides)

Data science is about building data products, not just answering questions

Data products empower others to use the data.

May help communicate your results (e.g., Nate Silver’s maps)

May empower others to do their own analysis (e.g., Global Burden of Disease)

04/15/2023 18

A Typical Data Science Workflow

Bill Howe, UW

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”

DB

ML/Stats

Vis

04/15/2023 19Bill Howe, UW

What are the abstractions of data science?

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

04/15/2023 20Bill Howe, UW

1850s: matrices and linear algebra (today: engineers and scientists)1950s: arrays and custom algorithms (today: C/Fortran performance junkies)1950s: s-expressions and pure functions (today: language purists)1960s: objects and methods (today: software engineers)1970s: files and scripts (today: system administrators)1970s: relations and relational algebra (today: large-scale data engineers)1980s: data frames and functions (today: statisticians)2000s: key-value pairs + one of the above (today: NoSQL hipsters)

But what are the abstractions of data science?

04/15/2023 21

DATABASES AND RELATIONAL ALGEBRA

Bill Howe, UW

04/15/2023 Bill Howe, eScience Institute 22

Pre-Relational: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Relational Database History

-- Codd 1979


Key Idea: “Physical Data Independence”

physical data independence

files and pointers

relations

SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;

f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .


Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

Equivalent logical expressions

25

σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R))

(σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R)

σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R)

right associative

left associative

distributive


Why do we care? Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

So what? RA is now ubiquitous

• Galaxy – “bioinformatics workflows”

• Pandas and Blaze: High Performance Arrays in Pythonmerge(left, right, on=‘key’)

• dplyr in Rfilter(x), select(x), arrange(x), groupby(x), inner_join(x, y), left_join(x, y)

• Hadoop and contemporaries all evolved to support RA-like interfaces: Pig, HIVE, Cascalog, Flume, Spark/Shark, Dremel

“…Operate on Genomics Intervals -> Join”

04/15/2023 28

“NOSQL” SYSTEMS

Bill Howe, UW

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary

Indexes TransactionsJoins/

AnalyticsIntegrity

Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable/Hbase ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

NoSQL and related systems, by feature

04/15/2023 30Bill Howe, UW

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary


AnalyticsIntegrity



1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

Scale was the primary motivation!

04/15/2023 31Bill Howe, UW

Rick Cattel’s clustering from“Scalable SQL and NoSQL Data Stores”SIGMOD Record, 2010

extensible record stores

document stores

key-value stores

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary


AnalyticsIntegrity



1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O key-val nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

04/15/2023 32Bill Howe, UW

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary


AnalyticsIntegrity



1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql

2006BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql

2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

MapReduce-based Systems

04/15/2023 33Bill Howe, UW

2004

Hadoop

2005

MapReduce

2006

2007

2008

2009

2010

2011

2012

MapReduce-based Systems

non-Google open source implementationdirect influence / shared features

compatible

implementation of

PigHIVE

Tenzing

Impala

04/15/2023 34Bill Howe, UW

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary


AnalyticsIntegrity



1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql

2006BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql

2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

04/15/2023 35Bill Howe, UW

BigTable

Cassandra

2004

2005

memcached

2006

2007

2008

2009

Spanner

Megastore

2010

2011

2012

NoSQL Systems

direct influence / shared features

compatible

implementation of

Dynamo

Voldemort Riak

Accumulo

2003

CouchDB

MongoDB

04/15/2023 36Bill Howe, UW

A lot of these systems give up joins!

Year sourceSystem/Paper

Scale to 1000s

PrimaryIndex

SecondaryIndexes Transactions

Joins/ Analytics

Integrity Constraints Views

Language/Algebra

Data model my label

1971many RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like2003 other memcached ✔ ✔ O O O O O O key-val lookup2004Google MapReduce ✔ O O O ✔ O O O key-val MR2005 couchbase CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR2006Google BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR2007 10gen MongoDB ✔ ✔ ✔ EC, record O O O O document filter2007Amazon Dynamo ✔ ✔ O O O O O O key-val lookup2007Amazon SimpleDB ✔ ✔ ✔ O O O O O ext. record filter2008 Yahoo Pig ✔ O O O ✔ / O ✔ tables RA-like2008 Facebook HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like2008 Facebook Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter2009 other Voldemort ✔ ✔ O EC, record O O O O key-val lookup2009 basho Riak ✔ ✔ ✔ EC, record MR O key-val filter2010Google Dremel ✔ O O O / ✔ O ✔ tables SQL-like2011Google Megastore ✔ ✔ ✔ entity groups O / O / tables filter2011Google Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables SQL-like2011 Berkeley Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like2012Google Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like2012Accumulo Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter2013 Cloudera Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like

04/15/2023 Bill Howe, UW 37

Joins

• Ex: Show all comments by “Sue” on any blog post by “Jim”

• Method 1:– Lookup all blog posts by Jim– For each post, lookup all comments and filter for “Sue”

• Method 2:– Lookup all comments by Sue– For each comment, lookup all posts and filter for “Jim”

• Method 3: – Filter comments by Sue, filter posts by Jim,– Sort all comments by blog id, sort all blogs by blog id– Pull one from each list to find matches

04/15/2023 38Bill Howe, UW

YearSystem/

PaperScale to

1000sPrimary

IndexSecondary


AnalyticsIntegrity



1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like2003 memcached ✔ ✔ O O O O O O key-val lookup2004 MapReduce ✔ O O O ✔ O O O key-val MR2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR2007 MongoDB ✔ ✔ ✔ EC, record O O O O document filter2007 Dynamo ✔ ✔ O O O O O O key-val lookup2008 Pig ✔ O O O ✔ / O ✔ tables RA-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter2009 Voldemort ✔ ✔ O EC, record O O O O key-val lookup2009 Riak ✔ ✔ ✔ EC, record MR O key-val filter2010 Dremel ✔ O O O / ✔ O ✔ tables SQL-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables filter2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables SQL-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter2013 Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like

04/15/2023 39

• Two value propositions– Performance: “I started with MySQL,

but had a hard time scaling it out in a distributed environment”

– Flexibility: “My data doesn’t conform to a rigid schema”

Bill Howe, UW

NoSQL Criticism

Stonebraker CACM (blog 2)

04/15/2023 40

NoSQL Criticism: flexibility argument

• Who are the customers of NoSQL? – Lots of startups

• Very few enterprises. Why? most applications are traditional OLTP on structured data; a few other applications around the “edges”, but considered less important

Bill Howe, UW

Stonebraker CACM (blog 2)

04/15/2023 41

Some Takeaways

• Data wrangling is the hard part of data science, not statistics

• Relational algebra is the right abstraction for reasoning about data wrangling

• Even “NoSQL” systems that explicitly rejected relational concepts eventually brought them back

Bill Howe, UW

Data & Analytics

Intro to Data Science Concepts