71
Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute eScience and Data Science at the UW eScience Institute 10/28/2022 Bill Howe, UW 1

eResearch New Zealand Keynote

Embed Size (px)

DESCRIPTION

Bill Howe gave a keynote at eResearch New Zealand in early July 2013

Citation preview

Page 1: eResearch New Zealand Keynote

04/11/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

eScience and Data Science at the UW eScience Institute

Bill Howe, UW

Page 2: eResearch New Zealand Keynote

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Page 3: eResearch New Zealand Keynote

Jim Gray

Page 4: eResearch New Zealand Keynote

04/11/2023 4

The University of Washington eScience Institute

• Rationale– The exponential increase in physical and virtual sensing tech is

transitioning all fields of science and engineering from data-poor to data-rich

– Techniques and technologies include• Sensors and sensor networks, data management, data mining,

machine learning, visualization, cluster/cloud computing • Mission

– Help position the University of Washington and partners at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them.

• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff

Bill Howe, UW

Page 5: eResearch New Zealand Keynote

04/11/2023 5Bill Howe, UW

# o

f b

yte

s

# of data sources

telescopes

spectra

LSST (~100PB; images, spectra)

PanSTARRS (~40PB; images, trajectories)

OOI (~50TB/year; sims, RSN)IOOS (~50TB/year; sims, satellite, gliders,

AUVs, vessels, more)CMOP (~10TB/year; sims, stations, gliders,

AUVs, vessels, more)

SDSS (~400TB; images, spectra, catalogs)

n-body sims

models

AUVs

stations

cruises, CTDsflow cytometry

gliders

ADCPsatellites

Astronomy

Ocean Sciences

3 V’s of Big Data

Volume

Variety

Velocity

Page 6: eResearch New Zealand Keynote

PDB

GenBank

UniProt

Pfam

Spreadsheets, NotebooksLocal, Lost

High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved

CATH, SCOP(Protein Structure Classification)

ChemSpider

Long Tail of Research Data

[src: Carol Goble]

Page 7: eResearch New Zealand Keynote

Types of Data Stored

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

5.20%

11.70%

15.20%

20.50%

22.80%

48.00%

61.20%

71.60%Quantitative/tabular/statistical

Text (literature, transcriptions, field notes)

Images (photographs, maps)

Video recordings

Audio recordings

Multimedia digital objects

Geo-tagged objects/ spatial data

Other

Lewis et al 2008

src: Conversations with Research Leaders (2008)

Page 8: eResearch New Zealand Keynote

Where do you store your data?

src: Conversations with Research Leaders (2008)

src: Faculty Technology Survey (2011)

Other

External (non-UW) data center

Department-managed server

My computer

0% 20% 40% 60% 80% 100%

5%6%

12%27%

41%66%

87%

Lewis et al 2011

Page 9: eResearch New Zealand Keynote

How much data do you work with?

Wright 2013

Page 10: eResearch New Zealand Keynote

The Long Tail is worthy of investment

Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)

log(NSERC grants) ($) log(NSERC + CIHR rants) ($)

Page 11: eResearch New Zealand Keynote

The Long Tail is worthy of investment

Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)

“In sum, greater productivity is not strongly related to greater funding.”

“…the best article of one rich researcher received, on average, 14% fewer citations than the best article from any random pair of researchers, each of whom received only half as much funding!”

“if maximizing the total impact of the entire pool of grantees is the goal, then the “few big” strategy would be less effective than the “many small” strategy.”

Page 12: eResearch New Zealand Keynote

04/11/2023 12Bill Howe, UW

“Ensuring a successful future for the biological sciences will require restraint in the growth of large centers and -omics-like projects, so as to provide more financial support for the critical work of innovative small laboratories striving to understand the wonderful complexity of living systems.”

Alberts B (2012) The End of “Small Science”? Science 337: 1583

The long tail is worthy of investment

Page 13: eResearch New Zealand Keynote

04/11/2023 13

Page 14: eResearch New Zealand Keynote

04/11/2023 14Bill Howe, UW

Page 15: eResearch New Zealand Keynote

100x more impact

How much more productive is a great programmer than a mediocre programmer?

Page 16: eResearch New Zealand Keynote

04/11/2023 16

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Page 17: eResearch New Zealand Keynote

1704/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

Page 18: eResearch New Zealand Keynote

04/11/2023 18Bill Howe, UW

“The future is already here; it’s just not very evenly distributed.”

-- William Gibson

Page 19: eResearch New Zealand Keynote

04/11/2023 19

Three Avenues of Attack

• Technological• Educational• Organizational

Bill Howe, UW

Page 20: eResearch New Zealand Keynote

SQLSHARE: QUERY-AS-A-SERVICETechnology for the Long Tail

Page 21: eResearch New Zealand Keynote

Alex Szalay Jim Gray

How can we deliver 1000 little SDSSs?

Page 22: eResearch New Zealand Keynote

1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema

2) Analyze data with SQLRight in your browser, writing queries on top of queries on top of queries ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results Queries on queries on queries…

Page 23: eResearch New Zealand Keynote

04/11/2023 23

Browse English descriptions

Page 24: eResearch New Zealand Keynote

04/11/2023 24

Save the results, share them with others

Edit a Query

Page 25: eResearch New Zealand Keynote

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

Excel Addin

R Client

Spreadsheet Crawler

ASP.NET

OAuth2

WCF

Page 26: eResearch New Zealand Keynote

04/11/2023 26

METADATAAside

Bill Howe, UW

Page 27: eResearch New Zealand Keynote

04/11/2023 27

What about metadata?

• Claim: Comprehensive metadata standards* represent a shared consensus about the world

• At the frontier of research, this shared consensus does not exist, by definition

• Any consensus that does emerge will change frequently, by definition

• Data found “in the wild” will typically not conform to any standard, by definition

So are we stuffed?

* ontology / schema / controlled vocabulary / etc.

Bill Howe, UW

Page 28: eResearch New Zealand Keynote

04/11/2023 28Bill Howe, UW

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Maslow’s Needs Hierarchy

Page 29: eResearch New Zealand Keynote

04/11/2023 29

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

query

curation

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Page 30: eResearch New Zealand Keynote

04/11/2023 30

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

curation

query

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Page 31: eResearch New Zealand Keynote

04/11/2023 31

Views

• A view is just a query with a name• We can use the view just like a real

table

Bill Howe, UW

Why can we do this?

Because we know that every query returns a relation:

We say that the language is “algebraically closed”

Page 32: eResearch New Zealand Keynote

04/11/2023 32

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

Pay-as-you-go metadata

Page 33: eResearch New Zealand Keynote

Deeply nested hierarchies of viewsProvenance

Controlled SharingImplicit re-execution

Page 34: eResearch New Zealand Keynote

Bring the computation to the data

• Rich query services, not data cemetaries

• Avoid “Transloading:” Pointless data movement from one environment to another– Compute Vis– Server Client– Cloud Cloud

• “Share the soup” and curate incrementally as a side effect of usage

• Pay-as-you-go metadata

Page 35: eResearch New Zealand Keynote

04/11/2023 35

USE CASES

Bill Howe, UW

Page 36: eResearch New Zealand Keynote

Laser

Microscope Objective

Pine Hole Lens

Nozzle d1

d2

FSC (Forward scatter)

Orange fluo

Red fluo

SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

Page 37: eResearch New Zealand Keynote

SeaFlowd1

/ F

SC

d2 / FSC

RE

D f

luor

esce

nce

FSC

Picoplankton

Nanoplankton

IS

Ultraplankton

Prochlorococcus

Continuous observations of various phytoplankton groups from 1-20 mm in size

Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores

Francois Ribalet

Jarred Swalwell

Ginger Armbrust

Page 38: eResearch New Zealand Keynote

SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

Page 39: eResearch New Zealand Keynote

04/11/2023 39

Script-oriented computing was killing them• Scripts (typically in R) must be pre-shared with all

collaborators• When the data changes, everybody has to re-run all the

scripts• When the scripts change, everybody has to re-run all the

scripts. • Implicit assumption that all data fits in main memory• Terrible provenance, terrible reproducibility • Pipeline of scripts dependent on intricate file formats and

file naming schemes

Bill Howe, UW

Page 40: eResearch New Zealand Keynote

Howe, et al., CISE 2012

Page 41: eResearch New Zealand Keynote

04/11/2023 41

Workflow Systems

• Scripts (typically in R) must be pre-shared with all collaborators

• When the data changes, everybody has to re-run all the scripts

• When the scripts change, everybody has to re-run all the scripts.

• Implicit assumption that data fits in main memory

• No provenance• Pipeline of scripts dependent on intricate file

formats and/or file naming schemes

Bill Howe, UW

Page 42: eResearch New Zealand Keynote

Steven Roberts

SQL as a lab notebook:http://bit.ly/16Xj2JP

Popular service for Bioinformatics Workflows

Page 43: eResearch New Zealand Keynote

Halperin, Howe, et al. SSDBM 2013

Page 44: eResearch New Zealand Keynote

04/11/2023 44

Robin Kodner

Bill Howe, UW

“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”

Data management and statistics for biologists

Page 45: eResearch New Zealand Keynote

04/11/2023 45Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White

Andrew White, UW Chemistry

Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted

Page 46: eResearch New Zealand Keynote

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

Page 47: eResearch New Zealand Keynote

04/11/2023 47

DATA SCIENCE

Bill Howe, UW

Education

Page 48: eResearch New Zealand Keynote

04/11/2023 48Bill Howe, UW

Page 49: eResearch New Zealand Keynote

04/11/2023 49

Drew Conway’s Data Science Venn Diagram

Bill Howe, UW

Page 50: eResearch New Zealand Keynote

04/11/2023 50Bill Howe, UW

“I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.”

-- Aaron Kimball, CTO Wibidata

Page 51: eResearch New Zealand Keynote

04/11/2023 51Bill Howe, UW

But what are the abstractions of data science?

tools abstr.

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

Claim: Relational Algebra is the universal formalism for data modeling and manipulation, and every data scientist should know it

Page 52: eResearch New Zealand Keynote

04/11/2023 52

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science

Previous courses:Scientific Data Management, Graduate CS, Summer 2006, Portland State UniversityScientific Data Management, Graduate CS, Spring 2010, University of Washington

Page 53: eResearch New Zealand Keynote

04/11/2023 53Bill Howe, UW

Bill Howe

Page 54: eResearch New Zealand Keynote

“Very very interesting tht there is high correlation in the way the election results were being announced and the way the graph is shaped.”

“I was quite amazined that I was able to obtain this analysis with just 2 days of lecturing and practice!”

“Inspired by all my new learning, I thought about doing a little sentiment analysis myself for our national elections!”

Page 55: eResearch New Zealand Keynote

“Darth Grader”

Page 56: eResearch New Zealand Keynote

… I was frankly amazed that you were so fast in responding to my query, in spite of the class being so huge. I’d like to thank you so much for teaching this course. This has been one of the most useful courses I’ve ever taken and you’re an awesome instructor. With such great quality education freely available, I wonder why I joined a Master’s program haha. Thanks again and take care. I’ll try my best to finish another competition.

“With such great quality education freely available, I wonder why I

joined a Master’s program haha.”

Page 57: eResearch New Zealand Keynote

04/11/2023 57

INTELLECTUAL INFRASTRUCTURE

Institution

Bill Howe, UW

Page 58: eResearch New Zealand Keynote

Seek work in “Pasteur’s Quadrant”

Considerations of use

Que

st fo

r fun

dam

enta

l und

erst

andi

ng

Pasteur

Edison

Bohr

Page 59: eResearch New Zealand Keynote

04/11/2023 59Bill Howe, UW

Multiple modes of interaction, multiple time scales

1-2 years and up1-2 weeks and down 1-2 quarters

Incubation(projects)

Communication(events)

Collaboration(partnerships)

Page 60: eResearch New Zealand Keynote

2018

2008

Some local observations:• Big data work exposes common ground• Every job is becoming “data scientist”• More π-shaped people!• Democratization to the long tail is key• Industry and research aren’t too different

Incubator• Seed grants to students and postdocs• Rotating staff from science and industry• An evolving portfolio of reusable tools• Produce digital capital and human capital

Data Science Incubator

2013

Page 61: eResearch New Zealand Keynote

04/11/2023 61Bill Howe, UW

On NoSQL

Page 62: eResearch New Zealand Keynote

04/11/2023 62Bill Howe, UW

http://sqlshare.escience.washington.edu

[email protected]

http://escience.washington.edu

Page 63: eResearch New Zealand Keynote

04/11/2023 63Bill Howe, UW

Page 64: eResearch New Zealand Keynote

04/11/2023 64

WHY SQL?

Bill Howe, UW

Page 65: eResearch New Zealand Keynote

5/18/10 Garret Cole, eScience Institute

What’s the point?• Conventional wisdom says “Science data isn’t relational”

– This is nonsense

• Conventional wisdom says “Scientists won’t write SQL”– This is nonsense

• So why aren’t databases being used more often?– They’re a PITA

• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)

We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”

Page 66: eResearch New Zealand Keynote

04/11/2023 66Bill Howe, UW

God made the integers; all else is the work of man.

(Leopold Kronecker, 19th Century Mathematician)

slide src: Mike Franklin

Codd made relations;all else is the work of man.

(Raghu Ramakrishnan, DB text book author)

Page 67: eResearch New Zealand Keynote

04/11/2023 Bill Howe, UW 67

Key Idea: Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

Page 68: eResearch New Zealand Keynote

Data Curation

Data Management

Data Analytics

Cyberinfrastructure

Big

Dat

a

Open D

ata

Database & Systems

Researchers

Stats, ML, and Viz

Library Science

Researchers

DataONE

Hadoop

GraphLab

Vertica

Greenplum

Oracle/MS/IBM

Dataverse

R/SPSS/MATLAB/Stata

Dryad

ICPSR

Geodata.gov

Tableau

Weka

GenBank

Intellectual Infrastructure

Spark

Pig

HIVE

Shark

Dremel

Page 69: eResearch New Zealand Keynote

04/11/2023 69

SQLShare as a CS Research Platform• Automatic “Starter” Queries

– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)

• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)

• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)

• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)

• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)

• Case Studies in Metagenomics, Chemistry, more

SSDBM 2011SIGMOD 2011 (demo)

SSDBM 2011CHI 2012SIGMOD 2012 (demo)

Bill Howe, UW

VLDB 2010Datalog2.0 2012CIDR 2013

Data engineering 2012CiSE 2012

Page 70: eResearch New Zealand Keynote

Two Failure Modes Serving the Long Tail

over-abstraction

“uber-system”

“neither fish nor fowl”

tries to address so many requirements, it addresses none

too reactive, ad hoc, one-off

Addresses exactly 1 application

no leverage; doesn’t scale

Page 71: eResearch New Zealand Keynote

A stripped-down version of Jim Gray’s “20 questions” methodology

Experimental Engagement Algorithm for the Long Tail

1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers

Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?