21
A case for teaching SQL to scientists Daniel Halperin #w2tbac @SESYNC 2013-07-09

A case for teaching SQL to scientists

Embed Size (px)

DESCRIPTION

A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: http://uwescience.github.io/sqlshare

Citation preview

Page 1: A case for teaching SQL to scientists

A case for teaching SQL to scientists

Daniel Halperin#w2tbac @SESYNC 2013-07-09

Page 2: A case for teaching SQL to scientists

SQL: think like data

• SQL is a Language for expressing Queries over Structured data.

• vs Python/R, SQL is

• strictly less powerful

• better for concisely, clearly, and efficiently expressing data manipulation

• ... and anecdotally, “many” scripts written by scientists just manipulate data

Page 3: A case for teaching SQL to scientists

Claim 1: SQL isConcise & Clear

• English questions often translate directly into SQL

• Scripting languages have a lot of language overhead -- syntactic sugar

• Let’s see some (admittedly biased) examples

Page 4: A case for teaching SQL to scientists

with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt

What does this code do?

Page 5: A case for teaching SQL to scientists

with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt

What does this code do?

SELECT COUNT(*) AS cntFROM file

Page 6: A case for teaching SQL to scientists

with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line

What does this code do?

Page 7: A case for teaching SQL to scientists

with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line

What does this code do?

SELECT *FROM fileWHERE value > 5

Page 8: A case for teaching SQL to scientists

What does this code do?SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value

Page 9: A case for teaching SQL to scientists

What does this code do?

with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4])for value in tot_counts: print value, tot_counts[value]

SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value

Page 10: A case for teaching SQL to scientists

What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county

Page 11: A case for teaching SQL to scientists

What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county

<Complicated stuff with dictionaries>

Page 12: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

Scaling up your data

• What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code

• All databases automatically, transparently spill to disk, and are heavily optimized for performance

Page 13: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

Say you inherit a really well-engineered Python script./highly_optimized_code.py < TB.dataset > GB.result

Page 14: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

Say you inherit a really well-engineered Python script

./simple_data_filter.py < GB.result > MB.answer

./highly_optimized_code.py < TB.dataset > GB.result

But are only interested in a small fraction of the result

Page 15: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

Say you inherit a really well-engineered Python script

./simple_data_filter.py < GB.result > MB.answer

./highly_optimized_code.py < TB.dataset > GB.result

But are only interested in a small fraction of the result

1) Dive into the complex code and modify its internals to filter inside2) Suffer the long running time of the first program

Page 16: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset

Gives their query a name, but doesn’t

execute it!

Page 17: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset

SELECT *FROM their_queryWHERE <... your filter ...>

Gives their query a name, but doesn’t

execute it!

Combine both queries and optimize

together!

Page 18: A case for teaching SQL to scientists

Claim 2: SQL is Efficient

CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset

SELECT *FROM their_queryWHERE <... your filter ...>

Gives their query a name, but doesn’t

execute it!

Combine both queries and optimize

together!

Fast!

Page 19: A case for teaching SQL to scientists

SQL for Science• UW’s SQLShare - open, view-oriented,

web database service

• Easy data import, public & private sharing, permalinks (DOI support coming)

• Use a series of views instead of scripts for:

• data cleaning, transformation, integration

• simple stats, analytics, format conversion

• provenance and publishing

• mashups: integrated with R, Sage, etc.

Page 20: A case for teaching SQL to scientists

escience.washington.edu/sqlshare“An undergraduate student and I are working with gigabytes of tabular

data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a

10 minute 100 line script in 1 line of SQL.”- Andrew D White, grad student in UW Chem Eng

“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”- Robin Kodner, as asst professor at Western Washington U

"That [SQL query that finished in 1 second] took me a week [manually in Excel]!"

- Robin Kodner, as postdoc at UW Oceanography

* yes, we need (and are interested in) more than anecdotes!!

Page 21: A case for teaching SQL to scientists

SQL can do more than you think (here vs R)