Upload
dhalperi
View
325
Download
0
Embed Size (px)
DESCRIPTION
A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: http://uwescience.github.io/sqlshare
Citation preview
A case for teaching SQL to scientists
Daniel Halperin#w2tbac @SESYNC 2013-07-09
SQL: think like data
• SQL is a Language for expressing Queries over Structured data.
• vs Python/R, SQL is
• strictly less powerful
• better for concisely, clearly, and efficiently expressing data manipulation
• ... and anecdotally, “many” scripts written by scientists just manipulate data
Claim 1: SQL isConcise & Clear
• English questions often translate directly into SQL
• Scripting languages have a lot of language overhead -- syntactic sugar
• Let’s see some (admittedly biased) examples
with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt
What does this code do?
with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt
What does this code do?
SELECT COUNT(*) AS cntFROM file
with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line
What does this code do?
with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line
What does this code do?
SELECT *FROM fileWHERE value > 5
What does this code do?SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value
What does this code do?
with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4])for value in tot_counts: print value, tot_counts[value]
SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value
What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county
What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county
<Complicated stuff with dictionaries>
Claim 2: SQL is Efficient
Scaling up your data
• What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code
• All databases automatically, transparently spill to disk, and are heavily optimized for performance
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script./highly_optimized_code.py < TB.dataset > GB.result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
1) Dive into the complex code and modify its internals to filter inside2) Suffer the long running time of the first program
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
Gives their query a name, but doesn’t
execute it!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
SELECT *FROM their_queryWHERE <... your filter ...>
Gives their query a name, but doesn’t
execute it!
Combine both queries and optimize
together!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
SELECT *FROM their_queryWHERE <... your filter ...>
Gives their query a name, but doesn’t
execute it!
Combine both queries and optimize
together!
Fast!
SQL for Science• UW’s SQLShare - open, view-oriented,
web database service
• Easy data import, public & private sharing, permalinks (DOI support coming)
• Use a series of views instead of scripts for:
• data cleaning, transformation, integration
• simple stats, analytics, format conversion
• provenance and publishing
• mashups: integrated with R, Sage, etc.
escience.washington.edu/sqlshare“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a
10 minute 100 line script in 1 line of SQL.”- Andrew D White, grad student in UW Chem Eng
“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”- Robin Kodner, as asst professor at Western Washington U
"That [SQL query that finished in 1 second] took me a week [manually in Excel]!"
- Robin Kodner, as postdoc at UW Oceanography
* yes, we need (and are interested in) more than anecdotes!!
SQL can do more than you think (here vs R)