View
128
Download
4
Category
Preview:
DESCRIPTION
Talk for Jim Frew's grad class at Bren School, UC Santa Barbara. Oct 31, 2013. All about things you can do wrong (and right) with spreadsheets.
Citation preview
Spooky Spreadsheets
Carly Strasser | California Digital Library UCSB/Bren Oct 2013
From Flickr by Jeff Golden
Roadmap
3. Toolbox
1. Background
2. Best practices
From Flickr by robertpaulyoung
Scientists are bad at data management.
Many tables
Embedded figures
my spreadsheet
No headings
my spreadsheet
my spreadsheet
?
Reproducibility Transparency Reuse NO
Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow
www.petsham
ing.ne
t
From Flickr by johntrainor
Why should I care?
Because they care:
From Flickr by Redden-‐McAllister
data management
From
Flickr by Big Sw
ede Guy
Best Practices
From Flickr by Mark Sardella
Plan before data collection
• Create a key (data dictionary) • Make sure names are unique • Define codes
From
Flickr by zebb
ie
Planning Design sample naming scheme
PhDcomics.com
Planning Design file naming scheme
Use descriptive file names • Unique • Reflect contents
From R Cook, ESA Best Practices Workshop 2010
Bad: Mydata.xls 2001_data.csv best version.txt
Better: Eaffinis_nanaimo_2010_counts.xls
Site name
Year What was measured
Study organism
*Not for everyone
*
Planning Design file naming scheme
From S. Hampton
Planning Design file organization
Biodiversity
Lake
Experiments
Field work
Grassland
Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …
From S. Hampton
Planning Design file organization
Consider… • Dependencies? • File formats? • Time of collection? • Order of analysis?
Workflows!
Planning
Constrain entries Atomize Break down spreadsheets
Design your spreadsheet
From Flickr by Ulleskelf
A relational database is A set of tables Relationships among the tables A language to specify & query the tables
A RDB provides
Scalability: millions+ records Features for sub-‐setting, querying, sorting Reduced redundancy & entry errors
From Mark Schildhauer
Planning Consider a database
You should invest time in learning databases if your data sets are large or complex
Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old
Planning Consider a database
From Mark Schildhauer
Store your data in a repository
Institutional archive
Discipline/specialty archive
Pick a data repository
From Flickr by torkildr
Ask a librarian
Repos of repos:
databib.org
re3data.org
Planning
From
Flickr by sepa
syn
od
From Flickr by taberandrew
From Flickr by withassociates
What software? What hardware? What personnel?
How often? Set up reminders!
Test system
Decide on preservation/backup Planning
…document that describes what you will
do with your data throughout
the research project
From Flickr by Barbies Land
Write a data management plan!
Planning
DMP components
But they all have different requirements and express them in
different ways
• What will be collected • Methods • Standards • Metadata • Sharing/access • Long-‐term storage
Planning
From Flickr by Barbies Land
Step-‐by-‐step wizard for generating DMP
create | edit | re-‐use | share
Free & open to community
dmptool.org Planning
During Data Collection & Entry
From Flickr by Julia Manzerova
Realistically: • Archive .csv version of raw data • Make a “raw” tab in working data file • Do all work on other tabs
During collection Keep raw data raw
Raw data as .csv
R script for processing & analysis
During collection Keep raw data raw
Ideally: • Use scripts to process data • Save them with data
During collection Document your workflow
Temperature data
Salinity data
Data import into Excel
Analysis: mean, SD
Graph production
Quality control & data cleaning “Clean” T
& S data
Summary statistics
Data in spread-‐sheet
Workflow: how you get from the raw data to the final products of your research
Simple workflow: flow chart
During collection Document your workflow
Workflow: how you get from the raw data to the final products of your research
Simple workflow: commented script
• R, SAS, MATLAB… • Well-‐documented code is
Easier to review Easier to share Easier to use for repeat analysis
# % $
&
Fancy schmancy workflows Resulting output
https://kepler-‐project.org
During collection Document your workflow
Workflows enable • Reproducibility • Transparency • Reuse
From Flickr by merlinprincesse
During collection Document your workflow
Constrain data entries • Excel lists • Data validation • Google docs forms
Modified from K. Vanderbilt
During collection
Atomize During collection
One piece of information per cell
Create parameter table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
During collection Break down spreadsheets
Fake a relational database
Create a site table
Why are you promoting Excel?
During collection Create metadata
Metadata: data reporting
WHO created the data?
WHAT is the content
of the data set?
WHEN was it created?
WHERE was it collected?
HOW was it developed?
WHY was it developed?
From
Flickr by /\/\ich
ael P
atric
|{
During collection Create metadata
Digital context
• Name of the data set
• The name(s) of the data file(s) in the data set
• Date the data set was last modified
• Example data file records for each data type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number) used to prepare/read the data set
• Data processing that was performed
Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
Scientific context
• Scientific reason why the data were collected
• What data were collected
• What instruments (including model & serial number) were used
• Environmental conditions during collection
• Temporal & spatial resolution
• Standards or calibrations used
Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g. uncertainty, sampling problems)
During collection Create metadata
• Provide structure to describe data
Common terms | definitions | language | structure
• Come in many flavors EML , FGDC, ISO19115, DarwinCore,…
• Can be met using software tools
Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
What is metadata?
Metadata standards…
During collection Create metadata
Standard <
Back up daily During collection
From Flickr by lippo
From Flickr by see phar
Original
Near
Far
During collection
From Flickr by Barbies Land
Remember that data management plan?
Revisit Review Revise
During collection
Schedule a time each week or month
Revisit Review Revise
From Flickr by purplemattfish
From
Flickr by dipster1
Toolbox
Step-‐by-‐step wizard for generating DMP
create | edit | re-‐use | share
Free & open to community
dmptool.org
Write a DMP
databib.org
Where should I put my data?
Find a repository
• Help researchers manage, describe, and share tabular data
• Free • Add-‐in for Excel & web application
Manage & share
Features 1. Best practices check 2. Generate metadata 3. Get identifier & citation 4. Post data to repository
Manage & share
Create metadata
Create metadata
Clean data
Open Refine = Google Refine
• Open source desktop application • Used for data cleanup and transformation to other formats • Works with spreadsheets but behaves like a database • User can filter the rows to display using facets that define
filtering criteria
Open Refine = Google Refine
• Open source desktop application • Used for data cleanup and transformation to other formats • Works with spreadsheets but behaves like a database • User can filter the rows to display using facets that define
filtering criteria
DCXL blog: dcxl.cdlib.org
Toolbox:
Get help
From
Flickr by tw
m1340
Culture Shift Ahead
science source notebook content access data government knowledge
From
Flickr by cd
sessum
s
From Flickr by Andy Graulund
Make a resolution • Triage on current projects • Get advisor, lab mates, collaborators on board • Do better next time
Website Email
Twitter Slides
carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser
Recommended