Bren - UCSB - Spooky spreadsheets

Preview:

DESCRIPTION

Talk for Jim Frew's grad class at Bren School, UC Santa Barbara. Oct 31, 2013. All about things you can do wrong (and right) with spreadsheets.

Citation preview

Spooky  Spreadsheets  

Carly  Strasser  |  California  Digital  Library  UCSB/Bren  Oct  2013  

From  Flickr  by  Jeff  Golden  

Roadmap  

3. Toolbox    

1. Background    

2. Best  practices  

From  Flickr  by  robertpaulyoung  

Scientists  are  bad  at  data  management.  

Many  tables  

Embedded  figures  

my  spreadsheet  

No  headings  

my  spreadsheet  

my  spreadsheet  

?

Reproducibility  Transparency  Reuse  NO  

Didn’t  share  the  data  Didn’t  document  the  data  (metadata)  Didn’t  document  provenance/workflow  

www.petsham

ing.ne

t  

From  Flickr  by  johntrainor  

Why  should  I  care?  

Because  they  care:  

From  Flickr  by  Redden-­‐McAllister  

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

Best  Practices  

From  Flickr  by  Mark  Sardella  

Plan  before  data  collection  

•  Create  a  key  (data  dictionary)  •  Make  sure  names  are  unique  •  Define  codes  

From

 Flickr  by  zebb

ie  

Planning  Design  sample  naming  scheme  

PhDcomics.com  

Planning  Design  file  naming  scheme  

 Use  descriptive  file  names  •  Unique  •  Reflect  contents  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad:    Mydata.xls      2001_data.csv      best  version.txt  

Better:  Eaffinis_nanaimo_2010_counts.xls  

Site  name  

Year  What  was  measured    

Study  organism  

*Not  for  everyone  

*  

Planning  Design  file  naming  scheme  

From  S.  Hampton  

Planning  Design  file  organization  

Biodiversity  

Lake  

Experiments  

Field  work  

Grassland  

Biodiv_H20_heatExp_2005to2008.csv  Biodiv_H20_predatorExp_2001to2003.csv  …  Biodiv_H20_PlanktonCount_2001toActive.csv  Biodiv_H20_ChlAprofiles_2003.csv  …    

From  S.  Hampton  

Planning  Design  file  organization  

Consider…  •  Dependencies?  •  File  formats?  •  Time  of  collection?  •  Order  of  analysis?  

Workflows!

Planning  

Constrain  entries    Atomize  Break  down  spreadsheets  

Design  your  spreadsheet  

From  Flickr  by  Ulleskelf  

A  relational  database  is      A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables  

 A  RDB  provides  

 Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Reduced  redundancy  &  entry  errors  

 

From  Mark  Schildhauer  

Planning  Consider  a  database  

You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex  

 

Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old  

Planning  Consider  a  database  

From  Mark  Schildhauer  

Store  your  data  in  a  repository  

Institutional  archive  

Discipline/specialty  archive  

   

 

Pick  a  data  repository  

From  Flickr  by  torkildr  

Ask  a  librarian  

Repos  of  repos:  

databib.org  

re3data.org  

Planning  

From

 Flickr  by  sepa

 syn

od  

From  Flickr  by    taberandrew  

From  Flickr    by  withassociates  

What  software?  What  hardware?  What  personnel?  

How  often?  Set  up  reminders!  

Test  system    

Decide  on  preservation/backup   Planning  

…document  that  describes  what  you  will  

do  with  your  data  throughout    

the  research  project  

From  Flickr  by  Barbies  Land  

Write  a  data  management  plan!  

Planning  

DMP  components  

But they all have different requirements and express them in

different ways

•  What  will  be  collected  •  Methods  •  Standards  •  Metadata  •  Sharing/access  •  Long-­‐term  storage  

Planning  

From  Flickr  by  Barbies  Land  

Step-­‐by-­‐step  wizard  for  generating  DMP  

create  |  edit  |  re-­‐use  |  share  

Free  &  open  to  community    

dmptool.org                    Planning  

During  Data  Collection  &  Entry  

From  Flickr  by  Julia  Manzerova  

Realistically:    •  Archive  .csv  version  of  raw  data  •  Make  a  “raw”  tab  in  working  data  file  •  Do  all  work  on  other  tabs  

During  collection  Keep  raw  data  raw  

Raw  data  as  .csv  

R  script  for  processing  &  analysis  

During  collection  Keep  raw  data  raw  

Ideally:  •  Use  scripts  to  process  data    •  Save  them  with  data    

During  collection  Document  your  workflow  

Temperature  data  

Salinity                data  

Data  import  into  Excel  

Analysis:  mean,  SD  

Graph  production  

Quality  control  &  data  cleaning  “Clean”  T  

&  S  data  

Summary  statistics  

Data  in  spread-­‐sheet  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflow:  flow  chart  

During  collection  Document  your  workflow  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflow:  commented  script  

•  R,  SAS,  MATLAB…  •  Well-­‐documented  code  is  

Easier  to  review  Easier  to  share  Easier  to  use  for  repeat  analysis  

#  %  $  

&  

Fancy  schmancy  workflows  Resulting  output  

https://kepler-­‐project.org  

During  collection  Document  your  workflow  

Workflows  enable  •  Reproducibility  •  Transparency    •  Reuse    

From  Flickr  by  merlinprincesse  

During  collection  Document  your  workflow  

Constrain  data  entries  •  Excel  lists  •  Data  validation  •  Google  docs  forms    

Modified  from  K.  Vanderbilt    

During  collection  

Atomize   During  collection  

One  piece  of  information  per  cell  

 Create  parameter  table  

From  doi:10.3334/ORNLDAAC/777  

From  doi:10.3334/ORNLDAAC/777  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

During  collection  Break  down  spreadsheets  

Fake  a  relational  database  

Create  a  site  table  

Why  are  you  promoting  Excel?  

During  collection  Create  metadata  

   Metadata:  data  reporting    

WHO  created  the  data?  

WHAT  is  the  content    

 of  the  data  set?  

WHEN  was  it  created?  

WHERE  was  it  collected?  

HOW  was  it  developed?  

WHY  was  it  developed?  

From

 Flickr  by    /\/\ich

ael  P

atric

|{    

During  collection  Create  metadata  

Digital  context  

•  Name  of  the  data  set  

•  The  name(s)  of  the  data  file(s)  in  the  data  set  

•  Date  the  data  set  was  last  modified  

•  Example  data  file  records  for  each  data  type  file  

•  Pertinent  companion  files  

•  List  of  related  or  ancillary  data  sets  

•  Software  (including  version  number)  used  to  prepare/read    the  data  set  

•  Data  processing  that  was  performed  

Personnel  &  stakeholders  

•  Who  collected    

•  Who  to  contact  with  questions  

•  Funders  

Scientific  context  

•  Scientific  reason  why  the  data  were  collected  

•  What  data  were  collected  

•  What  instruments  (including  model  &  serial  number)  were  used  

•  Environmental  conditions  during  collection  

•  Temporal  &  spatial  resolution    

•  Standards  or  calibrations  used  

Information  about  parameters  

•  How  each  was  measured  or  produced  

•  Units  of  measure  

•  Format  used  in  the  data  set  

•  Precision  &  accuracy  if  known  

Information  about  data  

•  Definitions  of  codes  used  

•  Quality  assurance  &  control  measures  

•  Known  problems  that  limit  data  use  (e.g.  uncertainty,  sampling  problems)    

During  collection  Create  metadata  

•  Provide  structure  to  describe  data  

Common  terms    |    definitions    |    language    |    structure  

•  Come  in  many  flavors    EML  ,  FGDC,  ISO19115,  DarwinCore,…  

•  Can  be  met  using  software  tools  

 Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)    

   

What  is  metadata?  

Metadata  standards…  

During  collection  Create  metadata  

Standard <

Back  up  daily   During  collection  

From  Flickr  by  lippo  

From  Flickr  by  see  phar  

Original  

Near  

Far  

During  collection  

From  Flickr  by  Barbies  Land  

Remember  that  data  management  plan?  

Revisit  Review  Revise  

During  collection  

Schedule  a  time  each  week  or  month  

Revisit  Review  Revise  

From  Flickr  by  purplemattfish  

From

 Flickr  by  dipster1  

Toolbox  

Step-­‐by-­‐step  wizard  for  generating  DMP  

create  |  edit  |  re-­‐use  |  share  

Free  &  open  to  community    

dmptool.org                    

Write  a  DMP  

databib.org  

Where  should  I  put  my  data?  

Find  a  repository  

•  Help  researchers  manage,  describe,  and  share  tabular  data  

•  Free  •  Add-­‐in  for  Excel  &  web  application    

Manage  &  share  

Features  1.  Best  practices  check  2.  Generate  metadata  3.  Get  identifier  &  citation  4.  Post  data  to  repository  

Manage  &  share  

Create  metadata  

Create  metadata  

Clean  data  

Open  Refine  =  Google  Refine    

•  Open  source  desktop  application    •  Used  for  data  cleanup  and  transformation  to  other  formats  •  Works  with  spreadsheets  but  behaves  like  a  database  •  User  can  filter  the  rows  to  display  using  facets  that  define  

filtering  criteria  

Open  Refine  =  Google  Refine    

•  Open  source  desktop  application    •  Used  for  data  cleanup  and  transformation  to  other  formats  •  Works  with  spreadsheets  but  behaves  like  a  database  •  User  can  filter  the  rows  to  display  using  facets  that  define  

filtering  criteria  

DCXL  blog:  dcxl.cdlib.org  

Toolbox:    

Get  help  

From

 Flickr  by  tw

m1340

 

Culture  Shift  Ahead  

science  source  notebook  content  access  data  government  knowledge  

From

 Flickr  by  cd

sessum

s  

From  Flickr  by  Andy  Graulund  

Make  a  resolution  • Triage  on  current  projects  • Get    advisor,  lab  mates,  collaborators  on  board  • Do  better  next  time  

Website  Email  

Twitter  Slides  

carlystrasser.net  carlystrasser@gmail.com  @carlystrasser    slideshare.net/carlystrasser  

Recommended