Digital Curation for Excel (DCXL)

Preview:

DESCRIPTION

CDL has recently launched a new project dubbed Digital Curation for Excel (DCXL), funded by the Gordon and Betty Moore Foundation and Microsoft Research. The goal of the DCXL project is to facilitate data management, sharing, and archiving for earth, environmental, and ecological scientists. The main result from the project will be an open source add-in for Microsoft Excel that will assist scientists in preparing their Excel data for sharing.

Citation preview

DCXL:  Digital  Curation  for  Excel  

Carly  Strasser  UC3,  California  Digital  Library  carly.strasser@ucop.edu  

Funders:  Gordon  &  Betty  Moore  Foundation,  Microsoft  Research  

22  Sept  2011    UC3  Webinar  Series      California  Digital  Library  

Build  on  existing  cyberinfrastructure  

Create  new  cyberinfrastructure  

Support  communities  

Community  Engagement  

Roadmap  

4.  How  to  get  involved  in  DCXL  

1.  An  overview:  why  is  DCXL  needed?  

2.  Goals  of  DCXL  project  3.  Progress  &  future  plans  

Digital  data  +    

Complex  workAlows  

Data  

Maximum  Likelihood  estimation  

Matrix  Models  

Images   Tables   Paper  

Models  

UGLY TRUTH

are  not  taught  data  management  

don’t  know  what  metadata  are  can’t  name  data  centers  or  repositories  

don’t  share  data  publicly  or  store  it  in  an  archive  

aren’t  convinced  they  should  share  data  

5shortessays.blogspot.com  

Most    Earth  |  Environmental  |  Ecological  scientists…    

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

2  tables   Random  notes  

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

Wash  Cres  Lake  Dec  15  Dont_Use.xls  

9  

Collaboration  and  Data  Sharing  

What  is  this?  

The  path  of  research  products  

Data  

Metadata  

Recreated  from  Klump  et  al.  2006  

www �

noaa.gov  www.collectionconnection.alcts.ala.org  

www.Tlickr.com/photos/csessums  

blog.seattlepi.com  blog.disorder2order.com  

Data  Sharing  

Data  Management  

Data  Reuse  

Data  

Metadata  

Recreated  from  Klump  et  al.  2006  

www �

The  path  of  research  products  

www �

noaa.gov  www.collectionconnection.alcts.ala.org  

digital-­servers.com  

Barriers  

Cost  

Software,  hardware  Personnel  

Time  

cultblender.wordpress.com  

ttatteredntornprims.blogspot.com/  

Barriers  

•  Not  the  norm  •  Lack  of  training  •  Disparate  data  

Cost:  time,  personnel,  software,  hardware  Culture  of  Science  

free-­photos.biz  

Barriers  

ConZlict  

Missed  opportunities  

Misuse  of  data  

Cost:  time,  personnel,  software,  hardware  Culture  of  Science  Loss  of  rights  or  bene:its  

colouringbook.org  

wattsupwiththat.com  

Barriers  

Cost:  time,  personnel,  software,  hardware  Culture  of  Science  Loss  of  rights  or  bene:its  Lack  of  incentives  

Reward  structure  

Few  requirements  

Time  consuming  &  expensive  

georgevanantwerp.com  

Roadmap  

4.  How  to  get  involved  in  DCXL  

1.  An  overview:  why  is  DCXL  needed?  

2.  DCXL  project  overview  3.  Progress  &  future  plans  

DCXL  Project  Goals  

•  Increase    interoperability        publishability        archivability              

•  Focus  on  atmospheric,  ecological,  hydrological,  and  oceanographic  data  

“A  transformation  in  the  conduct  of  a  segment  of  scientiTic  research  by  enabling  and  promoting  publishing,  sharing,  

and  archiving  of  tabular  data”  

=  Sharing  =  Publishing  =  Archiving  

DCXL  Project  Goals  

Open  Source  &  Free    Excel  Add-­in  

Software  program  that  extends  the  capabilities  of  larger  programs  

Complements  basic  Excel  functionality  From  www.webopedia.com  

www.ablebits.com  

DCXL  Add-­in  Goals  

Archiving  

Sharing  

Publishing  

Easier  

Harder  

DCXL  Project  Deliverables  

•  Excel  add-­‐in  •  Publicly  available  source  code  •  Technical  documentation  •  End  user  documentation    •  Publicly  available  requirements  

•  Community    

storageplusgulfport.com  

DCXL  Project  Outcomes  

 Enable  citation  &  allow  credit   Enable  policy  enactment   Enable  re-­‐use  by  eliminating  barriers   Save  time  for  researcher     Encourage  creation  of  extensions  

Process  

Assess  needs  •  Quantitative  

–  Surveys  

Process  

Assess  needs  •  Quantitative  

–  Surveys  –  Quick  poll  

Process  

Assess  needs  •  Quantitative  

–  Surveys  –  Quick  poll  

•  Qualitative  –  Interviews  

?

Process  

Assess  needs  Gather  requirements  

Recruitment  tools  DCXL/data  management  seminars  Listservs  &  email  Blog,  Facebook,  Twitter  Face-­‐to-­‐face  interactions  Flyers  

Process  

Assess  needs  Gather  requirements  

Locations    Conferences    UC  campus  visits    Remote/web-­‐based  

Process  

Assess  needs  Gather  requirements  

Stakeholders  &  contributors      Libraries    Scientists    Repositories    Experts:  MSR,  GBMF    Personnel  on  related  projects  

Process  

Requirements  

Quick  poll  Survey  Interview  

Email  Seminars  Flyers  

Social  media  

Social  media,  emails,  campus  visits  

Social  media,  emails  

Scientists  

Data  Centers  Libraries  

Funders  Related  projects  

CDL  

Implementation  

Assess  needs  Gather  requirements  Build  requirements  document  

Implementation  

Assess  needs  Gather  requirements  Build  requirements  document  Build  community  

Libraries  Scientists  Repositories  Programmers/Developers    

26 Sept DCXL Kickoff Meeting

7 Oct Finalize Requirements Gathering Framework

9 Nov 1st draft of Requirements to MSR

30 Nov 2nd draft of Requirements to MSR

5-9 Dec AGU Meeting, San Francisco

15 Dec Final Requirements to MSR

2012

16 Jan Receive Excel Add-in Version 1

23 Jan Rollout Excel Add-in Version 1

16-19 Feb AAAS meeting: Add-in user testing

20-24 Feb Ocean Sciences meeting: Add-in user testing

26 Feb 1st Draft of updated Requirements based on Version 1 to MSR

2 Apr Deliver updated Requirements based on Version 1 to MSR

28 May Receive Excel Add-in Version 2

29 May- 24 Jun User testing of Version 2

25 Jun Rollout Excel Add-in Version 2

7-10 July CSEE meeting: Add-in debut & demo

13 July Final code, technical documentation, and requirements published

31 July End user documentation published

Timeline  

Roadmap  

4.  How  to  get  involved  in  DCXL  

1.  An  overview:  why  is  DCXL  needed?  

2.  DCXL  project  overview  3.  Progress  &  future  plans  

Ecological  Society  of  America  Summer  2011  Meeting  

ESA  Overview  

•  Everyone  uses  Excel  –  Most  use  Excel  for  organizing  raw  data  –  Most  import  spreadsheets  into  other  programs  for  analysis  –  ~75%  are  embarrassed  about  using  Excel  

•  Excitement  about  open  source  •  Minimal  knowledge  about  data  management,  organization,  and  archiving  

•  55  surveys  from  diverse  group  

0  

5  

10  

15  

20  

25  

30  

35  

40  

45  

50  

Mac   PC   Linux  

Operating  System  

0   10   20   30   40   50   60  

Organization  

Visualization  

Statistics  

Other  Analyses  

Sharing  

Use  Excel  for...  

#  Respondents  (out  of  55)  

0  

5  

10  

15  

20  

25  

30  

Never   Every  day  

#  repsondents  

How  often  do  you  use  Excel?  

Rarely   Every  day  

0   10   20   30   40   50   60   70   80   90   100  

Multiple  Tables  

Multiple  Tabs  

Pivot  Tables  

Headers  

Embedded  formulas  

Macros  

Cell  shading  

Comments  

Percent  

What  features  are  used  in  Excel?  

American  Fisheries  Society  Summer  2011  Meeting  

Ray  Troll  (trollart.com)  

AFS  Overview  

•  Everyone  uses  Excel  •  Most  use  it  only  for  data  organization  and  sharing  •  36  surveys  from  diverse  group  •  Heavy  MS  Access  use  •  100%  PC  

0  

2  

4  

6  

8  

10  

12  

14  

16  

18  

Rarely   Every  day  

#  respondents  

How  often  do  you  use  Excel?  

0   10   20   30   40   50   60   70   80   90   100  

Organizing  data  

Visualizing  data  

Statistics  

Simple  Calculations  

Sharing  data  

Tasks  performed  in  Excel?  

%  respondents  (n  =  36)  

0  

10  

20  

30  

40  

50  

60  

Organize  my  data  for  my  own  use  

Organize  my  data  for  others  to  use  more  easily  

Archive  my  data  

Create  metadata  

Share  my  data  publicly  

No  opinion  

%    Respondents  

What  should  the  add-­in  help  you  do?  

AFS  Overview  

•  Everyone  uses  Excel  •  Most  use  it  only  for  data  organization  and  sharing  •  36  surveys  from  diverse  group  •  Heavy  MS  Access  use  •  100%  PC  •  Data  hoarders  

Myoverstuffedbookshelf.blogspot.com  

Roadmap  

4.  How  to  get  involved  in  DCXL  

1.  An  overview:  why  is  DCXL  needed?  

2.  DCXL  project  overview  3.  Progress  &  future  plans  

Get  Involved  

Now:    General  info  Blog  Forum  Calendar  

dcxl.cdlib.org    

Later:    Requirements  Documentation  

Get  Involved  

@dcxlCDL  

www.facebook.com/DCXLatCDL  

Acknowledgements  

•  CDL:  Rachael  Hu,  Trisha  Cruse,  John  Kunze,  Tracy  Seneca  •  MSR:  Lee  Dirks  •  GBMF:  Chris  Mentzel  

Carly  Strasser  carly.strasser@ucop.edu  

Recommended