23
Johns Hopkins University Data Management Services Sayeed Choudhury and Barbara Pralle Digital Library Federation – October 31, 2011 Johns Hopkins University Sheridan Libraries

Dc sheridan dlf_2011_final

Embed Size (px)

DESCRIPTION

Digital Library Federation presentation of Data Conservancy and Johns Hopkins University Sheridan Libraries Data Management Services (JHU DMS).

Citation preview

Page 1: Dc sheridan dlf_2011_final

Johns  Hopkins  University  Data  Management  Services  

Sayeed  Choudhury  and  Barbara  Pralle  Digital  Library  Federation  –  October  31,  2011  

Johns  Hopkins  University  Sheridan  Libraries  

Page 2: Dc sheridan dlf_2011_final

Powered  by  Data  Conservancy  

•  JHU  Data  Management  Service  (DMS)  represents  the  culmination  of  two  years  of  research,  design,  development  and  implementation  of  Data  Conservancy  

•  Service  launched  in  July  2011  •  DC  instance  launched  in  October  2011  •  Important,  essential  foundations  in  place  •  There  remains  work  to  be  done  

Johns  Hopkins  University  Sheridan  Libraries  

Page 3: Dc sheridan dlf_2011_final

Data  Conservancy  

•  The  Data  Conservancy  has  developed  a  blueprint  for  institutions  that  view  scientific  data  curation  as  a  means  to  collect,  organize,  validate  and  preserve  data  so  that  scientists  can  find  new  ways  to  address  the  grand  research  challenges  that  face  society.    

Page 4: Dc sheridan dlf_2011_final

Technology-­‐related  Accomplishments    

•  Preservation-­‐based,  flexible  data  model  (inspired  by  PLANETS  project)  

•  Demonstration  of  modularity  through  Archival  Storage  framework  

•  Feature  Extraction  Framework  •  Demonstration  of  interoperability  with  NSIDC  glacier  photo  service,  arXiv,  IVOA  and  Sakai  

•  Development  of  DCS  instance  

Johns  Hopkins  University  Sheridan  Libraries  

Page 5: Dc sheridan dlf_2011_final

Notion  of  the  DCS  Instance  (1)  

•  An  instance  of  the  DCS  (Data  Conservancy  System)  software  stack  

•  A  deployment  infrastructure  (hardware,  servers,  etc.)  for  the  DCS  

•  A  defined  (sub)set  of  DCS  services  to  be  exposed/provided  

•  A  defined/understood/expected  “context”  to  be  addressed  (e.g.  science  domain,  institutional  domain,    etc.)  

5

Page 6: Dc sheridan dlf_2011_final

Notion  of  the  DCS  Instance  (2)  

•  A  local  policy  framework  in  place  for  the  System  Operation  

•  Personnel/staffing  to  setup,  manage  and  support  the  continued  operation  of  the  system  

•  A  sustainability/business  plan  in  place  to  operate  the  instance  post  NSF  funding  (cost  model,  user  base,  etc.)  

•  Key  integrators  such  as  spatial,  temporal  and  taxonomic  queries  or  data  replication  

6

Page 7: Dc sheridan dlf_2011_final

Definition  of  Data  Preservation  

•  “Data  preservation  involves  providing  enough  representation  information,  context,  metadata,  fixity,  etc.  such  that  someone  other  than  the  original  data  producer  can  use  and  interpret  the  data.”  -  Ruth  Duerr,  National  Snow  and  Ice  Data  Center  

7

Page 8: Dc sheridan dlf_2011_final

Architecture  mapped  to  OAIS  

Open  Archival  Information  System  Functional  Entities  

Data  Conservancy  Service  Architecture  Block  Diagram  

Johns  Hopkins  University  Sheridan  Libraries  

Page 9: Dc sheridan dlf_2011_final

Data  Model:  Research  

•  Versions  vs.  (format)  migration  -  Modification  of  significant  properties  versus  transformation  of  data  

•  Replication  and  provenance  (verifiable  snapshots)  

•  Layered  data  model  -  Preservation  view,  Persistence  view,  Information  view  

Page 10: Dc sheridan dlf_2011_final

Data  Model:  Application  

•  Multiple  Data  Models  •  Content  models  for  describing  the  contents  of  a  Manifestation  

•  General  Model  used  to  correlate  model  entities  across  heterogeneous  datasets  -  geo-­‐reference,  time  of  observation,  etc…  

Page 11: Dc sheridan dlf_2011_final

Feature  Extraction  Framework:  Design  

•  Must  accommodate  a  variety  of  data  formats  

•  No  assumption  made  regarding  the  form  of  data  input  or  output  

•  Not  coupled  to  a  specific  execution  model  

Page 12: Dc sheridan dlf_2011_final

Feature  Extraction  Framework:  Application  

•  Subsetting  -  Returning  a  portion  of  a  dataset  

•  Indexing  -  Output  suitable  for  indexing  by  the  Query  Framework  

•  Workflows  -  Process  Orchestration,  Meandre,  Taverna,  Kepler  

•  Execution  environment  for  analysis  -  Stateless  Mappings  basis  for  MapReduce  

Page 13: Dc sheridan dlf_2011_final

Implementation:  Archival  Storage  

•  Responsible  for  long  term  storage    of  content  and  metadata  (AIP)  -  Evolving  storage  technologies    

(including  cloud)  -  Policy  implementation  (e.g.  multiple  copies)  

•  Archival  Storage  API  abstracts  underlying  technology  -  Fedora  (object  based),  ELM  (file  based)  

•  Persistent  Storage  Framework  (Y2)  -  Instrumented  file  systems  -  File  semantics  over  block,  cloud,  content-­‐

addressed  storage  -  Storage  services  (e.g.  integrity  checking)  

Page 14: Dc sheridan dlf_2011_final

Data  Management  Layers  Layers   Examples   Implication  for  PI   Implication  relative  

to  NSF  

Curation   Future  JHU  Data  Archive  and  other  DCS  instances  

•  Feature  Extraction  •  New  query  

capabilities  •  Cross-­‐disciplinary  

•  Competitive  advantage  

•  New  opportunities  

Preservation   JHU  Data  Archive  Portico  ICPSR  

•  Ability  to  use  own  data  in  the  future  (e.g.  5  yrs)  

•  Data  sharing    

•  Satisfies  NSF  needs  across  directorates  

 

Archiving   CUAHSI  NEES  Dataverse  

•  Provides  identifiers  for  sharing,  references,  etc.  

•  Could  satisfy  most  NSF  requirements  

Storage   Server  in  Lab  Website  Amazon  S3  

•  Responsible  for:  •  Restore  •  Sharing  •  Staffing  

•  Could  be  enough  for  now  but  not  near-­‐term  future  

Johns  Hopkins  University  Sheridan  Libraries  

Page 15: Dc sheridan dlf_2011_final

Defining  Sustainability  

•  “Ensuring  that  valuable  digital  assets  will  be  available  for  future  use  is  not  simply  a  matter  of  finding  sufficient  funds.  It  is  about  mobilizing  resources—human,  technical,  and  financial—across  a  spectrum  of  stakeholders  diffuse  over  both  space  and  time.”  

Page 16: Dc sheridan dlf_2011_final

Establishing  the  JHU  DMS    

•  May  2010  NSF  announces  DMP  expectations  •  Services  incubated  and  scoped  summer/fall  2010  

-  Build  on  Data  Conservancy  expertise    

•  Proposed  in  January  and  launched  in  July  2011  -  Consultative  data  management  planning  services  to  support  NSF  proposals  

-  Post  award  data  management  services  

Johns  Hopkins  University  Sheridan  Libraries  

Page 17: Dc sheridan dlf_2011_final

Resources  Needed  for  Launch    

•  Talented  team  and  expertise  •  Flexible  process  •  Tools  that  support  service  provision  •  Systems  developed  by  the  DC  to  manage  data  •  Marketing  and  outreach  partners  •  Mechanism  for  supporting  financial  model  

Johns  Hopkins  University  Sheridan  Libraries  

Page 18: Dc sheridan dlf_2011_final

People  and  Expertise    

•  Create  data  management  consultant  position    •  Position  description  recognizes  diversity  of  experience  and  talent  that  can  support  service  

•  Recruited  people  with  skills  that  match  with  the  JHU  DMS  objectives  and  services  

•  Knowledge  transfer  through  hands  on  work  and  interactions  with  DC  partners  

Johns  Hopkins  University  Sheridan  Libraries  

Page 19: Dc sheridan dlf_2011_final

Day  in  the  Life  of  a  JHU  DM  Consultant  (1)  

•  Consultative  process  emulates  a  reference  interview  

•  Adapt  to  PI  timeframe  and  deadlines  •  Gather  of  information,  identify  gaps,  understand  options,  prepare  and  iterate  plan  

•  Support  better  data  management  by  encouraging  systematic  cultural  change  through  PI  interactions  

Johns  Hopkins  University  Sheridan  Libraries  

Page 20: Dc sheridan dlf_2011_final

Day  in  the  Life  of  a  JHU  DM  Consultant  (2)  

•  In-­‐depth  planning  after  award  received  •  Helping  PI  understand  service  levels  within  JHU  Data  Archive  

•  Planning  and  preparing  data  for  deposit  •  Reviewing  data  management  plan  and  identifying  solutions  for  next  phase  of  data  management  

Johns  Hopkins  University  Sheridan  Libraries  

Page 21: Dc sheridan dlf_2011_final

Challenges  

•  Timing  of  consultative  support,  deadlines,  and  marketing  of  service  

•  Building  awareness  of  data  management  value  •  Establishing  common  vocabulary  ex.  One  PIs  ‘storage’  is  another  PIs  ‘archiving’  

•  Condensing  key  information  into  two  pages  •  Navigating  different  data  retention  policies  •  Responding  to  widely  ranging  domains  

Johns  Hopkins  University  Sheridan  Libraries  

Page 22: Dc sheridan dlf_2011_final

Opportunities  

•  Grow  the  researcher/grad  student  understanding  of  data  management  -  Support  good  stewardship  of  data  across  institution  

•  Establish  an  archive  specifically  designed  for  data,  enabling  future  discovery  and  use    

•  Build  our  collective  expertise  in  data  management  

Johns  Hopkins  University  Sheridan  Libraries  

Page 23: Dc sheridan dlf_2011_final

Acknowledgements  and  Resources  

•  NSF  Award  OCI-­‐0830976  •  Sheridan  Libraries  financial  support  •  Johns  Hopkins  University  financial  support  •  Elliot  Metsger  for  infrastructure  slides  •  Tim  DiLauro  for  inspiration  about  layers  •  Data  Conservancy  colleagues  for  their  exceptional  work  

and  patience  •  http://dataconservancy.org  •  http://dmp.data.jhu.edu  

Johns  Hopkins  University  Sheridan  Libraries