Growing Open Data: Making the sharing of XXL-sized research data files online a reality, using...

Preview:

Citation preview

GROWING OPEN DATA: MAKING THE SHARING OF XXL-SIZED RESEARCH DATA FILES ONLINE A REALITY, USING EDINBURGH DATASHAREPAULINE WARD: PAULINE.WARD@ED.AC.UK @PAULINEDATAWARDGEORGE HAMILTON

THE CHALLENGE

• Researchers are generating bigger files. At University of Edinburgh all researchers are entitled to 500 GB storage.

THE CHALLENGE

• Researchers need to be able to share their data online.• For impact.• For discoverability.• For reproducibility.• For compliance.

THE CHALLENGE

• DataShare is the Institutional Repository for research data for staff and students at the University of Edinburgh: datashare.is.ed.ac.uk .• Previous file size limit of 2.1 GB.• Largest file we’ve been asked to share: 20 GB – split into smaller

files.• Largest fileset we’ve been asked to share: 226 GB – split into

smaller filesets.

THE CHALLENGE

• Some files had to be imported via time-consuming batch import process because too big / too numerous for web deposit.• Some files still waiting to be shared because they are too big

for users to be able to conveniently download them.• These files are generated from a wide range of disciplines

and wide range of methods.

THE SOLUTION

• Getting the files from the depositors: address upload • Allowing users to get the files: address download

THE SOLUTION: UPLOAD

• HTML5 resumable upload

THE SOLUTION: UPLOAD

• EDINA’s code for implementing HTML5 upload in DSpace is on GitHub: https://github.com/edina/DSpace/tree/xml-html5-upload • Uses resumable.js• This was the XMLUI re-write of functionality that was

available for DSpace 5.0 JSPUI. See https://jira.duraspace.org/browse/DS-1562 for further details.

THE SOLUTION: UPLOAD

• Testing shows files up to 15 GB upload successfully.• (cf figshare 5 GB file size limit, Zenodo 2 GB)• 20 GB file upload has been done in testing, but generates an error

message in the browser, and the user must find and Resume the submission from the Submissions page

• Multiple files can be uploaded by drag’n’drop.

THE SOLUTION: DOWNLOAD

We wanted a mechanism, which DSpace doesn’t provide, of zipping up files for download.• BitTorrent was one possible approach: could be added at a

later date• Other approaches possible (Rsync, Secure Copy (SCP))

THE SOLUTION: DOWNLOAD

• FTP download: agreed• Tried and tested technology that we are confident we can put in place

and will work well• All files will be accessed from the FTP server anonymously• Users can still download files via browser via FTP• Users who wish can use an FTP client, allowing them to resume a

download

THE SOLUTION: DOWNLOAD

• Specification:• All files will still be required to have appropriate metadata stored in

DSpace• All filesets will now be downloadable as a zip file (previous 5.2 GB

limit)• Move DSpace assetstore to a location where more storage available• Statistics (i.e. numbers) of file downloads by SFTP will be added to

DSpace statistics

THE SOLUTION: DOWNLOAD

• This is a replacement for our current on-the-fly zip file creation of Item bitstreams.• Will mitigate potential performance issues. Because it will use

less server resources (Java threads and RAM)

SUMMARY

• We have implemented HTML5 upload in the DataShare (DSpace) web interface to allow depositors to easily and quickly deposit individual files up to 15 GB.• We are working on integrating an SFTP server to allow users to

retrieve filesets larger than our current 20 GB limit. Storage rather than network/browser timeout will become the limiting factor on fileset size. We anticipate making numerous filesets around 100 GB available in this way in the medium term.

Recommended