Institutional digital repositories: What role do they have in curation? Steve Hitchcock, JISC KeepIt ProjectECS, University of Southampton
ICE Forum, London, 29 June 2011
How much digital data? 9.57ZB of data processed by 27M computers in 2008 1.2ZB of data in digital universe by year end 2010 196.5TB/year Twitter 41 391TB data generated by 6 MIT case studies 20 600TB data generated by 1 MIT physics case study 3.5TB documents in 298 European repositories 2000TB Internet Archive Wayback Machine 394TB Hathi Trust 8.793M volumes 74TB LoC 15.3 million digital items onlineMeta MB 1 000 000Giga GB 1 000 000 000Tera TB 1 000 000 000 000Peta PB 1 000 000 000 000 000Exa EB 1 000 000 000 000 000 000Zetta ZB 1 000 000 000 000 000 000 000Yotta YB 1 000 000 000 000 000 000 000 000
Data generation layer - worldwideMoving data, data consumed27M computers processed 9.57ZB in 2008Americans consumed 3.6ZB in 2008Bohn, Short, How Much Information? 2010 Report on Enterprise Server Informationhttp://hmi.ucsd.edu/howmuchinfo_research_report_consum_2010.php Static data, original sourcesEST. 1.2ZB of data in digital universe by year end 2010IDC/EMC (2010)http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm User-generated dataTwitter 35MB/s, 155M tweets/day (ReadWriteWeb, May 25, 2011) = 196.5TB/yearhttp://www.readwriteweb.com/cloud/2011/05/gnip-ceo-on-the-challenges-of.php
The Rapid Growth in Unstructured Data, via http://wikibon.org/blog/unstructured-data/
Repository layerDRIVER search (1 June 2011) 3.520.000 documents in 298 repositories from 38 countries http://search.driver.research-infrastructures.eu/ Est 1MB/doc = 3.5TBWeibel (blog) March 2009 Are data repositories new IRs?http://weibel-lines.typepad.com/weibelines/2009/03/are-data-repositories-the-new-institutional-repositories.html Madnick, Smith, How much Info? July 2009 UCSD Webinar MIT 6 case studies 16 faculty workersTotal data generated 41391TB (Physics 20 600TB)5-10x more data than 5 years ago, expect similar growth rates in futurehttp://hmi.ucsd.edu/pdf/webinar_July22.pdf Chronopolis data grid for replication multiple copies of valued data collectionshttps://chronopolis.sdsc.edu/ cf LOCKSS Lots Of Copies Keep Stuff Safe
Archive layerInternet Archive Wayback Machine contains c.2000TB, currently growing at a rate of 20TB/monthhttp://www.archive.org/about/faqs.php
Hathi Trust (beginning of June 2011 8.793M volumes), 394TBhttp://www.libraryjournal.com/lj/home/890917-264/unlocking_hathitrust_inside_the_librarians.html.csp
Library of Congress15.3 million digital items online, 74TBnearly 142M items in the Librarys physical collectionsMatt Raymond, February 11, 2009 byhttp://blogs.loc.gov/loc/2009/02/how-big-is-the-library-of-congress/
LoC (start 2011) 147M items: 33M books + other print, 3M recordings, 12.5M photos, 5.4M maps, 6M sheet music, 64.5M manuscriptshttp://www.loc.gov/about/facts.html
Visualising data ratios (larger scale)Data generationRepository layerArchival layerMoving data (Bohn, Short, 2008) Static data (IDC 2010)
European IRs (DRIVER)MIT data case studies (2009)MIT physics case study (2009)Twitter/yRepository layerArchival layerData generationVisualising data ratios (smaller scale)Internet Archive Wayback MachineHathi Trust (June 2011)LoC digital items (2009)Moving data (Bohn, Short, 2008)X 107 Static data (IDC 2010)X 107
Digital repositories diversifying: institution-wide outputsScienceTeachingResearchArtsKeepIt exemplar preservation repositories
Summary of implications of the KeepIt project findings Digital preservation starts with detailed knowledgeand awarenessof your own contentThe issues raised by preservation are the same as those raised by content managementData curation is likely to be a natural progression for a preservation-focussed repositoryProvenance of data should be a key role for research institutionsPreservation tools are delivering specialist expertise directly to the userJISC should promote its role in the development of digital preservation tools more loudlyCreating a sense of capability will assist those new to preservation practiceConverged multi-data type repositories are likely to increase complexity for preservationPreservation should not be prioritized prematurely, especially among relatively new content repositoriesDigital institutional repositories will not instantly become preservation repositories, and repository managers are not archivists, but they both have a role in preservation
Digital institutional repositories will not quickly become preservation repositories, and repository managers are not archivists, but they both have a role in preservationAs there are vastly more digital content repositories than 'preservation repositories, if we are to have preservation-ready content repositories then many more need to be allowed to navigate the path towards digital preservation without imposing on them all the requirements of specialists. Should we view target content repositories as first-stage curators rather than archivists, i.e. as a process that informs and selects for preservation?hackingtheacademy @chrisprom argues digital archival programs will be recreated by academies with trusted repository and OSS-that's KeepIt Thu May 27 2010
Digital preservation starts with detailed knowledgeand awarenessof your own content .@bookfinch Shorter summary of DP: know what you have and value, assess risk, take action to avoid risk, repeat. Problem: people don't do it Thu Jan 13 2011All the needs and requirements of preservation stem from this knowledge, enabling a repository manager, for example, to then select appropriate preservation tools and services. In essence, this is the problem that KeepIt set out to help the managers of different types of institutional repository to resolve.
Data curation is likely to be a natural progression for a preservation-focussed repositoryThe work of NECTAR at the University of Northampton indicates the growing prevalence of the idea that repositories could be used for data curation, even if content (e.g. open access) repositories and data repositories remain separate within institutions to serve different metadata, interoperability and author requirements.If repositories are the new wave of scholarly communication, then data repositories in the cloud could be the nextnew wave.
Preservation tools are delivering specialist expertise directly to the userWidely and freely available tools can support a full preservation programme for repositories, from policy-making to costings, technical content management, and risk analysis. Analysis showed that around 70% of these tools had been developed in JISC projects.
Creating a sense of capability will assist those new to preservation practicePorter: 'create a sense of urgency'. No, create a sense of capability. That's what many JISC DP projects have done #brtf Fri May 07 2010 At a recent JISC end-of-programme event one keynote speaker questioned the impact of digital preservation on digital repositories. Once again, the situation was presented as urgent. Without reference to the range of tools now available for digital preservation, urgency unnecessarily detracts from creating a sense of capability.
What did the KeepIt exemplars do about preservation? All see preservation as an ongoingpractical commitment,providing it can be managed within the scope of existing work and resources. We can expect tosee progress where it fits with repositorydevelopment and emerging requirements. We cannot expect to see all repositories take thesame path towards preservation at the same speed. Progress willdepend on type of repository content, but also on otherfactors includinginstitutional issues, scale and growth of repository content.
Find out more about KeepItWeb: http://preservation.eprints.org/keepit/ Blog: Diary of a Repository Preservation Project http://blogs.ecs.soton.ac.uk/keepit/ Papers and presentations, Repository: http://www.ecs.soton.ac.uk/research/projects/640 Presentations, Slideshare: http://www.slideshare.net/SteveHitchcock/presentations Wiki: Training resources and bibliography http://wiki.eprints.org/w/Repository_Preservation_Exemplars Twitter: @jisckeepit Final report (June 2011) http://ie-repository.jisc.ac.uk/553/
One of the best known advocates and practitioners for digital preservation is David Rosenthal. Rosenthal argues that preservation resolves to three main factors: scale, costs and rights. When we consider curation education we can see immediately that scale must have a major impact probably costs and rights too, but we can leave those for another day.What I want to focus on here is the role of digital content repositories and the role they may have in filtering content for curation, and how they can help deal with the problem of scale. We tend to use the term digital repository quite generically, but my particular interest is institutional digital repositories, where the institutions are typically involved in research and higher education. These may not be the type of repositories you think of in terms of curation. The examples I will use come from the JISC KeepIt project, and Ill say more about that later.
*First, lets get some feel for the sort of numbers we are dealing with when considering the scale of digital information. I admit these numbers are somewhat random, and its hard to grasp and compare these in purely numerical terms, so lets try and add a little structure.
There is a scale to assist those who have trouble telling terabytes from yottabytes.*What was perhaps not apparent in the last slide is that I have structured the examples into what I will call data generation, repository and archive layers. First lets consider data generation. These are the places that content emerges. Taken as an aggregate these are the kind of numbers that scale to our digital universe. We have to be careful about comparing numbers directly, and about qualifying these data, but a first-hand indicator of scale this will help.*Some are better at visualizing these data than I am, but I will seek to provide some simple visualization later.
*The repository layer Im interested in has grown grown fastest, in terms of numbers of repositories, in the area of research paper repositories, but these are far from the largest in terms of data volumes. Are we witnessing the emergence of institutional data repositories? If we are then the first examples are already orders of magnitude larger. Remember that a popular approach to preservation is to create multiple copies of data. *When we come to the archive layer, these are examples you will be familiar with, among the major players. *Now I want to take these layers and examples and see how they compare visually. The problem is that on the scale of the largest examples in the data generation layer, the other layers become invisible. *Instead Ive scaled down to a level where most of our examples are visible, even though we can now only represent the largest cases numerically. The reason Ive structured it this way is to show the challenge of curation: to take the content in the top two layers here and provide the means of selection and filtering for the bottom layer.
*Miggie Pickton from Northampton University presenting for the JISC KeepIt Project at the Open Repositories Conference in Madrid, July 2010.*From the KeepIt project we know something about the emerging types of repositories from science data and the arts to teaching and learning content, as well as research papers these are our exemplars, and how they are adapting to the needs of preservation and curation. *Here is a summary of the implications of the findings of the project. As this is a just a short talk I will only pick out some of the findings that may resonate here. These are real lessons we have learned from the project. Weve just posted the final report from the project, so if you want to look into these findings further the URL is given at the end of the talk. Strictly our focus was on preservation, which I take to be a subset of curation, but I hope this is a useful contribution to the debate here. *This is the case for the middle way. The trap we fall into is to suggest that only the two extremes are valid: 1 that repositories must be capable of implementing a full preservation solution, or 2 that repositories cannot be responsible for preservation. I heard the latter case at a meeting earlier this year and argued against this in discussion. After I had put my case someone said: so you mean like the BP case. Exactly. If you recall the Gulf of Mexico oil spill in 2010, BP argued that it was not responsible for the spill as it has subcontracted the management of the rig. This line did not stand up the scrutiny of politicians and public. Nor will it stand up for content repositories that outsource preservation. *I doubt anyone here will take issue with this, but the problem is this: people dont start here. *We saw this happen in KeepIt. The NECTAR repository at Northampton used the Digital Asset Framework (DAF) tool to assess the scope for setting up a research data repository. The results of that work reached higher levels of the management of the institution, which approved the recommendations. It was the involvement in KeepIt, and the new perspective it provided on repository preservation, that led to this initiative.
*It is possible to build a full repository preservation programme using widely and freely available tools. Not easy, or quick, but possible, and we ran a five-part course in KeepIt to show how.I cant emphasise enough the importance of this range of tools. Whenever I hear people worrying about digital preservation, invariably they are unaware that these tools exist. Its as if were in a time warp, and outside the preservation community time has stood still. We have to redress this impression. *I have long argued against scaremongering and calls for urgency to motivate digital preservation, as I believe it is counter-productive. Now it is no longer necessary, as we have the practical tools for assessing the real need for preservation by identifying and acting on risk. This capability is empowering, and motivates, as KeepIt has shown, the selection of appropriate tools by different types of repository to tackle preservation. *Summary: There is nothing special about the exemplar repositories in KeepIt. They represent typical repositories for different types of content. Thus what we learned about how they respond to a structured approach to preservation is likely to be similar for the same types of repository. *I hope these few snippets from our project will help this meeting understand the role these repositories can play in helping meet the challenge of curation of the vast amounts of digital data now emerging. *