24
Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Embed Size (px)

Citation preview

Page 1: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Cultural Heritage Institutions and Big Data Collections

Leslie JohnstonChief of Repository Development

Library of Congress

Page 2: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Cultural Heritage organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.”

Now it’s also data. 

Page 3: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Data is not just generated by satellites, identified during experiments, or collected during surveys.

Datasets are not just scientific and business tables and spreadsheets.

We have Big Data in our Libraries, Archives and Museums.

Page 4: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Like other cultural heritage organizations, the Library of Congress has as one of its mandates that it make its collections freely available, whether that is in person or on the web.

Page 5: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

What are some Library of Congress examples of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?

Page 6: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

National DigitalNewspaper Program

chroniclingamerica.loc.gov/This collection was transformative for the Library of Congress: it was the first to be made to be available as a bulk download and exposed as a text and image dataset.

Some researchers want to search for stories in historic newspapers.  Some researchers want to mine newspaper OCR for trends across time periods and geographic areas.  Requests have come in to analyze the full collection..  The program has: Multiple producers (36 now, ultimately 54) Free and open public access APIs for machine access and automated processes,

including access to RDF linked data.

Over 6.7 million newspaper pages ingested to dateOver 250 Tb of data

Page 7: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Web Archives http://www.loc.gov/webarchiving/

lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records.

The collections include:• U.S. elections• Web sites created by members of the House and Senate• Thematic collections around events, such as elections in

the Philippines, the Iraq war, and the appointment of Supreme Court Justices.

• Collections around an area of study, such as Legal “Blawgs”  

We frequently receive requests for access to full collections for full-text data mining.

Every format possible on the web Almost 8 billion filesOver 425 TB

Page 8: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

congress.govCongress.gov is still in its beta phase,

transforming congressional information discovery.

Legislation from 1993 to the present, The Congressional Record from 1995 to the present, Committee Reports from 1995 to the present, and Member profiles from 1973 to the present (with some from 1947 to 1972).

Page 9: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

The Twitter ArchiveEvery public tweet since Twitter’s launch in March

2006.

Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.

The collection comprises only a few TB, but 100s of billions of tweets.

A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/

status

privacycommercial

personal

events

social media

visualization

social science

Page 10: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Research DatasetsResearch datasets are created by

faculty, curators, researchers, and federal and state agencies.

It is not enough to be collecting publications; we must collect the datasets that support the published work, to allow for replicability and r-use in research.

We are now planning to expands its collections to preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.

Page 11: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

And the full breadth of theLibrary’s Collections

The American Memory collection, one of the oldest and most used digital collections on the web.

The oral histories of the Veteran’s History Project.

The audio and video collections of the American Folklife Center.

More than 1.2 million images from Prints and Photographs.

Digitized maps and GIS data from Geography and Maps

More than 300,000 digitized audio and video files comprising over 5 PB at the Packard Campus.

And many, many, many more.

Page 12: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

id.loc.govThe Library of Congress is, in part, a

standards agency for rules used to create metadata records and in controlled vocabularies (authorities) used to describe items.

The Library is gradually making its vocabularies available as serialized RDF datasets (SKOS and JSON).

In the library community, The LC authorities are one of the most common tools for building linked data relationships.

Page 13: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

13

What are some of the technological challenges of managing and preserving large digital collections in many formats, and making them available for use?

Page 14: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

14

Sheer amount.

Huge variation in file formats.

Unclear and undocumented rights.

Security

Missing metadata.

Data citation and identifier issues.

Discovery expectations: discovery across collections and institutions together.

Cost.

Page 15: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

I will mention infrastructure only in passing.

There are scale issues related to:

Storage Archiving Bandwidth Software development Staffing for processing

Page 16: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

This Requires a Preservation Infrastructure

The Library developed the BagIt transfer specification for the movement of files between and within organizations.

http://www.digitalpreservation.gov/documents/bagitspec.pdf

The Library inventories incoming files, and is gradually inventorying all digital content.

The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations.

The Library has documented sustainability factors for file formats. http://www.digitalpreservation.gov/formats/

For cases where we do have control over content we receive, we have a “Best Edition” Preferred Formats statement, which is currently being updated.

•http://www.copyright.gov/circs/circ07b.pdf

Page 17: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

There are many new activities to be planned for with new researcher uses and expectations.

Page 18: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

We still have collections.  But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services.  Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. 

Now our collections are, more often than not, self-serve. Researchers are taking collections as data away to work with in their own computational environments. This is a shift away from recent service models where libraries built out and housed lab spaces for specialized activities such as text mining and geospatial modeling and provided staff to assist in acquiring and manipulating data.

Page 19: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

More and more researchers want to use one or more collections as a whole, mining and organizing the information in novel ways.

Researchers use what used to be unimaginable computing power on a desktop to mine the rich information and tools to create pictures that translate that information into knowledge.

Page 20: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Should collections be pre-processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Or do we limit access to the native format? Or put on-the-fly format transformation services for downloads in place?

We are beginning to put into place the infrastructure needed to create full-text indexes for millions/billions of items to support full discovery for researchers.

We are only just starting the process of generating linked data representations of billions of items.

Page 21: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Cultural heritage institutions are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know.

BUT … we do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them.

Both have policy implications and implications for public service staffing.

Page 22: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

But the benefits outweigh the challenges.

Page 23: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Cultural heritage institutions are managing and preserving the datasets and big data necessary for re-use and replicability.

We are working to make the deposit and management of such data easier to accomplish.

This is an important new role for our organizations in enabling new research.

Page 24: Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress

Discussion…

Leslie Johnston

[email protected]