Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library...

Preview:

Citation preview

Advancing Library Cyberinfrastructure

for Big Data Sharing and Reuse

2017 NFAIS Annual Conference, Feb 27, 2017

Zhiwu Xie

Big Data: How Big?

• Moving yardstick

• No longer unique to “big” science

• 1000 Genomes project:

200TB in 4 years

• Sloan Digital Sky Phase I and

II: 130TB in 8 years

• Today, a small lab can

produce as much data in

shorter period of time

Library Big Data: Examples

• Library of Congress Twitter Archive

• Digital Preservation Network (DPN)

• HathiTrust Research Center (HTRC)

• Digital Public Library of America (DPLA)

• SHARE

Towards Use And Reuse Driven

Big Data ManagementZhiwu Xie1, Yinlin Chen1, Julie Speer1, Tyler Walters1, Pablo A Tarazaga2, and Mary Kasarda2

1University Libraries and 2Department of Mechanical EngineeringVirginia Polytechnic Institute and State University

Blacksburg, USA

June 23, 2015, JCDL 2015, Knoxville, TN

“…running water is never stale and a door-hinges never get worm-eaten…”

-- Lü's Annals, c. 239 BCE

Research Data Management

• What are the roles of the academic and

research library?

Research Data Management• What are the roles of the academic and research library?

• How can we help?

U.S. National Archives’ Local Identifier: 102-LH-1494Chris 73 / Wikimedia Commons

Big Data: Institutional Context

Data Projects @ VT Libraries• Inter- and cross- disciplinary

• Grow out of our capacity, beyond IR building

• Focus on reuse

• Require deep engagements

Goodwin Hall Living Lab

• A 160,000-sf new building wired with

>240 different sensors

• Sensor mounts were directly wielded

to the structural steel during the

building construction

• Sensors are strategically positioned

and sufficiently sensitive to detect

human movements

• Will be the most instrumented

building for vibration

Goodwin Hall Living Lab

• Designed as a multi-purpose living

laboratory

• Opportunities for multi- and cross-

disciplinary exploration and discovery

• > 40 researchers and educators in

various disciplines and institutes

expressed interests in using the data

• VT libraries is tasked with building the

digital libraries to manage the data

and support these activities

• Data volume: > 30TB per year

VT Event Digital Library &

Archive

• Track and analyze live events such

as earthquakes, political events,

community activities, and violence,

crime prevention

• Potentially used by researchers from

many diverse disciplines

• Currently run on the lab’s own 20-

node Hadoop cluster

• 1 billion tweets & 11TB of webpages

• Through a MOU, library invested on

the data storage and became a

partner

SHARE Notify

• Free, open data set about research

and scholarly activities gathered

from various sources

• Linking publications to grants,

receive real time event notifications

on mobile devices, etc.

• 149 aggregated sources, ~20 million

events as of Feb 2017

Developing Library Cyberinfrastructure

Strategy for Big Data Sharing and Reuse

• A 2-year IMLS National Leadership for Libraries grant,

starting form June 2016

• Incentivized by the above 3 projects

• A collaboration between VT Libraries, Mechanical

Engineering, Computer Science, and UNT.

• Emphasis is on

• Leveraging shared infrastructure

• Widely applicable strategy

• Equip libraries with solid knowledge and techniques

to balance their desires, needs, and constraints with

a clear understanding of the tradeoffs

Key Research Questions

• What are the key technical challenges?

• What are the monetary and non-monetary (time,

skill set, administrative, etc.) costs? Are there any

cost patterns or correlations to the CI options?

• What are the knowledge and skill requirements

for librarians?

• What are the key service and performance

characteristics?

• How to consolidate the answers to the above

questions to form an easy to adapt and effective

library CI strategy?

Cyberinfrastructure Options

• Institutional high-performance computing (HPC),

high-throughput computing (HTC) and storage

facilities

• National HPC, HTC, and storage facilities, e.g.,

XSEDE resources

• National research clouds, e.g., Chameleon

Cloud, CloudLab, Open Science Data Cloud,

etc.

• Commercial clouds, e.g., Amazon Web Services

(AWS), Rackspace, etc.

• No unified CI framework or strategy to pick CI for

different library big data sharing and reuse

situations

Library Big Data Reuse Patterns

Compute

Storage

Bridge Network Hub

Goodwin Hall Event DL SHARE Notify

Progress So Far

• Identified the network bandwidth as a key

bottleneck in the bridge pattern

• Analyzing data loading, its acceleration

techniques, and tradeoffs in the network pattern

• Participated in building VT’s mass storage facility

• Participated in building VT’s 10G campus network

Questions?

Recommended