19
Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse 2017 NFAIS Annual Conference, Feb 27, 2017 Zhiwu Xie

Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Advancing Library Cyberinfrastructure

for Big Data Sharing and Reuse

2017 NFAIS Annual Conference, Feb 27, 2017

Zhiwu Xie

Page 2: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Big Data: How Big?

• Moving yardstick

• No longer unique to “big” science

• 1000 Genomes project:

200TB in 4 years

• Sloan Digital Sky Phase I and

II: 130TB in 8 years

• Today, a small lab can

produce as much data in

shorter period of time

Page 3: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Library Big Data: Examples

• Library of Congress Twitter Archive

• Digital Preservation Network (DPN)

• HathiTrust Research Center (HTRC)

• Digital Public Library of America (DPLA)

• SHARE

Page 4: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Towards Use And Reuse Driven

Big Data ManagementZhiwu Xie1, Yinlin Chen1, Julie Speer1, Tyler Walters1, Pablo A Tarazaga2, and Mary Kasarda2

1University Libraries and 2Department of Mechanical EngineeringVirginia Polytechnic Institute and State University

Blacksburg, USA

June 23, 2015, JCDL 2015, Knoxville, TN

Page 5: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

“…running water is never stale and a door-hinges never get worm-eaten…”

-- Lü's Annals, c. 239 BCE

Page 6: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Research Data Management

• What are the roles of the academic and

research library?

Page 7: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Research Data Management• What are the roles of the academic and research library?

• How can we help?

U.S. National Archives’ Local Identifier: 102-LH-1494Chris 73 / Wikimedia Commons

Page 8: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Big Data: Institutional Context

Page 9: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Data Projects @ VT Libraries• Inter- and cross- disciplinary

• Grow out of our capacity, beyond IR building

• Focus on reuse

• Require deep engagements

Page 10: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Goodwin Hall Living Lab

• A 160,000-sf new building wired with

>240 different sensors

• Sensor mounts were directly wielded

to the structural steel during the

building construction

• Sensors are strategically positioned

and sufficiently sensitive to detect

human movements

• Will be the most instrumented

building for vibration

Page 11: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Goodwin Hall Living Lab

• Designed as a multi-purpose living

laboratory

• Opportunities for multi- and cross-

disciplinary exploration and discovery

• > 40 researchers and educators in

various disciplines and institutes

expressed interests in using the data

• VT libraries is tasked with building the

digital libraries to manage the data

and support these activities

• Data volume: > 30TB per year

Page 12: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

VT Event Digital Library &

Archive

• Track and analyze live events such

as earthquakes, political events,

community activities, and violence,

crime prevention

• Potentially used by researchers from

many diverse disciplines

• Currently run on the lab’s own 20-

node Hadoop cluster

• 1 billion tweets & 11TB of webpages

• Through a MOU, library invested on

the data storage and became a

partner

Page 13: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

SHARE Notify

• Free, open data set about research

and scholarly activities gathered

from various sources

• Linking publications to grants,

receive real time event notifications

on mobile devices, etc.

• 149 aggregated sources, ~20 million

events as of Feb 2017

Page 14: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Developing Library Cyberinfrastructure

Strategy for Big Data Sharing and Reuse

• A 2-year IMLS National Leadership for Libraries grant,

starting form June 2016

• Incentivized by the above 3 projects

• A collaboration between VT Libraries, Mechanical

Engineering, Computer Science, and UNT.

• Emphasis is on

• Leveraging shared infrastructure

• Widely applicable strategy

• Equip libraries with solid knowledge and techniques

to balance their desires, needs, and constraints with

a clear understanding of the tradeoffs

Page 15: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Key Research Questions

• What are the key technical challenges?

• What are the monetary and non-monetary (time,

skill set, administrative, etc.) costs? Are there any

cost patterns or correlations to the CI options?

• What are the knowledge and skill requirements

for librarians?

• What are the key service and performance

characteristics?

• How to consolidate the answers to the above

questions to form an easy to adapt and effective

library CI strategy?

Page 16: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Cyberinfrastructure Options

• Institutional high-performance computing (HPC),

high-throughput computing (HTC) and storage

facilities

• National HPC, HTC, and storage facilities, e.g.,

XSEDE resources

• National research clouds, e.g., Chameleon

Cloud, CloudLab, Open Science Data Cloud,

etc.

• Commercial clouds, e.g., Amazon Web Services

(AWS), Rackspace, etc.

• No unified CI framework or strategy to pick CI for

different library big data sharing and reuse

situations

Page 17: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Library Big Data Reuse Patterns

Compute

Storage

Bridge Network Hub

Goodwin Hall Event DL SHARE Notify

Page 18: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Progress So Far

• Identified the network bandwidth as a key

bottleneck in the bridge pattern

• Analyzing data loading, its acceleration

techniques, and tradeoffs in the network pattern

• Participated in building VT’s mass storage facility

• Participated in building VT’s 10G campus network

Page 19: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Questions?