Advancing Library Cyberinfrastructure
for Big Data Sharing and Reuse
2017 NFAIS Annual Conference, Feb 27, 2017
Zhiwu Xie
Big Data: How Big?
• Moving yardstick
• No longer unique to “big” science
• 1000 Genomes project:
200TB in 4 years
• Sloan Digital Sky Phase I and
II: 130TB in 8 years
• Today, a small lab can
produce as much data in
shorter period of time
Library Big Data: Examples
• Library of Congress Twitter Archive
• Digital Preservation Network (DPN)
• HathiTrust Research Center (HTRC)
• Digital Public Library of America (DPLA)
• SHARE
Towards Use And Reuse Driven
Big Data ManagementZhiwu Xie1, Yinlin Chen1, Julie Speer1, Tyler Walters1, Pablo A Tarazaga2, and Mary Kasarda2
1University Libraries and 2Department of Mechanical EngineeringVirginia Polytechnic Institute and State University
Blacksburg, USA
June 23, 2015, JCDL 2015, Knoxville, TN
“…running water is never stale and a door-hinges never get worm-eaten…”
-- Lü's Annals, c. 239 BCE
Research Data Management
• What are the roles of the academic and
research library?
Research Data Management• What are the roles of the academic and research library?
• How can we help?
U.S. National Archives’ Local Identifier: 102-LH-1494Chris 73 / Wikimedia Commons
Big Data: Institutional Context
Data Projects @ VT Libraries• Inter- and cross- disciplinary
• Grow out of our capacity, beyond IR building
• Focus on reuse
• Require deep engagements
Goodwin Hall Living Lab
• A 160,000-sf new building wired with
>240 different sensors
• Sensor mounts were directly wielded
to the structural steel during the
building construction
• Sensors are strategically positioned
and sufficiently sensitive to detect
human movements
• Will be the most instrumented
building for vibration
Goodwin Hall Living Lab
• Designed as a multi-purpose living
laboratory
• Opportunities for multi- and cross-
disciplinary exploration and discovery
• > 40 researchers and educators in
various disciplines and institutes
expressed interests in using the data
• VT libraries is tasked with building the
digital libraries to manage the data
and support these activities
• Data volume: > 30TB per year
VT Event Digital Library &
Archive
• Track and analyze live events such
as earthquakes, political events,
community activities, and violence,
crime prevention
• Potentially used by researchers from
many diverse disciplines
• Currently run on the lab’s own 20-
node Hadoop cluster
• 1 billion tweets & 11TB of webpages
• Through a MOU, library invested on
the data storage and became a
partner
SHARE Notify
• Free, open data set about research
and scholarly activities gathered
from various sources
• Linking publications to grants,
receive real time event notifications
on mobile devices, etc.
• 149 aggregated sources, ~20 million
events as of Feb 2017
Developing Library Cyberinfrastructure
Strategy for Big Data Sharing and Reuse
• A 2-year IMLS National Leadership for Libraries grant,
starting form June 2016
• Incentivized by the above 3 projects
• A collaboration between VT Libraries, Mechanical
Engineering, Computer Science, and UNT.
• Emphasis is on
• Leveraging shared infrastructure
• Widely applicable strategy
• Equip libraries with solid knowledge and techniques
to balance their desires, needs, and constraints with
a clear understanding of the tradeoffs
Key Research Questions
• What are the key technical challenges?
• What are the monetary and non-monetary (time,
skill set, administrative, etc.) costs? Are there any
cost patterns or correlations to the CI options?
• What are the knowledge and skill requirements
for librarians?
• What are the key service and performance
characteristics?
• How to consolidate the answers to the above
questions to form an easy to adapt and effective
library CI strategy?
Cyberinfrastructure Options
• Institutional high-performance computing (HPC),
high-throughput computing (HTC) and storage
facilities
• National HPC, HTC, and storage facilities, e.g.,
XSEDE resources
• National research clouds, e.g., Chameleon
Cloud, CloudLab, Open Science Data Cloud,
etc.
• Commercial clouds, e.g., Amazon Web Services
(AWS), Rackspace, etc.
• No unified CI framework or strategy to pick CI for
different library big data sharing and reuse
situations
Library Big Data Reuse Patterns
Compute
Storage
Bridge Network Hub
Goodwin Hall Event DL SHARE Notify
Progress So Far
• Identified the network bandwidth as a key
bottleneck in the bridge pattern
• Analyzing data loading, its acceleration
techniques, and tradeoffs in the network pattern
• Participated in building VT’s mass storage facility
• Participated in building VT’s 10G campus network
Questions?