View
105
Download
1
Category
Tags:
Preview:
DESCRIPTION
This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.
Citation preview
BIG DATA PROCESSING IN THE CLOUD:A HYDRA/SUFIA EXPERIENCE
HelsinkiJune 2014
Collin BrittleZhiwu Xie
WHO?
WHAT?
WHY?
SENSORS
SMART INFRASTRUCTURE
DATA SHARING
• Encourage exploratory and multidisciplinary research
• Foster open and inclusive communities around
• modeling of dynamic systems• structural health monitoring and damage detection• occupancy studies• sensor evaluation• data fusion• energy reduction• evacuation management • …
CHARACTERIZATION
• Compute intensive
• Storage intensive
• Communication intensive
• On-demand
• Scalability challenge
COMPUTE INTENSIVE
• About 6GB raw data per hour
• Must be continuously processed,
ingested, and further processed
• User-generated computations
• Must not interfere with data retrieval
STORAGE INTENSIVE
• SEB will accumulate about 60TB of raw data per year
• To facilitate researchers, we must keep raw data for an extended period of time, e.g., >= 5 years
• VT currently does not have an affordable storage facility to hold this much data
• Within XSEDE, only TACC’s Ranch can allocate this much storage
COMMUNICATION INTENSIVE
• What if hundreds of researchers around the world each tried to download hundreds of TB of our data?
ON DEMAND
• Explorative and multidisciplinary research cannot predict the data usage beforehand
SCALABILITY
• How to deal with these challenges in a scalable manner?
BIG DATA + CLOUD
• Affordable
• Elastic
• Scalable
FRAMEWORK REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
FRAMEWORK REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
OBJECTS AND DATASTREAMS
Local Object
Meta Meta File
OBJECTS AND DATASTREAMS
Local Object
Meta Meta File
REMOTESTORAGE
LocalRepository
EC2 GlacierS3
Amazon
FRAMEWORK REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
Worker
Worker
Worker
Database
Public Server
Clients
Redis
BACKGROUNDPROCESSING
01000010
FROM QUEUES TO THE CLOUD
10100101
01010101
11000011
10100101
FROM QUEUES TO THE CLOUD
10100101
11000011
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
11000011
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
11000011
10100101
FROM QUEUES TO THE CLOUD
10100101
10100101
10100101
11000011
FROM QUEUES TO THE CLOUD
10100101
10100101
11000011
00111100
10100101
FROM QUEUES TO THE CLOUD
10100101
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
11110000
10100101
10100101
FROM QUEUES TO THE CLOUD
10100101
10100101
QUEUEING
QUEUEING
FRAMEWORK REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
01010101
01010101
FROM QUEUES TO THE CLOUD
00100100
00100100
00100100
10100101
10100101
10100101
11000011
11000011
11000011
11000011
FROM QUEUES TO THE CLOUD
10100101
11000011
00100100
00000010
Database
Public Server
Clients
RedisMaster
RedisSlave
Private Server
Private Server
Private Server
DISTRIBUTED PROCESSING
SCALE UP
SCALE UP
WE CHOSE SUFIA
WHAT IS SUFIA?
• Ruby on Rails framework…
• Based on Hydra…
• Using Fedora Commons…
• And Resque
FRAMEWORK REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
QUESTIONS?rotated8 (who works at) vt.edu
Recommended