45
BIG DATA PROCESSING IN THE CLOUD: A HYDRA/SUFIA EXPERIENCE Helsinki June 2014 Collin Brittle Zhiwu Xie

Big Data Processing in the Cloud: A Hydra/Sufia Experience

Embed Size (px)

DESCRIPTION

This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.

Citation preview

Page 1: Big Data Processing in the Cloud: A Hydra/Sufia Experience

BIG DATA PROCESSING IN THE CLOUD:A HYDRA/SUFIA EXPERIENCE

HelsinkiJune 2014

Collin BrittleZhiwu Xie

Page 2: Big Data Processing in the Cloud: A Hydra/Sufia Experience

WHO?

Page 3: Big Data Processing in the Cloud: A Hydra/Sufia Experience

WHAT?

Page 4: Big Data Processing in the Cloud: A Hydra/Sufia Experience

WHY?

Page 5: Big Data Processing in the Cloud: A Hydra/Sufia Experience

SENSORS

Page 6: Big Data Processing in the Cloud: A Hydra/Sufia Experience

SMART INFRASTRUCTURE

Page 7: Big Data Processing in the Cloud: A Hydra/Sufia Experience

DATA SHARING

• Encourage exploratory and multidisciplinary research

• Foster open and inclusive communities around

• modeling of dynamic systems• structural health monitoring and damage detection• occupancy studies• sensor evaluation• data fusion• energy reduction• evacuation management • …

Page 8: Big Data Processing in the Cloud: A Hydra/Sufia Experience

CHARACTERIZATION

• Compute intensive

• Storage intensive

• Communication intensive

• On-demand

• Scalability challenge

Page 9: Big Data Processing in the Cloud: A Hydra/Sufia Experience

COMPUTE INTENSIVE

• About 6GB raw data per hour

• Must be continuously processed,

ingested, and further processed

• User-generated computations

• Must not interfere with data retrieval

Page 10: Big Data Processing in the Cloud: A Hydra/Sufia Experience

STORAGE INTENSIVE

• SEB will accumulate about 60TB of raw data per year

• To facilitate researchers, we must keep raw data for an extended period of time, e.g., >= 5 years

• VT currently does not have an affordable storage facility to hold this much data

• Within XSEDE, only TACC’s Ranch can allocate this much storage

Page 11: Big Data Processing in the Cloud: A Hydra/Sufia Experience

COMMUNICATION INTENSIVE

• What if hundreds of researchers around the world each tried to download hundreds of TB of our data?

Page 12: Big Data Processing in the Cloud: A Hydra/Sufia Experience

ON DEMAND

• Explorative and multidisciplinary research cannot predict the data usage beforehand

Page 13: Big Data Processing in the Cloud: A Hydra/Sufia Experience

SCALABILITY

• How to deal with these challenges in a scalable manner?

Page 14: Big Data Processing in the Cloud: A Hydra/Sufia Experience

BIG DATA + CLOUD

• Affordable

• Elastic

• Scalable

Page 15: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FRAMEWORK REQUIREMENTS

• Mix local and remote content

• Support background processing

• Be distributable

Page 16: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FRAMEWORK REQUIREMENTS

• Mix local and remote content

• Support background processing

• Be distributable

Page 17: Big Data Processing in the Cloud: A Hydra/Sufia Experience

OBJECTS AND DATASTREAMS

Local Object

Meta Meta File

Page 18: Big Data Processing in the Cloud: A Hydra/Sufia Experience

OBJECTS AND DATASTREAMS

Local Object

Meta Meta File

Page 19: Big Data Processing in the Cloud: A Hydra/Sufia Experience

REMOTESTORAGE

LocalRepository

EC2 GlacierS3

Amazon

Page 20: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FRAMEWORK REQUIREMENTS

• Mix local and remote content

• Support background processing

• Be distributable

Page 21: Big Data Processing in the Cloud: A Hydra/Sufia Experience

Worker

Worker

Worker

Database

Public Server

Clients

Redis

BACKGROUNDPROCESSING

Page 22: Big Data Processing in the Cloud: A Hydra/Sufia Experience

01000010

FROM QUEUES TO THE CLOUD

10100101

01010101

11000011

Page 23: Big Data Processing in the Cloud: A Hydra/Sufia Experience

10100101

FROM QUEUES TO THE CLOUD

10100101

11000011

10100101

Page 24: Big Data Processing in the Cloud: A Hydra/Sufia Experience

10100101

FROM QUEUES TO THE CLOUD

10100101

11000011

10100101

Page 25: Big Data Processing in the Cloud: A Hydra/Sufia Experience

10100101

FROM QUEUES TO THE CLOUD

10100101

11000011

10100101

Page 26: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

10100101

11000011

Page 27: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

11000011

00111100

10100101

Page 28: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

10100101

Page 29: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

10100101

Page 30: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

10100101

Page 31: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

11110000

10100101

10100101

Page 32: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FROM QUEUES TO THE CLOUD

10100101

10100101

Page 33: Big Data Processing in the Cloud: A Hydra/Sufia Experience

QUEUEING

Page 34: Big Data Processing in the Cloud: A Hydra/Sufia Experience

QUEUEING

Page 35: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FRAMEWORK REQUIREMENTS

• Mix local and remote content

• Support background processing

• Be distributable

Page 36: Big Data Processing in the Cloud: A Hydra/Sufia Experience

01010101

01010101

FROM QUEUES TO THE CLOUD

Page 37: Big Data Processing in the Cloud: A Hydra/Sufia Experience

00100100

00100100

00100100

10100101

10100101

10100101

11000011

11000011

11000011

Page 38: Big Data Processing in the Cloud: A Hydra/Sufia Experience

11000011

FROM QUEUES TO THE CLOUD

10100101

11000011

00100100

00000010

Page 39: Big Data Processing in the Cloud: A Hydra/Sufia Experience

Database

Public Server

Clients

RedisMaster

RedisSlave

Private Server

Private Server

Private Server

DISTRIBUTED PROCESSING

Page 40: Big Data Processing in the Cloud: A Hydra/Sufia Experience

SCALE UP

Page 41: Big Data Processing in the Cloud: A Hydra/Sufia Experience

SCALE UP

Page 42: Big Data Processing in the Cloud: A Hydra/Sufia Experience

WE CHOSE SUFIA

Page 43: Big Data Processing in the Cloud: A Hydra/Sufia Experience

WHAT IS SUFIA?

• Ruby on Rails framework…

• Based on Hydra…

• Using Fedora Commons…

• And Resque

Page 44: Big Data Processing in the Cloud: A Hydra/Sufia Experience

FRAMEWORK REQUIREMENTS

• Mix local and remote content

• Support background processing

• Be distributable

Page 45: Big Data Processing in the Cloud: A Hydra/Sufia Experience

QUESTIONS?rotated8 (who works at) vt.edu