GRNET eScience platform for Big Data management · 2016-02-01 · λ lambda.grnet.gr 2 Simplifying...

Preview:

Citation preview

GRNET eScience platform

for Big Data managementCodename: orka

Monday,February1,2016

ProjectVision

• Data-IntensiveScience (storeandprocessbigdata,atPetabyte scale)

• Scientificworkflows• Virtual Research Environment• Datastreaming

• Theproblem:datadeluge• Solution:

– PaaSover• ~okeanos (VM,processing)• Pithos+ (storage)

Bigdata

Hadoopproject

• MostpopularimplementationfortheMapReduceprogrammingparadigm

• Opensource,commodityhardware• Hadoopcore (MapReduce,Hadoopdistributedfilesystem)

• Richecosystem(Pig,Hive,Hbase,manymore)• Researcherfocusesonthealgorithm andnotthesoftwareinstall/maintain/scaleetc.

Hadoopclusterwith~orka

• GUI,CLI,RESTontopof~okeanos to:– Createcluster (withconfigurableoptions) fromarangeofHadoopdistro’s(akaimages)

– Transferyourdata– Submit,execute,monitorjobs– Deletecluster– Start/stop/formatcluster– Scalecluster,add/removenodes– Saveclustercreationmetadataforreproducibility

Hadoopclusterwith~orka

Add-onstobasicHadoop

• Othercomponents&runtimes– Spark

• ApacheHadoop-baseddistro’s– Cloudera– Hue(HDFSexplorer,Ooziewebeditor)

• Storagebackend– Pithosó HDFSconnector(analogoustoAmazonS3FilesystemforHadoop)

ScientificWorkflows

• Orchestrationofatomicjobs• ApacheOozie• ApachePig

– Built-ininorka images

Collaborativescientificresearch

• VirtualResearchEnvironment• Completesystemforteamsandprojects• Components:

– Research/Projecthomepage(portal,wiki)– ProjectManagement– Teleconference– Digitalrepositories

• ImplementedasDockerimages

VirtualResearchEnvironment

Category Software stackPortal/ CMS Drupal (v7.37)Wiki,blog, forum Mediawiki (v1.2.4)Projectmanagement Redmine (v3.04)Webconferencing BigBlueButton (v0.81)Digitalrepositories DSpace (v5.3)

ReproducibleResearch

• Saveyourexperiment’smetadataasabundle• DomainSpecificLanguage (DSL)thatfullydescribesanexperiment/job

• Texteditor=>simple YAMLfile• Re-play,possiblywithdifferentparameters• SavebundletoPithos• Shareyourbundlewithother~okeanosusers

DatastreamsintoHDFS

• ApacheFlume• IntegratedintotheHadoopecosystem• Focusonstreamingdata

High-levelArchitecture

TechnologyStackeScience

Subsystem1[Orka0.1.1]

Back-End

OrkaSubSystem:Techn

ologiesO

verview Front-End

SinglePageApplication(SPA)üHTML5ü CSS3ü EmberJSüBootstrap

CommandLine(CLI)APIüOrkaAPI(Pythonscripts)

WebServerü nginx

ExternalAPIs/Technologiesü Synnefo/kamakiüAuthenticationüHadoop

ü DjangoRESTF/WorkRESTAPI AppServer

ü uWSGI

Supportedalso,(inprogress)ü RabbitMQ,MessageBrokerü CeleryTaskManager

Dataü Postgres DBMS

Currentstate

– github.com/grnet/e-science

– escience.grnet.gr

lambda on demand

λ

λ lambda.grnet.gr 2

Simplifying ComputingThe lambda architecture

a a useful framework to think about designing big data applications

a robust framework for ingesting real-time streams of data while providing efficient stream and batch analytics.

fault-tolerant against both hardware failures and human errors

b

c

serves a wide range of use cases, and in which low-latency reads and updates are required

d

λ lambda.grnet.gr 3

λ: lambda architecture

The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) pre-computing arbitrary query functions, called batch views.

Batch Layer

The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.

Serving layer

The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

Speed layer

The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

λ lambda.grnet.gr 4

data

λ: lambda architecturean example

data

batch layer

master

dataset

serving layer

batch view

batch view

real-time view real-time view

speed layer

query

query

1

23

4

5

data

1data is dispatched to batch and speed layer for processing.

2 precomputes the batch views

3indexes the batch views

4 deals with recent data only.

Any incoming query can be answered by merging results from batch views and real-time views.

5

λ lambda.grnet.gr 5

okeanos Users

Lambda on demand

service

λ instances

λ layers

λ api

λ λ λ

Speed Batch Speed Batch Speed

Provisioning a λ instance

Based on

λ lambda.grnet.gr 6

λambda UIDashboard, Instances, Applications and help

Create your lambda instances based on your needs. Manage , deploy applications and start your lambda instance.

λ - Instancesmanage lambda instances

Upload your Java or Scala application for streaming and batch jobs. Your applications are stored on the Pithos+ storage service.

Applications manage your applications

Short guides on how to 1) deploy, run and manage your lambda instances. 11) deploy, run and manage your applications 111) export and view your results

Help Informational guides

λ?

app

λ lambda.grnet.gr 7

Experienced UserUse the λambda API

lambda instance

lambda applications

λ - API

create

upload

manage

delete

manage

delete

well documented

with

Swagger mkdocs doc

λ lambda.grnet.gr 8

e-science vs λUse the λambda API

Lamda λ: focuses on analysing steaming Data

e-Science: focuses on existing data + offers a pre-installed collaborative tools to handle data

λ lambda.grnet.gr 9

Questions ?

Recommended