Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
GRNET eScience platform
for Big Data managementCodename: orka
Monday,February1,2016
ProjectVision
• Data-IntensiveScience (storeandprocessbigdata,atPetabyte scale)
• Scientificworkflows• Virtual Research Environment• Datastreaming
• Theproblem:datadeluge• Solution:
– PaaSover• ~okeanos (VM,processing)• Pithos+ (storage)
Bigdata
Hadoopproject
• MostpopularimplementationfortheMapReduceprogrammingparadigm
• Opensource,commodityhardware• Hadoopcore (MapReduce,Hadoopdistributedfilesystem)
• Richecosystem(Pig,Hive,Hbase,manymore)• Researcherfocusesonthealgorithm andnotthesoftwareinstall/maintain/scaleetc.
Hadoopclusterwith~orka
• GUI,CLI,RESTontopof~okeanos to:– Createcluster (withconfigurableoptions) fromarangeofHadoopdistro’s(akaimages)
– Transferyourdata– Submit,execute,monitorjobs– Deletecluster– Start/stop/formatcluster– Scalecluster,add/removenodes– Saveclustercreationmetadataforreproducibility
Hadoopclusterwith~orka
Add-onstobasicHadoop
• Othercomponents&runtimes– Spark
• ApacheHadoop-baseddistro’s– Cloudera– Hue(HDFSexplorer,Ooziewebeditor)
• Storagebackend– Pithosó HDFSconnector(analogoustoAmazonS3FilesystemforHadoop)
ScientificWorkflows
• Orchestrationofatomicjobs• ApacheOozie• ApachePig
– Built-ininorka images
Collaborativescientificresearch
• VirtualResearchEnvironment• Completesystemforteamsandprojects• Components:
– Research/Projecthomepage(portal,wiki)– ProjectManagement– Teleconference– Digitalrepositories
• ImplementedasDockerimages
VirtualResearchEnvironment
Category Software stackPortal/ CMS Drupal (v7.37)Wiki,blog, forum Mediawiki (v1.2.4)Projectmanagement Redmine (v3.04)Webconferencing BigBlueButton (v0.81)Digitalrepositories DSpace (v5.3)
ReproducibleResearch
• Saveyourexperiment’smetadataasabundle• DomainSpecificLanguage (DSL)thatfullydescribesanexperiment/job
• Texteditor=>simple YAMLfile• Re-play,possiblywithdifferentparameters• SavebundletoPithos• Shareyourbundlewithother~okeanosusers
DatastreamsintoHDFS
• ApacheFlume• IntegratedintotheHadoopecosystem• Focusonstreamingdata
High-levelArchitecture
TechnologyStackeScience
Subsystem1[Orka0.1.1]
Back-End
OrkaSubSystem:Techn
ologiesO
verview Front-End
SinglePageApplication(SPA)üHTML5ü CSS3ü EmberJSüBootstrap
CommandLine(CLI)APIüOrkaAPI(Pythonscripts)
WebServerü nginx
ExternalAPIs/Technologiesü Synnefo/kamakiüAuthenticationüHadoop
ü DjangoRESTF/WorkRESTAPI AppServer
ü uWSGI
Supportedalso,(inprogress)ü RabbitMQ,MessageBrokerü CeleryTaskManager
Dataü Postgres DBMS
Currentstate
– github.com/grnet/e-science
– escience.grnet.gr
lambda on demand
λ
λ lambda.grnet.gr 2
Simplifying ComputingThe lambda architecture
a a useful framework to think about designing big data applications
a robust framework for ingesting real-time streams of data while providing efficient stream and batch analytics.
fault-tolerant against both hardware failures and human errors
b
c
serves a wide range of use cases, and in which low-latency reads and updates are required
d
λ lambda.grnet.gr 3
λ: lambda architecture
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) pre-computing arbitrary query functions, called batch views.
Batch Layer
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
Serving layer
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Speed layer
The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
λ lambda.grnet.gr 4
data
λ: lambda architecturean example
data
batch layer
master
dataset
serving layer
batch view
batch view
real-time view real-time view
speed layer
query
query
1
23
4
5
data
1data is dispatched to batch and speed layer for processing.
2 precomputes the batch views
3indexes the batch views
4 deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.
5
λ lambda.grnet.gr 5
okeanos Users
Lambda on demand
service
λ instances
λ layers
λ api
λ λ λ
Speed Batch Speed Batch Speed
Provisioning a λ instance
Based on
λ lambda.grnet.gr 6
λambda UIDashboard, Instances, Applications and help
Create your lambda instances based on your needs. Manage , deploy applications and start your lambda instance.
λ - Instancesmanage lambda instances
Upload your Java or Scala application for streaming and batch jobs. Your applications are stored on the Pithos+ storage service.
Applications manage your applications
Short guides on how to 1) deploy, run and manage your lambda instances. 11) deploy, run and manage your applications 111) export and view your results
Help Informational guides
λ?
app
λ lambda.grnet.gr 7
Experienced UserUse the λambda API
lambda instance
lambda applications
λ - API
create
upload
manage
delete
manage
delete
well documented
with
Swagger mkdocs doc
λ lambda.grnet.gr 8
e-science vs λUse the λambda API
Lamda λ: focuses on analysing steaming Data
e-Science: focuses on existing data + offers a pre-installed collaborative tools to handle data