Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Åke Edlund KTH PDC-‐HPC Center for High Performance Computing KTH HPCViz Data-‐Intensive Computing Group KTH PDC-‐HPC Cloud 1
OpenNebula: Experiences at KTH
With a deeper dive into emerging data analytics stacks
Outline of this talkCloud computing and data-intensive computing at PDC - a brief overview
OpenNebula at PDC - examples
Apache Spark at PDC - what I use our cloud for
2
Cloud computing and data-intensive computing at PDC - a brief overview!
OpenNebula at PDC - examples
Apache Spark at PDC - what I use our cloud for
3
Cloud computing and data-intensive computing at PDC - a brief overview
• Cloud research since 2007 – Cloud provider since 2009 – national and international users
• Spark user since May 2012 (more in the last section) – Version 0.6 released in October 15, 2012
• Research and Development – Distributed and federated clouds and data analytics stacks – Bioinformatics and LifeScience applications – Scalable statistics – Self-‐improving systems – Strong and usable security factors to enable researchers to store sensitive data in the Cloud
• Projects (many) – SNIC Cloud Infrastructure (co-‐Initiator and Coordinator) – the Swedish roll out of cloud for
eScience – NeIC Nordic Cloud (co-‐Initiator and coordinator Swedish part) – BioBankCloud (WP leader) – PaaS for biobanking – EGI Federated Cloud task force (development and resource provider) – VENUS-‐C (WP-‐Leader) (2010 – 2012) – …
4
Cloud Resources at PDCPDC Cloud has been in production (with external users) since 2010 and is today an installation of 364 cores !-‐ 12 nodes, each consisting of 32 cores – 1 TB x 2 disk and 64 GB RAM -‐ 20 TB shared (through Infiniband) by the 12 nodes using Ceph (RBD (block
devices), S3 (Object Storage) -‐ this is under reconstruction (from SAN to dedicated Ceph storage nodes -‐> 36 TB)
-‐ Cloud middlewares used over the years range from Eucalyptus, OpenNebula, and now a mix of OpenNebula and OpenStack
-‐ Users access their resources using web panel and/or CLI/API !
Users (so far) are Nordic and European researchers. PDC Cloud is leading partner in a number of Swedish, Nordic and European cloud projects, e.g. being one of the first certified cloud resource providers to EGI Federated Cloud.
5
Data-Intensive Computing at PDC
HPCViz Data-‐Intensive Computing Group (started 2012) is a research group building on the experiences from PDC. !-‐ 9 group members (7 researchers, 2 developers) -‐ Collaborating mainly with Uppsala University (bioinformatics), KI
(SciLifeLab) on applying, and further expand, emerging novel techniques for iterative and interactive in-‐memory data analytics stacks (Spark, Stratosphere, H2O, …)
-‐ Other areas of interest include anomaly detection in streaming data, with applications in performance improvement of distributed systems, and security (intrusion detection).
6
[1] "Practical Cloud Evaluation from a Nordic eScience User Perspective", VTDC'11, ACM conference San Jose (2011) by Åke Edlund and Maarten Koopman, Zeeshan Ali Shah, Ilja Livenson, Frederik Orellana, Jukka Kommeri, Miika Tuisku, Pekka Lehtovuori, Klaus Marius Hansen, Helmut Neukirchen, Ebba Þóra Hvannberg 7
Our Cloud Learning Curve
2001 2004 2007 2010 2011 2012 2013 2014
Nordic cloud project, NEON (2010) Practical evaluation [1], testing public vs private cloud for eScience users (bioinformatics)
SNIC Cloud project (2011.6-‐2012.6+) Enabled cloud access (public and private) to SNIC users. 14 (some recurring) users of SNIC Cloud for Amazon (e.g. running Galaxy) and 54 on the private cloud (currently only PDC Cloud, partially from outside SNIC)
SNIC Galaxy project (2013.3-‐2014.3). The goal of the project is to deliver Galaxy as a service, using the Galaxy cloud management platform, Cloudman, on local cloud installations (private clouds).
SNIC Cloud Infrastructure (long-‐term, started Jan 2014). A (generic) IaaS on which communities/users can build their PaaS. Strong emphasize on user communities and their commitment.
Grid Computing projects (DataGrid, EGEE, EGI) – including EGI Federated Clouds TF
KTH PDC Cloud experimentation
Public IaaSPrivate IaaS
Private PaaSPublic PaaS
PDC-‐HPC (since 1989)
[1] "Practical Cloud Evaluation from a Nordic eScience User Perspective", VTDC'11, ACM conference San Jose (2011) by Åke Edlund and Maarten Koopman, Zeeshan Ali Shah, Ilja Livenson, Frederik Orellana, Jukka Kommeri, Miika Tuisku, Pekka Lehtovuori, Klaus Marius Hansen, Helmut Neukirchen, Ebba Þóra Hvannberg 8
2001 2004 2007 2010 2011 2012 2013 2014
Nordic cloud project, NEON (2010) Practical evaluation [1], testing public vs private cloud for eScience users (bioinformatics)
SNIC Cloud project (2011.6-‐2012.6+) Enabled cloud access (public and private) to SNIC users. 14 (some recurring) users of SNIC Cloud for Amazon (e.g. running Galaxy) and 54 on the private cloud (currently only PDC Cloud, partially from outside SNIC)
SNIC Galaxy project (2013.3-‐2014.3). The goal of the project is to deliver Galaxy as a service, using the Galaxy cloud management platform, Cloudman, on local cloud installations (private clouds).
SNIC Cloud Infrastructure (long-‐term, started Jan 2014). A (generic) IaaS on which communities/users can build their PaaS. Strong emphasize on user communities and their commitment.
Grid Computing projects (DataGrid, EGEE, EGI) – including EGI Federated Clouds TF
KTH PDC Cloud experimentation
Public IaaSPrivate IaaS
Private PaaSPublic PaaS
PDC-‐HPC (since 1989)Iaas à PaaS
Security concerns. Service to our users. Easier to manage larger user groups.
Public IaaS à Private IaaS Large amount of sensitive data, often too cumbersome for
practical use of public clouds.
Our Cloud Learning Curve
Federated Cloud ProjectsCurrent Cloud Projects
- SNIC Cloud (co-Initiator and Coordinator) – the Swedish roll out of cloud for eScience!- NeIC Nordic Cloud (co-Initiator and Coordinator Swedish part) - BioBankCloud (WP leader) – PaaS for biobanking - EGI Federated Cloud (development and resource provider)!
Earlier Cloud Projects
-SNIC Galaxy (PaaS) (co-Initiator and Coordinator) (2013) -SNIC Cloud (Initiator and Coordinator) (2011-2012) -SICS Startup Accelerator (co-Initiator and Coordinator) (2011) -VENUS-C (WP leader) (2010-2012) -NEON – Northern Europe cloud project (Initiator and Coordinator) (2010)
9
10
Main contribution to this section: from Zeeshan Ali Shah*
Cloud computing and data-intensive computing at PDC - a brief overview
OpenNebula at PDC - examples!
Apache Spark at PDC - what I use our cloud for
Started with Eucalyptus• Back in 2009
• Federated between KTH centers cross Stockholm.
• Then Eucalyptus selected redhat in licensing model.
• And we selected Open Nebula due to its openness and easy access to it’s core team which was located in EU .
11
Open Nebula• 2010 - Selected during technical kick-off of Venus-C project
• Based in EU , easy access to developers
• Fully open source
• Started with Open Nebula 2.0
• OVF (Open Virtualization format) interfaced was developed within Venus-C
• Federated with Other Venus-C sites such as BSC (Spain) and ENGINEERING (Italy).
12
User base
13
www.e-science.sewww.scilifelab.se
www.natmeg.se
Neurosciences, Karolinska Institute
And, yes, from EGI Fed cloud communities
Science for Life Laboratory (SciLifeLab) is a national center for molecular biosciences with focus on health and environmental research.
OpenNebula User experience• Served around 100+ users, both Swedish and other EU
researchers
• Interfaces:
– Open Nebula CLI
– Sunstone Dashboard
– SDK (not so many) but option was there
• Conducted Hands-on Workshops for users
14
Federation with EGI• Compute using OCCI (backend with Open Nebula)
• Auto injection of user keys from Voms server
• Federated identity with VOMS and X.509
• Information system
• Accounting service
15
From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)
Bio science usersPre configured apps with Open Nebula
• Galaxy - galaxyproject.org
• Cloudbio linux - cloudbiolinux.org
Cloud Bio Linux Galaxy (AWS -‐ for CloudMan)
16
Issue: PoC Cloudman on ON (SARA, NL) - but moved to OS
Way forward
• Dedicated storage service, like S3 , Swift (OpenStack)
• Network service for versatile setups, like Neutron (OS)
• Image caching on compute nodes.
– To minimize launch time of VMs, what we notice is that most of time in VMs launch took for copying image to designated host
– Shared FS is an option, but it has its own limitations.
17
“Wish list” from Zeeshan Ali Shah *
Big Data analytics• Apache Spark
• Hadoop
• Mesos -> YARN
• Orchestration of Spark clusters with Open Nebula
18
See next section ….
19
Cloud computing and data-intensive computing at PDC - a brief overview
OpenNebula at PDC - examples
Apache Spark at PDC - what I use our cloud for
Sources to Big Data
Probing extreme phenomena in scientific fields with mature theories
Increasingly exploratory research areas
Making meaning of human activity on the Internet
1990 2010
Sensing everything
20
Sources to Big Data
Probing extreme phenomena in scientific fields with mature theories
Increasingly exploratory research areas
Making meaning of human activity on the Internet
1990 2010
Sensing everything
21
Sthlm, May 2014
Research at HPCViz Data-‐Intensive Computing Group
…. building a DS curriculum for the group
Brain images – Scabia project, MEG data Paas for Life Science -‐ Biobankcloud, Galaxy, ..
Privacy preservation in the cloud -‐ Biobankcloud
Federated clouds -‐ EGI, Nordic Cloud, CDMi proxy
Cloud environments -‐ Environment launching -‐ Streaming capabilities -‐ Workflows -‐ including graph data capabilities
Anomaly detection in performance data -‐ Intrusion Detection -‐ Performance Analysis -‐ Sensor data, IoT, …
Next: Scalable statistics
Cloud and industry – esp. startups
Chemoinformatics -‐ MapReduce based Parallel Virtual Screening !!!
!!!!!!
22
Applicat
ions
Technolo
gies
Industry
Algorithm
s
Theory
Federated Cloud Services
Federated IaaS and STaaS Cloud
Tier 1: Reliable
Infrastructure Cloud
Tier 4: Zero ICT
Infrastructures
Tier 3: Platform as a Service
Tier 2: General-purpose platform services
PaaS
PaaS
DB aaS
Hado
op
aaS
VRE
Secure storage
Key Mgm
t
Encryptio
n
ACL mgm
t
Virtual eLaboratory
23
From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)
Federated Cloud Services
Federated IaaS and STaaS Cloud
Tier 1: Reliable
Infrastructure Cloud
Tier 4: Zero ICT
Infrastructures
Tier 3: Platform as a Service
Tier 2: General-purpose platform services
PaaS
PaaS
DB aaS
Hado
op
aaS
VRE
Secure storage
Key Mgm
t
Encryptio
n
ACL mgm
t
Virtual eLaboratory
24
From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)
DAaaS -‐ What do We Need?• Interactive queries: enable faster decisions • Queries on streaming data: enable decisions on real-‐time data • Sophisticated data processing: enable “better” decisions • Need of statistical principles (that scale): to justify the inferential
leap from data to knowledge: – Need estimates of uncertainty in the outputs of algorithms (“error bars”)
• Pipelines: ability to run mixed analysis under one framework – for efficiency and to be able to develop sophisticated algorithms
Support batch, streaming, and interactive computations… in a unified framework
25
Applications
Spark Streaming GraphX MLBase
BlinkDBPig
… Storm MPIShark HIVE
Spark Hadoop MR
HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.
Infrastructure E.g. public and private clouds
Data !Processing
Data!Management
Resource!Management
Berkeley Data Analytics Stack
26
Apache Hadoop
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
• Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
• ZooKeeper™: A high-performance coordination service for distributed applications.
Applications
Spark Streaming GraphX MLBase
BlinkDBPig
… Storm MPIShark HIVE
Spark Hadoop MR
HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.
Infrastructure E.g. public and private clouds
27
Applications
Spark Streaming GraphX MLBase
BlinkDBPig
… Storm MPIShark HIVE
Spark Hadoop MR
HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.
Infrastructure E.g. public and private clouds
Berkeley Data Analytics Stack
• Shark - Hive and SQL on top of Spark • MLbase - Machine Learning project on top of Spark • BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark • GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9) • Apache Mesos - Cluster management system that supports running Spark • Tachyon - In memory storage system that supports running Spark • Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on
top of Apache Hadoop, Hama, and Spark • OpenDL - A deep learning algorithm library based on Spark framework. Just kick off. • SparkR - R frontend for Spark • Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
28
• Unifies batch, streaming, interac<ve comp. • Easy to build sophisticated applications
– Support iterative, graph-parallel algorithms – Powerful APIs in Scala, Python, Java
Applications
Spark Streaming GraphX MLBase
BlinkDB
Pig
… Storm MPI
Shark HIVE
Spark Hadoop MR
HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.
Infrastructure E.g. public and private clouds
Berkeley Data Analytics Stack
29
StreamingInteractive
Sophisticated algorithms
Batch, Interactive
Interactive
Sophisticated algorithms
spark.apache.org
Turning Data into Value, Examples• Unify real-time and historical data analysis
– Easier to build and maintain
– Cheaper to operate
– Easier to get insights, faster decisions
• Unify streaming and machine-learning
– Faster diagnosis, decisions (e.g., better ad targeting)
• Unify graph processing and ETLs
– Faster to get social network insights (e.g., improve user experience)
30
What it Means for UsersSeparate frameworks:
…HDFS read
HDFS write
E T L
HDFS read
HDFS write
t r a i n
HDFS read
HDFS write
q u e r y
HDFS
HDFS read
Spark: Interactiveanalysis
31
E T L
t r a i n
q u e r y
Advantage of an unified stack• Explore data interactively
to identify problems
!
• Use same code in Spark for processing large logs
!
• Use similar code in Spark Streaming for realtime processing
$ ./spark-‐shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) ... scala> val mapped = filtered.map(...) ...
object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } } object ProcessLiveStream {
def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } }
32
Spark Integration
• val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
From Scala:
33
Summary – challenges and opportunities arising
• Data processing: from special to general -‐ and back? • Data locality: from detailed, to general – and back? See eg. Google’s OMEGA efforts
• Infrastructure: from public to private to hybrid cloud • Disk vs in-‐memory: going back to earlier more complex environments? Not yet.
• Workflows/pipelines: unification crucial for performance and usability
• New areas evolving, both in computer science as in statistics – Quality: Need of “error bars” around outcomes
• Need of new solutions to make this possible, on large data sets – Algorithmic weakening for statistical inference
• a new area in theoretical computer science? • a new area in statistics? 34
Summary – Exciting times ahead!• Data processing: from special to general -‐ and back? • Data locality: from detailed, to general – and back? See eg. Google’s OMEGA efforts
• Infrastructure: from public to private to hybrid cloud • Disk vs in-‐memory: going back to earlier more complex environments? Not yet.
• Workflows/pipelines: unification crucial for performance and usability
• New areas evolving, both in computer science as in statistics – Quality: Need of “error bars” around outcomes
• Need of new solutions to make this possible, on large data sets – Algorithmic weakening for statistical inference
• a new area in theoretical computer science? • a new area in statistics? 35
!
!
!
“Use Clouds running Data Analytics processing Big Data to solve problems in
X-‐Informatics ( or e-‐X)” !
!
!
!
!
!!
Need to excel in many areas, at the same time!
!
!
Comput
er Skills
Mathem
atics &
Statistics Knowledge
Substantive Experience
Data Science
Machine Learning
Traditional Research
Danger
Zone !
References• Geoffrey Fox, Indiana University
– http://www.soic.indiana.edu/people/profiles/fox-‐geoffrey-‐charles.shtml -‐ great visionary researcher in distributed computing and its usage
• Frontiers in Massive Data Analysis – http://www.nap.edu/catalog.php?record_id=18374 -‐ fundament of current state-‐of-‐
the-‐art • The Fourth Paradigm: Data-‐Intensive Scientific Discovery
– http://research.microsoft.com/en-‐us/collaboration/fourthparadigm/ -‐ a good starting point, esp. visions from Jim Gray
• Spark related slides from – Spark team
• Matei Zaharia, MIT and Databricks • Ion Stoika, UC Berkeley and Databricks • Patrick Wendell, Databricks • Joseph Gonzales (GraphX), UC Berkeley
36