2
Cloud Computing Infrastructure for Biological Echo-Systems Janaka Balasooriya School of Computing, Informatics, and Decision Systems Eng. Arizona State University,Tempe, AZ [email protected] 1. Introduction The Biological applications such as Gene and Protein analysis integrate and analyze biological data for the research in many bioinformatics and other bio related fields. Such applications are used under many large scale scientific applications and help in computing, integrating data, execute the analysis, automate the process by using information retrieved by different tasks and computational procedures to assist the scientists in scientific discovery and data distribution. Grid based and/or web based scientific workflow tools are used for bioinformatics related complex research to make scientists’ and researchers’ work easier. On average, scientists spend about 80% of their time assembling data to prepare for analysis. This is due largely in part to the fact that many of these resources required for data processing must be gathered from an external source. The best of these resources, however, are scattered across the globe. They are hosted at universities, institutes, and laboratories throughout the world. To bring all of these resources together by hiding system, network, and application level heterogeneity issues are challenging. 1.1 Challenges and Biological Data and Tool Integration Our recent comprehensive survey on this topic revealed that biological data and tool integration is a challenging task mainly due to Highly diverse data sets (numerical, natural language, strings) and representational heterogeneity of data (text files, data bases) Data sets to be analyzed can be large. Distributed across the globe. Available tools and data sources have non uniform interfaces for communication and data exchange. Data sources and tools are autonomous and have different interfaces and querying capabilities. Some applications require large amount of computing resources and can be long running. 1.2. Our Vision and Goals We envision biologists and natural scientists to find required tools and data sources and configure then “on- the-fly” over the web (Figure 1). This requires breaking several technological barriers. (a) Tools and data sources need to have uniform interfaces for communication and data exchange. (b) Data sources and tool providers need a platform to which they can announce the availability of resources. (c) Natural scientists should be able to find required data sources and tools, and configure their applications using high-level constructs. (d) Find resources (such as computing, data storage) to run their applications (if needed) and execute applications. (e) Should be able to perform above tasks over the Web with no or minimum programming. Data Computing Resources Figure 1: On-the-Fly data and tool integration in a Cloud based Bio-Ecosystem 2. Biological Data and Tool Integration in the Cloud 2.1 Cloud Capabilities: Cloud computing architecture provides an ideal platform to realize our vision. Even though there is no clear definition of Cloud computing, a good understanding of what Cloud computing can offer is prevalent. According to [2], Cloud computing can be seen as a “pool of easily usable and accessible resources (hardware and software). These resources can be dynamically re-configured. This pool of resources is typically exploited by a pay-per-use model.” This nicely fits into our vision and the goals presented in section 1.1. In the Cloud, resources are presented as services to its uses. In general, Cloud provides three different types of services (Figure 2). Each type of service can be provided separately and pay-per use model if needed. Tools 2010 IEEE 3rd International Conference on Cloud Computing 978-0-7695-4130-3/10 $26.00 © 2010 IEEE DOI 10.1109/CLOUD.2010.80 526

[IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - Cloud Computing

  • Upload
    janaka

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - Cloud Computing

Cloud Computing Infrastructure for Biological Echo-Systems Janaka Balasooriya

School of Computing, Informatics, and Decision Systems Eng. Arizona State University,Tempe, AZ

[email protected]

1. Introduction The Biological applications such as Gene and Protein analysis integrate and analyze biological data for the research in many bioinformatics and other bio related fields. Such applications are used under many large scale scientific applications and help in computing, integrating data, execute the analysis, automate the process by using information retrieved by different tasks and computational procedures to assist the scientists in scientific discovery and data distribution. Grid based and/or web based scientific workflow tools are used for bioinformatics related complex research to make scientists’ and researchers’ work easier. On average, scientists spend about 80% of their time assembling data to prepare for analysis. This is due largely in part to the fact that many of these resources required for data processing must be gathered from an external source. The best of these resources, however, are scattered across the globe. They are hosted at universities, institutes, and laboratories throughout the world. To bring all of these resources together by hiding system, network, and application level heterogeneity issues are challenging. 1.1 Challenges and Biological Data and Tool

Integration Our recent comprehensive survey on this topic revealed that biological data and tool integration is a challenging task mainly due to • Highly diverse data sets (numerical, natural language,

strings) and representational heterogeneity of data (text files, data bases)

• Data sets to be analyzed can be large. • Distributed across the globe. • Available tools and data sources have non uniform

interfaces for communication and data exchange. – Data sources and tools are autonomous and

have different interfaces and querying capabilities.

• Some applications require large amount of computing resources and can be long running.

1.2. Our Vision and Goals We envision biologists and natural scientists to find required tools and data sources and configure then “on-the-fly” over the web (Figure 1). This requires breaking several technological barriers.

(a) Tools and data sources need to have uniform interfaces for communication and data exchange.

(b) Data sources and tool providers need a platform to which they can announce the availability of resources.

(c) Natural scientists should be able to find required data sources and tools, and configure their applications using high-level constructs.

(d) Find resources (such as computing, data storage) to run their applications (if needed) and execute applications.

(e) Should be able to perform above tasks over the Web with no or minimum programming.

Data Computing Resources

Figure 1: On-the-Fly data and tool integration in a Cloud

based Bio-Ecosystem 2. Biological Data and Tool Integration in the Cloud 2.1 Cloud Capabilities: Cloud computing architecture provides an ideal platform to realize our vision. Even though there is no clear definition of Cloud computing, a good understanding of what Cloud computing can offer is prevalent. According to [2], Cloud computing can be seen as a “pool of easily usable and accessible resources (hardware and software). These resources can be dynamically re-configured. This pool of resources is typically exploited by a pay-per-use model.” This nicely fits into our vision and the goals presented in section 1.1. In the Cloud, resources are presented as services to its uses. In general, Cloud provides three different types of services (Figure 2). Each type of service can be provided separately and pay-per use model if needed.

Tools

2010 IEEE 3rd International Conference on Cloud Computing

978-0-7695-4130-3/10 $26.00 © 2010 IEEE

DOI 10.1109/CLOUD.2010.80

526

Page 2: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - Cloud Computing

Figure 2: Cloud Computing Resources

As explained in Section 1, Requirements for biological computing applications can be varied from needing all three of the above types of services to perhaps only one type of resource. Thus, we can visualize biological data sources and tool live in an echo-system where biologists find required data sources and tool and integrate them as required. The following section presents our Bio-Ecosystem architecture on the Cloud. 2.2 Bio-Ecosystem Architecture on the Cloud In [1], authors present a Cloud Computing Open Architecture (CCOA). Our Bio-Ecosystem architecture is based on CCOA. Figure 3 illustrates our Bio-Ecosystem architecture. Usually, biological applications involve four types of resources: data sources, tools (software packages), application development platforms, and computing resources. In this architecture • Biological data sources and tools can be implemented

as software as a services in the cloud • Application development platforms can be

implemented as platforms as a service in the cloud • Data storage, network, and hardware requirements

such as processing can be implemented as Infrastructure as a service in the cloud.

Here, we briefly explain each component of the Bio-Ecosystem architecture User Interface Dashboard: End users (biologists) interact with the ecosystem through this interface to discover, compose, and execute their applications on-the-fly. Core Infrastructure Provide Dashboard: Tools, data sources, and network and computing resource providers make their resources available to the bio-ecosystem through this interface possibly in a plus and play manner.

Third Party Partner Dashboard: Some core infrastructure provides may associate their resources with some third party components such as generic input validation. Such Third party components can be added to the ecosystem through this interface. Web Service Interface: Present reusable components as web Services Virtualization of hardware and software: Maintain Hardware and software resources availability in dynamic, on-demand manner. Bio-Ecosystem Management: For tool, data source, and other resource providers, partners, and end users to manage their components in the echo-system.

Figure 3: Bio-Ecosystem Architecture on the Cloud [1].

5. Conclusions and Future Work On average, about 80% of the invested time goes into assembling the right data to prepare for analysis and gathering resources for data processing from various resources due its diverse nature. Cloud computing provides an ideal platform on which one can develop an environment that allows biologists and natural scientists to find required tools and data sources and configure then “on-the-fly” over the web. In this paper we have proposed a Bio-Ecosystem Architecture on the Cloud. Currently, we have undertaken the task of designing and developing each component of this architecture. 6. References [1] Liang-Jie Zhang; Qun Zhou. CCOA: Cloud Computing Open Architecture. ICWS 2009. IEEE International Conference on. [2] Luis M. Vaquero , Luis Rodero-Merino , Juan Caceres , Maik Lindner, A break in the clouds: towards a cloud definition, ACM SIGCOMM Computer Communication Review, v.39 n.1, January 2009

527