[IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

CDCGM Track: Summary Report Convergence of Distributed Clouds, Grids and their Management

Track Chair’s Report

Dr. Rao Mikkilineni IEEE Member

C cubed DNA Inc., Cupertino, California, USA

[email protected]

Dr. Giovanni Morana DIEEI

University of Catania Catania, Italy

[email protected]

Abstract—The Convergence of distributed clouds, grids and their management conference track focuses on virtualization and cloud computing as they enjoy wider acceptance. A recent IDC report predicts that by 2016, $1 of every $5 will be spent on cloud-based software and infrastructure. Three papers address key issues in cloud computing such as resource optimization and scaling to address changing workloads and energy management. In addition, the DIME network architecture proposed in WETICE2010 is discussed in two papers in this conference, both showing its usefulness in addressing fault, configuration, accounting, performance and security of service transactions with in the service oriented architecture implementation and also spanning across multiple clouds. While virtualization has brought resource elasticity and application agility to the services infrastructure management, the resulting layers of orchestration and the lack of end-to-end service visibility and control spanning across multiple service provider infrastructure have added an alarming degree of complexity. Hopefully, reducing the complexity in the next generation datacenters will be a major research topic in this conference.

Keywords-component; Cloud Computing; grid computing; Distributed Intelligent Managed Element Networks; Distributed Services Management; Services Virtualization; Parallel Computing; Many-core Servers

I. INTRODUCTION While virtualization and cloud computing have brought

elasticity to computing resources and agility to applications in a distributed environment, they have also increased complexity of managing various distributed applications contributing to a distributed service transaction delivery by adding layers of orchestration and management systems. There are three major factors contributing to the complexity:

1. Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an

additional burden in the datacenter. A recent global survey commissioned by Symantec Corporation involving 2,453 IT professionals at organizations in 32 countries concludes [1] that the complexity introduced by virtualization, cloud computing and proliferation of mobile devices is a major problem. The survey asked respondents to rate the level of complexity in each of five areas on a scale of 0 to 10, and the results show that data center complexity affects all aspects of computing, including security and infrastructure, disaster recovery, storage and compliance. For example, respondents on average rated all the areas 6.56 or higher on the complexity scale, with security topping the list at 7.06. The average level of complexity for all areas for companies around the world was 6.69. The survey shows that organizations in the Americas on average rated complexity highest, at 7.81, and those in Asia-Pacific/Japan lowest, at 6.15.

2. As the complexity increases, the response is to introduce more automation of resource administration and operational controls. However, the increased complexity of management of services may be more a fundamental architectural issue related to Gödel’s prohibition of self-reflection in Turing machines [2] than a software design or an operational execution issue. Cockshott et al. [3] conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Automation of dynamic resource administration at run-time makes the computer itself a part of the model and also a part of the problem.

3. As the services increasingly span across multiple datacenters often owned and operated by different service providers and operators, it is unrealistic to expect that more software that coordinates the myriad resource management systems belonging to different owners is the answer for reducing complexity. A new approach that decouples the service management from

2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

144


129


131


131

underlying distributed resource management systems which are often non-communicative and cumbersome is in order.

Figure 1 summarizes the evolution of current datacenter complexity with respect to three parameters – system resiliency, efficiency and scaling. The resiliency is measured with respect to a service’s tolerance to faults, fluctuations in contention for resources, performance fluctuations, security threats and changing business priorities. Efficiency is measured in terms of total cost of ownership and return on investment. Scaling addresses end-to-end resource provisioning and management with respect to increasing number of computing elements required to meet service needs.

Figure 1: The Resiliency, Efficiency and Scaling of Information

Technology Infrastructure. Grid and cloud computing management brings automation of physical and virtual resources management.

The current course becomes even more untenable with the advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them. It is hard to imagine replicating current TCP/IP based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings. The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012 [3, 4]. Two papers in this conference continue the investigation of its usefulness. Hopefully, this tradition will result in other novel and different approaches to address the datacenter complexity issue while incremental improvements continue as is evident from another three papers.

II. SUMMARY OF THE SESSION The first paper by Di Stefano, Morana and Zito presents a

scalable and configurable monitoring system for cloud environments. This paper proposes “a monitoring system that allows user to collect and aggregate data in a form flexible and adequate to the integration with the management policy that the user adopts. This monitoring system is supplied by a configurable and highly scalable version of publisher-subscriber pattern. It gives user the capability of selecting data to be obtained from the provider with the desired grain of triggering and is suitable to support distributed and cooperative management systems.” Using the principle of separating the management channel from the data channel proposed by the DIME network architecture, this paper shows the decoupling of end-to-end service management from the underlying resource management systems while providing visibility and control of the distributed applications using the local operating systems on which they are executed. This is a first major step in reducing the complexity discussed earlier and paves the way for creating virtual private service transaction networks across multiple service provider infrastructure.

The second paper by Mohamed, Belaid and Tata proposes a “framework that generates self-managed and scalable micro-containers. These micro-containers are enhanced with the resiliency of cellular organisms assuring the fault, configuration, accounting, performance and security constraints (FCAPS) described for each service. The proposed intelligent managed micro-container (IMMC) has a self-monitoring service that allows it to take decisions to enhance its scalability based on migration and replication transactions. These transactions are performed using a Mobility service offered by our IMMC.” This paper is an application of the DIME network architecture discussed in previous WETICE conferences and they claim that the results are encouraging. Each micro-container acts like a Turing machine executing the service component. When encapsulated in a DIME using the service regulator describing the context, constraints, communications and the control of the behavior of the Turing machine, the FCAPS management and the signaling communication overlay with other elements provide policy-based self-management of a composite service transaction. The paper not only shows how resilient service oriented architecture can be implemented with FCAPS managed micro-containers but also paves the way for designing autonomic distributed service transaction delivery and assurance systems.

The third paper “Kaqudai: a Dependable Web Infrastructure made out of Existing Components” by Tramontana, Giunta, Messina, and Pappalardo describes the introduction of a dependability layer, addressing availability, reliability and data integrity concerns. Specifically, they have produced a set of aspects and other suitable components addressing the non-functional services required which can be smoothly integrated into an existing web server, endowing it with the ability to redirect service requests under excessive load. Redirection may target either cloud-powered lightweight replicas of the main web server, or publicly exposed browsing caches from clients. Experiments were conducted on the W3C Jigsaw web server. This approach of adding a dependability layer is similar to the

145130132132

DIME network architecture and the authors perhaps will examine its similarity or relationship to their work.

The fourth paper “Cost/Performance Evaluation for Cloud Applications Using Simulation" by Rak and Villano presents a technique to evaluate the trade-off between costs and performance of cloud applications through the use of benchmarks and simulation. The pay-per-use business model is one of the key factors for the success of the cloud computing paradigm. Resources are acquired only when needed and charged on the basis of their actual usage. The execution of applications in the cloud implies costs that depend on the usage of the leased resources and on the resource pricing model adopted by the providers. In those cases where performance indices and resource consumption requirements under generic workloads are known or specified by the developers, it possible to choose the deployment on the resources of the provider that guarantees the desired performance levels and minimizes the costs for executing the application. This work paves the way to address cost optimization of cloud computing.

The fifth paper by Soares, Dantas, Bauer and Macedo provides a comparison of four different high performance distributed file systems. These systems are employed to support a medical image server application in a private storage environment. Experimental results highlight the importance of an appropriate distributed file system to provide a differential level of performance considering application specific characteristics. This study addresses the importance of a distributed file system for both Internet large scale services and cloud computing environments where I/O latency and application buffering sizes play an important role in the process of data-intensive management.

III. CONCLUSION This year’s conference continues the tradition of WETICE

to follow its objective set at its inception; “what sets WETICE apart from larger conferences is that the conference tracks are kept small enough to promote fruitful discussions on the latest technology developments, directions, problems, and requirements.” This year, we selected five papers. The continued discussion of the new computing model is very timely to address some fundamental issues in distributed computing to show a new path to self-managing systems and hopefully will reduce the current complexity that is exploding in our data centers. We believe that the DIME network architecture represents a major departure from the current cloud approaches. In conventional computing, the physical server serves as a managed computing element with node level resource management performed by the operating system. In Cloud computing, the virtual server serves as a managed computing element also with operating system managing the local resources. In the DIME computing model, a process in an operating system serves as managed computing element. The non-von Neumann management approach proposed to manage the Turing Machine radically improves the resource exploitation using parallelism and coordinated management both at the node and network level even within the cloud environments, hiding the complexity of the management of FCAPS issues both from the developers and users of cloud

services. While two papers in this conference show promise of this approach, only time will tell of its usefulness in large scale mission critical applications.

We hope that some of the ideas discussed here will be carried further and some new areas of research will blossom to be presented at the next WETICE track on the convergence of distributed clouds, grids and their management. We conclude with a quote from John von Neumann [5]. "It is very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not last for a millisecond. Such a system is so integrated that it can operate across errors. An error in it does not in general indicate a degenerate tendency. The system is sufficiently flexible and well organized that as soon as an error shows up in any part of it, the system automatically senses whether this error matters or not. If it doesn't matter, the system continues to operate without paying any attention to it. If the error seems to the system to be important, the system blocks that region out, by-passes it, and proceeds along other channels. The system then analyzes the region separately at leisure and corrects what goes on there, and if correction is impossible the system just blocks the region off and by-passes it forever. The duration of operability of the automation is determined by the time it takes until so many incurable errors have occurred, so many alterations and permanent by-passes have been made, that finally the operability is really impaired. This is completely different philosophy from the philosophy which proclaims that the end of the world is at hand as soon as the first error has occurred.”

ACKNOWLEDGMENT The Chair persons thank many of the reviewers and the

WETICE organizing committee for their support.

REFERENCES

[1] Symantec: State of Data Center Survey, Global Results: http://www.symantec.com/ 2012

[2] Copeland B. J. (ed.), and Turing A. M., The Essential Turing. Oxford University Press, Oxford; Also Feferman S., 2006. Turing’s Thesis. Notices of the AMS 53(10):2. 2004

[3] Mikkilineni R., Comparini A., Morana G., The Turing O-Machine and the DIME Network Architecture: Injecting the Architectural Resiliency into Distributed Computing, In Turing-100. The Alan Turing Centenary, (Ed.) Andrei Voronkov, EasyChair Proceedings in Computing, Volume 10, 2012

[4] Mikkilineni, R., Designing a New Class of Distributed Systems New York, Springer, ISBN: 1461419239, 2011

[5] Neumann, J. v., General and Logical Theory of Automata. edited and compiled by William Aspray and Arthur Burks, MIT Press, p408, 1987

146131133133

Documents

[IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops