Parallax A New Operating System for Scalable, Distributed… · · 2013-02-263. Storage/File Management - Parallax offers local storage per server (Shared with each DIME within

Parallax – A New Operating System for Scalable, Distributed, and Parallel Computing

Dr. Rao Mikkilineni

Kawa Objects Inc., USA

and

Ian Seyler

Return Infinity, Canada

[email protected], [email protected]

Abstract— Parallax, a new operating system, implements

scalable, distributed, and parallel computing to take advantage of

the new generation of 64-bit multi-core processors. Parallax uses

the Distributed Intelligent Managed Element (DIME) network

architecture, which incorporates a signaling network overlay and

allows parallelism in resource configuration, monitoring, analysis

and reconfiguration on-the-fly based on workload variations,

business priorities and latency constraints of the distributed

software components. A workflow is implemented as a set of

tasks, arranged or organized in a directed acyclic graph (DAG)

and executed by a managed network of DIMEs. These tasks,

depending on user requirements are programmed and executed

as loadable modules in each DIME.

Parallax is implemented using the assembler language at the

lowest level for efficiency and provides a C/C++ programming

API for higher level programming.

Keywords - Cloud Computing; Distributed Computing; Parallel

Computing; Distributed Services Management; Operating

System; Parallax; FCAPS (Fault, Configuration, Accounting,

Performance and Security) Management.

I. INTRODUCTION

Analyzing the evolution of current application architectures, we observe two key emerging trends [1] that point to the importance of the continual measurement of productivity while managing resources:

1. Application execution and application management

are becoming linked so that each application can be

matched in real-time with appropriate computing,

network, and storage resources to meet its business

priority, availability, performance, security, and cost

constraints.

2. Each application‟s utilization of CPU, memory,

bandwidth, storage capacity, throughput, and I/O rates

is being monitored in real-time to adjust the resources

available to it so that the resources may be adjusted to

meet changing workloads, business priorities, and

latency constraints.

Today, because of the silo nature of the evolution of computing as well as network and storage resource management, providing visibility and control of resources for each application is cumbersome, labor-intensive, and involves layers of management systems. Figure 1 shows current state-of-the-art virtual server environments.

Figure 1: Current Business Workflow Orchestration systems resulting from competing open systems approach

Current cloud computing advances attempt to compensate for the deficiencies of these silo management systems at the application level by creating resource utilization monitoring and management layers, thus increasing the level of complexity [1 and 2]. A closer analysis of both the Azure cloud from Microsoft and the Amazon cloud platform (along with its management support provided by companies such as RightScale and Sclr), makes it clear that the applications are characterized by a set of parameters to define the workloads and business priorities; the resulting application profile is used to monitor and adjust the resources as the workloads and priorities vary.

However, this approach at the application level does not eliminate or reduce the lower level management complexity that contributes today to about 60% to 70% of the Data center Total Cost of Ownership in a five-year time period, with or without virtualization [3]. Domain-specific network and

storage appliances with dedicated chips and management systems add to this complexity. Multiple hypervisor based virtualization layers are introducing yet another layer of orchestration.

On the other hand the new class of multi-core architectures, hardware-assisted virtualization, and software strategies that exploit parallelism as well as web-based service delivery, are transforming current server-centric services such as creation, delivery, and assurance to network-centric distributed architectures. The new generation of computing elements:

1. Is scalable from mobile hand-held devices to large-

scale servers with energy- and space-conscious

design;

2. Supports hardware-assisted virtualization that allows

the decoupling of software from the hardware on

which it resides, from mobile hand-held devices to

large-scale servers, and;

3. Provides manageability at the chip level.

The objective of operating systems and programming languages is to reduce the semantic gap between business workflow definitions and their executions in a von Neumann computing machine. The important consequences of the upheaval in hardware with multi-CPU and multi-core architectures on a monolithic OS that shares data structures across cores are well articulated by Baumann et al [4]. They also introduce the need for making the OS structure hardware-neutral. "The irony is that hardware is now changing faster than software, and the effort required to evolve such operating systems to perform well on new hardware is becoming prohibitive." They argue that single computers increasingly resemble networked systems, and should be programmed as such.

We agree with their argument. As the number of cores increase to hundreds and thousands in the next decade, current generation operating systems cease to scale and full-scale networking architecture has to be brought inside the server. In this paper, we present an implementation of a novel network-centric distributed computing model which enables the execution of distributed and managed workflows within a server or across multiple servers. Exploiting this computing model, we design a new operating system called Parallax which will leverage chip-level hardware assistance provided to virtualize, manage, secure, and optimize computing at the core. The new OS also exploits fully the parallelism and multithread execution capabilities offered in these computing elements.

What we propose is a novel approach to providing a mechanism that incorporates management workflow on par with computing workflow, taking advantage of the parallelism that is available in multi-core, multi-CPU computing through the use of addressable threads. We implement a new object computing model called Distributed Intelligent Managed Elements (DIMEs) that are programmed with self-management capabilities and allow workflow execution as a network of

DIMEs using out-of-band signaling [3]. Figure 2 describes the current computing model along with the DIME computing model.

Figure 2: Current Computing Model and the new DIME Computing Model

In their paper on “The Case for a Factored Operating System (FOS)” [5], Wentzlaff and Agarwal assert that “the fundamental design of operating systems and operating system data structures must be rethought.” They provide an excellent review of various attempts and propose their own solution. Klues et al, [6] propose ROS for many-core with direct support for parallel applications and a scalable kernel. Their design is based on the notion of a „many-core‟ process abstraction and the decoupling of protection domains from resource partitioning. Helios [7] is an operating system that provides a seamless, single image operating system abstraction across heterogeneous programmable devices.

The DIME computing model is different from these other approaches in that it introduces FCAPS management and signaling capabilities to each computing element. The DIME network computing model provides:

1. Overlay of business workflow and its management workflow

2. Service component encapsulation as an executable,

3. A service profile defining FCAPS characteristics and management associated with the encapsulated service,

4. FCAPS executable modules for the service,

5. Service composition using network and sub-network model with distributed service nodes,

6. End to end service network/sub-network management using the signaling network.

The out of band signaling allows dynamic configuration of each computing element and the composition of multiple elements in a network. Figure 3 shows a simple managed DIME network that is used to implement a “hello world” application.

Figure 3: A managed DIME network implementing a workflow. Service S1 is a hello world application controlled and managed by its associated control workflow

In this paper we present the implementation of the DIME computing model using a new operating system called Parallax to demonstrate its feasibility using Intel 64-bit multicore servers.

In Section 2, we present the Parallax implementation and in Section 3 we present our observations and conclusions.

II. PARALLAX IMPLEMENTATION:

Parallax is implemented to execute on 64-bit multi-core Intel processors. The DIMES shown in figure 5 are implemented in two servers with dual-core Intel processors. The implementation decouples hardware infrastructure from services management. In our implementation, to demonstrate the features of the OS that illustrate service scaling, distribution and parallel implementation, the service management console is used as a command console to control the workflow of services deployed using Parallax with Command Language Interface (CLI). Services S1, S2, S3 and S4 are compiled and dispatched to the server under consideration and the Parallax OS provides the routing and switching functions to execute the services at the right destination. In a real implementation of a business workflow, the sub-network management will also be implemented using a DIME sub-network.

Parallax provides memory management, CPU resource management, storage/file management, network routing and switching functions, and signaling management.

1. Memory Management - Memory is assigned to a DIME on an as-needed basis. Each DIME maps to different pages in the linear memory system and cannot access pages to which it is not assigned. Security is provided at the hardware level for this memory protection. Once a program has completed its execution, all memory that it had in use is returned to the system. Limits can be set on how much memory each DIME is able to allocate for itself.

2. CPU Resource Management - Conventional operating systems allow for multiple programs or threads to share available CPU resources. This kind of sharing leads to a more complicated operating system that needs to take care of time slice management as well as internal context switching. Parallax allows for one program per CPU core. In fact, the granularity is taken to an addressable thread level when multiple threads are supported by the core. Time slice management is not needed as each program running in a DIME gets its own dedicated CPU core or thread. Performance of the program also increases over that of running on a conventional OS, as context switching is no longer a factor. With dedicated resources, each DIME can be viewed as its own separate computing entity. If a DIME completes its task and is free, it is given back to the pool of available resources. The network management assures discovery and allocation of available DIMEs in the pool for new tasks. The signaling allows addressability at the thread level.

3. Storage/File Management - Parallax offers local storage per server (Shared with each DIME within the system) as well as centralized file storage shared via the Orchestrator between all servers. Booting the OS via the network is also a possibility for systems that do not need permanent storage or for cost saving measures.

4. Network Management - Under Parallax, all network communication is done over raw Ethernet frames. Conventional operating systems use TCP/IP as the main communication protocol. By using raw packets we have created a much simpler communication framework as well as removed the overhead of higher-level protocols, thereby increasing the maximum throughput. The use of raw Ethernet packets has already seen great success with the ATAoE protocol invented by CoRaid for use in their network storage devices.

5. Signaling Management (Inter-DIME and Intra-DIME routing/switching) - Under Parallax, each DIME is addressable as a separate entity via the signaling and data channels. With the signaling layer, program parameters can be adjusted during run-time. DIMEs have the ability to communicate with other DIMEs for co-operation as well as with the Orchestrator for the purpose of compiling results. Instruction types can be directly encoded into the 16-bit Ether-Type field of each Ethernet frame shown in figure 5.

By making use of the EtherType field for our own purposes we can streamline the way in which packets are routed within a system. Packets can be tagged as Signaling/Data packets as well as whether or not they are destined for the overall system or rather just a specific DIME.

A simple workflow is implemented using a DIME network. The workflow is executed as set of tasks assigned to each node as part of a computational network, and is orchestrated using the signaling network. The implementation exploits the new 64 bit multi-core processors with hardware-assisted virtualization. Multiple threads and multi-core features are managed by the Parallax OS.

The DIME Orchestrator (Figure 4 shows the screen display of commands executed in demonstrating the workflow) is used to manage the DIME computing network using the signaling channel, and to control the execution of computational workflow implemented utilizing the individual DIMEs. Scripting of the orchestrator is also possible.

Figure 4: Screen display of DIME Orchestrator and execution commands (Videos of the presentation and the demonstration of the proof-of-concept is available at www.youtube.com/kawaobjects )

The Parallax OS implements the signaling and computational networks using the parallel thread and multi-core features available in new Intel laptops and servers. The Parallax OS has two components:

1. Intra-DIME FCAPS management methods that allow

local management of fault, configuration, accounting,

configuration, performance, and security management

methods; and

2. Inter-DIME signaling methods that allow

communication, coordination, and orchestration using

the signaling abstractions (addressing, alerting,

mediation, and supervision) that define the DIME

network computing model.

The Parallax OS components are shown in Figure 5:

Figure 5: Parallax Components include inter-DIME messaging and Intra-DIME routing/switching mechanisms

A simple Proof of Concept implementation includes:

1. Fault management that provides a heartbeat of each

DIME at a specified time interval, over the signaling

channel, for implementing fault monitoring and

recovery processes;

2. Configuration management configures and controls

the computing element to load, execute, suspend, stop,

and recover tasks along with memory, storage, and

network resource management for the DIME;

3. Account management keeps track of who requested

the use of the computing resources as well as the

duration;

4. Performance management monitors the local DIME

CPU, memory, bandwidth, storage capacity,

throughput, and IOPs at set intervals and makes it

available over the signaling channel; and

5. The Security task specifies how the computing

resources are accessed as well as assures that FCAPS

tasks are executed only through authorized messages.

http://www.youtube.com/kawaobjects

A simple computational workflow scenario is as follows:

1. Configure DIME 1 on server 1 to read continuously

from the keyboard and route input to be displayed on

the Console in server 1.

2. Configure DIME 1 on server 2 to read continuously

from the keyboard and route input to be displayed on

the Console in server 2.

3. Dynamically change the configuration of DIME 1 in

server 1 using the signaling channel to change the

routing of input from DIME 1 to server 2 DIME 1.

4. Dynamically change the configuration of DIME 1 in

server 2 using the signaling channel to change the

routing of input from DIME 1 to server 1 DIME 2.

A. FCAPS Management:

A major feature of the DIME computing model is its ability to implement a dynamically managed workflow using a network of DIMEs. This involves instantiating a network of DIMEs, loading each DIME with appropriate computing tasks and assuring an end-to-end connection management to assure the availability, reliability, performance, security and accounting management.

1) Configuration of the DIME:

Figure 6 shows the DIME Configuration.

Figure 6: The anatomy of a DIME

The node OS part of the Parallax OS implements local DIME management including configuration of the MICE, the computing element that implements the computing workflow component assigned to it. The network OS part of the DIME manages Inter DIME communication, signaling and I/O switching and routing functions. Different FCAPS functions define local policies such as broadcasting heartbeat, alarms, local security policy, performance parameter broadcast and utilization account details (call record) etc., are implemented through DIME policy record which provides a meta-model for the DIME.

2) Workflow Configuration:

Figure 7 shows the configuration of a workflow that connects services from different DIMES.

Figure 7: Service Creation and Deployment Scenario

The components with their meta-models are stored in a service creation environment server. The workflow creation console uses the components and the meta-models to create a deployment workflow that loads and executes various services as a network of DIMES across multiple physical servers if necessary based on the business priorities, latency constraints and workload requirements. In figure 7, services S1 and S2 are deployed in two different servers and the services are executed independently (a simple display of hello world typed is displayed on the console associated with the DIME providing the service.)

3) Fault Management:

Each DIME is programmed to broadcast a heartbeat as its fault management function. A service management application can monitor the heartbeat and if it stops can initiate a service on a different DIME. Figure 8 shows the fault management workflow.

Figure 8: Fault Management Workflow

When the heartbeat stops from a DIME, the Service mediator instantiates the corresponding service in a different DIME based on policy specified. Figure 9 shows the result.

Figure 9: Recovery Scenario

4) Performance Management:

Each DIME is programmed to broadcast CPU % utilization, memory % allocation, I/O bandwidth % utilization, throughput % utilization, storage capacity % utilization as its Performance management function. A service management application can monitor the Performance parameters and can initiate a service on a different DIME (replication) or shut off service based on business priorities, workload variations, latency constraints and policy requirements. Figures 10 and 11 show the performance management workflows with service replication.

Figure 10: Performance Management Workflow.

Accounting Management keeps a call record of requests for MICE services. Security management allows various security schemes based on workflow requirements.

The meta-model based FCAPS management allows dynamic workflow management across the entire DIME network based on end-to-end workflow requirements. Thus the new computing model and the Parallax OS implement node and connection management to deploy managed workflows as executable directed acyclic graphs.

Figure 11: Service Replication to improve performance

III. CONCLUSIONS:

In this paper, we have presented a new approach to distributed computing with self-management of nodes and signaling-enabled dynamic network monitoring and management of a set of such nodes to execute managed computational flows. The DIME computing model does not replace any of the computational models such as the grid, agent, or autonomous computing that utilize the von Neumann SPC computing model. In fact, it is closer to the self-replicating model von Neumann was seeking to duplicate the characteristics of fault tolerance, self-healing and other such attributes observed in living organisms. "There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself but that situation is reversed with respect to complicated automata. It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe." This statement shows that the limitations of the SPC computing architecture were clearly on his mind

1 when he gave his lecture at the Hixon

symposium in 1948 in Pasadena, California [8]. "The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately." Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say "It's very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond."

1 "You can perform within the logical type that's involved

everything that's feasible, but the question of whether

something is feasible in a type belongs to a higher logical type"

--- von Neumann, (1948)

The DIME network computing model defines a method to implement a set of distributed computing tasks arranged or organized in a directed acyclic graph to be executed by a managed network of distributed computing elements. Each distributed computing element (the DIME) is endowed with its own computing resources to execute a task (CPU, memory, network bandwidth, storage capacity, IOPs and throughput) and a management infrastructure to monitor and control fault, configuration, accounting, performance and security of these resources. A signaling network overlay over the computing network allows parallelism in resource monitoring, analysis and reconfiguration based on workload variations, business priorities and latency constraints of the distributed software components at the network or sub-network level. Self-management capabilities at the DIME node level, signaling with network awareness and management capabilities at the DIME network level allow the ability to create autonomic distributed systems very similar to how cellular signaling and DNA encoding allows self-managing living organisms. By encapsulating each core as a DIME, we have implemented Parallax in the current implementation. By implementing DIME network and its management using an orchestrator, we have demonstrated both fault and performance management automation. However, the model also lends itself to be implemented in current generation servers using features available in current operating systems such as Linux using the thread library to implement the DIME self-management of FCAPS and inter DIME signaling capabilities

2.

This paper is a first exercise in demonstrating a proof of concept of the DIME network architecture (DNA). It has to evolve by incorporating many local and global switching and routing abstractions (including TCP/IP) that have been well-tested and proven to be successful in POTS

3 (Plain Old

Telephone Service), PANS (Pretty Amazing New Services based on Internet) and SANs (Storage Area Network) in the past [9, 10, 11 and 12]. We envision, the evolution of service component creation, service workflow orchestration, and service delivery and assurance platforms to deliver executable

2 Giovanni Morana, and Rao Mikkilineni, " Scaling and

Self-repair of Linux Based Services Using a Novel Distributed

Computing Model Exploiting Parallelism", Accepted by IEEE

WETICE 2011, Convergence of distributed clouds, grids and

their management track, Paris June 2011. 3

The simplest way to appreciate the DIME network

computing model is to compare it with how telecommunication

network and its management are implemented. As soon as the

telephone number is dialed, the signaling network assures right

resources are connected end to end based on source and

destination profiles and service management infrastructure

assures the connection is billed, maintained with right quality

of service, and secured. The network components are designed

to support signaling and connection management. Each DIME

is a network element with profile based control of its

computing element. On instantiation, the service control

profile is used to load, execute and manage the service that

DIME is expected to provide.

managed workflows. The self-management and signaling capabilities of service components allow their design to be modular and plug and play at run time just as we design network components in telecommunications.

If this architecture is successful, it is easy to see how the current virtual server and associated management complexity will be reduced while addressing end to end business workflow FCAPS management. Service management can be decoupled from underlying hardware architecture by utilizing the dynamic configuration management of DIME network architecture. The service component profile specifies the environment in which it will execute and the DIME management infrastructure will assure right resources are available before the service is loaded and executed.

This paper is not about a new and complete OS. This paper is about getting research community‟s feedback on some ideas borrowed from cellular biology, human networks and telecommunication networks to create a similar self-manageable architecture in IT. As mentioned earlier, the future work will evolve the service component creation, workflow orchestration, service delivery and assurance infrastructure learning from the lessons of the past [9].

We conclude this paper with this observation by George Dyson [13]. “The analog of software in the living world is not a self-reproducing organism, but a self-replicating molecule of DNA. Self-replication and self-reproduction have often been confused. Biological organisms, even single-celled organisms, do not replicate themselves; they host the replication of genetic sequences that assist in reproducing an approximate likeness of themselves. For all but the lowest organisms, there is a lengthy, recursive sequence of nested programs to unfold. An elaborate self-extracting process restores entire directories of compressed genetic programs and reconstructs increasingly complicated levels of hardware on which the operating system runs." The managed execution of a workflow as a directed acyclic graph using the DIME Network Architecture (DNA) provides at least a mechanism for a lengthy recursive sequence of nested programs to unfold in the von Neumann computing world.

REFERENCES

[1] Rajkumar Buyyaa, Chee Shin Yeoa, , Srikumar Venugopala, James Broberga, and Ivona Brandicc, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility”, Future Generation Computer Systems, Volume 25, Issue 6, June 2009, Pages 599-616

[2] Vijay Sarathy, Purnendu Narayan, Rao Mikkilineni, "Next Generation Cloud Computing Architecture: Enabling Real-Time Dynamism for Shared Distributed Physical Infrastructure," wetice, pp.48-53, 2010 19th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, 2010

[3] Rao Mikkilineni, and Giovanni Morana “Is the Network-centric Computing Paradigm for Multicore, the Next Big Thing?” http://computingclouds.wordpress.com

[4] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, and Akhilesh Singhania, "The Multikernel: A new OS architecture for scalable multicore systems", In Proceedings of the 22nd ACM Symposium on OS Principles, Big Sky, MT, USA, October 2009

[5] D. Wentzlaff and A. Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. SIGOPS Oper. Syst. Rev., 43(2):76–85, 2009.

[6] K. Klues et al. Processes and Resource Management in a Scalable Many-core OS. In HotPar10, Berkeley, CA, June 2010.

[7] Edmund B. Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, and Galen Hunt, "Helios: Heterogeneous Multiprocessing with Satellite Kernels", ACM, SOSP‟09, October 11–14, 2009, Big Sky, Montana, USA

[8] Neumann, J. v. “The General and Logical Theory of Automata” In E. b. Taub, John von Neumann Collected Works (pp. Vol 5, p259). Chicago: University of Illinois Press (1951)

[9] Rao Mikkilineni, Vijay Sarathy ”Cloud Computing and Lessons from the Past”, Proceedings of IEEE WETICE 2009, First International Workshop on Collaboration & Cloud Computing, June 2009 (http://www.kawaobjects.com/resources/ieee-ccc-2009.pdf)

[10] P. Goyal, “The Virtual Business Services Fabric: an integrated abstraction of Services and Computing Infrastructure,” in Proceedings of WETICE 2009: 18th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, 33-38 (2009).

[11] Pankaj Goyal, Rao Mikkilineni, Murthy Ganti “FCAPS in the Business Services Fabric Model”, Proceedings of IEEE WETICE 2009, First International Workshop on Collaboration & Cloud Computing, June 2009

[12] Pankaj Goyal, Rao Mikkilineni, Murthy Ganti, “Manageability and Operability in the Business Services Fabric”, Proceedings of IEEE WETICE 2009, First International Workshop on Collaboration & Cloud Computing, June 2009

[13] George B. Dyson, “Darwin among the Machines, the evolution of global intelligence”, Helix Books, Addition Wesley Publishing Company, Inc., Reading, MA, 1997, p123.

Documents

Parallax A New Operating System for Scalable, Distributed… · · 2013-02-263. Storage/File Management - Parallax offers local storage per server (Shared with each DIME within