CX: A scalable, robust network for parallel computingdownloads.hindawi.com/journals/sp/2002/598245.pdf · 159 CX: A scalable, robust network for parallel computing Peter Cappello

159

CX: A scalable, robust network for parallelcomputing

Peter Cappello and Dimitrios MourloukosComputer Science Department, University ofCalifornia, Santa Barbara, CA 93106, USAE-mail: {cappello, mourlouk}@cs.ucsb.edu

CX, a network-based computational exchange, is presented.The system’s design integrates variations of ideas from otherresearchers, such as work stealing, non-blocking tasks, ea-ger scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates appli-cation logic from the logic that supports interprocess commu-nication and fault tolerance. Computations, of course, run tocompletion in the presence of computational hosts that joinand leave the ongoing computation. Such hosts, or produc-ers, use task caching and prefetching to overlap computa-tion with interprocessor communication. To break a poten-tial task server bottleneck, a network of task servers is pre-sented. Even though task servers are envisioned as reliable,the self-organizing, scalable network ofn servers, describedas asibling-connected height-balanced fat tree, tolerates a se-quence ofn−1 server failures. Tasks are distributed through-out the server network via a simple “diffusion” process.

CX is intended as a test bed for research on automatedsilent auctions, reputation services, authentication services,and bonding services. CX also provides a test bed for algo-rithm research into network-based parallel computation.

1. Introduction

The ocean contains many tons of gold. But, the goldatoms are too diffuse to extract usefully. Idle cycleson the Internet, like gold atoms in the ocean, seem toodiffuse to extract usefully. If we could harness effec-tively the vast quantities of idle cycles, we could greatlyaccelerate our acquisition of scientific knowledge, suc-cessfully undertake grand challenge computations, andreap the rewards in physics, chemistry, bioinformatics,and medicine, among other fields of knowledge.

Several trends, when combined, point to an opportu-nity:

– The number of networked computing devices isincreasing: Computation is getting faster andcheaper: The number of unused cycles per secondis growing rapidly

– Bandwidth is increasing and getting cheaper– Communication latency isnot decreasing– Humans are gettingneither fasternor cheaper.

These trends and other technological advanceslead to opportunities whose surface we have barelyscratched. It now is technically feasible to undertake“Internet computations” that are technicallyinfeasi-ble for a network of supercomputers in the same timeframe. The maximum feasible problem size for “In-ternet computations” is growing more rapidly than thatfor supercomputer networks. The SETI@home projectdiscloses an emerging global computational organism,bringing “life” to Sun Microsystem’s phrase “The net-work is the computer”. The underlying concept holdsthe promise of a huge computational capacity, in whichusers pay only for the computational capacity actuallyused, increasing the utilization of existing computers.

1.1. Project goals

In the CX project, we are designing an open, exten-sible Computation eXchange that can be instantiatedprivately, within a single organization (e.g., a univer-sity, distributed set of researchers, or corporation), orpublicly as part of a market in computation, includingcharitable computations (e.g., AIDS or cancer research,SETI). Application-specific computation services con-stitute one kind of extension, in which computationalconsumers directly contact specialized computationalproducers, which provide computational support forparticular applications.

The system must enable application programmersto design, implement, and deploy large computations,using computers on the Internet. It must reduce humanadministrative costs, such as costs associated with:

– downloading and executing a program on hetero-geneous sets of machines and operating systems

Scientific Programming 10 (2002) 159–171ISSN 1058-9244 / $8.00 2002, IOS Press. All rights reserved

160 P. Cappello and D. Mourloukos / CX: A scalable, robust network for parallel computing

– distributing software component upgrades.

It should reduce application design costs by:

– giving the application programmer a simple butgeneral programming abstraction

– freeing the application programmer from concernsof interprocessor communication and fault toler-ance.

System performance must scale both up and down,despite communication latency, to a set of computa-tion producers whose size varies widely even withinthe execution of a single computation. It must serveseveral consumers concurrently, associating differentconsumers with different priorities. It should supportcomputations of widely varying lifetimes, from a fewminutes to several months. Producers must be securefrom the code they execute. Discriminating amongconsumers is supported, both for security and privacy,and for prioritizing the allocation of resources, such ascompute producers.

After initial installation of system software, no hu-man intervention is required to upgrade those compo-nents. The computational model must enable generaltask decomposition and composition. The API must besimple but general. Communication and fault tolerancemust be transparent to the user. Producers’ interestsmust be aligned with their consumer’s interests: com-putations are completed according to how highly theyare valued.

1.2. Some Fundamental Issues

It is a challenge to achieve the goals of this systemwith respect to performance, inter-operability [1], cor-rectness, ease of use, incentive to participate, security,and privacy. Although this paper does not focus on se-curity and privacy, the Java security model [17] and the“Davis” release of Jini address network security [27](covering authentication, confidentiality, and integrity)clearly are intended to support such concerns. Ourchoice of the Java programmingsystem and Jini reflectsthese benefits implicitly.

In this paper, we present theProduction Networkservice subsystem of CX, focusing on its design withrespect to application programming complexity, ad-ministrative complexity, and performance. Applica-tion programming complexity is managed by present-ing the programmer with a simple, compact, generalAPI, briefly presented in the next section. Administra-tive complexity is managed by using the Java program-ming system: Its virtual machine provides a homoge-

neous platform on top of otherwise heterogeneous setsof machines and operating systems. The ProductionNetwork is a service that interfaces with every otherCX client and service. We however focus in this paperon the Task Server, the Producer, and Consumer.

Performance issues can be decomposed into severalsub-issues.

Heterogeneity of machines/OS: The goal is to over-come the administrative complexity associatedwith multiple hardware platforms and operatingsystems, incurring an acceptable loss of execu-tion performance. The tradeoff is between the ef-ficiency of native machine code vs. the univer-sality of virtual machine code. For the applica-tions targeted (not, e.g., real-time applications) thebenefits of Java JITs reduce the benefits of nativemachine code: Java wins by reducing applicationprogramming complexity and administrative com-plexity, whose costs are not declining as fast asexecution times.

Communication latency: There is little reason to be-lieve that technologicaladvances will significantlydecrease communication latency. Hiding latency,to the extent that it is possible, thus is central toour design.

Scalability: The architecture must scale to a higherdegree than existing multiprocessor architectures,such as workstation clusters. Login privilegesmust not be required for the consumer to use a ma-chine; such an administrative requirement limitsscalability.

Robustness: An architecture that scales to thousandsof computational producers must tolerate faults,particularly when participating machines, in ad-dition to failing, can disengage from an ongoingcomputation.

1.2.1. Ease of useThe computation consumer distributes code/data to a

heterogeneousset of machines/OSs. This motivates us-ing avirtual machine, in particular, the JVM. Compu-tational producers must download/install/upgrade sys-tem software (not just application code). Use of ascreensaver/daemon obviates the need for human ad-ministration beyond the one-time installation of pro-ducer software. The screensaver/daemon is a wrapperfor a client that downloads a “task server” service proxyevery time it starts, automatically distributing systemsoftware upgrades.

P. Cappello and D. Mourloukos / CX: A scalable, robust network for parallel computing 161

1.3. Paper organization

In the next section, we discuss related work, partic-ularly noting those ideas of others that we have incor-porated into CX. In Section 3, we introduce the API. InSection 4, we describe CX’s architecture. In Section 5,we present results from preliminary experiments. TheConclusion summarizes our contributions and some di-rections for future work.

2. Related work

Legion [18] and Condor [13] were early successesin network computing. They predate Java, hence arenot Java-centric, and indeed do not use a virtual ma-chine to overcome the portability/interoperabilityprob-lem associated with heterogeneous machines and OSs.The use of a virtual machine is a significant differ-ence between Java-centric and previous systems. Java-centric systems, among other differences, do not re-quire computational consumers to have login privi-leges on host machines. Indeed, administration evenof clusters is a challenge [20]. Charlotte [5] was thefirst research project, to our knowledge, that was Java-centric. Charlotte used eager scheduling, introducedby the Charlotte team, and implemented a full dis-tributed shared memory. Cilk-NOW [7], based onCilk 2, provides for “well-structured” computations (astrict subset of dag-structuredcomputations, wheredagmeans directed acyclic graph). It uses work-stealingand checkpointing (to a shared filesystem, such asNFS) for adaptively parallel computations (i.e., compu-tations hosted by machines that may join/retreat fromthe computation dynamically [10]). Nibhanupudi etal. [24,25] present work on adaptive BSP, an efficient,programmer-friendly model of parallel computationsuitable for the harvesting of idle cycles. Atlas [4],a version of Cilk-NOW intended for the Internet set-ting, put three important concepts into a Java-centricpackage: a computational model that supports well-structured computations, work-stealing, and the Inter-net. It was an attempt to create a Java-centric parallelprocessor with machines on the Internet. As a mas-ters project, it terminated abruptly, and, in our opin-ion, without reaching its full potential. CX sharesthese three properties. However, we have discoveredthat the dag-structured task model, eager scheduling(instead of Atlas’s checkpointing), work-stealing, andspace-based coordination integrate so elegantly as tobe “made for each other” when an adaptively parallel

computation is deployed on a network. Globus [15] is ametacomputing or umbrella project. It consequently isnot Java-centric, and indeedmust be language-neutral.CX is intended to fit under Globus’s umbrella via aportal [30]. Javelin [11,22,23] is Java-centric, imple-ments work stealing and eager scheduling, and has ahost/broker/client architecture. Javelin’s implementa-tion of eager scheduling is centralized on the client pro-cess. Manta [29] elegantly shows that wide-area net-works can efficiently support large, coarse-grainpar-allel computation. Manta however does not providefor adaptive parallelism (the situation where the actualprocessors join and retreat during the computation).Systems that make use of idle processors must be adap-tive (i.e., permit processors to join and retreat from acomputation dynamically). Adaptivity, unfortunately,materially complicates certain parallel computations.

Recently, several systems have emerged fordis-tributed computations on the Internet. Wendelbornet al. [32] describe an ongoing project to develop ageographical information system (PAGIS) for defin-ing and implementing processing networks on di-verse computational and data resources. Hawick etal. [19] describe an environment for service-basedmeta-computing (DISCWorld). Fink et al. [14] de-scribe Amica, a meta-computing system to support thedevelopment of coarse grained location-transparent ap-plications for distributed systems on the Internet, andincludes a memory subsystem. Bakker et al. [3] take theview of distributed objects as their unifying paradigmfor building large-scale wide area distributed systems.They appear to intend to do for objects what the worldwide web did for documents. Objects can differ intheir scheme, if any, for partitioning, replication, con-sistency, and fault tolerance, in a way that is opaque toclients.

Huberman et al. [2] relate anonymity to incentives,in their application of the “tragedy of the commons” toanonymous peer-to-peer networks.

Securing the infrastructure is not the focus of thisproject; commercial efforts are under way to secureJini, for example.

Recent commercial ventures attest to the perceptionthat unused cycles can be made available in a com-putationally meaningful way. Such ventures, whilestill in their infancy, include EnFuzion (targeted atintranets), Applied Metacomputing (the commercial-ization of Legion), Distributed Science (aka the Pro-cessTree), Entropia, Parabon Computation, PopularPower, and United Devices.

The setting for CX is the Internet (or an intranet).It comprises a set of interrelated services and clients


implemented in Java. From a performance point ofview, its goal is somewhat different from both the com-mercial ventures and the early systems such as Legionand Condor. These systems are intended primarily toincrease system throughput or utilization of idle cy-cles. CX is intended topush the limits of parallelcomputing in a network setting, despite long commu-nication latencies. Its architecture incorporates ideasfrom a variety of sources, integrating them in a uniqueway. Briefly, it uses thread programming model ideasfrom Cilk [6]; scheduling ideas from Enterprise [21],Spawn [31], and Cilk; classic decoupled communica-tion ideas from Linda [10] (and JavaSpaces [16], itsJava incarnation); eager scheduling ideas for fault tol-erance from Charlotte; and the host/broker/client archi-tectural ideas from Javelin. To match supply with de-mand [8,9] in time and space, the system incorporatesthe concept of auctions [12] via a market maker.

This article outlines the rationale for these choices,asthey pertain to the design of CX’s ProductionNetworksubsystem.

3. API

Computational model

The computational model reflects the dominatingphysical constraint on networked computation amongcompute producers whose availability may be short-lived: long communication latency relative to execu-tion speed. Computation is modeled with a dag ofnon-blocking tasks, analogous to Cilk threads. Such a dagis illustrated in Fig. 1. Producer cycles are too preciousand volatile to waste in a blocked state.

Programming model

In the programming model, the “task server” is thesingle abstraction through which applications com-municate with the system. To minimize commu-nication, the application programmer chooses where[de]composition occurs: the consumer, the producer,even the task server, or some combination thereof. Forcommunication efficiency, an application can batch thecommunication of tasks and computed arguments.

The programmer view is that of a single taskserver, despite its implementation as a network ofservers. The consumer stores a computationaltask into “the” task server, and receives a callback(processResult(Object o)) when the result be-

comes available. Producers repeatedly take tasks from“the” task server and compute them. See Fig. 2. Suchcomputation results in either the creation of new sub-tasks and/or arguments that are sent to successor tasks.

The application programming methods for commu-nicating with “the” task server include:

storeTask (Task t): store a task on the task serverstoreResult(Task t, int argNo, Object value): store

an argument of a successor task on the taskserver (pseudocode:t.inputs[argNo] =value)

The method processResult (Object re-sult) is invoked when a result is available. In theJavaSpace specification, clients cannot compute withinthe space. This is to prevent a client from grabbing thespace’s computational capacity, which would reduce itsresponsiveness to other clients. In CX, a productionnetwork (i.e., a particular set of task servers and theirassociated producers), executes one computation at atime. Consequently, the application can execute taskson a task server (by setting the Task’s boolean execu-teOnServer member to true). (This is in the spirit ofthe original tuple space design of the Linda system.).Computed arguments are stored on the server, usingstoreResult. Tasks areready for execution onlyafter receiving all their arguments, if any. For commu-nication efficiency, the above methods have a variantwhere aset of tasks/arguments is stored.

4. Architecture

First, we note some performance constraints. Thescheduling mechanisms must be general, subject to theconstraint that scheduling operations are of low timecomplexity: O(1) in the number of tasks and produc-ers. The system must be scalable, high-performance,and tolerate any single component failure. Failure ofcompute producers must be transparent to the progressof the computation. Recovering from a failed servermust require no human intervention and complete in afew seconds. After a server failure, restoring the sys-tem’s ability to tolerate another server failure requiresno human intervention, and completes in less than oneminute.

The basic entities relevant to the focus of this paperare:

Consumer (C): a process seeking computing resources.


Fib(4)

Fib(3)

Fib(0)Fib(1)

Fib(0)Fib(1)

Fib(2)

Fib(1)Fib(2)

Sum

Sum

Sum

Sum

Fig. 1. Task dag for computing the 4th Fibonacci number.

Task Server

Consumer

Producern

Fib(5)

Fib(4) Fib(3)

Sum

1. put2. get

3. put

4. put

5. get

6. put

Producer1

Fig. 2. Process communication abstraction. Illustrates the first few tasks of the Fib(5) computation.

Producer(P): a process offering or hosting computingresources. It is wrapped in a screen saver or unixdaemon, depending on its operating system.

Task Server (S): a process that coordinates task dis-tribution among a set of producers. Servers de-couple communication: consumers and producersdo not need to know each other or be active at thesame time.

Producer Network (N): A robust network of taskservers and their associated producers, which ne-gotiates as a single entity with consumers. Net-works solve the dynamic discovery problem be-tween active consumers and available producers.

Technological trends imply that network computa-tion must decompose into tasks of sufficient computa-tional complexity to hide communication latency: CXthus isnot suitable for computations with short-latencyfeedback loops. Also, we must avoid human operations(e.g., a system requiring a human to restart a crashed

server). They are too slow, too expensive, and unreli-able.

Why use Java? Since computation time is becom-ing less expensive and human labor is becoming moreexpensive, it makes sense to use a virtual machine(VM). Each computational “cell” in the global com-puter speaks the same language. One might argue thatincreased complexity associated with generating anddistributing binaries for each machine type and OS isan up-front, one-time cost, whereas the increased run-time of a virtual machine is for the entire computa-tion, every time it executes. JITs tend to negate thisargument. For some applications, machine- and OS-dependent binaries make sense. The cost derivatives(human vs. computation) suggest that thepercentageof such applications is declining with time. Of the pos-sible VMs, it also makes sense to leverage the industrialstrengthJava VM and its just-in-time (JIT) compilertechnology, which continues to improve. The increasein programmer productivity from Java technology jus-


tifies its use. Finally, many programmerslike to pro-gram in Java, a feature that should be elevated to the setof fundamental considerations, given the economics ofsoftware development.

There are a few relevant design principles that we ad-here to. The first principle concerns scalability: Eachsystem component consumes resources (e.g., band-width and memory) at a rate that must be independentof the number of system components, consumers, jobs,and tasks. Any component that violates this principlewill become a bottleneck when the number of compo-nents gets sufficiently large. Secondly, tasks are pre-fetched in order to hide communication latency. Thisimplies multi-threaded Producers and TaskServers. Fi-nally, we batch objects to be communicated, when pos-sible.

There also is a requirement that is needed to achievehigh performance. To focus producers on jobcom-pletion, producer networks must complete their con-sumer’s job before becoming “free agents” again.

The design of the computational part of the system isbriefly elaborated in two steps: 1) the isolatedcluster:a task server with its associated producers, and 2) aproducer network (of clusters). The producer networkis used to make the design scale and be fault tolerant.

4.1. The isolated cluster

An isolated cluster (See Fig. 3) supports the taskgraph model of computation, and tolerates producerfailure, both node and link.

A consumer starts a computation by putting the“root” task of its computation into a task server. Whena producer registers with a server, it downloads theserver’s proxy. The main proxy method repeatedly getsa task, computes it, and, when successfully completed,removes the task from the server. Since the task is notremoved from the server until completion notificationis given, transactions are unnecessary: A task is reas-signed until some producer successfully completes it.(The priority rules for assignment are given in the nextparagraph.) When a producer computes a task, it eithercreates subtasks and puts them into the server, and/orcomputes arguments needed by successor subtasks (the“argument” computed by the sink task is the final re-sult). Putting intermediate results into the server formsa checkpoint that occurs as a natural byproduct of thecomputation’s decomposition into subtasks. Applica-tion logic thus is cleanly separated from fault tolerancelogic. Once the consumer deposits the root task intothe server, it can deactivate until it retrieves the final

result. Task server fault tolerance derives from theirreplication, provided in the network discussed below.

We now discuss task caching. It increases perfor-mance by hiding communication latency between pro-ducers and their server. Each producer’s server proxyhas a task cache. Besides caching tasks, proxies copyforward arguments and tasks to the server, which main-tains aready task heap: The ordering of ready taskswithin the heap is based on 2 components: The dom-inant component is how many times a task has beenassigned. If task A has been assigned fewer times thanTask B, then Task A is higher in the heap than Task B.Within that, tasks are ordered by dag level (see [6]).This minor ordering mechanism is exposed to the appli-cation programmer: dag level is the default implemen-tation of the Task’s booleanis Higher Priority method.For example, it makes sense to give a Fibonacci task thatcomputes a bigger Fibonacci number a higher prioritythan a Fibonacci task that computes a smaller num-ber (because the task that computes the smaller num-ber ultimately spawns fewer tasks). In this case, theapplication programmer can implement the Fibonaccidecomposition task’s isHigherPriority method accord-ingly. This is a simple application-level scheduling [28]mechanism.

When the number of tasks in a proxy’s task cachefalls below a watermark (see [16]), itpre-fetches a copyof a task[s] from the server. For each task, the servermaintains the names of the producers whose proxieshave a copy of the task. A pre-fetch request returnsthe task with thelowest level (i.e., is earliest in the taskdag) among those that have been assigned thefewesttimes. After the task is complete, the proxy notifies theserver which removes the task from its task heapandfrom all proxy caches containing it.

The task server also maintains an unready task col-lection (of tasks that have not yet received all their in-put arguments). When a task in this collection receivesall its arguments, and hence becomes ready, it is in-serted into the ready task heap, and becomes availablefor pre-fetching. The producer’s task cache is orga-nized similarly, with a ready task heap and unready taskcollection.

Although the task graph can be a dag, thespawngraph is a tree. In Fig. 1, the sub-graph of solid edgesis the spawn tree. Hence, there is a unique path fromthe root task to any subtask. This path is the basisof a unique task identifier. Using this identifier, theserver discards duplicate tasks. Duplicate computedarguments also are discarded.

The server, in concert with its proxies, balances thetask load among its producers: A taskmay be concur-


Cluster

TaskServer

Producer

Task Server Proxy

Producer Producer

Producer

Producer

Fig. 3. A task server and its associated set of Producers.

rently assigned to many producers (particularly at theend of a computation, when there are fewer tasks thanproducers). This reduces completion time, in the pres-ence of aggressive task pre-fetching: Producers shouldnot be idle while other possiblyslower producers, havetasks in their cache. Via pre-fetching, when producersdeplete their task cache, they steal tasks spawned byother producers. Each producer thus is kept suppliedwith tasks, regardless of differences in producer com-putation rates. Our design goal: producers experienceno communication delay when they request tasks; therealways is a cached copy of a task waiting for them(Exception: the producer just completed thelast task).

4.2. The production network of clusters

The server can service only a bounded number ofproducers before becoming a bottleneck. Server net-works break this bottleneck. Each server (and proxy)retains the functionality of the isolated cluster. Ad-ditionally, servers balance the task load (“concentra-tion”) among themselves via a diffusion process: Likeproducers, diffusion of tasks throughout the server net-work is based on a system of low/high water marks forefficient inter-server communication. Onlyready tasksmove via this diffusion process. Similarly, a task thathas beendownloaded from some task server to one ofits producers, no longer moves to other task servers.However, other producers associated with the same taskserver can download it. This policy facilitates task re-moval, upon completion. Task diffusion among taskservers is a “background” pre-fetch process: Producersare oblivious to it. One design goal:producers endureno communication delays from their task server beyondthe basic request/receive latency: Each server has tasksfor its producers, provided the server network has moreready tasks than servers.

Fig. 4. A sibling-connected fat tree.

We now impose a special topology, that toleratesa sequence of server failures. Servers should havethe same mean time between failure as mission-criticalcommercial web servers. However, even these are notavailable 100% of the time. We want computation toprogress without re-computation in the presence of asequence of single server failures. To tolerate a serverfailure, its state (tasks and shared variables) must berecoverable. This information could be recovered froma transaction log (i.e., logging transactions against theobject store, for example, using a persistent implemen-tation of JavaSpaces). It also could be recovered if it isreplicated on other servers (see [24,25] for a discussionof automatic state replication and recovery in a virtualring). The first case suffers from a long recovery time,often requiring the human intervention. Since humansare gettingneither fasternor cheaper, we omit human-mediated computer/network administration. The sec-ond option can be fully automatic and faster at the costof increased design complexity.

We enhance the design via replication of task state,by organizing the server network as asibling-connectedfat tree (see Fig. 4) We can define such a tree opera-tionally:

– start with a height-balanced tree;


– add another “root”;– add edges between siblings;– add edges so that each node is adjacent to its par-

ent’s siblings.

Each server has amirror group: its siblings in thefat tree. (Since the tree does not need to be complete,it may be that there exists a parent that has only onechild. That child uses its parent as its mirror. This is aboundary condition.) Every state change to a server ismirrored: A server’s task state is updated if and only ifits sibling’s task states are identically updated. Whenthe task state update transaction fails:

– The server (or server proxy) that detects the fail-ure notifies the primary root server (the secondaryroot, if the primary rootis the failed server).

– Each proxy of the failed server, upon receivingRemoteExceptions, contacts a randomly selectedmember of the mirror group of the failed server.

– The root directs the most recently added leaf serverto migrate (with its associated Producers) to thefailed server’s position in the network. Its formerand new mirror groups are updated to reflect thischange.

Automatically reconfiguring the network after aserver failure requiresO(B) time, whereB is the max-imum degree of any server, and which isO(1) in thesize of the network. When a server joins a network,it becomes the new rightmost leaf in the network. In-sertion thus requiresO(B) time, independent of thenetwork size.

This design scales in the sense that each server isconnected to bounded number of servers, independentof the total number of servers: Port consumption isbounded. The diameter of the network ofn servers (themaximum distance between any task and any producer)is O(log n). Most importantly, the network repairsitself: the above properties hold after the failure of aserver. Hence, the network can recover from asequenceof such failures.

The consumer submits the “root” task of a com-putation to the primary root task server. The com-putation begins when a producer associated with thistask server executes this root task, which undoubtedlyspawns other tasks. Diffusion then begins.

4.3. Code distribution via the ClassLoader

Omitted from the discussion thus far is our strategyfor distributing code. If we make no special provision,task class files are downloaded from the Consumer’s

codebase. This clearly is a bottleneck, given the degreeof parallelism we seek from CX. To scale, the codedistribution scheme must have only a bounded numberof producers downloading code from any one location,independent of the total number of producers. This im-plies that the number of download points must increaselinearly with the number of producers. There is a nat-ural way to provide for this: Each task server becomesa download location for task class files. The CX classloader downloads task class files to the primary roottask server via the consumer’s class loader. From therethe classes are loaded down through the task servertree. Each producer loads the class files from its taskserver. This scheme achieves our primary objective:code distribution scales to an arbitrarily large numberof producers without a bottleneck emerging.

5. Preliminary experiments

All experimentswere run on our DepartmentalLinuxcluster. Each machine has 2 Intel EtherExpress Pro100 Mb/s Ethernet cards, and is running Red Hat Linux6.0 and JDK 1.2.2RC3. These machines are all con-nected to a 100 port Lucent P550 Cajun Gigabit Switch.

Let us first define the sequence of Fibonacci num-bers [26] as:

F (0) = 1;

F (1) = 1;

F (n) = F (n − 1) + F (n − 2), n > 1.

We tested a CX TaskServer cluster on a doubly recur-sive computation of thenth Fibonacci number,F (n),augmented with a synthetic workload. Neither thevalue ofF (n) is of interest here nor is the algorithmused efficient, since there is a formula forF (n) thatcan be computed inO(1) time, given the RAM com-putational model. Rather, this computation is of in-terest precisely because it is computationally simple,yet requires a lot of synchronization: By contributingessentially no computational complexity of its own, itclearly discloses CX system overhead associated withtask synchronization.

Indeed, we augment this trivial computation with aparameterized synthetic workload. Using the parame-ter, we vary the computational load in order to establishthe multiprocessor speedup efficiency as a function ofthe size of the computational load. LetN(n) denote thenumber of tasks spawned by computingF (n). Clearly,

N(n) = N(n − 1) + N(n − 2) + 2,


Table 1A table of efficiency (TSEQ/T1) as a function of the workload forcomputingF (8). Times are given in seconds.

Workload TSEQ T1 Efficiency

4522 497.420 518.816 0.963740 415.140 436.897 0.952504 280.448 297.474 0.941576 179.664 199.423 0.90914 106.024 120.807 0.88468 56.160 65.767 0.85198 24.750 29.553 0.8458 8.120 11.386 0.71

with initial conditionsN(0) = N(1) = 1. By inspect-ing the dags associated with the doubly recursive Fi-bonacci computation, we see thatN(n) = 3F (n) − 2.Thus,

N(n) = 3

1√

5

(1 +

√5

2

)n+1

−(

1 −√5

2

)n+1− 2.

This is the total number of tasks for computingF (n)recursively. The critical path length forF (n) is2n−1.

TSEQ denotes the time for to computeF (n) with adoubly recursive sequential Java program.T1 denotesthe time to computeF (n) with a doubly recursive Javaprogram for CX that has exactly one producer: eachrecursive method invocation translates into two sub-task plus a composition task to sum their results. Ta-ble 1 presents a table of times for workload,TSEQ,T1, and their ratio, which is referred to as theefficiencyof the CX application. The times given suggest theintuitive conclusion: As the workload increases, theefficiency of CX increases. The efficiencies given rep-resent “best” case, since both the producer and its taskserver were running on the same machine.

TP denotes the time for the computation usingPProducers.T∞ denotes the time to complete the com-putation’s critical path of tasks. Thus, as has beenreported in the Cilk project:

TP � max{T∞, T1/P}To ensure thatTP is dominated by the total work and

not the critical path, we thus must haveT1/P > T∞:

P <

3(

1√5

(1+

√5

2

)n+1

−(

1−√5

2

)n+1)− 2

2n − 1.

For P = 60, this inequality holds forn � 14. Ourexperiments computeF (n), for n = [13, 18]. For

larger values ofn, the total workload would moreclearly dominate the time to completeF (n)’s criticalpath.

Traditionally, speedup is measured on adedicatedmultiprocessor, where all processors are homogeneousin hardware and software configuration. Thus, speedupis well defined asT1/Tp, whereT1 is the time a programtakes on one processor andTp is the time the sameprogram takes onp processors. The fraction of speedupobtained is the ratio of ideal parallel time over the actualparallel time:

T1/p

Tp

We now generalize this formula to a vectorp =[p1 p2 · · · pd]T of d different processor types, wherethere arep1 processors of type1, p2 processors oftype 2, etc. The basic idea is simply thatwork =work rate × time. Let:

– w denote the amount of work– ri denote the work rate for1 processor of typei– T i

p(w) denote the time forp processors of typeito complete workw.

Clearly,

ri = w/T i1.

Ideally, work rates are additive: The work ratefor p1 machines of type1 plus p2 machines of type2 plus . . . plus pd machines of typed is just pT r,wherer = [r1 r2 . . . rd]T . Let τp(w) denote theideal parallel time to complete workw with a vectorp = [p1 p2 . . . pd]T of processors. We have

τp(w) =w

pT r

=(

p1

T 11 (w)

+p2

T 21 (w)

+ · · · + pd

T d1 (w)

)−1

.

When there is only 1 processor type, the formulaabove for usingp of them reduces to the familiarT1/p.Let Tp(w) denote theactual time to complete workwwith a vectorp of processors. The general formula forthe fraction of speedup obtained thus is

τp/Tp(w).

While this definition does not incorporate machineand network load factors, it does reflect the heteroge-neous nature of the set of machines.

The virtue of having a formula forN(n), the num-ber of tasks to computeF (n), now comes into play.Clearly, the experiments that take the longest are those


Table 2The number and processors types for Producers and TaskServers

Producers Dual 512 34Dual 1024 22Quad 4

TaskServers Quad 2

that involve only 1 processor: computingT i1 for vari-

ous machines types,i. Let W (n) denote the computa-tional work associated with computingF (n) (with anaugmented work load). LetT i

1(W (n)) denote the timeto completeW (n) on 1 processor of typei. Wemodelthe computation time ofW (n) on 1 machine of typeias the sum of:

– a portion of time that is independent ofn to startand stop the program, denotedα i

– an amount of time that depends onn: β iN(n).

That is,

T i1(W (n)) = αi + βiN(n).

We take actual measurements ofT i1(W (n)), for 2

values ofn chosen such that they result in a system oftwo independent linear equations. We then solve forαandβ. For example, sayT i

1(W (5)) = 27 seconds andT i

1(W (7)) = 66 seconds. Then,

27 = αi + 22βi (1)

66 = αi + 61βi. (2)

Solving, we obtain thatαi = 5 seconds andβi = 1 sec-ond on machine typei. We now estimateT i

1(W (n)) =5 + 1N(n), for any natural numbern. Thus, 2 smallexperiments suffice for producing a good estimate ofa very large sequential execution time. We used thistechnique to compute the base cases used in the follow-ing speedup calculations. This technique obviates theneed for extremely large sequential executions that oth-erwise would be needed to calculate speedups. Largemultiprocessor runs require large problem instances.Computing times for the base cases for such runs (e.g.,1000 processor experiments) can, in principle, requiremany days of processor time. Thus, using this tech-nique, we avoid the most computationally extendedexperiments, which are consequently quite precarious(e.g., a momentary power loss requires restarting fromthe beginning).

Table 2 presents the number of processors of eachtype that were used in our experiments.

Table 3 gives the actual times for 2 synthetic work-loads on the processor types used in the experiments.We have 3 task types: Decomposition (D), boundary(B), and composition (C).

Table 3Task times, for the 2 processor types. Each had 2 workloads. The3 task types are decomposition (D), boundary (B), and composition(C). Times are in milliseconds

Dual 512 D B CWorkload 1 41 1720 41Workload 2 41 3650 41Quad D B CWorkload 1 32 1377 32Workload 2 32 2925 32

The ratio of ideal speedup over actual speedup is lessthan or equal to 1. Figure 5 shows the ratio of idealspeedup over actual speedup. The figure shows exe-cution times for Fibonacci computations varying fromF (13) toF (18). ForF (14), the ratio of ideal speedupover actual speedup is 0.87. ForF (18), the ratio ofideal speedup over actual speedup is 0.99. CX achievesessentially 0.99 of ideal speedup using 60 processorson a complex dag-structured computation with smalltasks (average task time is 1.8 seconds for Workload1 and 3.7 seconds for Workload 2). This is encour-aging: The tasks do not need to be too coarse for re-spectable speedups. For these preliminary performanceexperiments, the task servers did not mirror their statechanges.

Figure 6 shows what percentage of idle time wasspent during the transient parts of the computation: Theinitial transient is when the computation begins, andmost processors are starving for tasks; the terminationtransient is when the computation is winding down,and most processors again are starving for tasks. Theseinevitable transients account for 25% of idle cycles,when the system is achieving 0.99 of optimal speedup.In particular, the idleness due to the initial transient inthat case is 0.1% of idle cycles. This suggests that tasksare distributed to the 60 processors rapidly.

We also performed experiments (on 16 processors)to measure the effect of pre-fetching. For small compu-tations (few tasks and/or short tasks) and fast commu-nication, performance gain via pre-fetching is minimal.As the number of tasks increase and/or the task time in-creases and/or the communication times increase, pre-fetching helps more and more. Since our cluster has fastcommunication, we did not obtain data for the case ofcommunications with relatively long latencies. Specif-ically, for F (11), speedup with pre-fetching was 0.51of optimal; whereas without pre-fetching, speedup was0.54. However, forF (15), speedup with pre-fetchingwas 0.93 of optimal; whereas without pre-fetching,speedup was 0.80. We believe that as the number oftasks increases and/or the task sizes increase and/orcommunication latencies increase, the benefits of pre-fetching increase commensurately.


Speedup over 60 nodes

0

0.2

0.4

0.6

0.8

1

1.2

F(13) Fib(14) Fib(15) Fib(16) Fib(17) Fib(18)

Sp

eed

up

Workload 1Workload 2

Fig. 5. Fraction of ideal speedup for computingF (n), n = 13, 14, 15, 16, 17, 18 under2 workloads, using 60 processors.

Cold start & tail computations

0

10

20

30

40

50

60

70

F(13) Fib(14) Fib(15) Fib(16) Fib(17) Fib(18)

%o

fto

talw

aste

Cold start wasteTail waste

Fig. 6. Percentage of idle cycles that are due to start and stop transients, forF (n), n = 13, 14, 15, 16, 17, 18 under2 workloads.

6. Conclusion

CX is a network-based computational exchange. Itcan be used in a variety of environments, from a smalllaboratory within a single department of a university,to a corporate producer network, to millions of inde-pendent producersspontaneouslyorganized into a giantproducer network.

We have chosen Java for CX because Java increasesapplication programmer productivity (e.g., is object-oriented, yet serializes objects for communication), re-duces application portability and interoperability prob-lems, enables mobile code, will support a high levelsecurity API (RMI), and does all this with an accept-able and decreasing penalty vis a vis native machineexecution.

We believe that our contributions to networked-based, object-oriented parallel computing include:

– The novelcombination of variations on ideas byother researchers, including work stealing of non-blocking tasks, eager task scheduling, and space-based coordination.

– A simple, compact API that enables the ex-pression of object-oriented, task-level parallelism.It cleanly separates application logic from thelogic that supports interprocess communicationand fault tolerance.

– The sibling-connected, fat tree of servers, a re-cursive, short-diameter, scalable network of taskservers that self-repairs in the face of a sequenceof faults: The network gracefully degrades fromn servers to one server, provided that the failuresoccur sequentially.

– A simple diffusion process for distributing tasksamong the network of task servers. Since the di-ameter of the network isO(log n), the number of


edges between any task and any producer is nomore than2 logn: Using only local information,task “concentrations” rapidly diffuse into the net-work.

– The use of task caching/replication and two levelsof pre-fetching (including inter-server task diffu-sion) to hide the large communication latency thatis intrinsic to networks.

– A simple, general expression for ideal speedup,τp(w), when performing workw on a vectorp =[p1 p2 · · · pd]T of processors:

τp(w) =w

pT r

=(

p1

T 11 (w)

+p2

T 21 (w)

+ · · · + pd

T d1 (w)

)−1

.

– A load generator, using theF (n) computation,that strenuously exercises the dag model of com-putation: It spawns many tasks that require syn-chronization of predecessor tasks. This load gen-erator is versatile because it augments theF (n)computation with a parameterized synthetic load.

– A technique for accurately estimating long se-quential execution times, based on 2 short execu-tions, that obviates the need for the most time-consuming experiments, potentially saving daysof experimental work.

– A test bed for a variety of research topics, suchas automated trading, reputation services, authen-tication services, and bonding services. CX alsoprovides a test bed for algorithm research intonetwork-based parallel computation.

The API can serve as a target for a higher level no-tation for the object-oriented expression of parallel al-gorithms. As future work, we may work on an exten-sion to Java, an object-oriented analog to Cilk’s exten-sions to C. The extensions (which, when elided, leavea valid Java program) could be preprocessed into an-other Java program – one that exploits the algorithm’stask-level parallelism when run on CX’s network com-puting system. We would like to more deeply analyzeand experiment with diffusion, modelling task serversand producers as adaptive controllers.

We also would like to experiment with various trad-ing strategies, and program applications for CX thathave value to the scientific community.

References

[1] S. Adabala, N.H. Kapadia and J.A.B. Fortes, Performance andInteroperability Issues in Incorporating Cluster Management

Systems within a Wide-Area Network Computing Environ-ment, inSupercomputing: High Performance Networking andComputing, Dallas, TX, November 2000.

[2] E. Adar and B.A. Huberman, Free Riding on Gnutella,FirstMonday 5(10) (Oct. 2000), http://www.firstmonday.dk/issues/issue510/adar.

[3] A. Bakker, E. Amade, G. Ballintijn, I. Kuz, P. Verkaik, I.van der Wijk, M. van Steen and A. Tanenbaum, The GlobeDistribution Network, inProc. 2000 USENIX Annual Conf.(FREENIX Track), San Diego, June 2000, pp. 141–152.

[4] J.E. Baldeschwieler, R.D. Blumofe and E.A. Brewer, ATLAS:An Infrastructure for Global Computing, inProceedings of theSeventh ACM SIGOPS European Workshop on System Supportfor Worldwide Applications, 1996.

[5] A. Baratloo, M. Karaul, Z. Kedem and P. Wyckoff, Charlotte:Metacomputing on the Web, inProceedings of the 9th Confer-ence on Parallel and Distributed Computing Systems, 1996.

[6] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson,K.H. Randall and Y. Zhou, Cilk: An Efficient MultithreadedRuntime System, in5th ACM SIGPLAN Symposium on Prin-ciples and Practice of Parallel Programming (PPOPP ’95),Santa Barbara, CA, July 1995, pp. 207–216.

[7] R.D. Blumofe and P.A. Lisiecki, Adaptive and Reliable Paral-lel Computing on Networks of Workstations, inProc. USENIXAnn. Technical Symposium, Anaheim, Jan. 1997.

[8] R. Buyya, D. Abramson and J. Giddy, An Economy DrivenResource Management Architecture for Global ComputationalPower Grids, inInt. Conf. on Parallel and Distributed Process-ing Techniques and Applications (PDPTA 2000), Las Vegas,USA, June 2000, pp. 26–29.

[9] P. Cappello, B. Christiansen, M.O. Neary and K.E. Schauser,Market-Based Massively Parallel Internet Computing, inThirdWorking Conf. on Massively Parallel Programming Models,London, Nov. 1997, pp. 118–129.

[10] N. Carriero, D. Gelernter, D. Kaminsky and J. West-brook, Adaptive Parallelism with Piranha, Technical Re-port YALEU/DCS/TR-954, Department of Computer Science,Yale University, New Haven, Connecticut, 1993.

[11] B.O. Christiansen, P. Cappello, M.F. Ionescu, M.O. Neary,K.E. Schauser and D. Wu, Javelin: Internet-Based ParallelComputing Using Java,Concurrency: Practice and Experi-ence 9(11) (Nov. 1997), 1139–1160.

[12] E. Drexler and M. Miller, Incentive Engineering for Compu-tational Resource Management, in:The Ecology of Compu-tation, B. Huberman, ed., Elsevier Science Publishers B. V.,North-Holland, 1988.

[13] D.H.J. Epema, M. Livny, R. van Dantzig, X. Evers and J.Pruyne, A Worldwide Flock of Condors: Load Sharing amongWorkstation Clusters,Future Generation Computer Systems12 (1996), 53–65.

[14] T. Fink and S. Kindermann, First Steps in Metacomputingwith Amica, in Proceedings of the 8th Euromicro Workshopon Parallel and Distributed Processing, 1998.

[15] I. Foster and C. Kesselman, Globus: A Metacomputing In-frastructure Toolkit,International Journal of SupercomputerApplications (1997).

[16] E. Freeman, S. Hupfer and K. Arnold,JavaSpaces Principles,Patterns, and Practice, Addision-Wesley, 1999.

[17] L. Gong, Inside Java 2 Platform Security, Addison-Wesley,1999.

[18] A.S. Grimshaw, W.A. Wulf and the Legion team, The LegionVision of a Worldwide Virtual Computer,Communications ofthe ACM 40(1) (Jan. 1997), 39–45.


[19] K.A. Hawick, H.A. James, A.J. Silis, D.A. Grove, K.E. Kerry,J.A. Mathew, P.D. Coddington, C.J. Patten, J.F. Hercus andF.A. Vaughan, DISCWorld: An Environment for Service-Based Metacomputing, Technical Report DHPC-042, 1998.

[20] A. Keller and A. Krawinkel, Lessons Learned While Oper-ating Two Large SCI Clusters, inProceedings of the FirstIEEE/ACM International Symposium on Cluster Computingand the Grid (CC-GRID), Brisbane, Australia, 2001, pp. 303–310.

[21] T. Malone, R.E. Fikes, K.R. Grant and M.T. Howard, Enter-prise Computation, in:The Ecology of Computation, B. Hu-berman, ed., Elsevier Science Publishers B. V., North-Holland,1988.

[22] M.O. Neary, S.P. Brydon, P. Kmiec, S. Rollins and P. Cappello,Javelin++: Scalability Issues in Global Computing,Concur-rency: Practice and Experience, to appear 12 (2001), 727–753.

[23] M.O. Neary, B.O. Christiansen, P. Cappello and K.E. Schauser,Javelin: Parallel Computing on the Internet,Future Genera-tion Computer Systems 15(5–6) (Oct. 1999), 659–674.

[24] M. Nibhanupudi and B. Szymanski, Runtime Support for Vir-tual BSP Computer, inParallel and Distributed Computing,Workshop on Runtime Systems for Parallel Programming (RT-SPP’98), 12th Int. Parallel Processing Symp. (IPPS/SPDP),March 1998.

[25] M. Nibhanupudi and B. Szymanski, BSP-based Adaptive Par-allel Processing, in:High Performance Cluster Computing,R. Buyya, ed., Prentice-Hall, 1999, pp. 702–721.

[26] F.S. Roberts,Applied Combinatorics, Prentice-Hall, 1984.[27] R. Scheifler, RMI Security, http://java.sun.com/aboutJava/

communityprocess/jsr/jsr076 rmisecurity.html,August 2000.[28] N. Spring and R. Wolski, Application Level Scheduling of

Gene Sequence Comparison on Metacomputers, inProceed-ings of the 12th ACM International Conference on Supercom-puting, July 1998. Melbourne, Australia.

[29] R. van Nieupoort, J. Maassen, H.E. Bal, T. Kielmann andR. Veldema, Wide-Area Parallel Computing in Java, inACM1999 Java Grande Conference, San Francisco, June 1999,pp. 8–14.

[30] G. von Laszewski, I. Foster, J. Gawor, W. Smith and S. Tuecke,CoG Kits: A Bridge between Commodity Distributed Com-puting and High-Performance Grids, inACM Java GrandeConference, June 2000.

[31] C.A. Waldspurger, T. Hogg, B.A. Huberman, J.O. Kephart andW.S. Stornetta, Spawn: A Distributed Computational Econ-omy,IEEE Transactions on Software Engineering 18(2) (Feb.1992).

[32] A. Wendelborn and D. Webb, Distributed process networksproject: Progress and directions, citeseer.nj.nec.com/wendel-born99distributed.html.

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014


Applied Computational Intelligence and Soft Computing

Advances in

Artificial Intelligence


Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in


Documents

CX: A scalable, robust network for parallel computingdownloads.hindawi.com/journals/sp/2002/598245.pdf · 159 CX: A scalable, robust network for parallel computing Peter Cappello