10.1.1.28.7264.ps

7/31/2019 10.1.1.28.7264.ps

1/14

JAWS: A Java Work Stealing Scheduler Over

a Network of Workstations

Computer Science Division

The University of California at BerkeleyZ. Morley Mao, Hoi-Sheung W. So, Alec Woo

{zmao, so, awoo}@cs.berkeley.edu

Abstract

In this paper, we present the design and implementation of a parallel programming environment called

JAWS (Java Work Stealer). JAWS is implemented as a user-level Java library which schedules user

threads over a network of workstations using a Work Stealer algorithm. The goal of JAWS is to enable

programmers to write cross-platform parallel programs that run seamlessly on a network of workstations,and adapt automatically to the number of available workstations.

1.0 Introduction

Today, machine speed doubles every 18 months while hardware prices continue to drop. It would be very

desirable to be able to buy newer, faster, and cheaper workstations, connect them to an existing cluster, and

effortlessly increase the speed of the cluster. The Network of Workstatons (NOW) [1] proves that building

a fast computer using slower workstations is both feasible and economical. However, writing software for

a cluster of heterogeneous machines is difficult because, for performance purposes, parallel programs aremostly written in machine dependent languages [7, 10]. Existing programs must be recompiled or even

modified before running on a new platform. Nonetheless, maintaining different versions of the sameprogram for different platforms increases administrative costs, thereby reducing the cost-effectiveness of

running a heterogeneous network.

Java [2,12,13] is a type-safe, cross-platform language which promises that programmers can now "write

once, run everywhere". If we can combine the low-cost, high-availability nature of NOW with the cross-

platform feature of Java to create a new parallel programming environment, the new environment can

potentially deliver the following promise: "write once, run everywhere, at the same time." It is an extremely

powerful concept to write a parallel program just once, and when new computers are added to the cluster,all existing programs will continue to run unmodified, just faster. Furthermore, as new computers are

dynamically added to the cluster, a work stealing scheduler can immediately make use of the new

computing resources while static scheduling algorithms [9, 15] fail to adapt to this dynamically changing

cluster environment.

The hard question is, can we turn Java into a parallel program environment that is scalable, adaptive,

7/31/2019 10.1.1.28.7264.ps

2/14

2.0 Design Overview

JAWS allows programmers to write parallel programs in pure Java that can run on a network ofworkstations. JAWS consists of two parts: a runtime system which does work scheduling and load-

balancing using the work stealing algorithm and a library that acts as an API for applications to access the

JAWS runtime facilities. The runtime system, also known as a ComputeEngine, runs on every node of the

NOW cluster all the time, constantly contacting each other to discover if there are any jobs to be shared.The API is packaged as a superclass, a.k.a. WorkerClass, which each application subclasses. WorkerClass

provides a gateway to spawn parallel jobs using JAWS runtime system.

3.0 Design RationaleThroughout the design process, four factors heavily influenced our decisions.

1. Ease-of-programming

We hope to make writing parallel programs in JAWS as simple as writing multi-threaded programs.

(a) The parallel programming model of JAWS should be intuitive and easy to learn.(b) The application programming interface (API) of JAWS be simple.(c) The amount of code that the application programmer has to write in order to spawn a parallel job and

collect results should be minimized.

2. Portability:

The reason we choose Java as the implementation language is because it is highly portable. A Javaprogram, once compiled, can run on any Java Virtual Machines (JVMs) that running on top of anyhardware architecture. In order to preserve the portabil ity of Java programs, we try very hard to avoid

modifying JVM or use Java Native Interfaces (JNI). At present, we do not require the use of JNI1

or

modifications to the JVM. However, it is conceivable that we may want to use C/C++ to implement some

of the performance critical components of JAWS later. Nevertheless, platform dependent code should only

be used as an aid to improve performance, not to ensure correctness or completeness of JAWS.

3. Expressive Power

JAWS should allow programmers to expose as much parallelism in their program as possible. The job of

the programmer is to find and expose the parallelism in the program structure. The job of JAWS is to

provide powerful mechanisms to help programmers to easily express the parallelism in the program

structure. Ideally, if two statements can be executed in parallel, the programmers should be allowed to

express this fact. However, given that we want to preserve the syntax of Java, it might not be possible to do

so in some cases.

4. Performance

For JAWS to have good performance, there are two metrics that we can use to guide the design and

optimization efforts:

(a) How well does the speedup of a JAWS program scale with the number of nodes in the cluster?

7/31/2019 10.1.1.28.7264.ps

3/14

4.0 Detailed Design & Implementation

JAWS is inspired by and strongly influenced by the Cilk project.[7], but two major differences exist.

1.) Cilk adds a parallel extension to the C language. Its runtime system can have access to the stack of arunning thread through the use of setjump() and longjump() calls. JAWS is implemented in Java

without modifying to the JVM. Therefore, it has no way to save and restore a stack frame as done inCilk.

2.) Current Cilk implementation assumes that the underlying hardware is a Symmetric Multiprocessor(SMP) machine with shared memory. JAWS targets network of workstations which has no shared

memory. As a result, communications between Java threads requires message passing.

4.1 JAWS Runtime System

Each JAWS node has a resident runtime system called Compute Engine. At the center of each Compute

Engine is a double-ended queue (4.1.2) which stores the jobs waiting to be executed. Two threads operate

concurrently on the queue: the Worker Thread (4.1.3) and the Daemon Thread (4.1.4). The Worker Thread

pops jobs out of the queue and start executing them. But if the queue is empty, a Worker Thread becomes a

thief and contacts other nodes to try to steal a job. The Daemon Thread helps thieves from other machines

to steal jobs stored in the local queue. After the job has been stolen and executed on a remote node, the

results of execution are returned to the original node where the job was spawned. The thief contacts theCollector Thread (4.1.5) of the original node to pass back any execution results. Figure 1 illustrates the

internal structure of each Compute Engine and the interactions between different engines.

Collector

Worker

Thread

Daemon

Deque

Pool ofDoJob

Threads

BlockedDoJobThreads

Pop

Job

Push

Job

Active

DoJobThread

Steal

Collector

Worker

Thread

Daemon

Deque

Pool ofDoJob

Threads

Blocked

DoJobThreads

Pop

Job

Push

Job

Active

DoJobThread

Compute Engine at Compute Engine at

Steal Request

Through RMI

Results

arrived!

Wakes up

thread.

Return

Results

7/31/2019 10.1.1.28.7264.ps

4/14

4.1.1 Job

In JAWS, the smallest unit of work that can be executed in parallel is a method invocation of an object -- ajob. A job is a tuple of the form < Class Name, Method_Name_and_Signature, Arguments,

Return_Result_Info >. Each job is self-contained, so a job can be migrated from the machine where it wasspawned to any other machine in the cluster. Return_Result_Info (4.1.5) stores enough information to

allow the execution results to be returned to the machine where the job was spawned.

JAWS assumes each JAWS program has an entry point method called main. To start a program, a job

specifying the main() method of the application together with all the parameters is spawned and handed to

the JAWS runtime system for execution.

4.1.2 Deque

The double-ended queue or Deque is the center of the JAWS runtime system. It stores jobs that have been

spawned but not yet executed. New jobs are pushed into the queue from one end. Jobs are also popped

from the same end for execution on the local host. However, when other machines asks for jobs from the

local host, a job is popped from the opposite end of the Deque.

4.1.3 Worker Thread

The job of a Worker Thread is to keep the CPU busy at all times by feeding it with a steady supply of jobs.Jobs can be popped from the local Deque if it is non-empty. If all the jobs in the local Deque have beencompleted, the Worker Thread becomes a thief and attempt to steal a job from a randomly picked node--a

victim--in the cluster. If the victim also has no jobs, the Worker Thread will randomly choose another node

to steal from. As long as there are no jobs in the local Deque, the stealing process will repeat until a job is

found.

Once a job has been found, the worker thread will assign the job to a DoJob Thread. A DoJob Thread canbe thought of as a robot that executes a job. The robot, by itself, cannot do anything useful. The robot

needs a programa job. A DoJob Thread uses Java Reflection [17,8] feature to invoke the method of theobject specified in the job. While the DoJob Thread is executing the job, new children jobs may be

spawned. It may go to sleep while waiting for results from its children jobs. When all the children jobs are

done, the waiting DoJob Thread will be awaken. After finishing the job, the DoJob Thread is returned to a

pool, ready for reuse. The reason for saving these threads is to minimize the number of threads created.When the Worker Thread runs out of the threads, it can create some on demand to execute new jobs.

4.1.4 Daemon Thread

The job of a Daemon Thread is to serve the steal requests from Worker Threads in other nodes. TheDaemon is implemented as a remote object which exposes a Remote Method Invocation (RMI) interface

[16,8] to allow other nodes to submit steal requests. When the Daemon receives a steal request, it checks

the local Deque to see if it is empty. If it is not, the Daemon pops the oldest job in the Deque and returns it

to the thief. If the local Deque is empty, the Daemon Thread will notify the thief immediately without

blocking.

7/31/2019 10.1.1.28.7264.ps

5/14

The collector has a hash table that keeps track of all the jobs that have been spawned, and the number ofresults that have been received so far. The hash table structure is illustrated in Table 1.

Thread ID Number of Spawns Number of Results Arrived

Result Vector(indexed by Result ID)

ID of Fib(5) 2 1 [5, null]

ID of Fib(20) 2 0 [null, null]

ID of Fib(2) 2 2 [1,1]

Table 1: Structure of collectors hash table

When a thief returns the execution result of a stolen job, it calls the collector of the victim and returns the

result wrapped in a ReturnResultInfo object. Recall from section 4.1.1, this object provides information forthe thief to route the execution results back victim, and for the collector of the victim to return the result to

the specific spawn of the thread. The ReturnResultInfo object contains the following information:

1. A collector pointer points to the collector residing on the victim machine.2. A Thread ID is used to identify the thread that spawned the job. It is also used as an index to the hash

table shown in Table 1.3. A Result ID is used to map the result to the correct spawn in the program.4. The actual object containing the result

4.2 JAWS Application Programming Interface

The programming interface of JAWS is packaged in as a superclass called WorkerClass which all JAWSprograms must subclass. It provide methods to spawn parallel jobs and collect results using JAWS runtime

system. The API specifies two methods: spawn() and sync().1.)int spawn (String className, String methodName, Class[] parameterList,

Object[] argumentList, Object instance);

2.) Vector sync(void);

JAWS allows job spawning at method granularity. That is, calling spawn() is like performing a method

invocation and requires information such as the object instance, the class name, the method name, the

method signature(parameterList), and the arguments. Spawn returns a Result ID which is used to lookup

the correct result from sync(). The method sync() takes no arguments. Its primary purpose is to block thethread and wait for results. When results for all the jobs spawned before sync() have arrived, the Collector

will notify the thread and sync() will return with a vector containing the results. The Result ID is used to

index the correct result within the Vector. Note that this Result ID returned by spawn() is the same ResultID stored in the ReturnResultInfo object.

4.3 Use of the JAWS System

A simple example for calculating Fibonacci numbers is shown in figure 2. Spawn() is used to create a new

Fibonacci jobs. Sync() is called to collect the results of previous outstanding Fibonacci spawns" up to the

7/31/2019 10.1.1.28.7264.ps

6/14

class Fib extends WorkerClass {

Object DoFib (Integer d) {

Object[] argListX, argListY;

Class[] paramList;Integer x, y;

if(d.intValue()>1) {

argListX = new Object[1];

argListY = new Object[1];

x = new Integer(d.intValue()-1);

y = new Integer(d.intValue()-2);

argListX[0] = x;argListY[0] = y;

paramList = new Class[1];

paramList[0] = x.getClass();

int rid1 = spawn("Fib", "DoFib", paramList, argListX, new Fib());int rid2 = spawn("Fib", "DoFib", paramList, argListY, new Fib());

Vector resultVector=sync();

x = (Integer) resultVector.elementAt(rid1);

y = (Integer) resultVector.elementAt(rid2);

return (Object) new Integer(x.intValue() + y.intValue());}

//base case: fib(0) or fib(1)

else {

return (Object) new Integer(1);

}

}}

Figure 2: A simple JAWS program for calculating Fibonacci using double recursion

5.0 Status and Evaluation

5.1 Speedup

We have implemented three programs using JAWS infrastructure: calculation of Knary [7], Fibonacci

(shown in figure 2), and RC5 key cracking.[14] The recursive version of Fibonacci that we implemented is

not optimized for execution on parallel processors. This is to stress test the worst case performance ofJAWS scheduler, because the amount of work done per method invocation is minimized. Most of the

execution is spent spawning jobs and collecting results. The base case of each job that performs no further

spawns is Fib of 0 or 1. The number of jobs to be done for a given argument n is exponential with respectto n. As shown in figure 3 below, the speedup of calculating Fib(17) is quite close to linear.

7/31/2019 10.1.1.28.7264.ps

7/14

Figure 3: Speedup of calculating Fib(17)

RC5 key cracking is quite different from Fibonacci in the sense that the job is finished as soon as the

correct key is found. The initial job has a given number of unknown bits to be guessed. Each additional

job spawned has a range of key values to be tried. The base case is testing 256 guessed key values. By

increasing the number of processors, there is a higher probability that the correct key will be found much

earlier. This occurs since the job that the correct key within its range will be executed earlier. The graphthat demonstrates the amount of increase in performance is shown in figure 4.

Speed of RC5 k ey cracking of 24 Bits

4

6

8

10

12

Speedup

RC5 Speedup

Ideal Speedup

Speedup of Fib(17)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7# of Processors

Speedup

Ideal Speedup

Fib Speedup

7/31/2019 10.1.1.28.7264.ps

8/14

The execution time can vary greatly from trial to trial depending how early the job containing the correctkey in its range is scheduled to execute. However, on average, the entire is job is finished faster given a

larger number of processors as seen in figure 4. The graph has a slope greater than 1. The program has

nondeterministic running time, which is dependent on the execution order.

Theoretically the execution time of P processors is given by the following formula [5]:

Tp = O(T1/P + T)

Tp: expected time to execute a fully strict computation on P processors using work stealing algorithm

T: the minimum execution time assuming infinite number of processors. This can also be seen as thecritical path length.

P: number of processors. T1: the minimum serial execution time of the multithreaded computation.

In practice, the speedup achieved is not quite linear and is affected by the following factors: the granularity

of job, the depth, and the degree of parallelism of each job, the number of processors available to do

execution, the critical path of the program, and the inherent parallelism available in the program. The basecase jobs size needs to be sufficiently large; otherwise, the overhead of communication (stealing andreturning results) would dominate and reduce the performance speedup achieved. The number ofprocessors available is certainly also a factor, for it directly determines the execution time of the initial job.

If the number of machines P exceeds the degree of parallelism inherent in the program, then increasing Pwould not show any improvement in speedup. This is because the amount of jobs to be distributed is

limited and less than the number of machines to do work. The critical path is important, because the

execution time on P processors is always greater than the larger value of the critical path (T) and T1/P. To

find out the actual speedup of JAWS, T needs to be small relative to T1/P.

The speedup of Fibonacci in figure 3 reveals that T dominates Tp mostly due to the RMI overhead ofstealing when P is greater than four and therefore limits the speedup achieved. This is best illustrated byfigure 1 in Appendix A. It shows that machine A and machine C are constantly stealing for 50% of the

execution time, because because jobs they steal are both fine-grained and spawn very few child jobs. Infact, as the number of machine increases, the amount of stealing also increases.

To further investigate this issue, we use the synthetic benchmark, Knary, referenced in Cilk manual [7].

Knary allows the programmer to tune these parameters: the granularity of the job, the degree of parallelism,and the critical path (depth) of the program. Figure 5 shows the speedup of Knary with fine granularity,

small degree of parallelism, and small depth.

Speedup for k nary(20,100,3)

12

14

16

18Ideal Speedup

Speedup

7/31/2019 10.1.1.28.7264.ps

9/14

Note that for small degree of parallelism and depth, there will be numerous stealing activities and T willbe large. Therefore, speedup is limited. We then tune the granularity, degree of parallelism, and depth

individually to understand the actual effects of these parameters in JAWS. Note that although the three

cases are independent of each other, the total amount of work T1 is the same in all three cases. Thefollowing graph shows the result.

Figure 6: Speedup for the three specially tuned Knary Programs

The above figure shows that by increasing the number of child jobs either with more depth or parallelism,

better speedup is achieved. This occurs because the number of steals decreases with increased depth and

parallelism. Consequently, T decreases and speedup improves. This explanation is further supported byfigure 2 in Appendix A which shows the average deque size of running the more depth case with four

machines. Contrasted with the Fib(24) case, this graph clearly shows each machine is busy for most of the

execution time with much fewer number of steals. Therefore, with four machines, Knary with sufficient

parallelism or depth can achieve a better speedup than Fib(24).

Figure 6 also shows that the effect of increasing the granularity is not as significant as compared to

increasing either parallelism or depth. However, if the granularity is not coarse enough, it is not

worthwhile to spawn jobs and execute them on other machines. That is because the cost of communicationis much larger than the job execution time. With increased granularity, stealing activities can be reduced;

therefore, better speedup will result due to decreased T.

Comparing the Speedup for the 3 Cases

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20

Number of Machines

Speedup

Ideal Speedupcoarser grain compared to Figure 5

More Parallelism

More Depth

7/31/2019 10.1.1.28.7264.ps

10/14

5.2.1 Experimental Setup / Data Collection:

The test bed is a SUN Ultra-Sparc 1, running Solaris 2.6 with JDK 1.1.3 using native threads. Unlessotherwise stated, The JAWS program being profiled is Fib(19). We obtained profiling data through twomechanisms. The first mechanism is the built-in profiler in JDK, which gives output of the following form:

Frequency Callee Caller Time

500 FunctionChild1 FunctionParentA 220ms

1500 FunctionChild2 FunctionParentB 530ms

...

Each method invocation is logged: a counter keeps the total number of calls, and a timer counts the totaltime spent in the callee.

However, turning on the profiler logs every single call inflates the execution time of Fib(19) by about 50%.

In addition, when a thread switch happens, the timer continues. So the time reported by the profiler can

only be thought of as an upper bound of the actual execution time. However, the frequencies of calls areextremely helpful; they give us hints as to where optimizations might help.

The second profiling mechanism is to print out the differences reported by the machine timer(System.currentTimeMillis() ) before and after the function call to be investigated. The method limits the

logging overhead because only calls of interests are traced. But it also has limited usage because theresolution of the timer provided by Java is 1ms. We timed the execution of System.currentTimeMillis()

itself and found it to be only about 1.3us, which is low enough to be ignored for our purposes.

5.2.2 Experimental Results:

We first run Fib(19) seven times, then discard the fastest and slowest run. We find the average of the restrunning times to be 20.20 seconds. As a reminder, we are running Fib(19) on a single machine throughout

this set of benchmarks, and hence stealing never occurs. After analyzing the profiler output and our own

traces, we identified two main problem areas with high overhead.

1. Reflection:

Each job in JAWS is stored as . To execute a job,

reflection (Class.forName()) is first used to load the class (code) into JVM. Next, reflection

(Class.getDeclaredMethod()) is used to look up the appropriate method, based on the MethodName and the

Argument types. Finally, the method returned by getDeclaredMethod() of the ObjectInstance is invoked.

Each of these methods is invoked once for each job. For Fib(19), each of these methods are invoked

13529 times. It turns out that forName() takes about 166us each, getDeclaredMethod() takes 250us. Wealso measured the time to invoke() an dummy function which returns a fixed integer takes about 12.6us.

Together, the overhead of reflection accounts for about (166us + 250us + 12.6us)*13529 = 5.799 seconds

or 28.7% of the total execution time!

In the next version of JAWS, the result of forName() and getDeclaredMethod() are cached to reduce the

total execution time by about 28%.

7/31/2019 10.1.1.28.7264.ps

11/14

roughly estimate the overhead of the explicit wait() and notify() calls, we setup a experiment where twothreads ping-pongs each other as follows:

Thread 1 Thread 2Run() {

For(i=; i

7/31/2019 10.1.1.28.7264.ps

12/14

8.0 References

[1] Tom Anderson, David Culler, David Patterson.A Case for Networks of Workstations: NOW. IEEEMicro, Feb, 1995.

[2] Ken Arnold and James Gosling. The Java Programming Language. Addison-Wesley, 1996.

[3] J. Eric Baldeschwieler, Robert D. Blumofe, and Eric A. Brewer. ATLAS: An Infrastructure for GlobalComputing. In Proceedings of the Seventh ACM SIGOPS European Workshop: systems Support forWorldwide Applications, September 9-11, 1996, Connnemara, Ireland.

[4] Robert D. Blumofe.Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of

Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September1995.

[5] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by WorkStealing. In Proceedings of the 35

thAnnual IEEE conference on Foundations of Computer Science

(FOCS94), Santa Fe, New Mexico, November 20-22, 1994.

[6] Bernd O. Christiansen, Peter Cappello, Mihai F. Ionescu, Michael O. Neary, Klaus E. Schauser, andDaniel Wu.Javelin: Internet-Based Parallel Computing Using Java.

[7] Cilk 5.2 Reference Manual. Supercomputing Technologies Group MIT Laboratory for ComputerScience

[8] Gary Cornell, Cay Horstmann. Core Java 1.1 Volume: Advanced Features. Prentice Hall, 1998.

[9] Mark Crovella, Prakash Das, Czarek Dubnicki, Thomas LeBlanc, and Evangelos Markatos,Multiprogramming on multiprocessor. In Proceedings of the Third IEEE Symposium on Parallel and

Distributed Processing, December, 1991.

[10]David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta,Thorsten von Eicken, and Katherine Yelick, Parallel Programming in Split-C.

http://www.cs.berkeley.edu/projects/parallel/castle/split-c/split-c.tr.html

[11]GLUnix: Global Layer Unix, http://now.CS.Berkeley.EDU/Glunix/glunix.html

[12]Java: http://www.java.sun.com

[13]James Gosling, Bill Joy, Guy Steele, Java Language Specification, Addison Wesley, 1996. Sun

Microsystems, Inc. http://java.sun.com

[14]RC5 Key cracking information. Available from World Wide Web:http://userzweb.lightspeed.net/~gregh/rc5

[15]Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N.Ganmukhi Jeffrey V Hill W Daniel Hillis Bradley C Kuszmaul Margaret A St Pierre David S

7/31/2019 10.1.1.28.7264.ps

13/14

13

Deque

Size

Deque

Size

Deque

Size

Deque

Size

Deque Sizes Across 4 Different

Machines in Running Fib(24)Machine A

Machine B

Machine C

Machine D

ms

ms

ms

ms

Appendix A

Figure 1: Deque size of 4 machines running Fib(24)

Circle

indicates ansteal by this

machine

7/31/2019 10.1.1.28.7264.ps

14/14

14

Deque

Size

Deque

Size

DequeSize

Deque

Size

Machine A

Machine B

Machine C

Machine D

Deque Sizes Across 4 Different

Machines in Running knary(20,100,4)

ms

ms

ms

ms

Figure 2: Deque size of 4 machines running Knary with parallel=20,

work=100, and depth=4

Circle

indicates ansteal by this

machine

Documents

10.1.1.28.7264.ps