Upload
daniel-iwan
View
221
Download
0
Embed Size (px)
Citation preview
7/31/2019 10.1.1.28.7264.ps
1/14
JAWS: A Java Work Stealing Scheduler Over
a Network of Workstations
Computer Science Division
The University of California at BerkeleyZ. Morley Mao, Hoi-Sheung W. So, Alec Woo
{zmao, so, awoo}@cs.berkeley.edu
Abstract
In this paper, we present the design and implementation of a parallel programming environment called
JAWS (Java Work Stealer). JAWS is implemented as a user-level Java library which schedules user
threads over a network of workstations using a Work Stealer algorithm. The goal of JAWS is to enable
programmers to write cross-platform parallel programs that run seamlessly on a network of workstations,and adapt automatically to the number of available workstations.
1.0 Introduction
Today, machine speed doubles every 18 months while hardware prices continue to drop. It would be very
desirable to be able to buy newer, faster, and cheaper workstations, connect them to an existing cluster, and
effortlessly increase the speed of the cluster. The Network of Workstatons (NOW) [1] proves that building
a fast computer using slower workstations is both feasible and economical. However, writing software for
a cluster of heterogeneous machines is difficult because, for performance purposes, parallel programs aremostly written in machine dependent languages [7, 10]. Existing programs must be recompiled or even
modified before running on a new platform. Nonetheless, maintaining different versions of the sameprogram for different platforms increases administrative costs, thereby reducing the cost-effectiveness of
running a heterogeneous network.
Java [2,12,13] is a type-safe, cross-platform language which promises that programmers can now "write
once, run everywhere". If we can combine the low-cost, high-availability nature of NOW with the cross-
platform feature of Java to create a new parallel programming environment, the new environment can
potentially deliver the following promise: "write once, run everywhere, at the same time." It is an extremely
powerful concept to write a parallel program just once, and when new computers are added to the cluster,all existing programs will continue to run unmodified, just faster. Furthermore, as new computers are
dynamically added to the cluster, a work stealing scheduler can immediately make use of the new
computing resources while static scheduling algorithms [9, 15] fail to adapt to this dynamically changing
cluster environment.
The hard question is, can we turn Java into a parallel program environment that is scalable, adaptive,
7/31/2019 10.1.1.28.7264.ps
2/14
2.0 Design Overview
JAWS allows programmers to write parallel programs in pure Java that can run on a network ofworkstations. JAWS consists of two parts: a runtime system which does work scheduling and load-
balancing using the work stealing algorithm and a library that acts as an API for applications to access the
JAWS runtime facilities. The runtime system, also known as a ComputeEngine, runs on every node of the
NOW cluster all the time, constantly contacting each other to discover if there are any jobs to be shared.The API is packaged as a superclass, a.k.a. WorkerClass, which each application subclasses. WorkerClass
provides a gateway to spawn parallel jobs using JAWS runtime system.
3.0 Design RationaleThroughout the design process, four factors heavily influenced our decisions.
1. Ease-of-programming
We hope to make writing parallel programs in JAWS as simple as writing multi-threaded programs.
(a) The parallel programming model of JAWS should be intuitive and easy to learn.(b) The application programming interface (API) of JAWS be simple.(c) The amount of code that the application programmer has to write in order to spawn a parallel job and
collect results should be minimized.
2. Portability:
The reason we choose Java as the implementation language is because it is highly portable. A Javaprogram, once compiled, can run on any Java Virtual Machines (JVMs) that running on top of anyhardware architecture. In order to preserve the portabil ity of Java programs, we try very hard to avoid
modifying JVM or use Java Native Interfaces (JNI). At present, we do not require the use of JNI1
or
modifications to the JVM. However, it is conceivable that we may want to use C/C++ to implement some
of the performance critical components of JAWS later. Nevertheless, platform dependent code should only
be used as an aid to improve performance, not to ensure correctness or completeness of JAWS.
3. Expressive Power
JAWS should allow programmers to expose as much parallelism in their program as possible. The job of
the programmer is to find and expose the parallelism in the program structure. The job of JAWS is to
provide powerful mechanisms to help programmers to easily express the parallelism in the program
structure. Ideally, if two statements can be executed in parallel, the programmers should be allowed to
express this fact. However, given that we want to preserve the syntax of Java, it might not be possible to do
so in some cases.
4. Performance
For JAWS to have good performance, there are two metrics that we can use to guide the design and
optimization efforts:
(a) How well does the speedup of a JAWS program scale with the number of nodes in the cluster?
7/31/2019 10.1.1.28.7264.ps
3/14
4.0 Detailed Design & Implementation
JAWS is inspired by and strongly influenced by the Cilk project.[7], but two major differences exist.
1.) Cilk adds a parallel extension to the C language. Its runtime system can have access to the stack of arunning thread through the use of setjump() and longjump() calls. JAWS is implemented in Java
without modifying to the JVM. Therefore, it has no way to save and restore a stack frame as done inCilk.
2.) Current Cilk implementation assumes that the underlying hardware is a Symmetric Multiprocessor(SMP) machine with shared memory. JAWS targets network of workstations which has no shared
memory. As a result, communications between Java threads requires message passing.
4.1 JAWS Runtime System
Each JAWS node has a resident runtime system called Compute Engine. At the center of each Compute
Engine is a double-ended queue (4.1.2) which stores the jobs waiting to be executed. Two threads operate
concurrently on the queue: the Worker Thread (4.1.3) and the Daemon Thread (4.1.4). The Worker Thread
pops jobs out of the queue and start executing them. But if the queue is empty, a Worker Thread becomes a
thief and contacts other nodes to try to steal a job. The Daemon Thread helps thieves from other machines
to steal jobs stored in the local queue. After the job has been stolen and executed on a remote node, the
results of execution are returned to the original node where the job was spawned. The thief contacts theCollector Thread (4.1.5) of the original node to pass back any execution results. Figure 1 illustrates the
internal structure of each Compute Engine and the interactions between different engines.
Collector
Worker
Thread
Daemon
Deque
Pool ofDoJob
Threads
BlockedDoJobThreads
Pop
Job
Push
Job
Active
DoJobThread
Steal
Collector
Worker
Thread
Daemon
Deque
Pool ofDoJob
Threads
Blocked
DoJobThreads
Pop
Job
Push
Job
Active
DoJobThread
Compute Engine at Compute Engine at
Steal Request
Through RMI
Results
arrived!
Wakes up
thread.
Return
Results
7/31/2019 10.1.1.28.7264.ps
4/14
4.1.1 Job
In JAWS, the smallest unit of work that can be executed in parallel is a method invocation of an object -- ajob. A job is a tuple of the form < Class Name, Method_Name_and_Signature, Arguments,
Return_Result_Info >. Each job is self-contained, so a job can be migrated from the machine where it wasspawned to any other machine in the cluster. Return_Result_Info (4.1.5) stores enough information to
allow the execution results to be returned to the machine where the job was spawned.
JAWS assumes each JAWS program has an entry point method called main. To start a program, a job
specifying the main() method of the application together with all the parameters is spawned and handed to
the JAWS runtime system for execution.
4.1.2 Deque
The double-ended queue or Deque is the center of the JAWS runtime system. It stores jobs that have been
spawned but not yet executed. New jobs are pushed into the queue from one end. Jobs are also popped
from the same end for execution on the local host. However, when other machines asks for jobs from the
local host, a job is popped from the opposite end of the Deque.
4.1.3 Worker Thread
The job of a Worker Thread is to keep the CPU busy at all times by feeding it with a steady supply of jobs.Jobs can be popped from the local Deque if it is non-empty. If all the jobs in the local Deque have beencompleted, the Worker Thread becomes a thief and attempt to steal a job from a randomly picked node--a
victim--in the cluster. If the victim also has no jobs, the Worker Thread will randomly choose another node
to steal from. As long as there are no jobs in the local Deque, the stealing process will repeat until a job is
found.
Once a job has been found, the worker thread will assign the job to a DoJob Thread. A DoJob Thread canbe thought of as a robot that executes a job. The robot, by itself, cannot do anything useful. The robot
needs a programa job. A DoJob Thread uses Java Reflection [17,8] feature to invoke the method of theobject specified in the job. While the DoJob Thread is executing the job, new children jobs may be
spawned. It may go to sleep while waiting for results from its children jobs. When all the children jobs are
done, the waiting DoJob Thread will be awaken. After finishing the job, the DoJob Thread is returned to a
pool, ready for reuse. The reason for saving these threads is to minimize the number of threads created.When the Worker Thread runs out of the threads, it can create some on demand to execute new jobs.
4.1.4 Daemon Thread
The job of a Daemon Thread is to serve the steal requests from Worker Threads in other nodes. TheDaemon is implemented as a remote object which exposes a Remote Method Invocation (RMI) interface
[16,8] to allow other nodes to submit steal requests. When the Daemon receives a steal request, it checks
the local Deque to see if it is empty. If it is not, the Daemon pops the oldest job in the Deque and returns it
to the thief. If the local Deque is empty, the Daemon Thread will notify the thief immediately without
blocking.
7/31/2019 10.1.1.28.7264.ps
5/14
The collector has a hash table that keeps track of all the jobs that have been spawned, and the number ofresults that have been received so far. The hash table structure is illustrated in Table 1.
Thread ID Number of Spawns Number of Results Arrived
Result Vector(indexed by Result ID)
ID of Fib(5) 2 1 [5, null]
ID of Fib(20) 2 0 [null, null]
ID of Fib(2) 2 2 [1,1]
Table 1: Structure of collectors hash table
When a thief returns the execution result of a stolen job, it calls the collector of the victim and returns the
result wrapped in a ReturnResultInfo object. Recall from section 4.1.1, this object provides information forthe thief to route the execution results back victim, and for the collector of the victim to return the result to
the specific spawn of the thread. The ReturnResultInfo object contains the following information:
1. A collector pointer points to the collector residing on the victim machine.2. A Thread ID is used to identify the thread that spawned the job. It is also used as an index to the hash
table shown in Table 1.3. A Result ID is used to map the result to the correct spawn in the program.4. The actual object containing the result
4.2 JAWS Application Programming Interface
The programming interface of JAWS is packaged in as a superclass called WorkerClass which all JAWSprograms must subclass. It provide methods to spawn parallel jobs and collect results using JAWS runtime
system. The API specifies two methods: spawn() and sync().1.)int spawn (String className, String methodName, Class[] parameterList,
Object[] argumentList, Object instance);
2.) Vector sync(void);
JAWS allows job spawning at method granularity. That is, calling spawn() is like performing a method
invocation and requires information such as the object instance, the class name, the method name, the
method signature(parameterList), and the arguments. Spawn returns a Result ID which is used to lookup
the correct result from sync(). The method sync() takes no arguments. Its primary purpose is to block thethread and wait for results. When results for all the jobs spawned before sync() have arrived, the Collector
will notify the thread and sync() will return with a vector containing the results. The Result ID is used to
index the correct result within the Vector. Note that this Result ID returned by spawn() is the same ResultID stored in the ReturnResultInfo object.
4.3 Use of the JAWS System
A simple example for calculating Fibonacci numbers is shown in figure 2. Spawn() is used to create a new
Fibonacci jobs. Sync() is called to collect the results of previous outstanding Fibonacci spawns" up to the
7/31/2019 10.1.1.28.7264.ps
6/14
class Fib extends WorkerClass {
Object DoFib (Integer d) {
Object[] argListX, argListY;
Class[] paramList;Integer x, y;
if(d.intValue()>1) {
argListX = new Object[1];
argListY = new Object[1];
x = new Integer(d.intValue()-1);
y = new Integer(d.intValue()-2);
argListX[0] = x;argListY[0] = y;
paramList = new Class[1];
paramList[0] = x.getClass();
int rid1 = spawn("Fib", "DoFib", paramList, argListX, new Fib());int rid2 = spawn("Fib", "DoFib", paramList, argListY, new Fib());
Vector resultVector=sync();
x = (Integer) resultVector.elementAt(rid1);
y = (Integer) resultVector.elementAt(rid2);
return (Object) new Integer(x.intValue() + y.intValue());}
//base case: fib(0) or fib(1)
else {
return (Object) new Integer(1);
}
}}
Figure 2: A simple JAWS program for calculating Fibonacci using double recursion
5.0 Status and Evaluation
5.1 Speedup
We have implemented three programs using JAWS infrastructure: calculation of Knary [7], Fibonacci
(shown in figure 2), and RC5 key cracking.[14] The recursive version of Fibonacci that we implemented is
not optimized for execution on parallel processors. This is to stress test the worst case performance ofJAWS scheduler, because the amount of work done per method invocation is minimized. Most of the
execution is spent spawning jobs and collecting results. The base case of each job that performs no further
spawns is Fib of 0 or 1. The number of jobs to be done for a given argument n is exponential with respectto n. As shown in figure 3 below, the speedup of calculating Fib(17) is quite close to linear.
7/31/2019 10.1.1.28.7264.ps
7/14
Figure 3: Speedup of calculating Fib(17)
RC5 key cracking is quite different from Fibonacci in the sense that the job is finished as soon as the
correct key is found. The initial job has a given number of unknown bits to be guessed. Each additional
job spawned has a range of key values to be tried. The base case is testing 256 guessed key values. By
increasing the number of processors, there is a higher probability that the correct key will be found much
earlier. This occurs since the job that the correct key within its range will be executed earlier. The graphthat demonstrates the amount of increase in performance is shown in figure 4.
Speed of RC5 k ey cracking of 24 Bits
4
6
8
10
12
Speedup
RC5 Speedup
Ideal Speedup
Speedup of Fib(17)
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7# of Processors
Speedup
Ideal Speedup
Fib Speedup
7/31/2019 10.1.1.28.7264.ps
8/14
The execution time can vary greatly from trial to trial depending how early the job containing the correctkey in its range is scheduled to execute. However, on average, the entire is job is finished faster given a
larger number of processors as seen in figure 4. The graph has a slope greater than 1. The program has
nondeterministic running time, which is dependent on the execution order.
Theoretically the execution time of P processors is given by the following formula [5]:
Tp = O(T1/P + T)
Tp: expected time to execute a fully strict computation on P processors using work stealing algorithm
T: the minimum execution time assuming infinite number of processors. This can also be seen as thecritical path length.
P: number of processors. T1: the minimum serial execution time of the multithreaded computation.
In practice, the speedup achieved is not quite linear and is affected by the following factors: the granularity
of job, the depth, and the degree of parallelism of each job, the number of processors available to do
execution, the critical path of the program, and the inherent parallelism available in the program. The basecase jobs size needs to be sufficiently large; otherwise, the overhead of communication (stealing andreturning results) would dominate and reduce the performance speedup achieved. The number ofprocessors available is certainly also a factor, for it directly determines the execution time of the initial job.
If the number of machines P exceeds the degree of parallelism inherent in the program, then increasing Pwould not show any improvement in speedup. This is because the amount of jobs to be distributed is
limited and less than the number of machines to do work. The critical path is important, because the
execution time on P processors is always greater than the larger value of the critical path (T) and T1/P. To
find out the actual speedup of JAWS, T needs to be small relative to T1/P.
The speedup of Fibonacci in figure 3 reveals that T dominates Tp mostly due to the RMI overhead ofstealing when P is greater than four and therefore limits the speedup achieved. This is best illustrated byfigure 1 in Appendix A. It shows that machine A and machine C are constantly stealing for 50% of the
execution time, because because jobs they steal are both fine-grained and spawn very few child jobs. Infact, as the number of machine increases, the amount of stealing also increases.
To further investigate this issue, we use the synthetic benchmark, Knary, referenced in Cilk manual [7].
Knary allows the programmer to tune these parameters: the granularity of the job, the degree of parallelism,and the critical path (depth) of the program. Figure 5 shows the speedup of Knary with fine granularity,
small degree of parallelism, and small depth.
Speedup for k nary(20,100,3)
12
14
16
18Ideal Speedup
Speedup
7/31/2019 10.1.1.28.7264.ps
9/14
Note that for small degree of parallelism and depth, there will be numerous stealing activities and T willbe large. Therefore, speedup is limited. We then tune the granularity, degree of parallelism, and depth
individually to understand the actual effects of these parameters in JAWS. Note that although the three
cases are independent of each other, the total amount of work T1 is the same in all three cases. Thefollowing graph shows the result.
Figure 6: Speedup for the three specially tuned Knary Programs
The above figure shows that by increasing the number of child jobs either with more depth or parallelism,
better speedup is achieved. This occurs because the number of steals decreases with increased depth and
parallelism. Consequently, T decreases and speedup improves. This explanation is further supported byfigure 2 in Appendix A which shows the average deque size of running the more depth case with four
machines. Contrasted with the Fib(24) case, this graph clearly shows each machine is busy for most of the
execution time with much fewer number of steals. Therefore, with four machines, Knary with sufficient
parallelism or depth can achieve a better speedup than Fib(24).
Figure 6 also shows that the effect of increasing the granularity is not as significant as compared to
increasing either parallelism or depth. However, if the granularity is not coarse enough, it is not
worthwhile to spawn jobs and execute them on other machines. That is because the cost of communicationis much larger than the job execution time. With increased granularity, stealing activities can be reduced;
therefore, better speedup will result due to decreased T.
Comparing the Speedup for the 3 Cases
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20
Number of Machines
Speedup
Ideal Speedupcoarser grain compared to Figure 5
More Parallelism
More Depth
7/31/2019 10.1.1.28.7264.ps
10/14
5.2.1 Experimental Setup / Data Collection:
The test bed is a SUN Ultra-Sparc 1, running Solaris 2.6 with JDK 1.1.3 using native threads. Unlessotherwise stated, The JAWS program being profiled is Fib(19). We obtained profiling data through twomechanisms. The first mechanism is the built-in profiler in JDK, which gives output of the following form:
Frequency Callee Caller Time
500 FunctionChild1 FunctionParentA 220ms
1500 FunctionChild2 FunctionParentB 530ms
...
Each method invocation is logged: a counter keeps the total number of calls, and a timer counts the totaltime spent in the callee.
However, turning on the profiler logs every single call inflates the execution time of Fib(19) by about 50%.
In addition, when a thread switch happens, the timer continues. So the time reported by the profiler can
only be thought of as an upper bound of the actual execution time. However, the frequencies of calls areextremely helpful; they give us hints as to where optimizations might help.
The second profiling mechanism is to print out the differences reported by the machine timer(System.currentTimeMillis() ) before and after the function call to be investigated. The method limits the
logging overhead because only calls of interests are traced. But it also has limited usage because theresolution of the timer provided by Java is 1ms. We timed the execution of System.currentTimeMillis()
itself and found it to be only about 1.3us, which is low enough to be ignored for our purposes.
5.2.2 Experimental Results:
We first run Fib(19) seven times, then discard the fastest and slowest run. We find the average of the restrunning times to be 20.20 seconds. As a reminder, we are running Fib(19) on a single machine throughout
this set of benchmarks, and hence stealing never occurs. After analyzing the profiler output and our own
traces, we identified two main problem areas with high overhead.
1. Reflection:
Each job in JAWS is stored as . To execute a job,
reflection (Class.forName()) is first used to load the class (code) into JVM. Next, reflection
(Class.getDeclaredMethod()) is used to look up the appropriate method, based on the MethodName and the
Argument types. Finally, the method returned by getDeclaredMethod() of the ObjectInstance is invoked.
Each of these methods is invoked once for each job. For Fib(19), each of these methods are invoked
13529 times. It turns out that forName() takes about 166us each, getDeclaredMethod() takes 250us. Wealso measured the time to invoke() an dummy function which returns a fixed integer takes about 12.6us.
Together, the overhead of reflection accounts for about (166us + 250us + 12.6us)*13529 = 5.799 seconds
or 28.7% of the total execution time!
In the next version of JAWS, the result of forName() and getDeclaredMethod() are cached to reduce the
total execution time by about 28%.
7/31/2019 10.1.1.28.7264.ps
11/14
roughly estimate the overhead of the explicit wait() and notify() calls, we setup a experiment where twothreads ping-pongs each other as follows:
Thread 1 Thread 2Run() {
For(i=; i
7/31/2019 10.1.1.28.7264.ps
12/14
8.0 References
[1] Tom Anderson, David Culler, David Patterson.A Case for Networks of Workstations: NOW. IEEEMicro, Feb, 1995.
[2] Ken Arnold and James Gosling. The Java Programming Language. Addison-Wesley, 1996.
[3] J. Eric Baldeschwieler, Robert D. Blumofe, and Eric A. Brewer. ATLAS: An Infrastructure for GlobalComputing. In Proceedings of the Seventh ACM SIGOPS European Workshop: systems Support forWorldwide Applications, September 9-11, 1996, Connnemara, Ireland.
[4] Robert D. Blumofe.Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of
Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September1995.
[5] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by WorkStealing. In Proceedings of the 35
thAnnual IEEE conference on Foundations of Computer Science
(FOCS94), Santa Fe, New Mexico, November 20-22, 1994.
[6] Bernd O. Christiansen, Peter Cappello, Mihai F. Ionescu, Michael O. Neary, Klaus E. Schauser, andDaniel Wu.Javelin: Internet-Based Parallel Computing Using Java.
[7] Cilk 5.2 Reference Manual. Supercomputing Technologies Group MIT Laboratory for ComputerScience
[8] Gary Cornell, Cay Horstmann. Core Java 1.1 Volume: Advanced Features. Prentice Hall, 1998.
[9] Mark Crovella, Prakash Das, Czarek Dubnicki, Thomas LeBlanc, and Evangelos Markatos,Multiprogramming on multiprocessor. In Proceedings of the Third IEEE Symposium on Parallel and
Distributed Processing, December, 1991.
[10]David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta,Thorsten von Eicken, and Katherine Yelick, Parallel Programming in Split-C.
http://www.cs.berkeley.edu/projects/parallel/castle/split-c/split-c.tr.html
[11]GLUnix: Global Layer Unix, http://now.CS.Berkeley.EDU/Glunix/glunix.html
[12]Java: http://www.java.sun.com
[13]James Gosling, Bill Joy, Guy Steele, Java Language Specification, Addison Wesley, 1996. Sun
Microsystems, Inc. http://java.sun.com
[14]RC5 Key cracking information. Available from World Wide Web:http://userzweb.lightspeed.net/~gregh/rc5
[15]Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N.Ganmukhi Jeffrey V Hill W Daniel Hillis Bradley C Kuszmaul Margaret A St Pierre David S
7/31/2019 10.1.1.28.7264.ps
13/14
13
Deque
Size
Deque
Size
Deque
Size
Deque
Size
Deque Sizes Across 4 Different
Machines in Running Fib(24)Machine A
Machine B
Machine C
Machine D
ms
ms
ms
ms
Appendix A
Figure 1: Deque size of 4 machines running Fib(24)
Circle
indicates ansteal by this
machine
7/31/2019 10.1.1.28.7264.ps
14/14
14
Deque
Size
Deque
Size
DequeSize
Deque
Size
Machine A
Machine B
Machine C
Machine D
Deque Sizes Across 4 Different
Machines in Running knary(20,100,4)
ms
ms
ms
ms
Figure 2: Deque size of 4 machines running Knary with parallel=20,
work=100, and depth=4
Circle
indicates ansteal by this
machine