10.1.1.28.7264.ps

Embed Size (px)

Citation preview

  • 7/31/2019 10.1.1.28.7264.ps

    1/14

    JAWS: A Java Work Stealing Scheduler Over

    a Network of Workstations

    Computer Science Division

    The University of California at BerkeleyZ. Morley Mao, Hoi-Sheung W. So, Alec Woo

    {zmao, so, awoo}@cs.berkeley.edu

    Abstract

    In this paper, we present the design and implementation of a parallel programming environment called

    JAWS (Java Work Stealer). JAWS is implemented as a user-level Java library which schedules user

    threads over a network of workstations using a Work Stealer algorithm. The goal of JAWS is to enable

    programmers to write cross-platform parallel programs that run seamlessly on a network of workstations,and adapt automatically to the number of available workstations.

    1.0 Introduction

    Today, machine speed doubles every 18 months while hardware prices continue to drop. It would be very

    desirable to be able to buy newer, faster, and cheaper workstations, connect them to an existing cluster, and

    effortlessly increase the speed of the cluster. The Network of Workstatons (NOW) [1] proves that building

    a fast computer using slower workstations is both feasible and economical. However, writing software for

    a cluster of heterogeneous machines is difficult because, for performance purposes, parallel programs aremostly written in machine dependent languages [7, 10]. Existing programs must be recompiled or even

    modified before running on a new platform. Nonetheless, maintaining different versions of the sameprogram for different platforms increases administrative costs, thereby reducing the cost-effectiveness of

    running a heterogeneous network.

    Java [2,12,13] is a type-safe, cross-platform language which promises that programmers can now "write

    once, run everywhere". If we can combine the low-cost, high-availability nature of NOW with the cross-

    platform feature of Java to create a new parallel programming environment, the new environment can

    potentially deliver the following promise: "write once, run everywhere, at the same time." It is an extremely

    powerful concept to write a parallel program just once, and when new computers are added to the cluster,all existing programs will continue to run unmodified, just faster. Furthermore, as new computers are

    dynamically added to the cluster, a work stealing scheduler can immediately make use of the new

    computing resources while static scheduling algorithms [9, 15] fail to adapt to this dynamically changing

    cluster environment.

    The hard question is, can we turn Java into a parallel program environment that is scalable, adaptive,

  • 7/31/2019 10.1.1.28.7264.ps

    2/14

    2.0 Design Overview

    JAWS allows programmers to write parallel programs in pure Java that can run on a network ofworkstations. JAWS consists of two parts: a runtime system which does work scheduling and load-

    balancing using the work stealing algorithm and a library that acts as an API for applications to access the

    JAWS runtime facilities. The runtime system, also known as a ComputeEngine, runs on every node of the

    NOW cluster all the time, constantly contacting each other to discover if there are any jobs to be shared.The API is packaged as a superclass, a.k.a. WorkerClass, which each application subclasses. WorkerClass

    provides a gateway to spawn parallel jobs using JAWS runtime system.

    3.0 Design RationaleThroughout the design process, four factors heavily influenced our decisions.

    1. Ease-of-programming

    We hope to make writing parallel programs in JAWS as simple as writing multi-threaded programs.

    (a) The parallel programming model of JAWS should be intuitive and easy to learn.(b) The application programming interface (API) of JAWS be simple.(c) The amount of code that the application programmer has to write in order to spawn a parallel job and

    collect results should be minimized.

    2. Portability:

    The reason we choose Java as the implementation language is because it is highly portable. A Javaprogram, once compiled, can run on any Java Virtual Machines (JVMs) that running on top of anyhardware architecture. In order to preserve the portabil ity of Java programs, we try very hard to avoid

    modifying JVM or use Java Native Interfaces (JNI). At present, we do not require the use of JNI1

    or

    modifications to the JVM. However, it is conceivable that we may want to use C/C++ to implement some

    of the performance critical components of JAWS later. Nevertheless, platform dependent code should only

    be used as an aid to improve performance, not to ensure correctness or completeness of JAWS.

    3. Expressive Power

    JAWS should allow programmers to expose as much parallelism in their program as possible. The job of

    the programmer is to find and expose the parallelism in the program structure. The job of JAWS is to

    provide powerful mechanisms to help programmers to easily express the parallelism in the program

    structure. Ideally, if two statements can be executed in parallel, the programmers should be allowed to

    express this fact. However, given that we want to preserve the syntax of Java, it might not be possible to do

    so in some cases.

    4. Performance

    For JAWS to have good performance, there are two metrics that we can use to guide the design and

    optimization efforts:

    (a) How well does the speedup of a JAWS program scale with the number of nodes in the cluster?

  • 7/31/2019 10.1.1.28.7264.ps

    3/14

    4.0 Detailed Design & Implementation

    JAWS is inspired by and strongly influenced by the Cilk project.[7], but two major differences exist.

    1.) Cilk adds a parallel extension to the C language. Its runtime system can have access to the stack of arunning thread through the use of setjump() and longjump() calls. JAWS is implemented in Java

    without modifying to the JVM. Therefore, it has no way to save and restore a stack frame as done inCilk.

    2.) Current Cilk implementation assumes that the underlying hardware is a Symmetric Multiprocessor(SMP) machine with shared memory. JAWS targets network of workstations which has no shared

    memory. As a result, communications between Java threads requires message passing.

    4.1 JAWS Runtime System

    Each JAWS node has a resident runtime system called Compute Engine. At the center of each Compute

    Engine is a double-ended queue (4.1.2) which stores the jobs waiting to be executed. Two threads operate

    concurrently on the queue: the Worker Thread (4.1.3) and the Daemon Thread (4.1.4). The Worker Thread

    pops jobs out of the queue and start executing them. But if the queue is empty, a Worker Thread becomes a

    thief and contacts other nodes to try to steal a job. The Daemon Thread helps thieves from other machines

    to steal jobs stored in the local queue. After the job has been stolen and executed on a remote node, the

    results of execution are returned to the original node where the job was spawned. The thief contacts theCollector Thread (4.1.5) of the original node to pass back any execution results. Figure 1 illustrates the

    internal structure of each Compute Engine and the interactions between different engines.

    Collector

    Worker

    Thread

    Daemon

    Deque

    Pool ofDoJob

    Threads

    BlockedDoJobThreads

    Pop

    Job

    Push

    Job

    Active

    DoJobThread

    Steal

    Collector

    Worker

    Thread

    Daemon

    Deque

    Pool ofDoJob

    Threads

    Blocked

    DoJobThreads

    Pop

    Job

    Push

    Job

    Active

    DoJobThread

    Compute Engine at Compute Engine at

    Steal Request

    Through RMI

    Results

    arrived!

    Wakes up

    thread.

    Return

    Results

  • 7/31/2019 10.1.1.28.7264.ps

    4/14

    4.1.1 Job

    In JAWS, the smallest unit of work that can be executed in parallel is a method invocation of an object -- ajob. A job is a tuple of the form < Class Name, Method_Name_and_Signature, Arguments,

    Return_Result_Info >. Each job is self-contained, so a job can be migrated from the machine where it wasspawned to any other machine in the cluster. Return_Result_Info (4.1.5) stores enough information to

    allow the execution results to be returned to the machine where the job was spawned.

    JAWS assumes each JAWS program has an entry point method called main. To start a program, a job

    specifying the main() method of the application together with all the parameters is spawned and handed to

    the JAWS runtime system for execution.

    4.1.2 Deque

    The double-ended queue or Deque is the center of the JAWS runtime system. It stores jobs that have been

    spawned but not yet executed. New jobs are pushed into the queue from one end. Jobs are also popped

    from the same end for execution on the local host. However, when other machines asks for jobs from the

    local host, a job is popped from the opposite end of the Deque.

    4.1.3 Worker Thread

    The job of a Worker Thread is to keep the CPU busy at all times by feeding it with a steady supply of jobs.Jobs can be popped from the local Deque if it is non-empty. If all the jobs in the local Deque have beencompleted, the Worker Thread becomes a thief and attempt to steal a job from a randomly picked node--a

    victim--in the cluster. If the victim also has no jobs, the Worker Thread will randomly choose another node

    to steal from. As long as there are no jobs in the local Deque, the stealing process will repeat until a job is

    found.

    Once a job has been found, the worker thread will assign the job to a DoJob Thread. A DoJob Thread canbe thought of as a robot that executes a job. The robot, by itself, cannot do anything useful. The robot

    needs a programa job. A DoJob Thread uses Java Reflection [17,8] feature to invoke the method of theobject specified in the job. While the DoJob Thread is executing the job, new children jobs may be

    spawned. It may go to sleep while waiting for results from its children jobs. When all the children jobs are

    done, the waiting DoJob Thread will be awaken. After finishing the job, the DoJob Thread is returned to a

    pool, ready for reuse. The reason for saving these threads is to minimize the number of threads created.When the Worker Thread runs out of the threads, it can create some on demand to execute new jobs.

    4.1.4 Daemon Thread

    The job of a Daemon Thread is to serve the steal requests from Worker Threads in other nodes. TheDaemon is implemented as a remote object which exposes a Remote Method Invocation (RMI) interface

    [16,8] to allow other nodes to submit steal requests. When the Daemon receives a steal request, it checks

    the local Deque to see if it is empty. If it is not, the Daemon pops the oldest job in the Deque and returns it

    to the thief. If the local Deque is empty, the Daemon Thread will notify the thief immediately without

    blocking.

  • 7/31/2019 10.1.1.28.7264.ps

    5/14

    The collector has a hash table that keeps track of all the jobs that have been spawned, and the number ofresults that have been received so far. The hash table structure is illustrated in Table 1.

    Thread ID Number of Spawns Number of Results Arrived

    Result Vector(indexed by Result ID)

    ID of Fib(5) 2 1 [5, null]

    ID of Fib(20) 2 0 [null, null]

    ID of Fib(2) 2 2 [1,1]

    Table 1: Structure of collectors hash table

    When a thief returns the execution result of a stolen job, it calls the collector of the victim and returns the

    result wrapped in a ReturnResultInfo object. Recall from section 4.1.1, this object provides information forthe thief to route the execution results back victim, and for the collector of the victim to return the result to

    the specific spawn of the thread. The ReturnResultInfo object contains the following information:

    1. A collector pointer points to the collector residing on the victim machine.2. A Thread ID is used to identify the thread that spawned the job. It is also used as an index to the hash

    table shown in Table 1.3. A Result ID is used to map the result to the correct spawn in the program.4. The actual object containing the result

    4.2 JAWS Application Programming Interface

    The programming interface of JAWS is packaged in as a superclass called WorkerClass which all JAWSprograms must subclass. It provide methods to spawn parallel jobs and collect results using JAWS runtime

    system. The API specifies two methods: spawn() and sync().1.)int spawn (String className, String methodName, Class[] parameterList,

    Object[] argumentList, Object instance);

    2.) Vector sync(void);

    JAWS allows job spawning at method granularity. That is, calling spawn() is like performing a method

    invocation and requires information such as the object instance, the class name, the method name, the

    method signature(parameterList), and the arguments. Spawn returns a Result ID which is used to lookup

    the correct result from sync(). The method sync() takes no arguments. Its primary purpose is to block thethread and wait for results. When results for all the jobs spawned before sync() have arrived, the Collector

    will notify the thread and sync() will return with a vector containing the results. The Result ID is used to

    index the correct result within the Vector. Note that this Result ID returned by spawn() is the same ResultID stored in the ReturnResultInfo object.

    4.3 Use of the JAWS System

    A simple example for calculating Fibonacci numbers is shown in figure 2. Spawn() is used to create a new

    Fibonacci jobs. Sync() is called to collect the results of previous outstanding Fibonacci spawns" up to the

  • 7/31/2019 10.1.1.28.7264.ps

    6/14

    class Fib extends WorkerClass {

    Object DoFib (Integer d) {

    Object[] argListX, argListY;

    Class[] paramList;Integer x, y;

    if(d.intValue()>1) {

    argListX = new Object[1];

    argListY = new Object[1];

    x = new Integer(d.intValue()-1);

    y = new Integer(d.intValue()-2);

    argListX[0] = x;argListY[0] = y;

    paramList = new Class[1];

    paramList[0] = x.getClass();

    int rid1 = spawn("Fib", "DoFib", paramList, argListX, new Fib());int rid2 = spawn("Fib", "DoFib", paramList, argListY, new Fib());

    Vector resultVector=sync();

    x = (Integer) resultVector.elementAt(rid1);

    y = (Integer) resultVector.elementAt(rid2);

    return (Object) new Integer(x.intValue() + y.intValue());}

    //base case: fib(0) or fib(1)

    else {

    return (Object) new Integer(1);

    }

    }}

    Figure 2: A simple JAWS program for calculating Fibonacci using double recursion

    5.0 Status and Evaluation

    5.1 Speedup

    We have implemented three programs using JAWS infrastructure: calculation of Knary [7], Fibonacci

    (shown in figure 2), and RC5 key cracking.[14] The recursive version of Fibonacci that we implemented is

    not optimized for execution on parallel processors. This is to stress test the worst case performance ofJAWS scheduler, because the amount of work done per method invocation is minimized. Most of the

    execution is spent spawning jobs and collecting results. The base case of each job that performs no further

    spawns is Fib of 0 or 1. The number of jobs to be done for a given argument n is exponential with respectto n. As shown in figure 3 below, the speedup of calculating Fib(17) is quite close to linear.

  • 7/31/2019 10.1.1.28.7264.ps

    7/14

    Figure 3: Speedup of calculating Fib(17)

    RC5 key cracking is quite different from Fibonacci in the sense that the job is finished as soon as the

    correct key is found. The initial job has a given number of unknown bits to be guessed. Each additional

    job spawned has a range of key values to be tried. The base case is testing 256 guessed key values. By

    increasing the number of processors, there is a higher probability that the correct key will be found much

    earlier. This occurs since the job that the correct key within its range will be executed earlier. The graphthat demonstrates the amount of increase in performance is shown in figure 4.

    Speed of RC5 k ey cracking of 24 Bits

    4

    6

    8

    10

    12

    Speedup

    RC5 Speedup

    Ideal Speedup

    Speedup of Fib(17)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    1 2 3 4 5 6 7# of Processors

    Speedup

    Ideal Speedup

    Fib Speedup

  • 7/31/2019 10.1.1.28.7264.ps

    8/14

    The execution time can vary greatly from trial to trial depending how early the job containing the correctkey in its range is scheduled to execute. However, on average, the entire is job is finished faster given a

    larger number of processors as seen in figure 4. The graph has a slope greater than 1. The program has

    nondeterministic running time, which is dependent on the execution order.

    Theoretically the execution time of P processors is given by the following formula [5]:

    Tp = O(T1/P + T)

    Tp: expected time to execute a fully strict computation on P processors using work stealing algorithm

    T: the minimum execution time assuming infinite number of processors. This can also be seen as thecritical path length.

    P: number of processors. T1: the minimum serial execution time of the multithreaded computation.

    In practice, the speedup achieved is not quite linear and is affected by the following factors: the granularity

    of job, the depth, and the degree of parallelism of each job, the number of processors available to do

    execution, the critical path of the program, and the inherent parallelism available in the program. The basecase jobs size needs to be sufficiently large; otherwise, the overhead of communication (stealing andreturning results) would dominate and reduce the performance speedup achieved. The number ofprocessors available is certainly also a factor, for it directly determines the execution time of the initial job.

    If the number of machines P exceeds the degree of parallelism inherent in the program, then increasing Pwould not show any improvement in speedup. This is because the amount of jobs to be distributed is

    limited and less than the number of machines to do work. The critical path is important, because the

    execution time on P processors is always greater than the larger value of the critical path (T) and T1/P. To

    find out the actual speedup of JAWS, T needs to be small relative to T1/P.

    The speedup of Fibonacci in figure 3 reveals that T dominates Tp mostly due to the RMI overhead ofstealing when P is greater than four and therefore limits the speedup achieved. This is best illustrated byfigure 1 in Appendix A. It shows that machine A and machine C are constantly stealing for 50% of the

    execution time, because because jobs they steal are both fine-grained and spawn very few child jobs. Infact, as the number of machine increases, the amount of stealing also increases.

    To further investigate this issue, we use the synthetic benchmark, Knary, referenced in Cilk manual [7].

    Knary allows the programmer to tune these parameters: the granularity of the job, the degree of parallelism,and the critical path (depth) of the program. Figure 5 shows the speedup of Knary with fine granularity,

    small degree of parallelism, and small depth.

    Speedup for k nary(20,100,3)

    12

    14

    16

    18Ideal Speedup

    Speedup

  • 7/31/2019 10.1.1.28.7264.ps

    9/14

    Note that for small degree of parallelism and depth, there will be numerous stealing activities and T willbe large. Therefore, speedup is limited. We then tune the granularity, degree of parallelism, and depth

    individually to understand the actual effects of these parameters in JAWS. Note that although the three

    cases are independent of each other, the total amount of work T1 is the same in all three cases. Thefollowing graph shows the result.

    Figure 6: Speedup for the three specially tuned Knary Programs

    The above figure shows that by increasing the number of child jobs either with more depth or parallelism,

    better speedup is achieved. This occurs because the number of steals decreases with increased depth and

    parallelism. Consequently, T decreases and speedup improves. This explanation is further supported byfigure 2 in Appendix A which shows the average deque size of running the more depth case with four

    machines. Contrasted with the Fib(24) case, this graph clearly shows each machine is busy for most of the

    execution time with much fewer number of steals. Therefore, with four machines, Knary with sufficient

    parallelism or depth can achieve a better speedup than Fib(24).

    Figure 6 also shows that the effect of increasing the granularity is not as significant as compared to

    increasing either parallelism or depth. However, if the granularity is not coarse enough, it is not

    worthwhile to spawn jobs and execute them on other machines. That is because the cost of communicationis much larger than the job execution time. With increased granularity, stealing activities can be reduced;

    therefore, better speedup will result due to decreased T.

    Comparing the Speedup for the 3 Cases

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    0 5 10 15 20

    Number of Machines

    Speedup

    Ideal Speedupcoarser grain compared to Figure 5

    More Parallelism

    More Depth

  • 7/31/2019 10.1.1.28.7264.ps

    10/14

    5.2.1 Experimental Setup / Data Collection:

    The test bed is a SUN Ultra-Sparc 1, running Solaris 2.6 with JDK 1.1.3 using native threads. Unlessotherwise stated, The JAWS program being profiled is Fib(19). We obtained profiling data through twomechanisms. The first mechanism is the built-in profiler in JDK, which gives output of the following form:

    Frequency Callee Caller Time

    500 FunctionChild1 FunctionParentA 220ms

    1500 FunctionChild2 FunctionParentB 530ms

    ...

    Each method invocation is logged: a counter keeps the total number of calls, and a timer counts the totaltime spent in the callee.

    However, turning on the profiler logs every single call inflates the execution time of Fib(19) by about 50%.

    In addition, when a thread switch happens, the timer continues. So the time reported by the profiler can

    only be thought of as an upper bound of the actual execution time. However, the frequencies of calls areextremely helpful; they give us hints as to where optimizations might help.

    The second profiling mechanism is to print out the differences reported by the machine timer(System.currentTimeMillis() ) before and after the function call to be investigated. The method limits the

    logging overhead because only calls of interests are traced. But it also has limited usage because theresolution of the timer provided by Java is 1ms. We timed the execution of System.currentTimeMillis()

    itself and found it to be only about 1.3us, which is low enough to be ignored for our purposes.

    5.2.2 Experimental Results:

    We first run Fib(19) seven times, then discard the fastest and slowest run. We find the average of the restrunning times to be 20.20 seconds. As a reminder, we are running Fib(19) on a single machine throughout

    this set of benchmarks, and hence stealing never occurs. After analyzing the profiler output and our own

    traces, we identified two main problem areas with high overhead.

    1. Reflection:

    Each job in JAWS is stored as . To execute a job,

    reflection (Class.forName()) is first used to load the class (code) into JVM. Next, reflection

    (Class.getDeclaredMethod()) is used to look up the appropriate method, based on the MethodName and the

    Argument types. Finally, the method returned by getDeclaredMethod() of the ObjectInstance is invoked.

    Each of these methods is invoked once for each job. For Fib(19), each of these methods are invoked

    13529 times. It turns out that forName() takes about 166us each, getDeclaredMethod() takes 250us. Wealso measured the time to invoke() an dummy function which returns a fixed integer takes about 12.6us.

    Together, the overhead of reflection accounts for about (166us + 250us + 12.6us)*13529 = 5.799 seconds

    or 28.7% of the total execution time!

    In the next version of JAWS, the result of forName() and getDeclaredMethod() are cached to reduce the

    total execution time by about 28%.

  • 7/31/2019 10.1.1.28.7264.ps

    11/14

    roughly estimate the overhead of the explicit wait() and notify() calls, we setup a experiment where twothreads ping-pongs each other as follows:

    Thread 1 Thread 2Run() {

    For(i=; i

  • 7/31/2019 10.1.1.28.7264.ps

    12/14

    8.0 References

    [1] Tom Anderson, David Culler, David Patterson.A Case for Networks of Workstations: NOW. IEEEMicro, Feb, 1995.

    [2] Ken Arnold and James Gosling. The Java Programming Language. Addison-Wesley, 1996.

    [3] J. Eric Baldeschwieler, Robert D. Blumofe, and Eric A. Brewer. ATLAS: An Infrastructure for GlobalComputing. In Proceedings of the Seventh ACM SIGOPS European Workshop: systems Support forWorldwide Applications, September 9-11, 1996, Connnemara, Ireland.

    [4] Robert D. Blumofe.Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of

    Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September1995.

    [5] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by WorkStealing. In Proceedings of the 35

    thAnnual IEEE conference on Foundations of Computer Science

    (FOCS94), Santa Fe, New Mexico, November 20-22, 1994.

    [6] Bernd O. Christiansen, Peter Cappello, Mihai F. Ionescu, Michael O. Neary, Klaus E. Schauser, andDaniel Wu.Javelin: Internet-Based Parallel Computing Using Java.

    [7] Cilk 5.2 Reference Manual. Supercomputing Technologies Group MIT Laboratory for ComputerScience

    [8] Gary Cornell, Cay Horstmann. Core Java 1.1 Volume: Advanced Features. Prentice Hall, 1998.

    [9] Mark Crovella, Prakash Das, Czarek Dubnicki, Thomas LeBlanc, and Evangelos Markatos,Multiprogramming on multiprocessor. In Proceedings of the Third IEEE Symposium on Parallel and

    Distributed Processing, December, 1991.

    [10]David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta,Thorsten von Eicken, and Katherine Yelick, Parallel Programming in Split-C.

    http://www.cs.berkeley.edu/projects/parallel/castle/split-c/split-c.tr.html

    [11]GLUnix: Global Layer Unix, http://now.CS.Berkeley.EDU/Glunix/glunix.html

    [12]Java: http://www.java.sun.com

    [13]James Gosling, Bill Joy, Guy Steele, Java Language Specification, Addison Wesley, 1996. Sun

    Microsystems, Inc. http://java.sun.com

    [14]RC5 Key cracking information. Available from World Wide Web:http://userzweb.lightspeed.net/~gregh/rc5

    [15]Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N.Ganmukhi Jeffrey V Hill W Daniel Hillis Bradley C Kuszmaul Margaret A St Pierre David S

  • 7/31/2019 10.1.1.28.7264.ps

    13/14

    13

    Deque

    Size

    Deque

    Size

    Deque

    Size

    Deque

    Size

    Deque Sizes Across 4 Different

    Machines in Running Fib(24)Machine A

    Machine B

    Machine C

    Machine D

    ms

    ms

    ms

    ms

    Appendix A

    Figure 1: Deque size of 4 machines running Fib(24)

    Circle

    indicates ansteal by this

    machine

  • 7/31/2019 10.1.1.28.7264.ps

    14/14

    14

    Deque

    Size

    Deque

    Size

    DequeSize

    Deque

    Size

    Machine A

    Machine B

    Machine C

    Machine D

    Deque Sizes Across 4 Different

    Machines in Running knary(20,100,4)

    ms

    ms

    ms

    ms

    Figure 2: Deque size of 4 machines running Knary with parallel=20,

    work=100, and depth=4

    Circle

    indicates ansteal by this

    machine