29
Evolution of a Java just-in-time compiler for IA-32 platforms T. Suganuma T. Ogasawara K. Kawachiya M. Takeuchi K. Ishizaki A. Koseki T. Inagaki T. Yasue M. Kawahito T. Onodera H. Komatsu T. Nakatani Java has gained widespread popularity in the industry, and an efficient Java virtual machine (JVM) and just-in-time (JIT) compiler are crucial in providing high performance for Java applications. This paper describes the design and implementation of our JIT compiler for IA-32 platforms by focusing on the recent advances achieved in the past several years. We first present the dynamic optimization framework, which focuses the expensive optimization efforts only on performance-critical methods, thus helping to manage the total compilation overhead. We then describe the platform- independent features, which include the conversion from the stack-semantic Java bytecode into our register-based intermediate representation (IR) and a variety of aggressive optimizations applied to the IR. We also present some techniques specific to the IA-32 used to improve code quality, especially for the efficient use of the small number of registers on that platform. Using several industry-standard benchmark programs, the experimental results show that our approach offers high performance with low compilation overhead. Most of the techniques presented here are included in the IBM JIT compiler product, integrated into the IBM Development Kit for Microsoft Windows, Java Technology Edition Version 1.4.0. Introduction Java** has gained widespread popularity, and an efficient Java virtual machine (JVM**) and just-in-time (JIT) compiler are crucial in providing high performance for Java applications. Since the compilation time overhead of a JIT compiler, in contrast to that of a conventional static compiler, is included in the program execution time, JIT compilers should be designed to manage the conflicting requirements between fast compilation speed and fast execution performance. That is, we would like the system to generate highly efficient code for good performance, but at the same time, the system should be lightweight enough to avoid any startup delays or intermittent execution pauses caused by the runtime overhead of the dynamic compilation. We previously described the design of version 3.0 of our JIT compiler [1] that was integrated into the IBM Development Kit (DK) 1.1.7. It was designed to apply only relatively lightweight optimizations equally to all of the methods so that compilation overhead would be minimized but competitive overall performance could still be achieved at the time of product release. We used an intermediate representation (IR) called extended bytecode (EBC)—a compact and stack-based representation similar to the original Java bytecode. The conversion from the bytecode to the IR is relatively straightforward. Using the IR, we performed both classic optimizations, such as constant propagation, dead code elimination, and common subexpression elimination, and Java-specific optimizations, such as exception check elimination. Without imposing a large impact on compilation overhead, we applied method inlining only to small target methods to alleviate the performance problem caused by frequent calls of relatively small methods. The devirtualization of virtual method call sites was performed using the guard code for testing the target method. Copyright 2004 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/04/$5.00 © 2004 IBM IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL. 767

Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Evolution of aJava just-in-timecompiler forIA-32 platforms

T. SuganumaT. OgasawaraK. Kawachiya

M. TakeuchiK. IshizakiA. KosekiT. InagakiT. Yasue

M. KawahitoT. OnoderaH. KomatsuT. Nakatani

Java� has gained widespread popularity in the industry, andan efficient Java virtual machine (JVM�) and just-in-time(JIT) compiler are crucial in providing high performancefor Java applications. This paper describes the design andimplementation of our JIT compiler for IA-32 platforms byfocusing on the recent advances achieved in the past severalyears. We first present the dynamic optimization framework,which focuses the expensive optimization efforts only onperformance-critical methods, thus helping to manage thetotal compilation overhead. We then describe the platform-independent features, which include the conversion fromthe stack-semantic Java bytecode into our register-basedintermediate representation (IR) and a variety of aggressiveoptimizations applied to the IR. We also present sometechniques specific to the IA-32 used to improve code quality,especially for the efficient use of the small number of registerson that platform. Using several industry-standard benchmarkprograms, the experimental results show that our approachoffers high performance with low compilation overhead. Mostof the techniques presented here are included in the IBM JITcompiler product, integrated into the IBM Development Kit forMicrosoft Windows�, Java Technology Edition Version 1.4.0.

IntroductionJava** has gained widespread popularity, and an efficientJava virtual machine (JVM**) and just-in-time (JIT)compiler are crucial in providing high performance forJava applications. Since the compilation time overhead ofa JIT compiler, in contrast to that of a conventional staticcompiler, is included in the program execution time, JITcompilers should be designed to manage the conflictingrequirements between fast compilation speed and fastexecution performance. That is, we would like the systemto generate highly efficient code for good performance,but at the same time, the system should be lightweightenough to avoid any startup delays or intermittentexecution pauses caused by the runtime overhead ofthe dynamic compilation.

We previously described the design of version 3.0 ofour JIT compiler [1] that was integrated into the IBMDevelopment Kit (DK) 1.1.7. It was designed to apply

only relatively lightweight optimizations equally to allof the methods so that compilation overhead would beminimized but competitive overall performance could stillbe achieved at the time of product release. We used anintermediate representation (IR) called extended bytecode(EBC)—a compact and stack-based representation similarto the original Java bytecode. The conversion from thebytecode to the IR is relatively straightforward. Usingthe IR, we performed both classic optimizations, such asconstant propagation, dead code elimination, and commonsubexpression elimination, and Java-specific optimizations,such as exception check elimination. Without imposing alarge impact on compilation overhead, we applied methodinlining only to small target methods to alleviate theperformance problem caused by frequent calls of relativelysmall methods. The devirtualization of virtual method callsites was performed using the guard code for testing thetarget method.

�Copyright 2004 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) eachreproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of thispaper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of

this paper must be obtained from the Editor.

0018-8646/04/$5.00 © 2004 IBM

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

767

Page 2: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

In the previous version, we also introduced a mixed-mode interpreter (MMI) to allow efficient mixed executionbetween interpreted mode and compiled code executionmode. We begin the program execution using the MMI asthe first execution mode. When the frequently executedmethods or critical hot spots 1 of the program areidentified using method invocation and loop iterationcounters, we invoke the JIT compiler to obtain betterperformance for those selected methods. The MMIhandles a large number of performance-insensitivemethods, typically 80% or more of the executed methodsin the program, and thus provides an efficient executionmode with no compilation overhead.

To achieve higher performance, we needed to applymore advanced optimization techniques, includingaggressive method inlining, dataflow-based optimizations,loop optimizations, and more sophisticated registerallocation. However, from the viewpoint of JIT compilers,these are all relatively expensive optimizations, and simplyputting these optimizations in the existing JIT frameworkmight have caused unacceptably high compilationoverhead, typically large delays in application startup time.Using a two-level execution model with the MMI and asingle-level, highly optimizing JIT compiler, the systemwould not be able to manage the balance betweenoptimization effectiveness and compilation overheadbecause of the increasing gap of the tradeoff levelbetween the two execution modes. It was thereforeimperative to provide multiple, reasonable steps in thecompilation levels with well-balanced tradeoffs betweencost and expected performance, from which an adequatelevel of optimization could be selected that wouldcorrespond to the execution context.

In this paper, we describe the design and implementationof version 4.5 of our Java JIT compiler, specificallydeveloped for IA-32 platforms, which has been integratedinto the IBM DK 1.4.0. To manage the tradeoff, we didseveral things:

● In addition to the EBC, we introduced two register-based intermediate representations to perform moreeffective and advanced optimizations. We evaluatedexisting and new optimizations by considering thecompilation costs (both space and time) and appliedeach of them on an appropriate IR.

● We constructed a dynamic optimization framework usinga multilevel recompilation system. This enabled usto focus expensive optimization efforts on onlyperformance-critical methods. We classified alloptimizations into several different levels based on thecost and the benefit of each optimization. The profiling

system continuously monitors the hot spots of a programand provides information for method promotion.

● We evaluated our approach using industry-standardbenchmarks and obtained significant performanceimprovement for each optimization, while keeping thecompilation overhead—in terms of compilation time,compiled code size, and compilation peak memoryusage—at a low level.

Since this paper focuses on the evolution of our JITcompiler from the previous version 3.0 in terms of itsinternal architecture and a variety of added optimizations,we do not address profile-directed optimizations, whichare more advanced techniques. The goal of this paper isto describe what we have done in the design of our JITcompiler to balance both the performance improvementand compilation overhead without using the profile-directed optimizations.

This paper is organized as follows. We first give anoverview of our JIT compilation system, covering both theoverall compiler structure with three different IRs and adynamic optimization framework using the interpreter andmultilevel optimizations. We then describe the platform-independent optimizations performed on each IR,followed by IA-32-specific optimizations, especially for theefficient use of the small number of registers. We nextpresent the experimental results on both performance andcompilation overhead. Finally, we summarize the relatedwork and then conclude.

System architectureThis section provides an overview of our system. Wepresent the overall structure of the JIT compiler bydescribing the three different IRs and list the majoroptimizations performed on each IR. We then describethe dynamic optimization framework, in which eachoptimization performed in the three IRs is mapped to anappropriate optimization level based on the compilationcost and the performance benefit.

Structure of the JIT compilerFigure 1 shows the overall structure of our JIT compiler.We employ three different IRs to perform a variety ofoptimizations. At the beginning, the given bytecodesequence is converted to the first IR, called the extendedbytecode (EBC), which was used in the previous version ofour compiler as the sole IR. The EBC is stack-based andvery similar to the original bytecode, but it annotatesadditional type information for the destination operand ineach instruction. Most EBC instructions have a one-to-onemapping with the corresponding bytecode instructions,but there are additional operators to explicitly expresssome operations resulting from method inlining anddevirtualization. For example, if we inline a synchronized

1 A hot spot is a method invocation within a loop or an indirect method invocationfrom a loop.

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

768

Page 3: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

method, we insert opc_syncenter and opc_syncexit

instructions at the entry and exit of the inlined code,respectively. This allows us to easily identify anyopportunity to optimize redundant synchronizationoperations exposed by several stages of inliningsynchronized methods. The instruction opc_cha_patch isused to indicate the code location for the runtime patchwhen the unguarded devirtualized code is invalidated dueto the dynamic class loading.

Method devirtualization and method inlining are thetwo most important optimizations applied on this IR.The class hierarchy is constructed and used, togetherwith the result of the object typeflow analysis, to identifythe compile-time monomorphic virtual call sites. Thedevirtualized call sites and static/nonvirtual call sitesare then considered for inlining. We have two separatebudgets for performing inlining, one for tiny methods andthe other for all other types of methods. Inlining the tinymethod is always considered beneficial without imposingany harmful effects on compilation time and code sizegrowth. In the next section we give a more detaileddescription for our inlining policy. Other optimizationsperformed on EBC include switch statementoptimization, typeflow analysis, and exception checkelimination.

The EBC is then translated to the second IR, calledquadruples. This is a register semantic IR. The quadruplesare n-tuple representations with an operator and zero ormore operands. We have a set of fine-grain operators inquadruples for subsequent optimizations. For example, theexception checking operations implicitly assumed in somebytecode instructions are explicitly expressed in this IR foroptimizers to easily and effectively identify redundantexception checking operations. Similarly, the classinitialization checking operation is also explicitly expressedif the target class is resolved but not yet initialized for therelevant bytecode instructions. Another example is thevirtual method invocation. The single bytecode instructionis separated into several operations: the operation forsetting each argument, null pointer check of the receiverobject, the method table load, the method block load, andfinally the method invocation itself. This will increase theopportunities for performing commoning optimizationfor some of these operations between successive virtualmethod invocations.

The translation from EBC to quadruples is based onan abstract interpretation of the stack operations. In thisprocess, we treat both local variables and stack variablesin the same way to convert to symbolic registers. Figure 2shows a simple example of the translation. The directtranslation of stack operations produces many redundantcopy operations, as shown in Figure 2(b). We apply copypropagation and dead code elimination immediately after

the translation, and most of the redundancies that resultfrom the direct translation of the stack operations can beeliminated, as shown in Figure 2(c).

We apply a variety of dataflow-based optimizations onthe quadruples. Some dataflow analysis, exception checkelimination, common subexpression elimination, andprivatization of memory accesses are iterated to takeadvantage of the fact that the application of oneoptimization creates new opportunities for the otheroptimizations. The maximum number of iterations islimited in consideration of its impact on the compilationoverhead (with different threshold values used betweenthe methods containing loops and those without loops).The other optimizations on this IR include escapeanalysis, synchronization optimization, and athrow inlining(all described in the section on optimizations onquadruples).

The third IR is called a directed acyclic graph (DAG).(We ignore the back edges of the loop when we call theIR acyclic.) This is also a register-based representationand is in the form of a static single assignment (SSA)[3]. It consists of nodes corresponding to quadrupleinstructions and edges indicating both data dependencies

Structure of the JIT compiler. (DAG: directed acyclic graph. Athrow inlining: see the subsection on athrow inlining and Reference [2].)

Figure 1

EBCoptimizations

DAG-basedoptimizations

Code generation

Bytecode(BC)

EBC-to-quadruple conversion

Quadrupleoptimizations

Quadruple-to-DAG conversion

DAG-to-quadruple conversion

Loop versioningLoop simplificationPrepass code scheduling

Method devirtualizationMethod inliningSwitch statement optimizationTypeflow analysisException check elimination

Constant/copy propagationDead code eliminationPrivatization/ commoningEscape analysisException check eliminationAthrow inlining

BC-to-EBCconversion

Architecture mappingMean time to next use analysisRegister constraint propagationNative code generationCode scheduling

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

769

Page 4: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

and exception dependencies. Figure 3 shows an exampleof constructing the DAG for a method containing a loop.This simply indicates how the DAG representation looksand ignores all of the exception checks necessary for thearray accesses within the loop. The actual DAG includesthe nodes for exception checking instructions and theedges representing exception dependencies.

This IR is designed to perform optimizations, such asloop versioning and prepass code scheduling, that aremore expensive but sometimes quite effective. The IR isconverted back to quadruples before entering the codegeneration phase. Thus, the conversion to DAG andapplying the DAG-based optimization is completely anoptional pass, and we need to apply this set of theseoptimizations judiciously and for only those methodsthat can benefit from the transformation.

All of the instructions of these three IRs are groupedinto basic blocks (BBs). Our BBs are extended in thesense that they are not terminated by instructions which

constitute potential exceptions, as in the factored controlflow graph [4]. The BBs are placed using the branchfrequencies collected during the MMI execution. Sincethe branch history information is limited to the first fewexecutions in the MMI, we do not use code positioningguided solely by profile information [5]. Instead, wecombine the use of this information with some heuristicsso that we can place backup blocks generated byversioning optimizations at the bottom of the code. Wealso use the profile information to select the depth firstorder for if-then-else blocks; this separates the BBsof frequently and infrequently executed paths.

The compiler is designed to be very flexible so thateach optimization can be enabled or disabled for a givenmethod. At a minimum, we need to perform bytecode-to-EBC conversion, EBC-to-quadruple translation, andnative code generation from the quadruples, even if alloptimizations are skipped. Method inlining can be appliedeither for tiny methods only or based on more aggressive

ALOAD 1

IGETFIELDILOAD 2IDIVISTORE 2

(a) (b)

AMOVE LA3 = LA1NULLCHECK LA3IGETFIELD LI3 = LA3, offsetIMOVE LI4 = LI2IDIV LI3 = LI3, LI4IMOVE LI2 = LI3

(c)

NULLCHECK LA1IGETFIELD LI3 = LA1, offset

IDIV LI2 = LI3, LI2

Figure 2

An example of translation from EBC to quadruples: (a) Extended bytecode; (b) conversion to quadruples; (c) after copy propagation.

Figure 3

Example of constructing a DAG for a loop-containing method: (a) Original code; (b) quadruples for the loop body; (c) DAG for the loop

body. For the sake of simplicity, this ignores all of the exception checks necessary for the array accesses within the loop.

int i, n;float a[], b[], c[], d[];float f1, f2, f3, f4;...for (i=0; i<n; i++) { f1 += a[i]*d[i]; f2 += b[i]*c[i]; f3 += a[i]*c[i]; f4 += b[i]*d[i];}.... = f1;.... = f2;.... = f3;.... = f4;

(a) (b)

a: faload f5 = a, ib: faload f6 = d, ic: fmul f7 = f5, f6d: fadd f1 = f1, f7e: faload f8 = b, if: faload f9 = c, ig: fmul f10 = f8, f9h: fadd f2 = f2, f10i: fmul f11 = f5, f9j: fadd f3 = f3, f11k: fmul f12 = f8, f6l: fadd f4 = f4, f12m: iadd i = i, 1n: iflt i, n

(c)

fmul

fadd

phi

faload faload faload faload

fmul

fadd

phi

iadd

iflt

phi

fmul

fadd

phi

fmul

fadd

phi

a b

c

d

f e

g

h

i

j

k

l

m

n

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

770

Page 5: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

static heuristics. The number of iterations on the dataflowanalyses can be adjusted. The generation of a DAGrepresentation and the optimizations on the DAG areoptional. We exploit this capability in the design of thedynamic optimization framework described in the nextsection.

Dynamic optimization frameworkFigure 4 depicts the architecture of our dynamicoptimization framework 2 [6]. This is a multilevelcompilation system with an MMI and three compilationlevels (level 1 to level 3). All of the optimizationsdescribed in the section above on the structure of thecompiler are classified into these three levels. Basically,the lightweight optimizations, such as those with linearorder to the size of the target code, are applied in theearlier optimization levels, and those with higher costs incompilation time or greater code size expansion are delayedto the later optimization levels. The classification is basedon the empirical measurements of both compilation costand performance benefit of each optimization [7].

The three optimization levels in our JIT compiler arethe following:

1. Level 1 (L1) optimization employs only a very limitedset of optimizations for minimizing the compilationoverhead. For the EBC, it performs devirtualization ofthe dynamically dispatched call sites based on classhierarchy analysis (CHA) [8]. For the resultingdevirtualized call sites and static/nonvirtual call sites, itperforms inlining only when the target method is tiny.Switch statement optimization is also performed. Forquadruples, it enables dataflow optimizations only forvery basic copy propagation and dead code eliminationthat eliminate the redundant operations resulting from

the EBC-to-quadruple translation. No other dataflow-based or DAG-based optimizations are performed.

2. Level 2 (L2) optimization enhances level 1 by employingadditional optimizations. For the EBC, method inliningis performed not only for tiny methods, but also forall types of methods based on static heuristics. Pre-existence analysis using the results of object typeflowanalysis is performed on both the EBC and thequadruples to safely remove guard code and backuppaths that resulted from devirtualization. For quadruples,it performs a set of dataflow optimizations iteratedseveral times, but the maximum number of iterationsis still limited to a low value. The other dataflow-basedoptimizations, such as synchronization optimization andathrow inlining, are also enabled at this level. DAG-based optimizations are not yet performed.

3. Level 3 (L3) optimization is augmented with allremaining optimizations available in our system. Newoptimizations enabled at this level include escapeanalysis (including stack object allocation, scalarreplacement, and synchronization elimination), codescheduling, and DAG-based optimizations. It alsoincreases the maximum iteration count for thedataflow-based optimizations on quadruples. There isno difference between L2 and L3 optimizations interms of the scope and aggressiveness of inlining.

The compilation for L1 optimization is invoked fromthe MMI and is executed within the application thread,while the compilation for L2 and L3 is performed by aseparate compilation thread in the background. Theupgrade recompilation from L1-compiled code to higher-level optimized code is triggered on the basis of thehotness level of the compiled method as detected by atimer-based sampling profiler [9]. Depending on therelative hotness level, the method can be promoted fromL1-compiled code either to L2 or directly to L3 optimized

2 �ACM, 2001. This subsection is from the work published in [6]. Used withpermission.

Figure 4

System architecture of our dynamic optimization framework.

Level 1compiled

code

Level 2compiled

code

Level 3compiled

code

Compiled method promotion

Recompilationcontroller

QueueBytecode

Hot methodlist

Mixed-modeinterpreter

Dynamic compiler(level 1 to level 3)

Level 2 and level 3compile requests Compile plan

Hot methodmonitoring

Level 1 compilerequests

Code installation

Samplingprofiler

Raw data

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

771

Page 6: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

code. This decision is made on the basis of differentthreshold values on the hotness level for each L2 andL3 method promotion.

The sampling profiler periodically monitors the programcounters of application threads; it keeps track of methodsin threads that are using the most central processing unittime by incrementing a hotness count associated with eachmethod. The profiler keeps the current hot methods in alinked list, sorted by the hotness count, and then groupsthem together and gives them to the recompilationcontroller at every fixed interval for upgrade recompilation.The sampling profiler operates continuously during theentire period of program execution to effectively adaptto the behavioral changes of the program.

Platform-independent optimizationsThis section provides a detailed description of the platform-independent optimization features performed on each ofthe three IRs.

Optimizations on EBCFigure 5 shows the sequence of optimizations performedon EBC. Since EBC is a compact representation, methodinlining is performed on this IR. It expands the targetprocedure body at the corresponding call sites anddefines the scope of the compilation target. All of theoptimizations on EBC following the method inlining arethen performed to simplify the control flow or to eliminateredundant code so that we can reduce the target code sizebefore converting to the second IR, quadruples.

Method inliningOur dynamic compiler is able to inline any methods,regardless of the context of the call sites or the types ofcaller or callee methods. For example, there is norestriction in inlining synchronized methods or methodswith try-catch blocks (exception tables), nor againstinlining methods into call sites within a synchronized

method or a synchronized block. For runtime handling ofexceptions and synchronization, we create a data structureindicating the inline context tree within the method andmap each instruction or call site address that mayconstitute an exception with the corresponding node in thetree structure. Using this information, the runtime systemcan identify the associated try region identification ornecessary object-unlocking actions based on the currentprogram counter. Thus, the inlining decision can be madepurely from the cost-benefit estimate for the method beingcompiled.

Many methods in Java, such as accessor methods,simply result in a small number of instructions of codewhen compiled; we call these tiny methods. A tiny methodis one whose estimated compiled code is equal to or lessthan the corresponding call site code sequence (argumentsetting, volatile registers saving, and the call itself). Thatis, the entire body of the method is expected to fit intothe space required for the method invocation. Ourimplementation identifies these methods on the basisof the estimated compiled code size. 3

Since invocation and frame allocation costs outweighthe execution costs of the method bodies for thesemethods, inlining them is considered completely beneficialwithout causing any harmful effects in either compilationtime or code size expansion. In fact, an empirical studyshowed that the tiny-only inlining policy has very littleeffect in increasing compilation time, while it producessignificant performance improvements over the no-inlinecase [10]. Thus, they are always inlined at all compilationlevels. For non-tiny methods, we employ static heuristicsto perform method inlining in an aggressive way whilekeeping the code expansion within a reasonable limit.

The inliner first builds a (possibly large) call tree ofinlined scopes based on allowable call tree depths andcallee method sizes; then, to come up with a final tree,it looks at the total cost by checking each individualdecision. Whenever it performs inlining on a method,the inliner updates the call tree to encompass the newpossible inlining targets within the inlined code. Whenlooking at the total cost, we manage two separate budgetsproportional to the original size of the method: one fortiny methods and the other for any type of method. Theinliner tries to greedily incorporate as many methods aspossible from the given tree using static heuristics untilthe predetermined budget is used up. Currently the staticheuristics consist of the following rules:

● If the total number of local variables and stack variablesfor the method being compiled (both caller and callees)exceeds a threshold, reject the method for inlining.

3 This estimate excludes the prologue and epilogue code. The compiled code size isestimated on the basis of the sequence of bytecodes, each of which is assigned anapproximate number of instructions generated.

Sequence of optimizations on EBC.

Figure 5

Idiomatic translation

EBC

Exception checkelimination

Method inlining (including devirtualization)

Typeflow analysis andoptimization

Switch statement optimization

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

772

Page 7: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

● If the total estimated size of the compiled code forthe methods being compiled (both caller and callees)exceeds a threshold, reject the method for inlining.

● If the estimated size of the compiled code for the targetmethod being inlined (callee only) exceeds a threshold,reject the method for inlining. This is to prevent wastingthe total inlining budget due to a single excessively largemethod.

● If the call site is within a basic block that has not yetbeen executed at the time of the compilation, it isconsidered a cold block of the method, and the inliningis not performed. On the other hand, if the call site iswithin a loop, it is considered a hot block, and theinlining is tried for a deeper nest of call chains thanfor outside a loop.

Throughout the process, the inliner devirtualizesdynamically dispatched call sites using both the CHA andthe typeflow analysis. It produces either guarded code (viaclass test or method test [11]) or unguarded code (viacode patching on invalidation [12]), depending on whetherthe call site can be assumed to be monomorphic. Whenmore than two target methods are found, it performsdevirtualization only when the profile is available andthe most beneficial target can be selected. A backuppath is generated for both guarded and unguardeddevirtualization cases to execute when the compile-time assumption is invalidated at runtime. Thus, itproduces a diamond-shaped control flow, with thenonvirtual call on the fast path and the backup path onthe other. At this point, the nonvirtual call sites in the fastpath may or may not be inlined according to the staticrules described above.

Idiomatic translationThis component recognizes the sequences of bytecodeinstructions or special method calls (primarily the math.class library calls), and replaces those bytecodeinstructions with simpler operators, mostly binary or unaryoperators, expressed in the EBC. For example, if thecompiler finds a bytecode sequence that results from amax or min operation expressed with explicit controlflow (if-then-else) or an arithmetic if operator,this is converted to a single EBC instruction with thecorresponding operator. This code is generated using thecmov instruction for efficient execution on IA-32.

For the math. class library calls, such as the square root,ceil/floor functions, and some transcendental functions,we can exploit the IA-32 native instructions or speciallyprepared library calls, instead of invoking standard nativemethods using the Java Native Interface (JNI), if they arenot specified as strictfp operations. Thus, we replacethe sequence of bytecode instructions for these method

invocations with single binary or unary operations. Thissignificantly simplifies the control flow at compile timeand also reduces the path length at execution time. Inaddition, we can eliminate both F2D and D2F conversionsproduced for single-precision operations before and afterthe original math library calls [13]. This is because themath libraries are provided with only double-precisionversions, and thus the inefficient conversion operationsare necessary when calling the library functions.

Switch statement optimizationThe switch statement sometimes has a long list of caselabels, and, in general, this makes the generated codeinefficient because of the complicated control flow andmany compare-and-branch instructions that have to beexecuted sequentially. In some cases, however, we canoptimize the switch statement to simplify the controlflow or to decrease the number of compare-and-branchinstructions executed before finding the appropriate targetstatement. We consider two such optimizations here:transformation to a table lookup operation and theextraction of frequently executed case statements.

The transformation to a table lookup operation is madewhen the switch statement is used as an assignment of adifferent value to a single local variable according to theinput parameter. Figure 6 shows a simple example of thistransformation. We first examine the structure of theswitch statement to see whether it is applicable for thistransformation by checking each case statement. Thecheck includes the number of case labels and the densityof case key values, as described below. If the conditionsare satisfied, we introduce a compiler-generated table thatholds the values as its elements corresponding to eachassignment statement and convert the switch and the setof case statements into a simple table lookup operation.

As shown in the figure, the new code consists of theindex variable normalization, the range checking of theindex, the loading from the compiler-generated table, andthe default value setting code for an index out-of-the-tablerange. If the default setting is not explicitly specified inthe original switch statement, the conversion is in aslightly different form. That is, when the index is withinthe table range but turns out to be a value not specified inthe case statement (through a dummy value stored in thetable), we need to set the variable back to its originalvalue before entering the switch statement.

Because of the base overhead of the indexnormalization and range checking, the transformationis not actually effective in a small number of casestatements. Also, if the value density between the lowestand the highest keys in the case labels is not sufficientlyhigh, the compiler-generated table may simply be a wasteof memory. Thus, we check the total number of case

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

773

Page 8: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

labels and the density of the values in the case labelsbefore applying this transformation.

The second optimization is the extraction of frequentlyexecuted case statements. Using the execution frequenciesfor individual cases and default statements collectedduring the MMI execution, we extract some number offrequently executed cases and put explicit compare andbranch instructions at the beginning of the switchstatement. We compute the expected number ofcomparison instructions to be executed before reachingeach body statement by assuming either linear searchor binary search based on the number of case labels,as performed in the code generation. We then try tominimize this expected number by the case extraction.When the number of case statements is small and thefrequency between the cases is clearly ordered, we replacethe whole switch statement with that control flow;otherwise, the switch statement still remains afterremoving some number of extracted cases. For the profilecollection, we limit the number of profiles both by thenumber of executions at the switch statement entry and bythe number of executions at individual case statements inorder to save the profile buffer space. We collect severaldifferent profiles for the default case, one for the inputvalue above the highest key within the case labels, onefor the value below the lowest key, and several othersbetween two adjacent keys within the range. On the basisof these profiles, we apply the same transformation for thedefault case.

Exception check eliminationThe redundant exception check elimination (both nullcheck and array bound check) in the EBC level is basedon simple forward dataflow. The dataflow analysispropagates both non-null and range expressions todetermine where they must be checked and where theycan be skipped. Since an exception is not explicitly

represented in the EBC, all results of the exception checkredundancy are marked as attributes in the relevantinstructions. When the EBC is converted to quadruples,the explicit exception checking code is generated onlyfor those without the attribute, and thus the size ofthe quadruples can be reduced.

Typeflow analysisWe perform typeflow analysis at various phases in theoptimization process to infer the static type of each objectreference. In the method inlining, we used the results oftypeflow analysis for determining the types of receiverobjects of virtual method invocations to increase theopportunities of devirtualization. At this phase, we usethe results of typeflow analysis for three optimizations:elimination of backup paths of devirtualized code,elimination of redundant type checking code, and conversionof the native method call System.arraycopy� to aspecial operator instruction.

We exploit the “preexistence” [11] to safely eliminatedevirtualized-call-site backup paths without requiring acomplex on-stack replacement technique. The preexistenceis a property of virtual call receiver objects in which thereceiver is preallocated before the calling method isinvoked. If the receiver for a devirtualized call site hasthis property and the devirtualization is made for thetarget method of a single implementation, we caneliminate the backup path without causing incorrectbehavior because we can guarantee that the singleimplementation assumption is true on entry of the callingmethod. Even if another implementation of the targetmethod is loaded because of the dynamic class loadinglater, we can modify the affected method to be recompiledat its entry before making the loaded class available. Toprove preexistence, we perform an additional argumentanalysis to check whether the receiver object of eachdevirtualized call site is directly reachable from an

Figure 6

An example of the transformation of the switch statement into a simple operation of loading from a compiler-generated table: (a) Original switch statement; (b) pseudocode after transformation.

switch (var) {case 100: t = 'A'; break;case 101: t = 'B'; break;case 103: t = 'C'; break;case 104: t = 'D'; break;case 107: t = 'E'; break;default: t = 'Z'; break;}

(a) (b)

#I = var � 100; // normalize indexif (#I < 0) || (7 < #I) goto default; // range checkingt = LOADTABLE #T[#I] // table lookupgoto done;default:t = 'Z'; // default value settingdone: #I: compiler generated table index variable #T: compiler generated table [A,B,Z,C,D,Z,Z,E]

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

774

Page 9: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

argument of the calling method. This optimization resultsin eliminating the merge points from virtual call sites inthe control flow and thus can improve the precision ofdataflow analyses in the later phases.

We also perform the elimination of redundant typechecking code using the typeflow analysis. The typechecking code examines whether two object referencesare related by a subclass relationship and is required forcheckcast, instanceof, and aastore bytecodeinstructions. By propagating the type information alongthe forward dataflow from several seed instructions—suchas object allocation, argument specification, and instancechecking code—we can identify and eliminate redundanttype checking operations in some cases. For example, thefollowing statement is quite common in practice, and thesecond instance checking for the checkcast instruction iseliminated by propagating the type information alreadychecked:

If (anObj instanceof aClass) {

aClass ao � (aClass) anObj;

}

Another optimization is for the special method call,System.arraycopy�. This is one of the most frequentlycalled native methods and takes five arguments: the sourcearray and its start index, the destination array and its startindex, and the number of elements to be copied. If weknow from the results of the typeflow analysis that anarray store exception cannot be raised for the givenpair of arrays, we convert the method call to a singleEBC operator. The typeflow analysis also provides theinformation that the source and destination arrays are notaliased on the basis of the corresponding seed instruction.Additional attributes for this operator, such as the length

of the array, the start index, and the number of copyelements, may be added by the optimizations of constantpropagation and copy propagation performed in thequadruple optimization phase. At the code generationphase, depending on the information available for theoperator, we inline the code by using the specialinstruction movs with the repeat prefix for IA-32. In thebest case, the inlined code is quite simple because weknow that the range of the copy operation is guaranteedto have no overlap and the array bound checking code isnot necessary.

Optimizations on quadruplesWe perform a variety of dataflow-based optimizations onquadruples. Figure 7 shows the sequence of optimizationsperformed on quadruples. In the following, we describeeach optimization in detail.

Copy propagation and dead code eliminationThese optimizations are designated to eliminate theredundant register-to-register copy operations in the codedirectly translated from the stack-based EBC, as describedabove in the section on the structure of the JIT compiler.The optimizations are applied globally across basic blocks.

Class initialization eliminationSince we separate the resolution of a class from theexecution of its class initializer, we can try to resolve aclass at compile time without violating the languagespecification. For a resolved but not yet initialized classaccess, we explicitly generate a quadruple operator forclass initialization, because we have to execute the classinitializer (i.e., �clinit�) once at every first access foreach class. The class initialization is a method invocation

Figure 7

Sequence of performance optimizations on quadruples.

Class initializationelimination

Quadruples

Synchronizationoptimization

Common subexpression eliminationand memory access privatization

Copy propagation/deadcode elimination

Null check elimination(platform-independent)

Array bound check elimination

Athrow inlining

Null check elimination(platform-dependent)

Escape analysis

Dataflow iterations

Typeflow analysis andoptimization

Copy propagation/deadcode elimination

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

775

Page 10: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

and thus has to be treated as a barrier in the followingexception check elimination and memory accessprivatization phases. We attempt to eliminate theredundant class initializations at this stage using simpleforward dataflow analysis.

Escape analysisEscape analysis examines each dynamically allocatedobject and deduces the bounds on the lifetimes of thoseobjects to perform optimizations. If all uses of theallocated object are within the current method, we canallocate the object on the stack to improve both theallocation and reclamation costs of memory management.If the object is proved to be nonescaping from the currentthread, we can eliminate any synchronization operationson this object.

Our escape analysis is based on the algorithm describedin [14]. This is a compositional analysis designed toanalyze each method independently and to producesummary information that can be used at every callsite that may invoke the method. Without the summaryinformation, the analysis treats all arguments as escapingat the given call sites. Hence, the analysis result can bemore precise and complete, since more of the invokedmethods are analyzed. However, if backup paths fordevirtualized call sites have been eliminated by the pre-existence analysis, we suppress generation of the summaryinformation. This is because the escape analysis will bebased on the optimistic assumption without consideringthose potentially escaping paths eliminated by the pre-existence analysis. In that case, we may incorrectlyconclude in some of its caller methods that the objectpassed as a parameter is nonescaping by using thesummary information. When dynamic class loadinginvalidates the pre-existence assumption, we will then haveto recompile not only the current method, but all callermethods that have used the summary information as well.

The stack allocation of an object basically holds thespace of the whole object, including the object header,within the stack frame. Since it may increase memoryconsumption by extending the lifetime of objects until themethod returns and the stack is rolled back, we performthe stack allocation only when the allocation site is notwithin a loop and the size of the allocated objects (bothindividually and collectively) is within a preset thresholdvalue.

When the only uses of an object are to read and writefields of the object, we can perform scalar replacement forthose field variables instead of allocating the entire object.For example, if an object is passed as an argument toother methods or is used in type checking instructions,it is not eligible for scalar replacement. For thistransformation, we introduce a new local variable for eachof the object fields and simply replace the allocation

instruction with a set of initialization instructions for thosenew locals. Each reference to the object is also replacedwith a simple copy instruction from or into the new localvariable. The computational redundancies with thistransformation can be eliminated in the subsequent phaseof copy propagation and dead code elimination.

Dataflow iterationsWe perform exception check elimination, commonsubexpression elimination, and memory accessprivatization iteratively using the partial redundancyelimination (PRE) technique [15, 16]. Within eachiteration, the null check elimination first moves nullchecks backward and eliminates redundant nullchecks within a loop. This optimization increases theopportunities for the following array bound checkingelimination, since the bound checking code is now allowedto move in a wider range in the program beyond theoriginal place of the null checking code. After theexceptions are eliminated from a loop, we apply commonsubexpression elimination and privatize memory (instance,class, and array variable) access operations. Again, theseoptimizations should be effective here because manyof the barriers for code motion were eliminated in theprevious steps. The common subexpression eliminationand the memory access privatization can, in turn, exposenew opportunities for the exception eliminationoptimization, so we iterate these dataflow optimizationsseveral times.

Figure 8 shows an example in which these optimizationsin the dataflow iteration can affect each other successivelyto produce better code. The original program (a) isconverted to the quadruple representation (b). With asimple approach of eliminating redundant exceptions usingforward dataflow, the first null check for the array a in (b)cannot be lifted out of the loop, since the exception hasnot been checked in the code reaching to the entry ofthe loop. Our algorithm uses PRE and moves this nullcheck out of the loop, as shown in (c). This creates anopportunity in (d) to move the bound checking code outof the inner loop, which in turn enables privatization tobe applied for the loop invariant array access in the firstdimension, as shown in (e). This again creates anotheropportunity (f) for elevating the null checking codeout of the inner loop.

We assign different values for the maximum number ofiterations of these dataflow optimizations depending onthe platform (i.e., the number of available registers),optimization level, and whether or not the methodcontains a loop. For IA-32, we chose four iterations as themaximum case when using the highest (L3) optimizationand for a method containing a loop. We adjust thisnumber for the lower-level optimizations and for a methodwithout a loop. In any case, the specified number is a limit

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

776

Page 11: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

for the iterations. We also stop the iterations if aniteration produces no new transformations.

Null check elimination (platform-dependent)This is the platform-dependent part of our null checkelimination algorithm [15]. On IA-32 Windows andLinux**, we can utilize the hardware trap mechanism foraccessing the zero page. We call the null check implicitwhen we can rely on the hardware trap without generatingactual checking code, and explicit otherwise. We need touse an implicit null check wherever possible to minimizethe overhead of null exception checks.

Our algorithm treats all null checks as explicit at first.We then perform forward dataflow analysis to computethe latest point in the region at which each null check canbe moved forward. We then insert either an explicit orimplicit null check at that point, depending on whetherthe next instruction following the insertion point will raisea hardware trap for the object access. Finally, we performbackward dataflow analysis to eliminate redundant explicitnull checks. This optimization substitutes explicit nullchecks wherever possible by finding and using the memoryaccessing instructions that raise the hardware traps.

Typeflow analysisThe typeflow analysis and optimizations performed onquadruples are basically the same as those performedon the EBC. That is, the results of the analysis are usedfor the elimination of devirtualized code backup paths,the elimination of the redundant type checking code,and the conversion of the calls to the native methodSystem.arraycopy� into the internal operator.We perform the typeflow analysis and the same set ofoptimizations here again to identify the type of variablesmore precisely than for the EBC. This is because otheroptimizations—in particular, copy propagation andmemory access privatization—allow better dataflowpropagation and thus expose more opportunities for thethree optimizations using the result of the typeflowanalysis.

Synchronization optimizationSince we explicitly express the synchronization enterand exit operations (syncenter and syncexit) inour IR with its target object, it is easy to identify theopportunities for eliminating redundant synchronizationoperations. There are two scenarios in which we can

Figure 8

An example of successive optimizations with dataflow iterations. Note that for simplicity the code for initializing and incrementing loop indices

in the original program is not shown in (b) to (f) . This is a minor revision of the figure published in [15]. ©ACM, 2000; used with permission.

i = 0;do { j = 0; do { sum +� a[i][j]; j++; } while (expr1) i++;} while (expr2)

do { do { nullcheck a; boundcheck(a,i); t1 � a[i];

nullcheck t1; boundcheck (t1,j); t2 � t1[j];

sum +� t2; } while (expr1)} while (expr2)

Original program(a)

Quadruplerepresentation

(b)

nullcheck a;do { boundcheck(a,i); t1 � a[i]; do {

nullcheck t1; boundcheck (t1,j); t2 � t1[j];

sum +� t2; } while (expr1)} while (expr2)

After scalarreplacement

(e)

nullcheck a;do { boundcheck(a,i); t1 � a[i]; nullcheck t1; do {

boundcheck (t1,j); t2 � t1[j];

sum +� t2; } while (expr1)} while (expr2)

After second nullcheckelimination

(f)

nullcheck a;do { do {

boundcheck(a,i); t1 � a[i];

nullcheck t1; boundcheck (t1,j); t2 � t1[j];

sum +� t2; } while (expr1)} while (expr2)

nullcheck a;do { boundcheck(a,i); do {

t1 � a[i];

nullcheck t1; boundcheck (t1,j); t2 � t1[j];

sum +� t2; } while (expr1)} while (expr2)

After first nullcheckelimination

(c)

After array boundcheck elimination

(d)

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

777

Page 12: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

perform the optimization. One is a nested synchronization,that is, a syncenter operation on the same object twicefollowed by two syncexit operations, where we canremove the inner pair of the synchronization. The otheris two separate synchronization regions, such as a pairof syncenter and syncexit operations on one objectfollowed by another pair of syncenter and syncexit

operations on the same object. There may be some othercode between the two regions. In this case, we have anopportunity to reduce the number of synchronizationoperations by extending the scope of synchronizationand creating a single larger synchronization region.

This second optimization requires somewhat carefulconsideration to guarantee the correct program behavior.Suppose there is an object field access in the codebetween the two synchronization regions. If we performthis optimization, the code of the field access can bemoved within the new wide synchronization region andmay be reordered with other memory operations ondifferent objects that were placed within either of the twooriginal synchronization regions. If the program assumesthe order of computation forced by the two originalsynchronization regions, this optimization can lead to anunintended behavior. Thus, we perform this optimizationonly if there is no quad instruction causing any sideeffect, such as an object write instruction and a methodinvocation between the two synchronization regions.

Another possibility for optimization is to lift thesynchronization operation out of the loop. This situationoccurs when a synchronized method is inlined within aloop. This optimization is efficient and probably legalin Java, but the resulting behavior is not what mostprogrammers would expect. If the loop is long-running,other threads trying to access that method will be starved.However, most programmers writing such a loop wouldexpect the lock to be released and reacquired each timethrough the loop to allow any other thread to have achance to execute. Thus, we currently do not performthis optimization on synchronization.

Athrow inliningThis is a part of exception-directed optimization (EDO)[2], which is a technique to identify and optimizefrequently raised exception paths. It first collects exceptionpath profiles. An exception path is a stack trace from themethod that throws an exception through a method thatcatches the exception. It is expressed with the sequence ofa pair of the method and the program counter of the callsites. Whenever the runtime library traverses the stackto find the method that catches the given exception, itregisters the exception path and increments the count.After identifying a particular exception path that isexecuted frequently enough, EDO drives recompilation forthe exception-catching method with a request to inline all

of the methods along the call sequence correspondingto the exception path down to the exception-throwingmethod.

After exception paths are inlined, the compilerexamines each athrow instruction that can be eliminatedby linking the source and destination of the exceptionpaths within the same compilation scope. This consists ofthree steps. For the first step, we perform exception classresolution, in which we identify the class of every possibleexception object that can be thrown at each athrowinstruction using the typeflow analysis. In the simplestcase, the exception object is explicitly created immediatelybefore the athrow, and thus the class can easily beidentified. In a more challenging case, the exception objectis created and saved in a class field and is used repeatedlyby loading the class variable before the athrow. Even inthis case, we can determine the set of possible classesusing the current class hierarchy structure.

For the second step, we perform the exception-handlinganalysis within the current method. When we identify theexact class of the exception object, we can find thecorresponding catch block. When we determine themultiple exception classes under a class hierarchy in theprevious step, we find the corresponding catch block foreach of the possible exception classes.

For the third step, we perform the conversion ofthe athrow instruction to either an unconditional or aconditional branch instruction, depending on whetheror not the exact class of the exception object has beenidentified. If necessary, to create a separate entry pointfor the branch instruction, the entry basic block for thetarget catch block is divided into two: one for loading theexception object and the other for the body of the catchblock. The athrow instruction itself may remain in abranch of a conditional statement if we cannot find thecatch block for all possible exception classes in the currentmethod. After the conversion, the exception object may nolonger be accessible if there is no reference within thecatch block. In this case, the explicit object allocationbefore raising the exception will be eliminated in asubsequent phase of dead code elimination.

Optimizations on DAGsThe sequence of optimizations performed on DAGs isshown in Figure 1.

Loop optimizationsWe perform two kinds of loop optimizations: loopversioning and loop simplification. Loop versioning [1, 17]is a technique to eliminate the exception-checking codewithin a loop by creating two versions of the loop: anoptimized loop and an unoptimized loop. Guard code isprovided at the loop entry for checking the possibility ofraising an exception within the loop, and— depending on

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

778

Page 13: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

the result of the entry check— either the optimized loopor the unoptimized loop is selected at runtime. The guardcode is provided for both null pointer checks and thearray bound checks for the entire range of the loopindex. The exception-checking code is eliminated in theoptimized loop, while the unoptimized loop retains theexception-checking code. Figure 9(a) shows an exampleof this transformation. We can expect the optimized loopto be executed for most of the time, since exceptionsare rarely thrown in practice.

For nested loops, we apply loop versioning to bothouter and inner loops by creating only two versions ofthe loop: the optimized version for both outer and innerloops and the unoptimized version for both loops. Theunoptimized version is executed upon any failure of theentry tests for both the outer loop and the inner loop.Otherwise, the optimized loop is executed.

In the optimized loop, this optimization not onlyeliminates the runtime overhead of the exception checkingcode, but also eliminates barriers for code motion andthus creates more optimization opportunities in the laterphases. On the other hand, code growth is the majordrawback of this transformation. In order to avoid anunacceptable level of code growth and to limit thecompilation overhead, we check the number of basicblocks and instructions within the loop body, and proceedonly if it is within a predetermined threshold.

Loop simplification is another optimization we performin this phase. As shown in Figure 9(b), this transformationunfolds the loop completely if we know the start and endof the loop index at compile time. Again, code growth is

the main limiting factor of this transformation, and wecheck the size of the loop body before its application.

Prepass code schedulingFinally, we perform prepass code scheduling within eachbasic block using a simple list-scheduling algorithm [18].We integrated register minimization into list schedulingin order to avoid the pass-ordering problem between theregister allocation and scheduling. The prepass schedulertraverses the DAG in its topological order. It selects thenode that has the highest scheduling priority among readynodes and appends the selected node to the end of thesequence of scheduled nodes. During scheduling, thescheduler also maintains a set of currently usedregisters.

Our algorithm uses two different scheduling policies:maximization of instruction-level parallelism (ILP) andminimization of the number of used registers. When thereare plenty of available registers on the target architecture,the scheduler prefers a node on the critical path of theDAG to maximize the parallelism. When the number ofavailable registers falls below a certain threshold, thescheduler then invokes the second list-scheduling routineto minimize the register usage; that is, it prefers the nodethat will release the largest number of registers. In thefinal phase of the scheduling, we reduce the number oflocal variables by assigning the same variable whenthe lifetime of the two variables does not overlap.

Figure 10 shows quadruples after we assignedpseudoregisters in the example shown in Figure 3. Wedenote an integer register as Rn and a floating-point

Figure 9

An example of (a) loop versioning and (b) loop simplification.

public int foo (int[] a, int[] b, int start, int end) { for (int i = start; i < end; i++) { a[i] = a[i] + b[i]; }}

(a)

(b)

if ((a != null) && (a.length >= end) && (b != null) && (b.length >= end) && (start >= 0)) { /* optimized loop */ for (int i = start; i < end; i++) { /* without exception checking code */ a[i] = a[i] + b[i]; }} else { /* unoptimized (original) loop */ for (int i = start; i < end; i++) { /* with exception checking code */ a[i] = a[i] + b[i]; }}

static final int start = 0;static final int end = 3;public int foo (int[] a, int[] b) { for (int i = start; i < end; i++) { a[i] = a[i] + b[i]; }}

a[0] = a[0] + b[0]; a[1] = a[1] + b[1]; a[2] = a[2] + b[2];

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

779

Page 14: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

register as Fn. Figure 10(a) shows the result with theoriginal instruction ordering. This ordering maximizes ILPand requires nine floating-point registers. Figure 10(b)shows the result when we applied prepass scheduling withthe minimal register usage policy. After the node e isscheduled, the ready nodes are node f and node k. Whilenode f requires another floating-point register to hold theloaded value, node k does not increase the number ofused registers. Thus, the scheduler selects node k afternode e. In this case, the resulting instruction orderingrequires only seven floating-point registers.

Platform-specific optimizationsThis section describes IA-32-specific optimizationsimplemented in the code generation phase. The sequenceof optimizations is shown in Figure 1. First the architecturemapping takes the platform-independent IR (quadruples)and converts it to a form suitable for the target architecture.For IA-32, we have two phases. One is to convert it from athree-operand format to a two-operand format as requiredby the architecture, except for the case in which we canexploit the lea instruction for three-operand additionand subtraction operations. We insert an additional copyinstruction for each statement to be converted. The otherphase is to specify where we can use memory operandinstructions in the code generation. For each memoryload instruction, we examine three conditions:

1. Whether the corresponding use for the loaded value issingle.

2. Whether the instruction using the value allows thememory operand format.

3. Whether the source operands for the load instructionare all live at the corresponding use.

If these conditions are met, we mark both the definitionand the use of the variable with an attribute indicatingthat the load instruction can be skipped by using thememory operand instruction at the corresponding usestatement. A typical example is an array bound check,which usually consists of loading the array size followed bycomparing it with a given index. If the size of the array isnot used after the bound check, both the definition andthe use of the array size variable are marked as a memoryoperand.

After the architecture mapping, we perform a set ofoptimizations for efficient usage of the registers. Sincewe do on-the-fly local register allocation during the codegeneration phase for IA-32, 4 it is important to provide theregister manager with a global view regarding the use ofregisters in order to avoid inefficient register shuffling orspilling code. The next sections show how the decision forspill candidate is made by the register manager, describehow the register preference is used for allocating registers,and describe our technique in the code generation phasefor increasing the number of available registers.

Mean-time-to-next-use analysisWhen no register is available, the register manager has todecide which register should be spilled out into memory.With the stack-based IR (EBC), as in the previous versionof the compiler, we first searched for the appropriate onefrom the local variable registers and then from the stackvariable registers, on the basis of the assumption thatstack variables are generally short-lived and thus morelikely to have a shorter interval to the next reference.With the register-based IR (quadruples), the registermanager has to know how soon each of the local variablesis accessed next from the current instruction in order tomake the right decision for spill-out registers. We call thisinformation the mean time to next use.

We compute the mean time value as the number ofquadruple instructions between each definition andreference of the local variables by using the backwarddataflow analysis. In the local analysis, the localpropagator initializes a counter whenever it encountersa reference to a new local variable. It increments allof the active counters as it scans each instruction in thequadruples. When it reaches a local variable referencewith an active counter, it updates the mean time value ofthe operand with the counter value. The counter is cleared

4 We use different register allocation schemes depending on the target platform:a register manager for IA-32, a linear scan allocator [19] for PPC, and a graph-coloring allocator [20, 21] for IA-64. This is based on the consideration of boththe number of available registers and the restrictions on register usage for eachplatform.

An example of prepass code scheduling. (a) Instruction ordering with

maximum instruction-level parallelism. (b) Instruction ordering with

minimal register usage.

Figure 10

(a) (b)

a: faload F5 = R1, R2b: faload F6 = R3, R2c: fmul F7 = F5, F6d: fadd F1 = F1, F7e: faload F7 = R4, R2f: faload F8 = R5, R2g: fmul F9 = F7, F8h: fadd F2 = F2, F9i: fmul F5 = F5, F8j: fadd F3 = F3, F5k: fmul F6 = F7, F6l: fadd F4 = F4, F6m: iadd R2 = R2, 1n: iflt R2, R6

a: faload F5 = R1, R2b: faload F6 = R3, R2c: fmul F7 = F5, F6d: fadd F1 = F1, F7e: faload F7 = R4, R2k: fmul F6 = F7, F6l: fadd F4 = F4, F6f: faload F6 = R5, R2g: fmul F7 = F7, F6h: fadd F2 = F2, F7i: fmul F5 = F5, F6j: fadd F3 = F3, F5m: iadd R2 = R2, 1n: iflt R2, R6

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

780

Page 15: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

at its definition point. In the global analysis, the resultsfrom local analysis are combined by selecting the smallestmean time value. In the case of loop back edges or edgeswith biased branch frequencies, the local result on aparticular single basic block is selected. We iterate thelocal and global dataflow analysis several times with acertain limit on the mean time value.

In the code generation phase, the register manager usesthe mean time value by updating its internal register table.It copies the value from the operand for the variabledefined or referenced in the current IR instruction, orotherwise decrements the value within the table for thevariables not currently referenced. Thus, the registermanager keeps the mean time value up to date in theregister table throughout the code generation and uses theinformation for making spill decisions. In addition, whena local variable reference is in its last use, the mean timevalue in the operand is zero, and the register manager caninvalidate the corresponding register immediately after itscode generation and put it in the list of available registers.

Register constraint propagationThere are constraints for the use of registers for someIA-32 instructions; that is, a particular operand must residein a designated register. For example, the shift instructionsrequire ecx to be the second operand. The integer divisioninstructions are another example. Also, only four of theeight general-purpose registers can hold 8-bit values. Wehave to satisfy these requirements during code generationby moving values between registers or between registersand memory, if necessary. Traditionally, the register-coalescing technique [22] in the graph-coloring registerallocation addressed this problem by merging nodesrepresenting local variables and physical registers in orderto eliminate redundant moves. The register constraintproblem was also discussed in the linear scan registerallocation, where register-preallocated live ranges [19]were used. We use the register constraint propagation tohelp the register manager assign the optimal registers [23].

Since the register manager works from the top to thebottom of the program, registers are assigned to localvariables at their definition and released at their lastuse. To track the register constraint information for theregister manager, we first add the register constraintseed information to the source operands of the givenquadruples and then propagate the information backwardfor the corresponding definitions. We currently use thefollowing as the seeds of the register usage constraints:

● Calling convention: We follow the calling conventionto use eax, edx, and ecx for the first three integerparameters (long parameters are treated as two integerparameters) for method invocations. The return of

integer-type values is through eax and the return oflong-type values is through eax and edx.

● IA-32 architectural requirements: The integer division,integer multiplication, and shift instructions requirespecific registers to be used for some operands. Thecompare-and-swap instruction, which we use forgenerating code for the synchronization operation,requires that its operands reside in specific registers.Instructions dealing with 8-bit values for byte-typeoperations (e.g., baload) allow use of only four ofthe eight general-purpose registers.

● Preference for variables referenced across method calls:For local variables whose live ranges cross a method callor a C-runtime library call, it is preferable to assignnonvolatile registers to reduce the cost of saving andrestoring registers over the call. In contrast to the abovetwo requirements, which we have to satisfy in the codegeneration, this is just a preference.

We use a bit vector representation with the numberof available registers to express the register constraintinformation. The constraint expression can be eitherpositive or negative; that is, we can specify that a localvariable be assigned with a certain specific register,or not be assigned with specific registers. This negativespecification is important to avoid a situation in which thespecified register is occupied by another variable whoselive range overlaps the live range of the target variable.

Second, we propagate the seed information within eachbasic block. The local propagator manipulates the array ofthe register constraint bit vector for local variables anda temporary called the currently occupied register bits(CORB) as it scans the code from the bottom to the top.If it encounters a local variable reference with a positiveregister constraint, it updates the array for the localvariable with the bits indicating the specified registers andthen clears the bits for all of the other local variables. TheCORB status also reflects the corresponding register bits.Likewise, if it encounters a local variable reference witha negative constraint, it updates the array for the localvariable with the inverse bits of the CORB. If it reachesa local variable definition, it clears all of the bits in thearray for the local variable and the corresponding bitsin the CORB. Figure 11 shows an example of the localconstraint propagation. The register constraints, bothpositive and negative, are expressed in the bit vectors asshown in the figure. Each bar represents the live range ofa local variable within a basic block, and two of them haveseed information at the reference points. After the localpropagation, each live range has the propagated registerconstraint information at its definition point.

Finally, the local information is propagated across theentire method using backward dataflow. The aggregation

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

781

Page 16: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

function in the dataflow equation is a bitwise operationbetween the constraints of successors for each localvariable, as follows:

● If the constraints of the successors are all positive,perform a bitwise OR operation.

● If the constraints of the successors are all negative,perform a bitwise AND operation.

● If the constraints of successors are mixed positive andnegative, perform a bitwise AND operation and set thepositive flag.

We iterate the local and global dataflow analysis severaltimes. In our current implementation, we iterate as manytimes as the maximum depth of the loop nest existing inthe method.

Code generationSince the number of registers is limited on IA-32, wemust use these scarce resources efficiently to achieve highperformance. We use three techniques: reclamation ofthe Frame Pointer register, fast accesses to the threadlocal storage without using a dedicated register, andmanagement of two different sets of floating-pointregisters.

Frame Pointer registerWe reclaimed the Frame Pointer register and freed upebp to be used as a general-purpose register. All of thelocal variables in memory are accessed through the stackpointer. The method prologue and epilogue code becomesimpler and more efficient just by increasing or decreasingthe stack pointer with the current frame size. If there isno stack memory operation within the method, we caneven skip creating the stack frame, saving the two

instructions for sliding the stack pointer. We reorder localvariables that require spill code within the stack frameon the basis of the access count, so frequently accessedvariables are located within the 1-byte offset from thestack pointer.

When a hardware trap or a software exception is raisedat runtime, the exception handler and the stack walkermust know the frame base address in order to unwind thechain of stack frames. We create a map for the runtime toindicate the current stack offset from the base for eachinstruction. Throughout code generation, we maintain thecurrent offset value of the stack pointer, and at the end ofthe compilation we create the mapping table for everyinstruction in the method that potentially raises anexception and where the stack offset value is changed.Each entry in the map includes a pair of the stack offsetvalue and the corresponding instruction address. Sinceour frame size is basically fixed and varies only when thearguments are pushed for method calls or C-runtimecalls, the map can be kept reasonably small. With anappropriate compression technique, the map size is keptwithin 10% of the generated code size for most methods.

The MMI- and JIT-generated code share the sameexecution stack in our system. Since the MMI uses ebpas a frame pointer, the exception handler first identifieswhich type of frame, MMI or JIT, is currently on thestack, and then extracts the frame base addressaccordingly.

Thread local storage pointerInstead of having a dedicated register as a pointer to theJVM/JIT thread local storage, we use the FS segmentregister (one of the four segment registers for pointingto data segments in IA-32 architectures). The segmentregister is usually reserved for use by the operating systemfor its own purposes and cannot be used arbitrarily byapplications. Microsoft Windows** maintains a threadinformation block (TIB) structure for every thread by theFS segment register. The first entry in this structure is apointer that contains the head of the exception-handlerchain. We create an exception-handler record whenstarting the execution for each thread and register withthe system by setting the pointer to the offset 0 into thesegment pointed to by the FS register. Our JIT compilerthen uses one entry in this structure as the pointer to ourown thread local storage. Thus, we can obtain the threadlocal storage address through one indirection of memoryaccess.

On Linux, the FS segment register is not reserved foruse for any particular purpose, and thus any applicationcan use the segment register for improving its ownperformance, which means that using the segment registerin our JIT compiler can cause an unexpected conflict withother applications. Instead, we use the topmost entry of

An example of local constraint propagation (a) before and (b) after local propagation. Each bar represents the live range of a local variable associated with a register constraint value.

Figure 11

(a) (b)

Live range

A

B

CD

E

0x81

0x82

Live range

A

B

C D

E

0x81

0x82

0x81

0x7C

0x7C

0x7E

0x7C

0x820x7E

0x7E

Register bitrepresentation

positiveflag

esiebpebxecxedxeax

ediSeeds

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

782

Page 17: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

the stack space for the pointer to the thread local storage.The entry can be accessed by masking the stack pointer sothat we can still reach the thread local storage throughone indirection, as in the Windows implementation.

Floating-point registersWe use streaming single-instruction multiple-data (SIMD)extensions (SSE and SSE2) when they are supported bythe underlying machine and the operating system. We cangenerate more efficient code using these new instructions,first because the associated XMM registers are designedto be flat, in contrast to the existing x87 floating-pointstack registers; second, because the instructions aredivided between single-precision and double-precisionoperations, we do not have to do extra precision controloperations. On the other hand, we have other kinds ofchallenges for register allocation and code generation,as follows:

● SSE/SSE2 instructions do not support all of theexisting floating-point functions (such as the remainderoperation and transcendental functions) using the x87registers, and we have to move values from one registerset to another for those operations.

● In some operations, such as the conversion from afloating-point number to an integer number, SSE/SSE2instructions are much faster than x87 instructions. Wewant to generate the faster code, even if we pay extrafor memory operations that transfer values betweenregisters.

● When there are more than eight live floating-pointvariables, it is better to use both register sets ratherthan sticking to one register set and generating memoryoperations for the extra variables. Thus, we have to dealwith two kinds of instructions and registers for efficientfloating-point code generation.

We follow lazy migration policy for transferring valuesbetween the two sets of registers. When the sourceoperands are in only one of the sets of registers, we usethe instructions available for those registers, except forthose operations having constraints or special preferences,as mentioned above. When the source operands are mixedbetween the two register sets, we basically move valuesto XMM registers, if available, and generate SSE/SSE2instructions.

It is possible to apply the register constraint backpropagation, as we did for integer registers. We canassign seed registers for those operations with the registerconstraints and preferences, and if we choose to passfloating-point parameters via XMM registers, we caninclude parameter-setting operations as well. Currentlywe do not do this, but the constraint propagation would

provide useful information for avoiding inefficient registermove operations, since we cannot directly transfer valuesbetween the two sets of registers, but have to use expensivememory operations.

Experimental resultsThis section presents some experimental results. We firstoutline our experimental methodology and then presentand discuss our results for performance improvement andcompilation overhead of individual optimizations and forthe effectiveness of our dynamic optimization framework.The last two sections below present the history ofperformance improvement with each release of our JITcompiler.

Benchmarking methodologyAll of the performance results presented in this sectionwere obtained on an IBM IntelliStation* M Pro 6850(Intel Xeon** 2.8-GHz uniprocessor with 1024-MBmemory), running Windows XP SP1. We used theJVM of the IBM DK for Windows, Java TechnologyEdition Version 1.4.0 prototype build, except for theresults reported in the section on the evolution ofperformance improvement, where we used each versionof the JVM/JIT combination. We used SPECjvm**98and SPECjbb**2000 [24] as common benchmarksfor evaluating our system in all of the experiments.SPECjvm98 was run in the test mode with the defaultlarge input size, and in the autorun mode with tenexecutions for each test. Each distinct test was run witha separate JVM with the initial and maximum heap sizeof 128 MB. SPECjbb2000 was run in the fully compliantmode with one to eight warehouses, with the initial andmaximum heap size of 256 MB, and we reported thebest scores among these runs. To evaluate our dynamicoptimization framework in more detail, we also presentresults for larger benchmarks—Java Grande [25],XSLTMark [26], and SPECjAppServer**2002 [24]—in thesection on evaluation of the dynamic optimization framework.The system configuration and parameter settings formeasuring these benchmarks are described in that section.

When measuring the compilation overhead, we use aseparate thread for compiling all methods. The compileris instrumented with several hooks for each part of theoptimization process to record the processor timestampcounter value. The priority of the compile thread is sethigher than the priority of the normal application threads.We measure the time difference between the beginningand the end of the compilation from the timestampcounter.

Evaluation of individual optimizationsWe first evaluate how much benefit each optimizationbrings to the system at what compilation cost. We disable

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

783

Page 18: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

each optimization individually and examine how muchimpact it has on both performance and compilationoverhead. We use a single optimization level for thismeasurement, instead of using the three-optimization-levelrecompilation system, to get a clear picture of the costand benefit of each optimization before classifying theminto the three optimization levels. We evaluated theeight variations shown in Table 1. The baseline of thecomparison is the configuration with all optimizationsenabled within the system.

Figure 12 shows changes in both performance andcompilation overhead when each optimization is disabled.We use three metrics for compilation overhead:compilation time, compiled code size, and compilationtime peak work memory usage. A smaller bar indicatesa larger impact from that optimization for the system.The peak memory usage is the maximum amount ofmemory allocated for compiling methods. Our memorymanagement routine allocates and frees memory in1-MB blocks.

There are several observations for these measurementresults. First, the static-heuristics-based inlining (leftmostbar) contributes an average of 10% performanceimprovement but causes a large compilation overhead.Without the aggressive inlining, the overhead is halvedor reduced to less than half for all three metrics. Inparticular, the compilation time in jess, db, the codesize in jess, and the work memory usage in db, javac,and SPECjbb are all reduced by more than 60%. Theperformance impact for jack is larger than that forthe others because EDO cannot effectively optimizefrequently raised exception paths without performingaggressive inlining.

Second, the DAG-based optimization (second bar fromthe right) shows the next largest compilation overhead.This occupies an average of 20 to 30% of the totalcompilation overhead, depending on the metric. Inparticular, this creates large overheads for db andSPECjbb in both compilation time and work memoryusage. Furthermore, the performance contribution of thisoptimization is limited to only a few programs, such asmpegaudio. We have to apply this expensive optimizationvery carefully.

Third, the largest performance contributor is thedataflow iteration (fourth bar from the right). It improvesthe performance by an average of 20% and up to 50% formpegaudio. The kernel of this benchmark is a heavilyiterated, doubly nested loop that accesses several two-dimensional arrays. The null check elimination, arraybound check elimination, common subexpressionelimination, and memory access privatization are alleffective for optimizing this kernel loop. The compilationtime overhead with this optimization is around 10 to 20%,except for mpegaudio, for which the compilation time issignificantly increased with the dataflow iterations turnedoff. This anomaly is due to DAG-based optimization, bywhich compilation time is increased more than fourfold.This is because the implementation of DAG-basedcommon subexpression elimination is not very efficient,but it is invoked when the quadruple-based commonsubexpression elimination within the dataflow iterationsis disabled.

Fourth, the typeflow analysis (third bar from the right)contributes to both performance improvement and reductionof compilation overhead. In particular, it has a large impacton mtrt, nearly 10% for performance and 20 to 40% for

Table 1 Variations for evaluating cost and benefit of individual optimizations. These correspond to the bars in Figure 12 fromleft to right.

Abbreviation Description

No-AggrInline This disables all of the static heuristics for aggressive inlining. Tiny methods are always inlined.

No-IdiomSwitch This disables both the idiomatic translation and the switch optimizations.

No-ExceptionCheck This disables the exception check (null check and array bound check) elimination at the EBC level.

No-Escape This disables escape analysis. No scalar replacement or stack object allocation is performed.

No-DataflowIter This disables the iteration of dataflow analyses performed at the quadruple level. It also disablesthe platform-dependent nullcheck elimination.

No-Typeflow This disables the typeflow analysis performed at both the EBC and quadruple levels. Theoptimizations using this analysis result, backup path elimination, redundant type checkingelimination, and arraycopy JNI call conversions, are not performed.

No-DAGOpt This disables the DAG-based optimizations (loop optimizations and prepass code scheduling). TheIR conversions from quadruples to DAGs and then back to quadruples are not performed.

No-RegOpt This disables optimizations performed in the code generation phase, that is, mean-time-to-next-useanalysis, register constraint propagation, the reclamation of the frame pointer register, and theuse of XMM registers.

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

784

Page 19: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Figure 12

Evaluation of individual optimizations for both performance impact in (a) and compilation overhead in (b) to (d). Each bar indicates the

relative number against the baseline, the configuration with all optimizations enabled. SPECjvm**98 consists of seven benchmarks: mtrt, jess, compress, db, mpegaudio, jack, and javac. G.M. � geometric mean.

0

0

0.2

0.4

0.6

0.8

1.0

1.2

0.2

0.4

0.6

0.8

1.0

1.2

1.4

No-AggrInline No-IdiomSwitch No-ExceptionCheck No-Escape No-DataflowIter No-Typeflow No-DAGOpt No-RegOpt

0

0.2

0.4

0.6

0.8

1.0

1.79

0

0.2

0.4

0.6

0.8

1.0

1.2

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

1.46

Performance ratio (taller is better)

(a)

Compile time ratio (smaller is better)

(b)

Code size ratio (smaller is better)

(c)

Work memory ratio (smaller is better)

(d)

Perf

orm

ance r

ati

o o

ver

full

-opt.

Com

pil

e t

ime r

ati

o o

ver

full

-opt.

Code s

ize r

ati

o o

ver

full

-opt.

Work

mem

ory

rati

o o

ver

full

-opt.

2.33

1.79

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

785

Page 20: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

compilation overhead, primarily by enabling theelimination of devirtualized code backup paths. Thisoptimization consistently contributes to many programsother than mtrt without causing any extra compilation cost.

The two EBC-level optimizations, idiom/switchoptimizations and the exception check elimination (secondand third bars from the left), cause little increase incompilation overhead, but they improve performance insome benchmarks. In particular, idiom/switch optimizationcontributes to performance improvements in mtrt andjavac. The register usage optimization in the codegeneration phase (rightmost bar) also contributes to asignificant performance improvement, more than 20%for mtrt and mpegaudio, while it causes little additionalcompilation overhead. The escape analysis improvesperformance only for mtrt by about 10%, but causesaround 10% of compilation time overhead for allbenchmarks.

Evaluation of dynamic optimization frameworkIn this section, we evaluate our dynamic optimizationframework. In the measurements, the timer interval forthe sampling profiler for detecting hot methods was setto five milliseconds, and the list of hot methods wasexamined every 200 sampling ticks.

Figure 13 compares both performance and compilationoverhead with our dynamic optimization framework(MMI-L1/L2/L3) against those in three single-levelcompilation configurations: MMI to L1 only, MMI to L2only, and MMI to L3 only. The performance with MMI-L1/L2/L3 is slightly less than that of the best-performingconfiguration, MMI-L3. The difference is up to 4% and,on average, 2%. For the compilation overhead, MMI-L1/L2/L3 shows a significant advantage over both MMI-L2and MMI-L3 configurations. In the best case, the overheadis almost equal to the baseline (MMI-L1). In the worstcase, with SPECjbb, it is around 3.5 times the baselinefor the compilation time. On average, the overheadis less than twice the baseline in all three metrics,and is around 30 to 70% of that with MMI-L3configuration.

Figure 14 shows the corresponding results for the largerbenchmarks, Java Grande section 3, XSLTMark Version2.1.0, and SPECjAppServer2002. Java Grande section 3consists of five benchmarks, Euler, MolDyn, MonteCarlo,RayTracer, and Search. We ran each of these benchmarksseparately with the “Size B” problem (large dataset) andwith the initial and maximum heap sizes of 512 MB.Unlike SPECjvm98, this benchmark includes the JITcompilation time in the best execution time. XSLTMarkconsists of 40 test cases designed to assess a varietyof functional areas of extensible stylesheet language

transformation (XSLT) processors. We used Xalan–JavaVersion 2.5.2 [27] as the XSLT processor, and measuredthe performance using the aggregated result of all of thetest cases. SPECjAppServer2002 models the representativebusiness process used at a Fortune 500 company, and itsworkload emulates heavyweight manufacturing, supplychain management, and order/inventory systems. Weconfigured a three-tier system (Figure 15) by providing aseparate machine for a database tier (using DB2* 8.1.4)and an application server tier (using WebSphere*Application Server Version 5.0.2). We used version 1.3.1of the IBM DK, not version 1.4.0, for this benchmark,because of the product restrictions of this version ofWebSphere, but we used the same code base for the JITcompiler as that used in the other benchmarks. We ranthe application server with the initial and maximumheap sizes of 1200 MB and compared the self-reportedscore (transactions per second) for each of the fourconfigurations. The ramp-up time, steady-state time, andramp-down time were set to 300, 300, and 150 seconds,respectively.

Figure 14 shows the same overall characteristics as doesFigure 13, but there are some interesting differences.First, MMI-L1/L2/L3 is the best performer for fourbenchmarks, and among them MMI-L1 is the second bestfor MolDyn and SPECjAppServer. Although we have notinvestigated the reasons in detail, we believe that theaggressive method inlining performed in L2 and L3 maycause code locality to be decreased. In fact, by disablingthe aggressive method inlining in MMI-L2 and MMI-L3configurations, performance is restored to a level higherthan that with MMI-L1 for both of these benchmarks.Second, MMI-L1/L2/L3 shows a compilation overheadsimilar to or larger than that of MMI-L3 in all threemetrics for Euler and Search. For these benchmarks, therecompilation system correctly picked up performance-critical methods to promote to higher optimizations, buttheir code size happens to be quite large after inlining. Insome cases, those methods were compiled through alloptimization levels, L1, L2, and then L3, and thus theoverhead of repetitive compilation for those selectedmethods exceeded the overhead caused by compilingall methods with L3 optimization under MMI-L3configuration. For other benchmarks, MMI-L1/L2/L3works effectively to reduce the total compilation overhead.

Overall, our dynamic optimization framework performsbest when both performance and compilation overheadare considered. These are the results even withoutenabling any profile-directed optimizations. The MMI-L1/L2/L3 is the only configuration that can enable thoseadditional optimizations; thus, it has the potential tooutperform other configurations even more withoutcausing any significant increase in compilation overhead.

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

786

Page 21: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Figure 13

Evaluation of dynamic optimization framework for both performance (a) and compilation overhead (b) to (d). The baseline of the comparison

is MMI-L1.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

mtrt jess comp db mpeg jack javac SPECjbb G.M.

Performance ratio (taller is better)

(a)

Compile time ratio (smaller is better)

(b)

Code size ratio (smaller is better)

(c)

Work memory ratio (smaller is better)

(d)

Perf

orm

an

ce r

ati

o o

ver

MM

I-L

1C

om

pil

e t

ime r

ati

o o

ver

MM

I-L

1C

od

e s

ize r

ati

o o

ver

MM

I-L

1W

ork

mem

ory

rati

o o

ver

MM

I-L

1

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0

1

2

3

4

5

6

7

8

0.9

1.0

1.1

1.2

1.3

0

1

2

3

4

5

66.9

MMI-L1 MMI-L2 MMI-L3 MMI-L1/L2/L3

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

787

Page 22: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Figure 14

Evaluation of dynamic optimization framework for Java Grande, XSLTMark, and SPECjAppServer2002 benchmarks. The baseline of the

comparison is MMI-L1.

Compile time ratio (smaller is better)

(b)

Performance ratio (taller is better)

(a)

Code size ratio (smaller is better)

(c)

Work memory ratio (smaller is better)

(d)

Perf

orm

an

ce r

ati

o o

ver

MM

I-L

1C

om

pil

e t

ime r

ati

o o

ver

MM

I-L

1C

od

e s

ize r

ati

o o

ver

MM

I-L

1W

ork

mem

ory

rati

o o

ver

MM

I-L

1

MMI-L1 MMI-L2 MMI-L3 MMI-L1/L2/L3

0

0.5

1.0

1.5

2.0

2.5

3.0

0

1

2

3

4

5

6

7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

Euler MolDyn MonteCarlo RayTracer Search XSLT jAppServer G.M.

Euler MolDyn MonteCarlo RayTracer Search XSLT jAppServer G.M.

Euler MolDyn MonteCarlo RayTracer Search XSLT jAppServer G.M.

Euler MolDyn MonteCarlo RayTracer Search XSLT jAppServer G.M.

0

1

2

3

4

5

6

7

8

14.014.1

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

788

Page 23: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Evolution of performance improvementFigure 16 shows the history of the performanceimprovement with each release of our JIT compiler afterversion 3.0. We have integrated the techniques describedin this paper over several versions of the JIT compiler andhave steadily improved the performance with every newrelease. With the latest version of the JVM and JITcompiler, we achieved on average more than a 30%performance improvement. In particular, the performanceof SPECjbb nearly doubles. There have been significantcontributions from our efforts in enhancing JVM,especially in the area of synchronization, object allocation,and class libraries, as well as from those in enhancing theJIT compiler for this achievement. Roughly speaking, halfof the performance improvement can be considered as dueto the JIT compiler, because the optimizations in JIT3.0correspond approximately to the current L1 optimizations,and thus the performance of JIT4.5 is estimated to beabout 15% better than JIT3.0, based on Figure 14(a),without the effect of JVM improvement. This is due toour rough estimate based on the average performanceimprovements, and the actual percent of the contributiondiffers greatly for each benchmark.

Figure 16 also shows an interesting trend that ouroverall performance improvement gains less from theimprovement in the JIT compiler. In fact, the performancegain by JIT4.5 is much smaller than that achieved by eachof the previous two releases. This is because we integratedmost of the techniques described here in JIT3.5 andJIT4.0, including the two register-based IRs and a varietyof optimizations on those IRs, and JIT4.5 added only afew remaining items. We currently believe that most ofthe optimization techniques that do not use profileinformation and are suitable for dynamic compilers havealready been included in our JIT compiler, though theremay be some additional opportunities we can find byinvestigating a wider range of applications or with the

introduction of new architectural features. In the future,we need to explore some profile-directed optimizations toimprove the performance further, as we have attemptedfor prototyping [6, 10, 28].

Related workBesides the IBM DK, there are four other majorJava virtual machines publicly available today. SunMicrosystems** HotSpot and BEA Systems** JRockit**are the two major production systems, whereas the IBMJikes* Research Virtual Machine (RVM) and Intel OpenRuntime Platform are the two major research virtualmachines. All of these systems employ dynamicrecompilation frameworks by using an interpreter and anoptimizing compiler or by using two different compilers inorder to focus the expensive optimization efforts only onthe hot methods of a program.

Figure 15

Experimental environment for running the SPECjAppServer2002 benchmark.

xSeries* 440(Xeon 2.4 GHz x 2, 2 GB memory)

Windows 2003 server

DB2* Version 8.1.4

IntelliStation* M Pro(Xeon 2.8 GHz, 2 GB memory)

Windows XP Professional

WebSphere* Application ServerVersion 5.0.2

SPECjAppServer2002

IntelliStation M Pro(Pentium 4 3.06 GHz, 1 GB memory)

Windows XP Professional

WebSphere Application ServerVersion 5.0.2

SPECjAppServer2002 driver

1000 Mb/s

System under test(SUT)

1000 Mb/s

Performance improvements for each release of the JIT compiler. The baseline of comparison is JIT3.0 [1].

Figure 16

0

10

20

30

40

50

60

70

80

90

JIT3.5 (JVM1.2.2) JIT4.0 (JVM1.3.1) JIT4.5 (JVM1.4.0)

mtrt jess comp db mpeg jack javac SPECjbb G.M.

Impro

vem

ent

over

JIT

3.0

(%

)

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

789

Page 24: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

The HotSpot server [29] is a JVM product that usesboth an interpreter and an optimizing compiler to allow amixed execution environment, as in our system. It uses anSSA-based IR for optimizations. The optimizer performsall of the classic optimizations, including commonsubexpression elimination, dead code elimination, constantpropagation, and loop invariant hoisting. In addition, itemploys more Java-specific optimizations, such as nullcheck and range check elimination. It also collects profilesof receiver type distribution online in the interpretedexecution mode. The profile information, together withCHA, is used when optimizing the code for virtual andinterface calls.

The JRockit [30] is another JVM product designed forserver applications. It takes a compile-only approach andrelies upon a fast JIT compiler for compiling methodsquickly at their first invocation, as opposed to interpretivebytecode execution. If the system determines that acompiled method is causing a performance bottleneck, itreoptimizes the method with a secondary compilation withfull optimizations. While the details of the compilerstructure and contents of the optimizations are notavailable, it seems to use an SSA-based IR for performingmost of the optimizations.

The Jikes RVM [31], formerly called the Jalapeno JVM,is the most popular and widely used platform in the Javavirtual machine research community and is released underan open source license. The system is implemented withJava itself and supports a multilevel recompilationframework using two different compilers: a baselinecompiler and an optimizing compiler. The optimizingcompiler has three IRs, all register-based, and performs avariety of optimizations on each IR. The bytecode is firsttranslated to the high-level IR (HIR). The optimizationsperformed on the HIR are method inlining, some simplelocal optimizations within an extended BB, and flow-insensitive optimizations. The HIR is then translated tothe low-level IR (LIR) by expanding some complexoperators, where intraprocedural flow-sensitiveoptimizations based on the SSA form are performed byexploiting new opportunities exposed in the lower-levelrepresentation. The instruction selection then translatesthe platform-independent LIR into machine-specific IR(MIR) using a bottom-up rewriting system (BURS). TheMIR has a one-to-one mapping between IR operators andthe target instruction set architecture. Using MIR, thesystem performs live analysis and linear scan registerallocation and then generates the native code. Theoptimizations on both HIR and LIR are divided intothree levels in the optimizing compiler. The adaptiveoptimization system drives recompilation based on theestimated cost and benefit of compiling methods at eachoptimization level [32].

The Open Runtime Platform (ORP) [33] is another well-known RVM released as open source. It uses a compile-only approach and implements the dynamic optimizationwith two execution modes by providing two differentcompilers: a fast code generator [34] and an optimizingcompiler [35, 36]. While the fast code generator producescode directly from bytecode with only limited andlightweight optimizations, the dynamic compiler uses anIR to apply aggressive optimizations, such as methodinlining, global dataflow-based optimizations, and looptransformations. The dynamic compiler also uses theprofile information collected in the first execution modeto guide optimization decisions such as determiningthe inlining policy, deciding where to apply expensiveoptimizations, and deciding upon the code layout in thefinal code emission.

The SELF system [37] was a pioneering work fordynamic optimizations, some of which are the basis of thetechniques used today in many sophisticated JVMs andJITs. Some examples include code splitting [38], receiverclass profiling and polymorphic inline cache [39], adaptiveoptimization system using online profile information [40],and dynamic deoptimization and on-stack replacement[41]. In particular, the adaptive recompilation systemis now considered a standard feature for dynamicoptimization systems to reconcile the high performanceand low compilation overhead and is adopted in oursystem and in all of the above Java virtual machines. Thisalso allows the system to exploit various kinds of profileinformation in the higher optimization levels to betterguide optimizations using the dynamic behavior of a program.

Devirtualization of dynamically dispatched methodcalls has been extensively studied in static compilationenvironments. CHA [8] determines a set of possibletargets of a dynamic method call by using the staticallydeclared type of a receiver object and the class hierarchyof the whole program. If a target method is proved to be asingle implementation, it can be inlined or the call site canbe replaced with a fast direct call. Rapid type analysis [42]and variable type analysis [43] were proposed to be moreprecise than CHA by tightening the static type constraintson the receiver using object instantiation information.When a call site is not monomorphic but turns out tobe almost monomorphic from receiver class profileinformation, a class test is provided to guard thedevirtualized code [44]. All of these techniques weredeveloped to statically analyze dynamic method callsunder a closed-world assumption.

In the dynamic environment, dynamic class loadingmay cause the result of the compile-time analysis to beincorrect; thus, we need to provide guard code for alldevirtualized call sites (unless it is provably monomorphic)and/or an invalidation mechanism to ensure safety in theoccurrence of class loading. On-stack replacement [41]

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

790

Page 25: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

is a technique to effectively invalidate currently executedunsafe code and to transfer the control to another versionof the same method. HotSpot [29] uses this technique toeliminate backup paths for devirtualized call sites. Thetechnique can be implemented with or without guard codeand eliminates the merge points from virtual call sites inthe control flow, improving the effectiveness of dataflow-based optimizations. Preexistence analysis [11] is a simpletechnique for invalidating unsafe code without requiringon-stack replacement, but it is effective for only a smallportion of devirtualized call sites. Since methods areguaranteed to be safe until the next invocation, the guardcode is not required with this technique. Our approach tocode patching [12] can completely remove the overhead ofexecuting guard code from any of the monomorphic callsites, but the diamond-shaped control flow with backupvirtual dispatch still remains. Thin guards [45] reduce boththe cost of guard code and the penalty of optimizationrestrictions by identifying regions of code for whichspeculative optimization can be performed and by mappingthose multiple optimistic assumptions to a single guard.

ConclusionsWe have described the design and implementation ofversion 4.5 of our Java JIT compiler, specifically forIA-32 platforms. Since the previous report [1], we haveintroduced new register-based IRs (quadruples andDAGs) and implemented more advanced optimizations tofurther improve the performance. At the same time, theseoptimizations were expected to cause high compilationoverhead, and we needed to apply these optimizationsmore selectively and carefully than in the previous system.We designed a dynamic optimization framework withthe interpreter and the system with three levels ofrecompilation to keep the total compilation overheadwithin an acceptable level.

We evaluated our approach using industry-standardbenchmarks. The results—measured in compilation time,compiled code size, and compilation peak work memoryusage—showed both the effectiveness of the optimizationson performance and the low compilation overhead withour dynamic optimization framework.

As we add more advanced and sophisticatedoptimizations in our JIT compiler, we have increasingnumbers of heuristics, such as method inlining decisions,loop versioning decisions, the number of iterations for thedataflow, and triggers for recompilation, for controllingthese optimizations. This is more or less inevitable,especially for dynamic compilers, since we need to balancetwo conflicting requirements: generating good-quality codeand minimizing compilation overhead. The tuning of theseheuristics is a difficult problem. We currently determinethese various parameters from the empirical results byusing a variety of industry-standard benchmarks, but we

will soon need to include more customer applications todetermine them.

In the future, we will further study the practical andeffective ways of performing dynamic compilation. Insteadof a predetermined optimization classification as currentlyimplemented, it is desirable to dynamically select a set ofsuitable optimizations according to the characteristicsof the target method, so that we can apply only thoseoptimizations known to be effective for that method. Inthe dynamic compilation environment, we should avoidapplying those optimizations that do not contribute toperformance improvements but rather result in a waste ofcompilation resources. A possible way of doing this is toestimate the impact of each costly optimization on thebasis of both the profile information and the structureanalysis of the given method. The notion of programmetrics [46] is one such attempt to achieve betteroptimization decisions.

The profile-directed optimizations are another potentialroute for future JIT evolution. Although we did notaddress any of those techniques in this paper, we haveattempted some of them [6, 10, 28] in our framework, andsee a potential for significant improvements. On the otherhand, we believe that the introduction of the profile-directed optimizations may pose additional challengesfrom the reliability, availability, and serviceability (RAS)point of view, especially in product quality and codemaintenance. When a problem arises around the profile-directed optimizations, it is in general quite difficult toreproduce the problem to trace back to the root cause.Future JIT compilers should address this problem whileexploiting more of the profile information for advancedoptimizations.

AcknowledgmentsWe thank the IBM Toronto Laboratory for continuing thedevelopment and services of the IBM Java JIT compiler.We also thank the IBM Java performance team in Austin,Texas, particularly Bill Alexander, Mike Collins, andWeiming Gu, for their helpful discussion and analysis foridentifying opportunities to improve overall performance.We are also grateful to the former members of our group,Hiroshi Dohji, Masashi Doi, Shusaku Matsuse, HiroyukiMomose, Arvin Shepherd, Janice Shepherd, Kunio Tabata,Akihiko Togami, and John Whaley, for their respectivecontributions to improve the JIT compiler. Finally, wethank the anonymous reviewers for their valuablecomments and suggestions to improve this paper.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of Sun Microsystems,Incorporated, Microsoft Corporation, Linus Torvalds, IntelCorporation, Standard Performance Evaluation Corporation,or BEA Systems, Incorporated, in the United States, other

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

791

Page 26: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

countries, or both.

References1. T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue,

M. Kawahito, K. Ishizaki, H. Komatsu, and T. Nakatani,“Overview of the IBM Java Just-in-Time Compiler,” IBMSyst. J. 39, No. 1, 175–193 (2000); see http://www.research.ibm.com/journal/sj/391/suganuma.pdf.

2. T. Ogasawara, H. Komatsu, and T. Nakatani, “A Studyof Exception Handling and Its Dynamic Optimization inJava,” Proceedings of the ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages, andApplications, October 2001, pp. 83–95.

3. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman,and F. K. Zadeck, “Efficiently Computing Static SingleAssignment Form and the Control Dependence Graph,”ACM Trans. Programming Languages & Syst. 13, No. 4,451– 490 (October 1991).

4. J.-D. Choi, D. Grove, M. Hind, and V. Sarkar, “Efficientand Precise Modeling of Exceptions for the Analysis ofJava Programs,” Proceedings of the ACM Workshop onProgram Analysis for Software Tools and Engineering,September 1999, pp. 21–31.

5. K. Pettis and R. C. Hansen, “Profile-Guided CodePositioning,” Proceedings of the ACM SIGPLANConference on Programming Language Design andImplementation, June 1990, pp. 16 –27.

6. T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu, andT. Nakatani, “A Dynamic Optimization Framework fora Java Just-in-Time Compiler,” Proceedings of the ACMConference on Object-Oriented Programming, Systems,Languages, and Applications, October 2001, pp. 180 –194.

7. K. Ishizaki, M. Takeuchi, K. Kawachiya, T. Suganuma, O.Gohda, T. Inagaki, A. Koseki, K. Ogata, M. Kawahito, T.Yasue, T. Ogasawara, T. Onodera, H. Komatsu, and T.Nakatani, “Effectiveness of Cross-Platform Optimizationsfor a Java Just-in-Time Compiler,” Proceedings of theACM SIGPLAN Conference on Object-OrientedProgramming, Systems, Languages, and Applications,October 2003, pp. 187–204.

8. J. Dean, D. Grove, and C. Chambers, “Optimization ofObject-Oriented Programs Using Static Class HierarchyAnalysis,” Proceedings of the 9th European Conference onObject-Oriented Programming, August 1995; published inLecture Notes in Computer Science 952, 77–101 (1995).

9. J. Whaley, “A Portable Sampling-Based Profiler for JavaVirtual Machines,” Proceedings of the ACM SIGPLANJava Grande Conference, June 2000, pp. 78 – 87.

10. T. Suganuma, T. Yasue, and T. Nakatani, “An EmpiricalStudy of Method Inlining for a Java Just-in-TimeCompiler,” Proceedings of the USENIX 2nd Java VirtualMachine Research and Technology Symposium, August2002, pp. 91–104.

11. D. Detlefs and O. Agesen, “Inlining of Virtual Methods,”Proceedings of the 13th European Conference on Object-Oriented Programming, June 1999; published in LectureNotes in Computer Science 1628, 258 –277 (1999).

12. K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu, and T.Nakatani, “A Study of Devirtualization Techniques fora Java Just-in-Time Compiler,” Proceedings of the ACMSIGPLAN Conference on Object-Oriented Programming,Systems, Languages, and Applications, October 2000, pp.294 –310.

13. T. Ogasawara, H. Komatsu, and T. Nakatani, “OptimizingPrecision Overhead for x86 Processors,” Proceedings of theUSENIX Java Virtual Machine Research and TechnologySymposium, August 2002, pp. 41–50.

14. J. Whaley and M. Rinard, “Compositional Pointer andEscape Analysis for Java Programs,” Proceedings of theACM SIGPLAN Conference on Object-Oriented

Programming, Systems, Languages, and Applications,November 1999, pp. 187–206.

15. M. Kawahito, H. Komatsu, and T. Nakatani, “EffectiveNull Pointer Check Elimination Utilizing HardwareTrap,” Proceedings of the 8th International Conferenceon Architectural Support for Programming Language andOperating Systems, November 2000, pp. 139 –149.

16. M. Kawahito, H. Komatsu, and T. Nakatani, “EliminatingException Checks and Partial Redundancies for Java Just-in-Time Compilers,” Research Report RT-0350, April 2000;IBM Tokyo Research Laboratory, 1623-14 Shimo-tsuruma,Yamato-shi, Kanagawa-ken 242-8502, Japan.

17. V. V. Mikheev, S. A. Fedoseev, V. V. Sukharev, andN. V. Lipsky, “Effective Enhancement of Loop Versioningin Java,” Proceedings of the 11th International Conferenceon Compiler Construction, published in Lecture Notes inComputer Science 2304, 293–306 (April 2002).

18. J. R. Goodman and W. C. Hsu, “Code Scheduling andRegister Allocation in Large Basic Blocks,” Proceedings ofthe International Conference on Supercomputing, July 1988,pp. 442– 452.

19. M. Poletto and V. Sarkar, “Linear Scan RegisterAllocation,” ACM Trans. Programming Languages & Syst.21, No. 5, 895–913 (September 1999).

20. P. Briggs, K. D. Cooper, and L. Torczon, “Improvementsto Graph Coloring Register Allocation,” ACM Trans.Programming Languages & Syst. 16, No. 3, 428 – 455 (May1994).

21. A. Koseki, H. Komatsu, and T. Nakatani, “Preference-Directed Graph Coloring,” Proceedings of the ACMSIGPLAN Conference on Programming Language Designand Implementation, June 2002, pp. 33– 44.

22. L. George and A. W. Appel, “Iterated RegisterCoalescing,” ACM Trans. Programming Languages & Syst.18, No. 3, 300 –324 (May 1996).

23. A. Koseki and H. Komatsu, “Register Constraint BackPropagation Working in Collaboration with a RegisterManager,” Research Report RT-0551, October 2003; IBMTokyo Research Laboratory, 1623-14 Shimo-tsuruma,Yamato-shi, Kanagawa-ken 242-8502, Japan.

24. SPECjvm98, SPECjbb2000, and SPECjAppServer2002Benchmarks available at http://www.spec.org/osg/; StandardPerformance Evaluation Corporation (SPEC), 6585Merchant Place, Suite 100, Warrenton, VA 20187.

25. Java Grande Forum Sequential Benchmarks available athttp://www.epcc.ed.ac.uk/javagrande/sequential.html; TheJava Grande Forum; see http://www.javagrande.org/.

26. XSLT Performance Benchmark available athttp://www.datapower.com/xml_community/xsltmark.html;DataPower Technology Inc., One Alewife Center, 4thFloor, Cambridge, MA 02140.

27. Apache XML Project Xalan–Java Version 2.5.2 availableat http://xml.apache.org/xalan-j/index.html; The ApacheSoftware Foundation; see http://xml.apache.org/.

28. T. Suganuma, T. Yasue, and T. Nakatani, “A Region-Based Compilation Technique for a Java Just-in-TimeCompiler,” Proceedings of the ACM SIGPLAN Conferenceon Programming Language Design and Implementation,June 2003, pp. 312–323.

29. M. Paleczny, C. Vick, and C. Click, “The Java HotSpotServer Compiler,” Proceedings of the USENIX Symposiumon Java Virtual Machine Research and Technology, April2001, pp. 1–12.

30. K. Shiv, R. Iyer, C. Newburn, J. Dahlstedt, M. Lagergren,and O. Lindholm, “Impact of JIT/JVM Optimizations onJava Application Performance,” Proceedings of the 7thAnnual Workshop on Interaction Between Compilers andComputer Architecture, February 2003, pp. 5–13.

31. B. Alpern, C. R. Attanasio, J. J. Barton, M. G. Burke,P. Cheng, J.-D. Choi, A. Cocchi, S. J. Fink, D. Grove,M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F.

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

792

Page 27: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano,J. C. Shepherd, S. E. Smith, V. C. Sreedhar, H.Srinivasan, and J. Whaley, “The Jalapeno VirtualMachine,” IBM Syst. J. 39, No. 1, 211–238 (2000).

32. M. Arnold, S. Fink, D. Grove, M. Hind, and P. F.Sweeney, “Adaptive Optimizations in the Jalapeno JVM,”Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, andApplications, October 2002, pp. 47– 65.

33. M. Cierniak, B. T. Lewis, and J. M. Stichnoth, “OpenRuntime Platform: Flexibility with Performance UsingInterfaces,” Proceedings of the Joint ACM Java Grande–ISCOPE 2002 Conference, November 2002, pp. 156 –164.

34. A. R. Adl-Tabatabai, M. Cierniak, G. Y. Lueh, V. M.Parakh, and J. M. Stichnoth, “Fast, Effective CodeGeneration in a Just-in-Time Java Compiler,” Proceedingsof the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, June 1998, pp.280 –290.

35. M. Cierniak, G. Y. Lueh, and J. M. Stichnoth, “PracticingJUDO: Java Under Dynamic Optimizations,” Proceedingsof the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, June 2000, pp.13–26.

36. A. R. Adl-Tabatabai, J. Bharadwaj, D. Y. Chen, A.Ghuloum, V. Menon, B. Murphy, M. Serrano, and T.Spheisman, “The StarJIT Compiler: A Dynamic Compilerfor Managed Runtime Environments,” Intel Technol. J. 7,No. 1, 19 –31 (February 2003).

37. C. Chambers, “The Design and Implementation of SELFCompiler, an Optimizing Compiler for Object-OrientedProgramming Languages,” Ph.D. Thesis, StanfordUniversity, CS-TR-92-1420, March 1992.

38. C. Chambers and D. Ungar, “Making Pure Object-Oriented Languages Practical,” Proceedings of the ACMSIGPLAN Conference on Object-Oriented Programming,Systems, Languages, and Applications, October 1991, pp.1–15.

39. U. Holzle, C. Chambers, and D. Ungar, “OptimizingDynamically-Typed Object-Oriented Languages withPolymorphic Inline Caches,” Proceedings of the 5thEuropean Conference on Object-Oriented Programming,July 1991; published in Lecture Notes in Computer Science512, 21–38 (1991).

40. U. Holzle and D. Ungar, “Reconciling Responsivenesswith Performance in Pure Object-Oriented Languages,”ACM Trans. Programming Languages & Syst. 18, No. 4,355– 400 (July 1996).

41. U. Holzle, C. Chambers, and D. Ungar, “DebuggingOptimized Code with Dynamic Deoptimization,”Proceedings of the ACM SIGPLAN Conference onProgramming Language Design and Implementation,June 1992, pp. 32– 43.

42. D. F. Bacon and P. F. Sweeney, “Fast Static Analysis ofC�� Virtual Function Calls,” Proceedings of the ACMSIGPLAN Conference on Object-Oriented ProgrammingSystems, Languages, and Applications, October 1996, pp.324 –341.

43. V. Sundaresan, L. Hendren, C. Razafimahefa, R. Vallee-Rai, P. Lam, E. Gagnon, and C. Godin, “Practical VirtualMethod Call Resolution for Java,” Proceedings of the ACMSIGPLAN Conference on Object-Oriented Programming,Systems, Languages, and Applications, October 2000, pp.264 –280.

44. G. Aigner and U. Holzle, “Eliminating Virtual FunctionCalls in C�� Programs,” Proceedings of the 10thEuropean Conference on Object-Oriented Programming,July 1996; published in Lecture Notes in Computer Science1098, 142–166 (1996).

45. M. Arnold and B. G. Ryder, “Thin Guards: A Simple andEffective Technique for Reducing the Penalty of Dynamic

Class Loading,” Proceedings of the 16th EuropeanConference on Object-Oriented Programming, June 2002;published in Lecture Notes in Computer Science 2374,498 –524 (2002).

46. T. Kistler and M. Franz, “Continuous ProgramOptimization: Design and Evaluation,” IEEE Trans.

Received October 15, 2003; accepted for publication

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

793

Computers 50, No. 6, 549 –566 (June 2001).

December 17, 2003; Internet publication September 17, 2004

Page 28: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

Toshio Suganuma IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Suganuma is a Senior Advisory Researcher. He received B.E.and M.E. degrees in applied mathematics and physics fromKyoto University in 1980 and 1982, respectively, and receivedan M.S. degree in computer science from Harvard Universityin 1992. Since joining IBM in 1992, he has worked on thehigh-performance Fortran and Java JIT compiler projects.Mr. Suganuma�s primary research interest is in the area ofdynamic compiler optimizations.

Takeshi Ogasawara IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Ogasawara works in the areas of optimizing compilers. Hejoined IBM in 1991 at the Tokyo Research Laboratory afterreceiving B.S. and M.S. degrees in computer science from theUniversity of Tokyo. He designed and implemented ExceptionDirected Optimization for the IBM Java JIT compiler andcontributed to the runtime support of method inlining. Mr.Ogasawara�s research interests include optimizations bycompilers and their runtime systems.

Kiyokuni Kawachiya IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Kawachiya is a Senior Advisory Researcher. He received B.S.and M.S. degrees in information science from the Universityof Tokyo in 1985 and 1987, respectively. Since joining IBM in1987, he has been working on operating systems, multimediasystems, and mobile information devices. He joined the JavaJIT compiler project in 1997, contributing primarily to itsLinux version, IA-64 runtime, and synchronizationoptimization. Mr. Kawachiya�s research interests includesystems software and computer architecture.

Mikio Takeuchi IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Takeuchi is an Advisory Researcher. He joined IBM in1993 after receiving B.S. and M.S. degrees in mathematicalengineering and information physics from the University ofTokyo. Until March 2003 he was a member of the Java JITcompiler project, where he worked on a lightweight registermanager for the Intel IA-32 architecture and the compilerback end for the Intel IA-64 architecture. Mr. Takeuchi�sresearch interests include compilers, runtimes, and toolsfor dynamic programming languages.

Kazuaki Ishizaki IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Dr. Ishizakireceived B.S., M.S., and Ph.D. degrees in computer sciencefrom Waseda University in 1990, 1992, and 2002, respectively.He joined IBM in the Research Division in 1992, and hasworked on the high-performance Fortran compiler. Dr.Ishizaki is currently involved with the Java JIT compiler; hisresearch interests include optimizing compilers and processorarchitectures.

Akira Koseki IBM Research Division, IBM Tokyo ResearchLaboratory, 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa-ken242-8502, Japan ([email protected]). Dr. Koseki receivedM.S. and Ph.D. degrees in computer engineering fromWaseda University in 1994 and 1998, respectively. He joinedthe IBM Tokyo Research Laboratory in 1998. Dr. Koseki hasbeen working primarily on code generation and registerallocation.

Tatsushi Inagaki IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Inagaki received B.S. and M.S. degrees from the University ofTokyo in 1993 and 1995, respectively. Since joining the IBMTokyo Research Laboratory in 1998, he has been working onDAG-based optimizations in the Java JIT compiler project.His research interests include compiler optimizations andparallel processing.

Toshiaki Yasue IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Mr.Yasue received B.S. and M.S. degrees from the School ofScience and Engineering, Waseda University, in 1989 and1991, respectively. Since joining IBM Japan in 1995, he hasbeen working on the Java JIT compiler in the NetworkComputing Platform Group. Mr. Yasue�s primary researchinterests include compiler optimization and parallelprocessing.

Motohiro Kawahito IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Dr.Kawahito received B.S. and Ph.D. degrees in computerscience from Waseda University in 1991 and 2004,respectively. He joined IBM in 1991 as a member of the AIXSystems Group at the IBM Yamato Laboratory. He joinedthe Tokyo Research Laboratory in 1998 and is currentlyan Advisory Researcher. Dr. Kawahito worked on partialredundancy elimination, exception check elimination, constantpropagation, dead store elimination, and class variableprivatization for the Java JIT compiler project.

Tamiya Onodera IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Dr.Onodera is a Senior Technical Staff Member. He receivedB.S., M.S., and Ph.D. degrees in computer science from theUniversity of Tokyo in 1983, 1985, and 1988, respectively. Hejoined IBM in 1988 and initially worked on the design andimplementation of a new object-oriented language calledCOB. For the past seven years, he has been working on IBMJava virtual machines and JIT compilers. Dr. Onodera�sprimary research interest is efficient implementations ofobject-oriented programming languages.

Hideaki Komatsu IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Dr.Komatsu is a Senior Technical Staff Member. He received

T. SUGANUMA ET AL. IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004

794

Page 29: Evolution of a Java just-in-time compiler for IA-32 platforms · an efficient Java virtual machine (JVM ) and just-in-time (JIT) compiler are crucial in providing high performance

B.S. and M.S. degrees in electrical engineering from WasedaUniversity in 1983 and 1985, respectively, and a Ph.D. degreein computer science from Waseda University in 1998. Sincejoining the IBM Tokyo Research Laboratory in 1985, he hascarried out research activities in the areas of optimizingcompilers, Prolog compilers, dataflow compilers, the Fortran90 compiler, the high-performance Fortran compiler, and theJava JIT compiler. Dr. Komatsu�s research interests includecompiler optimization techniques (register allocation, codescheduling, and loop optimizations) for instruction-levelparallel architectures and loop optimizations for massivelyparallel computers.

Toshio Nakatani IBM Research Division, IBM TokyoResearch Laboratory, 1623-14 Shimo-tsuruma, Yamato-shi,Kanagawa-ken 242-8502, Japan ([email protected]). Dr.Nakatani received a B.S. degree in mathematics from WasedaUniversity in 1975, and M.S.E., M.A., and Ph.D. degrees incomputer science from Princeton University in 1985, 1985,and 1987, respectively. He joined the IBM Tokyo ResearchLaboratory in 1987 and is currently an IBM DistinguishedEngineer and manager of the Network Computing PlatformGroup. Dr. Nakatani�s research interests include computerarchitectures, optimizing compilers, and algorithms forparallel computer systems.

IBM J. RES. & DEV. VOL. 48 NO. 5/6 SEPTEMBER/NOVEMBER 2004 T. SUGANUMA ET AL.

795