Upload
buihanh
View
220
Download
5
Embed Size (px)
Citation preview
INLINING JAVA NATIVE CALLS AT RUNTIME
by
Levon S. Stepanian
A thesis submitted in conformity with the requirementsfor the degree of Masters of Science
Graduate Department of Computer ScienceUniversity of Toronto
Copyright c© 2005 by Levon S. Stepanian
Abstract
Inlining Java Native Calls at Runtime
Levon S. Stepanian
Masters of Science
Graduate Department of Computer Science
University of Toronto
2005
Despite the overheads associated with the Java Native Interface (JNI), its opaque and binary-
compatible nature make it the preferred interoperability mechanism for Java applications that
use legacy, high-performance and architecture-dependent native code.
This thesis addresses the performance issues associated with the JNI by providing a strategy
that transforms JNI callbacks into semantically equivalent but significantly cheaper operations
at runtime. In order to do so, the strategy first inlines native functions into Java applications us-
ing a Just-in-time (JIT) compiler. Native function inlining is performed by leveraging the abil-
ity to store statically-generated intermediate language alongside native binaries. Once inlined,
transformed native code can be further optimized due to the availability of runtime information
to the JIT compiler.
Preliminary evaluations on a prototype implementation of our strategy show that it can sub-
stantially reduce the overhead of performing native calls and JNI callbacks, while preserving
the opaque and binary-compatible characteristics of the JNI.
ii
Dedication
Jack, Maggy and Hovan,
Rose and Eugenie,
Saro, Christian, Krikor, Sabine, Vicky, Paul and Corine
Paul Gries,
and all those who have helped me along the way
your names go unmentioned but unforgotten
iii
Acknowledgements
I should start off by thanking Angela Demke Brown for taking me under her wings and
being my mentor and an incredible supervisor. Thank you for providing me with support and
guidance, but most importantly, for teaching me how to become a more analytical and efficient
researcher.
Allan Kielstra, just as Angela did, made me dive head-first into my research question. I
wouldn’t be here if it wasn’t for the W-Code conversion mechanism he masterminded, and of
course, the patience he showed with my sometimes inept if not aloof modes of inquiry.
I’d like to thank Kevin Stoodley for the idea that inspired this work, his unrelenting support
through it all, and his help in speeding up the otherwise nauseating patent filing process.
Many thanks go not only to Kelly Lyons and Marin Litoiu and IBM’s Centers For Advanced
Studies for providing me with an impeccable working environment, but to the countless TR JIT
and J9 engineers that steered me in the right direction.
To the guys and gals in syslab - despite being away most of the time, I’ll never forget the
times we shared. Best of luck in the future and hopefully our paths cross again.
And last but certainly not least, my parents. You are the reason why I am here today,
standing proud, fearless and humble in this beautiful world.
iv
Contents
1 Introduction 1
1.1 Java and the JNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 JNI Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Pervasiveness of the JNI . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Approach and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Design 12
2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Requirements of an IL Conversion Mechanism . . . . . . . . . . . . . . . . . 13
2.3 Inlining Native Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Enhancements to a Java JIT Compiler’s Inliner . . . . . . . . . . . . . 14
2.4 Optimizing JNI Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Identifying Inlined JNI Callbacks . . . . . . . . . . . . . . . . . . . . 16
2.4.2 JNI Argument Use/Def Analysis . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Callback Transformations . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Other Callback Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Design Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Synthesizing Opaque Calls . . . . . . . . . . . . . . . . . . . . . . . . 22
v
2.6.2 Shared Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Design Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Tools 25
3.1 The TR JIT Compiler and J9 virtual machine . . . . . . . . . . . . . . . . . . 25
3.1.1 Inlining in the TR JIT compiler . . . . . . . . . . . . . . . . . . . . . 27
3.2 TR Intermediate Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 W-Code and The IL Conversion Mechanism . . . . . . . . . . . . . . . . . . . 28
4 Implementation 31
4.1 General Modifications to the TR JIT Compiler . . . . . . . . . . . . . . . . . . 31
4.2 Modifications to the TR JIT Compiler’s Inliner . . . . . . . . . . . . . . . . . 32
4.3 Introducing the Inlined CallHandlers . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 The JNICallHandler . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 The ExternalCallHandler . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Changes to the TR JIT Code Generator . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Results and Analysis 40
5.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 W-Code Conversion Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Native Inlining Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Callback Transformation Benefits . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Eliminating Data-Copy Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Optimizing Inlined Native Code . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Synthesis Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Related Work 55
6.1 Alternative Language Interoperability Frameworks . . . . . . . . . . . . . . . 55
vi
6.2 Programmer-Based Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Restricting Functionality in Native Code . . . . . . . . . . . . . . . . . . . . . 56
6.4 Proprietary Native Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5 Unmanaged Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6 Optimizing the JNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.7 Compiler IL as Runtime Program Data . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusions 62
7.1 Engineering Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendix 67
Bibliography 73
vii
List of Tables
4.1 Current support for callbacks and external function calls . . . . . . . . . . . . 39
5.1 Microbenchmark runtimes and improvements with native inlining . . . . . . . 44
5.2 Microbenchmark runtimes and improvements with native inlining and callback
transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Moving data from Java to C - improvements with native inlining and callback
transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 hash: Performance improvements with other JIT compiler optimizations . . . . 49
5.5 GetArrayLength: Improvements with native inlining and callback trans-
formations and other JIT compiler optimizations . . . . . . . . . . . . . . . . . 51
5.6 Synthesizing calls to opaque functions . . . . . . . . . . . . . . . . . . . . . . 53
A.1 Cost of W-Code to TR-IL conversion for SPEC CINT2000 benchmarks . . . . 68
A.2 Raw timing measurements for Table 5.1 . . . . . . . . . . . . . . . . . . . . . 69
A.3 Raw timing measurements for Table 5.2 . . . . . . . . . . . . . . . . . . . . . 71
A.4 Raw timing measurements for Table 5.3 and Figure 5.2 . . . . . . . . . . . . . 72
A.5 Raw timing measurements for Table 5.4 and Figure 5.3 . . . . . . . . . . . . . 74
A.6 Raw timing measurements for Table 5.5 . . . . . . . . . . . . . . . . . . . . . 74
A.7 Raw timing measurements for Table 5.6 . . . . . . . . . . . . . . . . . . . . . 75
viii
List of Figures
1.1 Interactions between Java and non-Java (native) code . . . . . . . . . . . . . . 3
1.2 The JNIEnv pointer (JNIEnv *) . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 The native function inlining process . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Sample inlined native code before callback transformations . . . . . . . . . . . 21
2.3 Sample inlined native code after callback transformations . . . . . . . . . . . . 21
2.4 Synthesizing opaque function calls . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 The TR JIT compiler’s architecture . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Sample TR-IL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 The IL conversion process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 The TR inliner: Handling native functions . . . . . . . . . . . . . . . . . . . . 33
4.2 The Inlined Call Handler class hierarchy . . . . . . . . . . . . . . . . . . . . . 34
4.3 Pseudocode for JNICallHandler::transformCalls . . . . . . . . . . 36
4.4 Pseudocode for JNICallHandler::synthesize . . . . . . . . . . . . . 37
5.1 Cost of W-Code to TR-IL conversion for SPEC CINT2000 benchmarks . . . . 42
5.2 Moving data from Java to C - a graphical representation of the improvements
with native inlining and callback transformations . . . . . . . . . . . . . . . . 48
5.3 Exposing inlined native code to other JIT optimizations . . . . . . . . . . . . . 51
ix
Chapter 1
Introduction
Currently, there is no single programming language that is universally suitable for all tasks, nor
is one likely to emerge. Rather than focusing on a one-size-fits-all approach to programming
language design, support for language interoperability is the preferred solution. In addition to
allowing programmers to choose the right tool for the job, interoperability allows the reuse of
legacy applications and libraries that may have been written in a different language.
Most high-level languages support interoperability by providing some mechanism for call-
ing code written in a low-level language (such as C). These mechanisms typically impose both
time and space overheads at each cross-language function invocation because arguments and
results must be packaged carefully to bridge the boundary between the languages involved.
In this thesis, we focus on optimizing the Java Native Interface (JNI) [28], the interoper-
ability interface used by the JavaTM programming language [21]. Our goal is to reduce the
space and time overheads involved in crossings between Java and non-Java (native) program-
ming languages. We do this by providing a Just-in-time (JIT) compiler optimization that inlines
compiled native code into compiled Java code at runtime. Our strategy also performs optimiz-
ing transformations on inlined JNI function calls, and allows the JIT compiler to perform other
optimizations on inlined native code.
The rest of this chapter provides background information and the motivation behind our
1
CHAPTER 1. INTRODUCTION 2
work. It also summarizes our approach and the challenges we encounter, and concludes with a
description of our research contributions.
1.1 Java and the JNI
Java is a high-level programming language that features object-orientation, platform indepen-
dence and type safety, as well as automatic memory management via garbage collection. These
and other characteristics have led to its widespread adoption in many settings.
The JNI is Java’s interoperability mechanism. It is a two-way application programming
interface (API) that provides interoperability at the function-level, permitting Java programs
to invoke functions written in native languages (which we refer to as callouts), at the same
time allowing native programs to access and modify data and services from an executing Java
virtual machine (JVM) (performed via functions which we refer to as callbacks).
Generally speaking, callouts provide Java applications with the ability to leverage legacy,
high-performance and architecture-dependent native code. Callbacks, on the other hand, pro-
vide native code with access to JVM managed objects (i.e., strings and arrays) and perform a
host of other operations including reference management, exception handling, synchronization
and reflection, as well as JVM instantiation and invocation. The latter feature can be used to
embed a JVM implementation into a native application in order to execute software written in
Java. Figure 1.1 demonstrates the interactions between a Java program, a JVM and its JNI im-
plementation, native code and a host architecture. Host architecture refers to a host operating
system, a set of native libraries and the host CPU instruction set.
Native code performs callbacks by calling JNI functions, which are made accessible by the
JNIEnv pointer that is passed as the first argument to every Java native function. As depicted
in Figure 1.2, each Java thread that invokes a native function receives its own JNIEnv pointer
containing thread−local data, as well as a pointer to a table of JNI function pointers. The two
levels of pointer indirection allow native code to link to any JVM implementation, and provides
CHAPTER 1. INTRODUCTION 3
Callout
Callback
ArchitectureHost
JNIJVM
JavaApplication
NativeBinary
Figure 1.1: Interactions between Java and non-Java (native) code
the JVM implementor with the flexibility of choosing between different function tables, as well
as different function implementations at runtime.
The JNI is designed to provide opaque access to JVM internals, by hiding JVM data struc-
tures and the binary layouts of heap-allocated objects. It is also binary-compatible, allowing
programmers to address interoperability issues once, and expect their software solutions to
function with all implementations of the Java platform (for a particular host environment).
We recognize that there exist alternative ways to introduce native code to a Java application.
Some of the alternatives separate Java applications and native code into separate processes
thereby exacerbating interoperability overheads. Others tightly couple JVMs and native code,
reducing overheads, but breaking the binary compatibility property essential to the JNI. A full
discussion of such alternatives is provided in Chapter 6.
We also acknowledge the fact that JNI-dependent Java code breaks the compile-once, run-
anywhere paradigm that has made Java the programming language of choice. Furthermore,
we recognize that JNI-dependent Java programs are limited by the type-unsafe nature of the
very native code they incorporate. Our work, however, mitigates this by implicitly performing
CHAPTER 1. INTRODUCTION 4
Thread 2 JNIEnv *
Thread 1 JNIEnv *
JNI Function Table
...
Function pointer
...
Thread n JNIEnv * ...
JNI function
JNI function
JNI function
Table pointer
thread−local data
Table pointer
thread−local data
Function pointer
Function pointer
Function pointer
Function pointer
Function pointer
Table pointer
thread−local data
Figure 1.2: The JNIEnv pointer (JNIEnv *)
CHAPTER 1. INTRODUCTION 5
runtime checks on inlined native code.
Having said this, we now examine the performance aspects of the JNI which our work
addresses.
1.2 Motivation
Our work is motivated by interoperability space and time overheads that afflict Java applica-
tions containing native function calls. We are also motivated by the JNI’s pervasive nature.
Since the JNI is used in a large number of applications, we believe any runtime benefit that
results from minimizing Java interoperability overheads will have a widespread effect.
1.2.1 JNI Performance Issues
The JNI’s strength lies in decoupling native code from a specific JVM implementation by pro-
viding opaque access to JVM internals, data, and services. The cost of this property is lost
efficiency, namely large runtime overheads during callouts to native functions, and even larger
ones during callbacks to access Java code and data. Furthermore, JIT compilers are not able
to perform aggressive optimizations on Java code containing native function calls because they
are forced to make pessimistic assumptions about the side-effects of these opaque calls.
Callout Overheads
Generally speaking, the traditional costs associated with a function call include setting up an
activation record, branching to the callee, branching on return and restoring the call stack. Java
callouts are very similar to traditional function calls, but bear a number of unique overheads:
• A native library containing the function called by a Java application must be loaded on
or before the function’s first invocation. The class containing the native call, as well
as the call itself must be resolved (resolution may require multiple passes over the ex-
ported functions of a native library). These are one-time costs that can be amortized if a
CHAPTER 1. INTRODUCTION 6
particular native function is invoked repeatedly.
• During each individual native function invocation, the JVM must also set up the native
stack (and possibly registers) to copy primitive-typed arguments, and add a layer of indi-
rection to passed reference arguments. There also exist JVM handshaking requirements
that must be met by each Java thread leaving the JVM’s context and entering a native
context, including handshaking for garbage collection and synchronization.
• Upon returning from native code, the return value must be pushed onto the Java stack
and the native stack must be restored. In addition, handshaking requirements for Java
threads re-entering the JVM context, which might include checking exception statuses
and garbage collecting local references created by native code, must be met. JVMs
with Just-in-time (JIT) compilers, however, may reduce these overheads by generating
specialized code segments to perform the required work at native call sites, as is done
in the IntelTM Open Runtime Platform [12] and in the IBM JIT compiler which we use
for our work. These code segments are further discussed in Section 4.4 and Chapter 6
respectively.
Sunderam and Kurzinyec [41] have studied the performance of different types of native calls
using different JVM implementations. The slowdowns they report when using native functions
range from a factor of 1.05 to a factor of 16 in the worst case. Similar results are produced in
overhead-measuring experiments performed by Murray et al. [32]. Liang [28] also reports a
factor of three slowdown when comparing native function calls to regular Java function calls.
Callback Overheads
Although callouts are reasonably expensive, the more significant source of overhead occurs
when native code invokes JNI callbacks. As described earlier, JNI functions are only callable
through a reference to the JNIEnv pointer. A callback thus pays an immediate performance
CHAPTER 1. INTRODUCTION 7
penalty because two levels of indirection¶ are used - one to obtain the appropriate function
pointer through the JNIEnv pointer, and one to invoke the function using that pointer. Other,
more specific callback overheads depend on the JNI function being called:
• Heap-allocated native function parameters - To make use of certain JVM heap-allocated
objects that are passed to native code as arguments (i.e., strings and arrays), native code
must first acquire access to them. Unfortunately, JVMs with garbage collectors that do
not support object pinning must perform expensive runtime copy operations to first pro-
vide native code with its own copy of the objects, and then to later update the JVM heap
with modifications to the copied objects. The JNI, however, also provides callbacks that
claim to increase the chances of receiving direct references to heap-allocated data, but
supporting such callbacks is left to the JVM’s discretion and also places certain restric-
tions on the programmer’s freedom. Because JVMs may implement these callbacks in
any way they choose, there is no guarantee that better performance will actually result
from their use. Sunderam and Kurzinyec [41] demonstrate that the achieved performance
for these types of callbacks varies widely across different JVM implementations.
• Fields and methods - Using Java data types, modifying object data, calling methods
and accessing JVM services from native code are also performed via callbacks. Reading
or modifying an instance or static field, as well as calling an instance or static function
first requires retrieving a handle to it and then performing the required operation via an-
other callback. Handle retrieval is commonly implemented as a traversal on the JVM’s
reflective data structures, in addition to expensive string-based signature comparison op-
erations performed at runtime [9]. Results in [41] highlight these callback overheads.
For example, field accesses in Java are orders of magnitude faster than those via the JNI.
Other JNI callbacks (such as those dealing with reference management, exception handling,
synchronization, reflection and JVM instantiation and invocation) share costs similar to those
¶Only one level of indirection is required for C++ native functions.
CHAPTER 1. INTRODUCTION 8
of field and method access callbacks, but have their own unique set of overheads. Liang [28]
recognizes the inability of JVMs to optimize callbacks, and hypothesizes that the overhead of
any callback can be as much as an order of magnitude greater than a normal Java function call.
Furthermore, callbacks also perform JVM handshaking (since callbacks require Java threads
to re-enter and then exit the JVM’s context), and may sometimes block if the JVM is in the
midst of performing a blocking task (such as garbage collection). The latter is dependent on
the specific JVM implementation and lies outside the scope of the JNI specification.
1.2.2 Pervasiveness of the JNI
What makes JNI overheads more troubling is the fact that there is widespread adoption and
usage of Java applications that depend on the JNI for functionality. We are concerned by the
high overheads associated with the JNI, especially since a prime motivator for using it in Java
applications is to access high-performance native code. Performance critical routines can be
written in low-level native code, and incorporated into Java applications by wrapping them
as Java native functions. More specifically, the JNI has been used in I/O implementations
to improve the performance of object serialization for distributed computing [11], to provide
bindings for low-level parallel computing libraries [4, 19], and to implement high-speed net-
work interfaces [42].
Bik and Gannon [7] make strong arguments in favor of implementing numerical routines
as Java native functions, and despite improvements in pure Java numerical libraries, interfaces
to widely-used but platform-dependent optimized native linear algebra packages are still being
developed [23].
Besides performance-critical routines, the JNI is also used to implement features that are
not available in Java. The graphical components of Java-based user interface libraries, in-
cluding the Standard Widgets Toolkit [34] and the Abstract Window Toolkit [39], as well as
other Java-based multimedia APIs [31] rely on the JNI to make use of underlying architecture
functionality. Native code is also used to compensate for other functionality unavailable in
CHAPTER 1. INTRODUCTION 9
Java, including low-level hardware performance measuring tools [37], and accurate timers and
system resource monitors [5].
The JNI has also been used to implement various JVM frameworks and APIs, such as the
Java 5.0 class libraries, the reflective, Java-based OpenJIT compiler from Ogawa et al. [35] and
the MicrosoftTM Marmot JVM’s class libraries [17].
Java applications also use the JNI as a software engineering tool to leverage and utilize
functionality provided by large sets of legacy code. Without the JNI, this code would have
to be re-written and re-engineered using the Java programming language if they were to be
incorporated into Java applications. Liang [28] devotes an entire chapter of the JNI specifica-
tion to techniques one might use when “wrapping” native functions for integration with Java
applications.
Having provided the background and motivation behind our work, we provide a description
of a strategy that reduces the callout and callback overheads experienced by JNI-dependent Java
applications.
1.3 Approach and Challenges
This thesis describes the design, implementation and evaluation of a strategy that inlines com-
piled native code into compiled Java code, thereby removing function call and return overheads.
More specifically, our strategy inlines native calls performed by Java applications at runtime,
and also performs optimizing transformations on inlined JNI callbacks, further improving the
performance of inlined native code.
We have implemented a prototype of this native function inlining strategy inside a production-
quality Java JIT optimizing compiler. Our proof-of-concept implementation operates as part
of the JIT compiler’s runtime optimization strategy, and utilizes an intermediate language (IL)
conversion mechanism to translate native code to JIT compiler IL during inlining. Our strategy
reduces JNI overheads while maintaining the JNI’s opaque and binary compatible nature.
CHAPTER 1. INTRODUCTION 10
Once native code has been inlined, the JIT compiler can also remove pessimistic assump-
tions it may have maintained about opaque native function calls, and performs aggressive run-
time optimizations on inlined native code.
As with any runtime optimization, we wish to amortize the cost of performing the opti-
mization by obtaining significant benefits from native function inlining and callback transfor-
mations. Our inlining strategy, however, must also deal with native code that is non-inlineable
and callbacks that are non-transformable. Furthermore, it must provide correct linkages for
program data that is shared between Java and native code as a byproduct of inlining, and most
importantly, it should enforce both native language and JNI semantics on all inlined and opti-
mized code.
1.4 Contributions
Our research contribution is a JIT compiler based native function inlining and optimizing call-
back transformation framework that reduces JNI overheads, and makes Java a more attractive
solution for cross-language application and system development.
Our implementation shows significant performance increases from inlining native code and
optimizing JNI callbacks for simple microbenchmarks. In spite of the prototypical-nature of
our work, we expect these benefits to translate into performance improvements in real applica-
tions that make extensive use of the JNI.
To be more specific, our contribution includes methods that:
• identify native function calls in a JIT compiler’s IL
• convert the statically generated and optimized IL of a native function to a JIT compiler’s
IL at runtime (while preserving the semantics of the native programming language)
• inline native function calls at runtime
• identify JNI function calls in a JIT compiler’s IL
CHAPTER 1. INTRODUCTION 11
• transform JNI function calls to JIT compile-time constants
• transform JNI function calls to cheaper but semantically equivalent operations
• handle non-inlineable function calls found in inlined native code
• share data between inlined and non-inlined native code
In essence, our work can be viewed as a first step to provide a framework that transmutes
statically compiled code to dynamic environments where runtime information guides optimiza-
tions that are otherwise not profitable or possible to perform statically.
The rest of this thesis is organized as follows: Chapter 2 details a complete JIT compiler
native inlining and callback transformation design. Chapter 3 describes the software tools we
use for our implementation. Chapter 4 describes our implementation. We verify our hypothesis
and contributions by showcasing our experimental results in Chapter 5. Chapter 6 compares
our work to others in the field. Finally, we conclude by discussing limitations and future work
in Chapter 7.
Chapter 2
Design
Given the extensive use of the JNI in existing applications, we believe JNI performance penal-
ties must be addressed directly, rather than by introducing changes to the interface, or intro-
ducing a new interoperability mechanism for Java. Furthermore, since high-performance JVMs
include JIT compilers, we believe it is appropriate to leverage JIT optimizations to reduce the
overheads incurred as a result of using the JNI for interoperability. Instead of simply generat-
ing efficient code to perform the extra work required at JNI invocation points, however, we aim
to eliminate this extra work entirely.
Our approach is to extend a JIT compiler’s function inlining optimization to handle native
function calls. Once native code has been inlined at its callsite in a Java program, it is no longer
necessary to set up and tear down a native stack, or perform other expensive operations to pass
arguments. More importantly, the callbacks, designed to gain access to internal JVM state, can
now be transformed into JIT compile-time constants or lightweight Bytecodes that preserve the
semantics of the original source program and the JNI.
Our native inlining design consists of three phases: The first phase requires the inliner to
convert native code to a representation understood by the JIT compiler. This permits the in-
lining of native code and the elimination of the overheads associated with making callouts.
The second phase performs optimizing transformations on inlined JNI callbacks, thereby elim-
12
CHAPTER 2. DESIGN 13
inating much of the overhead associated with performing JNI function calls in native code.
The final phase processes and fixes up inlined function calls that are not amenable to inlining,
thereby making our design robust. The following is a description of our design assumptions
and each design phase in more detail.
2.1 Assumptions
Our design assumes the existence of:
1. an optimizing Java JIT compiler that can perform Java function inlining
2. an intermediate language (IL) conversion mechanism that can perform a one-way map-
ping from statements in the compiler IL of a native language to the IL of the Java JIT
compiler mentioned in 1
The availability of a JIT compiler is a reasonable assumption to make since there are many
open-source JIT compilers available for academic research purposes. Our assumption of the
existence of an IL conversion mechanism might seem unusual at first, but is also reasonable
because it is part of the tool-set we have decided to use for implementation. Chapter 3 provides
a description of the actual JIT compiler and IL conversion mechanism used in our proof-of-
concept implementation.
2.2 Requirements of an IL Conversion Mechanism
Instead of using source-code text or low-level assembly instructions as a representation of
native code, our design uses the IL generated by a compiler for the native language. Using
source code text would require translating from a source language to a target language, making
sure that the semantics of the source language are captured in the target language. Due to
significant syntax and semantical differences between most programming languages, this might
CHAPTER 2. DESIGN 14
require substantial additions or modifications to the target language. Representing native code
in low-level assembly has the advantage of being a small, tightly-packed representation, but
suffers from being architecture-dependent.
Our choice in using the IL generated by a compiler as a representation of native code pro-
vides us with the right amount of abstraction between a high-level and low-level representation.
Furthermore, the IL representation encodes static optimizations that are performed by a native
compiler. Our only requirements from the IL conversion mechanism is that it should perform a
mapping of statements from native IL to the JIT compiler’s IL, and maintain correct semantic
information about the native code.
2.3 Inlining Native Calls
Our design for native function inlining consists of two major components, the first of which
is the already described IL conversion. The second component is an extension to the assumed
Java JIT compiler’s function inliner, permitting it to inline native functions.
2.3.1 Enhancements to a Java JIT Compiler’s Inliner
Upon successful native to JIT compiler IL conversion for a Java-callable native function, the
JIT inliner inlines the IL and recursively inlines function calls that are inlined. The inliner
considers non-Java callable native functions as potential inlineable candidates as well (i.e.,
native calls by native code).
Figure 2.1 depicts this process for a terminal native function (i.e., one containing no other
function calls): Upon (1) encountering a native callsite, the JIT inliner (2) feeds the native IL
to the IL conversion mechanism which then (3) generates JIT IL for the native method. The
inliner finishes (4) by producing an inlined native callsite using the converted IL. Inlined native
methods clearly execute in a Java context, therefore, the code must be conditioned to interact
with all appropriate JVM requirements. In particular, instructions to perform handshaking with
CHAPTER 2. DESIGN 15
JIT compiler IL:3.
IL conversionJIT compiler inliner
Native IL
.
.Inlined native code..
.
.Native callsite..
mechanism
JIT compiler IL
(used as input to the inliner)
encounters a native callsite1.
2. native IL:
used as inputto the conversionmechanism
output from the conversion mechanism
an inlined native callsitethe final product:4.
Figure 2.1: The native function inlining process
CHAPTER 2. DESIGN 16
JVM components such as garbage collection, as well as exception handling are inserted at the
appropriate locations in the inlined native code.
The IL for a native method cannot, in most cases, proceed directly through the rest of JIT
compiler processing. This is because it may contain “opaque” calls to non Java-callable native
functions. Such “opaque” calls occur in two situations: (1) calls through function pointers, and
(2) calls to functions in binaries where native IL is unavailable. Both situations require special
handling and we defer discussing them until Section 2.6.1.
The inliner recursively inlines functions called by a Java-callable native method until it
either encounters a call to an “opaque” function or a termination condition (e.g., a maximum
inlined code size limit) is reached. The inliner then continues with normal JIT compiler pro-
cessing only after it has performed optimizing callback transformations and satisfied require-
ments for inlined “opaques”, as described in the following sections.
2.4 Optimizing JNI Callbacks
The native inlining process is extended by callback transformations that optimize inlined JNI
function calls. Inlined native code executes in the JVM’s context, thus there is no need for the
JNIEnv pointer and the JNI function pointer table to obtain access to internal JVM services
and data. Once the native inlining technique has converted native IL to the JIT compiler’s IL, it
performs two tasks to transform callbacks. These tasks include identifying callbacks and per-
forming JNI argument use/def analysis. Whenever possible, inlined callbacks are transformed
into compile-time constants, and new semantically equivalent JIT compiler IL that represents
faster, more direct access to JVM services and data.
2.4.1 Identifying Inlined JNI Callbacks
The first step in transforming inlined callbacks is to identify them in the converted IL. The
JIT compiler’s IL makes it hard to distinguish a JNI callback from an arbitrary function call
CHAPTER 2. DESIGN 17
via a pointer. For this reason, our technique scans the generated IL for sequences of IL state-
ments that constitute function calls via pointers, and then attempts to match them against a
well known set of pre-constructed IL shapes (i.e., sequences of IL statements) that represent
JNI callbacks. In order to construct the set of well known IL shapes, we require a preliminary
step that renders each callback defined by the JNI API in terms of the JIT compiler’s IL.
Building JNI Callback Shapes
As part of identifying JNI callbacks, the JIT compiler needs to understand the expected “shape”
of each JNI callback as it scans the inlined IL looking for statements representing JNI callbacks.
The shape encodes how each callback uses the JNIEnv pointer and other arguments, thereby
uniquely identifying a callback. The JIT compiler uses a set of well known pre-constructed
shapes for subsequent analysis (and avoids recursively inlining callbacks).
Pre-constructing these shapes can be performed in a variety of ways, including:
• dynamically performed at the start of a Java program execution
• performed as part of the process of building the JIT compiler itself
• statically performed by encoding each callback’s shape in the compiler
The only requirement is that the IL used to pre-construct IL shapes must be correct for both the
current version of the JIT compiler (where the IL definition may change over time) and for the
JVM being targeted (because each JVM is free to define how the JNI specification is actually
implemented).
Pattern Matching JNI Callbacks
When a native function is inlined, care is taken to record uses of the JNIEnv pointer within
the IL. Recursive inlining is expected and the JNIEnv pointer may be passed to recursively in-
lined functions. However, before recursive inlining, the IL representing the callsite is examined
as follows:
CHAPTER 2. DESIGN 18
1. If the JNIEnv pointer is used in the same position in the IL as it appears in any of
the pre-constructed shapes, the inliner proceeds to Step 2. Otherwise, inlining continues
normally.
2. For each pre-constructed shape in which the JNIEnv pointer appears in the same posi-
tion as it does in the IL for an inlined callsite under consideration, the inliner attempts to
match the entire shape to the IL for the callsite. A match occurs if the shape and the IL
share the same number and compatible types of arguments. If there is a match, the call-
site is not eligible for inlining but might be transformable. Otherwise, inlining continues
normally.
As part of Step 2, the inliner records the callsites that it has determined do correspond to
JNI callbacks and remembers them later when performing optimizing transformation on them.
Thus, the result of the transformation of the first call can be used in the transformation of later
calls.
2.4.2 JNI Argument Use/Def Analysis
Once inlined callbacks have been identified, the values and types of variables passed to them
as arguments must be deciphered in order to replace callbacks with JIT compile-time constants
or cheaper but semantically equivalent operations. A pass of JNI argument use/def analysis is
performed to track the definitions of variables to the points where they are used as arguments
to callbacks.
In general, each callback argument is represented by a set of possible objects as dictated
by the control flow in the native method. Definitions of variables include incoming argu-
ments to the native function (i.e., passed in from Java code), or results of other callbacks. For
example, the values returned by the JNI callbacks GetObjectClass, GetSuperClass,
FindClass, Get[Static]‖MethodID and Get[Static]FieldID are treated as def-
‖[Static] is shorthand notation that allows us to encode the static version of the JNI callback function as well
CHAPTER 2. DESIGN 19
initions.
When the analysis cannot conclusively determine the class that an object must be an in-
stance of, it produces sufficient information to allow the transformation phase to consider all
possible classes that the object may be an instance of. It is possible, however, that the analy-
sis is unable to compute even conditional results if, for example, arguments to a callback are
fetched from storage.
The use/def analysis also tracks literal or constant arguments to FindClass,
Get[Static]MethodID and Get[Static]FieldID and by doing so, the JIT compiler
may positively resolve some of these calls while a more naive implementation would be unable
to do so.
2.4.3 Callback Transformations
Once the identification of callbacks and JNI argument use/def analysis is complete, the proce-
dure continues by iterating over all of the identified callbacks and attempts to transform them to
compile-time constant values or new JIT compiler IL that is semantically equivalent and results
in the generation of a smaller number of CPU instructions at code generation time. Generally
speaking, callbacks that result in definitions are transformed to constants, whereas callbacks
using such definitions are transformed to cheaper IL. Using some of the same callbacks men-
tioned in the previous section, the following transformation outcomes are possible:
• If all of the possible argument definitions reaching a GetObjectClass are of the same
class, the call is replaced by an appropriate constant.
• If all possible classes reaching a Get[Static]FieldID or Get[Static]MethodID
are compatible and the string arguments can be uniquely determined, the call is replaced
by an appropriate constant
as the instance version using only one identifier
CHAPTER 2. DESIGN 20
• If all possible field ids reaching a Get[Static]<Type>‡Field or a
Put[Static]<Type>Field are the same and all possible objects reaching the call
are of compatible class types, the call is replaced by a new, simpler sequence of JIT
compiler IL. More generally, if the offset of the data member from the beginning of the
object is the same for all possible types that can reach the call, then the same code can
be used for all the objects, allowing the callback to be “strength reduced”.
• Similar transformations are performed for the various Call[Static]<Type>Method
callbacks by replacing the existing IL with new IL that makes a more direct call to the
method.
We display the complete callback transformation process using the annotated inlined native
code in Figure 2.2, which is transformed to the code in Figure 2.3.
If use/def analysis produces known but inconclusive information for any arguments to a
callback, conditional logic may be inserted along with the appropriate IL that represents the
semantics of the callback being transformed. When the transformed callback is executed, ap-
propriate behavior can be selected based on actual values. Furthermore, all transformed IL
defers throwing exceptions in accordance with the Java rules for executing native methods.
Theoretically, an optimization such as the one described here should follow uses of data
defined in terms of the JNIEnv pointer to track all possible callsites that may correspond
to JNI function calls. However, it is harmless to perform this optimization on some callsites
and decline to perform it on others. Any inlined callbacks that are not handled by the steps
above are treated as an ordinary call to an appropriate VM service routine and are described in
Section 2.6.1.
‡<Type> represents any one of Void, Object, Boolean, Byte, Char, Short, Int, Long, Float or
Double
CHAPTER 2. DESIGN 21
}
inlined_native_function(JNIEnv *env, jobject obj) {
/* inlined callbacks look like: */(A) jclass cls = (*env)−>GetObjectClass(env, obj);(B) jmethodID mid = (*env)−>GetMethodID(env, cls, "power", "(II)I"); if (mid == NULL) return;
(C) jint ret = (*env)−>CallIntMethod(env, obj, mid, 2, 2);(D) if ((*env)−>ExceptionCheck(env)) return;
/* use ret in rest of inlined function */ ...
Figure 2.2: Sample inlined native code before callback transformations
}
inlined_native_function() {
/* (A) is transformed to a compile−time constant */
/** * (B) is transformed to a compile−time constant and return * semantics are preserved by generating IL for the case where * the constant can not be generated (i.e. an invalid argument * to the callback) */
/** * (C) is replaced with IL that performs an invocation of the * power function on the object with the given constant * arguments */
/** * (D) IL is generated to check for pending exceptions, as well * as the required return statement */
/* use ret in rest of inlined function */ ...
Figure 2.3: Sample inlined native code after callback transformations
CHAPTER 2. DESIGN 22
2.5 Other Callback Transformations
In certain situations, it might be possible to do away with transforming inlined callbacks all
together. For example, if the character conventions used by the JVM and the host architecture
are the same, there is no need to transform an inlined GetStringUTFChars. Its resulting
definition can be replaced by the original Java String it was destined to copy and convert.
A favorable side-effect of inlining native code that declares and uses local references is
that inlined callbacks can potentially be eliminated by implicitly shifting responsibilities to the
JVM’s garbage collector. These include NewLocalRef, DeleteLocalRef,
PushLocalFrame and PopLocalFrame callbacks.
We exclude an exhaustive analysis of all the JNI API functions, but recognize similarities
that might exist with other callbacks.
2.6 Design Concerns
Before describing an implementation of this proposed design, we bring to light two issues
that surface when inlining native code into Java programs. The first of these are “opaque”
function calls that can not, under any circumstance, be inlined and must be dealt with in a
special manner. The second one concerns data that is accessed or modified by both inlined and
non-inlined native code.
2.6.1 Synthesizing Opaque Calls
As mentioned earlier, “opaque” calls occur in two situations:
1. calls through function pointers, and
2. calls to functions in binaries where native IL is unavailable
For example, inlined but non-transformable JNI callbacks are opaque calls through function
pointers, whereas inlined system calls are opaque calls to functions in binaries that do not have
CHAPTER 2. DESIGN 23
native function
Synthesizednative function ‘a’
Synthesizednative function ‘b’
Synthesizednative function ‘c’
a
b
c
Java function (caller)
Calls to appropriatenative implementations afterproper linkage and function
Inlined
context established
"opaque"function calls
Inlined
Figure 2.4: Synthesizing opaque function calls
IL available.
We solve this problem by performing calls to “synthesized” functions whose purpose is to
call the said opaque function after having set up the proper linkages and context to make the
call. The effect is to bridge the Java application to the previously-buried native function. This
situation is depicted in Figure 2.4. Inlining a single Java-callable native function may require
the synthesis of multiple calls to opaque functions. Inlining, however, creates the opportunity
to remove the much-higher overhead of callbacks, and reduces the need for conservative as-
sumptions about the behavior of native code in the JIT optimizer. We expect that it will often
be profitable to synthesize multiple callouts to opaques, provided callbacks can be transformed
into cheaper operations as previously discussed in Section 2.4.3.
2.6.2 Shared Data
A second design concern is the situation that arises when inlining results in data that is accessed
or modified by both inlined and non-inlined native code. We define shared data as data shared
between an inlined native function and any other native function (e.g., a synthesized native
function, another function in the same library, or a function defined somewhere else), static
data, as well as addresses of variables (automatics or parameters) that may be passed by an
inlined native to one of its synthesized calls.
CHAPTER 2. DESIGN 24
In such cases, our strategy ensures that correct linkage is used and the inlined native code
is able to read and write to the same block of memory as non-inlined functions. Furthermore,
because the resolution of addresses is performed at JIT compile-time, and the original native
function is now inlined rather than called explicitly, additional care is taken to ensure that the
dynamic loading of new libraries is handled correctly.
2.7 Design Summary
To summarize, our strategy for improving the performance of JNI-dependent Java applications
is based on inlining native function calls at runtime, and performing a number of steps that
allow for the transformation of inlined JNI callbacks to cheaper but semantically-equivalent
operations. The following chapter describes the tools we use to implement our proposed de-
sign.
Chapter 3
Tools
Our strategy for native function inlining has been prototyped and evaluated in the context of
a high-performance production JVM and JIT compiler from IBM. In this chapter, we describe
the IBM Java JIT compiler and JVM that provide the starting point of our implementation. We
also describe JNI overheads specific to the JIT compiler and JVM implementation. Following
this is a description of the JIT compiler’s IL, as well as a concrete native to JIT compiler IL
conversion mechanism that will be used in the realization of our design.
3.1 The TR JIT Compiler and J9 virtual machine
The IBM R© TR JIT compiler is a high-quality, high-performance optimizing compiler, con-
ceived and developed at the IBM Toronto Software Lab. Designed with a high level of config-
urability in mind, it supports multiple Java Virtual Machines and class library implementations,
targets many architectures, can achieve various memory footprint goals and has a wide range
of optimizations and optimization strategies.
A single pass of the TR JIT compiler consists of phases for IL generation, optimization
and code generation as depicted in Figure 3.1. When compiling a method, the IL Generator
walks the method’s Bytecodes and generates tree-based JIT compiler IL (known as TR-IL)
that also encodes the control flow graph (see Section 3.2 for sample TR-IL). The Optimiza-
25
CHAPTER 3. TOOLS 26
JVM
TR JIT Compiler
IL Generator
Bytecodes Trees & CFG Trees & CFG
RuntimesInstructions &
meta−data
GeneratorsOptimizers
Code
Figure 3.1: The TR JIT compiler’s architecture
tion phase is a pipeline through which the TR-IL flows and may be modified and reordered by
architecture-independent/dependent, speculative and profile-based adaptive optimizations. The
Code Generation phase lowers the TR-IL to a specific machine instruction set, performs regis-
ter allocation and schedules the instructions before emitting a final binary encoding. Auxiliary
data structures and meta-data are also generated at the end of code generation.
The TR JIT compiler is currently used by the IBM J9 Java virtual machine†. J9 is a clean-
room Java Virtual Machine implementation targeting numerous different processors and oper-
ating systems, supporting ahead-of-time compilation, and method hot-swapping as well as a
host of other features. The TR JIT compiler can query J9 for information regarding classes and
invoke various J9 service routines via a tightly defined but publicly exposed interface. This
interface to the virtual machine will be used by the TR JIT compiler in order to transform and
synthesize callbacks at runtime.
†performance results for a TR JIT compiler enabled J9 virtual machine can be found on www.spec.org
CHAPTER 3. TOOLS 27
3.1.1 Inlining in the TR JIT compiler
The TR JIT compiler optimization we are interested in is the function inlining optimization.
This optimization reduces the overhead of function invocations by inlining entire functions at
their callsites. The primary purpose of this inlining, however, is to expose more TR-IL to the
optimizer and to eliminate pessimistic assumptions that must be made about the behaviour of
function calls. Like most inlining strategies, it uses a variety of heuristics to determine if a
given function call should be inlined. Once the decision has been made to inline a function,
the inliner generates TR-IL for the callee, and completes the process by performing all the
required transformations on both the caller and callee functions, including mapping parameters
to arguments, generating temporaries, as well as merging caller and callee IL and control flow
graphs.
TR currently handles native function invocation Bytecodes by generating code that transfers
the native call setup and tear-down work to J9, or by using a proprietary mechanism known
as Direct2JNI. Direct2JNI uses compile-time signature parsing to produce compiled glue code
tailored to perform the native call to each unique native target. We describe Direct2JNI in more
detail in Chapter 4. Independent of the type of dispatch mechanism used, Java threads leaving
the JVM context must indicate they are no longer mutators of the JVM heap. Conversely, Java
threads re-entering the JVM context must indicate they are mutators once again. Besides the
generic JNI callout and callback overheads and the JVM handshaking requirements mentioned
in Section 1.2.1, the notification mechanism is comprised of expensive CPU instructions that
can be eliminated if the originating native call is inlined.
Our implementation extends the inlining strategy in the TR JIT compiler to native function
calls. Our focus is on providing this novel functionality, rather than exploring new heuristics
that might be more suitable for native code. We thus use the existing heuristics to decide when
a native call should be inlined.
CHAPTER 3. TOOLS 28
load auto 3
load auto 2
next tree top
store auto 1
previous tree top
integer add
Figure 3.2: Sample TR-IL
3.2 TR Intermediate Language
As mentioned earlier, the IL generated and used by the TR JIT compiler is tree-based and
encodes the control flow graph for the function being compiled. More specifically, TR-IL is a
linked list of tree-tops, where each tree-top represents an instruction and each child to a tree-
top represents an argument to the instruction. Aliasing information is explicit in TR-IL, which
facilitates the native to JIT IL conversion mechanism.
Figure 3.2 is an example of the TR-IL for a function adding two local variables together
and storing them in a third.
3.3 W-Code and The IL Conversion Mechanism
The first phase of our native inlining design requires the conversion of native code into the same
IL used by the JIT compiler. To do this efficiently, we exploit the ability to store IL alongside
native executable code in the same binary object file or library.
CHAPTER 3. TOOLS 29
Native Code
TR−IL as in Figure 4.2
a = b + c;W−Code producing
front−ends conversion mechanismW−Code to TR−IL
W−Code
STORE aADD
LOAD cLOAD b
Figure 3.3: The IL conversion process
In our case, the native IL is W-Code, a mature stack-based representation generated by IBM
compiler front-ends for C, C++, FORTRAN, COBOL, PL/1 and other programming languages.
Because W-Code is designed to support a large number of languages, aliasing is made explicit
in the IL itself. As mentioned earlier, aliasing is also explicit in TR-IL, making it possible to
preserve alias information from the W-Code of native functions when they are converted to
TR-IL.
As depicted in Figure 3.3, the W-Code to TR-IL conversion mechanism operates by iterat-
ing through the W-Code opcodes of a native function, and generating TR-IL for each encoun-
tered statement. Once W-Code opcodes have been processed, the TR JIT compiler can treat the
generated TR-IL as if it were derived from Java Bytecodes. Care, however, must be taken to
provide appropriate linkages and preserve the semantics of the original language with respect
to opaque function calls and shared data as discussed in Section 2.6.
As will be described in Chapter 4, our implementation converts IL at runtime, when the
inliner decides to inline a particular native function. In principle, the conversion could also be
done offline, storing TR-IL along with the native executable. TR-IL, however, is an in-memory
IL and is not suitable for efficient serialization to disk. In contrast, W-Code (like Bytecode)
is a suitable disk format by design, and the conversion to TR-IL is a single-pass, lightweight
operation. The alternative of storing the tree-based IL directly would take more space to store
CHAPTER 3. TOOLS 30
and would still require a similar amount of work to reconstruct an in-memory representation.
In essence, we are interfacing the TR JIT compiler with a new virtual machine. This new
W-Code virtual machine is an oracle providing answers to queries made by the TR JIT compiler
regarding native symbols, but most importantly, it functions as an IL generator, generating TR-
IL from W-Code opcodes (instead of Java Bytecodes).
Having described the software framework available to use, including a JVM, JIT compiler
and IL conversion mechanism, we proceed to describe the details of our implementation in the
chapter that follows.
Chapter 4
Implementation
In this chapter, we present details of a prototype implementation of the design described in
Chapter 2. Our implementation targets the POWER4TM line of IBM processors, and is com-
posed of general modifications to the TR JIT compiler, changes to the TR JIT compiler’s in-
liner, as well as modifications to the TR JIT code generator to support synthesis. We conclude
this chapter by summarizing the current status of our prototype.
4.1 General Modifications to the TR JIT Compiler
Two significant changes were made to the TR JIT compiler to support the compilation of W-
Code-based languages. The first was to extend its data type set to include unsigned types (since
it was originally designed for Java which does not define unsigned types). The second was
modifying its optimizations that depend on alias analysis (e.g., copy and value propagation),
since aliasing in Java is much simpler than in C. As noted in Section 3.3, alias information
for the native code is explicit in the W-Code IL, and is preserved during the transformation to
TR-IL.
31
CHAPTER 4. IMPLEMENTATION 32
4.2 Modifications to the TR JIT Compiler’s Inliner
The TR JIT inliner was modified to permit function inlining of a small subset of native func-
tions. If the inliner encounters a native callsite during its heuristical analysis stage, it proceeds
via two steps to process the callsite:
1. it instantiates a W-Code virtual machine and associates with it the W-Code file containing
the IL for the native function under consideration
2. it requests the native function’s TR-IL (this initiates a W-Code to TR-IL conversion)
Once TR-IL for the native function is made available, the inliner instantiates two callback han-
dler objects that process the generated IL, transforming JNI callbacks and synthesizing opaque
function calls, respectively. Once transformations are complete and synthesis requirements
have been met, the inliner continues and completes the inlining process. Figure 4.1 displays
this entire native inlining process. We now describe the implementation of these two callback
handlers in detail.
4.3 Introducing the Inlined CallHandlers
We have implemented two callback handler classes that analyze and process inlined function
calls in the TR-IL generated from W-Code. Figure 4.2 represents a class diagram for the JNI-
CallHandler and ExternalCallHandler. The JNICallHandler is in charge of
transforming and synthesizing JNI callbacks, whereas the ExternalCallHandler syn-
thesizes all other opaque calls (i.e., function calls with no accompanying W-Code), and both
implement the interface defined by the InlinedCallHandler.
4.3.1 The JNICallHandler
As mentioned in Chapter 2, transforming inlined JNI callbacks requires callback identification
and JNI argument use/def analysis. Synthesis is also required for inlined callbacks that are
CHAPTER 4. IMPLEMENTATION 33
previous optimization
Retrieve TR−IL converted fromW−Code from the W−Code VM
for processing of inlined native TR−ILInstantiate inlined call handlers
Proceed and complete native inlining
next optimization
TR inliner
Detect a native callsite
Figure 4.1: The TR inliner: Handling native functions
CHAPTER 4. IMPLEMENTATION 34
virtual bool isCallOfInterest(CallNode)=0;
public: void identifyCalls(); void transformCalls();
protected: bool isCallOfInterest(CallNode);
public:
JNICallHandler ExternalCallHandler
void redo*(...);
virtual void identifyCalls()=0; virtual void transformCalls()=0;
protected:
public:
void identifyCalls(); void transformCalls();
protected: bool isCallOfInterest();
private: void handle*(...);
void synthesize(CallNode);
InlinedCallHandler
Figure 4.2: The Inlined Call Handler class hierarchy
opaque. We now describe how the transformation and synthesis of callbacks is realized.
Identifying JNI Callbacks
The JNICallHandler relies on the W-Code VM to identify and flag callback symbols dur-
ing IL conversion. The W-Code VM does so by checking to see if a function call by pointer
targets a legal function table offset in the JNI interface. These function table offsets are doc-
umented in the JNI specification [28]. Furthermore, the W-Code VM maintains a listing of
transformable callbacks, thereby allowing the handler to differentiate between transformable
and synthesizable callbacks.
Once the JNICallHandler has been instantiated, it iterates through the generated IL
probing for callbacks, and adding them to a list for later processing. The same is done if the
handler encounters any callbacks that are deemed non-transformable (i.e., requiring synthesis).
This represents the work performed in the JNICallHandler::identifyCalls routine.
By flagging known callback symbols, our implementation differs from the design in Chapter 2
CHAPTER 4. IMPLEMENTATION 35
which detailed an approach that scans the generated JIT compiler IL, pattern matching IL state-
ments against a set of well-known JNI callback IL shapes.
JNI Argument Use/Def Analysis
Our implementation avoids a detailed JNI argument use/def analysis by assuming straight line
control flow in native code. The use/def analysis is similar to that required for other optimiza-
tions, but building one specifically for JNI arguments is an engineering issue we believe can
be addressed by future work, and one that does not diminish the novelty of our idea. Instead,
the set of argument definitions is restricted to values passed in from Java code (i.e., arguments
to the native function call) as well as values returned by any of the JNI functions presented in
Section 2.4.2.
Callback Transformation
Having identified all transformable calls, the JNICallHandler::transformCalls rou-
tine proceeds by iterating through them and transforming each constant-generating callback
into a compile-time constant by querying the J9 virtual machine. For example, JNI functions
that return field or method ids are converted to JIT compile-time constant addresses and field
offsets. Transformable JNI calls that use the results of these constant-generating callbacks as
arguments (i.e., Get<Type>Field or CallStatic<Type>method) are transformed
into cheaper but semantically equivalent TR-IL. Get<Type>Field, for example, is trans-
formed to TR-IL representing a direct field access, whereas CallStatic<Type>method
is transformed to a direct function call.
Figure 4.3 shows pseudocode for the JNICallHandler::transformCalls func-
tion. Each transformable callback is provided a “handle” method that takes in the TR-IL rep-
resenting the callback, along with required “definitions” from previous transformations and
returns the TR-IL result of the transformation. A natural side-effect of querying the J9 virtual
machine for data at JIT compile-time is the ability to filter native code that performs illegal
CHAPTER 4. IMPLEMENTATION 36
*/
JNICallHandler::transformCalls() {
for (each tree top X representing an inlined JNI callback) { switch(callType(X)) {
break; case GetObjectClass: constClass = handle_GetObjectClass(X); break; case GetFieldID: redoPool.add(X); offset = handle_GetFieldID(X, constClass); break; case GetIntField: handle_GetIntField(X, offset); break; case ...: ... default:
break; } }}
case FindClass: constClass = handle_FindClass(X); // transform to a constant
// transform to a constant
// transform to an offset
// transform to a direct read
synthesize(X); // synthesize a call to the opaque JNI callback
/**
// add to the redo pool in case the transformation needs to be undone
* Attack the JNI callback tree tops as they first appear in the inlined native code’s IL
Figure 4.3: Pseudocode for JNICallHandler::transformCalls
CHAPTER 4. IMPLEMENTATION 37
}
... transformed_call = redoPool.getDependentTransformations(X);
JNICallHandler::synthesize(X) {
correct_data = redo(transform_call); handle_callback(X, correct_data); ...
Figure 4.4: Pseudocode for JNICallHandler::synthesize
operations on JVM data (i.e., querying the field id of a non-existent field). An appropriate
response to such an unchecked error would be to halt the JVM, thereby preventing it from en-
tering an indeterministic state, or possibly avoiding a crash.
Synthesizing Opaque Callbacks
Since both transformable and non-transformable callbacks are stored in the same list, there
exists the possibility of encountering an opaque callback during the transformation stage. If a
callback can not be transformed, but takes arguments that were defined by a previously trans-
formed callback, special care must be taken to ensure the definition of arguments to the opaque
callback are of the correct type. For example, if the virtual function table offset constant
generated from transforming a non-opaque GetMethodID callback is then passed as an argu-
ment to an opaque CallStaticObjectMethod callback, the GetMethodID transformation must
be ”redone” to produce the expected type of data which is then passed as an argument to Call-
StaticObjectMethod. The expected type of data in this case is jmethodID rather than a virtual
function table offset.
Once dependent transformations are redone, the opaque call is synthesized by adding a
layer of indirection to all the reference arguments that originate from the argument list to the
inlined native function (a semantic enforced by the JNI to support copying garbage collectors).
Figure 4.4 gives pseudocode for synthesizing opaque callbacks.
CHAPTER 4. IMPLEMENTATION 38
4.3.2 The ExternalCallHandler
The ExternalCallHandler relies on the W-Code VM to identify and flag external sym-
bols during IL conversion. The W-Code VM does so by checking to see if the function call
targets an externally defined symbol (i.e., a symbol outside the module being processed).
Once the ExternalCallHandler has been instantiated, it scans the generated IL for
external calls, and adds them to a list that will be processed by the code generator.
4.4 Changes to the TR JIT Code Generator
Once the native function inlining and callback transformation optimizations have taken place,
the code generator must handle any side-effects that result from changes to the inlined native
function’s TR-IL. Since we are targeting the POWER4TMline of processors, the POWER4 code
generator needs to be able to generate specialized dispatch code for synthesized callbacks and
external calls, as well as code for accesses to shared data residing in native libraries. The code
generator evaluates synthesized function calls by adapting and modifying the Direct2JNI call-
out mechanism described in Section 3.1.1. Direct2JNI is a specialized snippet of high-speed
assembly that sets up the correct linkages and context when making a native call from Java. It
enforces the linkage conventions specified by the AIX [13] Application Binary Interface [14]
(ABI). It also encodes the handshaking required between native code and the J9 virtual ma-
chine.
Furthermore, when evaluating TR-IL load trees, the code generator must handle loads of
shared data accessible to both inlined and non-inlined native functions. When generating CPU
instructions for such loads, the code generator uses the AIX dlopen and dlsym system calls [15]
to load and resolve the runtime addresses of intra-module defined symbols.
CHAPTER 4. IMPLEMENTATION 39
Transformable Callbacks Synthesizable Functions
JNI API Functions JNI API Functions libc Functions
FindClass FindClass malloc
GetObjectClass GetObjectClass free
Get[Static]FieldID Get[Static]FieldID printf
Get[Static]<Type>Field Get[Static]<Type>Field sprintf
Set[Static]<Type>Field Set[Static]<Type>Field atoi
Get[Static]MethodID Get[Static]MethodID strlen
Call[Static]<Type>Method Call[Static]<Type>Method strncasecmp
New<Type>Array New<Type>Array fopen, fclose
GetArrayLength GetArrayLength fwrite, fread
Get<Type>ArrayRegion Get<Type>ArrayRegion fseek, rewind
Table 4.1: Current support for callbacks and external function calls
4.5 Current Status
In summary, we have produced a fully-functional Java JIT compiler that can be substituted
as a back-end for various W-Code generating static front-ends. The correctness of our imple-
mentation has been verified by successfully compiling all of the C benchmarks from Spec-
CPU2000 [40], as well as standard C conformance tests. We have also compiled these bench-
mark programs with native-side inlining enabled and have observed the expected performance
increases.
Our native inlining modifications to the TR JIT compiler and the inliner allow it to handle
a large set of native code containing transformable JNI callbacks, as well as non-transformable
JNI callbacks and opaque external calls. Table 4.1 lists the set of JNI callbacks our imple-
mentation successfully transforms, as well as the callbacks and external functions the inliner
synthesizes.
Chapter 5
Results and Analysis
Native inlining is an optimization that interacts with the performance dynamics of the TR JIT
compiler, as well as with the running Java program performing native function calls. As with
any JIT optimization, the runtime costs of performing native inlining and callback transforma-
tions must be balanced against the expected benefits of removing overhead and exposing more
IL to the JIT optimizer. Ultimately, we believe the true power of our approach lies in the ability
to treat native and Java code together during JIT compilation, particularly since we have the
opportunity to eliminate pessimistic assumptions that the optimizer must make in the presence
of native function calls.
Our results and experiments focus on the costs and benefits of inlining callouts and trans-
forming callbacks. First, we examine the cost of converting native functions from native W-
Code IL into TR-IL. We also demonstrate the benefits of eliminating native call and return
overhead and record performance gains from transforming heavyweight callbacks into sub-
stantially cheaper operations. Furthermore, we measure and validate runtime improvements
in the performance of native code as the result of exposing them to additional JIT optimiza-
tions once inlined. To conclude, we quantify the performance results of synthesis when inlined
native code contains calls to opaque native functions.
To confirm the applicability of native inlining and callback transformations on real-world
40
CHAPTER 5. RESULTS AND ANALYSIS 41
code, we profiled a run of SPEC JAppServer2004 using IBM Websphere R© Application Server
6.0. We found that 4.07% of all function calls made during the run were calls to 71 unique
native functions, accounting for roughly 23% of the running time. Of these, 19 unique native
functions were called at least 5000 times, and out of those, six were called at least 50,000
times. A single native function, Object.hashCode(), was called more than 300,000 times. This
suggests that the runtime cost of inlining can be amortized over a large number of uses for
important native functions. If the native function is well-understood by the compiler, semantic
expansion [44] or a related inlining technique could be used to create a special version. These
approaches, however, are less general than our solution.
5.1 Experimental Platform
Due to the protypical nature of our implementation, we are limited to evaluating critical aspects
of our proposed system using microbenchmarks. Although this limits us from investigating the
impact of our work on a large-scale system, the microbenchmark results provide us with a
realistic sense of the costs and benefits of our implementation. All our timing measurements
are performed on an IBM 7038-6M2 with eight 1.4 GHz POWER4TM CPUs. We use the
following legend when describing our microbenchmark results:
• NoOpt - unless otherwise mentioned, no optimizations are performed on microbench-
mark tests calling native functions
• N-inlining - native functions called by microbenchmark tests are only inlined
• N-inlining+ - native functions called by microbenchmark tests are inlined and contained
JNI callbacks are transformed
Detailed descriptions of our microbenchmark tests, as well as the raw data used to generate the
results in this chapter, can be found in Appendix A.
CHAPTER 5. RESULTS AND ANALYSIS 42
bzip2
5.09
craf
ty
5.51
gap
5.34
gcc
4.97
gzip
5.47
mcf
4.53
pars
er
5.30
perlb
mk
5.72
twolf
5.21
vorte
x
5.64
vpr
5.46
0
1
2
3
4
5
6T
ime
per
Op
cod
e (m
icro
seco
nd
s)
SPEC CINT2000 Benchmarks
Figure 5.1: Cost of W-Code to TR-IL conversion for SPEC CINT2000 benchmarks
5.2 W-Code Conversion Costs
To evaluate the cost of converting from W-Code to TR-IL for C functions, we measured the
time to convert the SPEC CINT2000 benchmarks (eon is omitted because it is written in C++).
Figure 5.1 shows the rate of conversion of each W-Code opcode in each benchmark. Overall,
we find that the cost per opcode converted is small (averaging just 5.3 microseconds), and
relatively constant across benchmarks. These results are encouraging, as they suggest a simple
heuristic should be able to estimate the cost of converting a given native function at runtime
CHAPTER 5. RESULTS AND ANALYSIS 43
based on its size in W-Code opcodes. Such a heuristic would then be able to guide native
inlining decisions in the JIT compiler. Furthermore, the cost of conversion only needs to be
paid once, when the function is inlined, whereas the benefits of removing callout overhead will
be obtained on every subsequent use of the inlined code.
5.3 Native Inlining Benefits
We then implemented a set of microbenchmark tests to evaluate the benefits of inlining native
functions.
A portion of the test included calls to empty native functions. The motivation behind this
was to confirm the complete removal of native call and return overheads as a result of perform-
ing native inlining. Empty-bodied instance and static native functions were implemented with
varying number of parameters (0, 1, 3 and 5 parameters) and JIT compiled at NoOpt and N-
inlining optimization levels. As shown in Table 5.1, the speedup that resulted from performing
native inlining for each test was infinite because inlining completely removed the overhead in
performing the native call and then returning from the native call. In general, the NoOpt results
also show the incremental cost of passing arguments to native functions. They also show that
static microbenchmark tests ran faster than their corresponding instance versions due to the
fact that Direct2JNI was used to create compiled glue code for the native calls.
We also implemented a microbenchmark test that contains real code. This test will clearly
demonstrate that the benefits of inlining depend on the amount of time spent executing native
code in the function. To see the benefits that occur in realistic uses of native code, we wrote a
native hash function§. Inlining our hash function gives a speedup of 3.6.
In summary, our microbenchmark tests show that native inlining can easily result in speedups
that range from effectively infinite (as for the empty bodied native function calls), to effectively
§we based this hash function on Wang’s 32-bit mix function athttp://www.concentric.net/ Ttwang/tech/inthash.htm
CHAPTER 5. RESULTS AND ANALYSIS 44
Microbenchmark Test NoOpt (ns) N-inlining (ns) Speedup (X)
instance
0 args 423 0 ∞
1 args 458 0 ∞
3 args 490 0 ∞
5 args 579 0 ∞
static
0 args 128 0 ∞
1 args 137 0 ∞
3 args 138 0 ∞
5 args 143 0 ∞
Table 5.1: Microbenchmark runtimes and improvements with native inlining
0 (for very long-running native functions). The primary motivation for inlining native code,
however, is to create the opportunity to transform inlined JNI callbacks that are much more
expensive to perform. We consider the effect of these transformations in the following section.
5.4 Callback Transformation Benefits
To measure the overheads involved with performing callbacks in native code, we implemented
a series of microbenchmark tests and ran them at NoOpt and N-inlining+ optimization levels.
Table 5.2 contains a complete listing of the microbenchmark tests and our experimental results.
Our microbenchmark test names are encoded as follows:
• CVMethod - Call a void instance Java function from native code
• CSVMethod - Call a void static Java function from native code
• CIMethod - Call an integer returning instance Java function from native code
CHAPTER 5. RESULTS AND ANALYSIS 45
• CSIMethod - Call an integer returning static Java function from native code
• GIField - Read an integer field from a Java object
• GSIField - Read a static integer field from a Java class
• SIField - Write to an integer field in a Java object
• SSIField - Write to an integer field in a Java class
• E - empty function (i.e., in the context of native code calling Java functions)
Because callbacks are more expensive than callouts, we see that the benefit of transforming
them is correspondingly greater, with a minimum achieved speedup of nearly 12X in our mi-
crobenchmark tests.
For example, the native code in the CVoidMethodE test ultimately calls an empty Java
method, but does so by calling the GetObjectClass, GetMethodID, and
CallVoidMethod JNI functions. By performing native inlining and then transforming each
of these callbacks (i.e., running at N-inlining+), our strategy is able to reclaim the order of mag-
nitude slowdown experienced when running the same microbenchmark test at NoOpt. More
specifically, the three callbacks are transformed to two compile-time constants, and a JNI-
independent virtual function call (using the constants from the previous two transformations),
respectively.
Our results also indicate what at first appears to be anomalous infinite speedups for four
microbenchmark tests that perform reads or writes to instance or static fields. For example,
inlining the native call in GIntField inlines a JNI callback which reads an instance field. The
callback, however, even after being transformed (to a more direct read operation) is still present
and contributes to work performed at runtime. In other words, inlining and transforming the
callbacks does not result in the complete removal of work. The infinite speedup results from the
superscalar pipelined CPU (like the POWER4) being able to find a slot to schedule the single
read CPU instruction along with other instructions. The read instruction is scheduled alongside
CHAPTER 5. RESULTS AND ANALYSIS 46
Microbenchmark Test NoOpt (ns) N-inlining+ (ns) Speedup (X)
CVMethodE 2627 204 12.9
CSVMethodE 2523 214 11.8
CIMethodE 2652 217 12.2
CSIMethodE 2554 220 11.6
GIField 2560 0 ∞
GSIField 2194 0 ∞
SIField 2308 0 ∞
SSIField 2144 0 ∞
Table 5.2: Microbenchmark runtimes and improvements with native inlining and callback
transformations
instructions representing the loopy portion of the microbenchmark test (i.e., a compare and
predicted branch). Therefore, the read instruction appears to take no time to execute.
In summary, inlining native code and transforming JNI callbacks is able to gain the order
of magnitude lost due to callout and callback overheads. The next set of microbenchmark tests
we present include native code that perform more work, and have more realistic applications.
5.5 Eliminating Data-Copy Costs
We also designed a microbenchmark that passes integer array data from Java to native code.
The purpose of this microbenchmark is two-fold: First, it demonstrates a working solution to
the shared data concern mentioned in Section 2.6.2. Each of the microbenchmark tests access
a globally-declared C integer array. The second motivation behind this microbenchmark is to
display performance benefits that may be available to native code used by JDBCTM [43] drivers,
namely, data transfers between Java and C. In these tests, a single callback is used to obtain the
contents of a Java array. This approach is similar to the “coarse-grained” data transfer strategy
used in JNIbench [1], a microbenchmark that measures the throughput of passing integer and
CHAPTER 5. RESULTS AND ANALYSIS 47
Array Length NoOpt (ns) N-inlining+ (ns) Speedup (X)
1 586 2.4 244.2
10 599 20.7 28.9
100 1012 85.5 11.9
1000 4537 600 7.6
5000 20460 6302 3.2
10000 41443 13884 3.0
Table 5.3: Moving data from Java to C - improvements with native inlining and callback trans-
formations
byte data from Java to native code.
Table 5.3 displays the range of speedups obtained by transforming inlined
GetIntArrayRegion callbacks for each of the varying array lengths (Figure 5.2 provides
a graphical representation of the same data). These speedups range from a factor of 244 for
a single element array, to a factor of 3 for a 10000-element array. The reason behind these
impressive speedups is the transformation that the inlined GetIntArrayRegion callback
undergoes. At JIT compile-time, the array region copying callback is transformed to TR-
IL representing a call to a high-speed data-copy routine (similar to but more efficient than
memcpy) provided by the JIT compiler’s runtime.
As expected, these speedups decrease for larger array sizes because the overhead in per-
forming the callout and callback shrinks relative to the actual work done in copying the array.
There is evidence, however, that the large speedups for copying arrays of short lengths is ap-
plicable to real-world code. In particular, work by Bernecky [6] finds that the 76% of all
operations in APL occur on arrays with fewer than eight elements, and that about half of all
operations are performed on zero or one-element arrays.
CHAPTER 5. RESULTS AND ANALYSIS 48
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Tim
e (n
ano
seco
nd
s)
0 2000 4000 6000 8000 10000
Array Length
Optimization Strategy
NoOptN-inlining+
Figure 5.2: Moving data from Java to C - a graphical representation of the improvements with
native inlining and callback transformations
CHAPTER 5. RESULTS AND ANALYSIS 49
Microbenchmark N-inlining+ HighOpt HigherOpt
Test Speedup (X) Speedup (X) Speedup (X)
hash 3.58 36.49 23.91
Table 5.4: hash: Performance improvements with other JIT compiler optimizations
5.6 Optimizing Inlined Native Code
Our microbenchmark tests have thus far exposed the overheads of performing callouts and
callbacks. We have (except for hash) completely ignored the effects of native inlining and
callback transformations on native code that perform useful work. In this section, we present
the speedups obtained by exposing inlined and transformed native code to other optimizations
in the TR JIT compiler. Most notably, we inline native functions, and JIT-compile them at
increasingly higher optimization levels (HighOpt, HigherOpt) which perform a large set of
inter-procedural optimizations. We avoid a detailed explanation of these optimization strate-
gies, but focus on their effect on inlined native code.
For example, in Table 5.4, exposing hash to more JIT compiler optimizations improves
the runtime performance by a factor of almost 37. Exposing the same test, however, to a
higher level of optimizations dampens the runtime benefits from compiling at HighOpt. This
is because of very aggressive and experimental optimizations that do not pay off.
In Figure 5.3, four microbenchmark tests contain native code that use the JNI to ultimately
call Java functions that perform lookups on Java HashMap objects. These microbenchmark
tests are labeled:
• CVMethod - Call a void instance Java method
• CSVMethod - Call a void static Java method
• CIMethod - Call an integer returning instance Java method
• CSIMethod - Call an integer returning static Java method
CHAPTER 5. RESULTS AND ANALYSIS 50
Of interest here is the relatively small N-inlining+ baseline speedups (relative to the NoOpt
optimization strategy) these tests experienced compared to the infinite speedups observed be-
fore with other microbenchmark tests (Table 5.1). This is because the amount of work per-
formed by native code (or Java functions called from native code) dominates the overhead in
performing the callout and callbacks. This is especially apparent in the N-inlining++ strategy
which recursively inlines both the Java-callable native function, as well as the Java function
called via a JNI Call[Static]<Type>Method callback. There is no significant speedup
even though the entire callout-callback path is eliminated.
Modest performance gains, however, are attributable to exposing recursively inlined func-
tions to other JIT compiler optimizations, as can be seen with compiling with the HigherOpt
optimization strategy. Our initial expectations were for larger gains, given the Java-centric
nature of the JIT compiler’s optimization strategy, and the fact that the inlined code contains
operations on Java objects. We, however, do not believe this is generalizable to other mi-
crobenchmark tests. As long as the additional optimizations keep the performance of the gen-
erated code at par, and the overhead in performing additional aggressive optimizations is not
excessive, we do not see any harm in performing them.
Another interesting result comes from the GArrayLength microbenchmark test. This test
creates an array of characters using NewCharArray and returns its length using
GetArrayLength. As reported in Table 5.5, the 94.7 speedup factor is due to the trans-
formation and further optimization of two callbacks. The first is a call to NewCharArray,
which is ultimately transformed to a call to a specialized JIT compiler object allocation rou-
tine which makes use of runtime information unavailable to the JVM. The second call is to
GetArrayLength, which is ultimately transformed from an expensive JNI callback to the
JVM, to a fast array object header field lookup.
In summary, our results indicate that it will often be profitable for a JIT compiler to perform
aggressive runtime optimizations on inlined native code.
CHAPTER 5. RESULTS AND ANALYSIS 51
1.16 1.17
1.17 1.
19
CVMethod
1.12 1.13
1.12 1.13
CSVMethod
1.12 1.
14
1.13 1.
15
CIMethod
1.09 1.
13
1.13 1.14
CSIMethod0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up
Microbenchmark Test
Optimization Levels
N-inlining+N-inlining++HighOptHigherOpt
Figure 5.3: Exposing inlined native code to other JIT optimizations
Microbenchmark Test NoOpt (ns) N-inlining+ (ns) Speedup (X)
GArrayLength 5641 60 94.7
Table 5.5: GetArrayLength: Improvements with native inlining and callback transforma-
tions and other JIT compiler optimizations
CHAPTER 5. RESULTS AND ANALYSIS 52
5.7 Synthesis Decisions
The last set of results we present quantify the benefits and costs of performing synthesis on in-
lined opaque function calls. We created two sets of microbenchmark tests (shown in Table 5.6)
to demonstrate the need for an inlining heuristic that can estimate performance degradation
when native inlining generates synthesized opaque function calls.
The first set of tests (Fully Transformed) consists of native code that only contain opaque
function calls. All the tests except for File I/O contain a single synthesizable call to the libc
function encoded in the test’s name (i.e. printfS calls printf, atoiS calls atoi). File I/O
performs a sequence of file-system calls (fopen, fread, fwrite, fseek, rewind and
fclose). The difference between the S and L Fully Transformed microbenchmark tests lies
in the length of data passed to the libc function: S tests pass shorter lengthed data, whereas L
pass longer lengthed data (and therefore have longer runtimes).
Inlining a single callout for each of these tests results in the synthesis of a single call and
a respectable speedup. For the printf*, atoi*, strlen* tests, a synthesis heuristic should guide
the inliner into performing the optimization on the native functions. In contrast, inlining the
native callsite in File I/O results in the generation of 13 synthesized calls, and no significant
increase in performance. A synthesis heuristic should advise the inliner to refrain from inlining
this native call because the runtime benefit from performing the optimization would not offset
its runtime cost.
The second set (Partially Transformed) contains the exact same native code from before
(Section 5.4) except that some of the callbacks are synthesized instead of transformed. We
conclude that a synthesis heuristic should approve inlining Partially Transformed native code.
In the four tests (G[Static]IntField, S[Static]IntField) the inliner is able to transform all the
JNI functions that provide definitions (in the JNI argument use/def sense) to the synthesized
call that reads or writes to the instance or static variable. Although the speedups obtained
in these four tests are lower then those recorded in Table 5.2, they still represent substantial
speedups when compared to compiling them with the NoOpt optimization strategy.
CHAPTER 5. RESULTS AND ANALYSIS 53
Microbenchmark Test NoOpt (ns) N-inlining+ (ns) Speedup (X)
Fully Transformed
printfS 1141 560 2.0
printfL 3286 2951 1.1
atoiS 245 128 1.9
atoiL 487 212 2.3
strlenS 222 129 1.7
strlenL 437 173 2.5
File I/O 3652 3685 0.99
Partially Transformed
GIntField 2560 320 8.0
GStaticIntField 2194 335 6.6
SIntField 2308 335 6.9
SStaticIntField 2144 354 6.1
Table 5.6: Synthesizing calls to opaque functions
CHAPTER 5. RESULTS AND ANALYSIS 54
In general, our experiments have focused on the costs and benefits of inlining callouts
and transforming callbacks. The relatively small overheads and impressive gains attributable
to native inlining and callback transformations provides us with motivation to see through a
complete implementation of our strategy. In Chapter 6, we compare our strategy with other
approaches at minimizing JNI-related overheads, and Chapter 7 outlines other performance
and engineering issues that need to be taken into consideration when moving forward.
Chapter 6
Related Work
This chapter provides context to our strategy by describing research on general and Java-centric
approaches to language interoperability, as well as optimizations of Java native functions and
callbacks.
6.1 Alternative Language Interoperability Frameworks
Examples of language interoperability frameworks that operate across languages, processes
and machine boundaries include CORBA [38], Remote Procedure Calls [8] and the Compo-
nent Object Model [30]. These frameworks use interface definition languages (IDLs) to specify
common types, and depend on proxy stubs to help clients translate between machine architec-
tures, execution models and programming languages.
A more recent advance in language interoperability is Microsoft .NET [20] which claims
complete language interoperability between the family of .NET-labeled languages. .NET com-
pilers for each language transform source to a common IL, the Microsoft Intermediate Lan-
guage (MSIL). It also requires additional language features, for example, inheritance, over-
loading and exception handling in Visual Basic (VB) 6.0 .NET. This, unfortunately, may re-
quire revision and modifications to existing VB applications if these programs wish to take
advantage of the power of the CLR and CLS [22]. Furthermore, C and C++, two widely-used
55
CHAPTER 6. RELATED WORK 56
languages are not fully supported within the framework.
The JNI is not the only way to bridge Java programs with native code. Liang [28] mentions
a number of ways to utilize inter-process communication to marshal data between processes
independently hosting the Java application and native code. These solutions, however, have
unacceptable performance characteristics (i.e., large memory footprints).
6.2 Programmer-Based Optimizations
Programmer-based optimizations to the JNI put the onus on the application programmer to
practice efficient coding techniques when writing native code that uses the JNI. A former IBM
developerWorks R© article [25] advised batching native calls and passing as much data as pos-
sible per native call, as well as a number of other recommendations to amortize overhead,
including using the ExceptionCheck call instead of ExceptionOccured because it is
less computationally expensive. Although this article is no longer available through develop-
erWorks, the recommended JNI programming practices are still valid.
Similarly, the JNI specification [28] suggests ways to avoid making JNI callbacks, by
caching field and method ids during static initialization of classes.
By reducing the overhead of the JNI automatically, our approach obviates these program-
ming practices, removing the added burden from the application programmer.
6.3 Restricting Functionality in Native Code
Restricting native code functionality is another way to reduce overhead and minimize the de-
pendence on JNI callbacks. The Intel ORP [12] gives JIT compilers the freedom to lay out
stack frames and use registers in any manner they want, thereby making them responsible
for unwinding the stack and enumerating roots for garbage collection (GC). Such unmanaged
code, however, requires the core virtual machine to generate special wrapper code for native
CHAPTER 6. RELATED WORK 57
methods which provide support for unwinding past native frames, and enumerating JNI ref-
erences during GC. The wrappers also include synchronization code for synchronized native
methods. In order to avoid the work performed by these wrappers, the ORP supports a “direct
call” mechanism that bypasses the construction of the wrappers. The speedup that results from
not having to perform maintenance work in the wrappers comes at the expense of not being
able to unwind the stack. Therefore, direct calls can only be used for native methods that are
guaranteed not to require garbage collection, exception handling, synchronization or any type
of security support.
The JNI specification [28] provides a set of “critical” functions that may return direct ref-
erences to JVM data and objects at the cost of limiting the programmer’s freedom (code that
lies within associated GetCritical and ReleaseCritical callbacks can not invoke any
functions that might potentially block the running JVM thread and result in indeterminate JVM
behaviour).
Bacon [3] has implemented a JVM-specific JNI “trap-door” which simplifies reference
management for garbage collection, based on the observation that his native code only accesses
primitive parameters and never performs any JNI callbacks.
Although these strategies might improve performance in certain circumstances, they are
not general solutions, severely restrict native code functionality and cannot be used for most
existing JNI code.
6.4 Proprietary Native Interfaces
Proprietary native interfaces that are coupled with JVMs take advantage of knowing the inter-
nals of the JVM in order to mitigate the overheads of native calls and callbacks.
The PERC Native Interface [33] (PNI) is a native code calling interface that ships with the
PERC VM. Although PERC claims that it runs three times faster than the JNI, the PNI doesn’t
provide some of the key safety features of the JNI.
CHAPTER 6. RELATED WORK 58
Microsoft’s now-supplanted Raw Native Interface [16] (RNI) exposed the binary layout of
Java objects, and returned pointers to underlying JVM data, providing efficient access to JVM
data at the cost of being JVM dependent.
More recently, gcj [10] from Cygnus Solutions compiles Java source code ahead of time
to native binary code. The Cygnus Native Interface (CNI) can be used to call C++ code from
Java source. This solution, however, restricts native code to CNI-supporting compilers.
JNIWrapper [29] is a commercially available API that addresses the programmer-unfriendly
nature of writing JNI code, but does little else. It does not provide a solution to the overheads
involved in using the native interface.
According to [32], the original Native Method Interface was deprecated because it coupled
native code to a specific JVM implementation too closely. As a side-effect, this restricted the
JVM to using conservative garbage collection algorithms.
In summary, all of these approaches closely couple the native interface with the specific
virtual machine, and thus seriously restrict portability. The JNI, in contrast, is a cleaner and
more portable solution because all JVM internals are represented in an opaque manner and can
only be accessed via JNI callbacks. This, ironically, lies at the heart of JNI-related overheads.
Our work is independent of the JVM being used and our technique can be utilized by those
who wish to support it.
6.5 Unmanaged Memory
A different approach to native code involves extending the JVM to support features for which
native functions are commonly used. One example is incorporating unmanaged memory into
the JVM. The provision of high-speed access to unmanaged memory can be used to implement
shared memory segments, memory-mapped files, communication and I/O buffers, and even
memory-mapped hardware devices.
Jaguar [42] implements Bytecode-to-assembly code mappings in a JIT compiler to generate
CHAPTER 6. RELATED WORK 59
inlined assembly code for limited sequences of machine code. The main use of the mappings
is to map object fields to memory outside the managed Java heap. The benefit of this approach
is that the performance obtained for various Java-based latency and bandwidth simulations ap-
proaches that recorded by native implementations. However, there are a number of limitations
with this approach, including the inability to recognize and map long, complex sequences of
Bytecodes, as well as the inability to apply mappings to virtual method invocations, since code
mappings can’t handle runtime class loading.
Buffers in the Java new I/O libraries [24] improve performance in the areas of primitive
type buffer management, scalable network and file I/O, character-set support, and regular-
expression matching. Buffers are allocated outside the garbage-collected heap, and can be
accessed by the JVM without having to perform any time-consuming copy operations. The
general recommended usage of direct buffers is for large, long-lived buffers that are subject to
the underlying system’s native I/O operations.
The PERC Native Interface (PNI) [33] also provides an alternative solution to unmanaged
memory by providing DirectMemory, consisting of static methods for reading and writing
data to external memory. The methods look just like regular Java functions, but are treated
separately by the PERCS Ahead-of-time and JIT compilers. These methods can further be
optimized by inlining them, disabling any array-bounds checking, and performing processor-
specific optimizations - resulting in machine instructions generated by a native compiler.
Generally speaking, our thesis is orthogonal to work on unmanaged memory.
6.6 Optimizing the JNI
Optimizations that target native functions and the JNI specifically include IBM’s enhancements
to the Java 2 Platform mentioned in [25] and inlining of helper code that sets up JNI stack
frames as mentioned in [12]. IBM also reuses existing Java stack frames to reduce the native
stack frame setup overhead.
CHAPTER 6. RELATED WORK 60
Andrews’ [1] optimizations include native memory mirroring, which effectively registers
static Java memory with native code, making writes to these statics immediately visible to
native code. Andrews also suggests provisioning the JNI for lightweight calls similar to what
Bacon [3] suggests.
The Intel ORP also speeds up the performance of native calls by using an inlined sequence
of instructions to allocate and initialize outgoing JNI reference handles directly. This is very
similar to IBM’s Direct2JNI mechanism (Section 3.1.1).
From the work described in [32], code generated by SGI’s IRIX Java JIT compiler uses
the same calling convention as native code. This minimizes the overhead in making transitions
from a JIT compiler’s calling convention to a native calling convention.
Work on efficient object serialization [36] also proposes changes to the JNI. It argues that
some benchmarks clearly advocate extending the JNI to provide a routine that can copy all
primitive type instance variables of an object into a buffer at once.
No known implementations of native function inlining exist, but they have been referred to
by Andrews [1] and by Liang [28] as a powerful, yet difficult-to-implement optimization.
Our solution demonstrates that native function inlining is feasible with a JIT compiler and
a mature native IL, and that the benefits of removing overhead alone may make it worthwhile.
Furthermore, native inlining enables more aggressive optimizations, similar to traditional in-
lining techniques, as we have shown in Section 5.6.
CHAPTER 6. RELATED WORK 61
6.7 Compiler IL as Runtime Program Data
Our strategy of retaining the IL generated by a traditional static compiler to support future opti-
mizations is similar to strategies used to support link-time cross-module inlining optimizations
in several commercial compilers, include those from HP [2] and IBM [26]. It is also remi-
niscent of the strategy for supporting “life-long” program optimization used in LLVM [27].
Storing the more compact representation (i.e., W-Code instead of TR-IL) and converting at
runtime is in keeping with the “slim binaries” strategy proposed by Franz and Kistler [18].
Chapter 7
Conclusions
This chapter describes the engineering issues that make our design harder to generalize, per-
formance issues that need to be accounted for, and directions that can be taken to further our
prototype implementation.
7.1 Engineering Issues
Some of the language features of Java that our implementation does not support include native
callsite polymorphism (i.e., overriding and overloading of native functions, including dynam-
ically changing the target of a native function call by using the RegisterNatives JNI
callback). These can be addressed by using virtual guards or other runtime assumption-based
techniques that require validation and ensure the execution of the correct version of native code.
Synchronized native functions also fall outside the scope of this thesis but can be handled
via simple synchronization handshaking with the JVM before, inside and after the inlined
callsite.
For the purposes of generating a proof-of-concept prototype, we have ignored handling
JNI callbacks that deal with string parameter access, reference creation, exception handling,
the Reflection-related functions, monitors and the invocation interface. Although these are
important features that must be handled in a full implementation, we believe this is a matter
62
CHAPTER 7. CONCLUSIONS 63
of engineering, and one that will not substantially alter the applicability of our native function
inlining optimization.
We have not yet built the generalized shape-matching or the control-flow dependent use/def
analysis required to automatically detect and transform all callbacks. Although the pattern-
matching nature of our callback identification mechanism is restrictive, we believe that it
handles a sufficient amount of native functions especially if written in the style suggested by
Liang [28]. Certain J9 virtual machine data structures might also need to be altered in order to
identify arguments to JNI callbacks that originate from cached storage.
Our implementation ignores inlining native functions with variable length parameter lists
and parameters whose addresses are taken. Our inliner wasn’t able to guarantee the preser-
vation of the stack-semantics for these C language-based features. We also don’t handle any
of the Call[Static]<Type>Method callbacks that accept arguments in array or variable
list form.
Finally, there is a need for a robust infrastructure that maps native functions to their ap-
propriate libraries (containing W-Code data). As it stands, the path and names of many native
libraries are hard-coded into our copy of the JIT compiler source. A side-effect of this is our
inability to recursively inline non-opaque native functions declared in external modules. The
simplest solution is a lookup-table based approach for the mapping. A more elegant solution
would see this be part of a dynamic library loading scheme that would invalidate JIT compiled
code dependent on unloaded native modules.
7.2 Performance Issues
Although a JIT compiler is unlikely to be able to compete with a static native code optimizer,
the W-Code IL stored alongside our native binaries is the output of a sophisticated interpro-
cedural optimizer and loop transformer. This provides the TR JIT compiler with some of the
benefits of static analysis that could not be contained in the compile-time budget of a dynamic
CHAPTER 7. CONCLUSIONS 64
compiler. We have also demonstrated (in Section 5.6) that some of the runtime information
unavailable to static optimizers helps further improve the quality of the inlined native code. At
this time, however, we do not have results that allow us to compare the performance of our
inlined native code against their statically compiled counterparts.
As with any JIT optimization, we wish to amortize the cost of performing the optimization
by obtaining significant runtime benefits from native function inlining and optimizing call-
back transformation. Our inlining strategy, however, must also deal with native code that is
non-inlineable and callbacks that are non-transformable. Furthermore, it must provide correct
linkages for program data that becomes shared between Java and native code as a byproduct
of inlining, and most importantly, it should enforce both native language and JNI semantics on
any inlined and transformed code. By focusing on microbenchmark results, we have omitted a
comprehensive analysis of the costs of these runtime decisions and interactions. Furthermore,
the use of microbenchmarks implies a lower than expected level of stress on the JVM and JIT
compiler, making our results potentially more favorable.
We should also note that we have completely ignored the eight-way nature of the test system
we used to run our microbenchmark tests, as well as the level of stress on the system during
our runs. Even though our microbenchmark tests are all single-threaded, the JVM we use in
our implementation is multi-threaded. Our results therefore ignore any processor scheduling
effects on JVM performance. We have also run our tests multiple times to eliminate any noise
that may result from varied stress levels on our system. Our raw results are consistent across
all runs.
7.3 Future Directions
Besides the previously mentioned engineering and performance issues, our implementation
wasn’t overly concerned with modifying the TR JIT inliner’s heuristics, except for some fine-
tuning that recognized a number of differences between Java and native functions (i.e., function
CHAPTER 7. CONCLUSIONS 65
size).
Explorable heuristics include the IL-conversion cost heuristic mentioned in Chapter 5 as
well as a heuristic that uses the number of opaque calls contained in a potentially-inlineable
native function to guide runtime native inlining decisions. Developing and examining the out-
come of using heuristics such as these and others make for interesting future work.
Another heuristic that could be used during the callback transformation phase is one that
examines the context of the callback with respect to the host architecture. For example, if the
character conventions used by the JVM and the host architecture are different (i.e., Unicode
vs. Unicode Transformation Format-8), an inlined GetStringUTFChars callback which
provides a value that is used in a synthesized call to printf will still require an expensive
copy and string format conversion. In such a situation, it may be more profitable to leave the
original Java-callable native callsite alone.
One might also want to extend our work to cover a larger set of native languages (i.e., those
that have W-Code emitting front ends) and derive a larger infrastructure for cross language
interoperability. Looking at it from another perspective, this would mean an infrastructure that
makes static languages more “dynamic”.
7.4 Conclusion
In this thesis we presented a novel strategy that reduces the performance penalties incurred
by Java applications invoking native functions, as well as native code performing JNI call-
backs. By using an optimizing JIT compiler to inline native function calls at runtime, the cost
of calling and returning from native code is completely eliminated. Our strategy also per-
forms optimizing transformations on expensive JNI callbacks, transforming them to cheaper
but semantically-equivalent operations.
Our solution preserves the semantics of the native language when inlining by converting
native code IL to the JIT compiler’s internal representation. This is done by extracting native
CHAPTER 7. CONCLUSIONS 66
IL that is stored alongside statically optimized native binaries. When performing inlining, the
JIT compiler is also able to remove pessimistic assumptions on the side-effects of Java code
containing opaque native function calls, and performs an aggressive suite of optimizations to
further increase the performance of inlined native code.
Microbenchmark tests measuring performance indicate our prototypical implementation
significantly speeds up Java applications containing native function calls. In most cases, our
strategy is able to reclaim the overheads attributed to native calls and JNI callbacks.
We have also identified opportunities to extend our work to a full implementation. Besides
a number of engineering issues and performance concerns, we have highlighted heuristics that
may guide the JIT compiler in making better runtime inlining decisions.
Appendix
This appendix describes each microbenchmark test and presents the raw data used to derive
our experimental results. The following tables represent the time to perform 100 million calls
for each microbenchmark test. The calls were made from a loop, which was timed using
System.currentTimeMillis(). The overhead of the loop was then subtracted from
the recorded time by recording the length of time an empty loop iterates 100 million times.
Each test was run three times for accuracy. The results reported in Chapter 5 are the average of
each set of three runs converted to nanoseconds and divided by 100 million.
W-Code Conversion Costs - Measurements
Table A.1 contains the raw data used to derive the results of Figure 5.1 in Chapter 5. We report
the total number of opcodes converted, the total time for the conversion, and the average time
per opcode for each benchmark.
Inlining Callouts - Measurements
Table A.2 contains the raw data used to derive the results of Section 5.3. The microbenchmark
tests that were run include:
• 0 args, 1 arg, 3 args and 5 args are empty native method calls with the indicated number
of parameters
67
APPENDIX 68
Total W-Code Total Time Time per opcode
Benchmark Opcodes (ms) (µs)
bzip2 15383 78.277 5.09
crafty 84693 466.952 5.51
gap 336466 1797.185 5.34
gcc 133506 663.246 4.97
gzip 25469 139.263 5.47
mcf 5615 25.431 4.53
parser 48411 256.472 5.30
perlbmk 279196 1596.122 5.72
twolf 105027 547.702 5.21
vortex 193413 1091.121 5.64
vpr 56756 310.426 5.46
Table A.1: Cost of W-Code to TR-IL conversion for SPEC CINT2000 benchmarks
APPENDIX 69
NoOpt (ms) N-inlining (ms)
Microbenchmark Test run1 run2 run3 run1 run2 run3
instance
0 args 42126 42534 42536 -81 -82 -70
1 args 46636 45219 45637 -160 -132 -173
3 args 48906 49199 48888 -142 -150 -154
5 args 59295 57491 57011 -141 -168 -152
static
0 args 12768 12665 12881 -25 -15 -10
1 args 13727 13717 13737 -185 -188 -187
3 args 13756 13860 13652 -190 -183 -166
5 args 14253 14200 14303 -168 -186 -172
hash 30699 31617 30258 8589 8665 8592
Table A.2: Raw timing measurements for Table 5.1
• hash is a static native method call implementing Wang’s 32-bit mix hash function∗ using
its parameter as the key
The negative timings in Table A.2 might be attributable to side-effects from CPU scheduling,
combined with the poor granularity of our timing mechanism. Theoretically, these values
should all be 0.
Callback Transformations - Measurements
Table A.3 contains the raw data used to derive the results of Table 5.2 in Chapter 5. The
microbenchmark tests that were run include:
• SIntField and GIntField are native functions that contain GetObjectClass,
∗http://www.concentric.net/ Ttwang/tech/inthash.htm
APPENDIX 70
GetFieldID, SetIntField and GetIntField callbacks, respectively, which are
transformed to compile-time constants and direct field reads and writes.
• SStaticIntField and GStaticIntField contain GetStaticFieldID,
SetStaticIntField and GetStaticIntField callbacks, respectively, which
are transformed to compile-time constants and direct field reads and writes
• CVoidMethodE contains GetObjectClass, GetMethodID and
CallVoidMethod callbacks which are transformed to compile-time constants and a
virtual function call. The void method being called assigns its arguments to local vari-
ables.
• CStaticVoidMethodE contains GetMethodID and CallStaticVoidMethod call-
backs which are transformed to a compile-time constant and a direct function call. The
void method being called assigns its arguments to local variables.
• CIntMethodE contains GetObjectClass, GetMethodID and CallIntMethod
callbacks which are transformed to compile-time constants and a virtual function call.
The integer returning method being called performs a simple algebraic operation on its
arguments and returns its result.
• CStaticIntMethodE contains GetMethodID and CallStaticIntMethod call-
backs which are transformed to a compile-time constant and a direct function call. The
integer returning method being called performs a simple algebraic operation on its argu-
ments and returns its result.
The negative timings in Table A.3 might be attributable to side-effects from CPU scheduling,
combined with the poor granularity of our timing mechanism. Theoretically, these values
should all be 0.
APPENDIX 71
NoOpt (ms) N-inlining (ms)
Microbenchmark Test run1 run2 run3 run1 run2 run3
CVoidMethodE 262850 264079 261280 20354 20518 20405
CStaticVoidMethodE 254540 253224 249077 21259 21322 22305
CIntMethodE 263590 266906 265090 21598 21322 22305
CStaticIntMethodE 258982 255088 252200 22098 21120 22705
GIntField 251690 253070 263123 -61 -59 -47
GStaticIntField 218980 218963 220209 -75 -77 -77
SIntField 232588 232407 227453 -74 -62 -69
SStaticIntField 211302 213480 218301 -15 -35 -25
Table A.3: Raw timing measurements for Table 5.2
Data-Copy Transformations - Measurements
Table A.4 contains the raw data used to derive the results of Table 5.3 and Figure 5.2 in Chap-
ter 5. The native code in the microbenchmark tests differed in the length of the array being
processed and called the GetIntArrayRegion callback. We were able to transform this
callback to a high-speed array copy function call supplied by the JIT compiler’s runtime envi-
ronment.
Optimizing Inlined Native Code - Measurements
Table A.5 contains the raw data used to derive the results of Table 5.4 and Figure 5.3 in Chap-
ter 5. The hash test in the microbenchmark is identical to the one mentioned earlier in this ap-
pendix, whereas the other four tests call back into Java to perform a lookup on a HashTable
object:
• the native code in the CVoidMethod test contains GetObjectClass, GetMetho-
dID and CallVoidMethod callbacks which are transformed to compile-time con-
APPENDIX 72
NoOpt (ms) N-inlining+ (ms)
Array Length run1 run2 run3 run1 run2 run3
1 62798 56502 56447 203 211 306
10 57954 59522 62417 2057 2089 2075
100 105188 95462 102947 8419 8495 8724
1000 454158 454669 452364 60147 60207 59787
5000 2039887 2041478 2056517 630476 629410 630673
10000 4034440 4057517 4341044 1567593 1298281 1299391
Table A.4: Raw timing measurements for Table 5.3 and Figure 5.2
stants and a virtual function call.
• the native code in the CStaticVoidMethod test contains GetMethodID and
CallStaticVoidMethod callbacks which are transformed to a compile-time con-
stant and a direct function call.
• the native code in the CIntMethod test contains GetObjectClass, GetMethodID
and CallIntMethod callbacks which are transformed to compile-time constants and
a virtual function call.
• the native code in the CStaticIntMethod test contains GetMethodID and
CallStaticIntMethod callbacks which are transformed to a compile-time constant
and a direct function call.
We observe the effects of increasing the optimization level of the JIT compiler on inlined
native code. In all but the NoOpt, N-inlining+ and N-inlining++ columns, native inlining is
the first optimization in a suite of other optimizations as dictated by the TR JIT compiler’s
optimization strategies and policies. NoOpt is the case with no optimizations, N-inlining+
enables native function inlining and callback transformations and N-inlining++ does the same,
but also enables recursive inlining (i.e., thereby allowing a callout and callback sequence to be
APPENDIX 73
inlined). The values contained in the table are milliseconds (ms).
Table A.6 contains the raw data used to derive the GArrayLength results of Table 5.5
in Chapter 5. GArrayLength contains FindClass GetMethodID, NewCharArray, and
GetArrayLength, used to instantiate a new character array and return its length - these calls
are transformed to compile-time constants as well as a more direct array length function call.
Synthesis Benefits - Measurements
Table A.7 contains the raw data used to derive the results of Table 5.6 in Chapter 5. The
microbenchmark tests include:
• printfS contains a function call to printf passing an empty string
• printfL contains a function call to printf passing the string “Hello World”
• atoiS contains a function call to atoi passing the string “123”
• atoiL contains a function call to atoi passing the string “1234567890”
• strlenS contains a function call to strlen passing the string “I”
• strlenL contains a function call to strlen passing the string “IEEEEEEEEEEEEEE”
• File I/O contains a sequence of fopen, fread, fwrite, fseek, rewind, fclose
and printf function calls
The four other tests (G[Static]Intfield and S[Static]IntField are identical to the ones men-
tioned earlier, except that the JNI function call involving the Java function is synthesized in-
stead of transformed.
APPENDIX 74
Microbenchmark Optimization
Test Run Level
NoOpt N-inlining+ N-inlining++ HighOpt HigherOpt
(ms) (ms) (ms) (ms) (ms)
hash
1 30699 8589 8589 849 1293
2 31617 8665 8665 847 1290
3 30258 8592 8592 841 1289
CVoidMethod
1 273190 234110 231279 228777 N/A
2 270134 243252 235593 227897 N/A
3 271053 223452 230982 227968 N/A
CIntMethod
1 274116 245482 241609 239786 N/A
2 274230 246100 241802 238900 N/A
3 274120 244019 240230 239100 N/A
CStaticVoidMethod
1 266320 232409 237927 235655 N/A
2 267789 231323 235879 235732 N/A
3 263098 249090 234012 235125 N/A
CStaticIntMethod
1 273262 250052 241721 241022 N/A
2 273651 251101 242200 240019 N/A
3 274234 250534 241983 241323 N/A
Table A.5: Raw timing measurements for Table 5.4 and Figure 5.3
NoOpt (ms) N-inlining (ms)
Microbenchmark Test run1 run2 run3 run1 run2 run3
GArrayLength 568696 563905 559744 6031 5967 6128
Table A.6: Raw timing measurements for Table 5.5
APPENDIX 75
NoOpt (ms) N-inlining+(ms)
Microbenchmark Test run1 run2 run3 run1 run2 run3
Fully Transformed
printfS 113663 115530 113204 55018 56487 56454
printfL 320257 336235 329253 292413 299745 293179
atoiS 23434 22131 27788 12870 12716 12719
atoiL 48190 48650 49137 21226 21383 20864
strlenS 21811 22001 22757 13009 12712 13070
strlenL 42690 43980 44561 17008 17555 17316
File I/0 363776 366626 365381 367589 368423 369641
Partially Transformed
GIntField 251690 253070 263123 31239 33250 31429
GStaticIntField 218980 218963 220209 33788 33510 33117
SIntField 232588 232407 227453 33554 33608 33243
SStaticIntField 211302 213480 218301 36288 35254 34502
Table A.7: Raw timing measurements for Table 5.6
Bibliography
[1] Jack Andrews. Interfacing Java with native code: Performance limits. IT-
toolbox for Java Technologies Knowledge Base web site, Peer Publishing section.
http://java.ittoolbox.com/documents/document.asp?i=780#, 2000. Also available at
http://www.str.com.au/jnibench.
[2] Andrew Ayers, Stuart de Jong, John Peyton, and Richard Schooler. Scalable cross-module
optimization. In PLDI ’98: Proceedings of the ACM SIGPLAN 1998 conference on Pro-
gramming language design and implementation, pages 301–312. ACM Press, 1998.
[3] David F. Bacon. JaLA: A Java package for linear algebra. Presented at the Computer
Science Division, University of California, Berkeley, 1998. IBM T.J. Watson Research
Center.
[4] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Xinying Li. mpiJava:
A Java interface to MPI. In Proceedings of the First UK Workshop, Java for High Perfor-
mance Network Computing at EuroPar, Southampton, UK, September 1998.
[5] Paolo Bellavista, Antonio Corradi, and Cesare Stefanelli. How to Monitor and Control
Resource Usage in Mobile Agent Systems. In Proceedings of the Third IEEE Interna-
tional Symposium on Distributed Objects and Applications, pages 65–75, Rome, Italy,
September 17–20 2001.
[6] Robert Bernecky. Apex: The apl parallel executor. Master’s thesis, University of Toronto,
1997.
76
BIBLIOGRAPHY 77
[7] Aart J. C. Bik and Dennis Gannon. A Note on Native Level 1 BLAS in Java. Concurrency:
Practice and Experience, 9(11):1091–1099, 1997.
[8] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM
Transactions on Computer Systems, 2(1):39–59, 1984.
[9] Per Bothner. Java/C++ integration - writing native Java methods in natural C++.
http://gcc.gnu.org/java/papers/native++.html, November 1997.
[10] Per Bothner. Compiling Java with GCJ. Linux Journal, 105, January 1 2003.
http://www.linuxjournal.com/print.php?sid=4860.
[11] Fabian Breg and Constantine D. Polychronopoulos. Java virtual machine support for
object serialization. In Proceedings of the 2001 Joint ACM-ISCOPE Conference on Java
Grande, pages 173–180, Palo Alto, California, June 2–4 2001.
[12] Michal Cierniak, Marsha Eng, Neal Glew, Brian Lewis, and James Stichnoth. The Open
Runtime Platform: A Flexible High-Performance Managed Runtime Environment. Intel
Technology Journal, 7(1), February 2003.
[13] IBM Corporation. Ibm aix 5l - unix operating system. http://www-
1.ibm.com/servers/aix/.
[14] IBM Corporation. The PowerPC Compiler Writer’s Guide. Warthman Associates, 1996.
[15] IBM Corporation. Aix 5l version 5.3 - technical reference: Base operating system and
extensions, volume 1 (a-p), December 2004.
[16] Bruce Eckel. Thinking in Java. Prentice-Hall, 1st edition, 1998.
[17] Robert Fitzgerald, Todd B. Knoblock, Erik Ruf, Bjarne Steensgard, and David Tarditi.
Marmot: An optimizing compiler for Java. Technical Report MSN-TR-99-33, Microsoft
Inc., June 16 1999.
BIBLIOGRAPHY 78
[18] Michael Franz and Thomas Kistler. Slim binaries. Communications of the ACM,
40(12):87–94, December 1997.
[19] Vladimira Getov, Susan Flynn Hummel, and Sava Mintchev. High-performance parallel
programming in Java: exploiting native libraries. Concurrency: Practice and Experience,
10(11–13):863–872, 1998.
[20] Andrew D. Gordon and Don Syme. Typing a multi-language intermediate code. ACM
SIGPLAN Notices, 36(3):248–260, March 2001.
[21] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. Addison-
Wesley, 1996.
[22] Mark Hammond, Brad Abrams, and Damien Watkins. Programming in the .NET Envi-
ronment. Addison-Wesley, 2003.
[23] Bjørn-Ove Heimsund. Native Numerical Interface (NNI).
http://www.math.uib.no/ bjornoh/mtj/nni/, November 2004.
[24] Ron Hitchens. Java NIO. O’Reilly and Associates, Inc., August 2002.
[25] IBM Corporation. IBM rewrites the book on Java performance.
http://www.developer.ibm.com/java/j2/j2perfpaper.html.
[26] IBM Corporation. XL FORTRAN: Eight ways to boost performance. White Paper, 2000.
[27] Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program
analysis & transformation. In Proceedings of the 2004 International Symposium on Code
Generation and Optimization, pages 75–87, San Jose, California, March 20–24 2004.
[28] Sheng Liang. The Java Native Interface. Programmer’s Guide and Specification.
Addison-Wesley, 1999.
[29] MIIK Ltd. Jniwrapper. http://www.jniwrapper.com/index.jsp.
BIBLIOGRAPHY 79
[30] Microsoft Inc. Com: Component object model technologies.
http://www.microsoft.com/com/default.mspx.
[31] Michael Lazar Milvich. JavaCAVE: A 3D immersive environment in Java. Master’s
thesis, Montana State University, July 13 2004.
[32] Paul M. Murray, Todd Smith, Suresh Srinivas, and Mattias Jacob. Performance issues
for multi-language Java applications. In Proceedings of the 15 International Parallel and
Distributed Processing Symposium 2000 Workshops, volume 1800 of Lecture Notes in
Computer Science, pages 544–551, Cancun, Mexico, May 1–5 2000. Springer.
[33] NewMonics Inc. Best practices for native code integration with perc.
http://www.newmonics.com/perceval/native whitepaper.shtml, February 26 2003.
[34] Steve Northover. SWT: The Standard Widget Toolkit, Part 1: Implementation Strategy
for JavaTM Natives. http://www.eclipse.org/articles/Article-SWT-Design-1/SWT-Design-
1.html, March 2001.
[35] Hirotaka Ogawa, Kouya Shimura, Satoshi Matsuoka, Fuyuhiko Maruyama, Yukihiko So-
hda, and Yasunori Kimura. OpenJIT: An Open-Ended, Reflective JIT Compiler Frame-
work For Java. In Proceedings of the 14th European Conference on Object-Oriented Pro-
gramming, volume 1850 of Lecture Notes in Computer Science, pages 362–387, Sophia
Antipolis and Cannes, France, June 12–16 2000. Springer.
[36] Michael Philippsen and Bernhard Haumacher. More efficient object serialization. In
Proceedings of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th In-
ternational Parallel Processing Symposium and the 10th Symposium on Parallel and Dis-
tributed Processing, volume 1586 of Lecture Notes in Computer Science, pages 718–732,
San Juan, Puerto Rico, April 12–16 1999. Springer.
BIBLIOGRAPHY 80
[37] Vladimir Roubtsov. Profiling cpu usage from within a Java application.
http://www.javaworld.com/javaworld/javaqa/2002-11/01-qa-1108-cpu.html, November
2002.
[38] Todd Scallan. a corba primer. http://www.omg.org/news/whitepapers/seguecorba.pdf,
June 3 2002.
[39] Davanum Srinivas. Java tip 86: Support native rendering in jdk 1.3.
http://www.javaworld.com/javaworld/javatips/jw-javatip86.html.
[40] Standard Performance Evaluation Corporation. SPEC CPU2000 V1.2.
http://www.spec.org/cpu2000.
[41] Vaidy Sunderam and Dawid Kurzyniec. Efficient cooperation between Java and native
codes – JNI performance benchmark. In Proceedings of the 2001 International Confer-
ence on Parallel and Distributed Processing Techniques and Applications, Las Vegas,
Nevada, June 25–28 2001.
[42] Matt Welsh and David Culler. Jaguar: Enabling efficient communication and I/O in Java.
Concurrency: Practice and Experience, 12(7):519–538, May 2000.
[43] Seth White, Maydene Fisher, Rick Cattell, Graham Hamilton, and Mark Hapner. JDBCTM
API Tutorial and Reference: Universal Data Access for the JavaTM 2 Platform (2nd Edi-
tion). Pearson Education, June 1999.
[44] Peng Wu, Samuel P. Midkiff, Jose E. Moreira, and Manish Gupta. Efficient support for
complex numbers in java. In Proceedings of the ACM 1999 Java Grande Conference,
pages 109–118, San Francisco, California, June 1999.