Sunday, May 20, 2012

Understand the overhead of JNI

JNI[1][2] allows java code to call some native code written in C/C++. Inside native code implementation, it is possible to make other function calls to access the data inside JVM, like to access a class’ field, calling another java method, etc. The JNI’s overhead comes from 3 parts:

  • Native code prevents the optimization that JVM can make
  • Setup the environment to start a jni call
  • Data copy between JVM and native code, and indirection to access some fields and methods from native code.

JNI hurts JVM Optimization


The JVM can't inline the native method, no matter how simple it is.

The JVM doesn't know enough about the method to make optimisations that it could make when compiling a regular Java method (for example, it has to assume that all of the parameters passed in are always used);

The JVM can't make other optimization that it could make if it were dynamically compiling the code (e.g. compiling a constant parameter is a constant operand to a machine instruction rather than placing it on the stack and reading it off again);


The pure cost to make a JNI call


In order to make the native call into the DLL or library, the JVM may have to perform extra work, such as rearranging items on the stack.

Suppose we have a dummy native function which does nothing, on a 32bit JVM(openjdk 6) it takes about 10ns to make a single call on a testing machine.

private static native void noopJni();

extern "C" JNIEXPORT void JNICALL
Java_com_foo_test_JniPerfTest_noopJni(
JNIEnv* env, jclass) {
}

Interaction between JVM and native code


If the native code implementation just needs to access the data passed as parameters and returns value, then there is no additional cost involved. But in practice, the native has to do more interactions between jvm, like to access some data from JVM, calling some other Java method, etc. Take the following native code for example, which computes the sum of an integer array:

private static native int sum(int[] src);

extern "C" JNIEXPORT jint JNICALL
Java_com_foo_test_JniPerfTest_sum(
JNIEnv* env, jclass, jintArray src) {
  const jint size = env->GetArrayLength(src);
  jint* data = env->GetIntArrayElements(src, 0);
  jint sum = 0;
  for (int i = 0; i < size; ++i) {
     sum += data[i];   
  }
  env->ReleaseIntArrayElements(src, data, 0);
  return sum;
}

It’s invalid to access the array(src) directly, so the native code has to call GetIntArrayElements to either pin the java array or make an copy from java array to native array. The reason that we have to call GetIntArrayElements is because another GC thread may move the data around during the native call. The Virtual Machine guarantees that the result of GetIntArrayElements points to a non-movable array of integers. The JVM will either "pin" down the array , or it will make a copy of the array into nonmovable memory. When the native code has finished using the array, it must call ReleaseIntArrayElements, which enables the JNI to copy back and free body if it is a copy of the original Java array, or "unpin" the Java array if it has been pinned in memory. Forgetting to do so will call memory leak.

If the array size is large and the code only needs to access part of the array, it can call Get/Set<type>ArrayRegion functions, which will just make a copy of the given region instead of the whole array

Another choice is to use GetPrimitiveArrayCritical, which is provided since Java 1.2. It will try to stop garbage collection on that array, and will provide direct access to the array pointer most of the time. The drawback of GetPrimitiveArrayCritical is that no blocking operation should be made between GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical, otherwise deadlock might happen. So code between the two calls should be treated as in critical section. Here is the code snippet:


static native int sumCritical(int[] src);
extern "C" JNIEXPORT jint JNICALL
Java_com_foo_test_JniPerfTest_sumCritical(
  JNIEnv* env, jclass, jintArray src) {
  const jint size = env->GetArrayLength(src);
  jint* data = (jint*) env->GetPrimitiveArrayCritical(src, 0);
  jint sum = 0;
  for (int i = 0; i < size; ++i) {
     sum += data[i];
  }
  env->ReleasePrimitiveArrayCritical(src, data, 0);
  return sum;
}

Compare that with an implementation of pure Java.
static int javaSum(int[] src) {
  int result = 0;
  for (int i = 0; i < src.length; ++i) {
    result += src[i];
  }
  return result;
}

Here the benchmark result running on a testing machine with 32bit vm(int array size is 1024). The time is the average time it takes to make a single sum function call.

Pure JavaJNI GetIntArrayElementsJNI GetPrimitiveArrayCritical
493 ns1675 ns703 ns


Although GetPrimitiveArrayCritical can significantly reduce the overhead of making a copy of array, the JNI implementation is still slower than the pure java implementation. Also, in some use cases, GetPrimitiveArrayCritical is not practical because of the potential blocking operation to the array or other data.

There are other drawbacks of JNI mentioned in various documents, like it exposes the raw access to pointer and make the program more vulnerable to invalid memory access, it’s hard to handle signal properly between java and native implementation, etc.


Summary


The usage of JNI should be limited because its performance overhead and other drawbacks like unsafe memory access.

If JNI is necessary, here is a few tips to make it more efficient:

  • Avoid the number of JNI calls
  • Limit the interaction between jvm and native code, limit the data passed around java code and native code(e.g use GetPrimitiveArrayCritical is possible).
  • If the function’s performance is critical and the interaction between native code and java code is rare, consider making it intrinsic(let the JVM replace it with machine code directly at run-time, a typical example is System.arraycopy, a real example can be found here).

Sunday, April 1, 2012

Real time profiling with PProf

Profiling Go Programs introduced how to profile a go program in a generic way. With the release of go 1, the profiling has become much easier: you can register various profile handlers to the running http server, and profile the program when it's running. The trick is very easy:

link the pprof http handler by adding the following line into your program(usually the main file):

      import _ "net/http/pprof"

Then you can easier grab head profile(understand how the memory is consumed) by using the commands below:

   go tool pprof http://localhost:6060/debug/pprof/heap

Or to look at a 30-second CPU profile:

      go tool pprof http://localhost:6060/debug/pprof/profile

Or to view all available profiles:

      go tool pprof http://localhost:6060/debug/pprof/

For more information on how the understand the pprof output, you can go to Profiling Go Programs or the homepage of google perf tools.

BTW: since there pprof handlers are so convenient, it is generally a good practice to start a debug http server with pprof enabled even if the program is not going to serve any http traffic.