OpenCL4Java: Build high-performance OpenCL code with Java (and stay calm)

OpenCL is a new technology that lets programmers write small programs (“kernels”) that will be executed in a massively parallel way on modern graphic card processors (GPUs), which have come to be dormant powerhorses for most non-gamers. It actually also allows these “kernels” to run on (multicore) CPUs, thus effectively presenting a unified way to tap into most of the processing power of a modern computer.

OpenCL comes with a runtime (integrated to the recent MacOS X 10.6 and soon-to-be-provided by all major graphic card vendors on other operating systems), a C API for the host programs and a dedicated C-derived language to write the parallel kernels being hosted (see OpenCL’s specification).

So how can we use that promising technology from Java ?

The odd news is that for once, it’ll be an easier Java affair on MacOS X than on other operating systems, due to Apple’s OpenCL implementation being the only officially-released one in the wild so far (who said that Java on the Mac was in bad shape ?).

So first, we need OpenCL bindings for Java. This is trivial to create thanks to JNAerator (see the “Building” section on NativeLibs4Java’s OpenCL page).

These auto-JNAerated C-style wrappings are indeed easy to create, but they are ~~a pain in the~~ a bit painful to use, for a few reasons :

JNAerator creates multiple options for each function bound : it might not be obvious which one is best suited for a given use.
The OpenCL API is very C-oriented and plain unfriendly to OO-brainwashed Java developers.
One has to know JNA’s runtime classes a bit to know how to call the methods (do you all know what an IntByReference is ?)
There are little OpenCL examples on the net

Ok, so we just need a simple OO layer around that C API, right ? Here it is !

If this still looks cryptic to you, here’s the long-awaited simple Java example that uses the OpenCL bindings to run simple parallel operations (and runs fine on MacOS X 10.6) :

CLPlatform platform = OpenCL4Java.listPlatforms()[0]; // take first platform available
CLDevice[] devices = platform.listAllDevices();
//CLDevice[] devices = platform.listCPUDevices();
//CLDevice[] devices = platform.listGPUDevices();
CLContext context = CLContext.createContext(devices);
int dataSize = 10000;
String src = "\n" +
        "__kernel void aSinB(                                                       \n" +
        "   __global const float* a,                                       \n" +
        "   __global const float* b,                                       \n" +
        "   __global float* output)                                        \n" +
        "{                                                                             \n" +
        "   int i = get_global_id(0);                                      \n" +
        "   output[i] = a[i] * sin(b[i]) + 1;                            \n" +
        "}                                                                             \n";
CLProgram program = context.createProgram(src).build();
CLKernel kernel = program.createKernel("aSinB");
CLQueue queue = context.createDefaultQueue();
// Allocate OpenCL-hosted memory for inputs and output
CLMem memIn1 = context.createInput(dataSize * 4);
CLMem memIn2 = context.createInput(dataSize * 4);
CLMem memOut = context.createOutput(dataSize * 4);
// Bind these memory objects to the arguments of the kernel
kernel.setArgs(memIn1, memIn2, memOut);
/// Map input buffers to populate them with some data
FloatBuffer a = memIn1.blockingMapWrite(queue).asFloatBuffer();
FloatBuffer b = memIn2. blockingMapWrite(queue).asFloatBuffer();
// Fill the mapped input buffers with data
for (int i = 0; i < dataSize; i++) {
	a.put(i, i);
	b.put(i, i);
}

/// Unmap input buffers
memIn1.unmap(queue, a);
memIn2.unmap(queue, b);

// Ask for execution of the kernel with global size = dataSize
//   and workgroup size = 1
kernel.enqueueNDRange(queue, new int[]{dataSize}, new int[]{1});

// Wait for all operations to be performed
queue.finish();

// Copy the OpenCL-hosted array back to RAM
FloatBuffer output = NIOUtils.directFloats(dataSize);
memOut.read(output, queue, true);

// Compute absolute and relative average errors wrt Java implem
double totalAbsoluteError = 0, totalRelativeError = 0;
for (int i = 0; i < dataSize; i++) {
    float expected = i * (float)Math.sin(i) + 1;
    float result = output.get(i);

    double d = result - expected;
    if (expected != 0)
        totalRelativeError += d / expected;

    totalAbsoluteError += d < 0 ? -d : d;
}
double avgAbsoluteError = totalAbsoluteError / dataSize;
double avgRelativeError = totalRelativeError / dataSize;
System.out.println("Average absolute error = " + avgAbsoluteError);
System.out.println("Average relative error = " + avgRelativeError);

This is reasonably easy to read, but it is possible to make OpenCL even more affordable for casual programmers who don’t want to be bothered with learning OpenCL. For that, please wait for my next post : ScalaCL: Reap OpenCL’s benefits without learning its syntax (Scala DSL for transparently parallel computations) (edited to match actual title)

In the meanwhile you can find all the sources and binaries to play with on OpenCL4Java’s homepage.

This has only been tested on MacOS X, but you might succeed adapting it on Windows using ATI Stream 2 beta or the early access OpenCL drivers from NVidia (please report any success in the comments).

As usual, comments are highly welcome 🙂

Edit (Sept 30th 2009): Updated code snippets to match recent changes in API```