Write your first OpenCL Discrete Fourier Transform with JavaCL in 15 minutes

This article is intended for OpenCL beginners who want to use Java to tap into the vast power of their GPUs.

It will show you how to write a simple Discrete Fourier Transform (DFT), walking you through the various steps needed (install of an OpenCL implementation, writing of OpenCL and Java / JavaCL hosting code, compilation) and will show you an unique feature of the JavaCL library : the JavaCL Generator.

You can type the example developed here by yourself, or simply download JavaCLTutorial.zip, unpack it and just run Maven (or open the project with any Maven-aware IDE, such as Netbeans, IntelliJ IDEA Community Edition or Eclipse+Maven).

Prequisites

First, you’ll need to make sure you have a working OpenCL implementation on your computer.

It’s okay if you’re on MacOS X 10.6 Snow Leopard, otherwise JavaCL’s Get / Install page will guide you through this.

To build the project (available for download, you’ll probably be better off installing Maven.

Discrete Fourier Transform

The Discrete Fourier Transforms (DFTs) are extremely useful as they reveal periodicities in input signals as well as the relative strengths of any periodic components. In addition, the complex modulus of a properly scaled DFT is commonly known as the power spectrum of the input data.

Think of a DFT as the slow and easy twin of the Fast Fourier Transform (FFT). Both compute the same transform, but the former is dumb-easy while the latter is wicked-fast ($latex O(n * log(n)$) vs. $latex O(n^2)$).

You can skip the rest of this paragraph without harm if you don’t care about the maths behind this 🙂

Considering a sequence of $latex {N}$ complex numbers denoted by $latex {x_0,\dots, x_{N-1}}$, its dual representation in the Fourier orthonormal basis (exponentials) $latex {X_0,\dots, X_{N-1}}$ is performed using the following Discrete Fourier Transform formula

$latex \displaystyle X_k = \sum_{n=0}^{N-1} x_n e^{-2i\pi\frac{kn}{N}} \ \ \ k=0,\dots, N-1 \ \ \ \ \ (1)$

where $latex {i}$ is the imaginary unit. Then, the Inverse Discrete Fourier Transform (IDFT) is given by

$latex \displaystyle x_n = \frac{1}{N} \sum_{k=0}^{N-1} X_k e^{2i\pi\frac{kn}{N}} \ \ \ k=0,\dots, N-1 \ \ \ \ \ (2)$

These formulations aren’t unique but remain the most widespread.

In particular, if $latex {x_n}$ is a real signal, then $latex {X_k}$ and $latex {X_{N-k}}$ are such that $latex {\bar{X}_k=X_{N-k}}$ where $latex {\bar{z}}$ denotes the complex conjugate. Therefore, the DFT output for real inputs is half redundant, and it is sufficient to analyze the signal by roughly looking at half of the DFT coefficients.

The original C code

The DFT is pretty straightforward to code in C :

void dft(
	const double *in, // complex values input (packed real and imaginary)
	double *out,      // complex values output
	int length,       // number of input and output values
	int sign)         // sign modifier in the exponential :
					 // 1 for forward transform, -1 for backward.
{
	for (int i = 0; i < length; i++)
	{
		// Initialize sum and inner arguments
		double totReal = 0, totImag = 0;
		double param = (-2 * sign * i) * M_PI / (double)length;
		
		for (int k = 0; k < length; k++) {
			double valueReal = in[k * 2], valueImag = in[k * 2 + 1];
			double arg = k * param;
			double c = cos(arg), sin(arg);
			
			totReal += valueReal * c - valueImag * s;
			totImag += valueReal * s + valueImag * c;
		}
		
		if (sign == 1) {
			// forward transform (space -> frequential)
			out[i * 2] = totReal;
			out[i * 2 + 1] = totImag;
		} else {
			// backward transform (frequential -> space)
			out[i * 2] = totReal / (double)length;
			out[i * 2 + 1] = totImag / (double)length;
		}
	}
}

The OpenCL code

OpenCL makes it easy to call the same function many times in parallel.

As our original DFT C code simply consists in a double loop, we can make the outer loop parallel and keep the inner loop as is. The following OpenCL function (kernel) corresponds to the outer loop body, we’ll have it called many times in parallel by the OpenCL runtime with a varying parameter (here retrieved with the get_global_id function and stored in the ‘i’ variable) :

// Enable double-precision floating point numbers support.
// Not all platforms / devices support this, so you may have to switch to floats.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void dft(
	__global const double2 *in, // complex values input
	__global double2 *out,      // complex values output
	int length,                 // number of input and output values
	int sign)                   // sign modifier in the exponential :
	                            // 1 for forward transform, -1 for backward.
{
	// Get the varying parameter of the parallel execution :
	int i = get_global_id(0);
	
	// In case we're executed "too much", check bounds :
	if (i >= length)
		return;
	
	// Initialize sum and inner arguments
	double2 tot = 0;
	double param = (-2 * sign * i) * M_PI / (double)length;
	
	for (int k = 0; k < length; k++) {
		double2 value = in[k];
		
		// Compute sin and cos in a single call : 
		double c;
		double s = sincos(k * param, &c);
		
		// This adds (value.x * c - value.y * s, value.x * s + value.y * c) to the sum :
		tot += (double2)(
			dot(value, (double2)(c, -s)), 
			dot(value, (double2)(s, c))
		);
	}
	
	if (sign == 1) {
		// forward transform (space -> frequential)
		out[i] = tot;
	} else {
		// backward transform (frequential -> space)
		out[i] = tot / (double)length;
	}
}

Here are a few remarks concerning this code :

OpenCL has vector data types, such as the double2 that we’ve used. It supports efficient vector-vector and vector-scalar operations (+, -, *, /…), with a few extras for vector-vector operations (dot product…).
We’re using the sincos function, which is presumably faster than using sin and cos separately
Double-precision numbers are not supported by default in OpenCL (some GPUs just don’t support them), so we have to enable the OpenCL double extension at the first line with the #pragma OpenCL mechanism. The exercice of using floats instead (so that it works on all devices) is left to the reader (you just need to dumb-replace ‘double’ by ‘float’ and ‘Double’ by ‘Float’ ;-))

The JavaCL host code

The previous OpenCL source code needs to be presented to the OpenCL API so that it can compile it and call it with the correct parameters.

In general, the job of a OpenCL host program (which can be written in C / C++, Java or Python) is to :

select some devices (CPUs and / or GPUs) to create an OpenCL context
create input and output buffers / images bound to that context
provide the source code of some OpenCL programs (also called kernels) and have it compiled / linked by OpenCL
bind the arguments of some OpenCL kernels to values (primitives or OpenCL buffers)
ask for execution of these kernels in single-shot mode (called task) or in parallel mode (called N-dimensional range). The parallel mode is of course our primary target here 🙂
do something with the results (maybe feed them to some other kernels, or simply read them back)

An important feature of OpenCL is its deferred execution model : most operations are asynchronous, yielding a completion event which can be waited for (in a blocking way) or given as a reference to other asynchronous tasks, so that they can wait for the initial task to complete before proceeding. As a result, almost every OpenCL operation must be bound to a command queue (as of this writing, most command queues execute commands sequentially or in order, but this will hopefully change over time !).

OpenCL can be summarized as a cross-platform build and run infrastructure that uses reflection-like APIs and is focused on asynchronous and parallel programs.

Now that you’ve got a general idea of what OpenCL is about, here’s the JavaCL host code that corresponds to the OpenCL source code seen previously :

package tutorial;

import com.nativelibs4java.opencl.*;
import com.nativelibs4java.opencl.CLPlatform.DeviceFeature;
import com.nativelibs4java.util.*;
import java.io.IOException;
import java.nio.DoubleBuffer;

public class DFT {

    final CLQueue queue;
    final CLContext context;
    final CLProgram program;
    final CLKernel kernel;

    public DFT(CLQueue queue) throws IOException, CLBuildException {
        this.queue = queue;
        this.context = queue.getContext();

        String source = IOUtils.readText(DFT.class.getResource("DiscreteFourierTransformProgram.cl"));
        program = context.createProgram(source);
        kernel = program.createKernel("dft");
    }

    /**
     * Method that takes complex values in input (sequence of pairs of real and imaginary values) and 
     * returns the Discrete Fourier Transform of these values if forward == true or the inverse
     * transform if forward == false.
     */
    public synchronized DoubleBuffer dft(DoubleBuffer in, boolean forward) {
        assert in.capacity() % 2 == 0;
        int length = in.capacity() / 2;

        // Create an input CLBuffer that will be a copy of the NIO buffer :
        CLDoubleBuffer inBuf = context.createDoubleBuffer(CLMem.Usage.Input, in, true); // true = copy
        
        // Create an output CLBuffer :
        CLDoubleBuffer outBuf = context.createDoubleBuffer(CLMem.Usage.Output, length * 2);

        // Set the args of the kernel :
        kernel.setArgs(inBuf, outBuf, length, forward ? 1 : -1);
        
        // Ask for `length` parallel executions of the kernel in 1 dimension :
        CLEvent dftEvt = kernel.enqueueNDRange(queue, new int[]{ length });

        // Return an NIO buffer read from the output CLBuffer :
        return outBuf.read(queue, dftEvt);
    }

    /// Wrapper method that takes and returns double arrays
    public double[] dft(double[] complexValues, boolean forward) {
        DoubleBuffer outBuffer = dft(DoubleBuffer.wrap(complexValues), forward);
        double[] out = new double[complexValues.length];
        outBuffer.get(out);
        return out;
    }

    public static void main(String[] args) throws IOException, CLBuildException {
    		// Create a context with the best double numbers support possible :
    		// (try using DeviceFeature.GPU, DeviceFeature.CPU...)
        CLContext context = JavaCL.createBestContext(DeviceFeature.DoubleSupport);
        
        // Create a command queue, if possible able to execute multiple jobs in parallel
        // (out-of-order queues will still respect the CLEvent chaining)
        CLQueue queue = context.createDefaultOutOfOrderQueueIfPossible();

        DFT dft = new DFT(queue);
        //DFT2 dft = new DFT2(queue);

        // Create some fake test data :
        double[] in = createTestDoubleData();

        // Transform the data (spatial -> frequency transform) :
        double[] transformed = dft.dft(in, true);
        
        for (int i = 0; i < transformed.length / 2; i++) {
            // Print the transformed complex values (real + i * imaginary)
            System.out.println(transformed[i * 2] + "\t + \ti * " + transformed[i * 2 + 1]);
        }
        
        // Reverse-transform the transformed data (frequency -> spatial transform) :
        double[] backTransformed = dft.dft(transformed, false);

        // Check the transform + inverse transform give the original data back :
        double precision = 1e-5;
        for (int i = 0; i < in.length; i++) {
            if (Math.abs(in[i] - backTransformed[i]) > precision)
                throw new RuntimeException("Different values in back-transformed array than in original array !");
        }
    }

    static double[] createTestDoubleData() {
        int n = 32;
        double[] in = new double[2 * n];

        for (int i = 0; i < n; i++) {
            in[i * 2] = 1 / (double) (i + 1);
            in[i * 2 + 1] = 0;
        }
        return in;
    }
}

Compiling and running

Assuming the OpenCL and Java source files are in the same directory as javacl-1.0-beta-6-shaded.jar, you can build your code with the following command :

javac -cp javacl-1.0-beta-6-shaded.jar tutorial/DFT.java

And run it with that one (replace the ‘;’ by a ‘:’ on Unix systems) :

java -cp javacl-1.0-beta-6-shaded.jar;. tutorial.DFT

Alternatively, you can use the following Maven pom.xml :

<project xmlns="http://maven.apache.org/POM/4.0.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelVersion>4.0.0</modelVersion>
  <groupId>com.nativelibs4java</groupId>
  <artifactId>javacl-tutorial</artifactId>
  <name>JavaCL Tutorial</name>
  <url>http://code.google.com/p/javacl/</url>
  <version>1.0-beta-6</version>
  <packaging>jar</packaging>
    
  <properties>
  	<scala.version>2.8.1</scala.version>
  </properties>

  <repositories>
    <repository>
      <id>nativelibs4java</id>
      <name>nativelibs4java Maven2 Repository</name>
      <url>http://nativelibs4java.sourceforge.net/maven</url>
    </repository>
  </repositories>
  
  <dependencies>
	<dependency>
	  <groupId>com.nativelibs4java</groupId>
	  <artifactId>javacl</artifactId>
	  <version>1.0-beta-6</version>
	  <scope>compile</scope>
	</dependency>
	<!--dependency>
		<groupId>org.scala-lang</groupId>
		<artifactId>scala-library</artifactId>
		<version>${scala.version}</version>
	</dependency-->
  </dependencies>

  <build>
  	<plugins>
		<plugin>
			<groupId>com.nativelibs4java</groupId>
			<artifactId>javacl-generator</artifactId>
			<version>1.0-beta-6</version>
			<!--configuration>
				<javaOutputDirectory>${project.build.directory}/../src/main/java</javaOutputDirectory>
			</configuration-->
			<executions>
			  <execution>
				<phase>generate-sources</phase>
				<goals>
				  <goal>compile</goal>
				</goals>
			  </execution>
			</executions>
		</plugin>
		<plugin>
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-compiler-plugin</artifactId>
			<version>2.3.1</version>
			<configuration>
				<source>1.6</source>
				<target>1.6</target>
			</configuration>
		</plugin>
		<!--plugin>
			<groupId>org.scala-tools</groupId>
			<artifactId>maven-scala-plugin</artifactId>
			<executions>
				<execution>
					<goals>
						<goal>compile</goal>
						<goal>testCompile</goal>
					</goals>
				</execution>
			</executions>
		</plugin-->
		<plugin>
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-shade-plugin</artifactId>
			<version>1.3.3</version>
			<executions>
			  <execution>
				<phase>package</phase>
				<goals>
				  <goal>shade</goal>
				</goals>
				<configuration>
					  <filters>
						<filter>
						  <artifact>*:*</artifact>
						  <excludes>
							<exclude>META-INF/*.SF</exclude>
							<exclude>META-INF/*.DSA</exclude>
							<exclude>META-INF/*.RSA</exclude>
							<exclude>META-INF/maven/**</exclude>
						  </excludes>
						</filter>
					</filters>
				</configuration>
			  </execution>
			</executions>
		</plugin>
	</plugins>
  </build>
</project>

Of course, it will be easier to just download the .zip of this tutorial and run Maven in its root directory (if it’s the first time you use Maven, go have a coffee while it downloads the myriad of dependencies it needs to build stuff).

mvn package
  
java -cp target/javacl-tutorial-1.0-beta-6-shaded.jar tutorial.DFT

Maven Magic : the JavaCL Generator

To any Java programmer, setting the OpenCL kernel arguments with untyped interfaces must seem a bit odd. If you forget an argument or give an argument of the wrong type, you’ll only discover it during you program’s execution, not at compile time.

To avoid such issues, JavaCL has a very nice toy : the JavaCL Generator.

The JavaCL Generator simply parses your OpenCL headers and outputs Java code with one typed method per kernel, with the correct Javadoc and argument names (seems familiar ? it’s just a derivative of JNAerator !)

Given the previous OpenCL kernel code, the JavaCL Generator creates the following Java code :

package tutorial;
import com.nativelibs4java.opencl.*;
import java.io.IOException;

/// Auto-generated wrapper around the OpenCL program DiscreteFourierTransformProgram.cl
public class DiscreteFourierTransformProgram extends CLAbstractUserProgram {
	public DiscreteFourierTransformProgram(CLContext context) throws IOException {
		super(context, readRawSourceForClass(DiscreteFourierTransformProgram.class));
	}
	public DiscreteFourierTransformProgram(CLProgram program) throws IOException {
		super(program, readRawSourceForClass(DiscreteFourierTransformProgram.class));
	}
	CLKernel dft_kernel;
	public synchronized CLEvent dft(CLQueue commandQueue, CLDoubleBuffer in, CLDoubleBuffer out, int length, int sign, int globalWorkSizes[], int localWorkSizes[], CLEvent... eventsToWaitFor) throws CLBuildException {
		if (dft_kernel == null) 
			dft_kernel = createKernel("dft");
		dft_kernel.setArgs(in, out, length, sign);
		return dft_kernel.enqueueNDRange(commandQueue, globalWorkSizes, localWorkSizes, eventsToWaitFor);
	}
}

Which lets us rewrite the DFT class as follows, in a type-safe way (arguments count and types are checked by the compiler) :

package tutorial;

import com.nativelibs4java.opencl.*;
import java.io.IOException;
import java.nio.DoubleBuffer;

public class DFT2 {

    final CLQueue queue;
    final CLContext context;
    final DiscreteFourierTransformProgram program;

    public DFT2(CLQueue queue) throws IOException, CLBuildException {
        this.queue = queue;
        this.context = queue.getContext();
        this.program = new DiscreteFourierTransformProgram(context);
    }

    public synchronized DoubleBuffer dft(DoubleBuffer in, boolean forward) throws CLBuildException {
        assert in.capacity() % 2 == 0;
        int length = in.capacity() / 2;

        CLDoubleBuffer inBuf = context.createDoubleBuffer(CLMem.Usage.Input, in, true); // true = copy
        CLDoubleBuffer outBuf = context.createDoubleBuffer(CLMem.Usage.Output, length * 2);

        // The following call is type-safe, thanks to the JavaCL Maven generator :
        // (if the OpenCL function signature changes, the generated Java definition will be updated and compilation will fail) 
        CLEvent dftEvt = program.dft(queue, inBuf, outBuf, length, forward ? 1 : -1, new int[]{length}, null);
        return outBuf.read(queue, dftEvt);
    }
    
	public double[] dft(double[] complexValues, boolean forward) throws CLBuildException {
        DoubleBuffer outBuffer = dft(DoubleBuffer.wrap(complexValues), forward);
        double[] out = new double[complexValues.length];
        outBuffer.get(out);
        return out;
    }
 }

To make this work, you currently need to use Maven with the pom.xml file shown above. The JavaCL Generator will convert each .cl file in src/main/opencl into a wrapper class in target/generated-sources/java (and will do the same for test OpenCL files : src/test/opencl/…/*.cl -> target/generated-test-sources/java/…/*.java). The opencl directory will be added to the classpath so that resources are resolved and packaged by Maven correctly 🙂

Going further…

JavaCL has a few unique features that make it the best choice amongst OpenCL bindings :

automatic and transparent caching of program binaries (to skip compilation times !)
kernels can include files from the Java classpath (!)
Maven generator that provides safe Java interfaces for your OpenCL kernels (as demonstrated in this article)
goodies such as parallel random number generator, reduction utilities…
image kernel transform editor
Scala library with parallel collections and automatic translation from Scala to OpenCL (!)

Go try JavaCL’s demos, read its wiki and join its active user community.