Scalaxy/Loops: Optimized foreach loops for Scala 2.10.0 have landed (+ recap on ScalaCL)

In a hurry? Skip this lengthy post and go straight to Scalaxy/Loops!

Update: edited description of what “-inline” does in the last-but-one section.

Recap on an interesting performance issue

A couple of years back, I created a startup with a friend and got to write all our code in Scala, which included some Monte Carlo simulations for logistics planning optimization.

I was horrified to see that my seemingly barebones code didn’t run nearly as fast as I expected:

for (iteration <- 0 until iterations) {
  for (i <- 0 until n) {
    ...
  }
}

The reason why this code is much slower than its Java equivalent for loop is that it is desugared as…

(0 until iterations).foreach(iteration => {
  (0 until n).foreach(i => {
    ...
  })
})

… which creates a couple of objects at runtime and performs a truckload of method calls on them (each of these closures is an object, so the outer closure is created once and the inner closure is created once per iteration, and of course there’s a cost for calling each of them in each inner or outer loop iteration).

When I discovered this, I found myself starting to rewrite my code using while loops, like this:

{
  var iteration = 0
  while (iteration < iterations) {
    var i = 0
    while (i < n) {
      ...
      i += 1
    }
    iteration += 1
  }
}

But then my code became harder to read (I actually had nested loops on arrays, which looks even worse than ranges), and I wondered why I was doing this in Scala at the first place. Something had to be done.

Update: also see this great post on Loop Performance and Local Variables in Scala by Mark Lewis.

A first solution: compiler plugins (and ScalaCL was born!)

When I decided I couldn’t write while loops by hand, it’s quite naturally that I turned to Scala compiler plugins.

Steep learning curve, quite a few rough edges, but I soon released an optimizing compiler plugin that covered all of my needs: Range, Array, List with foreach, map, reduce, fold, scan… you name it!

Despite my repeated red warnings that this was experimental-ware, people started using ScalaCL for production purposes, I went to present ScalaCL at Scalathon 2011 and Devoxx France 2012, and the sky was blue (oh, and it was also capable of running Scala on GPUs in a very experimental way, incidentally, which may have confused a few…).

In the last months where I had time to work on that project, I started optimizing chains of collection calls, ending up with quite a complex piece of engineering that I eventually didn’t stabilize (version 0.3 was almost ready… and now it’s quite stale 🙁 ).

The catch with ScalaCL was:

Compiler plugins are a pain to write, and debug.
The internal APIs can be broken at every new release, making maintenance quite risky (and since I already had a few open-source projects to maintain, I had hardly enough hobby time left: JNAerator, BridJ and JavaCL).
Using the compiler plugin requires some special arguments for the compiler, and this scares some people off.

A new approach: macros

Now with Scala 2.10, we’ve got a shiny new API that exposes the internals of the compiler in a clean and (hopefully) future-proof way (although it’s still an experimental feature), which doesn’t require any setup of a compiler plugin.

Moreover, it brings an extremely powerful “reification” mechanism which allows for painless AST construction from macros.

Playing around with Scalaxy/Beans and Scalaxy/Fx last week just wasn’t enough: I had to restart my work on loops with macros:

Scalaxy/Loops

This concludes a year of catching up with Scala 2.10 milestones, with various rewrites of ScalaCL and Scalaxy (+ an experiment on Compilets), and the scope is voluntarily smaller than before: Scalaxy/Loops optimizes simple loops (and does it well), period.

How to make your loops faster in 30 seconds

If you’ve started using Scala 2.10 and use Sbt, you’re in luck: just edit your build.sbt file as follows:

// Only works with 2.10.0+
scalaVersion := "2.10.0"

// Dependency at compilation-time only (not at runtime).
libraryDependencies += "com.nativelibs4java" %% "scalaxy-loops" % "0.3-SNAPSHOT" % "provided"

// Scalaxy/Loops snapshots are published on the Sonatype repository.
resolvers += Resolver.sonatypeRepo("snapshots")

Now you just need to import scalaxy.loops.__ and append _optimized to all the ranges which foreach loops you want to optimize:

import scalaxy.loops._

for (iteration <- 0 until iterations optimized) {
  for (i <- 0 until n optimized) {
    ...
  }
}

That’s it!

(so far it’s limited to Range.foreach, but Range.map and Array.map / .foreach might follow soon)

Cool, but (why) is it (still) faster?

Oh yeah, it’s faster!

I’m uncomfortable announcing speedups because it depends on your JVM, on your code, etc… But see for yourself (expect x3+ speedups for tight loops):

Scalaxy/Loops naive Benchmark

Now for the “why”, it’s a bit more complicated. My understanding is that the Scala team didn’t want to invest resources into such “specific” optimizations and instead chose to focus on more generalistic ones, like inlining.

Looking at code compiled with “-optimise -Yinline -Yclosure-elim” in 2.10, you can see that Range foreach method is indeed inlined… But not its closure (the call of which is what takes all the time 🙁 ).

Update: there’s some amazing work on a new Scalac optimizer going on, hope it magically solves all our issues in a future version!

Anyway, hope it’s useful to some people, and stay tuned for a stable release soon.

Keep in touch

Whether you file bugs in Scalaxy’s tracker, join the NativeLibs4Java mailing-list or ping me on Twitter (@ochafik), any feedback or help will be greatly appreciated!