An application I’m writing requires a lot of linear interpolation of a function. There’s so much of it, in fact, that it was the biggest single contributor to the running time. In this article I show how I used Template Haskell to make it much faster and cut the overall running time of the application by about a third.
To be clear about the problem, we have a list of points (which I’ll call the “table”) like this:
These points are input-output pairs for some function
f, and we want
to define the points in the middle by joining the dots with straight
lines. For example, since
f(0.5) = 0.4 and
f(1.0) = 0.84, we want to
f(0.75) = 0.66 because it’s halfway between those two points.
So, for a given input value
x, we have to find the two points
(b,bv) such that
x is between
b, and the result should
be calculated like this:
All we need to do is find those two neighbouring points. For any
values that are outside of the range of the table, we just use the
nearest value; e.g. for the sample table above,
f(100) = -0.54.
A lot of the time, you might write a simple linear traversal of the table like this and call it a day:
When I was writing the application in question, I was struck by the feeling that this was not “Haskelly” enough – there’s direct recursion, and it doesn’t even have any monoids. So I wrote this instead:
It’s the same linear-traversal strategy, but written so that it’s harder to understand. (It occurred to me in writing this article that it might be clearer to use a fold directly instead of the First monoid, but I didn’t try this.) The only thing you could say in its defense is that the “selection” part (called “f” there) is separated from the rest of the algorithm.
A linear traversal is simple, but has the drawback that to get to
values near the end of the table we have to traverse the whole table.
Profiling my application suggested that this monoidal lookup was
spectacularly slow. Maybe we can use a
The first argument to
mapLookup is just the table converted to a
Data.Map.fromList table. This approach reduces the
worst case from \(O(N)\) to \(O(\log N)\), but hurts the best case and
introduces some overhead in general. It also seems a bit inefficient
to do two lookups (the
lookupGE) when they are both
really looking for the same spot in the tree.
In my application, I used this
mapLookup function for a long time,
until improvements in other parts of the program meant that
interpolation was again the bottleneck.
My next approach was to encode the lookup as a series of
comparisons, essentially the same as how a
Map would work
internally. The table is encoded as a tree, where each entry in the
table becomes a node with two children, or a leaf.
makeTree function turns a table into a tree, evenly distributing
the table entries to the left and right. Doing the lookup is similar
to a standard binary search, with the addition of keeping track of the
most recent times we took left and right branches.
The point of the
treeLookup function above is that we can easily
“unroll” the loop, turning it into a series of nested comparisons.
For example, for the simple table
[(1,1), (2,4), (3,9)], we’d like
to get as a result a function like this:
The nice thing is that via Template Haskell we can do this with only
some small modification of the
treeLookup function above. The
structure of the function stays essentially the same; the main work is
in annotating which parts should be evaluated statically, and which
parts are in the resulting “unrolled” function.
One hiccup is that Template Haskell won’t automatically lift a
Double value to an expression — apparently, only integers and
rationals are allowed. As a consequence I used the little function
ld to turn a
Double into a rational literal.
You can also unroll the linear lookup version:
We could debate which of these methods is the most elegant, but at least we can easily address the question of which one is the fastest.
(If you want to see the full source code to reproduce these results, or just for fun, see this repository on GitHub.)
I measured each method, looking up a value at the start, middle and end of the table; the table has 21 entries. Here are the results, courtesy of the criterion library:
(Click on the image above to get the full report and all the exciting details.) As we expected, linear lookups are fast when the target is at the front, but get slower as the target gets towards the end. Also, the binary lookups take about the same time no matter where the target is.
The monoidal lookup is impressively inefficient. The statically unrolled lookups are noticeably faster than their regular versions, so it’s a worthwhile transformation if the table is known at compile time.
For a bigger table (101 entries) the results are basically the same, but with the end-of-the-table-linear-lookup effect magnified. Note that the scale on the horizontal axis has changed.
Finally, I measured the effect that actually matters: how much faster
my application runs. Here is a short sample run using the
function with some GHC and timing statistics:
2,743,929,608 bytes allocated in the heap [...] MUT time 2.55s ( 2.64s elapsed) GC time 0.33s ( 0.26s elapsed) [...] Total time 2.88s ( 2.91s elapsed) [...] ./SimpleMain newp02-bin +RTS -s 2.89s user 0.02s system 99% cpu 2.912 total
And using the statically unrolled tree lookup:
974,367,480 bytes allocated in the heap [...] MUT time 1.70s ( 1.72s elapsed) GC time 0.20s ( 0.20s elapsed) [...] Total time 1.90s ( 1.92s elapsed) [...] ./SimpleMain newp02-bin +RTS -s 1.90s user 0.02s system 99% cpu 1.922 total
There’s a reduction in allocation, but more importantly the running time of the program as a whole is reduced by about one third.
In the end this was a worthwhile exercise: I got a faster program and learnt about Template Haskell along the way.
Whether the same trick is useful in other places depends on the circumstances. The static compilation of the lookup was only useful because it is called so very often – maybe the algorithm could be changed at a higher level to avoid making so many lookups.