How optimizing for performance changes code from generic to specific

I recently saw the 1 billion row challenge (1brc) on GitHub. I had some fun going through it and trying different solutions guided by profiling and flame graphs.

Earlier this year, I was talking with someone who wanted the system I was working in be more performant, even though we were still in the prototype / MVP phase. I pushed back a little with the observation, “If we optimize for performance now, our code will become less general, and more specific. That change will make it more brittle and harder to change if we need to pivot the software design or architecture later.” The person I was talking to seemed to refute this idea, so I thought I’d use the 1brc to illustrate my point.

Just to be clear, what I’m talking about here is not the easy, low hanging fruit type of stuff. If we have N+1 queries and O(n²) loops that are slowing our program down, we should clean that up as we find it.

Back to the 1brc. Here is a sample of what the input file looks like, in order to make the code snippets below a little easier to understand. Imagine this file has a billion lines, and you need to compute the min/mean/max values for each city:

Kinshasa;31.6
Bilbao;6.4
Tegucigalpa;30.7
Gabès;29.8
Astana;-2.0
Sochi;18.0
Vaduz;6.3
Kampala;24.4
Nuuk;-2.8
Singapore;27.3Code language: CSS (css)

Here is the original algorithm.

public static void main(String[] args) throws IOException {

    Collector<Measurement, MeasurementAggregator, ResultRow> collector = Collector.of(
            MeasurementAggregator::new,
            (a, m) -> {
                a.min = Math.min(a.min, m.value);
                a.max = Math.max(a.max, m.value);
                a.sum += m.value;
                a.count++;
            },
            (agg1, agg2) -> {
                var res = new MeasurementAggregator();
                res.min = Math.min(agg1.min, agg2.min);
                res.max = Math.max(agg1.max, agg2.max);
                res.sum = agg1.sum + agg2.sum;
                res.count = agg1.count + agg2.count;


                return res;
            },
            agg -> {
                return new ResultRow(agg.min, agg.sum / agg.count, agg.max);
            });


    Map<String, ResultRow> measurements =
        new TreeMap<>(Files.lines(Paths.get(FILE))
            .map(l -> new Measurement(l.split(";")))
            .collect(groupingBy(m -> m.station(), collector)));


    System.out.println(measurements);
}Code language: JavaScript (javascript)

I want to point out the following about the original algorithm:

It takes advantage of high-level, abstract APIs like Files.lines, and java.util.stream.Collector. This is productive for the programmer, and easy to change. We also don’t have to care if the input file is using UNIX or Windows line endings. The high-level API’s take care of that for us.
These high level APIs are composable. We could swap out the current collector with a different implementation without disturbing much of the existing program.
Parsing each line is very simple. Just line.split(";"). If we needed to change the field delimiter, we can pretty easily.
I do not have to write tests for Files.lines or line.split used to split up the records, or Double.parseDouble to parse the numbers. Those are well covered in Java’s test suite.

Now, a middle point between here and there would be, “Can we improve the performance of this code, within its current design, and maintain our flexibility in the short term, while we figure out whether we’re building the right thing or not?”

Because we’re using Java’s Stream and Collector high-level abstractions, the answer is yes we can get a quick win. We can process the stream in parallel. This small change improves our performance by about 70% on my development machine.

Map<String, ResultRow> measurements = new TreeMap<>(Files.lines(Paths.get(FILE))
                // Add this one line.
                .parallel() 
                .map(l -> new Measurement(l.split(";")))
                .collect(groupingBy(m -> m.station(), collector)));
Code language: JavaScript (javascript)

So, where does this rabbit hole lead to? Well, let’s take a look at implementation #5 in the challenge (as of the time of writing), by Peter Lawrey. At no point am I criticizing Peter’s code here. I’m only using it to illustrate the general principle I am trying to highlight. I enjoyed reading and learning from Peter’s solution. (Peter, if you’re reading, I can get your solution to run in 5.5 seconds on my Apple Silicon M2 Pro by adding JAVA_OPTS="-XX:+UseParallelGC -Xmx20G" to your program runner, calculate_average_lawrey.sh. Which would move you up the list on 2024-01-05. 😄)

When running the original algorithm under a profiler, you notice that a significant portion of the runtime is being spent on:

Reading the input file from disk.
Parsing the data into objects in memory. e.g. Double.parseDouble(), String.split(), String.indexOf().

Now let’s take a look at Peter’s implementation:

public static void main(String[] args) throws IOException {
    // Open the file for reading.
    File file = new File(FILE);
    long length = file.length();
    long chunk = 1 << 28; // Size of the chunk to be processed.
    RandomAccessFile raf = new RandomAccessFile(file, "r");


    // Process the file in chunks and merge the results.
    Map<Key, Measurement> allMeasurementsMap = LongStream.range(0, length / chunk + 1)
            .parallel()
            .mapToObj(i -> extractMeasurementsFromChunk(i, chunk, length, raf))
            .reduce((a, b) -> mergeMeasurementMaps(a, b))
            .orElseGet(Collections::emptyMap);


    // Sort the measurements and print them.
    Map<String, Measurement> sortedMeasurementsMap = new TreeMap<>();
    allMeasurementsMap.forEach((k, m) -> sortedMeasurementsMap.put(k.toString(), m));
    System.out.println(sortedMeasurementsMap);
}Code language: JavaScript (javascript)

Note we are no longer using the high-level Files.lines API, but instead using RandomAccessFile to access specific chunks within the file. We partition the file into chunks, create a thread pool to work on each chunk, and then merge the results into a final map. This uses a Map/Reduce style of processing.

Let’s take a look at extractMeasurementsFromChunk(). You don’t need to understand all of it, but I want to highlight the following:

We now have to manage reading chunks from within a file manually. We must ensure that we are calculating our chunk offsets and length correctly. We also need to ensure our chunk does not start in the middle of a line.
The file is being scanned and processed byte-by-byte. Higher-level APIs like Files.lines and line.split() are gone. We are taking ownership of parsing every byte of the file.
We are now parsing numbers by hand within readTemperatureFromBuffer, instead of using Double.parseDouble().
Parsing numbers by hand in the most machine-efficient way has us doing tricks like this, temp = 10 * temp + (b - '0'), to parse a number from a string. I don’t think it’s controversial to argue that Double.parseDouble() is easier for teammates to read than this technique.
Rather than leaning on tested standard library functions, we need to write tests for all of this new work we are doing by hand. Chunk offset calculations, edge cases for numerical inputs, etc.

    private static Map<Key, Measurement> extractMeasurementsFromChunk(long i, long chunk, long length, RandomAccessFile raf) {
        long start = i * chunk;
        long size = Math.min(length, start + chunk + 64 * 1024) - start;
        Map<Key, Measurement> map = new HashMap<>(1024);
        try {
            MappedByteBuffer mbb = raf.getChannel().map(FileChannel.MapMode.READ_ONLY, start, size);
            if (i > 0)
                skipToFirstLine(mbb);
            Key key = new Key();
            while (mbb.remaining() > 0 && mbb.position() <= chunk) {
                key.clear();
                readKey(mbb, key);
                int temp = readTemperatureFromBuffer(mbb);
                Measurement m = map.computeIfAbsent(key, n -> new Measurement());
                if (m.count == 0)
                    key = new Key();
                m.sample(temp / 10.0);
            }


        }
        catch (IOException ioe) {
            throw new RuntimeException(ioe);
        }
        return map;
    }


    // Reads a temperature value from the buffer.
    private static int readTemperatureFromBuffer(MappedByteBuffer mbb) {
        int temp = 0;
        boolean negative = false;
        outer: while (mbb.remaining() > 0) {
            byte b = mbb.get();
            switch (b) {
                case '-':
                    negative = true;
                    break;
                case '.':
                    break;
                case '\r':
                case '\n':
                    break outer;
                default:
                    temp = 10 * temp + (b - '0');
                    break;
            }
        }
        if (mbb.remaining() > 0) {
            byte b = mbb.get(mbb.position());
            if (b == '\n')
                mbb.get();
        }
        if (negative)
            temp = -temp;
        return temp;
    }Code language: JavaScript (javascript)

The point of this exercise is to illustrate that when code is more abstract and generic, it is often easier to understand and change when new requirements come in. As it becomes more specific to the problem as it is today, the code becomes more concrete and harder to change.

Put another way, the first algorithm can be written using high-level (more abstract) declarative APIs, whereas the second is much more concrete (more specific). It deals with the problem imperatively by overriding how to execute the program step-by-step in exchange for more efficient execution of our specific needs. We are taking ownership of those low-level details. They need to be tested and understood by the rest of the team now.

Performance optimization isn’t evil. But when folks say “premature optimization is the root of all evil“, I think this is a big reason why. This tension between abstraction and concretion, and when to favor one over the other.

I think it helps to know when the best time to apply optimization is, and what trade-offs you’re willing to make. Projects at different parts of their lifecycle are going to be on a spectrum of what’s best for them. One thing to always keep in mind is, what is fast enough for right now?

Is this a project that’s less than a year old? Are we sure if it does the right thing, yet? Maybe hold off on optimization work and see if we can cut some scope. Or perhaps we need a deeper architectural change in how we are approaching the problem and need to make time for that. For the project I was talking about in the beginning of the post, this turned out to be the right thing to do. Things ended up relatively happy in the end.

Is this a project that is five years old which works well in its domain? Then maybe performance improvements can drive efficiency and save operational cost for the business, and/or the user. Especially if the system is stable and has a low rate of churn in the code.

Is this a project that is 10-20 years old and having a tough time adding new features because of the viscosity of the code base? It’s likely that the code base is too concrete. Some subsystems should become less concrete, and more abstract to accommodate changes people want to make. Then, once that subsystem settles down again, you can re-evaluate it for optimizations.

This is easier said than done. Each team is going to have to figure out their own trade-offs. But I thought this point was worth illustrating after talking with someone who seemed to dismiss these trade-offs. I assure you, they exist.

This insight isn’t new. But I’m just putting this illustrative example somewhere I can easily find and point people to.

Featured Image: “concrete 8” by bittbox is licensed under CC BY 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/2.0/.

Published

January 8, 2024

jramnani in Programming | January 8, 2024

How optimizing for performance changes code from generic to specific

Published

January 8, 2024