Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Branchless Quicksort faster than std:sort and pdqsort with C and C++ API (tiki.li)

93 points by birdculture 2 days ago | 14 comments

orlp 3 hours ago [-]

Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).

So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)

On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:

    Sorting 50 million doubles:
    ipnsort             0.79s
    blqs                0.90s
    driftsort           1.13s   (stable)
    std::sort           1.22s
    std::stable_sort    4.64s   (stable)

    Sorting 50 million (i32, i32) structs:
    ipnsort             0.82s
    blqs                0.89s
    driftsort           1.07s   (stable)
    std::sort           3.09s
    std::stable_sort    3.15s   (stable)

And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random:

    driftsort           0.29s   (stable)
    ipnsort             0.81s
    std::sort           1.15s
    std::stable_sort    1.63s   (stable)
    blqs                1.89s

quuxplusone 31 minutes ago [-]

It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have

    T k;                       // default-construct
    if (i > 0) k = left[--i];  // copy-assign

This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know. Again, in `partition_small`, we have

    T swbuf[SMALLPART];

which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.

All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.

None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".

kvuj 52 minutes ago [-]

>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

>for (int i = 0; i < 1000; i++) {

> small_numbers[smlen] = numbers[i];

> smlen += (numbers[i] < 500);

Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?

josephg 16 minutes ago [-]

Nah. (numbers[i] < 500) is an expression which evaluates to true (1) or false (0). Evaluating this doesn't require a branch. There are instructions on modern CPUs to turn this expression into a number without a conditional jump. (cmp (compare), setle (set if comparison was less than or equal), then add).

> then wouldn't they also optimize the naive piece of code?

Great question. They do sometimes!

In general, the problem for compilers is that its not obvious which method would be better in some given piece of code. Most branches are highly predictable. Like, imagine a for loop which counts to 1000. At the end of the loop body, the code branches to see whether we should stay in the loop, or exit the loop. The first 999 times through the loop we keep going - so 99.9% of the time, the branch ends up taking the same path. Its very predictable! CPU designers optimise heavily for this, via branch prediction logic. Highly predictable branches run fast. (This is also why array bounds checking doesn't really hurt performance at all.)

But the branch predictor really struggles when the condition is unpredictable - ie, when a conditional branch is taken about 50% of the time. As is the case in a sorting algorithm.

The compiler has no idea whether any condition in your code is predictable or not. There are hints you can use, but it often defaults to just doing whatever you ask it to do.

Here's what the compiler actually does with the code you quoted. You can see the extra branch + jump for the second version of the code:

https://c.godbolt.org/z/zv7Tcd49f

I clicked around - for some reason even using __builtin_expect_with_probability, none of the compilers I tried will convert from one version of this code into the other.

4k0hz 42 minutes ago [-]

There's no branch in that code either way. The comparison operator outputs a value (which is arithmetic, not a branch), and that value is added unconditionally.

achandlerwhite 36 minutes ago [-]

Isn’t there an implicit check to exit the loop?

Tiddles-the2nd 27 minutes ago [-]

The check isn't important; what's important is being predictable so the CPU can guess which way the check will go. I don't know exactly how it works, but after the first couple of loops, the predictor will assume it's always going to end up in the loop and make that the fast path. It may guess wrong the first couple of loops, and the last check wrong, but the other 997 will be correct.

cstrahan 24 minutes ago [-]

“that code” refers to the body of the loop.

Unless the loop is unrolled, yes, there is a branch to exit the loop. But then that doesn’t matter because the whole goal at the beginning was to avoid branch misprediction (which is not the same thing as avoiding branches entirely).

53 minutes ago [-]

mgaunard 4 hours ago [-]

Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?

Why not compare against that?

mswphd 3 hours ago [-]

Funny: you can cf "sorting network", and see they use them within their own design even.

4 hours ago [-]

jeffbee 4 hours ago [-]

Great question. It would also be fair to ask how this behaves with non-random inputs. The benchmarks in the repo only use random values.

davidkwast 4 hours ago [-]

It is so simple that I had to look very slowly to understand. Nicely done.

NuclearPM 4 hours ago [-]

If it wasn’t simple you could look fast and understand?

hyperhello 3 hours ago [-]

If it wasn't simple, there would be more lines of code to implement the same idea. As it is, he might have had to spend an hour understanding one line to understand that idea (1 line/hr slow), as opposed to spending an hour reading a hundred lines of code (100 line/hr fast) for the same result.

Rendered at 02:09:03 GMT+0000 (Coordinated Universal Time) with Vercel.