NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
A not so fast implementation of cosine similarity in C++ and SIMD (joseprupi.github.io)
mkristiansen 54 minutes ago [-]
This is really interesting. I have a couple of questions, mainly from the fact that the code is c++ code is about 2x slower than then Numpy.

I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG

if you look at the assembly generated:

        vmovups ymm3, ymmword ptr [rdi + 4*rcx]

        vmovups ymm4, ymmword ptr [rsi + 4*rcx]

        add     rcx, 8

        vfmadd231ps     ymm2, ymm3, ymm4

        vfmadd231ps     ymm1, ymm3, ymm3

        vfmadd231ps     ymm0, ymm4, ymm4

        cmp     rcx, rax

        jb      .LBB0_10

        jmp     .LBB0_2
you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.

I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.

        vmovups ymm6, ymmword ptr [rdi + 4*rcx]

        vmovups ymm8, ymmword ptr [rsi + 4*rcx]

        vmovups ymm7, ymmword ptr [rdi + 4*rcx + 32]

        vmovups ymm9, ymmword ptr [rsi + 4*rcx + 32]

        add     rcx, 16

        vfmadd231ps     ymm5, ymm6, ymm8

        vfmadd231ps     ymm4, ymm7, ymm9

        vfmadd231ps     ymm3, ymm6, ymm6

        vfmadd231ps     ymm2, ymm7, ymm7

        vfmadd231ps     ymm1, ymm8, ymm8

        vfmadd231ps     ymm0, ymm9, ymm9

        cmp     rcx, rax

        jb      .LBB0_10

        jmp     .LBB0_2

which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.

Again -- This was a really interesting writeup :)

neonsunset 52 minutes ago [-]
> but unless you opt to implement a processor-specific calculation in C++

Not necessarily true if you use C# (or Swift or Mojo) instead:

    static float CosineSimilarity(
        ref float a,
        ref float b,
        nuint length
    ) {
        var sdot = Vector256<float>.Zero;
        var sa = Vector256<float>.Zero;
        var sb = Vector256<float>.Zero;

        for (nuint i = 0; i < length; i += 8) {
            var bufa = Vector256.LoadUnsafe(ref a, i);
            var bufb = Vector256.LoadUnsafe(ref b, i);

            sdot = Vector256.FusedMultiplyAdd(bufa, bufb, sdot);
            sa = Vector256.FusedMultiplyAdd(bufa, bufa, sa);
            sb = Vector256.FusedMultiplyAdd(bufb, bufb, sb);
        }

        var fdot = Vector256.Sum(sdot);
        var fanorm = Vector256.Sum(sa);
        var fbnorm = Vector256.Sum(sb);

        return fdot / MathF.Sqrt(fanorm) * MathF.Sqrt(fbnorm);
    }
Compiles to appropriate codegen quality: https://godbolt.org/z/hh16974Gd, on ARM64 it's correctly unrolled to 128x2

Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)

QuadmasterXLII 23 minutes ago [-]
I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products

Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:20:00 GMT+0000 (Coordinated Universal Time) with Vercel.