you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.
I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.
Again -- This was a really interesting writeup :)
neonsunset 52 minutes ago [-]
> but unless you opt to implement a processor-specific calculation in C++
Not necessarily true if you use C# (or Swift or Mojo) instead:
static float CosineSimilarity(
ref float a,
ref float b,
nuint length
) {
var sdot = Vector256<float>.Zero;
var sa = Vector256<float>.Zero;
var sb = Vector256<float>.Zero;
for (nuint i = 0; i < length; i += 8) {
var bufa = Vector256.LoadUnsafe(ref a, i);
var bufb = Vector256.LoadUnsafe(ref b, i);
sdot = Vector256.FusedMultiplyAdd(bufa, bufb, sdot);
sa = Vector256.FusedMultiplyAdd(bufa, bufa, sa);
sb = Vector256.FusedMultiplyAdd(bufb, bufb, sb);
}
var fdot = Vector256.Sum(sdot);
var fanorm = Vector256.Sum(sa);
var fbnorm = Vector256.Sum(sb);
return fdot / MathF.Sqrt(fanorm) * MathF.Sqrt(fbnorm);
}
Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)
QuadmasterXLII 23 minutes ago [-]
I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products
Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.
Rendered at 20:20:00 GMT+0000 (Coordinated Universal Time) with Vercel.
I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG
if you look at the assembly generated:
you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.Again -- This was a really interesting writeup :)
Not necessarily true if you use C# (or Swift or Mojo) instead:
Compiles to appropriate codegen quality: https://godbolt.org/z/hh16974Gd, on ARM64 it's correctly unrolled to 128x2Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)
Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.