Intel needs to see what has happened to their AVX instructions and why NVidia has taken over.
If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
jsheard 4 days ago [-]
> Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON
It's pretty funny how NEON ended up in there. A former Intel employee decided to implement it for fun and submitted it as a pull request, which Intel quietly ignored for obvious reasons, but then another former Intel employee who still had commit rights merged the PR, and the optics of publicly reverting it would be even worse than stonewalling so Intel begrudgingly let it stand (but they did revoke that devs commit rights).
While there is some truth in what you say, it makes seem like writing in the CUDA style is something new and revolutionary invented by NVIDIA, which it is not.
The CUDA style of writing parallel programs is nothing else than the use of the so-called "parrallel do" a.k.a. "parrallel for" program structure, which has been already discussed in 1963. Notable later evolutions of this concept have been present in "Communicating Sequential Processes" by C.A.R. Hoare (1978-08: "arrays of processes"), then in the programming language Occam, which was designed based on what Hoare had described, then in the OpenMP extension of Fortran (1997-10), then in the OpenMP extension of C and C++ (1998-10).
Programming in CUDA does not bring anything new, except that in comparison e.g. with OpenMP some keywords are implicit and others are different, so the equivalence is not immediately obvious.
Programming for CPUs in the much older OpenMP is equivalent with programming in CUDA for GPUs.
The real innovation of NVIDIA has been the high quality of the NVIDIA CUDA compiler and CUDA runtime GPU driver, which are able to distribute the work that must be done on the elements of an array over all the available cores, threads and SIMD lanes, in a manner that is transparent for the programmer, so in many cases the programmer is free to ignore which is the actual structure of the GPU that will run the program.
Previous compilers for OpenMP or for other such programming language extensions for parallel programming have been much less capable to produce efficient parallel programs without being tuned by the programmer for each hardware variant.
saagarjha 3 days ago [-]
I'm not sure what you mean. All CUDA code needs to be aware of the programming model the GPU imposes on them, splitting their code manually into threads and warps and blocks and kernels to match. This isn't really transparent at all.
dragontamer 3 days ago [-]
Oh all of this was being done in the 1980s by Lisp* programmers.
I'm not calling it new. I'm just saying that the intrinsics style is much much harder than what Lisp*, DirectX HLSL, CUDA, OpenCL (etc. etc) does.
A specialized SIMD language makes writing SIMD easier compared to intrinsic style. Look at any CUDA code today and compare it to the AVX that is in the above article and it becomes readily apparent.
pjmlp 4 days ago [-]
It is worse than that, given that AVX is the survivor from Larrabee great plan to kill GPUs.
Larrabee was going to take over it all, as I enjoyed its presentation at GDCE 2009.
Earw0rm 4 days ago [-]
And a few years later, Intel said we'd get AVX512 on everything by 2016, and that the instruction encoding supported a future extension to 1024.
And then the Skylake and Cannon Lake debacle..
First they pulled it from the consumer chips a fairly short time before launch. Then the server chips it was present in would downclock aggressively when you did use it, so you could get at best maybe 40% more performance, certainly far from the 2x+ it promised.
Ten years on and the AMD 9950X does a pretty good job with it, however.
Earw0rm 3 days ago [-]
Oh, and I neglected to mention the protracted development, and short, miserable life, of Cannon Lake itself.
First announced in 2013, it eventually shipped five years later in only a single, crippled dual-core mobile SKU, which lasted just a year in the market before they killed it off.
"Let's put our only consumer implementation of our highest performing vector architecture on a lame-duck NUC chip.", good move guys.
dragontamer 4 days ago [-]
I mean, 288-E Core Xeons are about to ship. Xeon 6900 series, right? (Estimated to ship in Q1 2025)
So Larrabee lives on for... some reason. These E cores are well known to be modified Intel Atom cores and those were modified Xeon Phi cores which were Larrabee based.
Just with.... AVX512 being disabled. (Lost when Xeon Phi turned into Intel Atoms).
Intels technical strategy is completely bonkers. In a bad way. Intel invented all this tech 10 to 20 years ago but fails to have a cohesive strategy to bring it to market. There's clearly smart people there but somehow all the top level decisions are just awful
ashvardanian 4 days ago [-]
Yes, a lot of weird decisions were made at Intel.
Ironically, AMD waited so long to implement AVX-512, but now has it on both server and mobile chips (natively and 256 bit emulation, respectively). Intel started the whole thing, has a very fragmented stack and is now preparing those E cores with even more new extensions.
Would be interesting to see how matrix multiplication throughput will differ between AVX-512-capable P cores and a larger quantity of AVX_VNNI-capable E cores!
alfiedotwtf 4 days ago [-]
A former Intel CEO even wrote a book where every product was planned 20+ years in advance.
Imagine planning 20 years in advance where Moore’s Law is still going strong. Come to think of it, Moore was also CEO of Intel lol
david-gpu 3 days ago [-]
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512
NVidia compilers would have compiled your code into something functional, but if you want to approach peak performance you need to at least tweak your kernels, and sometimes rewrite them from scratch. See for example the various MMA instructions that were introduced over time.
Edit: I see somebody made a similar comment and you addressed it. Sorry for the churn.
dist-epoch 4 days ago [-]
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs
That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture.
Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics.
dragontamer 4 days ago [-]
Maximum performance? Okay, you'll have to upgrade to ballot instructions or whatever and rearchitect your algorithms. (Or other wavefront / voting / etc. etc. new instructions that have been invented. Especially those 4x4 matrix multiplication AI instructions).
But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)
It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.
dist-epoch 4 days ago [-]
Intel code from 15 years ago also runs today. But it will not use AVX512.
Which is the same with PTX, right? If you didn't use the tensor core instructions or wavefront voting in the CUDA code, the PTX generated from it will not either, and NVIDIA will not magically add those capabilities in when compiling to SASS.
Maybe it remains competitive because the code is inherently parallel anyway, so it will naturally scale to fill the extra execution units of the GPU, which is where most of the improvement is generation to generation.
While AVX code can't automatically scale to use the AVX512 units.
dragontamer 4 days ago [-]
It's not the same. AVX2 instructions haven't changed and never will change.
In contrast, NVidia can go from 64-bit instruction bundles to 128-bit machine code (96-bit instruction + 32-bit control information) between Pascal (aka PTX Compute Capacity 5) and Voltage (aka PTX Compute Capacity 7) and all the old PTX code just autocompiles to the new assembly instruction format and takes advantage of all the new memory barriers added in Volta.
Having a PTX translation later is a MAJOR advantage for the NVidia workflow.
ashvardanian 4 days ago [-]
There is still a lot of similarity between CPU and GPU programming - between AVX and PTX. Different generations of CPU cores handle the same AVX2 instructions differently. The microcode changes and the schedulers change, but the process is transparent for the user, similar to PTX.
mmoskal 3 days ago [-]
I imagine there is and order of magnitude of difference between how much you can translate in software, with large memory and significant time budget to work with, compared to microcode.
dragontamer 3 days ago [-]
Most CPU instructions are 1-to-1 with their microcode. I dare say that microcode is nearly irrelevant, any high-performance instruction (ex: multiply, add, XOR, etc. etc.) is but a single instruction anyway.
Load/Store are memory dependent in all architectures. So that's just a different story as CPUs and GPUs have completely different ideas of how caches should work. (CPUs aim for latency, GPUs for bandwidth + incredibly large register spaces with substantial hiding of latency thanks to large occupancies).
-------------
That being said: reorder buffers on CPUs are well over 400-instructions these days, with super-large cores (like Apple's M4) is apparently on the order of 600 to 800 instructions.
I'd say CPUs have small register spaces (16 architectural registers, maybe 32), but large register files of maybe 300 or 400+. Tomasulo's algorithm is used to out-of-order access registers.
You should think of instructions like "mov rax, [memory]" as closer to "rax = malloc(register); delayed-load(rax, memory); Out-of-order execute all instructions that don't use RAX ahead of us in instruction stream".
Tomasulo's algorithm means using ~300-register file to _pretend_ to be just 16 architectural registers. The 300 registers keeps the data out-of-order and allows you to execute. Registers in modern CPUs are closer to unique_ptr<int> in C++, assigning them frees (aka: reorder buffer) and also mallocs a new register off the register-file.
janwas 4 days ago [-]
I hope people aren't writing directly to AVX2. When using a wrapper such as Highway, you get exactly this kind of update after a recompile, or even just running your code on a CPU that supports newer instructions.
The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.
jandrewrogers 3 days ago [-]
Many use cases for SIMD aren't trivially expressible through wrappers and abstractions. It is sometimes cleaner, easier, and produces more optimized codegen to write the intrinsics directly. It isn't ideal but it often produces the best result for the effort involved.
An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.
The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.
janwas 3 days ago [-]
Wrappers can be zero-overhead, so any claim of better codegen vs the underlying intrinsics sounds dubious. "best result for the [higher] effort involved" also contradicts my experience, so I ask for evidence.
One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].
I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.
If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?
The downside is that you write an implementation in Highway, find that it doesn't perform how you want, and then you have to rewrite it.
janwas 2 days ago [-]
Curious - how is/was performance helped by rewriting? Why not reach out to us, to see if it can be fixed in the library - wouldn't that be cheaper than rewriting?
saagarjha 2 days ago [-]
I’ve moved on to other things so I can’t really give details anymore. I understand this is annoying to hear as someone who works on that library but I also want to say that your comment is also annoying for different reasons, which mostly answer your question so I’ll explain anyway.
Highway is (I feel not very controversially) kind of like a compiler but worse at its job. It’s not meant to be as general and it only targets a limited set of code, namely code that is annotated to vectorize well. But looking at it as a compiler is kind of useful: it’s supposed to make writing faster code easier and more automatic. Sometimes compilers are not able to do this, just as Highway can’t either. Maybe its design lacks the expressiveness to represent the algorithm people want. Perhaps it doesn’t quite lower to the optimal code. Maybe it turns out that so little of the operation maps to the constructs that a huge amount needs to go through the escape hatch that you offer, at which point it’s not really worth using the library anyway. In that situation, given an existing and friendly relationship, I would be happy to reach out. But this is a cost to me, because I need to simplify and generalize the thing I want. Then I hand it to you and you decide how you want to tackle it, if at all. All the while I’m waiting and I have code that needs to be written. This is a cost, and something that as an engineer I weigh against just using the intrinsics directly, which I know do exactly what I need but with higher upfront and maintenance costs. When you see someone write their own assembly instead of letting the compiler do it for them, they’re making their version of the same tradeoff.
janwas 2 days ago [-]
Thank you for sharing your thoughts!
> it’s supposed to make writing faster code easier and more automatic
Agree with this viewpoint. I suppose that makes it compiler-like in spirit, though much simpler.
I also agree that waiting for input/updates is a cost. What still surprises me, is that you seem to be able to do something differently with intrinsics, while believing this is not possible as a user of Highway.
It is indeed possible to call _mm_fixupimm_pd(v1.raw, v2.raw, v3.raw, imm), and the rest of your code can be portable.
I would be surprised if heavy usage were made of such escape hatches, but it's certainly interesting to discuss any cases that arise.
I do respect your decision, and that you make clear that raw intrinsics have higher upfront and maintenance costs. I suppose it's a matter of preference and estimating the return on the investment of learning the Highway vocabulary (=searching x86_128-inl.h for the intrinsic you know).
Personally, I find the proliferation of ISAs makes a clear case against hand-written kernels. But perhaps in your use case, only x86 will continue to be the only target of interest. Fair enough.
imtringued 4 days ago [-]
Most video encoders and decoders consist of kernels with hand written SIMD instructions/intrinsics.
janwas 4 days ago [-]
Agreed. FWIW we demonstrated with JPEG XL (image codec, though also with animation 'video' support) that it is possible to write such kernels using the portable Highway intrinsics.
Remnant44 3 days ago [-]
I would wager that most real world SIMD use is with direct intrinsics.
dragontamer 3 days ago [-]
> I hope people aren't writing directly to AVX2.
Did you not read the article? It's using AVX intrinsics and NEON intrinsics.
janwas 3 days ago [-]
I did, and I truly do not understand why some people do this.
As shown in the reddit comments on this article [1], the initial intrinsics version was quite suboptimal and clearly worse than portable code [2].
When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)
This is not really fair or true. Nvidia changes the meaning of PTX when they want to. For example, warp thread divergence is something they implemented in an architecture revision, technically breaking existing code. With SM90 (Hopper) they have even started including unstable features in PTX that they reduce promises for even further. And of course everyone who cares about performance is rewriting their kernels (or using someone else's rewritten kernels) for each new architecture. I honestly do not think it is fair to compare this to the CPU landscape, which has much stronger backwards compatibility guarantees.
variadix 4 days ago [-]
How much of this is because CUDA is designed for GPU execution and because the GPU ISA isn’t a stable interface? E.g. new GPU instructions can be utilized by new CUDA compilers for new hardware because the code wasn’t written to a specific ISA? Also, don’t people fine tune GPU kernels per architecture manually (either by hand or via automated optimizers that test combinations in the configuration space)?
dragontamer 4 days ago [-]
NVidia PTX is a very stable interface.
And the PTX to SASS compiler DOES a degree of automatic fine tuning between architectures. Nothing amazing or anything, but it's a minor speed boost that has made PTX just a easier 'assembly-like language' to build on top of.
janwas 4 days ago [-]
My understanding is that there is a lot of hand-writing (not just fine-tuning) going on. AFAIK CuDNN and TensorRT are written directly as SASS, not CUDA. And the presence of FP8 in H100, but not A100, would likely require a complete rewrite.
dragontamer 3 days ago [-]
Cub, thrust and many other libraries that make those kernels possible don't need to be rewritten.
When you write a merge sort in CUDA, you can keep it across all versions. Maybe the new instructions can improve a few corner cases, but it's not like AVX to AVX512 where you need to rewrite everything.
I agree not everything needs to be rewritten.
And neither does code using an abstraction such as Highway, so we can stop beating that dead horse.
synack 3 days ago [-]
I’ve been playing with a new Lunar Lake laptop and they’ve complicated things even further with the Neural Processing Unit (NPU)
Now if your vectors are INT8/FP8 you’re supposed to shovel them into this accelerator via PCIe, rather than packing into registers for AVX512.
I wish they’d just pick an interface for vector ops and stick with it.
janwas 4 days ago [-]
Max performance is a stretch - recompilation would not utilize tensor cores, right?
"too hard for mainstream programmers" seems overly pessimistic. I've run several workshops where devs have written dot-product kernels using Highway after 30 minutes of introduction.
kristianp 3 days ago [-]
They said intrinsics. Highway is an abstraction on top of intrinsics.
janwas 3 days ago [-]
OK :) A thin abstraction, though.
If comparing with alternative categories such as domain-specific language or autovectorization, I'd still classify Highway as intrinsics, just portable and easier to use.
snihalani 4 days ago [-]
> software methodology to interact with SIMD using GPU-like languages should be a priority.
Sometimes I wonder about an alternative history scenario where CPU ISA's would have chosen a SIMT style model instead of SIMD. "Just" have something like fork/join instructions to start/stop vector mode, otherwise use the standard scalar instructions in both scalar and vector mode. Would have avoided a lot of combinatorial explosion in instructions. (of course you'd have to do something for cross-lane operations, and later tensor instructions etc.)
janwas 4 days ago [-]
Not sure why SIMT would help, it requires more compiler transforms than if the code is written for packets/vectors or whatever we want to call them. As you note, cross-lane is a key part of a good SIMD abstraction. Vulkan calls it "subgroups", but from where I sit it's still SIMD.
ip26 4 days ago [-]
Is CUDA not more analogous to using MKL, rather than AVX?
Joker_vD 4 days ago [-]
> SIMD instructions are complex, and even Arm is starting to look more “CISCy” than x86!
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.
4 days ago [-]
saagarjha 3 days ago [-]
That's RISC-V, no?
TinkersW 4 days ago [-]
You can simplify the 2x sqrts as sqrt(a*b), overall less operations so perhaps more accurate. It would also let you get rid of the funky lane swivels.
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
a_gopher 4 days ago [-]
my thoughts exactly - crazy to know all these arcane SIMD opcodes but not know basic maths!!
ashvardanian 4 days ago [-]
Square root computation can be tricky, often relying on approximations. These approximations tend to perform best for mid-range values, while accuracy can degrade for very large or very small values. With this in mind, a product of roots is generally more accurate than a root of products.
From a SIMD perspective, it’s worth noting that on most platforms, the cost of computing one square root or two is the same. On modern x86 server CPUs, for instance, you can calculate up to 8 double-precision roots in parallel with identical latency. So there’s no additional cost in terms of performance.
I hope this sheds some light on the design of my code.
PS: In a previous life, I did research in Astro- and Plasma Physics. While I don’t claim to remember all the Math, it’s usually more productive to ask for clarification than to assume ignorance ;)
harry8 4 days ago [-]
> it’s usually more productive to ask for clarification than to assume ignorance ;)
Good reminder for me and anyone else right there, nicely put.
nine_k 3 days ago [-]
Moments like that are enlightening. When you see something really improbable (knowing advanced SIMD while appearing to ignore basic algebra), it's likely the moment you see a gap in your picture of the world. So it's tine to learn something new and likely unexpected (else you could have guessed).
EVa5I7bHFq9mnYK 4 days ago [-]
C# vectors do a great job of simplifying those intrinsics in a safe and portable manner.
ashvardanian 4 days ago [-]
There are dozens of libraries, frameworks, and compiler toolchains that try to abstract away SIMD capabilities, but I don't think it's a great approach.
The only 2 approaches that still make sense to me:
A. Writing serial vectorization-aware code in a native compiled language, hoping your compiler will auto-vectorize.
B. Implementing natively for every hardware platform, as the ISA differences are too big to efficiently abstract away anything beyond 128-register float multiplication and addition.
This article, in a way, an attempt to show how big the differences even for simple data-parallel floating-point tasks.
neonsunset 4 days ago [-]
Numerics in .NET are not a high-level abstraction and do out of box what many mature vectorized libraries end up doing themselves - there is significant overlap between NEON, SSE* and, if we overlook vector width, AVX2/512 and WASMs PackedSIMD.
.NET has roughly three vector APIs:
- Vector<T> which is platform-defined width vector that exposes common set of operations
- Vector64/128/256/512<T> which has wider API than the previous one
- Platform intrinsics - basically immintrin.h
Notably, platform intrinsics use respective VectorXXX<T> types which allows to write common parts of the algorithm in a portable way and apply platform intrinsics in specific areas where it makes sense. Also some method have 'Unsafe' and 'Native' variants to allow for vector to exhibit platform-specific behavior like shuffles since in many situations this is still the desired output for the common case.
The .NET's compiler produces competitive with GCC and sometimes Clang codegen for these. It's gotten particularly good at lowering AVX512.
exDM69 4 days ago [-]
I will respectfully disagree with your statement, with the caveat that I mostly dabble in arithmetic with 128b/256b float and int vectors.
Using C or C++ with vector extensions (Gcc/Clang) or Rust (nightly) std::simd is very easy and you get code that is portable to different CPUs and ISAs.
But most importantly they have a zero cost fallback option to CPU-specific intrinsics when you need them. An f32x8 can be passed at zero cost as __mm256 to any core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++ land).
You gain portable arithmetic and swizzles and SIMD vector types, but lose nothing. Not having to write everything for x86_64 and aarch64 is a huge win even if doesn't quite cover everything.
Additionally you can use wider vectors than your hardware supports, the compiler is able to split your f64x64 to 128, 256 or 512 bit registers as needed depending on the compile target.
Earw0rm 4 days ago [-]
The library approach does a pretty good job in conjunction with a good compiler, and sensible algorithm design.
You're writing C++ code but as if it was shader code.
I've seen impressive results with clang doing this sort of thing.
dzaima 4 days ago [-]
There's the middle-ground approach of having primarily target-specific operations but with intersecting ones named the same, and allowing easily building custom abstractions on top of such to paper over the differences how best it makes sense for the given application. That's the approach https://github.com/mlochbaum/Singeli takes.
There's a good amount of stuff that can clearly utilize SIMD without much platform-specificness, but doesn't easily autovectorize - early-exit checks in a loop, packed bit boolean stuff, some data rearranging, probing hashmap checks, some very-short-variable-length-loop things. And while there might often be some parts that do just need to be entirely target-specific, they'll usually be surrounded by stuff that doesn't (the loop, trip count calculation, loads/stores, probably some arithmetic).
kolbe 4 days ago [-]
I still like ispc, but that's not going to catch on.
marmaduke 3 days ago [-]
My approach to this is to write a bunch of tiny “kernels” which are obvious to SIMD and then inline them all, and it does a pretty good job on x86 and arm
> Let's explore these challenges and how Mojo helps address them
You've not linked to or explained what Mojo is. There's also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I'm at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?
Mojo is a programming language that aims to target CPUs, GPUs and custom accelerators that was created by the same person, Lattner, behind LLVM and Clang.
It's based on a newer compiler framework that has been added to the LLVM umbrella of projects.
> MLIR is a newer compiler framework that allows Mojo to exploit higher level compiler passes unavailable in LLVM alone... It can often more effectively use certain types of CPU optimizations directly, like SIMD, with no direct intervention by a developer
The main problem is that there are no good abstractions in popular programming languages to take advantage of SIMD extensions.
Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn't help either.
ISPC is a good idea, but execution is meh... it's hard to setup and integrate.
Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C/C++.
Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.
pjmlp 4 days ago [-]
The best is the GPU programming approach, with specialised languages
Just like using SQL is much more sane than low level C APIs to handle BTree nodes.
The language extensions help, but code still requires too much low level expertise, with algorithms and data structures having to take SIMD/MIMD capabilities into account anyway.
markstock 2 days ago [-]
I'm surprised no one has mentioned Vc. I found ispc clunky and not as performant, and std::simd didn't support some useful math ops like rsqrt. Vc has been around for years, I have no trouble including it in my codes, it has masking and many of the most useful math ops, and I can get over 1 TF/s on a consumer-grade Ryzen and at least 3 TF/s on the big Epyc CPUs.
I think the EVE library for C++ is a great abstraction. It's got an unusual syntax using subscript operator overloading, but that winds up being a very ergonomic and flexible way to program with masked-SIMD.
secondcoming 4 days ago [-]
I’m not sure about EVE. I trialled it by trying to uppercase a string and even though I got it working in the end it was quite unpleasant. Their docs need to be better.
colejohnson66 3 days ago [-]
C#’s Vector<T> does a pretty great job.
Earw0rm 4 days ago [-]
std::experimental::simd is happening. It should be part of c++26.
janwas 4 days ago [-]
Unfortunately a bit late :) Highway reached v1.0 about 2.5 years ago.
How long would it take until Clang/GCC/MSVC are ready, and all users' distros have updated?
Not to mention that the number of ops provided by std::experimental::simd is extremely limited - basically only math operators, and zero support for shuffling/crypto/rounding/interleaving/table lookups which seem indispensable for many applications.
4 days ago [-]
rishi_devan 4 days ago [-]
Interesting article. The article mentions "...the NumPy implementation illustrates a marked improvement over the naive algorithm...", but I couldn't find a NumPy implementation in the article.
andix 4 days ago [-]
Yes, they are really great at abstracting the SIMD operations, but the abstraction has only very few common methods. I'm not sure how much real world benefits those abstractions have.
Once you need more complex operations, you need to use the specific operations from System.Runtime.Intrinsics.(X86|ARM) based on the current architecture. And you need to adjust your implementation on the CPUs capabilities. There are still a lot of older x64 CPUs around that don't have AVX512 for example.
remram 3 days ago [-]
Did they write bfloat16 and bfloat32 when they meant float16 and float32?
Yeah I was really confused at first, pretty sure they messed up the labels.
ashvardanian 3 days ago [-]
Patched ;)
Agingcoder 4 days ago [-]
This is the first time I hear ‘hyperscalar’. Is this generally accepted ? ( I’ve been using SIMD since the MMX days so am a bit surprised )
dragontamer 4 days ago [-]
I don't think so.
Superscalar is a real term (multiple operations in one clock tick due to parallel pipelines within a core). But hyperscalar is cringe to me. There are tons of words describing SIMD already, it seems unclear why someone would make up a new word to describe an already existing concept.
Especially when a similar word (superscalar) already is defined and likely gets confused for this new word.
ashvardanian 4 days ago [-]
That may have been my mistake. I use super & hyper interchangeably and don't always notice :)
PS: Should be an easy patch, will update!
dragontamer 4 days ago [-]
Maybe not.
Superscalar is when say... Think of the following assembly code.
Add r1, r2
Sub r3, r4
And the add and subtract both happen on the same clock tick. The important thing is that a modern CPU core (and even GPU core) have multiple parallel ALU pipelines inside of them.
Because r1, r2, r3 and r4 are fully independent, a modern CPU can detect the potential parallelism here and act in parallel. After CPUs mastered this trick, the next out of order processors were invented (which not only allowed for super scalar operations, but allowed the subtract to execute first if for some reason the CPU core were waiting on r1 or r2).
There are a ton of ways that modern CPUs and GPUs extract parallelism from seemingly nothingness. And because all the techniques are independent, we can have superscalar out-of-order SIMD (like what happens in AVX512 in practice).
SIMD is... SIMD. It's one instruction applied to lots of data in parallel. It's totally different.
You really need to use the correct word for the specific kind of parallelism that you are trying to highlight. I expect that the only word that makes sense in this article is SIMD.
bjourne 4 days ago [-]
Agree with this. Calling SIMD superscalar is a misnomer since it is single instruction (multiple data) with very wide data paths. Superscalar implies multiple different instructions in parallel, such as adding a pair of numbers, while subtracting another pair (or even dividing).
pyrolistical 4 days ago [-]
I wish hardware exposed an api that allowed us to submit a tree of instructions so the hardware doesn’t need figure out which instructions are independent.
Lots of this kind of work can be done during compilation but cannot be communicated to hardware due to code being linear
dragontamer 4 days ago [-]
That's called VLIW and Intel Itanium is considered one of the biggest chip failures of all time.
There is an argument that today's compilers are finally good enough for VLIW to go mainstream, but good luck convincing anyone in today's market to go for it.
------
A big problem with VLIW is that it's impossible to predict L1, L2, L3 or DRAM access. Meaning all loads/stores are impossible to schedule by the compiler.
NVidia has interesting barriers that get compiled into its SASS (a level lower than PTX assembly). These barriers seem to allow the compiler to assist in the dependency management process but ultimately still require a decoder in the NVidia core final level before execution.
neerajsi 4 days ago [-]
Vliw is kind of the dual of what pyrolistical was asking for. Vliw lets you bundle instructions that are known to be independent rather than encode instructions to mark known dependencies.
VLIW is still in use in multiple DSP products on the market today, and they are good successful products in their niche.
They work very well if your code can be written as a loop without branches (or very limited branches) in the body, and a lot of instruction level parallelism in the body.
Unfortunately for Intel, most code doesn't look like that. But for most workloads that happen to also be a good case for SIMD, it is (can be) great.
IE our simd implementation allows you to scale across different architectures/ CPU revisions without having to rewrite assembly for each CPU processor?
Edit:
Rereading, that does not make much sense...
big-chungus4 4 days ago [-]
can the authors please share the numpy code too
ashvardanian 3 days ago [-]
There are several ways to implement it in NumPy, often resulting in 20% variance. I've added a reference implementation to my mirror of the blogpost and the Modular team will soon update the original posting as well: https://ashvardanian.com/posts/understanding-simd-complexity...
bob1029 4 days ago [-]
I see a lot of "just use the GPU" and you'd often be right.
SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.
For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.
dragontamer 4 days ago [-]
Despite my 'Use a GPU' post below, you are absolutely correct.
Maximizing performance on a CPU today requires all the steps in the above article, and the article is actually very well written with regards to the 'mindset' needed to tackle a problem such as this.
It's a great article for people aiming to maximize the performance on Intel or AMD systems.
------
CPUs have the memory capacity advantage and will continue to hold said advantage for the foreseeable future (despite NVidias NVLink and other techs to try to bridge the gap).
And CPU code remains far easier than learning CUDA, despite how hard these AVX intrinsics are in comparison to CUDA.
marmaduke 3 days ago [-]
> CPUs have the memory capacity advantage
perhaps also more precisely they also have quite an advantage on anything that needs and plays nicely with caches? when I sliced my problem to maximize cache usage, I saw pretty clear scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so e.g. a 32 core system starts to compete with the big consumer GPUs.
dragontamer 3 days ago [-]
Caches between CPU and GPU are extremely complex.
Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.
Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.
For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).
Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.
--------
GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).
GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.
CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.
Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).
---------
So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.
pclmulqdq 4 days ago [-]
You would be surprised. The GPU often loses even for small neural nets given the large latency. Anything that needs high throughput or is sized like an HPC problem should use a GPU, but a lot of code benefits from SIMD on small problems.
gmueckl 4 days ago [-]
If you run many small tasks on the GPU, you can increase throughput by overlapping transfers and computation. There may also be other ways to batch problems together, but that depends on the algorithms.
The one truly unfixable issue is round-trip latency.
gopalv 4 days ago [-]
> The GPU often loses even for small neural nets given the large latency
Apple's neural engine shows that you can live in between those two worlds.
As you said, the trouble is the latency, the programming model is still great.
4 days ago [-]
moldavi 4 days ago [-]
Do Apple's chips (M1 etc) change this at all, since they share memory with the GPU?
one_even_prime 4 days ago [-]
Apple chips share the same physical memory between the GPU and the CPU. Still, they don't have USM/UVM (Unified Shared Memory/Unified Virtual Memory), that is, the GPU and the CPU can't access the same data concurrently and easily. Programs must map/unmap pages to control which device accesses it, and that's a very expensive operation.
tubs 4 days ago [-]
They don't need to be unmapped just for the other one to use it. source: I wrote GPU drivers for over 10 years.
Earw0rm 4 days ago [-]
Not much. Synchronisation of tasks is still a big overhead.
If you've got tens to hundreds of microseconds worth of workload, sure, get the GPU to do it.
But bear in mind 1000 clocks at 4GHz is 250ns, there's still a sizeable region where tight CPU/GPU integration isn't tight enough.
bob1029 4 days ago [-]
I think an argument could be made depending on the real world timings. How much closer in time is the Apple GPU vs one on a PCIe bus?
4 days ago [-]
8 days ago [-]
a1o 4 days ago [-]
[flagged]
ashvardanian 4 days ago [-]
May be a false positive. There were multiple passes of human writers working on the post - first, preparing the meat, and later reorganizing it for more casual readers.
4 days ago [-]
benchmarkist 4 days ago [-]
Looks like a great use case for AI. Set up the logical specification and constraints and let the AI find the optimal sequence of SIMD operations to fulfill the requirements.
fooblaster 4 days ago [-]
No, there are decades of compiler literature for solving this problem.
benchmarkist 4 days ago [-]
That's even better then. Just let the AI read the literature and write the optimal compiler.
fooblaster 4 days ago [-]
It would probably be easier to clone the existing repository than get an llm to regurgitate llvm.
benchmarkist 4 days ago [-]
The AI would learn from llvm as well.
saagarjha 3 days ago [-]
I think your comments would be improved if you learned from LLVM first.
benchmarkist 2 days ago [-]
Why would I learn anything if we're going to have AGI in less than 3 years according to silicon valley luminaries like Sam Altman? He's rich because he's super smart and everything he says is correct so you sir should get with the program and start thinking how to logically specify tasks so that Sam Altman's AGI can solve it for you instead of telling me to learn LLVM.
saagarjha 2 days ago [-]
Well I want to be like Sam and he seems to know a lot of things so I figure I should learn more until I am as smart as him
benchmarkist 2 days ago [-]
Then you better stop wasting time on places like HN.
almostgotcaught 4 days ago [-]
lol so says every person that has no clue how (NP-hard) combinatorial optimization is.
benchmarkist 4 days ago [-]
For humans it's very hard but it will be a breeze for the AI. I thought HN was a community of builders. This is an obvious startup opportunity.
stouset 4 days ago [-]
All we have to do is ascribe magical properties to AI and we can solve anything as if P=NP!
benchmarkist 4 days ago [-]
Those distinction are irrelevant for an AI because it is a pure form of intelligence that simply computes answers without worrying about P or NP complexity classes.
ConspiracyFact 4 days ago [-]
You had me going.
B-
imtringued 4 days ago [-]
That's the same as being confidently wrong.
bartekpacia 3 days ago [-]
high quality bait
Rendered at 00:59:17 GMT+0000 (Coordinated Universal Time) with Vercel.
If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
It's pretty funny how NEON ended up in there. A former Intel employee decided to implement it for fun and submitted it as a pull request, which Intel quietly ignored for obvious reasons, but then another former Intel employee who still had commit rights merged the PR, and the optics of publicly reverting it would be even worse than stonewalling so Intel begrudgingly let it stand (but they did revoke that devs commit rights).
https://pharr.org/matt/blog/2018/04/29/ispc-retrospective
The CUDA style of writing parallel programs is nothing else than the use of the so-called "parrallel do" a.k.a. "parrallel for" program structure, which has been already discussed in 1963. Notable later evolutions of this concept have been present in "Communicating Sequential Processes" by C.A.R. Hoare (1978-08: "arrays of processes"), then in the programming language Occam, which was designed based on what Hoare had described, then in the OpenMP extension of Fortran (1997-10), then in the OpenMP extension of C and C++ (1998-10).
Programming in CUDA does not bring anything new, except that in comparison e.g. with OpenMP some keywords are implicit and others are different, so the equivalence is not immediately obvious.
Programming for CPUs in the much older OpenMP is equivalent with programming in CUDA for GPUs.
The real innovation of NVIDIA has been the high quality of the NVIDIA CUDA compiler and CUDA runtime GPU driver, which are able to distribute the work that must be done on the elements of an array over all the available cores, threads and SIMD lanes, in a manner that is transparent for the programmer, so in many cases the programmer is free to ignore which is the actual structure of the GPU that will run the program.
Previous compilers for OpenMP or for other such programming language extensions for parallel programming have been much less capable to produce efficient parallel programs without being tuned by the programmer for each hardware variant.
I'm not calling it new. I'm just saying that the intrinsics style is much much harder than what Lisp*, DirectX HLSL, CUDA, OpenCL (etc. etc) does.
A specialized SIMD language makes writing SIMD easier compared to intrinsic style. Look at any CUDA code today and compare it to the AVX that is in the above article and it becomes readily apparent.
Larrabee was going to take over it all, as I enjoyed its presentation at GDCE 2009.
And then the Skylake and Cannon Lake debacle..
First they pulled it from the consumer chips a fairly short time before launch. Then the server chips it was present in would downclock aggressively when you did use it, so you could get at best maybe 40% more performance, certainly far from the 2x+ it promised.
Ten years on and the AMD 9950X does a pretty good job with it, however.
First announced in 2013, it eventually shipped five years later in only a single, crippled dual-core mobile SKU, which lasted just a year in the market before they killed it off.
"Let's put our only consumer implementation of our highest performing vector architecture on a lame-duck NUC chip.", good move guys.
So Larrabee lives on for... some reason. These E cores are well known to be modified Intel Atom cores and those were modified Xeon Phi cores which were Larrabee based.
Just with.... AVX512 being disabled. (Lost when Xeon Phi turned into Intel Atoms).
Intels technical strategy is completely bonkers. In a bad way. Intel invented all this tech 10 to 20 years ago but fails to have a cohesive strategy to bring it to market. There's clearly smart people there but somehow all the top level decisions are just awful
Ironically, AMD waited so long to implement AVX-512, but now has it on both server and mobile chips (natively and 256 bit emulation, respectively). Intel started the whole thing, has a very fragmented stack and is now preparing those E cores with even more new extensions.
Most importantly for Search and AI, it adds AVX_VNNI, which can be used for faster 8-bit integer dot-products: https://github.com/ashvardanian/SimSIMD/blob/75c426fb190a9d4...
Would be interesting to see how matrix multiplication throughput will differ between AVX-512-capable P cores and a larger quantity of AVX_VNNI-capable E cores!
Imagine planning 20 years in advance where Moore’s Law is still going strong. Come to think of it, Moore was also CEO of Intel lol
NVidia compilers would have compiled your code into something functional, but if you want to approach peak performance you need to at least tweak your kernels, and sometimes rewrite them from scratch. See for example the various MMA instructions that were introduced over time.
Edit: I see somebody made a similar comment and you addressed it. Sorry for the churn.
That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture.
Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics.
But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)
It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.
Which is the same with PTX, right? If you didn't use the tensor core instructions or wavefront voting in the CUDA code, the PTX generated from it will not either, and NVIDIA will not magically add those capabilities in when compiling to SASS.
Maybe it remains competitive because the code is inherently parallel anyway, so it will naturally scale to fill the extra execution units of the GPU, which is where most of the improvement is generation to generation.
While AVX code can't automatically scale to use the AVX512 units.
In contrast, NVidia can go from 64-bit instruction bundles to 128-bit machine code (96-bit instruction + 32-bit control information) between Pascal (aka PTX Compute Capacity 5) and Voltage (aka PTX Compute Capacity 7) and all the old PTX code just autocompiles to the new assembly instruction format and takes advantage of all the new memory barriers added in Volta.
Having a PTX translation later is a MAJOR advantage for the NVidia workflow.
Load/Store are memory dependent in all architectures. So that's just a different story as CPUs and GPUs have completely different ideas of how caches should work. (CPUs aim for latency, GPUs for bandwidth + incredibly large register spaces with substantial hiding of latency thanks to large occupancies).
-------------
That being said: reorder buffers on CPUs are well over 400-instructions these days, with super-large cores (like Apple's M4) is apparently on the order of 600 to 800 instructions.
Reorder buffers are _NOT_ translation. They're Tomasulo's algorithm (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm). If you want to know how CPUs do out-of-order, study that.
I'd say CPUs have small register spaces (16 architectural registers, maybe 32), but large register files of maybe 300 or 400+. Tomasulo's algorithm is used to out-of-order access registers.
You should think of instructions like "mov rax, [memory]" as closer to "rax = malloc(register); delayed-load(rax, memory); Out-of-order execute all instructions that don't use RAX ahead of us in instruction stream".
Tomasulo's algorithm means using ~300-register file to _pretend_ to be just 16 architectural registers. The 300 registers keeps the data out-of-order and allows you to execute. Registers in modern CPUs are closer to unique_ptr<int> in C++, assigning them frees (aka: reorder buffer) and also mallocs a new register off the register-file.
The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.
An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.
The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.
One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].
I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.
If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?
[1]: https://github.com/google/highway/tree/master/hwy/contrib/so... [2]: https://github.com/Voultapher/sort-research-rs/blob/38f37eef...
Highway is (I feel not very controversially) kind of like a compiler but worse at its job. It’s not meant to be as general and it only targets a limited set of code, namely code that is annotated to vectorize well. But looking at it as a compiler is kind of useful: it’s supposed to make writing faster code easier and more automatic. Sometimes compilers are not able to do this, just as Highway can’t either. Maybe its design lacks the expressiveness to represent the algorithm people want. Perhaps it doesn’t quite lower to the optimal code. Maybe it turns out that so little of the operation maps to the constructs that a huge amount needs to go through the escape hatch that you offer, at which point it’s not really worth using the library anyway. In that situation, given an existing and friendly relationship, I would be happy to reach out. But this is a cost to me, because I need to simplify and generalize the thing I want. Then I hand it to you and you decide how you want to tackle it, if at all. All the while I’m waiting and I have code that needs to be written. This is a cost, and something that as an engineer I weigh against just using the intrinsics directly, which I know do exactly what I need but with higher upfront and maintenance costs. When you see someone write their own assembly instead of letting the compiler do it for them, they’re making their version of the same tradeoff.
> it’s supposed to make writing faster code easier and more automatic Agree with this viewpoint. I suppose that makes it compiler-like in spirit, though much simpler.
I also agree that waiting for input/updates is a cost. What still surprises me, is that you seem to be able to do something differently with intrinsics, while believing this is not possible as a user of Highway. It is indeed possible to call _mm_fixupimm_pd(v1.raw, v2.raw, v3.raw, imm), and the rest of your code can be portable. I would be surprised if heavy usage were made of such escape hatches, but it's certainly interesting to discuss any cases that arise.
I do respect your decision, and that you make clear that raw intrinsics have higher upfront and maintenance costs. I suppose it's a matter of preference and estimating the return on the investment of learning the Highway vocabulary (=searching x86_128-inl.h for the intrinsic you know).
Personally, I find the proliferation of ISAs makes a clear case against hand-written kernels. But perhaps in your use case, only x86 will continue to be the only target of interest. Fair enough.
Did you not read the article? It's using AVX intrinsics and NEON intrinsics.
When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)
[1]: https://www.reddit.com/r/cpp/comments/1gzob1g/understanding_... [2]: https://github.com/google/highway/blob/master/hwy/contrib/do...
And the PTX to SASS compiler DOES a degree of automatic fine tuning between architectures. Nothing amazing or anything, but it's a minor speed boost that has made PTX just a easier 'assembly-like language' to build on top of.
When you write a merge sort in CUDA, you can keep it across all versions. Maybe the new instructions can improve a few corner cases, but it's not like AVX to AVX512 where you need to rewrite everything.
Ex: https://github.com/NVIDIA/cub/blob/main/cub/device/device_me...
Now if your vectors are INT8/FP8 you’re supposed to shovel them into this accelerator via PCIe, rather than packing into registers for AVX512.
I wish they’d just pick an interface for vector ops and stick with it.
"too hard for mainstream programmers" seems overly pessimistic. I've run several workshops where devs have written dot-product kernels using Highway after 30 minutes of introduction.
What's your opinion on sycl?
https://www.khronos.org/sycl/
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
From a SIMD perspective, it’s worth noting that on most platforms, the cost of computing one square root or two is the same. On modern x86 server CPUs, for instance, you can calculate up to 8 double-precision roots in parallel with identical latency. So there’s no additional cost in terms of performance.
I hope this sheds some light on the design of my code.
PS: In a previous life, I did research in Astro- and Plasma Physics. While I don’t claim to remember all the Math, it’s usually more productive to ask for clarification than to assume ignorance ;)
Good reminder for me and anyone else right there, nicely put.
The only 2 approaches that still make sense to me:
A. Writing serial vectorization-aware code in a native compiled language, hoping your compiler will auto-vectorize.
B. Implementing natively for every hardware platform, as the ISA differences are too big to efficiently abstract away anything beyond 128-register float multiplication and addition.
This article, in a way, an attempt to show how big the differences even for simple data-parallel floating-point tasks.
.NET has roughly three vector APIs:
- Vector<T> which is platform-defined width vector that exposes common set of operations
- Vector64/128/256/512<T> which has wider API than the previous one
- Platform intrinsics - basically immintrin.h
Notably, platform intrinsics use respective VectorXXX<T> types which allows to write common parts of the algorithm in a portable way and apply platform intrinsics in specific areas where it makes sense. Also some method have 'Unsafe' and 'Native' variants to allow for vector to exhibit platform-specific behavior like shuffles since in many situations this is still the desired output for the common case.
The .NET's compiler produces competitive with GCC and sometimes Clang codegen for these. It's gotten particularly good at lowering AVX512.
Using C or C++ with vector extensions (Gcc/Clang) or Rust (nightly) std::simd is very easy and you get code that is portable to different CPUs and ISAs.
But most importantly they have a zero cost fallback option to CPU-specific intrinsics when you need them. An f32x8 can be passed at zero cost as __mm256 to any core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++ land).
You gain portable arithmetic and swizzles and SIMD vector types, but lose nothing. Not having to write everything for x86_64 and aarch64 is a huge win even if doesn't quite cover everything.
Additionally you can use wider vectors than your hardware supports, the compiler is able to split your f64x64 to 128, 256 or 512 bit registers as needed depending on the compile target.
You're writing C++ code but as if it was shader code.
I've seen impressive results with clang doing this sort of thing.
There's a good amount of stuff that can clearly utilize SIMD without much platform-specificness, but doesn't easily autovectorize - early-exit checks in a loop, packed bit boolean stuff, some data rearranging, probing hashmap checks, some very-short-variable-length-loop things. And while there might often be some parts that do just need to be entirely target-specific, they'll usually be surrounded by stuff that doesn't (the loop, trip count calculation, loads/stores, probably some arithmetic).
https://github.com/maedoc/tvbk/blob/nb-again/src/util.h
You've not linked to or explained what Mojo is. There's also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I'm at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?
That said, this is a great article, thanks.
Edit: Mojo is a programming language with python-like syntax, and is a product by Modular: https://github.com/modularml/mojo
It's based on a newer compiler framework that has been added to the LLVM umbrella of projects.
> MLIR is a newer compiler framework that allows Mojo to exploit higher level compiler passes unavailable in LLVM alone... It can often more effectively use certain types of CPU optimizations directly, like SIMD, with no direct intervention by a developer
https://www.wikipedia.org/wiki/Mojo_(programming_language)
Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn't help either.
ISPC is a good idea, but execution is meh... it's hard to setup and integrate.
Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C/C++.
Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.
Just like using SQL is much more sane than low level C APIs to handle BTree nodes.
The language extensions help, but code still requires too much low level expertise, with algorithms and data structures having to take SIMD/MIMD capabilities into account anyway.
https://github.com/VcDevel/Vc https://github.com/Applied-Scientific-Research/nvortexVc
Once you need more complex operations, you need to use the specific operations from System.Runtime.Intrinsics.(X86|ARM) based on the current architecture. And you need to adjust your implementation on the CPUs capabilities. There are still a lot of older x64 CPUs around that don't have AVX512 for example.
On the image: https://www.modular.com/blog/understanding-simd-infinite-com...
Superscalar is a real term (multiple operations in one clock tick due to parallel pipelines within a core). But hyperscalar is cringe to me. There are tons of words describing SIMD already, it seems unclear why someone would make up a new word to describe an already existing concept.
Especially when a similar word (superscalar) already is defined and likely gets confused for this new word.
PS: Should be an easy patch, will update!
Superscalar is when say... Think of the following assembly code.
And the add and subtract both happen on the same clock tick. The important thing is that a modern CPU core (and even GPU core) have multiple parallel ALU pipelines inside of them.Because r1, r2, r3 and r4 are fully independent, a modern CPU can detect the potential parallelism here and act in parallel. After CPUs mastered this trick, the next out of order processors were invented (which not only allowed for super scalar operations, but allowed the subtract to execute first if for some reason the CPU core were waiting on r1 or r2).
There are a ton of ways that modern CPUs and GPUs extract parallelism from seemingly nothingness. And because all the techniques are independent, we can have superscalar out-of-order SIMD (like what happens in AVX512 in practice). SIMD is... SIMD. It's one instruction applied to lots of data in parallel. It's totally different.
You really need to use the correct word for the specific kind of parallelism that you are trying to highlight. I expect that the only word that makes sense in this article is SIMD.
Lots of this kind of work can be done during compilation but cannot be communicated to hardware due to code being linear
There is an argument that today's compilers are finally good enough for VLIW to go mainstream, but good luck convincing anyone in today's market to go for it.
------
A big problem with VLIW is that it's impossible to predict L1, L2, L3 or DRAM access. Meaning all loads/stores are impossible to schedule by the compiler.
NVidia has interesting barriers that get compiled into its SASS (a level lower than PTX assembly). These barriers seem to allow the compiler to assist in the dependency management process but ultimately still require a decoder in the NVidia core final level before execution.
The idea pyrolistical mentioned is closer to explicit data graph execution: https://en.m.wikipedia.org/wiki/Explicit_data_graph_executio....
They work very well if your code can be written as a loop without branches (or very limited branches) in the body, and a lot of instruction level parallelism in the body.
Unfortunately for Intel, most code doesn't look like that. But for most workloads that happen to also be a good case for SIMD, it is (can be) great.
https://en.m.wikipedia.org/wiki/Hyperscale_computing
IE our simd implementation allows you to scale across different architectures/ CPU revisions without having to rewrite assembly for each CPU processor?
Edit: Rereading, that does not make much sense...
SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.
For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.
Maximizing performance on a CPU today requires all the steps in the above article, and the article is actually very well written with regards to the 'mindset' needed to tackle a problem such as this.
It's a great article for people aiming to maximize the performance on Intel or AMD systems.
------
CPUs have the memory capacity advantage and will continue to hold said advantage for the foreseeable future (despite NVidias NVLink and other techs to try to bridge the gap).
And CPU code remains far easier than learning CUDA, despite how hard these AVX intrinsics are in comparison to CUDA.
perhaps also more precisely they also have quite an advantage on anything that needs and plays nicely with caches? when I sliced my problem to maximize cache usage, I saw pretty clear scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so e.g. a 32 core system starts to compete with the big consumer GPUs.
Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.
Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.
For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).
Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.
--------
GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).
GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.
CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.
Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).
---------
So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.
The one truly unfixable issue is round-trip latency.
Apple's neural engine shows that you can live in between those two worlds.
As you said, the trouble is the latency, the programming model is still great.
If you've got tens to hundreds of microseconds worth of workload, sure, get the GPU to do it.
But bear in mind 1000 clocks at 4GHz is 250ns, there's still a sizeable region where tight CPU/GPU integration isn't tight enough.
B-