I've been playing around with some low-level GPU stuff lately and trying to port some of my CPU-bound matrix multiplication. The challenge has been that in the space I work in (quantum) the matrices are of complex numbers, and for a certain size of problem floats don't cut it and you need doubles. Nearly everything I find for GPUs is targeted at matrices of real floats.
Any pointers to examples? I'd be fine sticking with floats as a first step, but would love to see some (reasonably optimized) low-level GPU code for working with matrices of complex numbers (preferably for Metal or WGPU, which is what I'm using).
pythomancer 17 days ago [-]
In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM (for double precision). Both cuBLAS and rocBLAS support ZGEMM, and the latter is open source. rocBLAS is also pretty complicated though, and perhaps not such a good learning resource. This is a more readable library which implements CGEMM, or at least a similar operation: https://git.astron.nl/RD/recruit/ccglib.
The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.
For Rust GPU, nothing built in but there are libraries like https://github.com/rust-num/num-complex that support `no_std` and should work on the GPU. I've never used them so I don't know what (if any) the perf hit would be.
taminka 17 days ago [-]
i’m always confused wrt how shader languages that target a bunch of other framework specific shader languages manage to map all the available features correctly?
like tensor core support, non-uniform thread groups, different data type support, a bunch of simd group functions, stuff like that…
i couldn’t find info wrt this in rust gpu, so i’m assuming it just tries to target the narrowest available feature set, that’s compatible across all shading languages?
LegNeato 17 days ago [-]
The goal is to handle things similar to how rust on the CPU does (shared traits with different implementations, intrinsics, platform-specific modules, user replaceable traits like alloc or panic handler and "asm" as a last ditch case).
Vulkan has a way to query and specify different GPU capabilities and Rust-GPU uses that.
Rust and Vulkan have many of the tools we need for progressive enhancement, we are not focused on lowest common denominator.
cosmic_quanta 17 days ago [-]
I'm not familiar with GPUs specifically, but I have seen this for ORMs that support multiple SQL dialects (e.g. [0]).
A great technique is called 'tagless final encoding' [1]. Using this technique, you can specify capabilities of an embedded domain-specific language (eDSL) such that you can have a shared (but narrow) common set of features, while allowing specializations of this eDSL to support extra features.
It's mentioned near the beginning of the linked article:
> These Rust GPU programs are then compiled into SPIR-V, a low-level format that most GPUs understand
taminka 17 days ago [-]
so it’s just a matter of rust gpu not yet supporting these features?
17 days ago [-]
fulafel 17 days ago [-]
The GPU prorgramming language tech landscape is generally pretty low-tech on the compilers side.
Tensor cores are not supported.
(Culturally of course the big one is the fragmentation and the proprietary nature of everything which is the reason so little gets done on GPUs and the horror of attempting multiplatform software there)
creata 17 days ago [-]
> Tensor cores are not supported.
Correct me if I'm wrong or misunderstanding you, but doesn't SPIR-V support tensor cores via SPV_KHR_cooperative_matrix?
fulafel 16 days ago [-]
I don't know, do share if there's something you can link.
At first blush it just sounds like something to allow multiple shader compute elements to work more efficiently together ("cooperate") on a single bigger matrix computation.
creata 15 days ago [-]
That is what it is afaict, but NVIDIA says this in their 2019 post "Machine Learning Acceleration in Vulkan with Cooperative Matrices"[0]:
> Additionally, if the GPU includes dedicated hardware for high-speed matrix operations, such as the Tensor Cores on Turing GPUs, then the Cooperative Matrix extension can tap into the power of this acceleration with no application changes.
The benchmark graph doesn't look too great, though - around half the "theoretical peak tensor core performance".
Any pointers to examples? I'd be fine sticking with floats as a first step, but would love to see some (reasonably optimized) low-level GPU code for working with matrices of complex numbers (preferably for Metal or WGPU, which is what I'm using).
The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.
For Rust GPU, nothing built in but there are libraries like https://github.com/rust-num/num-complex that support `no_std` and should work on the GPU. I've never used them so I don't know what (if any) the perf hit would be.
like tensor core support, non-uniform thread groups, different data type support, a bunch of simd group functions, stuff like that…
i couldn’t find info wrt this in rust gpu, so i’m assuming it just tries to target the narrowest available feature set, that’s compatible across all shading languages?
We already have a lot (we have many intrinsics and an `arch` module like `std::arch`, asm! to include raw spirv, etc). For example, here are the intrinsics: https://rust-gpu.github.io/rust-gpu/api/spirv_std/arch/index... and here is support for ray tracing (which obviously is not on every card: https://rust-gpu.github.io/rust-gpu/api/spirv_std/ray_tracin...).
Vulkan has a way to query and specify different GPU capabilities and Rust-GPU uses that.
Rust and Vulkan have many of the tools we need for progressive enhancement, we are not focused on lowest common denominator.
A great technique is called 'tagless final encoding' [1]. Using this technique, you can specify capabilities of an embedded domain-specific language (eDSL) such that you can have a shared (but narrow) common set of features, while allowing specializations of this eDSL to support extra features.
[0]: https://github.com/haskell-beam/beam
[1]: https://nrinaudo.github.io/articles/tagless_final.html
> These Rust GPU programs are then compiled into SPIR-V, a low-level format that most GPUs understand
(Culturally of course the big one is the fragmentation and the proprietary nature of everything which is the reason so little gets done on GPUs and the horror of attempting multiplatform software there)
Correct me if I'm wrong or misunderstanding you, but doesn't SPIR-V support tensor cores via SPV_KHR_cooperative_matrix?
At first blush it just sounds like something to allow multiple shader compute elements to work more efficiently together ("cooperate") on a single bigger matrix computation.
> Additionally, if the GPU includes dedicated hardware for high-speed matrix operations, such as the Tensor Cores on Turing GPUs, then the Cooperative Matrix extension can tap into the power of this acceleration with no application changes.
The benchmark graph doesn't look too great, though - around half the "theoretical peak tensor core performance".
[0]: https://developer.nvidia.com/blog/machine-learning-accelerat...