Next.js App Router + React Server Components Demo

NHacker Next

new
past
show
ask
show
jobs
submit

▲Writing high-performance matrix multiplication kernels for Blackwell (docs.jax.dev)

63 points by lairv 128 days ago | 6 comments

m00x 123 days ago [-]

This is essentially the same as https://siboehm.com/articles/22/CUDA-MMM

I wonder if the same person wrote it.

arjvik 124 days ago [-]

The interesting part is this is done in Pallas!

Seems like the Pallas of old has completely been upgraded

reasonableklout 124 days ago [-]

Pallas has a couple backends, this is the new-ish Mosaic GPU one. AAUI it provides a bunch of low-level APIs for interacting directly with NVIDIA-specific and new Blackwell features like SMEM, TMEM, collective MMA, etc.

What's interesting is that the MGPU team has achieved SOTA Blackwell GEMM performance before Triton (which IIUC is trying to bring up Gluon to reach the same level). All the big players are coming up with their own block-based low-level-ish DSLs for CUDA: OpenAI, NVIDIA, and now Google.

flakiness 123 days ago [-]

So OpenAI has Triton and Google has Pallas. What's the NVIDIA counterpart?

saagarjha 123 days ago [-]

Tilus/CUTLASS I assume

flakiness 123 days ago [-]

Interesting: https://github.com/NVIDIA/tilus Thanks for the pointer!

Rendered at 11:42:59 GMT+0000 (Coordinated Universal Time) with Vercel.