Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲ThunderKittens: Simple, fast, and adorable AI kernels (hazyresearch.stanford.edu)

87 points by lnyan 243 days ago | 18 comments

danielhanchen 243 days ago [-]

This is super cool! Especially matrix mult getting similar or better perf than cuBLAS! If anyone is interested on other kernels like swiglu, geglu, RMS layernorm, I coded some at https://github.com/unslothai/unsloth/tree/main/unsloth/kerne...

zaptrem 242 days ago [-]

Neat! Are these compatible with torch compile?

convexstrictly 243 days ago [-]

CUDA + ThunderKittens 4.5 hour tutorial

https://www.youtube.com/watch?v=xcpEl0cGCC4

mynameismon 243 days ago [-]

How easy is it to run on older GPUs (think 1080Tis)? The reason I ask this is because torch.compile refuses to support that, and that alone makes things much slower.

danielhanchen 243 days ago [-]

The other issue is Pascal cards don't have tensor cores, so there much slower than those with them. You could try Unsloth for 2x faster llama fine-tuning - someone made P40s and P100s work. Although I would suggest upgrading to at least RTX 20x series.

formalsystem 242 days ago [-]

The project is very much focused on maxing out tensor cores and since older GPUs don’t have them it’s not where the project shines best

almostgotcaught 243 days ago [-]

> torch.compile

torch.compile is a pt2.0 feature and has nothing to do with handwritten cuda kernels

> How easy is it to run on older GPUs

this is a torch cpp extension

https://github.com/HazyResearch/ThunderKittens/blob/8daffc9c...

so you're going to have the same exact issue (whatever issue you're having)

zackangelo 242 days ago [-]

I’m working on an inference platform that allows for tokens to be appended to the context after some tokens have been generated. If there’s other sequences in the batch, it means they’ll have to be padded. Currently this means I can’t use FlashAttention because it doesn’t support arbitrary masks/padding masks… can ThunderKittens help me?

boywitharupee 242 days ago [-]

so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?

quikoa 242 days ago [-]

"Coming soon -- ThunderKittens on AMD hardware!"

Any update on this?

simarora777 242 days ago [-]

hi! We're the devs - we're planning the livestream for 1pm and we'll post the link here, twitter, and in the discord tonight

Archit3ch 243 days ago [-]

I hate to be that guy, but Metal support?

simarora777 242 days ago [-]

coming!

6gvONxR4sf7o 241 days ago [-]

I'm late to the party, but also wondering about Metal, and I have a question for you all. Do you happen to know how energy use relates to utilization? If I run a super duper duper fast thunder kitten kernel on an iphone (pending Metal, blah blah), would I expect it to also cost less battery? My naive guess is yes, but without logic to back it up.

It would be wild if some of these time-efficiency boosts you're getting with TK turned out to be energy-efficiency boosts too!

pama 243 days ago [-]

I dont want to use the Platform Formerly Known as Twitter, but does anyone have a way to get the link to their livestream tomorrow?

convexstrictly 243 days ago [-]

Simran Arora: "Join us for a livestream this Thursday, Halloween/Diwali, and join our channel on the GPU Mode Discord server to hang out with us/get involved:"

https://discord.com/login?redirect_to=%2Fchannels%2F11894982...

simarora777 242 days ago [-]

Livestream link: https://youtube.com/live/IAwLzkldxUk?feature=share! Come ask questions!

pama 242 days ago [-]

Thanks!

Rendered at 10:48:25 GMT+0000 (Coordinated Universal Time) with Vercel.