NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
FlashAttention-T: Towards Tensorized Attention (dl.acm.org)
simianwords 8 minutes ago [-]
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

sigbottle 13 minutes ago [-]
Oh wow there's still work being done on ampere?

I was wondering - I've been thinking about switching to AI systems programming (I know, easy task), but from what I understand, industry cloud GPUs are the main winners, right? Nobody's going to pay me (assuming I even had the skills) to optimize for consumer GPUs?

From what I understand, it's not just number + capacity + performance, it's literal core primitives. I don't think any of the "Blackwell" chips like the grace one or rtx 5090 have for example SM pairs in their ISA? And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one's ISA?)

So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

semiinfinitely 1 hours ago [-]
tri dao isn't on the paper is it even allowed to call it "FlashAttention"???
saagarjha 1 hours ago [-]
Less annoying link directly to the paper: https://dl.acm.org/doi/pdf/10.1145/3774934.3786425?download=...
SpaceManNabs 60 minutes ago [-]
link if you don't want to automatically download files

https://dl.acm.org/doi/pdf/10.1145/3774934.3786425

measurablefunc 2 hours ago [-]
[flagged]
dheera 1 hours ago [-]
"Most people" didn't figure this out either, the top 0.01% did.
E-Reverance 1 hours ago [-]
I also wouldn't be surprised if they used AI to assist themselves in small ways
measurablefunc 1 hours ago [-]
You're just moving the goal post & not addressing the question I asked. Why isn't AI optimizing the kernels in its own code the way people have been optimizing it like in the posted paper?
phkahler 1 hours ago [-]
It will, right after it reads the paper.
measurablefunc 1 hours ago [-]
I read the paper. All the prerequisites are already available in existing literature & they basically profiled & optimized around the bottlenecks to avoid pipeline stalls w/ instructions that utilize the available tensor & CUDA cores. Seems like something these super duper AIs that don't get tired should be able to do pretty easily.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 23:41:17 GMT+0000 (Coordinated Universal Time) with Vercel.