Oh hey, I wrote this. Been a long time. I had the lucky of break of working in machine translation / parsing when the most important invention of the century happened in my niche field.
I'm pretty interested in the intersection of code / ML. If that's your thing here are some other writing you might be interested in.
Recently moved back to industry, so haven't had a chance to write in a while.
srush 1 days ago [-]
Actually realize this is to the modern version not the original. So props to Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman who rewrote this one.
zabi_rauf 1 days ago [-]
I loved the GPU puzzles, after completing all of them, I wished there were more. Learnt a bunch in process.
tough 1 days ago [-]
This is aawesome, thanks for the links and the write ups!
internetguy 2 days ago [-]
wow - this is really well made! i've been doing research w/ Transformer-based audio/speech models and this is made with incredible detail. Attention as a concept itself is already quite unintuitive for beginners due to is non-linearity, so this also explains it very well
roadside_picnic 2 days ago [-]
> Attention as a concept itself is already quite unintuitive
Once you realize that Attention is really just a re-framing of Kernel Smoothing it becomes wildly more intuitive [0]. It also allows you to view Transformers as basically learning a bunch of stacked Kernels which leaves them in a surprisingly close neighborhood to Gaussian Processes.
> I'd be grateful for any pointers to an example where system developers (or someone else in a position to know) have verified the success of a prompt extraction.
You can try this yourself with any open source llm setup that lets you provide a system prompt no? Just give it a prompt, ask the model the prompt ,and see if it matches.
gpt-oss is trained to refuse so it wont share (you can provide system prompt on lmstudio)
adityamwagh 2 days ago [-]
It’s a very popular article that has been around for a long time!
gdiamos 2 days ago [-]
It's so good it is worth revisiting often
ActorNightly 1 days ago [-]
When getting to the attention part, I really wish people stop describing it as Key Query Value. There is nothing special about Key or Query or Value in the sense of their implied function in the transformer. The KQV matrices themselves are computed by multiplying the input vector by learned weights, which are arbitrarily random matrices that come together in the end to the correct result. I.e it doesn't matter if you have 26 or 34 for a final result of 12.
The thing that makes transformers work is multi dimensionality in the sense that you are multiplying matricies by matricies instead of computing dot products on vectors. And because matrix multiplication is effectively sums of dot products, you can represent all of the transformer as wide single layer perceptron sequences (albeit with a lot of zeros), but mathematically they would do the same thing.
gchadwick 1 days ago [-]
I'd disagree as the K, Q and V have distinct functions within the attention calculation. In particular when you're considering decode (next token calculation during inference which follows the initial prefill stage that processes the prompt). For decode you have a single Q vector (relating to the in progress token) and multiple K and V vectors (your context, i.e. all tokens that have already been computed).
> you can represent all of the transformer as wide single layer perceptron sequences
This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input. The same mathematical function can be used to compute attention however there are no static weights. You've got your attention scores on one side and the V matrix on the other side.
Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism given they inherently have static weights and they can't directly multiply two inputs (or directly multiply two internal activations). Given an MLP is a general function approximater I guess a sufficiently large number of them could get close enough?
ActorNightly 21 hours ago [-]
>For decode you have a single Q vector (relating to the in progress token) and multiple K and V vectors (your context, i.e. all tokens that have already been computed).
Sure, but the K/V matrices are pretty much arbitrary weights, and so is the Q vector since its derived from the multiplication of the input vector by a learned matrix.
The thing im trying to convey is that the nomeclature of Key/Query/Value doesn't mean anything, so when people learn about transformers, they don't need to understand that those matricies correspond to some predefined structure that maps to the data in a specific way. You can have 2 identical model initialized with random values and trained on same datasets, and end up with different KQV matricies for the same input.
>This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input.
K/Q/V are all derived by multiplying the input vector by static learned weights, then those are all multiplied together in the attention calculation. Basically just a whole bunch of dot products. You would just have flattened matricies with intermediate layers acting as accumulators.
>Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism
It is. It won't be attention in a classical sense, it would just be extra connections on a wider dimension single layer stacks. The learning process would put the right values in the correct place.
Rendered at 14:00:36 GMT+0000 (Coordinated Universal Time) with Vercel.
I'm pretty interested in the intersection of code / ML. If that's your thing here are some other writing you might be interested in.
* Thinking about cuda: http://github.com/srush/gpu-puzzles
* Tensors considered harmful: https://nlp.seas.harvard.edu/NamedTensor
* Differentiating SVG: https://srush.github.io/DiffRast/
* Annotated S4: https://srush.github.io/annotated-s4/
Recently moved back to industry, so haven't had a chance to write in a while.
Once you realize that Attention is really just a re-framing of Kernel Smoothing it becomes wildly more intuitive [0]. It also allows you to view Transformers as basically learning a bunch of stacked Kernels which leaves them in a surprisingly close neighborhood to Gaussian Processes.
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
> I'd be grateful for any pointers to an example where system developers (or someone else in a position to know) have verified the success of a prompt extraction.
You can try this yourself with any open source llm setup that lets you provide a system prompt no? Just give it a prompt, ask the model the prompt ,and see if it matches.
gpt-oss is trained to refuse so it wont share (you can provide system prompt on lmstudio)
The thing that makes transformers work is multi dimensionality in the sense that you are multiplying matricies by matricies instead of computing dot products on vectors. And because matrix multiplication is effectively sums of dot products, you can represent all of the transformer as wide single layer perceptron sequences (albeit with a lot of zeros), but mathematically they would do the same thing.
> you can represent all of the transformer as wide single layer perceptron sequences
This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input. The same mathematical function can be used to compute attention however there are no static weights. You've got your attention scores on one side and the V matrix on the other side.
Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism given they inherently have static weights and they can't directly multiply two inputs (or directly multiply two internal activations). Given an MLP is a general function approximater I guess a sufficiently large number of them could get close enough?
Sure, but the K/V matrices are pretty much arbitrary weights, and so is the Q vector since its derived from the multiplication of the input vector by a learned matrix.
The thing im trying to convey is that the nomeclature of Key/Query/Value doesn't mean anything, so when people learn about transformers, they don't need to understand that those matricies correspond to some predefined structure that maps to the data in a specific way. You can have 2 identical model initialized with random values and trained on same datasets, and end up with different KQV matricies for the same input.
>This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input.
K/Q/V are all derived by multiplying the input vector by static learned weights, then those are all multiplied together in the attention calculation. Basically just a whole bunch of dot products. You would just have flattened matricies with intermediate layers acting as accumulators.
>Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism
It is. It won't be attention in a classical sense, it would just be extra connections on a wider dimension single layer stacks. The learning process would put the right values in the correct place.