> We kept a “lab notebook” of all our experiments in Notion
Couldn't find a link to this, is this public?
schopra909 17 hours ago [-]
Not public yet — we’re going to clean it up so it’s readable and release it as blog posts. First one will be everything you need to know on building a VAE for image and video. Should be out in a few weeks. We’re figuring out the write balance between spending time writing and all the work we have on our plate for the next model.
If you’re interested in this stuff, keep an eye on field notes (our blog).
1) Free up the t5 as soon as the text is encoded, so you reclaim GPU RAM
2) Manual Layer Offloading; move layers off GPU once they're done being used to free up space for the remaining layers + activations
dsrtslnd23 1 days ago [-]
Any idea on the minimum VRAM footprint with those tweaks? 20GB seems high for a 2B model. I guess the T5 encoder is responsible for that.
schopra909 1 days ago [-]
T5 Encoder is ~5B parameters so back of the envelope would be ~10GB of VRAM (it's in bfloat16). So, for 360p should take ~15 GB RAM (+/- a few GB based on the duration of video generated).
We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.
Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).
storystarling 23 hours ago [-]
The 5B text encoder feels disproportionate for a 2B video model. If the text portion is dominating your VRAM usage it really hurts the inference economics.
Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.
schopra909 16 hours ago [-]
That all being said, you can just delete the T5 from memory after encoding the text so save on memory.
The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.
A 720p 5 second video is roughly 100K tokens of context
schopra909 17 hours ago [-]
Great idea! We haven’t tried it but def interested to see if that works as well.
When we started down this path, T5 was the standard (back in 2024).
Likely won’t be the text encoder for subsequent models, given its size (per your point) and age
hackomorespacko 6 hours ago [-]
[flagged]
Rendered at 07:18:20 GMT+0000 (Coordinated Universal Time) with Vercel.
Also I’m super curious on how you’re attempting to have more realistic physics with post training.
Awesome to see more small teams making impressive leaps.
We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.
https://www.linum.ai/field-notes
We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.
Hopefully karpathy does that with his courses in the future!
Sorry, it might sound like a cliche, but try that as a prompt to a deep thinking and learning model, and see what comes out.
An expensive option: Look at Project #5 at https://bytebyteai.com/
Couldn't find a link to this, is this public?
If you’re interested in this stuff, keep an eye on field notes (our blog).
In the meantime here's the individual links to the models:
https://huggingface.co/Linum-AI/linum-v2-720p https://huggingface.co/Linum-AI/linum-v2-360p
https://github.com/Linum-AI/linum-v2/blob/298b1bb9186b5b9ff6...
1) Free up the t5 as soon as the text is encoded, so you reclaim GPU RAM
2) Manual Layer Offloading; move layers off GPU once they're done being used to free up space for the remaining layers + activations
We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.
Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).
Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.
The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.
A 720p 5 second video is roughly 100K tokens of context
When we started down this path, T5 was the standard (back in 2024).
Likely won’t be the text encoder for subsequent models, given its size (per your point) and age