(I particularly liked the one about string theorists rediscovering a fundamental theorem in GR decades too late-- rediscovering how to integrate happens in every field, it's nothing to be ashamed of :)
yobbo 22 days ago [-]
Skimmed little bits: "on policy" RL means the model has generated output and received feedback from sort of dynamic environment, which might not be scalable. Value-based off-policy means the model is trained with data that wasn't generated from the model itself exploring a dynamic environment. Instead it can be recordings. They then ask the question; how does that scale?
currymj 21 days ago [-]
RL is unbelievably finicky, sensitive to hyperparameters, and hard to make work. If you are pursuing a research project and decide to use RL, you are making your life a lot more difficult and stressful.
It's exciting to see any progress in making accurate predictions about what settings will work for RL training. I hope that this research direction can be expanded in scope and that ultimately, people who want to do research in RL can become confident in their training recipes.
I am more excited about that, than about the dream of scaling compute per se.
Rendered at 01:39:46 GMT+0000 (Coordinated Universal Time) with Vercel.
For a different perspective, error vs compute, see
https://youtu.be/5eqRuVp65eY
and comments
(I particularly liked the one about string theorists rediscovering a fundamental theorem in GR decades too late-- rediscovering how to integrate happens in every field, it's nothing to be ashamed of :)
It's exciting to see any progress in making accurate predictions about what settings will work for RL training. I hope that this research direction can be expanded in scope and that ultimately, people who want to do research in RL can become confident in their training recipes.
I am more excited about that, than about the dream of scaling compute per se.