NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Value-Based Deep RL Scales Predictably (arxiv.org)
gsf_emergency_2 22 days ago [-]
My attempt at a summary: authors characterize the data-compute pareto front (aka how bitter is the lesson, exactly?)

For a different perspective, error vs compute, see

https://youtu.be/5eqRuVp65eY

and comments

(I particularly liked the one about string theorists rediscovering a fundamental theorem in GR decades too late-- rediscovering how to integrate happens in every field, it's nothing to be ashamed of :)

yobbo 22 days ago [-]
Skimmed little bits: "on policy" RL means the model has generated output and received feedback from sort of dynamic environment, which might not be scalable. Value-based off-policy means the model is trained with data that wasn't generated from the model itself exploring a dynamic environment. Instead it can be recordings. They then ask the question; how does that scale?
currymj 21 days ago [-]
RL is unbelievably finicky, sensitive to hyperparameters, and hard to make work. If you are pursuing a research project and decide to use RL, you are making your life a lot more difficult and stressful.

It's exciting to see any progress in making accurate predictions about what settings will work for RL training. I hope that this research direction can be expanded in scope and that ultimately, people who want to do research in RL can become confident in their training recipes.

I am more excited about that, than about the dream of scaling compute per se.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 01:39:46 GMT+0000 (Coordinated Universal Time) with Vercel.