NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR (arxiv.org)
radarsat1 6 days ago [-]
> By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss.

Isn't this how the Decision Transformer works? I don't see it in the references, so I'll be curious to compare the papers in more depth.

https://arxiv.org/abs/2106.01345

> By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return.

Lately it has crossed my mind that I haven't seen DT brought up much lately, it seemed really interesting when it was first published but I haven't read much follow-up work.

impossiblefork 6 days ago [-]
59.76% on AIME is really appealing. Without having had time to understand it and determine whether it's useful or not, I see this number as indicating that this could be a stepping stone on something like the o1-to-DeepSeek-R1 progression for thinking, where open source models eventually figured out how o1 worked, only for the less definite 'o1' and instead what Google achieved and OpenAI may have achieved on the 2025 IMO problems.
getnormality 6 days ago [-]
I stumbled across this AI paper just now. It sounds intimidatingly technical, but if you read the abstract and look at Figures 1 and 2 and Equation 6, I think it's got some neat and accessible conceptual ideas.

Supervised learning is a much more mature technology than reinforcement learning, so it seems like a good thing to leverage that.

lostmsu 21 hours ago [-]
Can somebody explain "Base" model in the charts? Are they saying that the original model (e.g. before they applied either their or comparable training methods) has better or similar performance on all benchmarks vs their own result?
anfego 6 days ago [-]
Is this DPO?
getnormality 6 days ago [-]
I have no idea. My understanding of this entire field is extremely superficial. I only posted this because I was able to sort of understand the paper despite that.

I can tell you that they cite the DPO paper right before Equation 8.

yorwba 6 days ago [-]
I think you meant to link to

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR https://arxiv.org/abs/2509.02522

not

Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline https://arxiv.org/abs/2507.15855

dang 6 days ago [-]
We've changed the top link to that from https://arxiv.org/abs/2507.15855. Thanks!
impossiblefork 6 days ago [-]
That paper is really cool too though. I'm happy that your comment sort of records the old link, because I only saw the right paper.
getnormality 6 days ago [-]
Ack, thank you.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 17:45:25 GMT+0000 (Coordinated Universal Time) with Vercel.