Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲CS234: Reinforcement Learning Winter 2025 (web.stanford.edu)

207 points by jonbaer 73 days ago | 60 comments

pedrolins 73 days ago [-]

I was excited to check out lecture videos thinking they were public, but quickly saw that they were closed.

One of the things I miss most about the pandemic was how all of these institutions opened up for the world. Lately they have been closing down not only newer course offerings but also putting old videos private. Even MIT OCW falls apart once you get into some advanced graduate courses.

I understand that universities should prioritize their alumni, but there’s literally no cost in making the underlying material (especially lectures!) available on the internet. It delivers immense value to the world.

moosedev 73 days ago [-]

2024 lecture videos are on YouTube: https://youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEb...

rllearner 73 days ago [-]

One of my favorite parts of the 2024 series on Youtube was when Prof B explained her excitement just before introducing UCB algorithms (Lecture 11): "So now we're going to see one of my favorite ideas in the course, which is optimism under uncertainty... I think it's a lovely principle because it shows why it's provably optimal to be optimistic about things. Which is kind of beautiful."

Those moments are the best part of classroom education. When a super knowledgeable person spends a few weeks helping you get to the point where you can finally understand something cool. And you can sense their excitement to tell you about it. I still remember learning Gauss-Bonnet, Stokes Theorem, and the Central Limit Theorem. I think optimism under uncertainty falls in that group.

storus 73 days ago [-]

Those don't have DPO/GRPO which arguably made some parts of RL obsolete.

nafizh 73 days ago [-]

check out cs 336 stanford, they cover DPO/GRPO and relevant parts needed to train LLMs.

storus 72 days ago [-]

It's also covered by CS329H.

upbeat_general 73 days ago [-]

I can assure you that lacking knowledge in DPO (and especially GRPO it’s just stripped down PPO) is not a dealbreaker.

TomasBM 73 days ago [-]

I've seen arguments that opening up fresh material makes it easy for less honest institutions to plagiarize your work. I've even heard professors say they don't want to share their slides or record their lectures, because it's their copyright.

I personally don't like this, because it makes a place more exclusive with legal moats, not genuine prestige. If you're a professor, this also makes your work less known, not more. IMO the only beneficiaries are either those who paid a lot to be there, lecturers who don't want to adapt, and university admins.

levocardia 73 days ago [-]

>I've even heard professors say they don't want to share their slides or record their lectures, because it's their copyright.

No, it's because they don't want people to find out they've been reusing the same slide deck since 2004

outside1234 73 days ago [-]

I wish we would speed run this to where these super star profs open their classes to 20,000 people at a lower price point (but where this yields them more profit)

ibrahima 73 days ago [-]

That's basically MOOCs, but those kinda fizzled out. It's tough to actually stay focused for a full-length university-level course outside of a university environment IMO, especially if you're working and have a family, etc.

(I mean, I have no idea how Coursera/edX/etc are doing behind the scenes, but it doesn't seem like people talk about them the way they used to ~10 years ago.)

TomasBM 73 days ago [-]

They're still around and offering new online courses. I hope they don't have any problems to keep afloat, because they do offer useful material at the very least.

I agree it's hard, but I think it's because initially the lecturers were involved in the online community, which can be tiring and unrewarding even if you don't have other obligations.

I think the courses should have purely standalone material that lecturers can publish, earn extra money, and refresh the content when it makes sense. Maybe platform moderators could help with some questions or grading, but it's even easier to have chatbot support for that nowadays. Also, platforms really need to improve.

So, I think the problem with MOOCs has been the execution, not the concept itself.

geodel 73 days ago [-]

Most MOOCs are venture funded companies not lifestyle business so they will not likely do sensible user friendly things. They just need to somehow show investors that hyper growth will happen. (Doesn't seem like though that it did happen)

ndriscoll 73 days ago [-]

Most of the MOOCs were also watered down versions of a real course to attempt to make them accessible to a larger audience (e.g. the Stanford Coursera Machine Learning course that didn't want to assume any calculus or linear algebra background), which made them into more of a pointless brand advertisement than an actual learning resource.

TomasBM 72 days ago [-]

> pointless brand advertisement

I understand what you mean, but I disagree it's mostly or pure branding.

I'd argue that even watered down versions can be useful as a bridge to more advanced courses and material, provided you have access to both.

Personally, I benefited from that ML course by Andrew Ng, because I got the vocabulary and introductory math knowledge to proceed to courses and textbooks on linear algebra. It wasn't the only thing that helped, sure, but it helped.

There were also other STEM and non-STEM MOOCs which brought me free knowledge I probably would've never pursued or paid for otherwise.

geodel 73 days ago [-]

They are mostly used for professional courses. Learning python, java, gitlab runners, micro services with NodeJS, project management and things like that

TomasBM 73 days ago [-]

I'd definitely support that.

On the flip side, that'd require many professors and other participants in universities to rethink the role of a university degree, which proves to be much more difficult.

pkoird 73 days ago [-]

Reminds me of something I wrote a year ago https://praveshkoirala.com/2024/11/21/the-democratization-of...

sillysaurusx 73 days ago [-]

It’s been said that RL is the worst way to train a model, except for all the others. Many prominent scientists seem to doubt that this is how we’ll be training cutting edge models in a decade. I agree, and I encourage you to try to think of alternative paradigms as you go through this course.

If that seems unlikely, remember that image generation didn’t take off till diffusion models, and GPTs didn’t take off till RLHF. If you’ve been around long enough it’ll seem obvious that this isn’t the final step. The challenge for you is, find the one that’s better.

PaulRobinson 73 days ago [-]

You're assuming that people are only interested in image and text generation.

RL excels at learning control problems. It is mathematically guaranteed to provide an optimal solution for the state and controls you provide it, given enough runtime. For some problems (playing computer games), that runtime is surprisingly short.

There is a reason self-driving cars use RL, and don't use GPTs.

bchasknga 73 days ago [-]

> self-driving cars use RL

Some part of it, but I would argue with a lot of guardrail in place and not as common as you think. I don't think the majority of the planner/control stack out there in SDC is based. I also don't think any production SDCs are RL-based.

rangestransform 72 days ago [-]

Based on the zoox iccv talk, it sounds like their main planner is RL.

noobcoder 73 days ago [-]

I have been using it to train it on my game hotlapdaily

Apparently AI sets the best time even better than the pros It is really useful when it comes to controlled environment optimizations

srean 73 days ago [-]

You are exactly right.

Control theory and reinforcement learning are different ways of looking at the same problem. They traditionally and culturally focussed on different aspects.

whatshisface 73 days ago [-]

RL is barely even a training method, its more of a dataset generation method.

theOGognf 73 days ago [-]

I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs

mistercheph 73 days ago [-]

care to correct the misunderstanding?

mountainriver 73 days ago [-]

I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.

They also force exploration as a part of the algorithm.

They can be used for synthetic data generation once the reward model is good enough.

phyalow 73 days ago [-]

Its reductive, but also roughly correct.

singularity2001 73 days ago [-]

While collecting data according to policy is part of RL, 'reductive' is an understatement. It's like saying algebra is all about scalar products. Well yes, 1%

poorman 73 days ago [-]

RL is still widely used in the advertising industry. Don't let anyone tell you otherwise. When you have millions to billions of visits and you are trying to optimize an outcome RL is very good at that. Add in context with contextual multi-armed bandits and you have something very good at driving people towards purchasing.

paswut 73 days ago [-]

What about for combinatorial optimization? When you have a simulation of the world what other paradigms are fitting

whatever1 73 days ago [-]

More likely we will develop general super intelligent AI before we (together with our super intelligent friends) solve the problem of combinatorial optimization.

hyperbovine 73 days ago [-]

There's nothing to solve. The CoD kills you no matter what. P=NP or maybe quantum computing is the only hope of making serious progress on large-scale combinatorial optimization.

rishabhaiover 73 days ago [-]

I like to think of RLHF as a technique that I, as a student, used to apply to score good marks in my exam. As soon as I started working, I realized that out-of-distribution generalization can't be only achieved from practicing in an environment with verifiable rewards.

charcircuit 73 days ago [-]

GPT wouldn't have even been possible, let alone take off, without self supervised learning.

mountainriver 73 days ago [-]

RLHF is what gave us the ChatGPT moment. Self supervised learning was the base for this.

SSL creates all the connections and RL learns to walk the paths

charcircuit 73 days ago [-]

The easy to use web interface gave us the ChatGPT moment. Take a look at AI Dungeon for GPT2. It went viral due to making using GPT2 accessible.

mountainriver 72 days ago [-]

No RLHF did, we already had interfaces to GPT like Jasper

kgarten 73 days ago [-]

Are the videos available somewhere?

spring course is on YouTube https://m.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpT...

mlmonkey 73 days ago [-]

As a "tradional" ML guy who missed out on learning about RL in school, I'm confused about how to use RL in "traditional" problems.

Take, for example, a typical binary classifier with a BCE loss. Suppose I wanted to shoehorn RL onto this: how would I do that?

Or, for example, the House Value problem (given a set of features about a house for sale, predict its expected sale value). How would I slap RL onto that?

I guess my confusion comes from how the losses are hooked up. Traditional losses (BCE, RMSE, etc.) I know about; but how do you bring RL loss into problems?

nonameiguess 73 days ago [-]

RL is a technique for finding an optimal policy for Markov decision processes. If you can define state spaces and action spaces for a sequential decision problem with uncertain outcomes, then reinforcement learning is typically a pretty good way of finding a function mapping states to actions, assuming it isn't a sufficiently small problem that an exact solution exists.

I don't really see why you would want to use it for binary classification or continuous predictive modeling. It's why it excels in game play and operational control. You need to make decisions now that constrain possible decision in the future, but you cannot know the outcome until that future comes and you cannot attribute causality to the outcome even when you learn what it is. This isn't "hot dog/not a hot dog" that generally has an unambiguously correct answer and the classification itself is directly either correct or incorrect. In RL, a decision made early in a game probably leads causally to a particular outcome somewhere down the line, but the exact extent to which any single action contributes is unknown and probably unknowable in many cases.

egl2020 73 days ago [-]

Three considerations that come into play in deciding about using RL: 1) how informative is the loss on each example, 2) can you see how to adjust the model based on the loss signal, and 3) how complex is the feature space?

For the house value problem, you can quantify how far the prediction is from the true value, there are lots of regression models with proven methods of adjusting the model parameters (e.g. gradient descent), and the feature space comprises mostly monotone, weakly interacting features like quality of neighborhood schools and square footage. It's a "traditional" problem and can be solved as well as possible by the traditional methods we know and love. RL is unnecessary, might require more data than you have, and might produce an inferior result.

In contrast, for a sequential decision problem like playing go, the binary won-lost signal doesn't tell us much about how well or poorly the game was played, it's not clear how to improve the strategy, and there are a large number of moves at each turn with no evident ranking. In this setting RL is a difficult but possible approach.

robrenaud 73 days ago [-]

I just wouldn't.

RL is nice in that it is handles messy cases where you don't have per example labels.

How do you build a learned chess playing bot? Essentially the state of the art is to find a clever way of turning the problem of playing chess into a sequence of supervised learning problems.

mlmonkey 73 days ago [-]

So IIUC RL is applicable only when the outcome is not immediately available.

Let's say I do have a problem in that setting; say the chess problem, where I have a chess board with the positions of chess pieces and some features like turn number, my color, time left on the clock, etc. are available.

Would I train a DNN with these features? Are there some libraries where I can try out some toy problems?

I guess coming from a classical ML background I am quite clueless about RL but want to learn more. I tried reading the Sutton and Barto book, but got lost in the terminology. I'm a more hands-on person.

jebarker 73 days ago [-]

OpenAI has an excellent interactive course on Deep RL: https://spinningup.openai.com/en/latest/

egl2020 73 days ago [-]

The AlphaGo paper might be what you need. It requires some work to understand, but is clearly written. I read it when it came out and was confident enough to give a talk on it. (I don't have the slides any more; I did this when I was at a FAANG and left them behind.)

zerosizedweasle 73 days ago [-]

Given Ilya's podcast this is an interesting title.

actionfromafar 73 days ago [-]

So, basically AI Winter? :-)

airspresso 73 days ago [-]

That's how I read it XD "oh no, RL is dead too"

TNWin 73 days ago [-]

I didn't get the reference. Please elaborate.

egl2020 73 days ago [-]

Karpathy colorfully described RL as "sucking supervision bits through a straw".

apwell23 73 days ago [-]

he said RL sucks because it narrowly optimizes to solve a certain set of problems in certain sets of conditions.

he compared it to students who win at math competition but cant do anything practical .

kgarten 72 days ago [-]

Which podcast?

ontouchstart 72 days ago [-]

https://www.dwarkesh.com/p/ilya-sutskever-2

storus 73 days ago [-]

RL is extremely brittle, it's often difficult to make it converge. Even Stanford folks admit that. Are there any solutions for this?

mountainriver 73 days ago [-]

FlowRL is one, it’s learning the full distribution of rewards rather than just optimizing toward a single maximum

storus 73 days ago [-]

Thanks, that looks very promising!

_giorgio_ 73 days ago [-]

Kindly suggest some books about RL?

I've already studied a lot of deep learning.

Please confirm if these resoruces are good, or suggest yours:

Sutton et al. - Reinforcement Learning

Kevin Patrick Murphy - Reinforcement Learning, an overview https://arxiv.org/abs/2412.05265

Sebastian Raschka (upcoming book)

...

i_don_t_know 73 days ago [-]

I believe Kochenderfer et.al.'s book "Algorithms for decision making" is also about reinforcement learning and related approaches. Free PDFs are available at https://algorithmsbook.com

Andrew-Tate 73 days ago [-]

[dead]

Rendered at 10:41:34 GMT+0000 (Coordinated Universal Time) with Vercel.