Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Plenty of real world situations have clear objectives with obvious rewards.


Example.


Fold clothes -> clothes are folded.

Take children to school -> they safely arrive on time.

Autonomous driving -> arrive at destination without crashing.

Call centre -> customers are happy.


Those don't look like rewards, or at least don't get processed as such for many people (myself included).

Or maybe there is some art to finding happiness in simple things like having folded clothes or surviving the commute?


In RL rewards can be anything you want. They don't have to be things that humans like.


Fair enough!

I guess you can always find some well-specified, measurable goal/reward, but then that choice limits the performance of your model. It's fine when you're building a very specialized system; it gets more difficult the more general you're trying to be.

For a general system meant to operate in human environment, the goal ends up approaching "things that humans like". Case in point, that's what the overall LLM goal function is - continuations that make sense to humans, in fully-general meaning of that.


>> Fold clothes -> clothes are folded.

>> Take children to school -> they safely arrive on time.

>> Autonomous driving -> arrive at destination without crashing.

>> Call centre -> customers are happy.

Define a) "folded", b) "safely", c) "destination", d) "happy".

Also define the reward functions for each of the four objectives above.


Safely -> no crashes

Destination -> Like, close to the destination? I don't see how that's hard.

Happy -> you can use customer feedback for this

Folded -> this is indeed the trickiest one, but I think well within the capabilities of modern vision models.


>> Safely -> no crashes

Really? What about fires? Falling off cliffs? Causing others to crash?

Your "examples" are all hand-wavy and vague and no good to train an RL agent. You've also not provided a reward function.


Work a job, receive money


That's a weak example it context of at least salaried jobs, especially in context of RL, as "receive money" part is usually both significantly delayed from "work a job" part, and only loosely affected by it.


The delay between action and reward is a pretty fundamental problem with RL in general. I don't think they've come up with a really good solution yet.

Of course the delay is much bigger with working a job than most RL games but fundamentally it's the same problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: