Sparse Rewards: Enlightenment and Reinforcement Learning • Thariq Shihipar

A small primer on Reinforcement Learning

In AI, there is a phase of training models called Reinforcement Learning. By this point, the model has already learned about the world — it knows what a book is, what Amazon is, what cash is. Now it’s learning how to accomplish tasks, like buy a book from Amazon.

We train the model to do this by placing it in a simulated “environment” and giving it rewards when it does something right. Every time it attempts a task, it receives an update — a small change that hopefully nudges it toward better behavior. For example, it might get an update nudging it positively if it starts by clicking on the search bar.

The hardest problems are those with sparse rewards — where the model only receives feedback at rare, unpredictable moments. Without frequent signals, it struggles to know which of its actions actually mattered.

Dense

Sparse

For example, imagine that you actually wanted the model to “buy something that tastes good”. While the model could try and accomplish the task, it would not get much feedback until the user tasted what it bought.

Still, the model has no choice but to keep acting, trusting that somewhere in what it did, something mattered.

I think there are a lot of parallels to our journey through life. We stumble through the world, driven by vague signals that change often. Happiness comes and goes. We are never quite sure what we did to deserve it, or to lose it.

But there is a rarer reward. One that arrives so rarely most people aren’t sure it’s real. I think the closest word we have for it in English is enlightenment.

It is a joy that is not caused by anything. It is euphoric but not in an addictive way. It is peaceful, but not in a way that describes the absence of chaos. It makes you feel that nothing really matters, but that is also why everything matters so much.

The average person may feel these moments in glimpses and flashes. When you see the moon at night with someone you love.

But there are people who live in this state constantly. They have somehow solved the puzzle of life.

And in solving it, the shallowness of the world is revealed to them. They can see through it all.

The world is a training environment. Wealth and poverty are simply states and signals within it, neither advantages nor disadvantages. The point of it all is to find this signal of enlightenment.

So how do you find this reward? This feeling of enlightenment?

I would see this described as goodness. But it is not the same act as doing good. You could give your entire wealth in charity, and yet if the goal of it is to feel happy, it would not then be goodness.

Reward Hacking — In AI, this is a failure mode where the model finds a shortcut — it games the metric without solving the real problem. It looks like success from the outside, but something essential is missing.

Thus, it is hard to pin down goodness. We can only speak of some of the characteristics of goodness.

Goodness is not calculating. Goodness does not seek to have the most impact or deliver the most utility. It may do that as a side effect, but it is not the point. It may be self-sacrificing, it may hurt you, it may not leave you with anything.

Truly, the only reward for goodness, is goodness itself.

I want to be clear, I am not enlightened. So far from it. I experience this feeling so rarely and infrequently.

I experience it most when I am writing (both code and words), and so I write. I often think of what a poverty it is that I must go to these lengths to feel what some can feel just by breathing.

Still, we must be grateful for our sparse rewards whenever they arrive and heed whatever updates they bring us.