Reinforcement Learning vs Machine Learning

When I started with reinforcement learning (RL), I often got puzzled if RL was even needed, after all, both RL and ML solve a prediction problem. In this post, we discuss how machine learning (ML) and reinforcement learning (RL) compare with each other. While ML techniques are useful in more structured learning settings where labeled data is provided, samples are independent. RL is often used for teaching agents to operate in an environment.

In this post, we will start by noting basic similarities between RL and ML setups. Next, we will set up the RL problem of playing the Super Mario video game and highlight challenges of learning via ML techniques. Finally, we will formalize an RL setup.

Input-Output Characteristics

The goal in both ML and RL is to learn a probability distribution over actions one can take given a data point (aka state in RL). In typical ML settings, the state can be an image, a piece of text or any entity represented as a vector of numbers. The action space tends to be a list of classes, or it can be real numbers (like prices) over which you want to learn a distribution. RL is similar in that aspect, you want to learn a distribution over actions, which can be the moves you can make in a video game, or a continuous distribution over how much to turn the steering in autonomous car driving. The states in RL tend to be, for instance, current chess board, or player and enemy’s position in a video game or information about car’s surroundings, all encoded as vector of numbers. In this regard, the goal and inputs to both RL and ML match.

Policy in RL/model in ML is defined by a hypothesis class that represents the distribution of actions given the state, often represented by a neural network. In this aspect as well ML and RL are similar.

ML Training Data

We need labelled (state, action) pairs to learn a model. In ML we have training data of $(X,y)$ tuples, where $X$ is a numerical representation of state and $y$ is the true action. True action implies that if you take this action (aka predict this class), you get the minimum loss (or the maximum reward). In this sense, the training data contains the optimal action for each example state. Using this we can learn the $p(a|s)$ distribution.

This data is provided to us or can be generated one state at a time. Each state is independent of the other. It does not matter in what order you see your data. For instance, in image recognition each image is a fresh start. Implications of this are: you can shuffle the data, break it down in batches, do bagging operations. In ML setting, each state also comes from the identical distribution. The data generating process is such that if you sample a sequence of states and you sample enough, you will always end up with representative data to train model. If you like to think from a time series perspective such IID data is always stationary, so any joint distribution is the same, any sequence of data points are coming from same distribution. Both the independence and identical distribution are such nice properties to have, they give you a world of advantages while learning:

  1. You don’t care in what sequence you see the data. Each point is independent.
  2. You don’t care in what sequence you update the weights of your model. On an average each point makes the same update.
  3. You don’t need a world of data. Since all data points are similar, if you have enough samples you will be able to learn.
  4. You don’t need a complicated functional form of policy (or model). Ultimately, you are trying to learn decision on states from a single identical distribution.

Sidenote: The identical distribution assumption is a bit shaky in my opinion. If you are in a nice happy setting where you are learning from a simple distribution, you need less data, simpler hypothesis. If you have a complex distribution you will need more data and a complex hypothesis. A data from multiple distributions can still be represented by an arbitrarily complex distribution by introducing more variables. The problem seems to be more data rather than non identical distribution.

Another notable simplification in ML setting is the loss or reward you get depends only on the state you see. You are rewarded for the decision on the image given to you, it doesn’t matter if you can identify the next image or not. There is no extra reward for getting all images in a sequence correct. It’s a big simplification, it makes the data generating process simple.

Training ML to play Super Mario

Now, consider trying to learn to play Super Mario using the same ML setting. The state is the screenshot of where Mario is currently positioned, let’s say $s_i$. The action is press of buttons on the controller with the aim of winning the game. Whether this aim is fulfilled or not, happens only at the end of the game, when you are rewarded. In our current state, the reward for any action is 0, since none of the action leads to immediate winning of the game. There is some final state though, where Mario gets the queen and reward for a specific action is finally 1. However, at our current state, just based on instantaneous reward all actions are optimal. A better way to assign optimality to action would be to go through till the end of game and see if we win. This is the problem of delayed reward. Learning using ML methods seems hard since even getting a single data point needs unraveling the whole game. How do we learn when we can’t even curate the training data in the first place.

Learning from expert player: Instead of exploring all possible actions, why don’t we ourselves play Mario for some time and record all the data? We collect screenshots as state, buttons we press as actions. We throw away all the data where we die, and just keep the one where we get the queen. Now, we have a good action to take for each state that will get us the reward at the end. Already we can notice, the journey is much more complicated, just to get optimal action for one state, we had to play the whole game. Indeed, we can change my reward at this point to like that of ML setting, with binary reward of 1 if I hit the optimal action and 0 otherwise. Using this data, I can learn a model. Is this a good model, seems we are memorizing?

  1. What I see depends on the action I take, so if the model takes a slightly different action than its human teacher, it is in a new situation it has never seen. Over a sequence of images, the differences pile up. Models that imitate the optimal player don’t learn about the mistakes humans didn’t make, and thus don’t know how to get out of them.
  2. Harder Generalization: What if there is some randomization in the game, say the way ducks move and in every run of the game they move differently. Model is seeing an image it has not seen in the training, and given the point above its generalization capability is not great. It has just memorized my game play. It will fail miserably when presented an out of distribution sample.
  3. Non Identical distribution: The learning problem seems harder, in image recognition task images that look similar have similar labels. When we are learning Mario, the sequence of images as we play may look very similar still the action we may take is very different.

Based on above, we may conclude that RL is a different learning paradigm. In ML we are solving a standalone prediction problem. In RL we are training an agent to predict the best decision and then bear its consequences, that may come down the line. We want the model to have much more holistic learning. The agent needs to learn not only from the instantaneous reward but from all the future rewards. Agent needs to learn not just from expert’s experience but also from novel experiences expert didn’t encounter. We solve this by letting the agent learn from its experiences. There is no human generated training data. The agent is left in the environment and interacts with it by making decisions based on current state and policy. It is rewarded based on the decision and its policy is updated to give higher weight to decisions that help improve total future reward. The agent needs to balance between exploration of the environment and exploitation of what it has learnt.

RL Setup

Reinforcement Learning Setup

Figure above shows the RL setup. The agent gets current state of the environment $s_t$ as input, uses the policy to take action $a_t$, gets immediate reward $r(s_t,a_t)$ and the environment transitions to state $s_{t+1}$ as a result of agent’s actions.

The agent’s objective is to learn a policy that maximizes the total future rewards. A trajectory is the sequence of RVs $s_t$ and $a_t$ as the decision process unrolls. Let $\theta$ define the parameters of the policy. Let $\tau = (s_0, a_0,\dots, s_T,a_T)$ be a trajectory. The aim is to maximize total future reward over an average trajectory.

\[J(\theta) = \mathbb{E}_{\tau \sim p(\tau|\theta)} [\sum_t r(s_t,a_t)]\]

In the next few posts, we will explore the details of how this optimization problem is solved.