Reinforcement Learning Cornerstones

This section presents the key mathematical foundations and fundamental concepts that underpin RL. It covers specialized techniques and equations essential for understanding how agents learn, make decisions, and optimize their behavior. These principles will serve as the theoretical basis for the following sections.

It is completely normal to need multiple readings to grasp these concepts fully. If you find them a bit confusing at first, that’s perfectly okay—everyone experiences this. Don’t get discouraged! With time and practice, these ideas will become much clearer. Keep going!

State Space

In RL, the state space is a fundamental concept, as it defines the set of information that the agent relies on to understand its environment and make informed decisions. There are two types of state representation:

Low-dimensional representation: It typically takes the form of a vector $\mathbf{s} = [x_1, x_2, \dots, x_n] \in \mathbb{R}^n$ . Each element in the vector corresponds to a specific feature relevant to the task at hand, such as the positions of joints in a robotic arm, the location of an end-effector, or a desired goal state. This type of representation is preferred for its simplicity, computational efficiency, and straightforward implementation, making it ideal for tasks where key information can be effectively captured by a relatively small number of parameters.
High-dimensional representation: often coming from a camera or other visual input sensor, allowing the agent to “see” the environment in the form of an image. This approach is chosen because it can capture a much richer set of information about the environment, which is especially valuable for tasks that are complex or involve many interacting factors. While high-dimensional representations require more computational resources, they provide the agent with a deeper understanding of its surroundings

Example

Action Space

In RL, the action space defines all the possible actions an agent can take within a given environment. It is divided into two main types: continuous and discrete. In a continuous action space, the agent can select actions from a continuous range, meaning there are potentially infinite choices available (e.g., the position of a steering wheel). On the other hand, a discrete action space consists of a finite set of distinct actions, where the agent must choose from a predefined list at each time step (e.g., pressing a button in a video game).

Reward Types

Depending on the task, environment, and agent’s condition, the reward function can be dense or sparse. In a dense reward system, the agent receives a reward after each action, with feedback provided more frequently, which helps guide the learning process in smaller steps. Conversely, in a sparse reward system, the agent only receives rewards occasionally, often only at the end of an episode, making it more challenging for the agent to associate specific actions with outcomes. An episode refers to a single cycle of interaction between the agent and the environment, starting from the initial state and ending in a terminal state.

Reward Example

Policy Types

A policy is essentially the strategy an agent uses to decide what actions to take in different situations. It tells the agent what to do when it’s in a specific state. Policies are typically divided into two types:

Stochastic Policy: This type of policy involves randomness. Instead of choosing a single action, the agent picks from a set of actions based on probabilities. For example, the agent might have a 70% chance of choosing action A and a 30% chance of choosing action B.
Deterministic Policy: With this policy, the agent always chooses the same action for a given state. There’s no randomness—each state is mapped to exactly one action.

Value Functions

In simple terms, value functions help estimate how good it is for an agent to be in a certain state or to take a specific action from that state. They estimate the expected total return, $G\_t$ , an agent can receive over time. The return is the total sum of rewards the agent collects during its journey, while a reward is the immediate feedback after taking an action.

There are two types of value functions:

State-Value Function $V^{\pi}(s)$ : This estimates the expected return the agent will receive if it starts in state $s$ and follows a policy $\pi$ .
Action-Value Function (or Q-value) $Q^{\pi}(s, a)$ : This estimates the expected return if the agent takes a specific action $a$ from state $s$ and then follows a policy $\pi$ .

Bellman Equation

The Bellman Equation describes how the value of a state depends on the values of other states. In simple terms, it defines the value of being in a particular state based on the expected future returns, which are obtained by taking actions from that state and moving to new states. These future returns are discounted by a factor $\gamma$ , which prioritizes immediate rewards over future ones, similar to how humans tend to favor quick rewards.

V(s) = \max_{a_t} \left( \sum_{s_{t+1}, r} p(s_{t+1}, r|s_t, a_t) [r(s_t, a_t) + \gamma V(s_{t+1})] \right)

Or, for the action-value function:

Q(s, a) = \sum_{s_{t+1}, r} p(s_{t+1}, r|s_t, a_t) [r + \gamma \max_{a_{t+1}} Q(s_{t+1}, a_{t+1})]

Don’t worry about the equations—they are simply the mathematical representation of the Bellman Equation. Focus on understanding the underlying idea, and the math will make sense in time!

On-policy vs. Off-policy Methods

In RL, updating a policy can follow different methodologies. Broadly, these approaches fall into two categories: on-policy and off-policy learning.

An on-policy method improves the same policy that is currently being used to make decisions. The agent gathers data by interacting with the environment according to its active policy and then refines that policy based on its experiences. This approach ensures that learning is closely tied to the behavior of the agent but can be sample-inefficient since it requires continuous interaction with the environment.

In contrast, off-policy learning allows an agent to improve a policy using data collected from a different policy, including past experiences or external demonstrations. This flexibility makes off-policy methods more data-efficient and capable of leveraging diverse sources of information, such as replay buffers or expert demonstrations. However, learning from off-policy data introduces additional complexity in ensuring stable and reliable updates.

Both approaches have their strengths and trade-offs, with on-policy methods often preferred for stability in environments with rapidly changing dynamics, while off-policy methods excel in leveraging past data for efficient learning.

Online vs. Offline Reinforcement Learning

Imagine training a self-driving car. One approach is to let the car drive around in a controlled environment, learning from every turn, stop, and obstacle it encounters—constantly refining its driving strategy based on real-time experiences. Another approach is to train the car using a vast dataset of recorded driving scenarios, allowing it to learn without actually being on the road. These two strategies correspond to online and offline RL, respectively.

In online RL, the agent continuously interacts with the environment, making decisions in real time and adjusting its policy based on immediate feedback. This dynamic approach ensures that learning is closely tied to exploration, allowing the agent to refine its strategy based on the latest experiences. However, online RL can be resource-intensive, requiring consistent interaction with the environment, which may not always be feasible in real-world applications.

On the other hand, offline RL operates differently by learning from a fixed dataset of past experiences—collected either through prior interactions or simulations. Instead of relying on real-time feedback, the agent optimizes its policy based on pre-existing data, making offline RL particularly valuable when direct interaction is costly, risky, or impractical. However, since the agent cannot explore beyond the dataset, ensuring generalization to unseen situations becomes a key challenge

On-policy approaches, which rely on real-time data for policy updates, are inherently inefficient for offline RL. In contrast, off-policy methods, which can learn from previously collected experiences, offer the flexibility to be applied in both online and offline settings.

Exploration vs. Exploitation

Exploration Example

Image source: https://huggingface.co/learn/deep-rl-course/unit1/exp-exp-tradeoff

The optimality of the learned behavior in RL strongly depends on how an agent approaches exploration in each environment. Insufficient exploration can lead to suboptimal chances of success and hinder the learning process in accomplishing the task. If an RL agent does not explore enough, it may never discover the best possible actions, leading to a stagnation of learning. Conversely, if it explores too much without leveraging what it has learned, it may waste valuable time and resources.

This tension between exploration and exploitation presents one of the most persistent challenges in RL. Inadequate exploration may prevent the agent from solving the task due to a lack of valuable experience or cause it to become trapped in local maxima, resulting in suboptimal behavior. However, fully exploring an environment can be time-consuming or even infeasible without leveraging insights from prior experiences.

A balanced strategy is essential—one that allows the agent to explore adequately within a reasonable time frame while making use of the knowledge it has accumulated. Improved exploration strategies aim to accelerate task resolution, maximizing cumulative rewards rather than relying on purely heuristic approaches.

In the ever-evolving landscape of RL, mastering the dance between exploration and exploitation is the key to unlocking truly intelligent decision-making.

In this section, we’ve covered some of the key cornerstones of Reinforcement Learning, including the state and action spaces, as well as reward systems. We explored how these foundational concepts shape the agent’s learning process, whether it’s through continuous or discrete actions, dense or sparse rewards, or low- and high-dimensional state spaces. Understanding these elements is essential for grasping how RL works in practice. We hope this has provided you with a solid foundation to build upon as we now move into the RL Taxonomy, where we’ll delve deeper into different types of RL algorithms and how they are categorized based on their approaches and applications.

Introduction to RL

​ State Space

​ Action Space

​ Reward Types

​ Policy Types

​ Value Functions

​ Bellman Equation

​ On-policy vs. Off-policy Methods

​ Online vs. Offline Reinforcement Learning

​ Exploration vs. Exploitation

State Space

Action Space

Reward Types

Policy Types

Value Functions

Bellman Equation

On-policy vs. Off-policy Methods

Online vs. Offline Reinforcement Learning

Exploration vs. Exploitation