Unsupervised RL Pretraining: Toward Foundation Models for Behavior

Learning without reward, acting without limits.

Published: April 2026 | Author: Abhranil Chandra

Why Should You Care?

Large language models learned to write poetry, debate philosophy, and pass bar exams — all by predicting the next token on mountains of unlabeled text. No one hand-labeled those billions of sentences with “this is good writing.” The data itself, processed through a simple self-supervised objective, was enough.

Now imagine doing the same thing for behavior. Instead of predicting the next word, an agent learns to predict — and enact — the next action, the next movement, the next decision, all from raw, unlabeled interaction data. No reward signal. No task specification. Just an agent immersed in an environment, extracting the deep structure of how the world works and what it can do in it.

This is the promise of Unsupervised Reinforcement Learning (URL) pretraining — and I believe it is the most important open frontier in building truly autonomous, general-purpose agents.

Label-free learning has been the engine behind recent progress in learning from large amounts of data. The breakthroughs in Large Language Models showcase just how effective it is to learn the generative process from massive corpora of text data. This success has catalyzed a parallel movement in reinforcement learning: learning without labels, learning from auxiliary objectives, and learning from the structure of interaction data itself. This blog provides a comprehensive review of this movement — the methods, the mathematics, the unifying theory, and the vision for where it all leads.

If you work in RL, this post will give you a unified technical map of the landscape: from exploration-based methods to successor features to bisimulation metrics to contrastive objectives, and a recent theoretical lens that reveals they are all optimizing a shared underlying quantity. If you don’t work in RL, this post will explain why this matters: why the path to robots that help in your home, agents that manage complex supply chains, and AI systems that learn open-endedly in the real world runs directly through the ideas described here.

The Core Problem: RL is Expensive, Narrow, and Fragile

Reinforcement Learning has produced stunning results — defeating world champions at Go ^[47], controlling nuclear fusion plasma ^[48], and discovering faster matrix multiplication algorithms ^[49]. But each of these is a single-task achievement. RL still relies on using a large number of environment interactions to learn a policy, which can make it prohibitively expensive ^[33].

In many settings, such as robotics, the agent needs to solve a variety of tasks, described by different reward functions, in a single environment. Learning a new policy for each new task is prohibitively expensive. And most critically, it requires a reward function — a precise mathematical specification of what “good” means — which is notoriously hard to design well and easy to hack.

The availability of data, however, has made its use inevitable. Although not all data is good, a lot of data used for training RL agents comes from unknown tasks, suboptimal demonstrations, and incomplete interactions. Methods to pretrain using this data are attractive because solving new tasks from scratch is simply not efficient.

What if, instead of training from scratch, we could pretrain an agent on a massive corpus of reward-free interaction data — just states, actions, and transitions — and then, at test time, instantly produce a near-optimal policy for any reward function? No further training. No further environment interaction. Just inference. This is the paradigm of unsupervised RL pretraining, and the agents that achieve it are increasingly being called Behavioral Foundation Models (BFMs) ^[40]^[41].

Setting the Stage: The MDP Framework

All problem settings presented here operate within the Markov Decision Process (MDP) framework ^[1]. The environment is described as an MDP $M$ with state space $\mathcal{S}$, action space $\mathcal{A}$, transition dynamics $P(s'|s,a)$, and discount factor $\gamma \in [0,1)$. A policy $\pi(a|s)$ describes what action to take given a state $s$. Each task is defined via a reward function $r$ that maps state-action pairs to real-valued rewards.

In the unsupervised pretraining setting, we have access to a dataset $\mathcal{D}$ of $N$ trajectories $\tau_i = (s_0^i, a_0^i, s_1^i, \ldots, s_{T_i}^i)$, collected using any policy (optimal or suboptimal), with no reward labels. The goal is to learn a representation $\mathcal{R}$ during a reward-free phase that enables efficient policy inference during a subsequent reward-based phase for any downstream task ^[33].

We define a URL algorithm as one which consists of an unsupervised Reward-Free phase to learn $\mathcal{R}$ followed by a supervised Reward-Based phase that efficiently uses $\mathcal{R}$ to solve the RL problem for a wide variety of downstream tasks. The reusability of $\mathcal{R}$ and the improved efficiency of the Reward-Based phase makes URL algorithms suited to settings where the agent is expected to solve a variety of tasks in the same environment.

Part I: Methods — Curiosity and Structure

The field has converged on major families of pretraining objectives organized around two driving forces: curiosity (the drive to explore and cover the space of possible behaviors) and structure (the drive to model and exploit the mathematical fabric of the environment’s dynamics).

1. Exploration-Based Representations

Incorporating exploration as an objective can be used to learn diverse skills from data. Exploration-based works focus on invariance to task rewards. This can be done either for the offline data available or for gathering new data through online interactions.

1.1 Entropy Maximization for Zero-Shot Generalization

Zisselman et al. ^[28] argue that exploring a domain is fundamentally harder than training a policy to maximize reward in an environment. A pre-training exploratory policy can be used for zero-shot RL by training a policy that maximizes the entropy of a marginal state visitation distribution for a finite horizon $T$:

$$\max_\pi H\big(\rho^\pi(s)\big)$$

The entropy objective helps the agent move towards states it has not encountered in the episode. This objective is optimized over different MDPs sampled from a distribution and learns a policy that equally visits all the states during episodes. Entropy is computed by a particle-based $k$-nearest neighbor approximation ^[28]^[24]. For practical implementation, Zisselman et al. treat the objective as a reward to be optimized by training an agent with PPO.

But learning an exploratory policy is not enough, and disentangling exploration from underlying dynamics is important — we need a way to steer behavior. One way to do this is to train an ensemble of policies and use the exploratory policy to find actions in the environment when the ensemble fails to agree on an action ^[28]. The entropy-maximizing policy shows a small generalization gap on unknown tasks. The generalization quality is validated by training agents on 200, 500, and 1000 of Procgen’s Maze, Jumper, and Miner levels, demonstrating that exploration learns reward invariance.

1.2 Prototypical Representations with Intrinsic Reward

A more sample-efficient approach with the same guiding motivation can be used for learning representations over explored data. The core idea is to learn a visual representation without task rewards in a self-supervised way ^[24]. The pre-training phase has two parts. First, learn a set of prototypical embeddings that form the basis of a low-dimensional latent space. Second, use these prototypes to maximize an entropy-based intrinsic reward, encouraging exploration of the environment. The intrinsic reward is similar to the one in ^[28], but instead of states, it uses encoded observations. Yarats et al. ^[24] demonstrate strong downstream performance by appending the intrinsic reward with the task reward to train a SAC agent for DeepMind Control.

1.3 Diverse Temporal Behavior from Offline Data

Not all methods require interaction with the environment during pretraining. Diverse behavior from offline data can also be learned in a task-agnostic way. Park et al. ^[17] — Foundation Policies with Hilbert Representations (HILP) — optimize for diverse temporal behavior in the pre-training phase. They learn a state representation $\phi$ that captures the temporal structure from unlabeled trajectories by using the equivalence between temporal distance and the optimal goal-conditioned value function. This representation is used to learn a policy that spans the latent space and captures diverse skills in the offline data. Few-shot adaptation to downstream tasks is achieved by learning only the task-dependent information.

1.4 Curiosity-Driven Online Exploration

Curiosity-driven objectives can be used to improve online data collection and exploration for new tasks. Such objectives have been effective in solving sparse reward tasks at test time ^[4]^[14]. The unlabeled data can be labeled with optimistic rewards using Random Network Distillation (RND) ^[4] to guide exploration to unknown states. An RL agent trained with this data can then collect data through online interactions and subsequently update the reward model using this online experience. Li et al. ^[14] evaluate this iterative approach for rapid exploration in challenging sparse-reward domains such as the AntMaze domain, Adroit hand manipulation, and a visually simulated robotic manipulation domain.

1.5 Skill Extraction for Online Exploration

Offline data can be used with such objectives to improve online exploration. Wilcoxson et al. ^[23] extract low-level skills from the offline dataset using a variational encoder. A learned prior conditions the posterior to stay close to the offline dataset. Essentially, the method learns a low-level policy to take actions around the dataset trajectories and a high-level policy through online interactions. This is effective for tasks that fundamentally require exploration to be solved. Both this approach and ^[14] improve online exploration by using unlabeled data and implicitly capturing invariance.

Key Takeaway Curiosity-based approaches capture a notion of diversity and learn to disentangle this diversity for downstream tasks. They demonstrate that reward invariance is both achievable and valuable. How exploration relates to representation learning for downstream performance deserves additional study.

2. Structured Representations: Successor Features, Measures, and Beyond

A few approaches present a way to learn representations that explicitly exploit structure in the environment. The structure across different tasks remains the same when the environment dynamics are the same. This section focuses on different ways to learn such invariance.

2.1 Successor Features (SF)

Successor Features, introduced by Barreto et al. ^[2] building on Dayan’s ^[36] successor representation, were used to do transfer in RL by exploiting the underlying structure in dynamics and performing well on new tasks. If the expected one-step reward associated with transition $(s, a, s')$ is decomposed into features represented by $\phi$ and weights $w$, such a representation can be used to compute the Q-value function. This Q-value function can be decomposed into the task-relevant information and cumulant $\psi$, which is called the Successor Feature:

$$Q^\pi(s,a) = \psi^\pi(s,a)^\top w \quad\text{where}\quad \psi^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{t \geq 0} \gamma^t \phi(s_t) \mid s, a\right]$$

Once successor features are learned for different tasks, Generalized Policy Improvement (GPI) is used to learn a policy that performs as well as any prior policy on a new task ^[2]. A key focus is that with SF and GPI, no task-specific information is shared to learn policies ^[3].

SF and GPI can be combined with Universal Value Function Approximators (UVFAs) ^[20] to learn value functions over task encodings. Borsa et al. ^[3] introduced Universal Successor Feature Approximators (USFAs) that learn a span over the space of policies. While this offers good generalization, the performance on a new task depends on how the policies used for learning the span are sampled.

Extensions include attention-based SF learning ^[5], distributional returns for more robust SF approximation ^[6], first-occupancy representations for non-Markovian rewards ^[16], and Active Pretraining with Successor Features (APS) ^[13] which combines variational successor features with nonparametric entropy maximization to address limitations of both mutual information and entropy-based unsupervised RL.

2.2 Successor Measures (SM)

The successor feature decomposition can be extended to state-visitation distributions. Touati & Ollivier ^[22] introduced the idea of decomposing the Q-function into the successor measure and the reward. The successor measure encodes the probability of going to the next state under the policy:

$$M^\pi(s, a, X) = \mathbb{E}_\pi\!\left[\sum_{t \geq 0} \gamma^t \mathbf{1}(s_{t+1} \in X) \mid s, a\right] \quad \forall X \subset \mathcal{S}$$

This is fundamental because for any reward function $r$ and policy $\pi$: $Q^\pi(s,a) = \sum_{s'} M^\pi(s,a,s') \cdot r(s')$. The successor measure completely disentangles dynamics from reward. Successor features and measures are related: $\psi^\pi(s,a) = \sum_{s'} M^\pi(s,a,s') \phi(s')$ ^[32].

Learning the successor measure representation for all policies allows for adaptation to any reward function. Forward-Backward (FB) representations ^[22]^[30] provide a practical instantiation: $M^{\pi_z}(s,a,s^+) = F(s,a,z)^\top B(s^+)$, where $F$ encodes future reachability and $B$ encodes backward state features. Policy inference reduces to $z^* = Br$ — a simple matrix-vector product.

2.3 Proto-Successor Measures (PSM)

The successor measure decomposition uses the linearity of Bellman equations. Agarwal et al. ^[29] exploit this to represent the successor measure using a policy-independent affine basis:

$$M^\pi(s,a,s^+) = \sum_i \phi_i(s,a,s^+) w_i^\pi + b(s,a,s^+)$$

where $\phi$ are the policy-independent basis functions, $b$ is the policy-independent bias, and $w^\pi$ is a linear weight that depends on the policy. This enables an affine representation space containing the successor measures for all policies. Unlike successor features, PSM does not directly link the policy to its corresponding reward. Given any reward function, a simple constrained Linear Program needs to be solved to obtain $w^*$ ^[29].

Learning the span of policies does not have to be restricted to optimal policies. PSM can represent the entire space of valid policies through the affine set. For identical dimensionalities, PSM provably represents a larger class of value functions than PVF-style approaches ^[29].

Key Takeaway Successor Features and Successor Measures offer a concise way to represent the underlying structure explicitly. Their ability to transfer to downstream tasks relies on how the space of policies is encoded, making it an important area of closer examination. Structure in the environment dictates the structure in the solution space, but this interdependence needs further investigation.

3. Bisimulation-Based Representations

Bisimulation metrics capture invariance to task-irrelevant features. They group functionally similar states in a latent space, and capturing this similarity in an MDP can be useful for generalization over large state spaces ^[8]. In theory, bisimulation defines an equivalence relation over the state-space. Two states are bisimilar if they have the same immediate reward and transition to states that are bisimilar. Partitioning the state-space under this relation is difficult for practical application ^[8]. To deal with this, a semi-metric over this relation is defined to capture how similar two states are.

Castro et al. ^[7] introduced MICo — a sampling-based approximation to the joint distribution of the transition dynamics that makes it feasible to implement in large state spaces. A common problem is representation collapse, where dissimilar states receive identical representations. This is particularly significant in sparse reward settings. Zang et al. ^[26] mitigate this using cosine distance to measure distance between two latent states (SimSR).

3.1 Goal-Conditioned Bisimulation

Bisimulation can result in effective transfer of skills across analogous tasks ^[10]. Hansen-Estruch et al. ^[10] define a goal-conditioned bisimulation relation where for equivalent state-goal pairs, the reward and transition distributions must match across all actions. For practical implementation, an on-policy version of this relation gives rise to a paired-state metric used to learn state-goal representations and train an offline goal-conditioned policy. A bisimulation objective can be appended with a forward model error-based intrinsic reward to improve exploration ^[11].

3.2 Bisimulation from Offline Data

Learning from an offline dataset is notoriously difficult with function approximation. Pavse et al. ^[18] show that representations learned using the bisimulation metric can result in a value function that does not diverge — these representations stabilize TD-learning and are Bellman complete. Zang et al. ^[27] further reduce estimation error using an expectile operator. For real-world deployment, Yin et al. ^[25] show that naively training with adversarial states and actions does not transfer well; robust representation can be achieved using perturbed states and goals with a contrastive objective.

3.3 Action Bisimulation

Bisimulation-based metrics enable capturing invariance in behavior by learning action representations. Shi et al. ^[21] define a state-conditional action chunk bisimulation metric:

$$d(c_i, c_j \mid s_t) = |R_{s_t}^{c_i} - R_{s_t}^{c_j}| + \gamma \, W_2\big(P_{s_t}^{c_i}, P_{s_t}^{c_j}; d_c\big)$$

where $R^c_{s_t}$ represents the cumulative discounted reward for executing chunk $c$ from $s_t$ and $W_2$ is the 2nd Wasserstein distance. Such representations are effective for complex tasks and validated on 7-DOF arm control ^[21]. Learning action representations from offline data is also achievable: Rudolph et al. ^[19] learn action-bisimulation representations using forward and inverse dynamics models, with a trainable out-of-distribution detector $\hat{I}_\beta(a|s)$ to mitigate distribution mismatch.

Key Takeaway Bisimulation effectively captures behavioral similarity. It can be adapted for offline datasets, online exploration, and learning action representations. However, computing the metric through probability distributions is a bottleneck.

4. Contrastive Learning-Based Representations

Contrastive Learning objectives have been used for unsupervised RL. It is a framework to learn representations by exploiting structure between similar and dissimilar pairs of input, performing a dictionary lookup task where positives and negatives represent keys with respect to a query. The most commonly adapted objective is the InfoNCE loss:

$$\mathcal{L}_\text{contrastive} = -\log \frac{\exp\big(\text{sim}(q, k^+)/\tau\big)}{\sum_{k \in \mathcal{K}} \exp\big(\text{sim}(q, k)/\tau\big)}$$

where $q$ is the query embedding, $k^+$ is the positive key embedding, $\mathcal{K}$ is the set of all keys, $\text{sim}(\cdot)$ is cosine similarity, and $\tau$ is temperature.

Contrastive Learning can train RL agents over representations from image-based observations ^[12]. It can improve robustness of bisimulation representations by perturbing negative samples ^[25]. Ma et al. ^[15] provide a particularly elegant application: learn representations via a goal-conditioned offline pre-training objective that minimizes the distance between the goal-conditioned state-occupancy distribution of the policy and the data distribution. The dual of this objective yields a contrastive RL objective, enabling zero-shot generalization in goal-conditioned RL.

Recent theoretical work has revealed deep connections between contrastive RL and both GCRL and MISL ^[46]^[45]. The variational lower bounds used in MISL reduce to InfoNCE-style objectives, opening connections between skill learning and contrastive methods.

Key Takeaway Contrastive Learning provides a sample-efficient way to learn representations that cluster positive pairs and separate negative pairs. It serves as a natural bridge between representation learning and policy optimization.

Part II: The Unifying Perspective — It’s All About the Successor Measure

On the surface, the four families described above look very different. But a recent analysis ^[33] reveals something striking: all of these methods are approximating the same intractable objective — learning the successor measure $M^\pi$ — under different assumptions and with different state compressions.

Over the years, many URL algorithms have been proposed for pretraining in the reward-free setting ^[15]^[30]^[29]^[17]^[34]^[37]. Through these algorithms, structures as varied as state encoders ^[19], latent skills ^[34], successor representations ^[36], or goal-conditioned policies can be pretrained and utilized for rapid downstream inference. On the surface, these techniques appear to optimize very different objectives, though with the similar goal of rapid policy inference. With the proliferation of techniques, it has become challenging for researchers to identify unexplored areas and differentiate existing methods ^[33].

The Unified Objective

The successor measure $M^\pi(s,a,s^+)$ represents the discounted measure of ending in state $s^+$ starting from $s$, taking action $a$, and following $\pi$ thereafter. If we could represent it exactly for all policies, the Q-function for any reward would be immediate: $Q^\pi(s,a) = \sum_{s^+} M^\pi(s,a,s^+) \cdot r(s^+)$. This is intractable exactly, so each URL method makes assumptions about ^[33]:

$\Pi$ — the class of policies for which $M^\pi$ is approximated
$\mathcal{T}$ — the distribution of tasks/rewards for inference
$\phi$ — the state compression / abstraction

These create a fundamental performance-efficiency tradeoff.

How Each Family Approximates $M^\pi$

Goal-Conditioned RL (GCRL): Restricts rewards to goal-reaching: $r_z(s_t, a_t) = (1 - \gamma)p(s_{t+1} = z \mid s_t, a_t)$. Under unification, $Q^{\pi_z}(s,a) \propto M^{\pi_z}(s,a,z)$ — the value function is proportional to the successor measure evaluated at the goal ^[33]. Approaches like VIP ^[15] and HILP ^[17] additionally parameterize this as a metric: $M^{\pi_z} \propto -\|\phi(s) - \phi(z)\|$. Contrastive RL ^[46] connects GCRL to density estimation. A deep connection exists between GCRL and MISL: when the goal space equals the state space, GCRL with a Gaussian reward becomes equivalent to MISL with a Gaussian variational distribution ^[45].

Mutual Information Skill Learning (MISL): Algorithms like DIAYN ^[34] and METRA ^[35] maximize mutual information $I(S; Z)$ between states and skills. This can be decomposed in reverse form $I = H(Z) - H(Z|S)$ or forward form $I = H(S) - H(S|Z)$, yielding different variational bounds ^[33]. DIAYN uses the reverse, resulting in intrinsic reward $r_\text{int}(s,z) = \log q_\phi(z|s)$. METRA ^[35] replaces standard MI with the Wasserstein Dependency Measure under temporal distance, discovering diverse locomotion in pixel-based Quadruped and Humanoid. Under unification, MISL learns $M^{\pi_z}(s, s^+) = \frac{q(z|s^+, s)\, p(s^+|s)}{p(z)}$ ^[33]. An important finding by Eysenbach et al. ^[43] is that MISL does not recover all optimal policies.

Successor Features (SF): Under unification, $M^{\pi_z}(s,a,s^+) = \psi(s,a,z)(\Phi^\top \Phi)^{-1} \Phi^\top$ ^[33]. Policy inference reduces to linear regression: $z^* = \arg\min_z \|r - \Phi^\top z\|^2$. The FB representation ^[22] further makes $z^* = Br$. This is extremely efficient but assumes the reward lies in the linear span of $\phi$.

Proto-Successor Measures (PSM): $M^{\pi_z}(s,a,s^+) = \sum_i \phi_i(s,a,s^+) w_i^\pi + b(s,a,s^+)$ ^[29]. The task class is any reward function. $M^\pi$ is represented directly for a large policy set, but pretraining requires significant data coverage.

Proto-Value Functions (PVF): PVFs ^[38] represent value functions using the spectral decomposition of the graph Laplacian: $V^\pi = \phi w$. Under unification, the eigenvectors of PVFs are the same as those of $M^{\pi_U}(s,s^+)$ where $\pi_U$ is a uniform random policy ^[33]. Proto-Value Networks ^[37] and Adversarial Value Functions ^[39] have attempted to scale this idea.

Controllable Representations (CR): Methods like ACRO ^[50] learn an encoder $\phi: \mathcal{S} \to \mathcal{X}$ through multi-step inverse dynamics prediction, retaining only action-relevant information (the endogenous state of an Exo-MDP). Action-Bisimulation ^[19] extends this to infinite-horizon controllability. Under unification, multi-step inverse methods model $M^{\pi_\beta}_K(s,a,s^+) = \frac{f(a|s,s^+)\, p^{\pi_\beta}(s^+|s)}{\pi_\beta(a|s)}$ ^[33]. These provide efficient representations but do not directly admit a policy — downstream RL is still needed.

World Models: World models learn the dynamics of the environment — from single-step dynamics models ^[51] to latent dynamics models like the Dreamer family ^[52] to multi-step generative models. Under the unified perspective, world models represent $M^\pi$ as a generative model — they are the generator functions to the otherwise density-based estimation of the successor measure ^[33]. Density-based methods compute $\sum_{s^+} M^\pi(s,a,s^+) r(s^+)$ directly, enabling very quick inference. World models instead compute $\mathbb{E}_{s^+ \sim M^\pi}[r(s^+)]$ via sampling, which is time-consuming and requires significantly more computation. As a result, world model inference, while sample-efficient, is not computationally efficient ^[33]. Geometric Horizon Models ($\gamma$-models) are the generator functions of normalized successor measures.

A key challenge for world models is that latent dynamics learning (minimizing only prediction error) is susceptible to representation collapse. Methods typically add regularizers such as reconstruction, inverse kinematics prediction, orthogonal regularization, or variational losses to prevent this ^[33]. In my own recent work, VideoAgent ^[53], we explored using generative video models as world models for embodied planning, demonstrating that while video generation models can simulate physics impressively, they suffer from compounding hallucinations over long horizons. We address this through self-conditioning consistency mechanisms and VLM-guided feedback, turning a generative video model into a robust planner. The connection between world models and successor measures is one I elaborate on in detail in Chapter 2 of my Masters thesis ^[54], where I show how the spectrum of world model architectures — from single-step dynamics to latent dynamics (Dreamer), to multi-step generative models — can all be understood as different strategies for representing the same successor measure, differing primarily in their generative assumptions and planning costs. This perspective clarifies why world models require planning for inference (they learn the generator, not the density) while successor feature methods achieve near-instant inference (they learn the density directly).

State Equivalence: The Universal Compression Principle

Making $M^\pi$ tractable requires compressing the state space. The unifying framework reveals that each URL method implicitly or explicitly learns a state abstraction $\phi: \mathcal{S} \to \mathcal{X}$ that groups states with similar future distributions under the successor measure ^[33]. An ideal abstraction would have $\phi(s_1) = \phi(s_2) \iff s_1 = s_2$, but this implies no compression. In practice, we want an abstraction preserving future predictability:

$$\phi(s_1) = \phi(s_2) \iff M^\pi(\phi(s_1), a, \phi(s^+)) = M^\pi(\phi(s_2), a, \phi(s^+))$$

Each method defines a different equivalence metric ^[33]:

GCRL & CR: $-\|\phi(s_1) - \phi(s_2)\|$ — Euclidean distance in embedding space
MISL: $D_\text{KL}(q_\phi(z|s_1) \| q_\phi(z|s_2))$ — two states are similar if they activate the same skills
SF, PSM, PVF: $\phi(s_1)^\top \phi(s_2)$ — inner product; explains why methods often impose orthonormality ^[22]^[30]
World Models: Determined by regularizer (reconstruction, variational, inverse kinematics, etc.)

This is a profound observation: the diversity of URL methods is, at its core, a diversity of state equivalence metrics, all serving the same purpose of making successor measure estimation tractable.

The Pareto Frontier and Novel Algorithm Design

The practical consequence is a clear performance-efficiency tradeoff ^[33]. Methods representing $M^\pi$ for a larger class of policies (PSM, World Models) achieve higher performance over wider task distributions but require more computation at inference. Methods restricting $\Pi$ (GCRL, SF) are more efficient but may fail outside their representational assumptions. Gridworld experiments confirm this Pareto frontier: GCRL achieves near-perfect goal-reaching performance but fails on arbitrary rewards, while PSM achieves the highest overall URL performance with modest inference cost.

A direct implication is the ability to combine Reward-Free and Reward-Based phases of different methods. Cross-combinations like GCRL+SF ^[17], MISL+SF (CSF) ^[44], and PVF+SF achieve favorable intermediate points ^[33]. The unifying framework paves the way for novel algorithms centered around the unified objective.

Part III: Behavioral Foundation Models

A Behavioral Foundation Model (BFM) ^[40]^[41], for a given MDP, is an agent trained in unsupervised fashion on reward-free transitions that can produce approximately optimal policies for a large class of reward functions specified at test time, without additional learning or planning. State-of-the-art BFMs build upon successor features and forward-backward representations ^[22]^[30]. Given a reward function $r$, a policy is inferred by linear regression: $z_r = \arg\min_z \mathbb{E}_{s \sim \rho}[(r(s) - \phi(s)^\top z)^2]$, returning the pretrained policy $\pi_{z_r}$ ^[32].

Fast Adaptation: Beyond Zero-Shot

A critical insight from Sikchi et al. ^[32] is that zero-shot policies from BFMs are often suboptimal — but they are excellent initializations. The pretrained latent space contains policies more performant than those identified by the zero-shot inference procedure. ReLA (Residual Latent Adaptation) and LoLA (Lookahead Latent Adaptation) search the low-dimensional task-embedding space, achieving 10–40% improvement in a few tens of episodes. Crucially, these mitigate the initial “unlearning” phase common in fine-tuning pretrained RL models — because adaptation happens in the latent task space rather than full parameter space, the pretrained structure is preserved ^[32].

RLZero: From Language to Policy

The RLZero framework ^[31] demonstrates prompt-to-policy — paralleling the imagination capabilities of humans:

Imagine: Given a language command, use a video generation model to produce imagined trajectories.
Project: Map imagined frames to the agent’s observation space via semantic similarity search (using CLIP/SigLIP embeddings).
Imitate: Use a pretrained BFM to output a policy matching the imagined state-visitation distribution — in zero-shot, using distribution-matching capabilities of unsupervised RL.

RLZero sidesteps reward functions entirely and frames language-to-skill inference as matching state-only distributions. It represents a first approach to zero-shot cross-embodiment transfer ^[31]. This opens the possibility of prompting to generate a policy, analogous to how we prompt language models to generate text.

Part IV: The Path to Generalist Agents

The recipe mirrors language models: (1) Massive reward-free interaction data, (2) Self-supervised objective capturing environment structure, (3) Scale, (4) Condition on task specification (reward function, goal, language, demonstration) for instant behavioral inference.

The resulting Behavioral Foundation Models are the behavioral analog of GPT for text or CLIP for vision-language. Recent work has already demonstrated this for humanoid control: zero-shot whole-body humanoid control achieved through BFMs ^[41], where pretrained agents produce diverse locomotion and manipulation behaviors without any task-specific training.

The platonic representation hypothesis suggests that as we build sufficiently rich simulators and train over increasingly complex task curricula, learned representations should converge toward reality’s underlying structure. If this holds, agents trained in such settings could zero-shot transfer to the real world. Methods like successor measures, which model future state occupancy from every state-action pair, are perfectly positioned for this vision.

Open Questions

What is the right pretraining objective? The “next-token prediction” equivalent for behavior has not yet been found. Finding it may be the single highest-impact open problem in the field.
How does data coverage affect zero-shot performance? The impact of dataset quality on successor measure estimation is underexplored ^[29].
Partial observability? No works have explored BFM performance under partial observability — a setting where the successor measure itself becomes harder to define and learn.
Scaling to high-dimensional visual domains? Most methods are validated on locomotion benchmarks. Scaling to real-world robotics with contact-rich manipulation remains a challenge.
Combining BFMs with large pretrained models? The intersection with VLMs and video generation is largely uncharted. RLZero ^[31] and VideoAgent ^[53] offer early glimpses.
Adaptation without forgetting? How to fine-tune efficiently without forgetting remains open ^[32].
Evaluation metrics? For prompt-to-policy, accurate evaluation metrics are lacking since the true reward is unknown and human evaluation is subjective ^[31].

Conclusion

This article has provided an integrated overview of recent research trends in unsupervised and self-supervised methods for representation learning in RL. It covers approaches based on exploration, successor features and measures, bisimulation, and contrastive learning, providing a unification of the core objectives through the lens of successor measure estimation.

Curiosity-based approaches capture diversity and learn to disentangle it for downstream tasks. Successor Features and Measures capture structure explicitly — the affine structure of successor measures provides a principled mathematical foundation for representing the entire behavior space. Bisimulation learns behavioral similarity but remains computationally bottlenecked. Contrastive learning provides sample-efficient representations with deep connections to GCRL and MISL. The unifying perspective reveals that all of these methods are approximating the successor measure under different assumptions and state compressions, creating a Pareto frontier that directly informs algorithm selection ^[33].

Most importantly, unsupervised RL pretraining represents a fundamental paradigm shift. The resulting Behavioral Foundation Models — capable of zero-shot policy inference, fast adaptation, and language-to-behavior translation — represent the most promising path I see toward building truly autonomous, interactive, open-ended foundation models for action, behavior, and generalist agency.

The successor measure is the backbone. The data is accumulating. The models are getting bigger. The question is not whether this paradigm will produce general-purpose behavioral agents, but when — and what the right pretraining recipe will turn out to be.

References

Bellman, R. Dynamic programming and stochastic control processes. Information and Control, 1958.
Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., van Hasselt, H.P., Silver, D. Successor features for transfer in RL. NeurIPS, 2017.
Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., Schaul, T. Universal successor features approximators. arXiv:1812.07626, 2018.
Burda, Y., Edwards, H., Storkey, A., Klimov, O. Exploration by random network distillation. ICLR, 2019.
Carvalho, W., Filos, A., Lewis, R.L., Singh, S., et al. Composing task knowledge with modular SF approximators. arXiv:2301.12305, 2023.
Carvalho, W., et al. Combining behaviors with the successor features keyboard. NeurIPS, 2023.
Castro, P.S., Kastner, T., Panangaden, P., Rowland, M. MICo: Improved representations via sampling-based state similarity for MDPs. NeurIPS, 2021.
Ferns, N., Panangaden, P., Precup, D. Metrics for finite Markov decision processes. UAI, 2004.
Gu, P., Zhao, M., Chen, C., Li, D., Hao, J., An, B. Learning pseudometric-based action representations for offline RL.
Hansen-Estruch, P., Zhang, A., Nair, A., Yin, P., Levine, S. Bisimulation makes analogies in goal-conditioned RL. ICML, 2022.
Kemertas, M., Aumentado-Armstrong, T. Towards robust bisimulation metric learning. NeurIPS, 2021.
Laskin, M., Srinivas, A., Abbeel, P. CURL: Contrastive unsupervised representations for RL. ICML, 2020.
Liu, H., Abbeel, P. APS: Active pretraining with successor features. ICML, 2021.
Li, Q., Zhang, J., Ghosh, D., Zhang, A., Levine, S. Accelerating exploration with unlabeled prior data. NeurIPS, 2023.
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A. VIP: Towards universal visual reward and representation via value-implicit pre-training. arXiv:2210.00030, 2022.
Moskovitz, T., Wilson, S.R., Sahani, M. A first-occupancy representation for RL. arXiv:2109.13863, 2021.
Park, S., Kreiman, T., Levine, S. Foundation policies with Hilbert representations. ICML, 2024.
Pavse, B.S., Chen, Y., Xie, Q., Hanna, J.P. Stable offline value function learning with bisimulation-based representations. arXiv:2410.01643, 2024.
Rudolph, M., Chuck, C., Black, K., Lvovsky, M., Niekum, S., Zhang, A. Learning action-based representations using invariance. arXiv:2403.16369, 2024.
Schaul, T., Horgan, D., Gregor, K., Silver, D. Universal value function approximators. ICML, 2015.
Shi, L., Hao, J., Tang, H., Dong, Z., Zheng, Y. Self-supervised bisimulation action chunk representation for efficient RL. NeurIPS Safe Generative AI Workshop, 2024.
Touati, A., Ollivier, Y. Learning one representation to optimize all rewards. NeurIPS, 2021.
Wilcoxson, M., Li, Q., Frans, K., Levine, S. Leveraging skills from unlabeled prior data for efficient online exploration. arXiv:2410.18076, 2024.
Yarats, D., Fergus, R., Lazaric, A., Pinto, L. RL with prototypical representations. ICML, 2021.
Yin, X., Wu, S., Liu, J., Fang, M., Zhao, X., Huang, X., Ruan, W. Representation-based robustness in goal-conditioned RL. AAAI, 2024.
Zang, H., Li, X., Wang, M. SimSR: Simple distance-based state representations for deep RL. AAAI, 2022.
Zang, H., et al. Understanding and addressing pitfalls of bisimulation representations in offline RL. NeurIPS, 2023.
Zisselman, E., Lavie, I., Soudry, D., Tamar, A. Explore to generalize in zero-shot RL. NeurIPS, 2023.
Agarwal, S., Sikchi, H., Stone, P., Zhang, A. Proto successor measure: Representing the behavior space of an RL agent. arXiv:2411.19418, 2025.
Touati, A., Rapin, J., Ollivier, Y. Does zero-shot reinforcement learning exist? ICLR, 2023.
Sikchi, H., Agarwal, S., Jajoo, P., Parajuli, S., Chuck, C., Rudolph, M., Stone, P., Zhang, A., Niekum, S. RLZero: Zero-shot language to behaviors without any supervision. arXiv:2412.05718, 2024.
Sikchi, H., et al. Fast adaptation with behavioral foundation models. 2025.
Agarwal, S., Chuck, C., Sikchi, H., Hu, J., Rudolph, M., Niekum, S., Stone, P., Zhang, A. A unifying perspective on unsupervised reinforcement learning algorithms. Under review, ICLR, 2026.
Eysenbach, B., Gupta, A., Ibarz, J., Levine, S. Diversity is all you need: Learning skills without a reward function (DIAYN). ICLR, 2019.
Park, S., Rybkin, O., Levine, S. METRA: Scalable unsupervised RL with metric-aware abstraction. ICLR, 2024.
Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
Farebrother, J., et al. Proto-value networks: Scaling representation learning with auxiliary tasks. ICLR, 2023.
Mahadevan, S. Proto-value functions: Developmental reinforcement learning. ICML, 2005.
Bellemare, M., Dabney, W., Dadashi, R., et al. A geometric perspective on optimal representations for RL. NeurIPS, 2019.
Pirotta, M., Tirinzoni, A., Touati, A., Lazaric, A., Ollivier, Y. Fast imitation via behavior foundation models. 2024.
Tirinzoni, A., et al. Zero-shot whole-body humanoid control via behavioral foundation models. ICLR, 2025.
Gregor, K., Rezende, D.J., Wierstra, D. Variational intrinsic control. arXiv:1611.07507, 2016.
Eysenbach, B., Salakhutdinov, R., Levine, S. The information geometry of unsupervised RL. ICLR, 2022.
Zheng, C., Tuyls, J., Peng, J., Eysenbach, B. Can a MISL fly? Analysis and ingredients for mutual information skill learning. ICLR, 2025.
Choi, J., et al. Variational empowerment as representation learning for goal-conditioned RL. ICML, 2021.
Eysenbach, B., et al. Contrastive learning as goal-conditioned RL. NeurIPS, 2022.
Silver, D., et al. Mastering the game of Go without human knowledge. Nature, 2017.
Degrave, J., et al. Magnetic control of tokamak plasmas through deep RL. Nature, 2022.
Fawzi, A., et al. Discovering faster matrix multiplication algorithms with RL. Nature, 2022.
Islam, R., et al. ACRO: Agent-centric representations for controllability in RL. 2023.
Nagabandi, A., Konolige, K., Levine, S., Kumar, V. Deep dynamics models for learning dexterous manipulation. 2019.
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M. Dream to control: Learning behaviors by latent imagination. ICLR, 2020.
Chandra, A., et al. VideoAgent: Video models as self-improving generative simulators and world models for embodied planning. Under Review at TMLR; Spotlight Oral @ RL Beyond Rewards, RLC and Language Agent World Models (LAW) Workshop, NeurIPS, 2025. [paper]
Chandra, A. Unifying Foundation Models with Decision Making - A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision. Masters Thesis, University of Waterloo, 2025. [Masters Thesis]

Machado, M.C., Barreto, A., Precup, D. Temporal abstraction in RL with the successor representation. JMLR, 2023.

Citation

If you found this useful, please cite:

Chandra, Abhranil. Unsupervised RL Pretraining: Toward Foundation Models for Behavior. https://abhranilchandra.github.io/blogs/unsupervised-rl.html. April 2026.

← Back to All Blogs Go to Home