Improving Visual Generalization in Model-Based Reinforcement Learning
Under review, 2025

Abstract

Learning a generalizable reinforcement learning (RL) agent to the unseen visual image in a zero-shot manner enables further deployments of deep RL into the real world. The field has witnessed significant progress in the prior literature by leveraging data augmentation and auxiliary representation learning techniques. However, simultaneously achieving superior sample efficiency and generalization ability still remains challenging for visual RL agents. In this work, we devise Visual Generalization in MOdel-Based RL (ViGMO), a novel model-based RL method to encourage visual generalization with superior sample efficiency by blending a popular model-based RL architecture with groundbreaking recipes from the prior literature in model-free RL. Our key idea is to constrain the model to exhibit a consistent prediction ability regardless of visual perturbations during training. We provide extensive empirical results on the sample efficiency and generalization ability of visual RL agents in diverse environments and tasks.

Overview

We propose ViGMO; a model-based RL for visual generalization with superior sample efficiency by employing recipes from model-free RL: (1) applying weak and strong image augmentations, (2) predicting consistent representations from mixed horizontal representations, (3) regularizing visual encoder with weak and strong augmentations.


Motivation

  • Typical visual model-based RL generates synthetic latent samples by rolling a learned transition model from latent representations, which enables sample-efficient value learning.
  • However, when the image is perturbed with distracting factors, the encoder may extract representations distinct from the observed latent distribution $Z$.
  • Latent transition model would be conditioned out-of-distributional latent representations, leading to poor performance at test time.

If the transition model predicts consistent representation regardless of perturbations, model-based RL exhibits superior sample efficiency and visual generalization over model-free RL!


Experimental Results

DM Control and Robosuite training tasks

Example evaluation tasks on DM Control

Example evaluation tasks on Robosuite

Inspired by RL-ViGen, we consider diverse visual generalization tasks during evaluation, including 15 tasks for DM Control and 4 tasks for Robosuite.

DM Control training & evaluation results

Robosuite training & evaluation results

Our method demonstrate strong performance on diverse visual RL tasks with superior sample efficiency and generalization performance.

Full evaluation results

Full training results


Embedding Visualization

TD-MPC2 on cheetah-run

ViGMO on cheetah-run

TD-MPC2 on cartpole-swingup

ViGMO on cartpole-swingup

To validate our motivation, we visualize the latent representation predictions of TD-MPC2, our backbone model-based RL, and ViGMO over the horizon. While TD-MPC2 struggles to predict the consistent representations over different perturbations, ViGMO demonstrates a consistent prediciton ability over the horizon regardless of perturbations (representations from perturbed images (× and ♦) track the representations extracted from original images (☆)).


Template is borrowed from here.