Improving Visual Generalization in Model-Based Reinforcement Learning Under review, 2025
- Mingyu Park
- Donghwan Lee KAIST (Korea Advanced Institute of Science and Technology)
Abstract
Learning a generalizable reinforcement learning (RL) agent to the unseen visual image in a zero-shot manner enables further deployments of deep RL into the real world. The field has witnessed significant progress in the prior literature by leveraging data augmentation and auxiliary representation learning techniques. However, simultaneously achieving superior sample efficiency and generalization ability still remains challenging for visual RL agents. In this work, we devise Visual Generalization in MOdel-Based RL (ViGMO), a novel model-based RL method to encourage visual generalization with superior sample efficiency by blending a popular model-based RL architecture with groundbreaking recipes from the prior literature in model-free RL. Our key idea is to constrain the model to exhibit a consistent prediction ability regardless of visual perturbations during training. We provide extensive empirical results on the sample efficiency and generalization ability of visual RL agents in diverse environments and tasks.
Overview
|
We propose ViGMO; a model-based RL for visual generalization with superior sample efficiency by employing recipes from model-free RL: (1) applying weak and strong image augmentations, (2) predicting consistent representations from mixed horizontal representations, (3) regularizing visual encoder with weak and strong augmentations.
Motivation
|
- Typical visual model-based RL generates synthetic latent samples by rolling a learned transition model from latent representations, which enables sample-efficient value learning.
- However, when the image is perturbed with distracting factors, the encoder may extract representations distinct from the observed latent distribution $Z$.
- Latent transition model would be conditioned out-of-distributional latent representations, leading to poor performance at test time.
If the transition model predicts consistent representation regardless of perturbations, model-based RL exhibits superior sample efficiency and visual generalization over model-free RL!
Experimental Results
|
|
DM Control and Robosuite training tasks |
|
|
|
Example evaluation tasks on DM Control |
Example evaluation tasks on Robosuite |
Inspired by RL-ViGen, we consider diverse visual generalization tasks during evaluation, including 15 tasks for DM Control and 4 tasks for Robosuite.
|
DM Control training & evaluation results |
|
Robosuite training & evaluation results |
Our method demonstrate strong performance on diverse visual RL tasks with superior sample efficiency and generalization performance.
|
Full evaluation results |
|
Full training results |
Embedding Visualization
TD-MPC2 on cheetah-run |
ViGMO on cheetah-run |
TD-MPC2 on cartpole-swingup |
ViGMO on cartpole-swingup |
To validate our motivation, we visualize the latent representation predictions of TD-MPC2, our backbone model-based RL, and ViGMO over the horizon. While TD-MPC2 struggles to predict the consistent representations over different perturbations, ViGMO demonstrates a consistent prediciton ability over the horizon regardless of perturbations (representations from perturbed images (× and ♦) track the representations extracted from original images (☆)).
Template is borrowed from here.