Pretraining a Shared $Q$-Network for Data-Efficient Offline Reinforcement Learning

Pretraining a Shared $Q$-Network for Data-Efficient Offline Reinforcement Learning
Under review, 2025

Jongchan Park
Mingyu Park
Donghwan Lee

Paper

Abstract

Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a $Q$-network to enhance data efficiency in offline RL. Specifically, we introduce a shared $Q$-network structure that outputs predictions of the next state and $Q$-value. We pretrain the shared $Q$-network through a supervised regression task that predicts a next state and trains the shared $Q$-network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that the proposed method enhances the performance of existing popular offline RL methods on the D4RL and Robomimic benchmarks, with an average improvement of 135.94% on the D4RL benchmark. Furthermore, we show that the proposed method significantly boosts data-efficient offline RL across various data qualities and data distributions. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.

Overview

Our method splits the original $Q$-network into two core architectures: (i) a shared backbone network extracting $z$ from the concatenated input $\textbf{concat}(s,a)$, and (ii) separate shallow heads for training the transition model network and $Q$-network, respectively.

Two-Phase Offline Learning Scheme

Our method presents a two-phase training scheme during offline learning: pretraining and RL training. During pretraining phase, the shared backbone ($h_\phi$) attached with a shallow transition head ($g_\psi$) is trained with the transition dynamics prediction task. Subsequently, the pretrained shared backbone is connected with randomly initialized $Q$ layer ($f_\theta$) and trained with a remaining offline RL value learning. we consider the pretraining time-step ratio; a ratio of pretraining steps over entire training gradient steps.

Empirical D4RL Results


HalfCheetah	Hopper	Walker2d

Average normalized scores on the D4RL benchmark

By combining with popular offline RL methods (e.g., CQL), our method demonstrates strong empirical improvements over diverse datasets and environments, where blue scores outperform baselines without our method.

Learning Curves

Average normalized scores of TD3+BC variations with our method on the D4RL benchmark.
Red vertical line indicates to 10% pretraining time-step ratio, which is used as a default value.

Regardless of the pretraining time-step ratio, our method accelerates performance of offline RL with only a few lines of modifications. As depicted in figures, offline RL agents with our method (red, orange, green) demonstrate rapid and strong performance over baselines (blue) after the pretraining phase.

Robomimic Results


Lift	Can

Average success rates of baselines with and without our method on the Robomimic benchmark.

Performance gains are not limited to a popular offline RL benchmark, D4RL. Our method successfully proves superior performance over baselines on robotic manipulation tasks.

Template is borrowed from here.