Offline reinforcement learning (Offline RL), also known as batch RL, is the problem of learning an optimal policy exclusively from a fixed, previously collected dataset of experiences, without further online interaction with the environment. This paradigm is critical for applications like robotics and healthcare, where online trial-and-error is prohibitively expensive, dangerous, or impractical. The core challenge is distributional shift: the learned policy must avoid taking actions that differ significantly from those in the dataset, which can lead to unpredictable and poor performance when deployed.




