Also known as batch reinforcement learning, this paradigm addresses the core limitation of traditional online RL: the need for costly, dangerous, or impractical active exploration. The agent learns from a static dataset of transitions (state, action, reward, next state), which may be collected by arbitrary, potentially unknown, behavioral policies. The primary technical challenge is distributional shift; the agent must learn a policy that performs well in the real environment while avoiding actions that were not well-represented in the offline data, a problem addressed by algorithms incorporating conservative or constrained policy updates.
