A Decentralized Partially Observable Markov Decision Process (Dec-POMDP) is a formal mathematical framework that extends the single-agent Partially Observable Markov Decision Process (POMDP) to model sequential decision-making problems involving multiple cooperative agents, each with a partial and potentially unique observation of the global state, who must coordinate their actions to maximize a shared long-term reward without centralized control or communication.
It works by defining a tuple <I, S, {A_i}, P, {Ω_i}, O, R, γ>, where:
- I is a finite set of agents.
- S is a set of global states.
- {A_i} is the set of joint actions, the Cartesian product of each agent's individual action set.
- P(s' | s, a) is the state transition probability function.
- {Ω_i} is the set of joint observations.
- O(o | a, s') is the observation probability function.
- R(s, a) is the immediate shared reward function.
- γ is the discount factor.
At each time step, the system is in a hidden global state s. Each agent i receives a local observation o_i correlated with s, selects an action a_i based on its local action-observation history, and the team receives a single shared reward. The goal is to find a joint policy—a set of decentralized controllers mapping local histories to actions—that maximizes the expected cumulative discounted reward.