In MARL, each agent operates within a Partially Observable Markov Decision Process (POMDP), receiving local observations and taking actions to maximize its own cumulative reward. The core challenge is non-stationarity: the environment's dynamics change from any single agent's perspective because the other learning agents are also adapting their policies. This interdependence necessitates specialized algorithms that address stability, convergence, and the credit assignment problem across agents. Key frameworks include cooperative, competitive, and mixed-motive settings, modeled by stochastic games.
