Multi-Agent Reinforcement Learning (MARL) extends the single-agent Reinforcement Learning (RL) paradigm to environments with multiple interacting decision-makers. Each agent, typically modeled as a Partially Observable Markov Decision Process (POMDP), learns a policy by maximizing its own cumulative reward signal. The core challenge is the non-stationarity of the learning environment, as each agent's policy changes concurrently, breaking the foundational assumption of stationary dynamics present in single-agent RL.




