Trust Region Policy Optimization (TRPO) is a policy gradient method that directly optimizes a parameterized policy to maximize expected cumulative reward. Its core innovation is the trust region constraint, which limits the size of each policy update by enforcing a maximum Kullback-Leibler (KL) divergence between the new and old policy. This constraint prevents overly large, destructive updates that can collapse performance, a common failure mode in vanilla policy gradient methods. The algorithm solves a constrained optimization problem at each step, guaranteeing monotonic improvement under theoretical assumptions.




