Temporal Difference (TD) Learning is a model-free reinforcement learning method where an agent updates its estimate of a state's or state-action pair's value based on the difference between its current prediction and a more informed, bootstrapped target. This target combines the immediate reward received and the discounted value estimate of the next state, as formalized by the Bellman equation. Unlike Monte Carlo methods that wait until the end of an episode, TD learning updates estimates after every time step, enabling online, incremental learning and often faster convergence.




