A Process Reward Model (PRM) is a discriminative model trained on human feedback to assign a scalar reward or correctness score to each intermediate step in a Chain-of-Thought reasoning trace. Unlike outcome supervision, which only evaluates the final answer, process supervision provides dense, stepwise feedback. This granular signal is crucial for training more reliable and transparent reasoning in Large Language Models (LLMs), as it directly reinforces correct logical and factual progression.
