Knowledge distillation is a model compression technique where a smaller, more efficient 'student' model is trained to replicate the behavior of a larger, more complex 'teacher' model. Instead of learning from hard, one-hot labels, the student is trained to match the teacher's softened output probabilities, known as soft labels or a softmax temperature. This process transfers the teacher's learned 'dark knowledge'—its nuanced understanding of class relationships and decision boundaries—enabling the student to achieve comparable performance with significantly fewer parameters and lower inference latency.
