Knowledge distillation is a model compression technique where a compact 'student' model is trained to mimic the predictive behavior and internal representations of a larger, pre-trained 'teacher' model. The core mechanism involves using the teacher's softened output probabilities (logits) as training targets, which contain richer 'dark knowledge' about class relationships than hard one-hot labels. This process transfers the teacher's generalization ability, often allowing the student to match or exceed the teacher's performance despite having far fewer parameters.
