Knowledge distillation is a model compression technique where a smaller, more efficient student model is trained to replicate the outputs and internal representations of a larger, more complex teacher model. The process transfers the teacher's learned generalization capabilities—often referred to as its 'dark knowledge'—enabling the student to achieve comparable performance with significantly reduced computational and memory footprints. This is particularly valuable for deploying high-quality models, such as sentence transformers, in resource-constrained environments like edge devices or high-throughput embedding serving pipelines.
