Low-rank factorization is a model compression technique that approximates a large, dense weight matrix as the product of two or more smaller, low-rank matrices. This exploits the idea that many learned weight matrices in neural networks contain redundant information and can be represented more efficiently. The technique directly reduces the total number of parameters, decreasing the model's storage size and computational cost during inference, which is critical for on-device deployment and agentic memory systems.
