Mel-Frequency Cepstral Coefficients (MFCCs) are a compact, perceptually motivated feature vector that represents the short-term power spectrum of a sound, derived by applying a non-linear Mel-scale filterbank and a discrete cosine transform to the log power spectrum of an audio frame.
MFCCs are the de facto standard feature for speech recognition and audio classification. They are designed to mimic the human ear's non-linear frequency perception (the Mel scale), making them more robust and informative than a raw Fast Fourier Transform (FFT) spectrum. The process involves:
- Pre-emphasis & Framing: Boosting high frequencies and splitting the audio signal into short, overlapping frames (e.g., 20-40 ms).
- Windowing: Applying a window function (like a Hamming window) to each frame to reduce spectral leakage.
- FFT & Power Spectrum: Computing the magnitude spectrum and converting it to a power spectrum.
- Mel Filterbank: Passing the power spectrum through a set of triangular filters spaced according to the Mel scale, which emphasizes lower frequencies.
- Logarithm: Taking the log of the filterbank energies to compress the dynamic range.
- Discrete Cosine Transform (DCT): Applying a DCT to decorrelate the filterbank energies, producing the final cepstral coefficients. The first 12-13 coefficients (excluding the 0th) are typically used as the MFCC feature vector.