Temporal pooling is a dimensionality reduction operation that aggregates feature representations across a temporal dimension, converting a variable-length sequence into a fixed-size vector. It operates over a sliding or fixed time window, applying an aggregation function—such as max, average, or attention-weighted sum—to the feature vectors at each timestep. This creates a condensed, summary representation that is invariant to the exact timing of features within the window, making it crucial for tasks like video classification, audio event detection, and time-series summarization where the overall pattern matters more than precise temporal localization.
