Stacked generalization (stacking) is a two-layer ensemble machine learning technique where a meta-model (or blender) is trained to optimally combine the predictions of several diverse base models (level-0 models). Unlike simple averaging or voting, stacking learns the most effective way to integrate the base learners' outputs, often using a hold-out validation set to generate the training data for the meta-model. This architecture allows the ensemble to correct for individual model biases and capture complex interactions between their predictions, typically yielding superior generalization performance on unseen data.
Glossary
Stacked Generalization (Stacking)

What is Stacked Generalization (Stacking)?
Stacked generalization, commonly called stacking, is a meta-learning ensemble technique for improving predictive performance by combining multiple base models with a meta-model.
The technique is a cornerstone of self-consistency mechanisms in agentic systems, where aggregating multiple reasoning paths improves reliability. Implementation requires careful cross-validation to prevent data leakage and overfitting. Common meta-models include linear regression or simple neural networks. Stacking is distinct from bagging and boosting; it is a heterogeneous ensemble method that can integrate fundamentally different algorithms (e.g., combining a decision tree, a support vector machine, and a neural network) under a learned consensus framework.
Core Characteristics of Stacking
Stacked generalization, or stacking, is a meta-learning ensemble technique where a meta-model is trained to optimally combine the predictions of several base models to improve overall performance.
Two-Level Learning Architecture
Stacking employs a hierarchical structure distinct from other ensembles. At the base level, multiple heterogeneous base learners (e.g., a decision tree, a support vector machine, a neural network) are trained on the original dataset. Their predictions on a hold-out validation set (or via cross-validation) become the meta-features. At the meta-level, a meta-learner (or blender) is trained on these meta-features to learn the optimal combination of the base models' outputs. This separation allows the meta-model to correct the systematic biases of the individual base models.
Heterogeneous Base Models
A key strength of stacking is its ability to leverage diverse, uncorrelated base models. Unlike bagging or boosting, which typically use homogeneous weak learners, stacking thrives on model heterogeneity.
- Purpose: Different algorithms make different assumptions and capture different patterns in the data (e.g., linear relationships, tree-based splits, distance metrics).
- Benefit: This diversity creates a richer set of meta-features for the meta-learner, increasing the chance that their collective errors will be uncorrelated and thus correctable. Using similar models often yields diminishing returns.
Meta-Learner as a Combiner
The meta-learner is not a simple averager. It is a model trained to discover the non-linear relationship between base model predictions and the true target.
- Common Choices: Often a relatively simple, interpretable model like linear regression (for regression) or logistic regression (for classification) is used to prevent overfitting. More complex models like gradient boosting can be used with sufficient data.
- Function: It learns weights (and potentially interactions) for each base model's output. For example, it might learn to heavily trust Model A on certain data types and Model B on others, creating a context-aware fusion.
Out-of-Fold Predictions for Training
To prevent data leakage and overfitting, the meta-features for the training set must be generated without the meta-learner having seen the true labels for those same instances during base model training. This is achieved using k-fold cross-validation on the training data:
- For each fold, base models are trained on the k-1 folds and used to predict the held-out fold.
- The predictions for all folds are concatenated to form the out-of-fold (OOF) meta-feature dataset.
- The final base models are then retrained on the entire training set to generate predictions for the true test set. This rigorous process ensures the meta-learner learns from generalized base model performance.
Performance vs. Complexity Trade-off
Stacking is a high-variance, low-bias method that often achieves state-of-the-art performance in machine learning competitions but introduces significant complexity.
- Advantages: Can outperform any single base model and often beats simpler averaging ensembles by learning an optimal combination. It is highly flexible and can integrate any model type.
- Disadvantages: Increases computational cost and training time substantially. Requires careful tuning to avoid overfitting the meta-layer. Model interpretability is reduced, as the final prediction is the result of a stacked pipeline.
Relationship to Other Ensemble Methods
Stacking is distinguished from other self-consistency mechanisms by its learned combination rule.
- vs. Bagging/Averaging: Bagging reduces variance by averaging predictions from models trained on bootstrap samples. Stacking uses a learned model instead of a fixed average.
- vs. Boosting: Boosting builds models sequentially, each correcting its predecessor. Stacking trains base models in parallel and then learns to blend them.
- vs. Mixture of Experts: Both use a gating mechanism. A Mixture of Experts typically uses a softmax gating network that makes input-dependent selections, while stacking's meta-learner makes a final combination after all base predictions are made.
How Stacked Generalization Works: A Technical Breakdown
Stacked generalization, commonly called stacking, is a meta-learning ensemble method that uses a second-level model, the meta-learner, to optimally combine the predictions of multiple base models.
Stacked generalization is a two-stage ensemble technique. In the first stage, diverse base models (e.g., decision trees, neural networks, SVMs) are trained on the original dataset. Their predictions on a hold-out validation set, or generated via cross-validation, become the input features for the second stage. This creates a new dataset where each instance is represented by the vector of base model predictions, not the original raw data.
A meta-model (or blender) is then trained on this new dataset to learn the optimal combination of the base learners' outputs. This meta-learner, often a simpler linear model like logistic regression or a linear regression, learns to correct the biases and errors of the individual base models. The final prediction is the meta-model's output, resulting in performance that typically surpasses any single base learner or simple averaging techniques like bagging or boosting.
Frequently Asked Questions
This FAQ addresses common technical questions about Stacked Generalization (Stacking), a meta-learning ensemble technique used to improve prediction reliability by combining multiple base models with a meta-model.
Stacked generalization, or stacking, is a meta-learning ensemble technique where a meta-model (or blender) is trained to optimally combine the predictions of several diverse base models (level-0 models) to produce a final prediction with higher accuracy and robustness than any individual model. The process works in two distinct training phases. First, multiple base models (e.g., a decision tree, a support vector machine, and a neural network) are trained on the original training data. Their predictions on a hold-out validation set (or via cross-validation) become the new input features, known as meta-features. Second, a separate meta-model (often a simpler linear model like logistic regression or ridge regression) is trained on these meta-features, with the true target labels, to learn the optimal way to weight or combine the base models' outputs. Crucially, to prevent data leakage and overfitting, the meta-features for the training instances used to train the meta-model must be generated via out-of-fold predictions, ensuring the meta-model learns from predictions the base models did not see during their own training.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Stacked generalization is a powerful meta-ensemble technique. These related concepts represent alternative or foundational methods for combining multiple models or reasoning paths to achieve more robust, accurate, and reliable outputs.
Ensemble Averaging
Ensemble averaging, or soft voting, is a foundational aggregation technique where the final prediction is the arithmetic mean of the continuous-valued outputs (e.g., probabilities, regression values) from multiple base models. This reduces variance and often yields a more stable and accurate result than any single model.
- Key Mechanism: Computes the mean of predictions.
- Contrast with Stacking: Averaging uses a simple, fixed rule, whereas stacking trains a meta-model to learn the optimal, potentially non-linear, combination of base model outputs.
Bootstrap Aggregating (Bagging)
Bagging is an ensemble method designed to reduce model variance and prevent overfitting. It creates multiple versions of a base model, each trained on a different bootstrap sample (random sample with replacement) of the training data. Predictions are aggregated, typically by voting for classification or averaging for regression.
- Primary Goal: Variance reduction.
- Relation to Stacking: Bagging is often used to create the diverse set of base models (learners) whose outputs are then fed into a stacking meta-learner for a potentially superior combination.
Mixture of Experts
A Mixture of Experts (MoE) is a conditional computation architecture where a gating network dynamically routes each input to one or a few specialized 'expert' neural networks. The final output is a weighted sum of the expert outputs, with weights determined by the gater.
- Dynamic Specialization: Experts learn to handle different regions of the input space.
- Comparison: Like stacking, MoE learns to combine models. However, MoE typically trains the gating and expert networks jointly and end-to-end, whereas stacking often uses a separate, sequential training process for the meta-learner.
Bayesian Model Averaging (BMA)
Bayesian Model Averaging is a rigorous probabilistic framework for ensemble learning. Instead of selecting a single 'best' model, BMA averages the predictions of all candidate models, weighted by their posterior model probabilities given the observed data. This accounts for model uncertainty and typically provides better predictive performance and more honest uncertainty estimates.
- Theoretical Foundation: Rooted in Bayesian inference.
- Contrast: BMA uses a theoretically derived weighting scheme based on model evidence, while stacking uses a data-driven meta-learner to find combination weights, which can be more flexible and predictive in practice.
Weighted Consensus
Weighted consensus is a broad class of aggregation techniques where the final decision is a weighted combination of individual model or agent outputs. The weights can be static (e.g., based on historical accuracy) or dynamic (e.g., based on per-instance confidence).
- General Principle:
Final Output = Σ (weight_i * output_i). - Stacking as Advanced Weighted Consensus: Stacking can be viewed as a sophisticated form of weighted consensus where the weights are not pre-defined but are learned by a meta-model that can discover complex, context-dependent combinations of the base learners.
Boosting
Boosting is a sequential ensemble technique that builds a strong learner by combining many weak learners (e.g., shallow trees). It works by training each new model to correct the errors of the current ensemble, typically by giving more weight to misclassified training instances. Models are combined through a weighted sum.
- Primary Goal: Bias reduction, building a strong model from weak ones.
- Architectural Difference: Boosting is a sequential, dependent process where each new model is influenced by the ensemble's current performance. Stacking is typically a parallel, independent process where base models are trained separately, and their outputs are blended by a meta-learner.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us