# Meta-Model Stacking User Guide ## Overview Meta-model stacking (stacked generalization) is an ensemble technique that combines predictions from multiple base models using a second-level "meta-learner". This approach often improves prediction accuracy by leveraging the complementary strengths of different models. In nirs4all, the `MetaModel` operator provides a flexible, robust implementation of stacking that: - **Prevents data leakage** through out-of-fold (OOF) predictions - **Supports flexible source selection** (all, explicit, top-K, diversity) - **Handles edge cases** with configurable coverage strategies - **Integrates with branches** for multi-preprocessing pipelines - **Persists and reloads** seamlessly for production use ## Quick Start ### Basic Stacking Pipeline ```python from sklearn.cross_decomposition import PLSRegression from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import Ridge from sklearn.model_selection import KFold from sklearn.preprocessing import MinMaxScaler from nirs4all.data import DatasetConfigs from nirs4all.pipeline import PipelineRunner, PipelineConfigs from nirs4all.operators.models import MetaModel # Load dataset dataset = DatasetConfigs("path/to/data/") # Pipeline with base models and meta-learner pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), # Required for OOF PLSRegression(n_components=5), # Base model 1 RandomForestRegressor(n_estimators=50), # Base model 2 {"model": MetaModel(model=Ridge(alpha=1.0))}, # Meta-learner ] runner = PipelineRunner() predictions, _ = runner.run(PipelineConfigs(pipeline, "Stacking"), dataset) ``` ## Core Concepts ### Out-of-Fold (OOF) Predictions Stacking requires predictions that were **not** made on the training data to avoid leakage. During cross-validation: 1. Each fold's model predicts on its validation set 2. These validation predictions become training features for the meta-model 3. The meta-model sees the same samples as the base models, but through their predictions ``` Fold 1: Train on [2,3,4,5], Predict on [1] → OOF for samples in fold 1 Fold 2: Train on [1,3,4,5], Predict on [2] → OOF for samples in fold 2 ... Result: Complete OOF predictions for all training samples ``` ### Source Model Selection The `source_models` parameter controls which base models contribute to the meta-learner: | Mode | Syntax | Description | |------|--------|-------------| | All Previous | `source_models="all"` (default) | Use all models before the MetaModel | | Explicit | `source_models=["Model1", "Model2"]` | Use specific named models | | Top-K | `source_models={"top_k": 3, "metric": "r2"}` | Best N models by metric | | Diversity | `source_models={"diversity": True, "max_models": 5}` | Diverse model selection | ### Coverage Strategies When some samples lack OOF predictions (e.g., excluded samples), coverage strategies determine behavior: | Strategy | Enum | Behavior | |----------|------|----------| | Strict | `CoverageStrategy.STRICT` | Error if any sample missing (default) | | Drop | `CoverageStrategy.DROP_INCOMPLETE` | Mask incomplete samples | | Impute Zero | `CoverageStrategy.IMPUTE_ZERO` | Fill missing with 0 | | Impute Mean | `CoverageStrategy.IMPUTE_MEAN` | Fill missing with column mean | ### Test Aggregation Multiple folds produce multiple test predictions. Aggregation strategies combine them: | Strategy | Enum | Behavior | |----------|------|----------| | Mean | `TestAggregation.MEAN` | Simple average (default) | | Weighted | `TestAggregation.WEIGHTED_MEAN` | Weight by validation scores | | Best | `TestAggregation.BEST_FOLD` | Use only best-scoring fold | ## Configuration Reference ### MetaModel Parameters ```python MetaModel( model, # Required: sklearn-compatible meta-learner source_models="all", # Source selection mode use_proba=False, # Use probabilities (classification) stacking_config=None, # StackingConfig instance ) ``` ### StackingConfig Parameters ```python from nirs4all.operators.models import StackingConfig, CoverageStrategy, TestAggregation, BranchScope config = StackingConfig( coverage_strategy=CoverageStrategy.STRICT, # How to handle missing OOF test_aggregation=TestAggregation.MEAN, # How to aggregate test preds branch_scope=BranchScope.CURRENT_ONLY, # Which branches to use min_coverage_ratio=1.0, # Minimum required coverage allow_no_cv=False, # Allow non-CV pipelines ) ``` ## Usage Patterns ### Pattern 1: Named Source Selection Select specific models by name: ```python pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), {"model": PLSRegression(n_components=3), "name": "PLS_3"}, {"model": PLSRegression(n_components=5), "name": "PLS_5"}, {"model": PLSRegression(n_components=10), "name": "PLS_10"}, RandomForestRegressor(n_estimators=100), # Not selected # Only use PLS models {"model": MetaModel( model=Ridge(), source_models=["PLS_3", "PLS_5", "PLS_10"], )}, ] ``` ### Pattern 2: Top-K Selection Automatically select best models: ```python pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), # Many base models... PLSRegression(n_components=3), PLSRegression(n_components=5), PLSRegression(n_components=10), RandomForestRegressor(n_estimators=50), GradientBoostingRegressor(n_estimators=50), # Select top 3 by validation R² {"model": MetaModel( model=Ridge(), source_models={"top_k": 3, "metric": "r2"}, )}, ] ``` ### Pattern 3: Robust Configuration Handle missing predictions gracefully: ```python from nirs4all.operators.models import StackingConfig, CoverageStrategy, TestAggregation config = StackingConfig( coverage_strategy=CoverageStrategy.IMPUTE_MEAN, # Fill gaps test_aggregation=TestAggregation.WEIGHTED_MEAN, # Weight by performance min_coverage_ratio=0.8, # Allow up to 20% missing ) pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), PLSRegression(n_components=5), {"model": MetaModel(model=Ridge(), stacking_config=config)}, ] ``` ### Pattern 4: Branch Stacking Stack models from preprocessing branches: ```python from nirs4all.operators.transforms import FirstDerivative, SecondDerivative from nirs4all.operators.models import MetaModel, StackingConfig, BranchScope pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), {"branch": [ [PLSRegression(n_components=5)], # Raw [FirstDerivative(), PLSRegression(n_components=5)], # D1 [SecondDerivative(), PLSRegression(n_components=5)], # D2 ]}, {"merge": "predictions"}, # Stack all branch models {"model": MetaModel( model=Ridge(), stacking_config=StackingConfig( branch_scope=BranchScope.ALL_BRANCHES, ), )}, ] ``` ### Pattern 5: Multi-Level Stacking Create hierarchical stacking: ```python pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), # Level 0: Base models {"model": PLSRegression(n_components=3), "name": "PLS_L0"}, {"model": PLSRegression(n_components=10), "name": "PLS10_L0"}, RandomForestRegressor(n_estimators=50), # Level 1: Stack PLS models only {"model": MetaModel( model=Ridge(), source_models=["PLS_L0", "PLS10_L0"], ), "name": "Meta_L1"}, # Level 2: Final meta-model {"model": MetaModel( model=Lasso(alpha=0.1), ), "name": "Meta_L2"}, ] ``` ### Pattern 6: Classification Stacking Stack classification models: ```python from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), RandomForestClassifier(n_estimators=50), LinearDiscriminantAnalysis(), # Stack with probabilities {"model": MetaModel( model=LogisticRegression(), use_proba=True, # Use class probabilities as features )}, ] ``` ## Best Practices ### ✅ DO - **Always use cross-validation** - Stacking requires OOF predictions - **Set random_state** - Ensure reproducibility - **Start simple** - Begin with default settings, then tune - **Use diverse base models** - Mix linear and non-linear models - **Name your models** - Makes source selection clearer - **Test on held-out data** - Validate improvement on unseen data ### ❌ DON'T - **Stack too many models** - Diminishing returns, consider top-K - **Ignore base model quality** - Bad base models hurt stacking - **Use complex meta-learner** - Simple models (Ridge, Linear) often best - **Forget to check coverage** - Ensure OOF predictions are complete - **Over-engineer** - Sometimes a single good model is enough ## Troubleshooting ### Common Errors **"No source models found"** ``` Solution: Ensure base models are defined before MetaModel in pipeline ``` **"Incomplete OOF coverage"** ``` Solution: 1. Check that KFold or similar is in pipeline 2. Use CoverageStrategy.DROP_INCOMPLETE or IMPUTE_MEAN ``` **"Source model not found: ModelName"** ``` Solution: Verify model names match exactly (case-sensitive) ``` **"No fold data found"** ``` Solution: Ensure cross-validation splitter is before base models ``` ### Debugging Tips 1. **Check predictions store**: ```python predictions = runner.predictions for pred in predictions.filter(partition="val"): print(f"{pred['model_name']}: fold={pred.get('fold_id')}") ``` 2. **Verify source models exist**: ```python # After run all_models = predictions.filter(partition="val") model_names = set(p['model_name'] for p in all_models) print(f"Available models: {model_names}") ``` 3. **Check coverage**: ```python meta_preds = predictions.filter(model_name_contains="MetaModel") if meta_preds: print(f"Coverage: {meta_preds[0].get('coverage_ratio', 'N/A')}") ``` ## API Reference ### MetaModel ```python class MetaModel: """ Meta-model operator for stacked generalization. Parameters ---------- model : estimator Sklearn-compatible meta-learner (e.g., Ridge, LogisticRegression) source_models : str, list, or dict Source model selection: - "all": Use all previous models (default) - ["name1", "name2"]: Use specific named models - {"top_k": N, "metric": "r2"}: Use top N by metric - {"diversity": True}: Use diverse selection use_proba : bool For classification: use probabilities instead of predictions stacking_config : StackingConfig Configuration for coverage and aggregation strategies """ ``` ### StackingConfig ```python @dataclass class StackingConfig: """ Configuration for meta-model stacking behavior. Attributes ---------- coverage_strategy : CoverageStrategy How to handle missing OOF predictions test_aggregation : TestAggregation How to combine fold predictions for test set branch_scope : BranchScope Which branches contribute source models min_coverage_ratio : float Minimum required sample coverage (0.0-1.0) allow_no_cv : bool Allow stacking without cross-validation """ ``` ### Enums ```python class CoverageStrategy(Enum): STRICT = "strict" # Error if incomplete DROP_INCOMPLETE = "drop" # Mask incomplete samples IMPUTE_ZERO = "impute_zero" # Fill with 0 IMPUTE_MEAN = "impute_mean" # Fill with column mean class TestAggregation(Enum): MEAN = "mean" # Simple average WEIGHTED_MEAN = "weighted" # Weight by val scores BEST_FOLD = "best" # Use best fold only class BranchScope(Enum): CURRENT_ONLY = "current" # Only current branch ALL_BRANCHES = "all" # All branches SPECIFIED = "specified" # Explicitly listed ``` ## See Also - {doc}`branching` - Pipeline branching for preprocessing variations - {doc}`writing_pipelines` - General pipeline usage - {doc}`/reference/pipeline_syntax` - Complete pipeline syntax reference