Branch Merging
After creating parallel branches with branch or source_branch, use merge operators to combine outputs and continue with a single pipeline path.
Overview
nirs4all provides three merge-related keywords:
Keyword |
Purpose |
Use Case |
|---|---|---|
|
Combine outputs from pipeline branches |
Feature fusion, stacking, ensemble building |
|
Combine features from data sources |
Multi-instrument fusion, multi-modal data |
|
Apply per-source preprocessing (auto-merges) |
Source-specific preprocessing before fusion |
Branch Merging (merge)
The merge keyword combines outputs from branches created with {"branch": [...]}.
Basic Syntax
# Simple string syntax
{"merge": "features"} # Concatenate X matrices from all branches
{"merge": "predictions"} # Collect OOF predictions (for stacking)
{"merge": "all"} # Collect both features and predictions
Feature Merging
Concatenate feature matrices (X) horizontally from all branches:
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5, random_state=42),
{"branch": {
"snv": [SNV()],
"msc": [MSC()],
"derivative": [FirstDerivative()],
}},
{"merge": "features"}, # Concatenate X from all branches
PLSRegression(n_components=10),
]
Before merge:
Branch 0 (snv): shape (n, p)
Branch 1 (msc): shape (n, p)
Branch 2 (derivative): shape (n, p)
After merge:
Merged X: shape (n, 3*p) # Horizontal concatenation
Prediction Merging (Stacking)
Collect out-of-fold (OOF) predictions from branches for meta-model training:
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5, random_state=42),
{"branch": {
"pls": [SNV(), PLSRegression(n_components=5)],
"ridge": [MSC(), Ridge(alpha=1.0)],
"rf": [FirstDerivative(), RandomForestRegressor(n_estimators=50)],
}},
{"merge": "predictions"}, # Collect OOF predictions
Ridge(alpha=0.1), # Meta-model (Level 2)
]
Before merge:
Branch 0 (pls): predictions shape (n,)
Branch 1 (ridge): predictions shape (n,)
Branch 2 (rf): predictions shape (n,)
After merge:
Meta X: shape (n, 3) # Stacked predictions as features
OOF predictions prevent data leakage by using validation set predictions during training.
Mixed Merging
Select features from some branches and predictions from others:
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5, random_state=42),
{"branch": [
[SNV()], # Branch 0: preprocessing only
[MSC()], # Branch 1: preprocessing only
[FirstDerivative(), PLSRegression(5)], # Branch 2: has model
]},
{"merge": {"features": [0, 1], "predictions": [2]}},
Ridge(alpha=0.1),
]
This combines:
Features from branches 0 and 1 (preprocessed X)
Predictions from branch 2 (model outputs)
Dict Syntax Options
The dict syntax provides fine-grained control over merging:
{"merge": {
# What to collect
"features": "all", # or [0, 2] for specific branches
"predictions": "all", # or [1, 3] for specific branches
# Feature options
"include_original": True, # Include pre-branch features
"aggregation": "mean", # mean | median | std | min | max
# Prediction options
"unsafe": False, # Default: use OOF reconstruction
# Output format
"output_as": "features", # features | sources | dict
"source_names": ["snv", "msc"], # Names for output_as="sources"
# Error handling
"on_missing": "error", # error | warn | skip
"on_shape_mismatch": "error", # error | allow | pad | truncate
}}
Branch Selection
Select specific branches by index or name:
# By index
{"merge": {"features": [0, 2]}} # Only branches 0 and 2
# By name (when using named branches)
{"merge": {"features": ["snv", "derivative"]}}
# All branches (default)
{"merge": {"features": "all"}}
Feature Aggregation
Instead of concatenating, aggregate features across branches:
pipeline = [
{"branch": {
"snv": [SNV()],
"msc": [MSC()],
"derivative": [FirstDerivative()],
}},
{"merge": {"features": "all", "aggregation": "mean"}},
PLSRegression(n_components=5),
]
Aggregation options:
"mean"- Average features across branches"median"- Median of features"std"- Standard deviation"min"- Minimum value"max"- Maximum value
Result shape: (n, p) instead of (n, branches*p)
Include Original Features
Preserve pre-branch features alongside merged branch features:
pipeline = [
MinMaxScaler(), # Original features after this
{"branch": {
"snv": [SNV()],
"derivative": [FirstDerivative()],
}},
{"merge": {"features": "all", "include_original": True}},
PLSRegression(n_components=10),
]
Result: [original_X | snv_X | derivative_X]
Output Formats
Control how merged features are structured:
# Default: single feature matrix
{"merge": {"features": "all", "output_as": "features"}}
# Result: shape (n, total_features)
# As separate sources (for multi-source models)
{"merge": {"features": "all", "output_as": "sources", "source_names": ["snv", "msc"]}}
# Result: SpectroDataset with named sources
# As dictionary (for multi-input models)
{"merge": {"features": "all", "output_as": "dict"}}
# Result: {"branch_0": X0, "branch_1": X1, ...}
Per-Branch Prediction Configuration
Fine control over which models and how predictions are collected:
{"merge": {
"predictions": [
# Branch 0: Use best model by RMSE
{"branch": 0, "select": "best", "metric": "rmse"},
# Branch 1: Top 2 models by R²
{"branch": 1, "select": {"top_k": 2}, "metric": "r2"},
# Branch 2: Specific models by name
{"branch": 2, "select": ["PLS", "Ridge"]},
# Branch 3: Average predictions instead of separate columns
{"branch": 3, "select": "all", "aggregate": "mean"},
]
}}
Selection strategies:
"all"- All models in branch (default)"best"- Single best model by metric{"top_k": N}- Top N models by metric["model1", "model2"]- Explicit model names
Aggregation strategies:
"separate"- Each model as a separate feature column (default)"mean"- Simple average of predictions"weighted_mean"- Weighted by validation scores"proba_mean"- For classification: average class probabilities
Unsafe Mode
By default, prediction merging uses OOF reconstruction to prevent data leakage. Disable this for special cases:
# Safe (default): OOF predictions reconstructed per fold
{"merge": "predictions"} # or {"merge": {"predictions": "all", "unsafe": False}}
# Unsafe: Direct predictions (data leakage risk!)
{"merge": {"predictions": "all", "unsafe": True}}
Use unsafe: True only when you understand the implications (e.g., for final model predictions on new data).
Nested Branching
Multiple branch-merge cycles enable hierarchical architectures:
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5, random_state=42),
# Level 1: Preprocessing exploration
{"branch": {
"snv": [SNV()],
"msc": [MSC()],
}},
{"merge": "features"}, # Exit first branch level
# Level 2: Model comparison
{"branch": {
"pls": [PLSRegression(n_components=5)],
"ridge": [Ridge(alpha=1.0)],
}},
{"merge": "predictions"}, # Stack predictions
# Final meta-model
Ridge(alpha=0.1),
]
Each merge exits its branch level, enabling sequential branch-merge cycles.
Source Merging (merge_sources)
The merge_sources keyword combines features from different data sources (sensors, instruments, modalities).
Basic Syntax
# Simple string syntax
{"merge_sources": "concat"} # Horizontal concatenation
{"merge_sources": "stack"} # 3D stacking (for CNNs)
{"merge_sources": "average"} # Element-wise average
Concatenation (Default)
Horizontally concatenate all sources:
pipeline = [
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"Raman": [MSC(), SavitzkyGolay()],
}},
{"merge_sources": "concat"},
PLSRegression(n_components=15),
]
Result: shape (n, nir_features + raman_features)
3D Stacking
Create 3D array for CNN models:
{"merge_sources": "stack"}
Result: shape (n, n_sources, n_features)
Requires sources to have the same number of features.
Averaging
Element-wise average of sources:
{"merge_sources": "average"}
Result: shape (n, n_features)
Requires identical dimensions across sources.
Dict Syntax Options
{"merge_sources": {
"strategy": "concat", # concat | stack | dict
"sources": "all", # or ["NIR", "markers"] for specific
"on_incompatible": "error", # error | flatten | pad | truncate
"output_name": "merged", # Name for merged source
"preserve_source_info": True, # Keep source metadata
}}
Weighted Merging
Scale source contributions before combining:
{"merge_sources": {
"mode": "concat",
"weights": {"NIR": 1.0, "Raman": 0.5} # Scale Raman features by 0.5
}}
Selective Merging
Include only specific sources:
{"merge_sources": {
"sources": ["NIR", "markers"], # Exclude Raman
"mode": "concat"
}}
Handling Shape Mismatches
When sources have different feature dimensions:
{"merge_sources": {
"strategy": "stack",
"on_incompatible": "error", # Default: raise error
}}
# Or fallback strategies:
{"merge_sources": {
"strategy": "stack",
"on_incompatible": "flatten", # Fall back to 2D concat
}}
{"merge_sources": {
"strategy": "stack",
"on_incompatible": "pad", # Zero-pad shorter sources
}}
{"merge_sources": {
"strategy": "stack",
"on_incompatible": "truncate", # Truncate longer sources
}}
Source Branching (source_branch)
Apply source-specific preprocessing pipelines. By default, sources are automatically merged after processing.
Basic Syntax
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"Raman": [MSC(), SavitzkyGolay()],
"markers": [StandardScaler()],
}}
Auto Mode
Process each source independently with empty pipeline:
{"source_branch": "auto"}
Default Pipeline
Apply default preprocessing to unlisted sources:
{"source_branch": {
"NIR": [SNV()],
"_default_": [MinMaxScaler()], # Applied to other sources
}}
Controlling Auto-Merge
By default, sources merge after source_branch. Disable this:
{"source_branch": {
"NIR": [SNV()],
"Raman": [MSC()],
"_merge_after_": False, # Keep sources separate
"_merge_strategy_": "concat", # Merge strategy if merging (default)
}}
Indexed Sources
Reference sources by index instead of name:
{"source_branch": {
0: [SNV(), FirstDerivative()],
1: [MinMaxScaler()],
}}
Complete Examples
Example 1: Multi-Preprocessing Fusion
Combine multiple preprocessing strategies:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from nirs4all.operators.transforms import SNV, MSC, FirstDerivative
import nirs4all
pipeline = [
MinMaxScaler(),
KFold(n_splits=5, shuffle=True, random_state=42),
{"branch": {
"snv": [SNV()],
"msc": [MSC()],
"derivative": [FirstDerivative()],
}},
{"merge": "features"},
PLSRegression(n_components=15),
]
result = nirs4all.run(pipeline=pipeline, dataset="path/to/data")
Example 2: Two-Level Stacking
Build a meta-model from diverse base models:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
pipeline = [
MinMaxScaler(),
KFold(n_splits=5, shuffle=True, random_state=42),
{"branch": {
"pls": [SNV(), PLSRegression(n_components=10)],
"rf": [MSC(), RandomForestRegressor(n_estimators=100)],
"ridge": [FirstDerivative(), Ridge(alpha=1.0)],
}},
{"merge": "predictions"},
Ridge(alpha=0.1), # Meta-model
]
Example 3: Multi-Instrument Fusion
Combine data from different spectrometers:
pipeline = [
KFold(n_splits=5, shuffle=True, random_state=42),
{"source_branch": {
"portable": [
SNV(),
SavitzkyGolay(window_length=15, polyorder=2),
FirstDerivative(),
],
"benchtop": [
SNV(),
FirstDerivative(),
],
}},
{"merge_sources": {
"mode": "concat",
"weights": {"portable": 0.7, "benchtop": 1.0}
}},
PLSRegression(n_components=10),
]
Example 4: Hybrid Source and Pipeline Branching
Combine source-level and algorithm-level branching:
pipeline = [
KFold(n_splits=5, shuffle=True, random_state=42),
# Step 1: Per-source preprocessing
{"source_branch": {
"NIR": [SNV()],
"Raman": [MSC()],
}},
{"merge_sources": "concat"},
# Step 2: Feature scaling
MinMaxScaler(),
# Step 3: Model comparison
{"branch": {
"pls": [PLSRegression(n_components=10)],
"rf": [RandomForestRegressor(n_estimators=100)],
}},
]
Example 5: Advanced Prediction Selection
Fine-grained control over stacking:
pipeline = [
MinMaxScaler(),
KFold(n_splits=5, shuffle=True, random_state=42),
{"branch": {
"diverse_pls": [SNV(),
{"model": PLSRegression(3), "name": "PLS_3"},
{"model": PLSRegression(5), "name": "PLS_5"},
{"model": PLSRegression(10), "name": "PLS_10"},
],
"ensemble": [MSC(), RandomForestRegressor(n_estimators=100)],
}},
{"merge": {
"predictions": [
# Top 2 PLS models by R²
{"branch": "diverse_pls", "select": {"top_k": 2}, "metric": "r2"},
# All from ensemble branch, averaged
{"branch": "ensemble", "select": "all", "aggregate": "mean"},
]
}},
Ridge(alpha=0.1),
]
Key Behaviors
merge ALWAYS exits branch mode - After merge, the pipeline returns to a single path
OOF safety by default - Prediction merging uses out-of-fold reconstruction
source_branch auto-merges - Unless
_merge_after_: FalseFeature snapshots - Pre-branch features are preserved for
include_originalPrediction mode support - Merge configurations are saved and restored for inference
Troubleshooting
“No feature snapshot found”
The merge requires features from before the branch. Ensure preprocessing steps exist before branching.
“Feature dimension mismatch”
Branches produced different feature counts. Use aggregation or check preprocessing consistency.
“No models found in branch”
For prediction merge, at least one branch must contain a trained model.
“Incomplete OOF coverage”
Some samples lack validation predictions. Check cross-validation setup or use on_missing: "warn".
“merge_sources requires multiple sources”
Your dataset has only one source. Use branch instead for single-source data.
See Also
Pipeline Branching - Creating parallel pipeline branches
Meta-Model Stacking User Guide - MetaModel stacking patterns
Multi-Source Pipelines - Loading and working with multi-source data
Writing a Pipeline in nirs4all - Complete pipeline syntax reference
D03_merge_basics.py - Feature and prediction merge examples
D04_merge_sources.py - Multi-source merge examples