# Branch Merging After creating parallel branches with `branch` or `source_branch`, use **merge operators** to combine outputs and continue with a single pipeline path. ## Overview nirs4all provides three merge-related keywords: | Keyword | Purpose | Use Case | |---------|---------|----------| | `merge` | Combine outputs from pipeline branches | Feature fusion, stacking, ensemble building | | `merge_sources` | Combine features from data sources | Multi-instrument fusion, multi-modal data | | `source_branch` | Apply per-source preprocessing (auto-merges) | Source-specific preprocessing before fusion | ## Branch Merging (`merge`) The `merge` keyword combines outputs from branches created with `{"branch": [...]}`. ### Basic Syntax ```python # Simple string syntax {"merge": "features"} # Concatenate X matrices from all branches {"merge": "predictions"} # Collect OOF predictions (for stacking) {"merge": "all"} # Collect both features and predictions ``` ### Feature Merging Concatenate feature matrices (X) horizontally from all branches: ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, random_state=42), {"branch": { "snv": [SNV()], "msc": [MSC()], "derivative": [FirstDerivative()], }}, {"merge": "features"}, # Concatenate X from all branches PLSRegression(n_components=10), ] ``` **Before merge:** ``` Branch 0 (snv): shape (n, p) Branch 1 (msc): shape (n, p) Branch 2 (derivative): shape (n, p) ``` **After merge:** ``` Merged X: shape (n, 3*p) # Horizontal concatenation ``` ### Prediction Merging (Stacking) Collect out-of-fold (OOF) predictions from branches for meta-model training: ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, random_state=42), {"branch": { "pls": [SNV(), PLSRegression(n_components=5)], "ridge": [MSC(), Ridge(alpha=1.0)], "rf": [FirstDerivative(), RandomForestRegressor(n_estimators=50)], }}, {"merge": "predictions"}, # Collect OOF predictions Ridge(alpha=0.1), # Meta-model (Level 2) ] ``` **Before merge:** ``` Branch 0 (pls): predictions shape (n,) Branch 1 (ridge): predictions shape (n,) Branch 2 (rf): predictions shape (n,) ``` **After merge:** ``` Meta X: shape (n, 3) # Stacked predictions as features ``` OOF predictions prevent data leakage by using validation set predictions during training. ### Mixed Merging Select features from some branches and predictions from others: ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, random_state=42), {"branch": [ [SNV()], # Branch 0: preprocessing only [MSC()], # Branch 1: preprocessing only [FirstDerivative(), PLSRegression(5)], # Branch 2: has model ]}, {"merge": {"features": [0, 1], "predictions": [2]}}, Ridge(alpha=0.1), ] ``` This combines: - Features from branches 0 and 1 (preprocessed X) - Predictions from branch 2 (model outputs) ### Dict Syntax Options The dict syntax provides fine-grained control over merging: ```python {"merge": { # What to collect "features": "all", # or [0, 2] for specific branches "predictions": "all", # or [1, 3] for specific branches # Feature options "include_original": True, # Include pre-branch features "aggregation": "mean", # mean | median | std | min | max # Prediction options "unsafe": False, # Default: use OOF reconstruction # Output format "output_as": "features", # features | sources | dict "source_names": ["snv", "msc"], # Names for output_as="sources" # Error handling "on_missing": "error", # error | warn | skip "on_shape_mismatch": "error", # error | allow | pad | truncate }} ``` ### Branch Selection Select specific branches by index or name: ```python # By index {"merge": {"features": [0, 2]}} # Only branches 0 and 2 # By name (when using named branches) {"merge": {"features": ["snv", "derivative"]}} # All branches (default) {"merge": {"features": "all"}} ``` ### Feature Aggregation Instead of concatenating, aggregate features across branches: ```python pipeline = [ {"branch": { "snv": [SNV()], "msc": [MSC()], "derivative": [FirstDerivative()], }}, {"merge": {"features": "all", "aggregation": "mean"}}, PLSRegression(n_components=5), ] ``` **Aggregation options:** - `"mean"` - Average features across branches - `"median"` - Median of features - `"std"` - Standard deviation - `"min"` - Minimum value - `"max"` - Maximum value Result shape: `(n, p)` instead of `(n, branches*p)` ### Include Original Features Preserve pre-branch features alongside merged branch features: ```python pipeline = [ MinMaxScaler(), # Original features after this {"branch": { "snv": [SNV()], "derivative": [FirstDerivative()], }}, {"merge": {"features": "all", "include_original": True}}, PLSRegression(n_components=10), ] ``` Result: `[original_X | snv_X | derivative_X]` ### Output Formats Control how merged features are structured: ```python # Default: single feature matrix {"merge": {"features": "all", "output_as": "features"}} # Result: shape (n, total_features) # As separate sources (for multi-source models) {"merge": {"features": "all", "output_as": "sources", "source_names": ["snv", "msc"]}} # Result: SpectroDataset with named sources # As dictionary (for multi-input models) {"merge": {"features": "all", "output_as": "dict"}} # Result: {"branch_0": X0, "branch_1": X1, ...} ``` ### Per-Branch Prediction Configuration Fine control over which models and how predictions are collected: ```python {"merge": { "predictions": [ # Branch 0: Use best model by RMSE {"branch": 0, "select": "best", "metric": "rmse"}, # Branch 1: Top 2 models by R² {"branch": 1, "select": {"top_k": 2}, "metric": "r2"}, # Branch 2: Specific models by name {"branch": 2, "select": ["PLS", "Ridge"]}, # Branch 3: Average predictions instead of separate columns {"branch": 3, "select": "all", "aggregate": "mean"}, ] }} ``` **Selection strategies:** - `"all"` - All models in branch (default) - `"best"` - Single best model by metric - `{"top_k": N}` - Top N models by metric - `["model1", "model2"]` - Explicit model names **Aggregation strategies:** - `"separate"` - Each model as a separate feature column (default) - `"mean"` - Simple average of predictions - `"weighted_mean"` - Weighted by validation scores - `"proba_mean"` - For classification: average class probabilities ### Unsafe Mode By default, prediction merging uses OOF reconstruction to prevent data leakage. Disable this for special cases: ```python # Safe (default): OOF predictions reconstructed per fold {"merge": "predictions"} # or {"merge": {"predictions": "all", "unsafe": False}} # Unsafe: Direct predictions (data leakage risk!) {"merge": {"predictions": "all", "unsafe": True}} ``` Use `unsafe: True` only when you understand the implications (e.g., for final model predictions on new data). ### Nested Branching Multiple branch-merge cycles enable hierarchical architectures: ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, random_state=42), # Level 1: Preprocessing exploration {"branch": { "snv": [SNV()], "msc": [MSC()], }}, {"merge": "features"}, # Exit first branch level # Level 2: Model comparison {"branch": { "pls": [PLSRegression(n_components=5)], "ridge": [Ridge(alpha=1.0)], }}, {"merge": "predictions"}, # Stack predictions # Final meta-model Ridge(alpha=0.1), ] ``` Each `merge` exits its branch level, enabling sequential branch-merge cycles. ## Source Merging (`merge_sources`) The `merge_sources` keyword combines features from different data sources (sensors, instruments, modalities). ### Basic Syntax ```python # Simple string syntax {"merge_sources": "concat"} # Horizontal concatenation {"merge_sources": "stack"} # 3D stacking (for CNNs) {"merge_sources": "average"} # Element-wise average ``` ### Concatenation (Default) Horizontally concatenate all sources: ```python pipeline = [ {"source_branch": { "NIR": [SNV(), FirstDerivative()], "Raman": [MSC(), SavitzkyGolay()], }}, {"merge_sources": "concat"}, PLSRegression(n_components=15), ] ``` **Result:** `shape (n, nir_features + raman_features)` ### 3D Stacking Create 3D array for CNN models: ```python {"merge_sources": "stack"} ``` **Result:** `shape (n, n_sources, n_features)` Requires sources to have the same number of features. ### Averaging Element-wise average of sources: ```python {"merge_sources": "average"} ``` **Result:** `shape (n, n_features)` Requires identical dimensions across sources. ### Dict Syntax Options ```python {"merge_sources": { "strategy": "concat", # concat | stack | dict "sources": "all", # or ["NIR", "markers"] for specific "on_incompatible": "error", # error | flatten | pad | truncate "output_name": "merged", # Name for merged source "preserve_source_info": True, # Keep source metadata }} ``` ### Weighted Merging Scale source contributions before combining: ```python {"merge_sources": { "mode": "concat", "weights": {"NIR": 1.0, "Raman": 0.5} # Scale Raman features by 0.5 }} ``` ### Selective Merging Include only specific sources: ```python {"merge_sources": { "sources": ["NIR", "markers"], # Exclude Raman "mode": "concat" }} ``` ### Handling Shape Mismatches When sources have different feature dimensions: ```python {"merge_sources": { "strategy": "stack", "on_incompatible": "error", # Default: raise error }} # Or fallback strategies: {"merge_sources": { "strategy": "stack", "on_incompatible": "flatten", # Fall back to 2D concat }} {"merge_sources": { "strategy": "stack", "on_incompatible": "pad", # Zero-pad shorter sources }} {"merge_sources": { "strategy": "stack", "on_incompatible": "truncate", # Truncate longer sources }} ``` ## Source Branching (`source_branch`) Apply source-specific preprocessing pipelines. By default, sources are automatically merged after processing. ### Basic Syntax ```python {"source_branch": { "NIR": [SNV(), FirstDerivative()], "Raman": [MSC(), SavitzkyGolay()], "markers": [StandardScaler()], }} ``` ### Auto Mode Process each source independently with empty pipeline: ```python {"source_branch": "auto"} ``` ### Default Pipeline Apply default preprocessing to unlisted sources: ```python {"source_branch": { "NIR": [SNV()], "_default_": [MinMaxScaler()], # Applied to other sources }} ``` ### Controlling Auto-Merge By default, sources merge after `source_branch`. Disable this: ```python {"source_branch": { "NIR": [SNV()], "Raman": [MSC()], "_merge_after_": False, # Keep sources separate "_merge_strategy_": "concat", # Merge strategy if merging (default) }} ``` ### Indexed Sources Reference sources by index instead of name: ```python {"source_branch": { 0: [SNV(), FirstDerivative()], 1: [MinMaxScaler()], }} ``` ## Complete Examples ### Example 1: Multi-Preprocessing Fusion Combine multiple preprocessing strategies: ```python from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import KFold from sklearn.preprocessing import MinMaxScaler from nirs4all.operators.transforms import SNV, MSC, FirstDerivative import nirs4all pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), {"branch": { "snv": [SNV()], "msc": [MSC()], "derivative": [FirstDerivative()], }}, {"merge": "features"}, PLSRegression(n_components=15), ] result = nirs4all.run(pipeline=pipeline, dataset="path/to/data") ``` ### Example 2: Two-Level Stacking Build a meta-model from diverse base models: ```python from sklearn.linear_model import Ridge from sklearn.ensemble import RandomForestRegressor pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), {"branch": { "pls": [SNV(), PLSRegression(n_components=10)], "rf": [MSC(), RandomForestRegressor(n_estimators=100)], "ridge": [FirstDerivative(), Ridge(alpha=1.0)], }}, {"merge": "predictions"}, Ridge(alpha=0.1), # Meta-model ] ``` ### Example 3: Multi-Instrument Fusion Combine data from different spectrometers: ```python pipeline = [ KFold(n_splits=5, shuffle=True, random_state=42), {"source_branch": { "portable": [ SNV(), SavitzkyGolay(window_length=15, polyorder=2), FirstDerivative(), ], "benchtop": [ SNV(), FirstDerivative(), ], }}, {"merge_sources": { "mode": "concat", "weights": {"portable": 0.7, "benchtop": 1.0} }}, PLSRegression(n_components=10), ] ``` ### Example 4: Hybrid Source and Pipeline Branching Combine source-level and algorithm-level branching: ```python pipeline = [ KFold(n_splits=5, shuffle=True, random_state=42), # Step 1: Per-source preprocessing {"source_branch": { "NIR": [SNV()], "Raman": [MSC()], }}, {"merge_sources": "concat"}, # Step 2: Feature scaling MinMaxScaler(), # Step 3: Model comparison {"branch": { "pls": [PLSRegression(n_components=10)], "rf": [RandomForestRegressor(n_estimators=100)], }}, ] ``` ### Example 5: Advanced Prediction Selection Fine-grained control over stacking: ```python pipeline = [ MinMaxScaler(), KFold(n_splits=5, shuffle=True, random_state=42), {"branch": { "diverse_pls": [SNV(), {"model": PLSRegression(3), "name": "PLS_3"}, {"model": PLSRegression(5), "name": "PLS_5"}, {"model": PLSRegression(10), "name": "PLS_10"}, ], "ensemble": [MSC(), RandomForestRegressor(n_estimators=100)], }}, {"merge": { "predictions": [ # Top 2 PLS models by R² {"branch": "diverse_pls", "select": {"top_k": 2}, "metric": "r2"}, # All from ensemble branch, averaged {"branch": "ensemble", "select": "all", "aggregate": "mean"}, ] }}, Ridge(alpha=0.1), ] ``` ## Key Behaviors 1. **merge ALWAYS exits branch mode** - After merge, the pipeline returns to a single path 2. **OOF safety by default** - Prediction merging uses out-of-fold reconstruction 3. **source_branch auto-merges** - Unless `_merge_after_: False` 4. **Feature snapshots** - Pre-branch features are preserved for `include_original` 5. **Prediction mode support** - Merge configurations are saved and restored for inference ## Troubleshooting ### "No feature snapshot found" The merge requires features from before the branch. Ensure preprocessing steps exist before branching. ### "Feature dimension mismatch" Branches produced different feature counts. Use `aggregation` or check preprocessing consistency. ### "No models found in branch" For prediction merge, at least one branch must contain a trained model. ### "Incomplete OOF coverage" Some samples lack validation predictions. Check cross-validation setup or use `on_missing: "warn"`. ### "merge_sources requires multiple sources" Your dataset has only one source. Use `branch` instead for single-source data. ## See Also - {doc}`branching` - Creating parallel pipeline branches - {doc}`stacking` - MetaModel stacking patterns - {doc}`multi_source` - Loading and working with multi-source data - {doc}`/reference/pipeline_syntax` - Complete pipeline syntax reference - [D03_merge_basics.py](https://github.com/GBeurier/nirs4all/blob/main/examples/developer/01_advanced_pipelines/D03_merge_basics.py) - Feature and prediction merge examples - [D04_merge_sources.py](https://github.com/GBeurier/nirs4all/blob/main/examples/developer/01_advanced_pipelines/D04_merge_sources.py) - Multi-source merge examples