Branch Merging

After creating parallel branches with branch or source_branch, use merge operators to combine outputs and continue with a single pipeline path.

Overview

nirs4all provides three merge-related keywords:

Keyword	Purpose	Use Case
`merge`	Combine outputs from pipeline branches	Feature fusion, stacking, ensemble building
`merge_sources`	Combine features from data sources	Multi-instrument fusion, multi-modal data
`source_branch`	Apply per-source preprocessing (auto-merges)	Source-specific preprocessing before fusion

Branch Merging (`merge`)

The merge keyword combines outputs from branches created with {"branch": [...]}.

Basic Syntax

# Simple string syntax
{"merge": "features"}      # Concatenate X matrices from all branches
{"merge": "predictions"}   # Collect OOF predictions (for stacking)
{"merge": "all"}           # Collect both features and predictions

Feature Merging

Concatenate feature matrices (X) horizontally from all branches:

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, random_state=42),
    {"branch": {
        "snv": [SNV()],
        "msc": [MSC()],
        "derivative": [FirstDerivative()],
    }},
    {"merge": "features"},  # Concatenate X from all branches
    PLSRegression(n_components=10),
]

Before merge:

Branch 0 (snv):       shape (n, p)
Branch 1 (msc):       shape (n, p)
Branch 2 (derivative): shape (n, p)

After merge:

Merged X: shape (n, 3*p)  # Horizontal concatenation

Prediction Merging (Stacking)

Collect out-of-fold (OOF) predictions from branches for meta-model training:

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, random_state=42),
    {"branch": {
        "pls": [SNV(), PLSRegression(n_components=5)],
        "ridge": [MSC(), Ridge(alpha=1.0)],
        "rf": [FirstDerivative(), RandomForestRegressor(n_estimators=50)],
    }},
    {"merge": "predictions"},  # Collect OOF predictions
    Ridge(alpha=0.1),          # Meta-model (Level 2)
]

Before merge:

Branch 0 (pls):   predictions shape (n,)
Branch 1 (ridge): predictions shape (n,)
Branch 2 (rf):    predictions shape (n,)

After merge:

Meta X: shape (n, 3)  # Stacked predictions as features

OOF predictions prevent data leakage by using validation set predictions during training.

Mixed Merging

Select features from some branches and predictions from others:

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, random_state=42),
    {"branch": [
        [SNV()],                                      # Branch 0: preprocessing only
        [MSC()],                                      # Branch 1: preprocessing only
        [FirstDerivative(), PLSRegression(5)],       # Branch 2: has model
    ]},
    {"merge": {"features": [0, 1], "predictions": [2]}},
    Ridge(alpha=0.1),
]

This combines:

Features from branches 0 and 1 (preprocessed X)
Predictions from branch 2 (model outputs)

Dict Syntax Options

The dict syntax provides fine-grained control over merging:

{"merge": {
    # What to collect
    "features": "all",           # or [0, 2] for specific branches
    "predictions": "all",        # or [1, 3] for specific branches

    # Feature options
    "include_original": True,    # Include pre-branch features
    "aggregation": "mean",       # mean | median | std | min | max

    # Prediction options
    "unsafe": False,             # Default: use OOF reconstruction

    # Output format
    "output_as": "features",     # features | sources | dict
    "source_names": ["snv", "msc"],  # Names for output_as="sources"

    # Error handling
    "on_missing": "error",       # error | warn | skip
    "on_shape_mismatch": "error", # error | allow | pad | truncate
}}

Branch Selection

Select specific branches by index or name:

# By index
{"merge": {"features": [0, 2]}}     # Only branches 0 and 2

# By name (when using named branches)
{"merge": {"features": ["snv", "derivative"]}}

# All branches (default)
{"merge": {"features": "all"}}

Feature Aggregation

Instead of concatenating, aggregate features across branches:

pipeline = [
    {"branch": {
        "snv": [SNV()],
        "msc": [MSC()],
        "derivative": [FirstDerivative()],
    }},
    {"merge": {"features": "all", "aggregation": "mean"}},
    PLSRegression(n_components=5),
]

Aggregation options:

"mean" - Average features across branches
"median" - Median of features
"std" - Standard deviation
"min" - Minimum value
"max" - Maximum value

Result shape: (n, p) instead of (n, branches*p)

Include Original Features

Preserve pre-branch features alongside merged branch features:

pipeline = [
    MinMaxScaler(),  # Original features after this
    {"branch": {
        "snv": [SNV()],
        "derivative": [FirstDerivative()],
    }},
    {"merge": {"features": "all", "include_original": True}},
    PLSRegression(n_components=10),
]

Result: [original_X | snv_X | derivative_X]

Output Formats

Control how merged features are structured:

# Default: single feature matrix
{"merge": {"features": "all", "output_as": "features"}}
# Result: shape (n, total_features)

# As separate sources (for multi-source models)
{"merge": {"features": "all", "output_as": "sources", "source_names": ["snv", "msc"]}}
# Result: SpectroDataset with named sources

# As dictionary (for multi-input models)
{"merge": {"features": "all", "output_as": "dict"}}
# Result: {"branch_0": X0, "branch_1": X1, ...}

Per-Branch Prediction Configuration

Fine control over which models and how predictions are collected:

{"merge": {
    "predictions": [
        # Branch 0: Use best model by RMSE
        {"branch": 0, "select": "best", "metric": "rmse"},

        # Branch 1: Top 2 models by R²
        {"branch": 1, "select": {"top_k": 2}, "metric": "r2"},

        # Branch 2: Specific models by name
        {"branch": 2, "select": ["PLS", "Ridge"]},

        # Branch 3: Average predictions instead of separate columns
        {"branch": 3, "select": "all", "aggregate": "mean"},
    ]
}}

Selection strategies:

"all" - All models in branch (default)
"best" - Single best model by metric
{"top_k": N} - Top N models by metric
["model1", "model2"] - Explicit model names

Aggregation strategies:

"separate" - Each model as a separate feature column (default)
"mean" - Simple average of predictions
"weighted_mean" - Weighted by validation scores
"proba_mean" - For classification: average class probabilities

Unsafe Mode

By default, prediction merging uses OOF reconstruction to prevent data leakage. Disable this for special cases:

# Safe (default): OOF predictions reconstructed per fold
{"merge": "predictions"}  # or {"merge": {"predictions": "all", "unsafe": False}}

# Unsafe: Direct predictions (data leakage risk!)
{"merge": {"predictions": "all", "unsafe": True}}

Use unsafe: True only when you understand the implications (e.g., for final model predictions on new data).

Nested Branching

Multiple branch-merge cycles enable hierarchical architectures:

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, random_state=42),

    # Level 1: Preprocessing exploration
    {"branch": {
        "snv": [SNV()],
        "msc": [MSC()],
    }},
    {"merge": "features"},  # Exit first branch level

    # Level 2: Model comparison
    {"branch": {
        "pls": [PLSRegression(n_components=5)],
        "ridge": [Ridge(alpha=1.0)],
    }},
    {"merge": "predictions"},  # Stack predictions

    # Final meta-model
    Ridge(alpha=0.1),
]

Each merge exits its branch level, enabling sequential branch-merge cycles.

Source Merging (`merge_sources`)

The merge_sources keyword combines features from different data sources (sensors, instruments, modalities).

Basic Syntax

# Simple string syntax
{"merge_sources": "concat"}   # Horizontal concatenation
{"merge_sources": "stack"}    # 3D stacking (for CNNs)
{"merge_sources": "average"}  # Element-wise average

Concatenation (Default)

Horizontally concatenate all sources:

pipeline = [
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), SavitzkyGolay()],
    }},
    {"merge_sources": "concat"},
    PLSRegression(n_components=15),
]

Result: shape (n, nir_features + raman_features)

3D Stacking

Create 3D array for CNN models:

{"merge_sources": "stack"}

Result: shape (n, n_sources, n_features)

Requires sources to have the same number of features.

Averaging

Element-wise average of sources:

{"merge_sources": "average"}

Result: shape (n, n_features)

Requires identical dimensions across sources.

Dict Syntax Options

{"merge_sources": {
    "strategy": "concat",           # concat | stack | dict
    "sources": "all",               # or ["NIR", "markers"] for specific
    "on_incompatible": "error",     # error | flatten | pad | truncate
    "output_name": "merged",        # Name for merged source
    "preserve_source_info": True,   # Keep source metadata
}}

Weighted Merging

Scale source contributions before combining:

{"merge_sources": {
    "mode": "concat",
    "weights": {"NIR": 1.0, "Raman": 0.5}  # Scale Raman features by 0.5
}}

Selective Merging

Include only specific sources:

{"merge_sources": {
    "sources": ["NIR", "markers"],  # Exclude Raman
    "mode": "concat"
}}

Handling Shape Mismatches

When sources have different feature dimensions:

{"merge_sources": {
    "strategy": "stack",
    "on_incompatible": "error",     # Default: raise error
}}

# Or fallback strategies:
{"merge_sources": {
    "strategy": "stack",
    "on_incompatible": "flatten",   # Fall back to 2D concat
}}

{"merge_sources": {
    "strategy": "stack",
    "on_incompatible": "pad",       # Zero-pad shorter sources
}}

{"merge_sources": {
    "strategy": "stack",
    "on_incompatible": "truncate",  # Truncate longer sources
}}

Source Branching (`source_branch`)

Apply source-specific preprocessing pipelines. By default, sources are automatically merged after processing.

Basic Syntax

{"source_branch": {
    "NIR": [SNV(), FirstDerivative()],
    "Raman": [MSC(), SavitzkyGolay()],
    "markers": [StandardScaler()],
}}

Auto Mode

Process each source independently with empty pipeline:

{"source_branch": "auto"}

Default Pipeline

Apply default preprocessing to unlisted sources:

{"source_branch": {
    "NIR": [SNV()],
    "_default_": [MinMaxScaler()],  # Applied to other sources
}}

Controlling Auto-Merge

By default, sources merge after source_branch. Disable this:

{"source_branch": {
    "NIR": [SNV()],
    "Raman": [MSC()],
    "_merge_after_": False,         # Keep sources separate
    "_merge_strategy_": "concat",   # Merge strategy if merging (default)
}}

Indexed Sources

Reference sources by index instead of name:

{"source_branch": {
    0: [SNV(), FirstDerivative()],
    1: [MinMaxScaler()],
}}

Complete Examples

Example 1: Multi-Preprocessing Fusion

Combine multiple preprocessing strategies:

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from nirs4all.operators.transforms import SNV, MSC, FirstDerivative
import nirs4all

pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5, shuffle=True, random_state=42),
    {"branch": {
        "snv": [SNV()],
        "msc": [MSC()],
        "derivative": [FirstDerivative()],
    }},
    {"merge": "features"},
    PLSRegression(n_components=15),
]

result = nirs4all.run(pipeline=pipeline, dataset="path/to/data")

Example 2: Two-Level Stacking

Build a meta-model from diverse base models:

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5, shuffle=True, random_state=42),
    {"branch": {
        "pls": [SNV(), PLSRegression(n_components=10)],
        "rf": [MSC(), RandomForestRegressor(n_estimators=100)],
        "ridge": [FirstDerivative(), Ridge(alpha=1.0)],
    }},
    {"merge": "predictions"},
    Ridge(alpha=0.1),  # Meta-model
]

Example 3: Multi-Instrument Fusion

Combine data from different spectrometers:

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),
    {"source_branch": {
        "portable": [
            SNV(),
            SavitzkyGolay(window_length=15, polyorder=2),
            FirstDerivative(),
        ],
        "benchtop": [
            SNV(),
            FirstDerivative(),
        ],
    }},
    {"merge_sources": {
        "mode": "concat",
        "weights": {"portable": 0.7, "benchtop": 1.0}
    }},
    PLSRegression(n_components=10),
]

Example 4: Hybrid Source and Pipeline Branching

Combine source-level and algorithm-level branching:

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),

    # Step 1: Per-source preprocessing
    {"source_branch": {
        "NIR": [SNV()],
        "Raman": [MSC()],
    }},
    {"merge_sources": "concat"},

    # Step 2: Feature scaling
    MinMaxScaler(),

    # Step 3: Model comparison
    {"branch": {
        "pls": [PLSRegression(n_components=10)],
        "rf": [RandomForestRegressor(n_estimators=100)],
    }},
]

Example 5: Advanced Prediction Selection

Fine-grained control over stacking:

pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5, shuffle=True, random_state=42),
    {"branch": {
        "diverse_pls": [SNV(),
            {"model": PLSRegression(3), "name": "PLS_3"},
            {"model": PLSRegression(5), "name": "PLS_5"},
            {"model": PLSRegression(10), "name": "PLS_10"},
        ],
        "ensemble": [MSC(), RandomForestRegressor(n_estimators=100)],
    }},
    {"merge": {
        "predictions": [
            # Top 2 PLS models by R²
            {"branch": "diverse_pls", "select": {"top_k": 2}, "metric": "r2"},
            # All from ensemble branch, averaged
            {"branch": "ensemble", "select": "all", "aggregate": "mean"},
        ]
    }},
    Ridge(alpha=0.1),
]

Key Behaviors

merge ALWAYS exits branch mode - After merge, the pipeline returns to a single path
OOF safety by default - Prediction merging uses out-of-fold reconstruction
source_branch auto-merges - Unless _merge_after_: False
Feature snapshots - Pre-branch features are preserved for include_original
Prediction mode support - Merge configurations are saved and restored for inference

Troubleshooting

“No feature snapshot found”

The merge requires features from before the branch. Ensure preprocessing steps exist before branching.

“Feature dimension mismatch”

Branches produced different feature counts. Use aggregation or check preprocessing consistency.

“No models found in branch”

For prediction merge, at least one branch must contain a trained model.

“Incomplete OOF coverage”

Some samples lack validation predictions. Check cross-validation setup or use on_missing: "warn".

“merge_sources requires multiple sources”

Your dataset has only one source. Use branch instead for single-source data.

Branch Merging

Overview

Branch Merging (merge)

Basic Syntax

Feature Merging

Prediction Merging (Stacking)

Mixed Merging

Dict Syntax Options

Branch Selection

Feature Aggregation

Include Original Features

Output Formats

Per-Branch Prediction Configuration

Unsafe Mode

Nested Branching

Source Merging (merge_sources)

Basic Syntax

Concatenation (Default)

3D Stacking

Averaging

Dict Syntax Options

Weighted Merging

Selective Merging

Handling Shape Mismatches

Source Branching (source_branch)

Basic Syntax

Auto Mode

Default Pipeline

Controlling Auto-Merge

Indexed Sources

Complete Examples

Example 1: Multi-Preprocessing Fusion

Example 2: Two-Level Stacking

Example 3: Multi-Instrument Fusion

Example 4: Hybrid Source and Pipeline Branching

Example 5: Advanced Prediction Selection

Key Behaviors

Troubleshooting

“No feature snapshot found”

“Feature dimension mismatch”

“No models found in branch”

“Incomplete OOF coverage”

“merge_sources requires multiple sources”

See Also

Branch Merging (`merge`)

Source Merging (`merge_sources`)

Source Branching (`source_branch`)