Multi-Source Pipelines

Work with multiple data sources (sensors, instruments, modalities) in a single pipeline.

Overview

Multi-source datasets combine data from different origins:

NIR + Raman: Complementary spectroscopy techniques
Portable + Benchtop: Same sensor type, different instruments
Spectra + Metadata: Spectral data with chemical markers or sensor readings

nirs4all provides two key constructs for multi-source workflows:

source_branch: Apply different preprocessing to each source
merge_sources: Combine source features into a unified representation

Loading Multi-Source Data

From Multiple Files

from nirs4all.data import DatasetConfigs

dataset = DatasetConfigs([
    {"path": "nir_spectra.csv", "source_name": "NIR"},
    {"path": "raman_spectra.csv", "source_name": "Raman"},
    {"path": "markers.csv", "source_name": "markers"},
])

Source Properties

Each source can have different:

Number of features (wavelengths, channels)
Headers (wavelength values)
Preprocessing requirements

Source Branching

Apply source-specific preprocessing pipelines:

pipeline = [
    ShuffleSplit(n_splits=5, random_state=42),

    # Different preprocessing per source
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), SavitzkyGolay(window_length=11, polyorder=2)],
        "markers": [VarianceThreshold(), StandardScaler()],
    }},

    # Sources are merged automatically after source_branch
    PLSRegression(n_components=15),
]

How source_branch Works

Isolation: Each source is processed independently
Parallel execution: Source pipelines run in parallel (conceptually)
Type-specific steps: Each source gets its own transformer chain
Auto-merge: By default, sources are concatenated after processing

Input Data
    │
    ├── NIR ──────► SNV → FirstDerivative ────┐
    │                                          │
    ├── Raman ────► MSC → SavitzkyGolay ──────├──► Merged Features
    │                                          │
    └── markers ──► VarianceThreshold ────────┘
                    → StandardScaler

source_branch Syntax Variants

Named Sources (Recommended)

{"source_branch": {
    "NIR": [SNV(), FirstDerivative()],
    "markers": [MinMaxScaler()],
}}

Indexed Sources

{"source_branch": {
    0: [SNV(), FirstDerivative()],
    1: [MinMaxScaler()],
}}

Auto Mode (Same Processing Per Source)

{"source_branch": "auto"}  # Each source processed independently with empty pipeline

Default Pipeline for Unlisted Sources

{"source_branch": {
    "NIR": [SNV()],
    "_default_": [MinMaxScaler()],  # Applied to other sources
}}

Disabling Auto-Merge

By default, sources are merged after source_branch. To keep them separate:

{"source_branch": {
    "NIR": [SNV()],
    "markers": [StandardScaler()],
    "_merge_after_": False  # Keep sources separate
}}

Merging Sources

Explicitly combine features from multiple sources:

pipeline = [
    ShuffleSplit(n_splits=5, random_state=42),

    {"source_branch": {
        "NIR": [SNV()],
        "Raman": [MSC()],
    }},

    # Explicit merge with options
    {"merge_sources": "concat"},  # Horizontal concatenation

    PLSRegression(n_components=15),
]

Merge Strategies

Concatenation (Default)

{"merge_sources": "concat"}

Horizontally concatenates all sources: [NIR_features | Raman_features | ...]

Stacking

{"merge_sources": "stack"}

Creates 3D array (samples, sources, features). Requires uniform feature dimensions.

Averaging

{"merge_sources": "average"}

Element-wise average of sources. Requires identical dimensions.

Advanced Merge Options

Weighted Merging

{"merge_sources": {
    "mode": "concat",
    "weights": {"NIR": 1.0, "Raman": 0.5}  # Scale Raman features by 0.5
}}

Selective Merging

{"merge_sources": {
    "sources": ["NIR", "markers"],  # Exclude Raman
    "mode": "concat"
}}

Complete Examples

Example 1: NIR + Chemical Markers

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from nirs4all.operators.transforms import SNV, FirstDerivative
from nirs4all.data import DatasetConfigs
import nirs4all

# Multi-source dataset
dataset = DatasetConfigs([
    {"path": "nir_spectra.csv", "source_name": "NIR"},
    {"path": "chemical_markers.csv", "source_name": "markers"},
])

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),

    # Source-specific preprocessing
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "markers": [VarianceThreshold(threshold=0.01), StandardScaler()],
    }},

    # Merge and model
    {"merge_sources": "concat"},
    PLSRegression(n_components=15),
]

result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Multi-source RMSE: {result.best_score:.4f}")

Example 2: Portable vs Benchtop Instruments

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),

    # Instrument-specific calibration
    {"source_branch": {
        "portable": [
            # Portable needs more aggressive preprocessing
            SNV(),
            SavitzkyGolay(window_length=15, polyorder=2),
            FirstDerivative(),
        ],
        "benchtop": [
            # Benchtop is more stable
            SNV(),
        ],
    }},

    # Weighted merge (trust benchtop more)
    {"merge_sources": {
        "mode": "concat",
        "weights": {"portable": 0.7, "benchtop": 1.0}
    }},

    PLSRegression(n_components=10),
]

Example 3: Hybrid Branching (Sources + Preprocessing Variants)

Combine source branching with regular pipeline branching:

pipeline = [
    KFold(n_splits=3, shuffle=True, random_state=42),

    # Step 1: Source-level preprocessing
    {"source_branch": {
        "NIR": [SNV()],
        "Raman": [MSC()],
    }},

    # Step 2: Feature scaling (applied to merged features)
    MinMaxScaler(),

    # Step 3: Compare models via regular branching
    {"branch": {
        "pls": [PLSRegression(n_components=10)],
        "rf": [RandomForestRegressor(n_estimators=100)],
    }},
]

result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Branches: {result.predictions.get_unique_values('branch_name')}")

Sources vs Branches

Understanding the difference between sources and branches:

Concept	Sources	Branches
Dimension	Data provenance	Processing strategy
Origin	Different sensors/files	Same data, different pipelines
Created by	`DatasetConfigs`	`branch` keyword
Merged by	`merge_sources`	`merge` keyword
Use case	Multi-instrument fusion	Algorithm comparison

Multi-Source Data                    Pipeline Branching

NIR data ──────┐                     Input ─────┬── SNV → PLS
               ├──► source_branch                │
Raman data ────┘                                 ├── MSC → RF
                                                 │
                                                 └── Detrend → SVR

Best Practices

Name your sources: Use descriptive names like "NIR", "Raman" instead of indices
Match preprocessing to source: Each source type has different noise characteristics
Consider feature scaling: Sources may have very different scales
Test weighted merging: Sometimes weighting sources improves performance
Use variance thresholding: Remove uninformative features from metadata sources
Monitor source contributions: Use SHAP or feature importance to understand each source’s contribution

Troubleshooting

“merge_sources requires a dataset with feature sources”

Your dataset has only one source. Check your DatasetConfigs definition.

Sources have different sample counts

All sources must have the same number of samples. Ensure your files are aligned by sample ID.

Feature dimension mismatch for stacking

"stack" mode requires all sources to have the same number of features. Use "concat" for heterogeneous sources.