Multi-Source Pipelines

Work with multiple data sources (sensors, instruments, modalities) in a single pipeline.

Overview

Multi-source datasets combine data from different origins:

  • NIR + Raman: Complementary spectroscopy techniques

  • Portable + Benchtop: Same sensor type, different instruments

  • Spectra + Metadata: Spectral data with chemical markers or sensor readings

nirs4all provides two key constructs for multi-source workflows:

  • source_branch: Apply different preprocessing to each source

  • merge_sources: Combine source features into a unified representation

Loading Multi-Source Data

From Multiple Files

from nirs4all.data import DatasetConfigs

dataset = DatasetConfigs([
    {"path": "nir_spectra.csv", "source_name": "NIR"},
    {"path": "raman_spectra.csv", "source_name": "Raman"},
    {"path": "markers.csv", "source_name": "markers"},
])

Source Properties

Each source can have different:

  • Number of features (wavelengths, channels)

  • Headers (wavelength values)

  • Preprocessing requirements

Source Branching

Apply source-specific preprocessing pipelines:

pipeline = [
    ShuffleSplit(n_splits=5, random_state=42),

    # Different preprocessing per source
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), SavitzkyGolay(window_length=11, polyorder=2)],
        "markers": [VarianceThreshold(), StandardScaler()],
    }},

    # Sources are merged automatically after source_branch
    PLSRegression(n_components=15),
]

How source_branch Works

  1. Isolation: Each source is processed independently

  2. Parallel execution: Source pipelines run in parallel (conceptually)

  3. Type-specific steps: Each source gets its own transformer chain

  4. Auto-merge: By default, sources are concatenated after processing

Input Data
    │
    ├── NIR ──────► SNV → FirstDerivative ────┐
    │                                          │
    ├── Raman ────► MSC → SavitzkyGolay ──────├──► Merged Features
    │                                          │
    └── markers ──► VarianceThreshold ────────┘
                    → StandardScaler

source_branch Syntax Variants

Indexed Sources

{"source_branch": {
    0: [SNV(), FirstDerivative()],
    1: [MinMaxScaler()],
}}

Auto Mode (Same Processing Per Source)

{"source_branch": "auto"}  # Each source processed independently with empty pipeline

Default Pipeline for Unlisted Sources

{"source_branch": {
    "NIR": [SNV()],
    "_default_": [MinMaxScaler()],  # Applied to other sources
}}

Disabling Auto-Merge

By default, sources are merged after source_branch. To keep them separate:

{"source_branch": {
    "NIR": [SNV()],
    "markers": [StandardScaler()],
    "_merge_after_": False  # Keep sources separate
}}

Merging Sources

Explicitly combine features from multiple sources:

pipeline = [
    ShuffleSplit(n_splits=5, random_state=42),

    {"source_branch": {
        "NIR": [SNV()],
        "Raman": [MSC()],
    }},

    # Explicit merge with options
    {"merge_sources": "concat"},  # Horizontal concatenation

    PLSRegression(n_components=15),
]

Merge Strategies

Concatenation (Default)

{"merge_sources": "concat"}

Horizontally concatenates all sources: [NIR_features | Raman_features | ...]

Stacking

{"merge_sources": "stack"}

Creates 3D array (samples, sources, features). Requires uniform feature dimensions.

Averaging

{"merge_sources": "average"}

Element-wise average of sources. Requires identical dimensions.

Advanced Merge Options

Weighted Merging

{"merge_sources": {
    "mode": "concat",
    "weights": {"NIR": 1.0, "Raman": 0.5}  # Scale Raman features by 0.5
}}

Selective Merging

{"merge_sources": {
    "sources": ["NIR", "markers"],  # Exclude Raman
    "mode": "concat"
}}

Complete Examples

Example 1: NIR + Chemical Markers

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from nirs4all.operators.transforms import SNV, FirstDerivative
from nirs4all.data import DatasetConfigs
import nirs4all

# Multi-source dataset
dataset = DatasetConfigs([
    {"path": "nir_spectra.csv", "source_name": "NIR"},
    {"path": "chemical_markers.csv", "source_name": "markers"},
])

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),

    # Source-specific preprocessing
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "markers": [VarianceThreshold(threshold=0.01), StandardScaler()],
    }},

    # Merge and model
    {"merge_sources": "concat"},
    PLSRegression(n_components=15),
]

result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Multi-source RMSE: {result.best_score:.4f}")

Example 2: Portable vs Benchtop Instruments

pipeline = [
    KFold(n_splits=5, shuffle=True, random_state=42),

    # Instrument-specific calibration
    {"source_branch": {
        "portable": [
            # Portable needs more aggressive preprocessing
            SNV(),
            SavitzkyGolay(window_length=15, polyorder=2),
            FirstDerivative(),
        ],
        "benchtop": [
            # Benchtop is more stable
            SNV(),
        ],
    }},

    # Weighted merge (trust benchtop more)
    {"merge_sources": {
        "mode": "concat",
        "weights": {"portable": 0.7, "benchtop": 1.0}
    }},

    PLSRegression(n_components=10),
]

Example 3: Hybrid Branching (Sources + Preprocessing Variants)

Combine source branching with regular pipeline branching:

pipeline = [
    KFold(n_splits=3, shuffle=True, random_state=42),

    # Step 1: Source-level preprocessing
    {"source_branch": {
        "NIR": [SNV()],
        "Raman": [MSC()],
    }},

    # Step 2: Feature scaling (applied to merged features)
    MinMaxScaler(),

    # Step 3: Compare models via regular branching
    {"branch": {
        "pls": [PLSRegression(n_components=10)],
        "rf": [RandomForestRegressor(n_estimators=100)],
    }},
]

result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Branches: {result.predictions.get_unique_values('branch_name')}")

Sources vs Branches

Understanding the difference between sources and branches:

Concept

Sources

Branches

Dimension

Data provenance

Processing strategy

Origin

Different sensors/files

Same data, different pipelines

Created by

DatasetConfigs

branch keyword

Merged by

merge_sources

merge keyword

Use case

Multi-instrument fusion

Algorithm comparison

Multi-Source Data                    Pipeline Branching

NIR data ──────┐                     Input ─────┬── SNV → PLS
               ├──► source_branch                │
Raman data ────┘                                 ├── MSC → RF
                                                 │
                                                 └── Detrend → SVR

Best Practices

  1. Name your sources: Use descriptive names like "NIR", "Raman" instead of indices

  2. Match preprocessing to source: Each source type has different noise characteristics

  3. Consider feature scaling: Sources may have very different scales

  4. Test weighted merging: Sometimes weighting sources improves performance

  5. Use variance thresholding: Remove uninformative features from metadata sources

  6. Monitor source contributions: Use SHAP or feature importance to understand each source’s contribution

Troubleshooting

“merge_sources requires a dataset with feature sources”

Your dataset has only one source. Check your DatasetConfigs definition.

Sources have different sample counts

All sources must have the same number of samples. Ensure your files are aligned by sample ID.

Feature dimension mismatch for stacking

"stack" mode requires all sources to have the same number of features. Use "concat" for heterogeneous sources.

See Also