Multi-Source Pipelines
Work with multiple data sources (sensors, instruments, modalities) in a single pipeline.
Overview
Multi-source datasets combine data from different origins:
NIR + Raman: Complementary spectroscopy techniques
Portable + Benchtop: Same sensor type, different instruments
Spectra + Metadata: Spectral data with chemical markers or sensor readings
nirs4all provides two key constructs for multi-source workflows:
source_branch: Apply different preprocessing to each sourcemerge_sources: Combine source features into a unified representation
Loading Multi-Source Data
From Multiple Files
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs([
{"path": "nir_spectra.csv", "source_name": "NIR"},
{"path": "raman_spectra.csv", "source_name": "Raman"},
{"path": "markers.csv", "source_name": "markers"},
])
Source Properties
Each source can have different:
Number of features (wavelengths, channels)
Headers (wavelength values)
Preprocessing requirements
Source Branching
Apply source-specific preprocessing pipelines:
pipeline = [
ShuffleSplit(n_splits=5, random_state=42),
# Different preprocessing per source
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"Raman": [MSC(), SavitzkyGolay(window_length=11, polyorder=2)],
"markers": [VarianceThreshold(), StandardScaler()],
}},
# Sources are merged automatically after source_branch
PLSRegression(n_components=15),
]
How source_branch Works
Isolation: Each source is processed independently
Parallel execution: Source pipelines run in parallel (conceptually)
Type-specific steps: Each source gets its own transformer chain
Auto-merge: By default, sources are concatenated after processing
Input Data
│
├── NIR ──────► SNV → FirstDerivative ────┐
│ │
├── Raman ────► MSC → SavitzkyGolay ──────├──► Merged Features
│ │
└── markers ──► VarianceThreshold ────────┘
→ StandardScaler
source_branch Syntax Variants
Named Sources (Recommended)
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"markers": [MinMaxScaler()],
}}
Indexed Sources
{"source_branch": {
0: [SNV(), FirstDerivative()],
1: [MinMaxScaler()],
}}
Auto Mode (Same Processing Per Source)
{"source_branch": "auto"} # Each source processed independently with empty pipeline
Default Pipeline for Unlisted Sources
{"source_branch": {
"NIR": [SNV()],
"_default_": [MinMaxScaler()], # Applied to other sources
}}
Disabling Auto-Merge
By default, sources are merged after source_branch. To keep them separate:
{"source_branch": {
"NIR": [SNV()],
"markers": [StandardScaler()],
"_merge_after_": False # Keep sources separate
}}
Merging Sources
Explicitly combine features from multiple sources:
pipeline = [
ShuffleSplit(n_splits=5, random_state=42),
{"source_branch": {
"NIR": [SNV()],
"Raman": [MSC()],
}},
# Explicit merge with options
{"merge_sources": "concat"}, # Horizontal concatenation
PLSRegression(n_components=15),
]
Merge Strategies
Concatenation (Default)
{"merge_sources": "concat"}
Horizontally concatenates all sources: [NIR_features | Raman_features | ...]
Stacking
{"merge_sources": "stack"}
Creates 3D array (samples, sources, features). Requires uniform feature dimensions.
Averaging
{"merge_sources": "average"}
Element-wise average of sources. Requires identical dimensions.
Advanced Merge Options
Weighted Merging
{"merge_sources": {
"mode": "concat",
"weights": {"NIR": 1.0, "Raman": 0.5} # Scale Raman features by 0.5
}}
Selective Merging
{"merge_sources": {
"sources": ["NIR", "markers"], # Exclude Raman
"mode": "concat"
}}
Complete Examples
Example 1: NIR + Chemical Markers
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from nirs4all.operators.transforms import SNV, FirstDerivative
from nirs4all.data import DatasetConfigs
import nirs4all
# Multi-source dataset
dataset = DatasetConfigs([
{"path": "nir_spectra.csv", "source_name": "NIR"},
{"path": "chemical_markers.csv", "source_name": "markers"},
])
pipeline = [
KFold(n_splits=5, shuffle=True, random_state=42),
# Source-specific preprocessing
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"markers": [VarianceThreshold(threshold=0.01), StandardScaler()],
}},
# Merge and model
{"merge_sources": "concat"},
PLSRegression(n_components=15),
]
result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Multi-source RMSE: {result.best_score:.4f}")
Example 2: Portable vs Benchtop Instruments
pipeline = [
KFold(n_splits=5, shuffle=True, random_state=42),
# Instrument-specific calibration
{"source_branch": {
"portable": [
# Portable needs more aggressive preprocessing
SNV(),
SavitzkyGolay(window_length=15, polyorder=2),
FirstDerivative(),
],
"benchtop": [
# Benchtop is more stable
SNV(),
],
}},
# Weighted merge (trust benchtop more)
{"merge_sources": {
"mode": "concat",
"weights": {"portable": 0.7, "benchtop": 1.0}
}},
PLSRegression(n_components=10),
]
Example 3: Hybrid Branching (Sources + Preprocessing Variants)
Combine source branching with regular pipeline branching:
pipeline = [
KFold(n_splits=3, shuffle=True, random_state=42),
# Step 1: Source-level preprocessing
{"source_branch": {
"NIR": [SNV()],
"Raman": [MSC()],
}},
# Step 2: Feature scaling (applied to merged features)
MinMaxScaler(),
# Step 3: Compare models via regular branching
{"branch": {
"pls": [PLSRegression(n_components=10)],
"rf": [RandomForestRegressor(n_estimators=100)],
}},
]
result = nirs4all.run(pipeline=pipeline, dataset=dataset)
print(f"Branches: {result.predictions.get_unique_values('branch_name')}")
Sources vs Branches
Understanding the difference between sources and branches:
Concept |
Sources |
Branches |
|---|---|---|
Dimension |
Data provenance |
Processing strategy |
Origin |
Different sensors/files |
Same data, different pipelines |
Created by |
|
|
Merged by |
|
|
Use case |
Multi-instrument fusion |
Algorithm comparison |
Multi-Source Data Pipeline Branching
NIR data ──────┐ Input ─────┬── SNV → PLS
├──► source_branch │
Raman data ────┘ ├── MSC → RF
│
└── Detrend → SVR
Best Practices
Name your sources: Use descriptive names like
"NIR","Raman"instead of indicesMatch preprocessing to source: Each source type has different noise characteristics
Consider feature scaling: Sources may have very different scales
Test weighted merging: Sometimes weighting sources improves performance
Use variance thresholding: Remove uninformative features from metadata sources
Monitor source contributions: Use SHAP or feature importance to understand each source’s contribution
Troubleshooting
“merge_sources requires a dataset with feature sources”
Your dataset has only one source. Check your DatasetConfigs definition.
Sources have different sample counts
All sources must have the same number of samples. Ensure your files are aligned by sample ID.
Feature dimension mismatch for stacking
"stack" mode requires all sources to have the same number of features. Use "concat" for heterogeneous sources.
See Also
Loading Data - Loading multi-source datasets
Pipeline Branching - Regular pipeline branching
Writing a Pipeline in nirs4all - Complete pipeline syntax reference
D04_merge_sources.py - Full example