Preprocessing Overview
This guide covers spectral preprocessing techniques for NIRS data. Preprocessing is critical for NIRS analysis as it removes artifacts, reduces noise, and enhances spectral features relevant to the target property.
Why Preprocessing Matters
Near-infrared spectroscopy (NIRS) data requires specialized preprocessing to:
Remove scatter effects from particle size, surface roughness, or path length variations (MSC, SNV)
Eliminate baseline drift and offset (detrending, derivatives)
Reduce noise while preserving spectral features (Savitzky–Golay, Gaussian smoothing)
Normalize intensity variations across instruments or sessions
Enhance absorption bands through derivatives or wavelet transforms
Common NIRS Preprocessing Workflow
A typical preprocessing workflow follows this order:
1. SCATTER CORRECTION (MSC or SNV)
↓
2. SMOOTHING (Savitzky–Golay or Gaussian)
↓
3. BASELINE CORRECTION (detrending or derivatives)
↓
4. SCALING (StandardScaler or normalization)
Tip
The order matters! Apply scatter correction before derivatives, and smoothing before derivatives to reduce noise amplification.
Quick Example
from nirs4all.operators.transforms import SNV, SavitzkyGolay
from sklearn.preprocessing import StandardScaler
pipeline = [
SNV(), # Scatter correction
SavitzkyGolay(window_length=15, deriv=1), # First derivative with smoothing
StandardScaler(), # Feature scaling
{"model": PLSRegression(n_components=10)}
]
Available Operators
NIRS4ALL Transformers
All NIRS4ALL transformers follow the sklearn TransformerMixin pattern and can be used in pipelines.
Scatter Correction
Operator |
Description |
Typical Use |
|---|---|---|
|
Row-wise normalization: subtract mean, divide by std |
General scatter correction |
|
Reference-based correction for multiplicative effects |
When reference spectrum available |
|
Outlier-resistant SNV variant |
Noisy/heterogeneous samples |
|
Window-based local SNV |
Heterogeneous materials |
Derivatives
Operator |
Description |
Typical Use |
|---|---|---|
|
Numerical 1st derivative along wavelengths |
Remove baseline offset |
|
Numerical 2nd derivative along wavelengths |
Resolve overlapping peaks |
|
Smoothing + optional derivative |
Most common derivative method |
Warning
Axis convention: FirstDerivative and SecondDerivative operate along axis=1 (wavelengths), which is correct for NIRS. The legacy Derivate class uses axis=0 (samples) and should be avoided.
Smoothing
Operator |
Description |
Parameters |
|---|---|---|
|
Polynomial smoothing filter |
|
|
Gaussian filter |
|
Baseline & Normalization
Operator |
Description |
Typical Use |
|---|---|---|
|
Remove linear/polynomial baseline |
Baseline slope removal |
|
Subtract per-feature mean |
Simple offset removal |
|
Scale to range or L2 norm |
Intensity standardization |
|
Min-max scaling per feature |
Bounded [0,1] range |
|
Logarithmic transformation |
Convert reflectance to pseudo-absorbance |
Wavelets & Advanced
Operator |
Description |
Typical Use |
|---|---|---|
|
Discrete wavelet transform |
Denoising, multi-resolution features |
|
Haar wavelet (shortcut) |
Quick wavelet denoising |
|
Select wavelength range |
Region selection |
|
Resample to fixed size |
Standardize different instruments |
Sklearn-Compatible Transformers
These standard sklearn transformers work seamlessly in NIRS4ALL pipelines:
Transformer |
Purpose |
NIRS Context |
|---|---|---|
|
Zero mean, unit variance |
Use after scatter correction |
|
Median/IQR scaling |
Robust to outliers |
|
Scale to [0, 1] |
Bounded models |
|
Dimensionality reduction |
NIRS spectra are highly collinear |
|
Wrap custom functions |
Quick custom preprocessing |
Recommended Preprocessing by Model Type
Classical ML (sklearn)
Model |
Recommended Preprocessing |
Avoid |
|---|---|---|
PLS |
Mean-centering, SNV/MSC, 1st derivative |
Over-aggressive 2nd derivative |
PCR |
Center + autoscale, SG smoothing |
Raw uncorrected scatter |
SVM/SVR |
Standardization, SNV/MSC, SG + 1st deriv |
No scaling |
Random Forest |
SG smoothing, SNV/MSC |
Per-feature standardization |
k-NN |
Per-feature scaling or SNV, band selection |
Raw unscaled spectra |
Neural Networks
Model |
Recommended Preprocessing |
Avoid |
|---|---|---|
MLP |
Standardization, mean-centering/SNV, SG smoothing |
Raw unscaled spectra |
1D CNN |
Input scaling, SNV/MSC, 1st derivative optional |
Over-smoothed spectra |
Transformers |
Standardization, SNV/MSC, patch/bin tokens |
Very long sequences without reduction |
Preprocessing Order Rules
Correct Order
# ✅ CORRECT: scatter → smooth → derivative
pipeline = [
MSC(), # 1. Scatter correction first
SavitzkyGolay(window_length=15), # 2. Smoothing
FirstDerivative(), # 3. Derivative last
]
# ✅ CORRECT: SavGol with built-in derivative
pipeline = [
SNV(),
SavitzkyGolay(window_length=15, deriv=1), # Smoothing + derivative
]
# ✅ CORRECT: Detrend before scatter correction
pipeline = [
Detrend(),
SNV(),
SavitzkyGolay(window_length=15),
]
Incorrect Orders (Avoid)
# ❌ WRONG: Don't combine two scatter corrections
pipeline = [SNV(), MSC()] # Redundant
# ❌ WRONG: Don't smooth after SavGol derivative
pipeline = [
SavitzkyGolay(deriv=1),
SavitzkyGolay(), # SG already includes smoothing
]
# ❌ WRONG: Derivative before scatter correction
pipeline = [
FirstDerivative(), # Amplifies scatter artifacts
MSC(),
]
Task-Specific Preprocessing
Protein/Nitrogen Prediction (N-H bonds, 2000-2200nm)
pipeline = [
MSC(),
SavitzkyGolay(window_length=17, deriv=1), # Critical for protein
StandardScaler(),
]
Moisture Prediction (O-H bonds, 1400-1500nm, 1900-2000nm)
pipeline = [
MSC(),
SavitzkyGolay(window_length=15), # Smooth, avoid derivatives in high absorption
RSNV(), # Local scatter for heterogeneous moisture
]
Fat/Oil Prediction (C-H bonds, 1700-1800nm, 2300-2400nm)
pipeline = [
MSC(), # Important for fat scatter
SNV(),
SavitzkyGolay(window_length=21, deriv=1),
]
Multi-Layer Preprocessing for Deep Learning
When training neural networks, multiple preprocessing “views” can be stacked as channels:
# Create multiple preprocessing branches
pipeline = [
{"branch": [
[SNV(), SavitzkyGolay()], # Channel 1: SNV + smooth
[MSC(), FirstDerivative()], # Channel 2: MSC + 1st deriv
[SNV(), SecondDerivative()], # Channel 3: SNV + 2nd deriv
[Wavelet('db4')], # Channel 4: Wavelet features
]},
{"merge": "features"}, # Stack as multi-channel input
{"model": CNN1D()}
]
Note
After concatenating multi-layer preprocessing, apply per-channel scaling, not a single global scaler.
SciPy Functions Reference
These SciPy functions are used internally by NIRS4ALL transformers:
Function |
Purpose |
Used By |
|---|---|---|
|
Smoothing + derivatives |
|
|
Linear trend removal |
|
|
Gaussian smoothing |
|
|
1D interpolation |
|
|
Spline smoothing |
Spline augmenters |
Best Practices
Tip
Golden Rules for NIRS Preprocessing
Always fit preprocessing inside CV to avoid data leakage
Start simple: SNV + SavGol(deriv=1) is a strong baseline
Don’t over-preprocess: More steps ≠ better results
Validate your choices: Compare RMSE across preprocessing variants
Match preprocessing to model: Trees don’t need scaling; SVMs do
See Also
Preprocessing Cheatsheet - Quick reference by model type
Preprocessing Handbook - In-depth theory and advanced techniques
Operator Catalog - Complete operator reference
Standard Normal Variate (SNV) Transformation - Detailed SNV documentation