Preprocessing Overview

This guide covers spectral preprocessing techniques for NIRS data. Preprocessing is critical for NIRS analysis as it removes artifacts, reduces noise, and enhances spectral features relevant to the target property.

Why Preprocessing Matters

Near-infrared spectroscopy (NIRS) data requires specialized preprocessing to:

Remove scatter effects from particle size, surface roughness, or path length variations (MSC, SNV)
Eliminate baseline drift and offset (detrending, derivatives)
Reduce noise while preserving spectral features (Savitzky–Golay, Gaussian smoothing)
Normalize intensity variations across instruments or sessions
Enhance absorption bands through derivatives or wavelet transforms

Common NIRS Preprocessing Workflow

A typical preprocessing workflow follows this order:

1. SCATTER CORRECTION (MSC or SNV)
   ↓
2. SMOOTHING (Savitzky–Golay or Gaussian)
   ↓
3. BASELINE CORRECTION (detrending or derivatives)
   ↓
4. SCALING (StandardScaler or normalization)

Tip

The order matters! Apply scatter correction before derivatives, and smoothing before derivatives to reduce noise amplification.

Quick Example

from nirs4all.operators.transforms import SNV, SavitzkyGolay
from sklearn.preprocessing import StandardScaler

pipeline = [
    SNV(),                                    # Scatter correction
    SavitzkyGolay(window_length=15, deriv=1), # First derivative with smoothing
    StandardScaler(),                          # Feature scaling
    {"model": PLSRegression(n_components=10)}
]

Available Operators

NIRS4ALL Transformers

All NIRS4ALL transformers follow the sklearn TransformerMixin pattern and can be used in pipelines.

Scatter Correction

Operator	Description	Typical Use
`SNV` (StandardNormalVariate)	Row-wise normalization: subtract mean, divide by std	General scatter correction
`MSC` (MultiplicativeScatterCorrection)	Reference-based correction for multiplicative effects	When reference spectrum available
`RSNV` (RobustStandardNormalVariate)	Outlier-resistant SNV variant	Noisy/heterogeneous samples
`LSNV` (LocalStandardNormalVariate)	Window-based local SNV	Heterogeneous materials

Derivatives

Operator	Description	Typical Use
`FirstDerivative`	Numerical 1st derivative along wavelengths	Remove baseline offset
`SecondDerivative`	Numerical 2nd derivative along wavelengths	Resolve overlapping peaks
`SavitzkyGolay`	Smoothing + optional derivative	Most common derivative method

Warning

Axis convention: FirstDerivative and SecondDerivative operate along axis=1 (wavelengths), which is correct for NIRS. The legacy Derivate class uses axis=0 (samples) and should be avoided.

Smoothing

Operator	Description	Parameters
`SavitzkyGolay`	Polynomial smoothing filter	`window_length=11, polyorder=3, deriv=0`
`Gaussian`	Gaussian filter	`sigma=1, order=2`

Baseline & Normalization

Operator	Description	Typical Use
`Detrend`	Remove linear/polynomial baseline	Baseline slope removal
`Baseline`	Subtract per-feature mean	Simple offset removal
`Normalize`	Scale to range or L2 norm	Intensity standardization
`SimpleScale`	Min-max scaling per feature	Bounded [0,1] range
`LogTransform`	Logarithmic transformation	Convert reflectance to pseudo-absorbance

Wavelets & Advanced

Operator	Description	Typical Use
`Wavelet`	Discrete wavelet transform	Denoising, multi-resolution features
`Haar`	Haar wavelet (shortcut)	Quick wavelet denoising
`CropTransformer`	Select wavelength range	Region selection
`ResampleTransformer`	Resample to fixed size	Standardize different instruments

Sklearn-Compatible Transformers

These standard sklearn transformers work seamlessly in NIRS4ALL pipelines:

Transformer	Purpose	NIRS Context
`StandardScaler`	Zero mean, unit variance	Use after scatter correction
`RobustScaler`	Median/IQR scaling	Robust to outliers
`MinMaxScaler`	Scale to [0, 1]	Bounded models
`PCA`	Dimensionality reduction	NIRS spectra are highly collinear
`FunctionTransformer`	Wrap custom functions	Quick custom preprocessing

Recommended Preprocessing by Model Type

Classical ML (sklearn)

Model	Recommended Preprocessing	Avoid
PLS	Mean-centering, SNV/MSC, 1st derivative	Over-aggressive 2nd derivative
PCR	Center + autoscale, SG smoothing	Raw uncorrected scatter
SVM/SVR	Standardization, SNV/MSC, SG + 1st deriv	No scaling
Random Forest	SG smoothing, SNV/MSC	Per-feature standardization
k-NN	Per-feature scaling or SNV, band selection	Raw unscaled spectra

Neural Networks

Model	Recommended Preprocessing	Avoid
MLP	Standardization, mean-centering/SNV, SG smoothing	Raw unscaled spectra
1D CNN	Input scaling, SNV/MSC, 1st derivative optional	Over-smoothed spectra
Transformers	Standardization, SNV/MSC, patch/bin tokens	Very long sequences without reduction

Preprocessing Order Rules

Correct Order

# ✅ CORRECT: scatter → smooth → derivative
pipeline = [
    MSC(),                           # 1. Scatter correction first
    SavitzkyGolay(window_length=15), # 2. Smoothing
    FirstDerivative(),               # 3. Derivative last
]

# ✅ CORRECT: SavGol with built-in derivative
pipeline = [
    SNV(),
    SavitzkyGolay(window_length=15, deriv=1),  # Smoothing + derivative
]

# ✅ CORRECT: Detrend before scatter correction
pipeline = [
    Detrend(),
    SNV(),
    SavitzkyGolay(window_length=15),
]

Incorrect Orders (Avoid)

# ❌ WRONG: Don't combine two scatter corrections
pipeline = [SNV(), MSC()]  # Redundant

# ❌ WRONG: Don't smooth after SavGol derivative
pipeline = [
    SavitzkyGolay(deriv=1),
    SavitzkyGolay(),  # SG already includes smoothing
]

# ❌ WRONG: Derivative before scatter correction
pipeline = [
    FirstDerivative(),  # Amplifies scatter artifacts
    MSC(),
]

Task-Specific Preprocessing

Protein/Nitrogen Prediction (N-H bonds, 2000-2200nm)

pipeline = [
    MSC(),
    SavitzkyGolay(window_length=17, deriv=1),  # Critical for protein
    StandardScaler(),
]

Moisture Prediction (O-H bonds, 1400-1500nm, 1900-2000nm)

pipeline = [
    MSC(),
    SavitzkyGolay(window_length=15),  # Smooth, avoid derivatives in high absorption
    RSNV(),  # Local scatter for heterogeneous moisture
]

Fat/Oil Prediction (C-H bonds, 1700-1800nm, 2300-2400nm)

pipeline = [
    MSC(),  # Important for fat scatter
    SNV(),
    SavitzkyGolay(window_length=21, deriv=1),
]

Multi-Layer Preprocessing for Deep Learning

When training neural networks, multiple preprocessing “views” can be stacked as channels:

# Create multiple preprocessing branches
pipeline = [
    {"branch": [
        [SNV(), SavitzkyGolay()],           # Channel 1: SNV + smooth
        [MSC(), FirstDerivative()],          # Channel 2: MSC + 1st deriv
        [SNV(), SecondDerivative()],         # Channel 3: SNV + 2nd deriv
        [Wavelet('db4')],                    # Channel 4: Wavelet features
    ]},
    {"merge": "features"},  # Stack as multi-channel input
    {"model": CNN1D()}
]

Note

After concatenating multi-layer preprocessing, apply per-channel scaling, not a single global scaler.

SciPy Functions Reference

These SciPy functions are used internally by NIRS4ALL transformers:

Function	Purpose	Used By
`scipy.signal.savgol_filter`	Smoothing + derivatives	`SavitzkyGolay`
`scipy.signal.detrend`	Linear trend removal	`Detrend`
`scipy.ndimage.gaussian_filter1d`	Gaussian smoothing	`Gaussian`
`scipy.interpolate.interp1d`	1D interpolation	`ResampleTransformer`
`scipy.interpolate.UnivariateSpline`	Spline smoothing	Spline augmenters

Best Practices

Tip

Golden Rules for NIRS Preprocessing

Always fit preprocessing inside CV to avoid data leakage
Start simple: SNV + SavGol(deriv=1) is a strong baseline
Don’t over-preprocess: More steps ≠ better results
Validate your choices: Compare RMSE across preprocessing variants
Match preprocessing to model: Trees don’t need scaling; SVMs do