Preprocessing Overview

This guide covers spectral preprocessing techniques for NIRS data. Preprocessing is critical for NIRS analysis as it removes artifacts, reduces noise, and enhances spectral features relevant to the target property.

Why Preprocessing Matters

Near-infrared spectroscopy (NIRS) data requires specialized preprocessing to:

  • Remove scatter effects from particle size, surface roughness, or path length variations (MSC, SNV)

  • Eliminate baseline drift and offset (detrending, derivatives)

  • Reduce noise while preserving spectral features (Savitzky–Golay, Gaussian smoothing)

  • Normalize intensity variations across instruments or sessions

  • Enhance absorption bands through derivatives or wavelet transforms

Common NIRS Preprocessing Workflow

A typical preprocessing workflow follows this order:

1. SCATTER CORRECTION (MSC or SNV)
   ↓
2. SMOOTHING (Savitzky–Golay or Gaussian)
   ↓
3. BASELINE CORRECTION (detrending or derivatives)
   ↓
4. SCALING (StandardScaler or normalization)

Tip

The order matters! Apply scatter correction before derivatives, and smoothing before derivatives to reduce noise amplification.

Quick Example

from nirs4all.operators.transforms import SNV, SavitzkyGolay
from sklearn.preprocessing import StandardScaler

pipeline = [
    SNV(),                                    # Scatter correction
    SavitzkyGolay(window_length=15, deriv=1), # First derivative with smoothing
    StandardScaler(),                          # Feature scaling
    {"model": PLSRegression(n_components=10)}
]

Available Operators

NIRS4ALL Transformers

All NIRS4ALL transformers follow the sklearn TransformerMixin pattern and can be used in pipelines.

Scatter Correction

Operator

Description

Typical Use

SNV (StandardNormalVariate)

Row-wise normalization: subtract mean, divide by std

General scatter correction

MSC (MultiplicativeScatterCorrection)

Reference-based correction for multiplicative effects

When reference spectrum available

RSNV (RobustStandardNormalVariate)

Outlier-resistant SNV variant

Noisy/heterogeneous samples

LSNV (LocalStandardNormalVariate)

Window-based local SNV

Heterogeneous materials

Derivatives

Operator

Description

Typical Use

FirstDerivative

Numerical 1st derivative along wavelengths

Remove baseline offset

SecondDerivative

Numerical 2nd derivative along wavelengths

Resolve overlapping peaks

SavitzkyGolay

Smoothing + optional derivative

Most common derivative method

Warning

Axis convention: FirstDerivative and SecondDerivative operate along axis=1 (wavelengths), which is correct for NIRS. The legacy Derivate class uses axis=0 (samples) and should be avoided.

Smoothing

Operator

Description

Parameters

SavitzkyGolay

Polynomial smoothing filter

window_length=11, polyorder=3, deriv=0

Gaussian

Gaussian filter

sigma=1, order=2

Baseline & Normalization

Operator

Description

Typical Use

Detrend

Remove linear/polynomial baseline

Baseline slope removal

Baseline

Subtract per-feature mean

Simple offset removal

Normalize

Scale to range or L2 norm

Intensity standardization

SimpleScale

Min-max scaling per feature

Bounded [0,1] range

LogTransform

Logarithmic transformation

Convert reflectance to pseudo-absorbance

Wavelets & Advanced

Operator

Description

Typical Use

Wavelet

Discrete wavelet transform

Denoising, multi-resolution features

Haar

Haar wavelet (shortcut)

Quick wavelet denoising

CropTransformer

Select wavelength range

Region selection

ResampleTransformer

Resample to fixed size

Standardize different instruments

Sklearn-Compatible Transformers

These standard sklearn transformers work seamlessly in NIRS4ALL pipelines:

Transformer

Purpose

NIRS Context

StandardScaler

Zero mean, unit variance

Use after scatter correction

RobustScaler

Median/IQR scaling

Robust to outliers

MinMaxScaler

Scale to [0, 1]

Bounded models

PCA

Dimensionality reduction

NIRS spectra are highly collinear

FunctionTransformer

Wrap custom functions

Quick custom preprocessing

Preprocessing Order Rules

Correct Order

# ✅ CORRECT: scatter → smooth → derivative
pipeline = [
    MSC(),                           # 1. Scatter correction first
    SavitzkyGolay(window_length=15), # 2. Smoothing
    FirstDerivative(),               # 3. Derivative last
]

# ✅ CORRECT: SavGol with built-in derivative
pipeline = [
    SNV(),
    SavitzkyGolay(window_length=15, deriv=1),  # Smoothing + derivative
]

# ✅ CORRECT: Detrend before scatter correction
pipeline = [
    Detrend(),
    SNV(),
    SavitzkyGolay(window_length=15),
]

Incorrect Orders (Avoid)

# ❌ WRONG: Don't combine two scatter corrections
pipeline = [SNV(), MSC()]  # Redundant

# ❌ WRONG: Don't smooth after SavGol derivative
pipeline = [
    SavitzkyGolay(deriv=1),
    SavitzkyGolay(),  # SG already includes smoothing
]

# ❌ WRONG: Derivative before scatter correction
pipeline = [
    FirstDerivative(),  # Amplifies scatter artifacts
    MSC(),
]

Task-Specific Preprocessing

Protein/Nitrogen Prediction (N-H bonds, 2000-2200nm)

pipeline = [
    MSC(),
    SavitzkyGolay(window_length=17, deriv=1),  # Critical for protein
    StandardScaler(),
]

Moisture Prediction (O-H bonds, 1400-1500nm, 1900-2000nm)

pipeline = [
    MSC(),
    SavitzkyGolay(window_length=15),  # Smooth, avoid derivatives in high absorption
    RSNV(),  # Local scatter for heterogeneous moisture
]

Fat/Oil Prediction (C-H bonds, 1700-1800nm, 2300-2400nm)

pipeline = [
    MSC(),  # Important for fat scatter
    SNV(),
    SavitzkyGolay(window_length=21, deriv=1),
]

Multi-Layer Preprocessing for Deep Learning

When training neural networks, multiple preprocessing “views” can be stacked as channels:

# Create multiple preprocessing branches
pipeline = [
    {"branch": [
        [SNV(), SavitzkyGolay()],           # Channel 1: SNV + smooth
        [MSC(), FirstDerivative()],          # Channel 2: MSC + 1st deriv
        [SNV(), SecondDerivative()],         # Channel 3: SNV + 2nd deriv
        [Wavelet('db4')],                    # Channel 4: Wavelet features
    ]},
    {"merge": "features"},  # Stack as multi-channel input
    {"model": CNN1D()}
]

Note

After concatenating multi-layer preprocessing, apply per-channel scaling, not a single global scaler.

SciPy Functions Reference

These SciPy functions are used internally by NIRS4ALL transformers:

Function

Purpose

Used By

scipy.signal.savgol_filter

Smoothing + derivatives

SavitzkyGolay

scipy.signal.detrend

Linear trend removal

Detrend

scipy.ndimage.gaussian_filter1d

Gaussian smoothing

Gaussian

scipy.interpolate.interp1d

1D interpolation

ResampleTransformer

scipy.interpolate.UnivariateSpline

Spline smoothing

Spline augmenters

Best Practices

Tip

Golden Rules for NIRS Preprocessing

  1. Always fit preprocessing inside CV to avoid data leakage

  2. Start simple: SNV + SavGol(deriv=1) is a strong baseline

  3. Don’t over-preprocess: More steps ≠ better results

  4. Validate your choices: Compare RMSE across preprocessing variants

  5. Match preprocessing to model: Trees don’t need scaling; SVMs do

See Also