Preprocessing Handbook

In-depth guide to NIRS preprocessing methods with theory, best practices, and advanced techniques.

Philosophy: Multi-Layer Preprocessing

Modern NIRS analysis benefits from providing multiple “views” of the spectral data. Each preprocessing layer should provide a different view of the chemical information:

  • Minimize redundancy between layers

  • Maximize complementary information

  • Order matters in sequential preprocessing

Preprocessing Categories

1. Scatter Correction

Scatter effects arise from particle size variations, surface roughness, and path length differences. These multiplicative effects must be corrected before most analyses.

Standard Normal Variate (SNV)

from nirs4all.operators.transforms import SNV

# Row-wise normalization
snv = SNV(copy=True)
X_corrected = snv.fit_transform(X)

How it works: For each spectrum, subtract mean and divide by standard deviation.

$$X_{SNV}^{(i)} = \frac{X^{(i)} - \bar{X}^{(i)}}{\sigma^{(i)}}$$

Pros: Simple, no reference needed, fast Cons: Sensitive to outliers, may distort relative peak heights

Multiplicative Scatter Correction (MSC)

from nirs4all.operators.transforms import MSC

msc = MSC(scale=True, copy=True)
X_corrected = msc.fit_transform(X)

How it works: Regresses each spectrum against a reference (typically the mean spectrum).

Pros: Better preserves absolute intensities, reference-based Cons: Requires representative reference, computationally heavier

Robust/Local Variants

Variant

Use Case

Advantage

RSNV

Noisy data

Outlier-resistant (uses robust statistics)

LSNV

Heterogeneous samples

Window-based local correction

EMSC

Complex scatter

Extended MSC with polynomial terms

2. Smoothing

Smoothing reduces high-frequency noise while preserving spectral features.

Savitzky-Golay Filter

The workhorse of NIRS preprocessing:

from nirs4all.operators.transforms import SavitzkyGolay

# Smoothing only
sg_smooth = SavitzkyGolay(window_length=15, polyorder=3, deriv=0)

# First derivative with smoothing
sg_d1 = SavitzkyGolay(window_length=15, polyorder=3, deriv=1)

# Second derivative
sg_d2 = SavitzkyGolay(window_length=21, polyorder=3, deriv=2)

Parameter Guidelines:

Parameter

Typical Range

Effect of Increase

window_length

11-25

More smoothing, less noise

polyorder

2-4

Better peak preservation

deriv

0-2

Higher order derivatives

Tip

For derivatives, use longer windows to reduce noise amplification:

  • 1st derivative: window 11-17

  • 2nd derivative: window 21-31

Gaussian Smoothing

from nirs4all.operators.transforms import Gaussian

gauss = Gaussian(sigma=2, order=0)
X_smooth = gauss.fit_transform(X)

Simpler than SG, good for quick smoothing without derivatives.

3. Baseline Correction

Detrending

Removes linear or polynomial baselines:

from nirs4all.operators.transforms import Detrend

# Linear detrend
detrend = Detrend(bp=0)

# Piecewise with breakpoints
detrend_pw = Detrend(bp=[100, 200])  # Breakpoints at indices

First and Second Derivatives

Derivatives inherently remove baseline components:

  • 1st derivative: Removes constant offset and linear slope

  • 2nd derivative: Removes linear and quadratic baselines

from nirs4all.operators.transforms import FirstDerivative, SecondDerivative

d1 = FirstDerivative(delta=1.0, edge_order=2)
d2 = SecondDerivative(delta=1.0, edge_order=2)

4. Wavelets

Wavelet transforms provide multi-resolution analysis:

from nirs4all.operators.transforms import Wavelet

# Common wavelets for NIRS
wavelet_db4 = Wavelet(wavelet='db4')    # Good for peaks
wavelet_sym5 = Wavelet(wavelet='sym5')  # Good for baselines
wavelet_haar = Wavelet(wavelet='haar')  # Sharp features

When to use wavelets:

  • Denoising with thresholding

  • Multi-resolution feature extraction

  • Capturing both local and global features

5. Scaling

Feature-wise Scaling (per wavelength)

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# Zero mean, unit variance
scaler = StandardScaler()

# Robust to outliers
scaler = RobustScaler()

# Bounded range
scaler = MinMaxScaler(feature_range=(0, 1))

Sample-wise Scaling

Already handled by SNV, but can also use:

from nirs4all.operators.transforms import Normalize

# Vector normalization (L2 norm)
norm = Normalize(feature_range=(-1, 1))

Preprocessing Order

General Order

1. Convert to absorbance (if reflectance)
2. Detrend (if needed before scatter correction)
3. MSC/SNV (scatter correction)
4. Smoothing (SavGol, Gaussian)
5. Derivatives (if using)
6. Scaling (StandardScaler, RobustScaler)

Common Correct Chains

# Chain 1: Standard NIRS preprocessing
[MSC(), SavitzkyGolay(deriv=1), StandardScaler()]

# Chain 2: SNV-Detrend combo
[SNV(), Detrend()]  # Detrend AFTER SNV

# Chain 3: Smooth then derivative
[SavitzkyGolay(deriv=0), FirstDerivative()]

# Chain 4: Full pipeline
[Detrend(), MSC(), SavitzkyGolay(deriv=1), RobustScaler()]

Order Violations to Avoid

# ❌ Two scatter corrections
[SNV(), MSC()]  # Redundant!

# ❌ Smooth after SG derivative
[SavitzkyGolay(deriv=1), SavitzkyGolay(deriv=0)]  # SG already smooths

# ❌ Derivative before scatter correction
[FirstDerivative(), MSC()]  # Amplifies scatter artifacts

# ❌ Detrend after SG with derivative
[SavitzkyGolay(deriv=1), Detrend()]  # Derivative already removes linear baseline

Multi-Layer Preprocessing for Deep Learning

Optimal 8-10 Channels

For neural networks, stack multiple preprocessing views as channels:

preprocessing_layers = [
    # Layer 1: Raw baseline information
    None,  # Pass-through

    # Layer 2: Scatter-corrected baseline
    [MSC()],

    # Layer 3: Normalized + smoothed
    [SNV(), SavitzkyGolay()],

    # Layer 4: First derivative (critical!)
    [MSC(), FirstDerivative()],

    # Layer 5: Second derivative
    [SNV(), SecondDerivative()],

    # Layer 6: SNV-detrend combo
    [SNV(), Detrend()],

    # Layer 7: Wavelet high-frequency
    [Wavelet('db4')],

    # Layer 8: Wavelet low-frequency
    [Wavelet('sym5')],
]

Minimal 5 Channels

minimal_layers = [
    None,                               # Raw
    [MSC(), SavitzkyGolay()],           # Standard
    [SNV(), FirstDerivative()],         # 1st derivative
    [MSC(), SecondDerivative()],        # 2nd derivative
    [Wavelet('db6')],                   # Wavelet
]

Per-Channel Scaling

Warning

After concatenating multi-layer preprocessing, apply per-channel scaling:

# ❌ WRONG: Global scaler on all channels
all_channels = np.concatenate([ch1, ch2, ch3], axis=1)
scaler.fit_transform(all_channels)

# ✅ CORRECT: Scale each channel separately
for i, channel in enumerate(channels):
    channels[i] = StandardScaler().fit_transform(channel)
final = np.stack(channels, axis=-1)  # (samples, wavelengths, channels)

Task-Specific Preprocessing

Protein/Nitrogen (N-H bonds: 2000-2200nm)

protein_preprocessing = [
    MSC(),
    SavitzkyGolay(window_length=17, deriv=1),  # Critical for protein
    StandardScaler(),
]

Key considerations:

  • 1st derivative enhances amide bands

  • 2nd derivative resolves overlapping peaks

  • Consider Wavelet(‘coif3’) for additional features

Moisture (O-H bonds: 1400-1500nm, 1900-2000nm)

moisture_preprocessing = [
    MSC(),
    SavitzkyGolay(window_length=15),  # Smooth only in high absorption
    RSNV(),  # Local scatter for heterogeneous moisture
]

Key considerations:

  • Water peaks are strong, derivatives may not be needed

  • Use caution near saturation regions

  • Haar wavelet captures sharp water absorption edges

Fat/Oil (C-H bonds: 1700-1800nm, 2300-2400nm)

fat_preprocessing = [
    MSC(),  # Important for fat scatter
    SNV(),
    SavitzkyGolay(window_length=21, deriv=1),
]

Key considerations:

  • Fat scatter is significant, MSC helps

  • Area normalization can be useful

  • Wavelet(‘db8’) captures smooth fat peaks

Cellulose/Lignin (Plant matrices)

cellulose_preprocessing = [
    Detrend(),
    MSC(),
    SavitzkyGolay(window_length=17, deriv=1),
    # Consider region-specific: 1600-1800nm, 2100-2350nm
]

Advanced Techniques

Region-Specific Preprocessing

Different spectral regions may benefit from different preprocessing:

def region_specific_preprocessing(X, wavelengths):
    """Apply different preprocessing to different regions."""

    # Region 1: 1100-1400nm (C-H, good SNR)
    mask1 = wavelengths < 1400
    X1 = SNV().fit_transform(X[:, mask1])
    X1 = FirstDerivative().fit_transform(X1)

    # Region 2: 1400-1600nm (water, high absorption)
    mask2 = (wavelengths >= 1400) & (wavelengths < 1600)
    X2 = MSC().fit_transform(X[:, mask2])
    X2 = SavitzkyGolay().fit_transform(X2)  # Smooth only

    # Region 3: 1600-2400nm (protein, fat, lower SNR)
    mask3 = wavelengths >= 1600
    X3 = RSNV().fit_transform(X[:, mask3])
    X3 = SecondDerivative().fit_transform(X3)

    return np.concatenate([X1, X2, X3], axis=1)

Instrument Transfer

When transferring calibrations between instruments:

transfer_preprocessing = [
    # EMSC with reference from new instrument
    EMSC(reference_spectrum=new_instrument_reference),
    # Or PDS (Piecewise Direct Standardization)
    PDS(transfer_samples=transfer_set),
    # Standard preprocessing
    SavitzkyGolay(deriv=1),
    StandardScaler(),
]

Validation & Quality Control

Check Your Preprocessing

Use this function to validate preprocessing choices:

def validate_preprocessing_layers(layers):
    """Check for common preprocessing mistakes."""
    issues = []

    # Check for scatter correction redundancy
    scatter_methods = ['SNV', 'MSC', 'LSNV', 'RSNV']
    for layer in layers:
        if isinstance(layer, list):
            scatter_count = sum(
                any(s in str(p) for s in scatter_methods)
                for p in layer
            )
            if scatter_count > 1:
                issues.append("⚠️ Multiple scatter corrections in same pipeline")

    # Check for derivative coverage
    all_layers_str = str(layers)
    if 'FirstDeriv' not in all_layers_str and 'deriv=1' not in all_layers_str:
        issues.append("❌ Missing first derivative (critical for NIRS)")

    return issues

Metrics to Monitor

Metric

What it Tells You

RMSECV

Cross-validation error (primary metric)

Explained variance

Bias

Systematic offset

RPD

Ratio of Performance to Deviation

Summary

Key Takeaways

  1. Order matters: Scatter → Smooth → Derivative → Scale

  2. Don’t over-process: Start simple, add complexity only if needed

  3. Match to model: Trees don’t need scaling; neural networks need careful normalization

  4. Validate inside CV: Never fit preprocessing on full dataset

  5. First derivative is critical: Almost always improves PLS/SVM/CNN performance

Quick Decision Tree

Is there scatter/baseline?
  └─ Yes → Apply SNV or MSC
  └─ No → Skip

Is data noisy?
  └─ Yes → Apply SavitzkyGolay smoothing
  └─ No → Skip or light smoothing

Need baseline removal?
  └─ Yes → Apply 1st derivative (removes offset + slope)
  └─ Need peak resolution? → Apply 2nd derivative (with caution)

Using distance-based model (SVM, k-NN, MLP)?
  └─ Yes → Apply StandardScaler or RobustScaler
  └─ No (trees) → Skip scaling

See Also