Preprocessing Cheatsheet

Quick reference for NIR preprocessing selection by model type.

Classical ML (scikit-learn + libraries)

Model

Task

Works Well

Avoid

PLS (PLS-R, PLS-DA)

R/C

Mean-centering; SNV/MSC/EMSC; mild SG smoothing; 1st derivative; detrend; wavelength masking

Over-aggressive 2nd deriv with short windows; no scatter correction; leaking preprocessing

OPLS / Kernel PLS / Local PLS

R/C

Same as PLS; for Local/LW: distance-aware scaling (SNV); for Kernel: global standardization

Using raw unscaled spectra for neighbor search; noisy high-order derivatives

PCR

R

Center + (often) autoscale; SG smoothing; baseline removal before PCA; retain PCs via CV

PCA on raw uncorrected scatter; keeping too many PCs; PCA fitted outside CV

Linear / Ridge / Lasso / ElasticNet

R/C

Center + scale; SNV/MSC; mild SG; band selection to reduce collinearity

No scaling; strong noise amplification via derivatives

LDA / QDA

C

Dimension reduction first (PCA/PLS scores); center + scale; SNV/MSC; outlier control

Training directly on thousands of wavelengths; uncorrected batch effects

k-NN

R/C

Per-feature scaling or SNV; baseline/scatter correction; smoothed 1st deriv; band selection

Raw unscaled spectra; high-order noisy derivatives; very high-D

SVM / SVR

R/C

Standardization; SNV/MSC; SG + 1st deriv; band selection or PCA/PLS scores

No scaling; feeding entire noisy spectrum; aggressive 2nd deriv

Decision Tree

R/C

SG smoothing; SNV/MSC if strong scatter; band/bin selection

Per-feature standardization; high-order noisy derivatives

Random Forest

R/C

SG smoothing; SNV/MSC helpful; band/bin selection; remove obvious noise regions

Standardization per wavelength; over-derivation amplifying noise

Gradient Boosting

R/C

As RF; plus outlier trimming; modest feature reduction; early stopping

Per-feature standardization; noisy 2nd deriv; no denoising

XGBoost / LightGBM / CatBoost

R/C

SG smoothing; SNV/MSC; band/bin selection; remove artifacts; tune regularization

Standardization per wavelength; noisy derivatives

TabPFN

R/C

Minimal scaling needed (internal z-score); mask noisy/irrelevant bands; band/bin reduction

Manual re-scaling; feeding artifact regions; extreme dimensionality

Legend: R = Regression, C = Classification

Neural Networks

Model

Task

Works Well

Avoid

MLP

R/C

Standardization or min-max; mean-centering/SNV; baseline removal; SG smoothing; band/PCA/PLS scores

Raw unscaled spectra; high-D collinearity with small N; noisy derivatives

1D CNN

R/C

Input scaling to stable range; SNV/MSC; SG smoothing; 1st deriv optional; data augmentation

No normalization; over-smoothed spectra; pure 2nd deriv without smoothing

RNN (LSTM/GRU)

R/C

Standardization; mean-centering; baseline removal; moderate smoothing; downsampling/binning

Very long raw sequences with noise; unscaled inputs that saturate gates

Transformers

R/C

Standardization; positional encoding; SNV/MSC; denoising or patch/bin tokens

Raw baselines/scatter; very long token sequences; no normalization

Vision backbones (transfer)

C/R

Match pretrained input normalization; consistent encoding; SNV/MSC before encoding

Mismatch of expected scale; noisy encodings

Quick Rules of Thumb

When to Apply What

Condition

Action

Scatter/baseline present

SNV/MSC/EMSC + detrend

Noisy spectra

SG smoothing; conservative derivatives

High-D with small N

Band/bin selection or PCA/PLS scores

Model needs scaling

SVM, k-NN, linear, MLP, RNN, Transformers

Model dislikes scaling

Trees, RF, Boosting

Derivative Guidelines

Derivative

Best For

Caution

1st derivative

PLS, SVM, k-NN, CNN

Use with smoothing

2nd derivative

Overlapping peaks

Only with adequate SNR

SG derivative

Most applications

Window 11-21, polyorder 2-3

Preprocessing Chains

Minimal Robust (3 steps)

[SNV(), SavitzkyGolay(window_length=15, deriv=1), StandardScaler()]

Standard (4 steps)

[MSC(), SavitzkyGolay(window_length=17), FirstDerivative(), RobustScaler()]

Multi-view (for deep learning)

[
    {"branch": [
        [SNV(), SavitzkyGolay()],
        [MSC(), FirstDerivative()],
        [SNV(), SecondDerivative()],
    ]},
    {"merge": "features"}
]

Parameter Recommendations

Operator

Parameter

Default Range

Notes

SavitzkyGolay

window_length

11-21

Must be odd

polyorder

2-3

Higher = less smoothing

deriv

0-2

0=smooth, 1=1st deriv

FirstDerivative

delta

1.0

Wavelength spacing

Gaussian

sigma

1-3

Higher = more smoothing

RSNV/LSNV

window

25-75 points

Depends on resolution

Wavelet

level

3-5

Decomposition levels

Warning Signs

Warning

🚨 Preprocessing Anti-patterns

  • SNV + MSC together - Redundant scatter correction

  • SavGol then Detrend - SG already removes trends

  • Derivative before scatter correction - Amplifies artifacts

  • Very short SG window (< 7) - Insufficient smoothing for derivatives

  • Global scaler on SNV-normalized channels - Double normalization

  • Fitting preprocessing on full dataset - Data leakage!

See Also