# Preprocessing Cheatsheet Quick reference for NIR preprocessing selection by model type. ## Classical ML (scikit-learn + libraries) | Model | Task | Works Well | Avoid | |-------|------|-----------|-------| | **PLS (PLS-R, PLS-DA)** | R/C | Mean-centering; SNV/MSC/EMSC; mild SG smoothing; **1st derivative**; detrend; wavelength masking | Over-aggressive **2nd deriv** with short windows; no scatter correction; leaking preprocessing | | **OPLS / Kernel PLS / Local PLS** | R/C | Same as PLS; for Local/LW: **distance-aware scaling** (SNV); for Kernel: global **standardization** | Using raw unscaled spectra for neighbor search; noisy high-order derivatives | | **PCR** | R | Center + (often) autoscale; SG smoothing; baseline removal before PCA; retain PCs via CV | PCA on raw uncorrected scatter; keeping too many PCs; PCA fitted outside CV | | **Linear / Ridge / Lasso / ElasticNet** | R/C | Center + scale; SNV/MSC; mild SG; band selection to reduce collinearity | No scaling; strong noise amplification via derivatives | | **LDA / QDA** | C | Dimension reduction first (PCA/PLS scores); center + scale; SNV/MSC; outlier control | Training directly on thousands of wavelengths; uncorrected batch effects | | **k-NN** | R/C | **Per-feature scaling** or SNV; baseline/scatter correction; smoothed **1st deriv**; band selection | Raw unscaled spectra; high-order noisy derivatives; very high-D | | **SVM / SVR** | R/C | **Standardization**; SNV/MSC; SG + **1st deriv**; band selection or PCA/PLS scores | No scaling; feeding entire noisy spectrum; aggressive 2nd deriv | | **Decision Tree** | R/C | SG smoothing; SNV/MSC if strong scatter; **band/bin selection** | Per-feature standardization; high-order noisy derivatives | | **Random Forest** | R/C | SG smoothing; SNV/MSC helpful; **band/bin selection**; remove obvious noise regions | Standardization per wavelength; over-derivation amplifying noise | | **Gradient Boosting** | R/C | As RF; plus outlier trimming; modest feature reduction; early stopping | Per-feature standardization; noisy 2nd deriv; no denoising | | **XGBoost / LightGBM / CatBoost** | R/C | SG smoothing; SNV/MSC; band/bin selection; remove artifacts; tune regularization | Standardization per wavelength; noisy derivatives | | **TabPFN** | R/C | Minimal scaling needed (internal z-score); **mask noisy/irrelevant bands**; band/bin reduction | Manual re-scaling; feeding artifact regions; extreme dimensionality | **Legend**: R = Regression, C = Classification ## Neural Networks | Model | Task | Works Well | Avoid | |-------|------|-----------|-------| | **MLP** | R/C | **Standardization** or min-max; mean-centering/SNV; baseline removal; SG smoothing; band/PCA/PLS scores | Raw unscaled spectra; high-D collinearity with small N; noisy derivatives | | **1D CNN** | R/C | Input scaling to stable range; SNV/MSC; SG smoothing; **1st deriv** optional; data augmentation | No normalization; over-smoothed spectra; pure 2nd deriv without smoothing | | **RNN (LSTM/GRU)** | R/C | Standardization; mean-centering; baseline removal; moderate smoothing; **downsampling/binning** | Very long raw sequences with noise; unscaled inputs that saturate gates | | **Transformers** | R/C | Standardization; positional encoding; SNV/MSC; denoising or **patch/bin tokens** | Raw baselines/scatter; very long token sequences; no normalization | | **Vision backbones (transfer)** | C/R | Match pretrained **input normalization**; consistent encoding; SNV/MSC before encoding | Mismatch of expected scale; noisy encodings | ## Quick Rules of Thumb ::::{grid} 2 :gutter: 3 :::{grid-item} ### When to Apply What | Condition | Action | |-----------|--------| | Scatter/baseline present | **SNV/MSC/EMSC + detrend** | | Noisy spectra | **SG smoothing**; conservative derivatives | | High-D with small N | **Band/bin selection** or **PCA/PLS scores** | | Model needs scaling | SVM, k-NN, linear, MLP, RNN, Transformers | | Model dislikes scaling | Trees, RF, Boosting | ::: :::{grid-item} ### Derivative Guidelines | Derivative | Best For | Caution | |------------|----------|---------| | **1st derivative** | PLS, SVM, k-NN, CNN | Use with smoothing | | **2nd derivative** | Overlapping peaks | Only with adequate SNR | | **SG derivative** | Most applications | Window 11-21, polyorder 2-3 | ::: :::: ## Preprocessing Chains ### Minimal Robust (3 steps) ```python [SNV(), SavitzkyGolay(window_length=15, deriv=1), StandardScaler()] ``` ### Standard (4 steps) ```python [MSC(), SavitzkyGolay(window_length=17), FirstDerivative(), RobustScaler()] ``` ### Multi-view (for deep learning) ```python [ {"branch": [ [SNV(), SavitzkyGolay()], [MSC(), FirstDerivative()], [SNV(), SecondDerivative()], ]}, {"merge": "features"} ] ``` ## Parameter Recommendations | Operator | Parameter | Default Range | Notes | |----------|-----------|---------------|-------| | **SavitzkyGolay** | `window_length` | 11-21 | Must be odd | | | `polyorder` | 2-3 | Higher = less smoothing | | | `deriv` | 0-2 | 0=smooth, 1=1st deriv | | **FirstDerivative** | `delta` | 1.0 | Wavelength spacing | | **Gaussian** | `sigma` | 1-3 | Higher = more smoothing | | **RSNV/LSNV** | window | 25-75 points | Depends on resolution | | **Wavelet** | level | 3-5 | Decomposition levels | ## Warning Signs :::{warning} **🚨 Preprocessing Anti-patterns** - ❌ **SNV + MSC together** - Redundant scatter correction - ❌ **SavGol then Detrend** - SG already removes trends - ❌ **Derivative before scatter correction** - Amplifies artifacts - ❌ **Very short SG window (< 7)** - Insufficient smoothing for derivatives - ❌ **Global scaler on SNV-normalized channels** - Double normalization - ❌ **Fitting preprocessing on full dataset** - Data leakage! ::: ## See Also - {doc}`overview` - Comprehensive preprocessing guide - {doc}`handbook` - In-depth theory and advanced techniques - {doc}`/reference/operator_catalog` - Complete operator reference