# Preprocessing Overview This guide covers spectral preprocessing techniques for NIRS data. Preprocessing is critical for NIRS analysis as it removes artifacts, reduces noise, and enhances spectral features relevant to the target property. ## Why Preprocessing Matters Near-infrared spectroscopy (NIRS) data requires specialized preprocessing to: - Remove scatter effects from particle size, surface roughness, or path length variations (MSC, SNV) - Eliminate baseline drift and offset (detrending, derivatives) - Reduce noise while preserving spectral features (Savitzky–Golay, Gaussian smoothing) - Normalize intensity variations across instruments or sessions - Enhance absorption bands through derivatives or wavelet transforms ## Common NIRS Preprocessing Workflow A typical preprocessing workflow follows this order: ``` 1. SCATTER CORRECTION (MSC or SNV) ↓ 2. SMOOTHING (Savitzky–Golay or Gaussian) ↓ 3. BASELINE CORRECTION (detrending or derivatives) ↓ 4. SCALING (StandardScaler or normalization) ``` :::{tip} The order matters! Apply scatter correction before derivatives, and smoothing before derivatives to reduce noise amplification. ::: ## Quick Example ```python from nirs4all.operators.transforms import SNV, SavitzkyGolay from sklearn.preprocessing import StandardScaler pipeline = [ SNV(), # Scatter correction SavitzkyGolay(window_length=15, deriv=1), # First derivative with smoothing StandardScaler(), # Feature scaling {"model": PLSRegression(n_components=10)} ] ``` ## Available Operators ### NIRS4ALL Transformers All NIRS4ALL transformers follow the sklearn TransformerMixin pattern and can be used in pipelines. #### Scatter Correction | Operator | Description | Typical Use | |----------|-------------|-------------| | `SNV` (StandardNormalVariate) | Row-wise normalization: subtract mean, divide by std | General scatter correction | | `MSC` (MultiplicativeScatterCorrection) | Reference-based correction for multiplicative effects | When reference spectrum available | | `RSNV` (RobustStandardNormalVariate) | Outlier-resistant SNV variant | Noisy/heterogeneous samples | | `LSNV` (LocalStandardNormalVariate) | Window-based local SNV | Heterogeneous materials | #### Derivatives | Operator | Description | Typical Use | |----------|-------------|-------------| | `FirstDerivative` | Numerical 1st derivative along wavelengths | Remove baseline offset | | `SecondDerivative` | Numerical 2nd derivative along wavelengths | Resolve overlapping peaks | | `SavitzkyGolay` | Smoothing + optional derivative | Most common derivative method | :::{warning} **Axis convention**: `FirstDerivative` and `SecondDerivative` operate along **axis=1** (wavelengths), which is correct for NIRS. The legacy `Derivate` class uses axis=0 (samples) and should be avoided. ::: #### Smoothing | Operator | Description | Parameters | |----------|-------------|------------| | `SavitzkyGolay` | Polynomial smoothing filter | `window_length=11, polyorder=3, deriv=0` | | `Gaussian` | Gaussian filter | `sigma=1, order=2` | #### Baseline & Normalization | Operator | Description | Typical Use | |----------|-------------|-------------| | `Detrend` | Remove linear/polynomial baseline | Baseline slope removal | | `Baseline` | Subtract per-feature mean | Simple offset removal | | `Normalize` | Scale to range or L2 norm | Intensity standardization | | `SimpleScale` | Min-max scaling per feature | Bounded [0,1] range | | `LogTransform` | Logarithmic transformation | Convert reflectance to pseudo-absorbance | #### Wavelets & Advanced | Operator | Description | Typical Use | |----------|-------------|-------------| | `Wavelet` | Discrete wavelet transform | Denoising, multi-resolution features | | `Haar` | Haar wavelet (shortcut) | Quick wavelet denoising | | `CropTransformer` | Select wavelength range | Region selection | | `ResampleTransformer` | Resample to fixed size | Standardize different instruments | ### Sklearn-Compatible Transformers These standard sklearn transformers work seamlessly in NIRS4ALL pipelines: | Transformer | Purpose | NIRS Context | |-------------|---------|--------------| | `StandardScaler` | Zero mean, unit variance | Use after scatter correction | | `RobustScaler` | Median/IQR scaling | Robust to outliers | | `MinMaxScaler` | Scale to [0, 1] | Bounded models | | `PCA` | Dimensionality reduction | NIRS spectra are highly collinear | | `FunctionTransformer` | Wrap custom functions | Quick custom preprocessing | ## Recommended Preprocessing by Model Type ### Classical ML (sklearn) | Model | Recommended Preprocessing | Avoid | |-------|--------------------------|-------| | **PLS** | Mean-centering, SNV/MSC, 1st derivative | Over-aggressive 2nd derivative | | **PCR** | Center + autoscale, SG smoothing | Raw uncorrected scatter | | **SVM/SVR** | Standardization, SNV/MSC, SG + 1st deriv | No scaling | | **Random Forest** | SG smoothing, SNV/MSC | Per-feature standardization | | **k-NN** | Per-feature scaling or SNV, band selection | Raw unscaled spectra | ### Neural Networks | Model | Recommended Preprocessing | Avoid | |-------|--------------------------|-------| | **MLP** | Standardization, mean-centering/SNV, SG smoothing | Raw unscaled spectra | | **1D CNN** | Input scaling, SNV/MSC, 1st derivative optional | Over-smoothed spectra | | **Transformers** | Standardization, SNV/MSC, patch/bin tokens | Very long sequences without reduction | ## Preprocessing Order Rules ### Correct Order ```python # ✅ CORRECT: scatter → smooth → derivative pipeline = [ MSC(), # 1. Scatter correction first SavitzkyGolay(window_length=15), # 2. Smoothing FirstDerivative(), # 3. Derivative last ] # ✅ CORRECT: SavGol with built-in derivative pipeline = [ SNV(), SavitzkyGolay(window_length=15, deriv=1), # Smoothing + derivative ] # ✅ CORRECT: Detrend before scatter correction pipeline = [ Detrend(), SNV(), SavitzkyGolay(window_length=15), ] ``` ### Incorrect Orders (Avoid) ```python # ❌ WRONG: Don't combine two scatter corrections pipeline = [SNV(), MSC()] # Redundant # ❌ WRONG: Don't smooth after SavGol derivative pipeline = [ SavitzkyGolay(deriv=1), SavitzkyGolay(), # SG already includes smoothing ] # ❌ WRONG: Derivative before scatter correction pipeline = [ FirstDerivative(), # Amplifies scatter artifacts MSC(), ] ``` ## Task-Specific Preprocessing ### Protein/Nitrogen Prediction (N-H bonds, 2000-2200nm) ```python pipeline = [ MSC(), SavitzkyGolay(window_length=17, deriv=1), # Critical for protein StandardScaler(), ] ``` ### Moisture Prediction (O-H bonds, 1400-1500nm, 1900-2000nm) ```python pipeline = [ MSC(), SavitzkyGolay(window_length=15), # Smooth, avoid derivatives in high absorption RSNV(), # Local scatter for heterogeneous moisture ] ``` ### Fat/Oil Prediction (C-H bonds, 1700-1800nm, 2300-2400nm) ```python pipeline = [ MSC(), # Important for fat scatter SNV(), SavitzkyGolay(window_length=21, deriv=1), ] ``` ## Multi-Layer Preprocessing for Deep Learning When training neural networks, multiple preprocessing "views" can be stacked as channels: ```python # Create multiple preprocessing branches pipeline = [ {"branch": [ [SNV(), SavitzkyGolay()], # Channel 1: SNV + smooth [MSC(), FirstDerivative()], # Channel 2: MSC + 1st deriv [SNV(), SecondDerivative()], # Channel 3: SNV + 2nd deriv [Wavelet('db4')], # Channel 4: Wavelet features ]}, {"merge": "features"}, # Stack as multi-channel input {"model": CNN1D()} ] ``` :::{note} After concatenating multi-layer preprocessing, apply **per-channel** scaling, not a single global scaler. ::: ## SciPy Functions Reference These SciPy functions are used internally by NIRS4ALL transformers: | Function | Purpose | Used By | |----------|---------|---------| | `scipy.signal.savgol_filter` | Smoothing + derivatives | `SavitzkyGolay` | | `scipy.signal.detrend` | Linear trend removal | `Detrend` | | `scipy.ndimage.gaussian_filter1d` | Gaussian smoothing | `Gaussian` | | `scipy.interpolate.interp1d` | 1D interpolation | `ResampleTransformer` | | `scipy.interpolate.UnivariateSpline` | Spline smoothing | Spline augmenters | ## Best Practices :::{tip} **Golden Rules for NIRS Preprocessing** 1. **Always fit preprocessing inside CV** to avoid data leakage 2. **Start simple**: SNV + SavGol(deriv=1) is a strong baseline 3. **Don't over-preprocess**: More steps ≠ better results 4. **Validate your choices**: Compare RMSE across preprocessing variants 5. **Match preprocessing to model**: Trees don't need scaling; SVMs do ::: ## See Also - {doc}`cheatsheet` - Quick reference by model type - {doc}`handbook` - In-depth theory and advanced techniques - {doc}`/reference/operator_catalog` - Complete operator reference - {doc}`snv` - Detailed SNV documentation