Preprocessing Handbook
In-depth guide to NIRS preprocessing methods with theory, best practices, and advanced techniques.
Philosophy: Multi-Layer Preprocessing
Modern NIRS analysis benefits from providing multiple “views” of the spectral data. Each preprocessing layer should provide a different view of the chemical information:
Minimize redundancy between layers
Maximize complementary information
Order matters in sequential preprocessing
Preprocessing Categories
1. Scatter Correction
Scatter effects arise from particle size variations, surface roughness, and path length differences. These multiplicative effects must be corrected before most analyses.
Standard Normal Variate (SNV)
from nirs4all.operators.transforms import SNV
# Row-wise normalization
snv = SNV(copy=True)
X_corrected = snv.fit_transform(X)
How it works: For each spectrum, subtract mean and divide by standard deviation.
$$X_{SNV}^{(i)} = \frac{X^{(i)} - \bar{X}^{(i)}}{\sigma^{(i)}}$$
Pros: Simple, no reference needed, fast Cons: Sensitive to outliers, may distort relative peak heights
Multiplicative Scatter Correction (MSC)
from nirs4all.operators.transforms import MSC
msc = MSC(scale=True, copy=True)
X_corrected = msc.fit_transform(X)
How it works: Regresses each spectrum against a reference (typically the mean spectrum).
Pros: Better preserves absolute intensities, reference-based Cons: Requires representative reference, computationally heavier
Robust/Local Variants
Variant |
Use Case |
Advantage |
|---|---|---|
RSNV |
Noisy data |
Outlier-resistant (uses robust statistics) |
LSNV |
Heterogeneous samples |
Window-based local correction |
EMSC |
Complex scatter |
Extended MSC with polynomial terms |
2. Smoothing
Smoothing reduces high-frequency noise while preserving spectral features.
Savitzky-Golay Filter
The workhorse of NIRS preprocessing:
from nirs4all.operators.transforms import SavitzkyGolay
# Smoothing only
sg_smooth = SavitzkyGolay(window_length=15, polyorder=3, deriv=0)
# First derivative with smoothing
sg_d1 = SavitzkyGolay(window_length=15, polyorder=3, deriv=1)
# Second derivative
sg_d2 = SavitzkyGolay(window_length=21, polyorder=3, deriv=2)
Parameter Guidelines:
Parameter |
Typical Range |
Effect of Increase |
|---|---|---|
|
11-25 |
More smoothing, less noise |
|
2-4 |
Better peak preservation |
|
0-2 |
Higher order derivatives |
Tip
For derivatives, use longer windows to reduce noise amplification:
1st derivative: window 11-17
2nd derivative: window 21-31
Gaussian Smoothing
from nirs4all.operators.transforms import Gaussian
gauss = Gaussian(sigma=2, order=0)
X_smooth = gauss.fit_transform(X)
Simpler than SG, good for quick smoothing without derivatives.
3. Baseline Correction
Detrending
Removes linear or polynomial baselines:
from nirs4all.operators.transforms import Detrend
# Linear detrend
detrend = Detrend(bp=0)
# Piecewise with breakpoints
detrend_pw = Detrend(bp=[100, 200]) # Breakpoints at indices
First and Second Derivatives
Derivatives inherently remove baseline components:
1st derivative: Removes constant offset and linear slope
2nd derivative: Removes linear and quadratic baselines
from nirs4all.operators.transforms import FirstDerivative, SecondDerivative
d1 = FirstDerivative(delta=1.0, edge_order=2)
d2 = SecondDerivative(delta=1.0, edge_order=2)
4. Wavelets
Wavelet transforms provide multi-resolution analysis:
from nirs4all.operators.transforms import Wavelet
# Common wavelets for NIRS
wavelet_db4 = Wavelet(wavelet='db4') # Good for peaks
wavelet_sym5 = Wavelet(wavelet='sym5') # Good for baselines
wavelet_haar = Wavelet(wavelet='haar') # Sharp features
When to use wavelets:
Denoising with thresholding
Multi-resolution feature extraction
Capturing both local and global features
5. Scaling
Feature-wise Scaling (per wavelength)
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
# Zero mean, unit variance
scaler = StandardScaler()
# Robust to outliers
scaler = RobustScaler()
# Bounded range
scaler = MinMaxScaler(feature_range=(0, 1))
Sample-wise Scaling
Already handled by SNV, but can also use:
from nirs4all.operators.transforms import Normalize
# Vector normalization (L2 norm)
norm = Normalize(feature_range=(-1, 1))
Preprocessing Order
General Order
1. Convert to absorbance (if reflectance)
2. Detrend (if needed before scatter correction)
3. MSC/SNV (scatter correction)
4. Smoothing (SavGol, Gaussian)
5. Derivatives (if using)
6. Scaling (StandardScaler, RobustScaler)
Common Correct Chains
# Chain 1: Standard NIRS preprocessing
[MSC(), SavitzkyGolay(deriv=1), StandardScaler()]
# Chain 2: SNV-Detrend combo
[SNV(), Detrend()] # Detrend AFTER SNV
# Chain 3: Smooth then derivative
[SavitzkyGolay(deriv=0), FirstDerivative()]
# Chain 4: Full pipeline
[Detrend(), MSC(), SavitzkyGolay(deriv=1), RobustScaler()]
Order Violations to Avoid
# ❌ Two scatter corrections
[SNV(), MSC()] # Redundant!
# ❌ Smooth after SG derivative
[SavitzkyGolay(deriv=1), SavitzkyGolay(deriv=0)] # SG already smooths
# ❌ Derivative before scatter correction
[FirstDerivative(), MSC()] # Amplifies scatter artifacts
# ❌ Detrend after SG with derivative
[SavitzkyGolay(deriv=1), Detrend()] # Derivative already removes linear baseline
Multi-Layer Preprocessing for Deep Learning
Optimal 8-10 Channels
For neural networks, stack multiple preprocessing views as channels:
preprocessing_layers = [
# Layer 1: Raw baseline information
None, # Pass-through
# Layer 2: Scatter-corrected baseline
[MSC()],
# Layer 3: Normalized + smoothed
[SNV(), SavitzkyGolay()],
# Layer 4: First derivative (critical!)
[MSC(), FirstDerivative()],
# Layer 5: Second derivative
[SNV(), SecondDerivative()],
# Layer 6: SNV-detrend combo
[SNV(), Detrend()],
# Layer 7: Wavelet high-frequency
[Wavelet('db4')],
# Layer 8: Wavelet low-frequency
[Wavelet('sym5')],
]
Minimal 5 Channels
minimal_layers = [
None, # Raw
[MSC(), SavitzkyGolay()], # Standard
[SNV(), FirstDerivative()], # 1st derivative
[MSC(), SecondDerivative()], # 2nd derivative
[Wavelet('db6')], # Wavelet
]
Per-Channel Scaling
Warning
After concatenating multi-layer preprocessing, apply per-channel scaling:
# ❌ WRONG: Global scaler on all channels
all_channels = np.concatenate([ch1, ch2, ch3], axis=1)
scaler.fit_transform(all_channels)
# ✅ CORRECT: Scale each channel separately
for i, channel in enumerate(channels):
channels[i] = StandardScaler().fit_transform(channel)
final = np.stack(channels, axis=-1) # (samples, wavelengths, channels)
Task-Specific Preprocessing
Protein/Nitrogen (N-H bonds: 2000-2200nm)
protein_preprocessing = [
MSC(),
SavitzkyGolay(window_length=17, deriv=1), # Critical for protein
StandardScaler(),
]
Key considerations:
1st derivative enhances amide bands
2nd derivative resolves overlapping peaks
Consider Wavelet(‘coif3’) for additional features
Moisture (O-H bonds: 1400-1500nm, 1900-2000nm)
moisture_preprocessing = [
MSC(),
SavitzkyGolay(window_length=15), # Smooth only in high absorption
RSNV(), # Local scatter for heterogeneous moisture
]
Key considerations:
Water peaks are strong, derivatives may not be needed
Use caution near saturation regions
Haar wavelet captures sharp water absorption edges
Fat/Oil (C-H bonds: 1700-1800nm, 2300-2400nm)
fat_preprocessing = [
MSC(), # Important for fat scatter
SNV(),
SavitzkyGolay(window_length=21, deriv=1),
]
Key considerations:
Fat scatter is significant, MSC helps
Area normalization can be useful
Wavelet(‘db8’) captures smooth fat peaks
Cellulose/Lignin (Plant matrices)
cellulose_preprocessing = [
Detrend(),
MSC(),
SavitzkyGolay(window_length=17, deriv=1),
# Consider region-specific: 1600-1800nm, 2100-2350nm
]
Advanced Techniques
Region-Specific Preprocessing
Different spectral regions may benefit from different preprocessing:
def region_specific_preprocessing(X, wavelengths):
"""Apply different preprocessing to different regions."""
# Region 1: 1100-1400nm (C-H, good SNR)
mask1 = wavelengths < 1400
X1 = SNV().fit_transform(X[:, mask1])
X1 = FirstDerivative().fit_transform(X1)
# Region 2: 1400-1600nm (water, high absorption)
mask2 = (wavelengths >= 1400) & (wavelengths < 1600)
X2 = MSC().fit_transform(X[:, mask2])
X2 = SavitzkyGolay().fit_transform(X2) # Smooth only
# Region 3: 1600-2400nm (protein, fat, lower SNR)
mask3 = wavelengths >= 1600
X3 = RSNV().fit_transform(X[:, mask3])
X3 = SecondDerivative().fit_transform(X3)
return np.concatenate([X1, X2, X3], axis=1)
Instrument Transfer
When transferring calibrations between instruments:
transfer_preprocessing = [
# EMSC with reference from new instrument
EMSC(reference_spectrum=new_instrument_reference),
# Or PDS (Piecewise Direct Standardization)
PDS(transfer_samples=transfer_set),
# Standard preprocessing
SavitzkyGolay(deriv=1),
StandardScaler(),
]
Validation & Quality Control
Check Your Preprocessing
Use this function to validate preprocessing choices:
def validate_preprocessing_layers(layers):
"""Check for common preprocessing mistakes."""
issues = []
# Check for scatter correction redundancy
scatter_methods = ['SNV', 'MSC', 'LSNV', 'RSNV']
for layer in layers:
if isinstance(layer, list):
scatter_count = sum(
any(s in str(p) for s in scatter_methods)
for p in layer
)
if scatter_count > 1:
issues.append("⚠️ Multiple scatter corrections in same pipeline")
# Check for derivative coverage
all_layers_str = str(layers)
if 'FirstDeriv' not in all_layers_str and 'deriv=1' not in all_layers_str:
issues.append("❌ Missing first derivative (critical for NIRS)")
return issues
Metrics to Monitor
Metric |
What it Tells You |
|---|---|
RMSECV |
Cross-validation error (primary metric) |
R² |
Explained variance |
Bias |
Systematic offset |
RPD |
Ratio of Performance to Deviation |
Summary
Key Takeaways
Order matters: Scatter → Smooth → Derivative → Scale
Don’t over-process: Start simple, add complexity only if needed
Match to model: Trees don’t need scaling; neural networks need careful normalization
Validate inside CV: Never fit preprocessing on full dataset
First derivative is critical: Almost always improves PLS/SVM/CNN performance
Quick Decision Tree
Is there scatter/baseline?
└─ Yes → Apply SNV or MSC
└─ No → Skip
Is data noisy?
└─ Yes → Apply SavitzkyGolay smoothing
└─ No → Skip or light smoothing
Need baseline removal?
└─ Yes → Apply 1st derivative (removes offset + slope)
└─ Need peak resolution? → Apply 2nd derivative (with caution)
Using distance-based model (SVM, k-NN, MLP)?
└─ Yes → Apply StandardScaler or RobustScaler
└─ No (trees) → Skip scaling
See Also
Preprocessing Overview - Quick introduction to preprocessing
Preprocessing Cheatsheet - Quick reference by model type
Operator Catalog - Complete operator reference