Preprocessing Examples
This section covers NIRS-specific preprocessing techniques, from basic transformations to automated exploration of preprocessing combinations.
Overview
Example |
Topic |
Difficulty |
Duration |
|---|---|---|---|
Preprocessing Basics |
★★☆☆☆ |
~3 min |
|
Feature Augmentation |
★★☆☆☆ |
~3 min |
|
Sample Augmentation |
★★☆☆☆ |
~3 min |
|
Signal Conversion |
★★☆☆☆ |
~2 min |
U01: Preprocessing Basics
Overview of standard NIRS preprocessing techniques.
What You’ll Learn
Scatter correction: SNV, MSC
Baseline correction: Detrend
Derivatives: First, Second, Savitzky-Golay
Smoothing: Gaussian, Savitzky-Golay
Wavelet transforms: Haar
Preprocessing Categories
NIRS preprocessing addresses common spectral issues:
📊 Scatter Correction
Corrects for variations in light scattering due to sample structure:
from nirs4all.operators.transforms import (
StandardNormalVariate,
MultiplicativeScatterCorrection
)
# SNV: Per-sample mean-centering and scaling
StandardNormalVariate()
# MSC: Regression-based correction using reference spectrum
MultiplicativeScatterCorrection()
Method |
How it Works |
When to Use |
|---|---|---|
SNV |
Centers and scales each spectrum individually |
Path length variations, quick scatter correction |
MSC |
Regresses each spectrum against a reference (mean) |
More robust to baseline variations |
📈 Baseline Correction
Removes baseline drift from spectra:
from nirs4all.operators.transforms import Detrend
# Remove polynomial baseline drift
Detrend() # Default: linear detrending
📉 Derivatives
Enhance peaks and remove baselines:
from nirs4all.operators.transforms import (
FirstDerivative,
SecondDerivative,
SavitzkyGolay
)
# Simple derivatives
FirstDerivative() # Removes constant baseline
SecondDerivative() # Removes linear baseline
# Smoothed derivative (recommended for noisy data)
SavitzkyGolay(window_length=11, polyorder=2, deriv=1)
Derivative |
Effect |
Use Case |
|---|---|---|
First |
Enhances peaks, removes constant baseline |
General baseline issues |
Second |
Stronger enhancement, removes linear baseline |
Complex baselines |
Savitzky-Golay |
Smoothed derivatives |
Noisy spectra |
🔊 Smoothing
Reduce noise while preserving spectral features:
from nirs4all.operators.transforms import Gaussian, SavitzkyGolay
# Gaussian convolution
Gaussian(sigma=2)
# Polynomial smoothing (no derivative)
SavitzkyGolay(window_length=11, polyorder=2, deriv=0)
🌊 Wavelet Transforms
Multi-resolution analysis:
from nirs4all.operators.transforms import Haar
Haar() # Haar wavelet transform
Combining Preprocessing Steps
Common combinations for NIRS data:
# Combination 1: Scatter + Derivative
pipeline = [
StandardNormalVariate(),
FirstDerivative(),
PLSRegression(n_components=10)
]
# Combination 2: Full preprocessing chain
pipeline = [
Detrend(),
MultiplicativeScatterCorrection(),
SavitzkyGolay(window_length=11, polyorder=2, deriv=1),
PLSRegression(n_components=10)
]
Comparing Methods
# Run pipelines with different preprocessing
methods = {
'SNV': StandardNormalVariate(),
'MSC': MultiplicativeScatterCorrection(),
'D1': FirstDerivative(),
'SG': SavitzkyGolay(deriv=1),
}
for name, method in methods.items():
pipeline = [method, ShuffleSplit(n_splits=3), PLSRegression(n_components=10)]
result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")
print(f"{name}: RMSE = {result.best_rmse:.4f}")
U02: Feature Augmentation
Automatically explore preprocessing combinations.
What You’ll Learn
Using
feature_augmentationto generate variantsThe
_or_generator syntaxPick, count, and combination controls
Actions: extend vs add vs replace
The Feature Augmentation Step
Instead of manually testing every preprocessing combination:
# Manual approach (tedious!)
pipeline_1 = [SNV(), FirstDerivative(), ...]
pipeline_2 = [MSC(), FirstDerivative(), ...]
pipeline_3 = [SNV(), SavitzkyGolay(), ...]
# ... many more
Use feature augmentation:
pipeline = [
MinMaxScaler(),
# Automatically generate preprocessing variants
{
"feature_augmentation": {
"_or_": [SNV, MSC, FirstDerivative, SavitzkyGolay, Gaussian],
"pick": 2, # Pick 2 methods at a time
"count": 5 # Generate 5 random combinations
}
},
PLSRegression(n_components=10)
]
Generator Syntax Options
_or_ - Alternatives
{"_or_": [A, B, C]} # Generates: A, B, C (3 variants)
pick - Combinations
{"_or_": [A, B, C, D], "pick": 2}
# Generates: [A,B], [A,C], [A,D], [B,C], [B,D], [C,D] (6 variants)
count - Limit
{"_or_": [A, B, C, D], "pick": 2, "count": 3}
# Generates: 3 random combinations (from the 6 possible)
Augmentation Actions
Action |
Behavior |
|---|---|
|
Add generated variants to existing features |
|
Stack the new transform on top of previous |
|
Replace current features with augmented versions |
# Extend: try each option separately
{"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"}
# Add: stack a derivative on top of current preprocessing
{"feature_augmentation": [FirstDerivative], "action": "add"}
Practical Example
pipeline = [
# Base scaling
MinMaxScaler(),
# Explore scatter correction options
{"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"},
# Add derivative on top
{"feature_augmentation": [FirstDerivative], "action": "add"},
# Cross-validation and model
ShuffleSplit(n_splits=3),
PLSRegression(n_components=10)
]
U03: Sample Augmentation
Data augmentation techniques for increasing sample diversity.
What You’ll Learn
Noise injection for robustness
Spectral transformations
Sample mixing strategies
Augmentation during training
Sample Augmentation Techniques
While feature augmentation creates different preprocessing pipelines, sample augmentation creates synthetic training samples.
{
"sample_augmentation": {
"noise_injection": 0.01, # Add Gaussian noise (1% std)
"shift": 2, # Shift spectra by ±2 wavelengths
"scale": 0.05, # Scale intensity by ±5%
"mixup_alpha": 0.2, # Mixup with alpha=0.2
"augmentation_factor": 3 # Triple training set size
}
}
When to Use Sample Augmentation
Technique |
Purpose |
Best For |
|---|---|---|
Noise injection |
Robustness to measurement noise |
Small datasets |
Spectral shift |
Robustness to wavelength calibration |
Instrument transfer |
Intensity scaling |
Robustness to concentration variations |
Variable samples |
Mixup |
Regularization, interpolation |
Deep learning |
U04: Signal Conversion
Convert between signal representations (absorbance, reflectance, etc.).
What You’ll Learn
Converting between absorbance and reflectance
Log transformations
Standard signal formats
Common Conversions
from nirs4all.operators.transforms import (
AbsorbanceToReflectance,
ReflectanceToAbsorbance,
Log1p,
Log10
)
# Convert representations
pipeline = [
ReflectanceToAbsorbance(), # If your data is in reflectance
SNV(), # Preprocessing expects absorbance
PLSRegression(n_components=10)
]
Signal Representation Guidelines
Representation |
Formula |
Typical Range |
|---|---|---|
Reflectance (R) |
I/I₀ |
0-1 |
Absorbance (A) |
-log₁₀(R) |
0-3+ |
Transmittance (T) |
I/I₀ |
0-1 |
Most NIRS preprocessing methods expect absorbance data.
Preprocessing Best Practices
1. Order Matters
# Recommended order:
pipeline = [
# 1. Signal conversion (if needed)
ReflectanceToAbsorbance(),
# 2. Scatter correction
StandardNormalVariate(),
# 3. Baseline correction (optional)
Detrend(),
# 4. Derivatives
FirstDerivative(),
# 5. Smoothing (if noisy)
Gaussian(sigma=1),
# 6. Feature scaling (before model)
MinMaxScaler(),
# 7. Model
PLSRegression(n_components=10)
]
2. Don’t Over-Process
More preprocessing isn’t always better. Common mistakes:
❌ Applying SNV after derivatives (destroys derivative information)
❌ Multiple smoothing steps (over-smooths, loses peaks)
❌ Second derivative on noisy data (amplifies noise)
3. Use Visualization
pipeline = [
"chart_2d", # Visualize raw spectra
SNV(),
"chart_2d", # Visualize after SNV
FirstDerivative(),
"chart_2d", # Visualize after derivative
PLSRegression(n_components=10)
]
4. Let the Data Decide
Use feature augmentation to find the best combination:
pipeline = [
{
"feature_augmentation": {
"_or_": [SNV, MSC, Detrend, FirstDerivative, SavitzkyGolay],
"pick": [1, 2, 3], # Try 1, 2, or 3 methods
"count": 10 # Generate 10 random combinations
}
},
PLSRegression(n_components=10)
]
Running These Examples
cd examples
# Run all preprocessing examples
./run.sh -n "U0*.py" -c user
# Run with visualization
python user/03_preprocessing/U01_preprocessing_basics.py --plots --show
Next Steps
After mastering preprocessing:
Models: Compare different model architectures
Cross-Validation: Proper model evaluation
Explainability: Understand which wavelengths matter