# Preprocessing Examples This section covers NIRS-specific preprocessing techniques, from basic transformations to automated exploration of preprocessing combinations. ```{contents} On this page :local: :depth: 2 ``` ## Overview | Example | Topic | Difficulty | Duration | |---------|-------|------------|----------| | [U01](#u01-preprocessing-basics) | Preprocessing Basics | ★★☆☆☆ | ~3 min | | [U02](#u02-feature-augmentation) | Feature Augmentation | ★★☆☆☆ | ~3 min | | [U03](#u03-sample-augmentation) | Sample Augmentation | ★★☆☆☆ | ~3 min | | [U04](#u04-signal-conversion) | Signal Conversion | ★★☆☆☆ | ~2 min | --- ## U01: Preprocessing Basics **Overview of standard NIRS preprocessing techniques.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/03_preprocessing/U01_preprocessing_basics.py) ### What You'll Learn - Scatter correction: SNV, MSC - Baseline correction: Detrend - Derivatives: First, Second, Savitzky-Golay - Smoothing: Gaussian, Savitzky-Golay - Wavelet transforms: Haar ### Preprocessing Categories NIRS preprocessing addresses common spectral issues: #### 📊 Scatter Correction Corrects for variations in light scattering due to sample structure: ```python from nirs4all.operators.transforms import ( StandardNormalVariate, MultiplicativeScatterCorrection ) # SNV: Per-sample mean-centering and scaling StandardNormalVariate() # MSC: Regression-based correction using reference spectrum MultiplicativeScatterCorrection() ``` | Method | How it Works | When to Use | |--------|--------------|-------------| | **SNV** | Centers and scales each spectrum individually | Path length variations, quick scatter correction | | **MSC** | Regresses each spectrum against a reference (mean) | More robust to baseline variations | #### 📈 Baseline Correction Removes baseline drift from spectra: ```python from nirs4all.operators.transforms import Detrend # Remove polynomial baseline drift Detrend() # Default: linear detrending ``` #### 📉 Derivatives Enhance peaks and remove baselines: ```python from nirs4all.operators.transforms import ( FirstDerivative, SecondDerivative, SavitzkyGolay ) # Simple derivatives FirstDerivative() # Removes constant baseline SecondDerivative() # Removes linear baseline # Smoothed derivative (recommended for noisy data) SavitzkyGolay(window_length=11, polyorder=2, deriv=1) ``` | Derivative | Effect | Use Case | |------------|--------|----------| | **First** | Enhances peaks, removes constant baseline | General baseline issues | | **Second** | Stronger enhancement, removes linear baseline | Complex baselines | | **Savitzky-Golay** | Smoothed derivatives | Noisy spectra | #### 🔊 Smoothing Reduce noise while preserving spectral features: ```python from nirs4all.operators.transforms import Gaussian, SavitzkyGolay # Gaussian convolution Gaussian(sigma=2) # Polynomial smoothing (no derivative) SavitzkyGolay(window_length=11, polyorder=2, deriv=0) ``` #### 🌊 Wavelet Transforms Multi-resolution analysis: ```python from nirs4all.operators.transforms import Haar Haar() # Haar wavelet transform ``` ### Combining Preprocessing Steps Common combinations for NIRS data: ```python # Combination 1: Scatter + Derivative pipeline = [ StandardNormalVariate(), FirstDerivative(), PLSRegression(n_components=10) ] # Combination 2: Full preprocessing chain pipeline = [ Detrend(), MultiplicativeScatterCorrection(), SavitzkyGolay(window_length=11, polyorder=2, deriv=1), PLSRegression(n_components=10) ] ``` ### Comparing Methods ```python # Run pipelines with different preprocessing methods = { 'SNV': StandardNormalVariate(), 'MSC': MultiplicativeScatterCorrection(), 'D1': FirstDerivative(), 'SG': SavitzkyGolay(deriv=1), } for name, method in methods.items(): pipeline = [method, ShuffleSplit(n_splits=3), PLSRegression(n_components=10)] result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression") print(f"{name}: RMSE = {result.best_rmse:.4f}") ``` --- ## U02: Feature Augmentation **Automatically explore preprocessing combinations.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/03_preprocessing/U02_feature_augmentation.py) ### What You'll Learn - Using `feature_augmentation` to generate variants - The `_or_` generator syntax - Pick, count, and combination controls - Actions: extend vs add vs replace ### The Feature Augmentation Step Instead of manually testing every preprocessing combination: ```python # Manual approach (tedious!) pipeline_1 = [SNV(), FirstDerivative(), ...] pipeline_2 = [MSC(), FirstDerivative(), ...] pipeline_3 = [SNV(), SavitzkyGolay(), ...] # ... many more ``` Use feature augmentation: ```python pipeline = [ MinMaxScaler(), # Automatically generate preprocessing variants { "feature_augmentation": { "_or_": [SNV, MSC, FirstDerivative, SavitzkyGolay, Gaussian], "pick": 2, # Pick 2 methods at a time "count": 5 # Generate 5 random combinations } }, PLSRegression(n_components=10) ] ``` ### Generator Syntax Options #### `_or_` - Alternatives ```python {"_or_": [A, B, C]} # Generates: A, B, C (3 variants) ``` #### `pick` - Combinations ```python {"_or_": [A, B, C, D], "pick": 2} # Generates: [A,B], [A,C], [A,D], [B,C], [B,D], [C,D] (6 variants) ``` #### `count` - Limit ```python {"_or_": [A, B, C, D], "pick": 2, "count": 3} # Generates: 3 random combinations (from the 6 possible) ``` ### Augmentation Actions | Action | Behavior | |--------|----------| | `"extend"` | Add generated variants to existing features | | `"add"` | Stack the new transform on top of previous | | `"replace"` | Replace current features with augmented versions | ```python # Extend: try each option separately {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"} # Add: stack a derivative on top of current preprocessing {"feature_augmentation": [FirstDerivative], "action": "add"} ``` ### Practical Example ```python pipeline = [ # Base scaling MinMaxScaler(), # Explore scatter correction options {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"}, # Add derivative on top {"feature_augmentation": [FirstDerivative], "action": "add"}, # Cross-validation and model ShuffleSplit(n_splits=3), PLSRegression(n_components=10) ] ``` --- ## U03: Sample Augmentation **Data augmentation techniques for increasing sample diversity.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/03_preprocessing/U03_sample_augmentation.py) ### What You'll Learn - Noise injection for robustness - Spectral transformations - Sample mixing strategies - Augmentation during training ### Sample Augmentation Techniques While feature augmentation creates different preprocessing pipelines, **sample augmentation** creates synthetic training samples. ```python { "sample_augmentation": { "noise_injection": 0.01, # Add Gaussian noise (1% std) "shift": 2, # Shift spectra by ±2 wavelengths "scale": 0.05, # Scale intensity by ±5% "mixup_alpha": 0.2, # Mixup with alpha=0.2 "augmentation_factor": 3 # Triple training set size } } ``` ### When to Use Sample Augmentation | Technique | Purpose | Best For | |-----------|---------|----------| | **Noise injection** | Robustness to measurement noise | Small datasets | | **Spectral shift** | Robustness to wavelength calibration | Instrument transfer | | **Intensity scaling** | Robustness to concentration variations | Variable samples | | **Mixup** | Regularization, interpolation | Deep learning | --- ## U04: Signal Conversion **Convert between signal representations (absorbance, reflectance, etc.).** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/03_preprocessing/U04_signal_conversion.py) ### What You'll Learn - Converting between absorbance and reflectance - Log transformations - Standard signal formats ### Common Conversions ```python from nirs4all.operators.transforms import ( AbsorbanceToReflectance, ReflectanceToAbsorbance, Log1p, Log10 ) # Convert representations pipeline = [ ReflectanceToAbsorbance(), # If your data is in reflectance SNV(), # Preprocessing expects absorbance PLSRegression(n_components=10) ] ``` ### Signal Representation Guidelines | Representation | Formula | Typical Range | |----------------|---------|---------------| | Reflectance (R) | I/I₀ | 0-1 | | Absorbance (A) | -log₁₀(R) | 0-3+ | | Transmittance (T) | I/I₀ | 0-1 | Most NIRS preprocessing methods expect **absorbance** data. --- ## Preprocessing Best Practices ### 1. Order Matters ```python # Recommended order: pipeline = [ # 1. Signal conversion (if needed) ReflectanceToAbsorbance(), # 2. Scatter correction StandardNormalVariate(), # 3. Baseline correction (optional) Detrend(), # 4. Derivatives FirstDerivative(), # 5. Smoothing (if noisy) Gaussian(sigma=1), # 6. Feature scaling (before model) MinMaxScaler(), # 7. Model PLSRegression(n_components=10) ] ``` ### 2. Don't Over-Process More preprocessing isn't always better. Common mistakes: - ❌ Applying SNV after derivatives (destroys derivative information) - ❌ Multiple smoothing steps (over-smooths, loses peaks) - ❌ Second derivative on noisy data (amplifies noise) ### 3. Use Visualization ```python pipeline = [ "chart_2d", # Visualize raw spectra SNV(), "chart_2d", # Visualize after SNV FirstDerivative(), "chart_2d", # Visualize after derivative PLSRegression(n_components=10) ] ``` ### 4. Let the Data Decide Use feature augmentation to find the best combination: ```python pipeline = [ { "feature_augmentation": { "_or_": [SNV, MSC, Detrend, FirstDerivative, SavitzkyGolay], "pick": [1, 2, 3], # Try 1, 2, or 3 methods "count": 10 # Generate 10 random combinations } }, PLSRegression(n_components=10) ] ``` --- ## Running These Examples ```bash cd examples # Run all preprocessing examples ./run.sh -n "U0*.py" -c user # Run with visualization python user/03_preprocessing/U01_preprocessing_basics.py --plots --show ``` ## Next Steps After mastering preprocessing: - **Models**: Compare different model architectures - **Cross-Validation**: Proper model evaluation - **Explainability**: Understand which wavelengths matter