Preprocessing Examples

This section covers NIRS-specific preprocessing techniques, from basic transformations to automated exploration of preprocessing combinations.

Overview

Example

Topic

Difficulty

Duration

U01

Preprocessing Basics

★★☆☆☆

~3 min

U02

Feature Augmentation

★★☆☆☆

~3 min

U03

Sample Augmentation

★★☆☆☆

~3 min

U04

Signal Conversion

★★☆☆☆

~2 min


U01: Preprocessing Basics

Overview of standard NIRS preprocessing techniques.

📄 View source code

What You’ll Learn

  • Scatter correction: SNV, MSC

  • Baseline correction: Detrend

  • Derivatives: First, Second, Savitzky-Golay

  • Smoothing: Gaussian, Savitzky-Golay

  • Wavelet transforms: Haar

Preprocessing Categories

NIRS preprocessing addresses common spectral issues:

📊 Scatter Correction

Corrects for variations in light scattering due to sample structure:

from nirs4all.operators.transforms import (
    StandardNormalVariate,
    MultiplicativeScatterCorrection
)

# SNV: Per-sample mean-centering and scaling
StandardNormalVariate()

# MSC: Regression-based correction using reference spectrum
MultiplicativeScatterCorrection()

Method

How it Works

When to Use

SNV

Centers and scales each spectrum individually

Path length variations, quick scatter correction

MSC

Regresses each spectrum against a reference (mean)

More robust to baseline variations

📈 Baseline Correction

Removes baseline drift from spectra:

from nirs4all.operators.transforms import Detrend

# Remove polynomial baseline drift
Detrend()  # Default: linear detrending

📉 Derivatives

Enhance peaks and remove baselines:

from nirs4all.operators.transforms import (
    FirstDerivative,
    SecondDerivative,
    SavitzkyGolay
)

# Simple derivatives
FirstDerivative()   # Removes constant baseline
SecondDerivative()  # Removes linear baseline

# Smoothed derivative (recommended for noisy data)
SavitzkyGolay(window_length=11, polyorder=2, deriv=1)

Derivative

Effect

Use Case

First

Enhances peaks, removes constant baseline

General baseline issues

Second

Stronger enhancement, removes linear baseline

Complex baselines

Savitzky-Golay

Smoothed derivatives

Noisy spectra

🔊 Smoothing

Reduce noise while preserving spectral features:

from nirs4all.operators.transforms import Gaussian, SavitzkyGolay

# Gaussian convolution
Gaussian(sigma=2)

# Polynomial smoothing (no derivative)
SavitzkyGolay(window_length=11, polyorder=2, deriv=0)

🌊 Wavelet Transforms

Multi-resolution analysis:

from nirs4all.operators.transforms import Haar

Haar()  # Haar wavelet transform

Combining Preprocessing Steps

Common combinations for NIRS data:

# Combination 1: Scatter + Derivative
pipeline = [
    StandardNormalVariate(),
    FirstDerivative(),
    PLSRegression(n_components=10)
]

# Combination 2: Full preprocessing chain
pipeline = [
    Detrend(),
    MultiplicativeScatterCorrection(),
    SavitzkyGolay(window_length=11, polyorder=2, deriv=1),
    PLSRegression(n_components=10)
]

Comparing Methods

# Run pipelines with different preprocessing
methods = {
    'SNV': StandardNormalVariate(),
    'MSC': MultiplicativeScatterCorrection(),
    'D1': FirstDerivative(),
    'SG': SavitzkyGolay(deriv=1),
}

for name, method in methods.items():
    pipeline = [method, ShuffleSplit(n_splits=3), PLSRegression(n_components=10)]
    result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")
    print(f"{name}: RMSE = {result.best_rmse:.4f}")

U02: Feature Augmentation

Automatically explore preprocessing combinations.

📄 View source code

What You’ll Learn

  • Using feature_augmentation to generate variants

  • The _or_ generator syntax

  • Pick, count, and combination controls

  • Actions: extend vs add vs replace

The Feature Augmentation Step

Instead of manually testing every preprocessing combination:

# Manual approach (tedious!)
pipeline_1 = [SNV(), FirstDerivative(), ...]
pipeline_2 = [MSC(), FirstDerivative(), ...]
pipeline_3 = [SNV(), SavitzkyGolay(), ...]
# ... many more

Use feature augmentation:

pipeline = [
    MinMaxScaler(),

    # Automatically generate preprocessing variants
    {
        "feature_augmentation": {
            "_or_": [SNV, MSC, FirstDerivative, SavitzkyGolay, Gaussian],
            "pick": 2,      # Pick 2 methods at a time
            "count": 5      # Generate 5 random combinations
        }
    },

    PLSRegression(n_components=10)
]

Generator Syntax Options

_or_ - Alternatives

{"_or_": [A, B, C]}  # Generates: A, B, C (3 variants)

pick - Combinations

{"_or_": [A, B, C, D], "pick": 2}
# Generates: [A,B], [A,C], [A,D], [B,C], [B,D], [C,D] (6 variants)

count - Limit

{"_or_": [A, B, C, D], "pick": 2, "count": 3}
# Generates: 3 random combinations (from the 6 possible)

Augmentation Actions

Action

Behavior

"extend"

Add generated variants to existing features

"add"

Stack the new transform on top of previous

"replace"

Replace current features with augmented versions

# Extend: try each option separately
{"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"}

# Add: stack a derivative on top of current preprocessing
{"feature_augmentation": [FirstDerivative], "action": "add"}

Practical Example

pipeline = [
    # Base scaling
    MinMaxScaler(),

    # Explore scatter correction options
    {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"},

    # Add derivative on top
    {"feature_augmentation": [FirstDerivative], "action": "add"},

    # Cross-validation and model
    ShuffleSplit(n_splits=3),
    PLSRegression(n_components=10)
]

U03: Sample Augmentation

Data augmentation techniques for increasing sample diversity.

📄 View source code

What You’ll Learn

  • Noise injection for robustness

  • Spectral transformations

  • Sample mixing strategies

  • Augmentation during training

Sample Augmentation Techniques

While feature augmentation creates different preprocessing pipelines, sample augmentation creates synthetic training samples.

{
    "sample_augmentation": {
        "noise_injection": 0.01,      # Add Gaussian noise (1% std)
        "shift": 2,                    # Shift spectra by ±2 wavelengths
        "scale": 0.05,                 # Scale intensity by ±5%
        "mixup_alpha": 0.2,           # Mixup with alpha=0.2
        "augmentation_factor": 3       # Triple training set size
    }
}

When to Use Sample Augmentation

Technique

Purpose

Best For

Noise injection

Robustness to measurement noise

Small datasets

Spectral shift

Robustness to wavelength calibration

Instrument transfer

Intensity scaling

Robustness to concentration variations

Variable samples

Mixup

Regularization, interpolation

Deep learning


U04: Signal Conversion

Convert between signal representations (absorbance, reflectance, etc.).

📄 View source code

What You’ll Learn

  • Converting between absorbance and reflectance

  • Log transformations

  • Standard signal formats

Common Conversions

from nirs4all.operators.transforms import (
    AbsorbanceToReflectance,
    ReflectanceToAbsorbance,
    Log1p,
    Log10
)

# Convert representations
pipeline = [
    ReflectanceToAbsorbance(),  # If your data is in reflectance
    SNV(),                       # Preprocessing expects absorbance
    PLSRegression(n_components=10)
]

Signal Representation Guidelines

Representation

Formula

Typical Range

Reflectance (R)

I/I₀

0-1

Absorbance (A)

-log₁₀(R)

0-3+

Transmittance (T)

I/I₀

0-1

Most NIRS preprocessing methods expect absorbance data.


Preprocessing Best Practices

1. Order Matters

# Recommended order:
pipeline = [
    # 1. Signal conversion (if needed)
    ReflectanceToAbsorbance(),

    # 2. Scatter correction
    StandardNormalVariate(),

    # 3. Baseline correction (optional)
    Detrend(),

    # 4. Derivatives
    FirstDerivative(),

    # 5. Smoothing (if noisy)
    Gaussian(sigma=1),

    # 6. Feature scaling (before model)
    MinMaxScaler(),

    # 7. Model
    PLSRegression(n_components=10)
]

2. Don’t Over-Process

More preprocessing isn’t always better. Common mistakes:

  • ❌ Applying SNV after derivatives (destroys derivative information)

  • ❌ Multiple smoothing steps (over-smooths, loses peaks)

  • ❌ Second derivative on noisy data (amplifies noise)

3. Use Visualization

pipeline = [
    "chart_2d",           # Visualize raw spectra
    SNV(),
    "chart_2d",           # Visualize after SNV
    FirstDerivative(),
    "chart_2d",           # Visualize after derivative
    PLSRegression(n_components=10)
]

4. Let the Data Decide

Use feature augmentation to find the best combination:

pipeline = [
    {
        "feature_augmentation": {
            "_or_": [SNV, MSC, Detrend, FirstDerivative, SavitzkyGolay],
            "pick": [1, 2, 3],  # Try 1, 2, or 3 methods
            "count": 10         # Generate 10 random combinations
        }
    },
    PLSRegression(n_components=10)
]

Running These Examples

cd examples

# Run all preprocessing examples
./run.sh -n "U0*.py" -c user

# Run with visualization
python user/03_preprocessing/U01_preprocessing_basics.py --plots --show

Next Steps

After mastering preprocessing:

  • Models: Compare different model architectures

  • Cross-Validation: Proper model evaluation

  • Explainability: Understand which wavelengths matter