Preprocessing Examples

This section covers NIRS-specific preprocessing techniques, from basic transformations to automated exploration of preprocessing combinations.

Overview 

Example	Topic	Difficulty	Duration
U01	Preprocessing Basics	★★☆☆☆	~3 min
U02	Feature Augmentation	★★☆☆☆	~3 min
U03	Sample Augmentation	★★☆☆☆	~3 min
U04	Signal Conversion	★★☆☆☆	~2 min

U01: Preprocessing Basics 

Overview of standard NIRS preprocessing techniques.

📄 View source code

What You’ll Learn 

Scatter correction: SNV, MSC
Baseline correction: Detrend
Derivatives: First, Second, Savitzky-Golay
Smoothing: Gaussian, Savitzky-Golay
Wavelet transforms: Haar

Preprocessing Categories 

NIRS preprocessing addresses common spectral issues:

📊 Scatter Correction

Corrects for variations in light scattering due to sample structure:

from nirs4all.operators.transforms import (
    StandardNormalVariate,
    MultiplicativeScatterCorrection
)

# SNV: Per-sample mean-centering and scaling
StandardNormalVariate()

# MSC: Regression-based correction using reference spectrum
MultiplicativeScatterCorrection()

Method	How it Works	When to Use
SNV	Centers and scales each spectrum individually	Path length variations, quick scatter correction
MSC	Regresses each spectrum against a reference (mean)	More robust to baseline variations

📈 Baseline Correction

Removes baseline drift from spectra:

from nirs4all.operators.transforms import Detrend

# Remove polynomial baseline drift
Detrend()  # Default: linear detrending

📉 Derivatives

Enhance peaks and remove baselines:

from nirs4all.operators.transforms import (
    FirstDerivative,
    SecondDerivative,
    SavitzkyGolay
)

# Simple derivatives
FirstDerivative()   # Removes constant baseline
SecondDerivative()  # Removes linear baseline

# Smoothed derivative (recommended for noisy data)
SavitzkyGolay(window_length=11, polyorder=2, deriv=1)

Derivative	Effect	Use Case
First	Enhances peaks, removes constant baseline	General baseline issues
Second	Stronger enhancement, removes linear baseline	Complex baselines
Savitzky-Golay	Smoothed derivatives	Noisy spectra

🔊 Smoothing

Reduce noise while preserving spectral features:

from nirs4all.operators.transforms import Gaussian, SavitzkyGolay

# Gaussian convolution
Gaussian(sigma=2)

# Polynomial smoothing (no derivative)
SavitzkyGolay(window_length=11, polyorder=2, deriv=0)

🌊 Wavelet Transforms

Multi-resolution analysis:

from nirs4all.operators.transforms import Haar

Haar()  # Haar wavelet transform

Combining Preprocessing Steps 

Common combinations for NIRS data:

# Combination 1: Scatter + Derivative
pipeline = [
    StandardNormalVariate(),
    FirstDerivative(),
    PLSRegression(n_components=10)
]

# Combination 2: Full preprocessing chain
pipeline = [
    Detrend(),
    MultiplicativeScatterCorrection(),
    SavitzkyGolay(window_length=11, polyorder=2, deriv=1),
    PLSRegression(n_components=10)
]

Comparing Methods 

# Run pipelines with different preprocessing
methods = {
    'SNV': StandardNormalVariate(),
    'MSC': MultiplicativeScatterCorrection(),
    'D1': FirstDerivative(),
    'SG': SavitzkyGolay(deriv=1),
}

for name, method in methods.items():
    pipeline = [method, ShuffleSplit(n_splits=3), PLSRegression(n_components=10)]
    result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")
    print(f"{name}: RMSE = {result.best_rmse:.4f}")

U02: Feature Augmentation 

Automatically explore preprocessing combinations.

📄 View source code

What You’ll Learn 

Using feature_augmentation to generate variants
The _or_ generator syntax
Pick, count, and combination controls
Actions: extend vs add vs replace

The Feature Augmentation Step 

Instead of manually testing every preprocessing combination:

# Manual approach (tedious!)
pipeline_1 = [SNV(), FirstDerivative(), ...]
pipeline_2 = [MSC(), FirstDerivative(), ...]
pipeline_3 = [SNV(), SavitzkyGolay(), ...]
# ... many more

Use feature augmentation:

pipeline = [
    MinMaxScaler(),

    # Automatically generate preprocessing variants
    {
        "feature_augmentation": {
            "_or_": [SNV, MSC, FirstDerivative, SavitzkyGolay, Gaussian],
            "pick": 2,      # Pick 2 methods at a time
            "count": 5      # Generate 5 random combinations
        }
    },

    PLSRegression(n_components=10)
]

Generator Syntax Options 

`_or_` - Alternatives

{"_or_": [A, B, C]}  # Generates: A, B, C (3 variants)

`pick` - Combinations

{"_or_": [A, B, C, D], "pick": 2}
# Generates: [A,B], [A,C], [A,D], [B,C], [B,D], [C,D] (6 variants)

`count` - Limit

{"_or_": [A, B, C, D], "pick": 2, "count": 3}
# Generates: 3 random combinations (from the 6 possible)

Augmentation Actions 

Action	Behavior
`"extend"`	Add generated variants to existing features
`"add"`	Stack the new transform on top of previous
`"replace"`	Replace current features with augmented versions

# Extend: try each option separately
{"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"}

# Add: stack a derivative on top of current preprocessing
{"feature_augmentation": [FirstDerivative], "action": "add"}

Practical Example 

pipeline = [
    # Base scaling
    MinMaxScaler(),

    # Explore scatter correction options
    {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"},

    # Add derivative on top
    {"feature_augmentation": [FirstDerivative], "action": "add"},

    # Cross-validation and model
    ShuffleSplit(n_splits=3),
    PLSRegression(n_components=10)
]

U03: Sample Augmentation 

Data augmentation techniques for increasing sample diversity.

📄 View source code

What You’ll Learn 

Noise injection for robustness
Spectral transformations
Sample mixing strategies
Augmentation during training

Sample Augmentation Techniques 

While feature augmentation creates different preprocessing pipelines, sample augmentation creates synthetic training samples.

{
    "sample_augmentation": {
        "noise_injection": 0.01,      # Add Gaussian noise (1% std)
        "shift": 2,                    # Shift spectra by ±2 wavelengths
        "scale": 0.05,                 # Scale intensity by ±5%
        "mixup_alpha": 0.2,           # Mixup with alpha=0.2
        "augmentation_factor": 3       # Triple training set size
    }
}

When to Use Sample Augmentation 

Technique	Purpose	Best For
Noise injection	Robustness to measurement noise	Small datasets
Spectral shift	Robustness to wavelength calibration	Instrument transfer
Intensity scaling	Robustness to concentration variations	Variable samples
Mixup	Regularization, interpolation	Deep learning

U04: Signal Conversion 

Convert between signal representations (absorbance, reflectance, etc.).

📄 View source code

What You’ll Learn 

Converting between absorbance and reflectance
Log transformations
Standard signal formats

Common Conversions 

from nirs4all.operators.transforms import (
    AbsorbanceToReflectance,
    ReflectanceToAbsorbance,
    Log1p,
    Log10
)

# Convert representations
pipeline = [
    ReflectanceToAbsorbance(),  # If your data is in reflectance
    SNV(),                       # Preprocessing expects absorbance
    PLSRegression(n_components=10)
]

Signal Representation Guidelines 

Representation	Formula	Typical Range
Reflectance (R)	I/I₀	0-1
Absorbance (A)	-log₁₀(R)	0-3+
Transmittance (T)	I/I₀	0-1

Most NIRS preprocessing methods expect absorbance data.

Preprocessing Best Practices 

1. Order Matters 

# Recommended order:
pipeline = [
    # 1. Signal conversion (if needed)
    ReflectanceToAbsorbance(),

    # 2. Scatter correction
    StandardNormalVariate(),

    # 3. Baseline correction (optional)
    Detrend(),

    # 4. Derivatives
    FirstDerivative(),

    # 5. Smoothing (if noisy)
    Gaussian(sigma=1),

    # 6. Feature scaling (before model)
    MinMaxScaler(),

    # 7. Model
    PLSRegression(n_components=10)
]

2. Don’t Over-Process 

More preprocessing isn’t always better. Common mistakes:

❌ Applying SNV after derivatives (destroys derivative information)
❌ Multiple smoothing steps (over-smooths, loses peaks)
❌ Second derivative on noisy data (amplifies noise)

3. Use Visualization 

pipeline = [
    "chart_2d",           # Visualize raw spectra
    SNV(),
    "chart_2d",           # Visualize after SNV
    FirstDerivative(),
    "chart_2d",           # Visualize after derivative
    PLSRegression(n_components=10)
]

4. Let the Data Decide 

Use feature augmentation to find the best combination:

pipeline = [
    {
        "feature_augmentation": {
            "_or_": [SNV, MSC, Detrend, FirstDerivative, SavitzkyGolay],
            "pick": [1, 2, 3],  # Try 1, 2, or 3 methods
            "count": 10         # Generate 10 random combinations
        }
    },
    PLSRegression(n_components=10)
]

Running These Examples 

cd examples

# Run all preprocessing examples
./run.sh -n "U0*.py" -c user

# Run with visualization
python user/03_preprocessing/U01_preprocessing_basics.py --plots --show

Next Steps 

After mastering preprocessing:

Models: Compare different model architectures
Cross-Validation: Proper model evaluation
Explainability: Understand which wavelengths matter

Preprocessing Examples

📊 Scatter Correction

📈 Baseline Correction

📉 Derivatives

🔊 Smoothing

🌊 Wavelet Transforms

_or_ - Alternatives

pick - Combinations

count - Limit

`_or_` - Alternatives

`pick` - Combinations

`count` - Limit