Sample Augmentation Guide
Sample augmentation creates synthetic variations of your training samples to improve model robustness and generalization. This guide covers both quick reference and detailed usage.
Overview
Sample augmentation is useful for:
Increasing dataset size when training samples are limited
Balancing class distributions in imbalanced datasets
Improving model generalization through data diversity
Preventing overfitting by introducing controlled variations
Important
Leak Prevention: nirs4all automatically ensures augmented samples never appear in validation folds during cross-validation.
Quick Start
Standard Mode (Count-Based)
sample_augmentation:
transformers:
- StandardScaler: {}
- MinMaxScaler: {}
count: 2 # Augmentations per sample
selection: "random" # or "all"
random_state: 42
Balanced Mode (Class-Aware)
sample_augmentation:
transformers:
- StandardScaler: {}
balance: "y" # Balance on targets
target_size: 100 # Samples per class
random_state: 42
Parameter Reference
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
List |
Required |
List of sklearn transformers to apply |
|
int |
1 |
Number of augmentations per sample (standard mode) |
|
str |
None |
Column to balance on: “y” or metadata column (balanced mode) |
|
int |
None |
Target samples per class (balanced mode) |
|
float |
None |
Maximum augmentation multiplier (balanced mode) |
|
float |
None |
Target as percentage of majority class |
|
str |
“random” |
“random” or “all” - how to assign transformers |
|
int |
None |
Random seed for reproducibility |
Usage Modes
Standard Mode (Count-Based)
Creates a fixed number of augmented samples per base sample:
sample_augmentation:
transformers:
- StandardScaler: {}
- MinMaxScaler: {}
count: 3 # 3 augmentations per sample
selection: "random" # or "all"
random_state: 42
Selection options:
"random": Randomly assign transformers (default)"all": Cycle through transformers systematically
Example with 100 samples, count=3:
Original: 100 samples
After augmentation: 400 samples (100 base + 300 augmented)
Balanced Mode Strategies
Three balancing strategies are available. Choose ONE:
Strategy 1: Fixed Target Size
sample_augmentation:
transformers:
- StandardScaler: {}
balance: "y"
target_size: 100 # Each class to exactly 100 samples
Example:
Initial: Class 0: 150, Class 1: 30, Class 2: 50
Result: Class 0: 150 (unchanged), Class 1: 100 (+70), Class 2: 100 (+50)
Strategy 2: Multiplier Factor
sample_augmentation:
transformers:
- StandardScaler: {}
balance: "y"
max_factor: 3.0 # Multiply each class by 3, capped at majority
Example:
Initial: Class 0: 100 (majority), Class 1: 20, Class 2: 50
Result: Class 0: 100, Class 1: 60 (20×3), Class 2: 100 (50×3 capped)
Strategy 3: Reference Percentage
sample_augmentation:
transformers:
- StandardScaler: {}
balance: "y"
ref_percentage: 0.8 # Target 80% of majority class
Example:
Initial: Class 0: 100 (majority), Class 1: 30, Class 2: 20
Result: Class 0: 100, Class 1: 80, Class 2: 80
Binning for Regression
When balancing continuous targets, nirs4all automatically bins values:
sample_augmentation:
transformers:
- StandardScaler: {}
balance: "y"
bins: 10 # Create 10 virtual classes
binning_strategy: "equal_width" # or "quantile"
max_factor: 2.0
Binning strategies:
"equal_width": Uniform bin spacing (default)"quantile": Equal samples per bin
Pipeline Integration
Full Pipeline Example
pipeline:
# 1. Preprocessing
- preprocessing:
- SNV: {}
- SavitzkyGolay: {window_length: 11}
# 2. Sample Augmentation (before splitting)
- sample_augmentation:
transformers:
- StandardScaler: {}
count: 2
random_state: 42
# 3. Cross-Validation (leak-free)
- split:
- StratifiedKFold:
n_splits: 5
shuffle: true
random_state: 42
# 4. Model Training
- model:
- PLSRegression:
n_components: 10
Sequential Augmentation
Multiple augmentation rounds target only base samples:
pipeline:
- sample_augmentation:
transformers:
- StandardScaler: {}
count: 1
- sample_augmentation:
transformers:
- MinMaxScaler: {}
count: 1
# Result: 100 base → 300 total (100 + 100 + 100)
Recommended Transformers
Transformer |
Use Case |
Pros |
Cons |
|---|---|---|---|
|
General purpose |
Stable, well-tested |
Assumes normality |
|
Bounded features |
[0,1] range |
Sensitive to outliers |
|
Noisy data |
Outlier-resistant |
Slower |
|
Sparse data |
Preserves sparsity |
Less common |
Best Practices
✅ DO
Set
random_statefor reproducibilityStart with count=1 or 2
Use balanced mode for imbalanced data
Test different transformers
Monitor validation performance
❌ DON’T
Over-augment (count > 5 usually unnecessary)
Mix incompatible transformers
Forget to validate results
Augment test/validation data
Ignore computation cost
Augmentation Count Guidelines
Dataset Size |
Recommended Count |
|---|---|
Small (<100 samples) |
2-5 |
Medium (100-1000) |
1-3 |
Large (>1000) |
1-2 or balanced mode |
Dataset API
Python API
# Augment samples manually
dataset.augment_samples(
data=transformed_data,
processings=["proc_name"],
augmentation_id="unique_id",
selector={"partition": "train"}, # Optional
count=2
)
# Get data with/without augmented
X_all = dataset.x({}, include_augmented=True)
X_base = dataset.x({}, include_augmented=False)
# Get augmented samples for origins
aug_indices = dataset._indexer.get_augmented_for_origins([0, 1, 2])
# Get origin for augmented sample
origin_idx = dataset._indexer.get_origin_for_sample(10)
How It Works
Architecture
Base Samples (n samples)
↓
SampleAugmentationController
↓ (delegates to)
TransformerMixinController (applies transformations)
↓
Dataset.augment_samples() (stores with origin tracking)
↓
Augmented Dataset (n base + m augmented samples)
↓
Split Controller (uses only base samples for splitting)
↓
Training Folds (can access augmented samples)
Validation Folds (only base samples, leak-free!)
Leak Prevention
The system prevents augmented samples from leaking into validation:
Origin Tracking: Every augmented sample stores its origin sample index
Two-Phase Selection: CV splits use
include_augmented=FalseMetadata Inheritance: Augmented samples inherit all metadata from origins
Memory Formula
Memory ≈ base_samples_size + (augmentation_count × features_size)
Troubleshooting
Error |
Cause |
Solution |
|---|---|---|
“Processing ‘X’ not found” |
Wrong processing name |
Check transformer output names |
High memory usage |
Too many augmented samples |
Reduce count or use max_factor |
Poor performance |
Over-augmentation |
Reduce count, try different transformers |
Inconsistent results |
No random_state |
Set random_state parameter |
Validation too optimistic |
Data leakage |
Verify CV splits use include_augmented=False |
See Also
NIRS Data Augmentation in nirs4all - Overview of augmentation methods
Synthetic NIRS Spectra Generator - Scientific Documentation - Generate synthetic NIRS data
Operator Catalog - Complete operator reference