# Sample Augmentation Guide

Sample augmentation creates synthetic variations of your training samples to improve model robustness and generalization. This guide covers both quick reference and detailed usage.

## Overview

Sample augmentation is useful for:
- **Increasing dataset size** when training samples are limited
- **Balancing class distributions** in imbalanced datasets
- **Improving model generalization** through data diversity
- **Preventing overfitting** by introducing controlled variations

:::{important}
**Leak Prevention**: nirs4all automatically ensures augmented samples never appear in validation folds during cross-validation.
:::

## Quick Start

### Standard Mode (Count-Based)

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
    - MinMaxScaler: {}
  count: 2  # Augmentations per sample
  selection: "random"  # or "all"
  random_state: 42
```

### Balanced Mode (Class-Aware)

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
  balance: "y"  # Balance on targets
  target_size: 100  # Samples per class
  random_state: 42
```

## Parameter Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `transformers` | List | Required | List of sklearn transformers to apply |
| `count` | int | 1 | Number of augmentations per sample (standard mode) |
| `balance` | str | None | Column to balance on: "y" or metadata column (balanced mode) |
| `target_size` | int | None | Target samples per class (balanced mode) |
| `max_factor` | float | None | Maximum augmentation multiplier (balanced mode) |
| `ref_percentage` | float | None | Target as percentage of majority class |
| `selection` | str | "random" | "random" or "all" - how to assign transformers |
| `random_state` | int | None | Random seed for reproducibility |

## Usage Modes

### Standard Mode (Count-Based)

Creates a fixed number of augmented samples per base sample:

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
    - MinMaxScaler: {}
  count: 3  # 3 augmentations per sample
  selection: "random"  # or "all"
  random_state: 42
```

**Selection options:**
- `"random"`: Randomly assign transformers (default)
- `"all"`: Cycle through transformers systematically

**Example with 100 samples, count=3:**
- Original: 100 samples
- After augmentation: 400 samples (100 base + 300 augmented)

### Balanced Mode Strategies

Three balancing strategies are available. Choose ONE:

#### Strategy 1: Fixed Target Size

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
  balance: "y"
  target_size: 100  # Each class to exactly 100 samples
```

**Example:**
```
Initial: Class 0: 150, Class 1: 30, Class 2: 50
Result:  Class 0: 150 (unchanged), Class 1: 100 (+70), Class 2: 100 (+50)
```

#### Strategy 2: Multiplier Factor

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
  balance: "y"
  max_factor: 3.0  # Multiply each class by 3, capped at majority
```

**Example:**
```
Initial: Class 0: 100 (majority), Class 1: 20, Class 2: 50
Result:  Class 0: 100, Class 1: 60 (20×3), Class 2: 100 (50×3 capped)
```

#### Strategy 3: Reference Percentage

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
  balance: "y"
  ref_percentage: 0.8  # Target 80% of majority class
```

**Example:**
```
Initial: Class 0: 100 (majority), Class 1: 30, Class 2: 20
Result:  Class 0: 100, Class 1: 80, Class 2: 80
```

### Binning for Regression

When balancing continuous targets, nirs4all automatically bins values:

```yaml
sample_augmentation:
  transformers:
    - StandardScaler: {}
  balance: "y"
  bins: 10  # Create 10 virtual classes
  binning_strategy: "equal_width"  # or "quantile"
  max_factor: 2.0
```

**Binning strategies:**
- `"equal_width"`: Uniform bin spacing (default)
- `"quantile"`: Equal samples per bin

## Pipeline Integration

### Full Pipeline Example

```yaml
pipeline:
  # 1. Preprocessing
  - preprocessing:
      - SNV: {}
      - SavitzkyGolay: {window_length: 11}

  # 2. Sample Augmentation (before splitting)
  - sample_augmentation:
      transformers:
        - StandardScaler: {}
      count: 2
      random_state: 42

  # 3. Cross-Validation (leak-free)
  - split:
      - StratifiedKFold:
          n_splits: 5
          shuffle: true
          random_state: 42

  # 4. Model Training
  - model:
      - PLSRegression:
          n_components: 10
```

### Sequential Augmentation

Multiple augmentation rounds target only base samples:

```yaml
pipeline:
  - sample_augmentation:
      transformers:
        - StandardScaler: {}
      count: 1

  - sample_augmentation:
      transformers:
        - MinMaxScaler: {}
      count: 1
# Result: 100 base → 300 total (100 + 100 + 100)
```

## Recommended Transformers

| Transformer | Use Case | Pros | Cons |
|-------------|----------|------|------|
| `StandardScaler` | General purpose | Stable, well-tested | Assumes normality |
| `MinMaxScaler` | Bounded features | [0,1] range | Sensitive to outliers |
| `RobustScaler` | Noisy data | Outlier-resistant | Slower |
| `MaxAbsScaler` | Sparse data | Preserves sparsity | Less common |

## Best Practices

### ✅ DO

- Set `random_state` for reproducibility
- Start with count=1 or 2
- Use balanced mode for imbalanced data
- Test different transformers
- Monitor validation performance

### ❌ DON'T

- Over-augment (count > 5 usually unnecessary)
- Mix incompatible transformers
- Forget to validate results
- Augment test/validation data
- Ignore computation cost

### Augmentation Count Guidelines

| Dataset Size | Recommended Count |
|--------------|-------------------|
| Small (<100 samples) | 2-5 |
| Medium (100-1000) | 1-3 |
| Large (>1000) | 1-2 or balanced mode |

## Dataset API

### Python API

```python
# Augment samples manually
dataset.augment_samples(
    data=transformed_data,
    processings=["proc_name"],
    augmentation_id="unique_id",
    selector={"partition": "train"},  # Optional
    count=2
)

# Get data with/without augmented
X_all = dataset.x({}, include_augmented=True)
X_base = dataset.x({}, include_augmented=False)

# Get augmented samples for origins
aug_indices = dataset._indexer.get_augmented_for_origins([0, 1, 2])

# Get origin for augmented sample
origin_idx = dataset._indexer.get_origin_for_sample(10)
```

## How It Works

### Architecture

```
Base Samples (n samples)
    ↓
SampleAugmentationController
    ↓ (delegates to)
TransformerMixinController (applies transformations)
    ↓
Dataset.augment_samples() (stores with origin tracking)
    ↓
Augmented Dataset (n base + m augmented samples)
    ↓
Split Controller (uses only base samples for splitting)
    ↓
Training Folds (can access augmented samples)
Validation Folds (only base samples, leak-free!)
```

### Leak Prevention

The system prevents augmented samples from leaking into validation:

1. **Origin Tracking**: Every augmented sample stores its origin sample index
2. **Two-Phase Selection**: CV splits use `include_augmented=False`
3. **Metadata Inheritance**: Augmented samples inherit all metadata from origins

### Memory Formula

```
Memory ≈ base_samples_size + (augmentation_count × features_size)
```

## Troubleshooting

| Error | Cause | Solution |
|-------|-------|----------|
| "Processing 'X' not found" | Wrong processing name | Check transformer output names |
| High memory usage | Too many augmented samples | Reduce count or use max_factor |
| Poor performance | Over-augmentation | Reduce count, try different transformers |
| Inconsistent results | No random_state | Set random_state parameter |
| Validation too optimistic | Data leakage | Verify CV splits use include_augmented=False |

## See Also

- {doc}`augmentations` - Overview of augmentation methods
- {doc}`synthetic_nirs_generator` - Generate synthetic NIRS data
- {doc}`/reference/operator_catalog` - Complete operator reference