nirs4all.controllers.data.balancing module

Balancing utilities for sample augmentation.

This module provides utilities to calculate augmentation counts for balanced datasets and to apply random transformer selection strategies.

class nirs4all.controllers.data.balancing.BalancingCalculator[source]

Bases: object

Calculate augmentation counts for balanced datasets.

static apply_random_transformer_selection(transformers: List, augmentation_counts: Dict[int, int], random_state: int | None = None) Dict[int, List[int]][source]

Randomly select transformers for each augmentation.

This method assigns transformer indices to each sample’s augmentations, supporting reproducible randomization via random_state.

Parameters:
  • transformers – List of transformer instances (e.g., [SavGol(), Gaussian(), SNV()])

  • augmentation_counts – sample_id → number of augmentations to create

  • random_state – Random seed for reproducibility. None = non-deterministic

Returns:

Dictionary mapping sample_id → list of transformer indices For each sample, returns a list of length augmentation_counts[sample_id] containing randomly selected transformer indices.

Examples

>>> transformers = [SavGol(), Gaussian(), SNV()]  # 3 transformers
>>> counts = {10: 2, 11: 3, 12: 0}  # Sample 10 needs 2 augs, 11 needs 3, 12 needs 0
>>> selection = BalancingCalculator.apply_random_transformer_selection(
...     transformers, counts, random_state=42
... )
>>> len(selection[10])  # 2 (two transformer indices)
>>> len(selection[11])  # 3 (three transformer indices)
>>> selection[12]  # [] (no augmentations)
>>> all(0 <= idx < 3 for idx in selection[10])  # True (valid indices)
static calculate_balanced_counts(base_labels: ndarray, base_sample_indices: ndarray, all_labels: ndarray, all_sample_indices: ndarray, target_size: int | None = None, max_factor: float | None = None, ref_percentage: float | None = None, random_state: int | None = None) Dict[int, int][source]

Calculate augmentations per BASE sample considering ALL samples for target.

Three balancing modes are supported (use exactly one):

  1. target_size mode: Augment each class to a fixed target sample count. - target_size: int - desired number of samples per class - Example: target_size=100 means each class will have 100 samples - No cap: classes can exceed majority class size if target_size > majority

  2. max_factor mode: Augment each class by a multiplier, capped at majority class size. - max_factor: float - multiplier applied to each class’s current size - Target is capped at majority class size (majority class is never augmented) - Example: max_factor=3 with majority=100, class=20 → target=min(60, 100)=60 - Example: max_factor=2 with majority=100, class=100 → target=100 (no augmentation)

  3. ref_percentage mode: Augment each class to a percentage of the majority class. - ref_percentage: float - can be any positive value (0.5-2.0, etc) - If < 1.0: targets below majority (e.g., 0.8 with majority=100 → target=80) - If > 1.0: targets above majority, like a multiplier of majority class - Example: ref_percentage=1.5 with majority=100 → target=150

Parameters:
  • base_labels – Class labels for BASE samples only

  • base_sample_indices – BASE sample IDs (these have data to augment)

  • all_labels – Class labels for ALL samples (base + augmented)

  • all_sample_indices – ALL sample IDs (for calculating target size)

  • target_size – Fixed target samples per class (mode 1)

  • max_factor – Multiplier for augmentation, capped at majority (mode 2)

  • ref_percentage – Target as multiple of reference class (mode 3, can be > 1.0)

  • random_state – Random seed for reproducible remainder distribution

Returns:

Dict mapping base_sample_id → augmentation_count

Raises:

ValueError – If zero or multiple modes are specified, or invalid parameter values

static calculate_balanced_counts_value_aware(base_labels: ndarray, base_sample_indices: ndarray, base_values: ndarray, all_labels: ndarray, all_sample_indices: ndarray, target_size: int | None = None, max_factor: float | None = None, ref_percentage: float | None = None, random_state: int | None = None) Dict[int, int][source]

Calculate augmentations per BASE sample with value-aware distribution.

This method first distributes augmentations fairly across unique values, then distributes within each value group across its samples.

Useful for binned regression where samples with the same bin value should be treated fairly together, not competing individually.

Parameters:
  • base_labels – Class labels for BASE samples only

  • base_sample_indices – BASE sample IDs

  • base_values – Actual values (y or bin values) for BASE samples Used to group samples with same value

  • all_labels – Class labels for ALL samples (base + augmented)

  • all_sample_indices – ALL sample IDs

  • target_size – Fixed target samples per class (mode 1)

  • max_factor – Multiplier for augmentation (mode 2)

  • ref_percentage – Target as multiple of reference class (mode 3)

  • random_state – Random seed for reproducibility

Returns:

Dict mapping base_sample_id → augmentation_count