nirs4all.controllers.data.repetition module

Repetition Transformation Controllers.

This module provides controllers for transforming spectral repetitions (multiple spectra per sample) into either separate sources or additional preprocessings.

These transformations physically reshape the dataset structure:

rep_to_sources: Each repetition becomes a separate data source Input: 1 source × (120 samples, 1 pp, 500 features) Output: 4 sources × (30 samples, 1 pp, 500 features)
rep_to_pp: Repetitions become additional preprocessing slots Input: 1 source × (120 samples, 1 pp, 500 features) Output: 1 source × (30 samples, 4 pp, 500 features)

These are typically used early in the pipeline, before cross-validation, as they change the fundamental dataset structure.

Example

>>> # Transform 4 repetitions per sample into 4 sources
>>> pipeline = [
...     {"rep_to_sources": "Sample_ID"},
...     ShuffleSplit(n_splits=3),
...     PLSRegression(n_components=10)
... ]
>>>
>>> # Use dataset's aggregate column (from DatasetConfigs)
>>> pipeline = [
...     {"rep_to_sources": True},  # Uses aggregate column
...     {"source_branch": {...}},   # Per-source preprocessing
...     {"merge_sources": "concat"},
...     PLSRegression()
... ]
>>>
>>> # Transform to preprocessings for multi-PP models
>>> pipeline = [
...     {"rep_to_pp": "Sample_ID"},
...     ShuffleSplit(n_splits=3),
...     {"model": NiConNet()}  # Handles multi-PP input
... ]

Keywords: “rep_to_sources”, “rep_to_pp” Priority: 3 (early in pipeline, before CV)

class nirs4all.controllers.data.repetition.RepToPPController[source]

Bases: OperatorController

Controller for transforming repetitions into additional preprocessings.

This controller handles the rep_to_pp pipeline keyword, which groups samples by a metadata column and reshapes each repetition into a preprocessing dimension.

Before: n_sources × (n_samples, n_pp, n_features) After: n_sources × (n_unique_samples, n_pp × n_reps, n_features)

This enables:

Multi-preprocessing input for models like NiConNet
Repetition-as-preprocessing fusion strategies
Consistent sample count for cross-validation

priority

Controller priority (3 = early, before CV).

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, StepOutput][source]

Execute rep_to_pp transformation.

Reshapes the dataset by grouping samples by the specified column and stacking repetitions into the preprocessing dimension.

Parameters:

step_info – Parsed step containing rep_to_pp configuration
dataset – Dataset to transform
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index (not used, operates on all sources)
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects (not used)
prediction_store – External prediction store (not used)

Returns:

Tuple of (context, StepOutput with transformation info)

Raises:

ValueError – If column not found or groups have unequal sizes and on_unequal=”error”.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Check if the step matches the rep_to_pp controller.

Parameters:

step – Original step configuration
operator – Deserialized operator
keyword – Step keyword

Returns:

True if keyword is “rep_to_pp”

priority: int = 3

classmethod supports_prediction_mode() → bool[source]

Repetition transformation should NOT run in prediction mode.

The transformation happens once during training. During prediction, the model expects the same structure that was used during training.

classmethod use_multi_source() → bool[source]: This controller operates on the whole dataset, not per-source.

class nirs4all.controllers.data.repetition.RepToSourcesController[source]

Bases: OperatorController

Controller for transforming repetitions into separate data sources.

This controller handles the rep_to_sources pipeline keyword, which groups samples by a metadata column (typically sample ID) and reshapes each repetition index into a separate data source.

Before: 1 source × (n_samples, n_pp, n_features) After: n_reps sources × (n_unique_samples, n_pp, n_features)

This enables:

Per-repetition preprocessing via source_branch
Multi-source modeling strategies
Repetition-aware feature fusion

priority

Controller priority (3 = early, before CV).

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, StepOutput][source]

Execute rep_to_sources transformation.

Reshapes the dataset by grouping samples by the specified column and creating one source per repetition index.

Parameters:

step_info – Parsed step containing rep_to_sources configuration
dataset – Dataset to transform
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index (not used, operates on all sources)
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects (not used)
prediction_store – External prediction store (not used)

Returns:

Tuple of (context, StepOutput with transformation info)

Raises:

ValueError – If column not found or groups have unequal sizes and on_unequal=”error”.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Check if the step matches the rep_to_sources controller.

Parameters:

step – Original step configuration
operator – Deserialized operator
keyword – Step keyword

Returns:

True if keyword is “rep_to_sources”

priority: int = 3

classmethod supports_prediction_mode() → bool[source]

Repetition transformation should NOT run in prediction mode.

The transformation happens once during training. During prediction, the model expects the same structure that was used during training. The controller should be skipped in prediction mode - the user must ensure prediction data has the same structure as training data after transformation.

classmethod use_multi_source() → bool[source]: This controller operates on the whole dataset, not per-source.