nirs4all.operators.data.repetition module

Repetition transformation operator configuration.

This module provides configuration dataclasses for transforming spectral repetitions (multiple spectra per sample) into either separate sources or additional preprocessings.

When samples have multiple repetitions (e.g., 4 spectra per leaf sample), these operators reshape the dataset structure:

rep_to_sources: Each repetition becomes a separate data source Input: 1 source × (120 samples, 1 pp, 500 features) Output: 4 sources × (30 samples, 1 pp, 500 features)
rep_to_pp: Repetitions become additional preprocessing slots Input: 1 source × (120 samples, 1 pp, 500 features) Output: 1 source × (30 samples, 4 pp, 500 features)

Example

>>> # Transform 4 repetitions per sample into 4 sources
>>> {"rep_to_sources": "Sample_ID"}
>>>
>>> # Transform repetitions into preprocessing dimension
>>> {"rep_to_pp": "Sample_ID"}
>>>
>>> # Group by target value instead of metadata column
>>> {"rep_to_sources": "y"}
>>>
>>> # Advanced configuration with options
>>> {"rep_to_sources": {
...     "column": "Sample_ID",
...     "on_unequal": "drop",
...     "source_names": "rep_{i}"
... }}

class nirs4all.operators.data.repetition.RepetitionConfig(column: str | None = None, on_unequal: str = 'error', expected_reps: int | None = None, source_names: str | List[str] | None = None, pp_names: str | List[str] | None = None, preserve_order: bool = True, aggregate_metadata: str = 'first')[source]

Bases: object

Configuration for repetition transformation operations.

This dataclass provides configuration for rep_to_sources and rep_to_pp keywords, which reshape datasets based on sample repetitions.

Repetitions are identified by a metadata column (e.g., “Sample_ID”) that groups multiple spectra belonging to the same physical sample.

column

Metadata column identifying sample groups, or special values: - None (default): Use dataset’s aggregate column from DatasetConfigs - “y”: Group by target values - str: Explicit metadata column name

Type:: str | None

on_unequal

Strategy when samples have different repetition counts. - “error” (default): Raise error if counts differ - “pad”: Pad shorter groups with NaN to match longest - “drop”: Drop samples without expected repetition count - “truncate”: Use minimum count across all samples

Type:: str

expected_reps

Expected number of repetitions per sample. If None (default), inferred from data (mode of group sizes). If specified, validates all groups match this count.

Type:: int | None

source_names

Naming template for new sources (rep_to_sources only). - None (default): Uses “rep_0”, “rep_1”, etc. - str with {i}: Template like “rep_{i}” or “spectrum_{i}” - List[str]: Explicit names for each repetition

Type:: str | List[str] | None

pp_names

Naming template for new preprocessings (rep_to_pp only). - None (default): Uses “{original}_rep{i}” format - str with {i} and {pp}: Template like “{pp}_r{i}” - List[str]: Explicit names (length = n_reps * n_existing_pp)

Type:: str | List[str] | None

preserve_order

Whether to preserve sample order within groups. If True (default), repetitions are ordered by their row position. If False, order within groups is undefined.

Type:: bool

aggregate_metadata

How to handle metadata after grouping. - “first” (default): Keep metadata from first repetition - “validate”: Ensure all reps have identical metadata, error if not - “drop”: Remove metadata columns that differ across repetitions

Type:: str

Example

>>> # Use dataset's aggregate column (simplest)
>>> RepetitionConfig()
>>>
>>> # Simple column-based grouping
>>> RepetitionConfig(column="Sample_ID")
>>>
>>> # Group by target value with padding
>>> RepetitionConfig(column="y", on_unequal="pad")
>>>
>>> # Explicit repetition count validation
>>> RepetitionConfig(
...     column="Leaf_ID",
...     expected_reps=4,
...     on_unequal="error"
... )
>>>
>>> # Custom source naming
>>> RepetitionConfig(
...     column="Sample_ID",
...     source_names="measurement_{i}"
... )

__post_init__()[source]: Validate configuration after initialization.

aggregate_metadata: str = 'first'

column: str | None = None

expected_reps: int | None = None

classmethod from_dict(data: Dict[str, Any]) → RepetitionConfig[source]

Create config from dictionary.

Parameters:: data – Dictionary representation. If ‘column’ is missing, uses None (aggregate).
Returns:: RepetitionConfig instance.

classmethod from_step_value(value: str | bool | Dict[str, Any] | None) → RepetitionConfig[source]

Create config from step value (string, bool, or dict).

Handles multiple syntax styles: - None or True: Use dataset’s aggregate column - str: Explicit column name (or “y” for target grouping) - dict: Full configuration with options

Parameters:: value – Step value - column name, True/None for aggregate, or config dict.
Returns:: RepetitionConfig instance.

Example

>>> # Use dataset aggregate (simplest)
>>> RepetitionConfig.from_step_value(True)
>>> RepetitionConfig.from_step_value(None)
>>>
>>> # Explicit column
>>> RepetitionConfig.from_step_value("Sample_ID")
>>>
>>> # Advanced syntax
>>> RepetitionConfig.from_step_value({
...     "column": "Sample_ID",
...     "on_unequal": "drop"
... })

get_pp_name(rep_index: int, original_pp: str) → str[source]

Generate preprocessing name for a given repetition and original processing.

Parameters:

rep_index – Zero-based repetition index.
original_pp – Original preprocessing name (e.g., “raw”, “snv”).

Returns:

New preprocessing name string.

get_source_name(rep_index: int) → str[source]

Generate source name for a given repetition index.

Parameters:: rep_index – Zero-based repetition index.
Returns:: Source name string.

get_unequal_strategy() → UnequelRepsStrategy[source]

Get the unequal handling strategy as an enum.

Returns:: UnequelRepsStrategy enum value.

property is_y_grouping: bool

Check if grouping by target values.

Returns:: True if column is “y” (case-insensitive).

on_unequal: str = 'error'

pp_names: str | List[str] | None = None

preserve_order: bool = True

resolve_column(dataset_aggregate: str | None) → str[source]

Resolve the actual column to use for grouping.

Parameters:: dataset_aggregate – The aggregate value from dataset (column name, “y”, or None).
Returns:: The resolved column name to use.
Raises:: ValueError – If no column specified and dataset has no aggregate setting.

source_names: str | List[str] | None = None

to_dict() → Dict[str, Any][source]

Serialize configuration to dictionary.

Returns:: Dictionary representation for manifest storage.

property uses_dataset_aggregate: bool

Check if using dataset’s aggregate column.

Returns:: True if column is None (will use dataset.aggregate at runtime).

class nirs4all.operators.data.repetition.UnequelRepsStrategy(value)[source]

Bases: Enum

Strategy for handling samples with unequal repetition counts.

When samples have different numbers of repetitions, this controls how the transformation handles the mismatch.

ERROR: Raise an error if repetition counts differ (default, strictest).

PAD: Pad shorter groups with NaN/zeros to match the longest.

DROP: Drop samples that don’t have the expected repetition count.

TRUNCATE: Truncate all groups to the minimum repetition count.

DROP = 'drop'

ERROR = 'error'

PAD = 'pad'

TRUNCATE = 'truncate'