nirs4all.operators.data.repetition module

Repetition transformation operator configuration.

This module provides configuration dataclasses for transforming spectral repetitions (multiple spectra per sample) into either separate sources or additional preprocessings.

When samples have multiple repetitions (e.g., 4 spectra per leaf sample), these operators reshape the dataset structure:

  • rep_to_sources: Each repetition becomes a separate data source Input: 1 source × (120 samples, 1 pp, 500 features) Output: 4 sources × (30 samples, 1 pp, 500 features)

  • rep_to_pp: Repetitions become additional preprocessing slots Input: 1 source × (120 samples, 1 pp, 500 features) Output: 1 source × (30 samples, 4 pp, 500 features)

Example

>>> # Transform 4 repetitions per sample into 4 sources
>>> {"rep_to_sources": "Sample_ID"}
>>>
>>> # Transform repetitions into preprocessing dimension
>>> {"rep_to_pp": "Sample_ID"}
>>>
>>> # Group by target value instead of metadata column
>>> {"rep_to_sources": "y"}
>>>
>>> # Advanced configuration with options
>>> {"rep_to_sources": {
...     "column": "Sample_ID",
...     "on_unequal": "drop",
...     "source_names": "rep_{i}"
... }}
class nirs4all.operators.data.repetition.RepetitionConfig(column: str | None = None, on_unequal: str = 'error', expected_reps: int | None = None, source_names: str | List[str] | None = None, pp_names: str | List[str] | None = None, preserve_order: bool = True, aggregate_metadata: str = 'first')[source]

Bases: object

Configuration for repetition transformation operations.

This dataclass provides configuration for rep_to_sources and rep_to_pp keywords, which reshape datasets based on sample repetitions.

Repetitions are identified by a metadata column (e.g., “Sample_ID”) that groups multiple spectra belonging to the same physical sample.

column

Metadata column identifying sample groups, or special values: - None (default): Use dataset’s aggregate column from DatasetConfigs - “y”: Group by target values - str: Explicit metadata column name

Type:

str | None

on_unequal

Strategy when samples have different repetition counts. - “error” (default): Raise error if counts differ - “pad”: Pad shorter groups with NaN to match longest - “drop”: Drop samples without expected repetition count - “truncate”: Use minimum count across all samples

Type:

str

expected_reps

Expected number of repetitions per sample. If None (default), inferred from data (mode of group sizes). If specified, validates all groups match this count.

Type:

int | None

source_names

Naming template for new sources (rep_to_sources only). - None (default): Uses “rep_0”, “rep_1”, etc. - str with {i}: Template like “rep_{i}” or “spectrum_{i}” - List[str]: Explicit names for each repetition

Type:

str | List[str] | None

pp_names

Naming template for new preprocessings (rep_to_pp only). - None (default): Uses “{original}_rep{i}” format - str with {i} and {pp}: Template like “{pp}_r{i}” - List[str]: Explicit names (length = n_reps * n_existing_pp)

Type:

str | List[str] | None

preserve_order

Whether to preserve sample order within groups. If True (default), repetitions are ordered by their row position. If False, order within groups is undefined.

Type:

bool

aggregate_metadata

How to handle metadata after grouping. - “first” (default): Keep metadata from first repetition - “validate”: Ensure all reps have identical metadata, error if not - “drop”: Remove metadata columns that differ across repetitions

Type:

str

Example

>>> # Use dataset's aggregate column (simplest)
>>> RepetitionConfig()
>>>
>>> # Simple column-based grouping
>>> RepetitionConfig(column="Sample_ID")
>>>
>>> # Group by target value with padding
>>> RepetitionConfig(column="y", on_unequal="pad")
>>>
>>> # Explicit repetition count validation
>>> RepetitionConfig(
...     column="Leaf_ID",
...     expected_reps=4,
...     on_unequal="error"
... )
>>>
>>> # Custom source naming
>>> RepetitionConfig(
...     column="Sample_ID",
...     source_names="measurement_{i}"
... )
__post_init__()[source]

Validate configuration after initialization.

aggregate_metadata: str = 'first'
column: str | None = None
expected_reps: int | None = None
classmethod from_dict(data: Dict[str, Any]) RepetitionConfig[source]

Create config from dictionary.

Parameters:

data – Dictionary representation. If ‘column’ is missing, uses None (aggregate).

Returns:

RepetitionConfig instance.

classmethod from_step_value(value: str | bool | Dict[str, Any] | None) RepetitionConfig[source]

Create config from step value (string, bool, or dict).

Handles multiple syntax styles: - None or True: Use dataset’s aggregate column - str: Explicit column name (or “y” for target grouping) - dict: Full configuration with options

Parameters:

value – Step value - column name, True/None for aggregate, or config dict.

Returns:

RepetitionConfig instance.

Example

>>> # Use dataset aggregate (simplest)
>>> RepetitionConfig.from_step_value(True)
>>> RepetitionConfig.from_step_value(None)
>>>
>>> # Explicit column
>>> RepetitionConfig.from_step_value("Sample_ID")
>>>
>>> # Advanced syntax
>>> RepetitionConfig.from_step_value({
...     "column": "Sample_ID",
...     "on_unequal": "drop"
... })
get_pp_name(rep_index: int, original_pp: str) str[source]

Generate preprocessing name for a given repetition and original processing.

Parameters:
  • rep_index – Zero-based repetition index.

  • original_pp – Original preprocessing name (e.g., “raw”, “snv”).

Returns:

New preprocessing name string.

get_source_name(rep_index: int) str[source]

Generate source name for a given repetition index.

Parameters:

rep_index – Zero-based repetition index.

Returns:

Source name string.

get_unequal_strategy() UnequelRepsStrategy[source]

Get the unequal handling strategy as an enum.

Returns:

UnequelRepsStrategy enum value.

property is_y_grouping: bool

Check if grouping by target values.

Returns:

True if column is “y” (case-insensitive).

on_unequal: str = 'error'
pp_names: str | List[str] | None = None
preserve_order: bool = True
resolve_column(dataset_aggregate: str | None) str[source]

Resolve the actual column to use for grouping.

Parameters:

dataset_aggregate – The aggregate value from dataset (column name, “y”, or None).

Returns:

The resolved column name to use.

Raises:

ValueError – If no column specified and dataset has no aggregate setting.

source_names: str | List[str] | None = None
to_dict() Dict[str, Any][source]

Serialize configuration to dictionary.

Returns:

Dictionary representation for manifest storage.

property uses_dataset_aggregate: bool

Check if using dataset’s aggregate column.

Returns:

True if column is None (will use dataset.aggregate at runtime).

class nirs4all.operators.data.repetition.UnequelRepsStrategy(value)[source]

Bases: Enum

Strategy for handling samples with unequal repetition counts.

When samples have different numbers of repetitions, this controls how the transformation handles the mismatch.

ERROR

Raise an error if repetition counts differ (default, strictest).

PAD

Pad shorter groups with NaN/zeros to match the longest.

DROP

Drop samples that don’t have the expected repetition count.

TRUNCATE

Truncate all groups to the minimum repetition count.

DROP = 'drop'
ERROR = 'error'
PAD = 'pad'
TRUNCATE = 'truncate'