nirs4all.operators.data.repetition module
Repetition transformation operator configuration.
This module provides configuration dataclasses for transforming spectral repetitions (multiple spectra per sample) into either separate sources or additional preprocessings.
When samples have multiple repetitions (e.g., 4 spectra per leaf sample), these operators reshape the dataset structure:
rep_to_sources: Each repetition becomes a separate data source Input: 1 source × (120 samples, 1 pp, 500 features) Output: 4 sources × (30 samples, 1 pp, 500 features)
rep_to_pp: Repetitions become additional preprocessing slots Input: 1 source × (120 samples, 1 pp, 500 features) Output: 1 source × (30 samples, 4 pp, 500 features)
Example
>>> # Transform 4 repetitions per sample into 4 sources
>>> {"rep_to_sources": "Sample_ID"}
>>>
>>> # Transform repetitions into preprocessing dimension
>>> {"rep_to_pp": "Sample_ID"}
>>>
>>> # Group by target value instead of metadata column
>>> {"rep_to_sources": "y"}
>>>
>>> # Advanced configuration with options
>>> {"rep_to_sources": {
... "column": "Sample_ID",
... "on_unequal": "drop",
... "source_names": "rep_{i}"
... }}
- class nirs4all.operators.data.repetition.RepetitionConfig(column: str | None = None, on_unequal: str = 'error', expected_reps: int | None = None, source_names: str | List[str] | None = None, pp_names: str | List[str] | None = None, preserve_order: bool = True, aggregate_metadata: str = 'first')[source]
Bases:
objectConfiguration for repetition transformation operations.
This dataclass provides configuration for rep_to_sources and rep_to_pp keywords, which reshape datasets based on sample repetitions.
Repetitions are identified by a metadata column (e.g., “Sample_ID”) that groups multiple spectra belonging to the same physical sample.
- column
Metadata column identifying sample groups, or special values: - None (default): Use dataset’s aggregate column from DatasetConfigs - “y”: Group by target values - str: Explicit metadata column name
- Type:
str | None
- on_unequal
Strategy when samples have different repetition counts. - “error” (default): Raise error if counts differ - “pad”: Pad shorter groups with NaN to match longest - “drop”: Drop samples without expected repetition count - “truncate”: Use minimum count across all samples
- Type:
- expected_reps
Expected number of repetitions per sample. If None (default), inferred from data (mode of group sizes). If specified, validates all groups match this count.
- Type:
int | None
- source_names
Naming template for new sources (rep_to_sources only). - None (default): Uses “rep_0”, “rep_1”, etc. - str with {i}: Template like “rep_{i}” or “spectrum_{i}” - List[str]: Explicit names for each repetition
- pp_names
Naming template for new preprocessings (rep_to_pp only). - None (default): Uses “{original}_rep{i}” format - str with {i} and {pp}: Template like “{pp}_r{i}” - List[str]: Explicit names (length = n_reps * n_existing_pp)
- preserve_order
Whether to preserve sample order within groups. If True (default), repetitions are ordered by their row position. If False, order within groups is undefined.
- Type:
- aggregate_metadata
How to handle metadata after grouping. - “first” (default): Keep metadata from first repetition - “validate”: Ensure all reps have identical metadata, error if not - “drop”: Remove metadata columns that differ across repetitions
- Type:
Example
>>> # Use dataset's aggregate column (simplest) >>> RepetitionConfig() >>> >>> # Simple column-based grouping >>> RepetitionConfig(column="Sample_ID") >>> >>> # Group by target value with padding >>> RepetitionConfig(column="y", on_unequal="pad") >>> >>> # Explicit repetition count validation >>> RepetitionConfig( ... column="Leaf_ID", ... expected_reps=4, ... on_unequal="error" ... ) >>> >>> # Custom source naming >>> RepetitionConfig( ... column="Sample_ID", ... source_names="measurement_{i}" ... )
- classmethod from_dict(data: Dict[str, Any]) RepetitionConfig[source]
Create config from dictionary.
- Parameters:
data – Dictionary representation. If ‘column’ is missing, uses None (aggregate).
- Returns:
RepetitionConfig instance.
- classmethod from_step_value(value: str | bool | Dict[str, Any] | None) RepetitionConfig[source]
Create config from step value (string, bool, or dict).
Handles multiple syntax styles: - None or True: Use dataset’s aggregate column - str: Explicit column name (or “y” for target grouping) - dict: Full configuration with options
- Parameters:
value – Step value - column name, True/None for aggregate, or config dict.
- Returns:
RepetitionConfig instance.
Example
>>> # Use dataset aggregate (simplest) >>> RepetitionConfig.from_step_value(True) >>> RepetitionConfig.from_step_value(None) >>> >>> # Explicit column >>> RepetitionConfig.from_step_value("Sample_ID") >>> >>> # Advanced syntax >>> RepetitionConfig.from_step_value({ ... "column": "Sample_ID", ... "on_unequal": "drop" ... })
- get_pp_name(rep_index: int, original_pp: str) str[source]
Generate preprocessing name for a given repetition and original processing.
- Parameters:
rep_index – Zero-based repetition index.
original_pp – Original preprocessing name (e.g., “raw”, “snv”).
- Returns:
New preprocessing name string.
- get_source_name(rep_index: int) str[source]
Generate source name for a given repetition index.
- Parameters:
rep_index – Zero-based repetition index.
- Returns:
Source name string.
- get_unequal_strategy() UnequelRepsStrategy[source]
Get the unequal handling strategy as an enum.
- Returns:
UnequelRepsStrategy enum value.
- property is_y_grouping: bool
Check if grouping by target values.
- Returns:
True if column is “y” (case-insensitive).
- resolve_column(dataset_aggregate: str | None) str[source]
Resolve the actual column to use for grouping.
- Parameters:
dataset_aggregate – The aggregate value from dataset (column name, “y”, or None).
- Returns:
The resolved column name to use.
- Raises:
ValueError – If no column specified and dataset has no aggregate setting.
- class nirs4all.operators.data.repetition.UnequelRepsStrategy(value)[source]
Bases:
EnumStrategy for handling samples with unequal repetition counts.
When samples have different numbers of repetitions, this controls how the transformation handles the mismatch.
- ERROR
Raise an error if repetition counts differ (default, strictest).
- PAD
Pad shorter groups with NaN/zeros to match the longest.
- DROP
Drop samples that don’t have the expected repetition count.
- TRUNCATE
Truncate all groups to the minimum repetition count.
- DROP = 'drop'
- ERROR = 'error'
- PAD = 'pad'
- TRUNCATE = 'truncate'