nirs4all.controllers.splitters.fold_file_loader module

Controller for loading pre-computed fold indices from files.

This module provides the FoldFileLoaderController which loads fold definitions from previously saved fold files (generated by splitters like KFold, ShuffleSplit, etc.) or from user-provided fold files.

Supported file formats: - CSV: nirs4all standard format (fold_0, fold_1, … columns with sample IDs) - CSV: Single column format (sample_id, fold columns) - JSON: List of fold objects with train/val keys - YAML: Same structure as JSON - TXT: Simple index lists (one per line)

Example pipeline usage:

pipeline = [
    MinMaxScaler(),
    {"split": "workspace/runs/my_run/folds_KFold_seed42.csv"},
    {"model": PLSRegression()}
]
class nirs4all.controllers.splitters.fold_file_loader.FoldFileLoaderController[source]

Bases: OperatorController

Controller for loading pre-computed fold indices from files.

This controller matches pipeline steps where the ‘split’ keyword is used with a file path (string ending in a supported extension) instead of a splitter object.

Examples

>>> # In pipeline
>>> {"split": "path/to/folds.csv"}
>>> {"split": "workspace/runs/my_run/folds_KFold_seed42.csv"}
execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) Tuple[ExecutionContext, Any][source]

Load folds from file and set them on the dataset.

Parameters:
  • step_info – Parsed step containing the file path.

  • dataset – Dataset to set folds on.

  • context – Current execution context.

  • runtime_context – Runtime context with global settings.

  • source – Source index (unused).

  • mode – Execution mode (“train” or “predict”).

  • loaded_binaries – Pre-loaded binaries (unused).

  • prediction_store – Prediction store (unused).

Returns:

Tuple of (context, StepOutput).

classmethod matches(step: Any, operator: Any, keyword: str) bool[source]

Match steps with ‘split’ keyword and file path value.

Returns True if: - keyword is ‘split’, AND - operator is a string (file path), AND - path has a supported extension (.csv, .json, .yaml, .yml, .txt)

priority: int = 9
classmethod supports_prediction_mode() bool[source]

Fold files should be loaded in prediction mode to set up fold structure.

classmethod use_multi_source() bool[source]

Fold loading is a single-source operation.

class nirs4all.controllers.splitters.fold_file_loader.FoldFileParser[source]

Bases: object

Utility class for parsing fold files in various formats.

Supports multiple fold file formats: - nirs4all CSV: columns fold_0, fold_1, etc. with sample IDs as rows - Assignment CSV: columns sample_id, fold assigning each sample to a fold - JSON: List of dicts with train and val (or test) keys - YAML: Same structure as JSON - TXT: Simple format with fold indices

Examples

>>> parser = FoldFileParser()
>>> folds = parser.parse("folds_KFold.csv")
>>> # Returns: [(train_ids, val_ids), (train_ids, val_ids), ...]
SUPPORTED_EXTENSIONS = {'.csv', '.json', '.txt', '.yaml', '.yml'}
parse(file_path: str | Path, format: str | None = None) List[Tuple[List[int], List[int]]][source]

Parse a fold file and return fold definitions.

Parameters:
  • file_path – Path to the fold file.

  • format – Optional format hint (‘csv’, ‘json’, ‘yaml’, ‘txt’). If None, format is auto-detected from extension.

Returns:

List of (train_indices, val_indices) tuples.

Raises: