nirs4all.controllers.splitters.fold_file_loader module

Controller for loading pre-computed fold indices from files.

This module provides the FoldFileLoaderController which loads fold definitions from previously saved fold files (generated by splitters like KFold, ShuffleSplit, etc.) or from user-provided fold files.

Supported file formats: - CSV: nirs4all standard format (fold_0, fold_1, … columns with sample IDs) - CSV: Single column format (sample_id, fold columns) - JSON: List of fold objects with train/val keys - YAML: Same structure as JSON - TXT: Simple index lists (one per line)

Example pipeline usage:

pipeline = [
    MinMaxScaler(),
    {"split": "workspace/runs/my_run/folds_KFold_seed42.csv"},
    {"model": PLSRegression()}
]

class nirs4all.controllers.splitters.fold_file_loader.FoldFileLoaderController[source]

Bases: OperatorController

Controller for loading pre-computed fold indices from files.

This controller matches pipeline steps where the ‘split’ keyword is used with a file path (string ending in a supported extension) instead of a splitter object.

Examples

>>> # In pipeline
>>> {"split": "path/to/folds.csv"}
>>> {"split": "workspace/runs/my_run/folds_KFold_seed42.csv"}

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Load folds from file and set them on the dataset.

Parameters:

step_info – Parsed step containing the file path.
dataset – Dataset to set folds on.
context – Current execution context.
runtime_context – Runtime context with global settings.
source – Source index (unused).
mode – Execution mode (“train” or “predict”).
loaded_binaries – Pre-loaded binaries (unused).
prediction_store – Prediction store (unused).

Returns:

Tuple of (context, StepOutput).

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Match steps with ‘split’ keyword and file path value.

Returns True if: - keyword is ‘split’, AND - operator is a string (file path), AND - path has a supported extension (.csv, .json, .yaml, .yml, .txt)

priority: int = 9

classmethod supports_prediction_mode() → bool[source]: Fold files should be loaded in prediction mode to set up fold structure.

classmethod use_multi_source() → bool[source]: Fold loading is a single-source operation.

class nirs4all.controllers.splitters.fold_file_loader.FoldFileParser[source]

Bases: object

Utility class for parsing fold files in various formats.

Supports multiple fold file formats: - nirs4all CSV: columns fold_0, fold_1, etc. with sample IDs as rows - Assignment CSV: columns sample_id, fold assigning each sample to a fold - JSON: List of dicts with train and val (or test) keys - YAML: Same structure as JSON - TXT: Simple format with fold indices

Examples

>>> parser = FoldFileParser()
>>> folds = parser.parse("folds_KFold.csv")
>>> # Returns: [(train_ids, val_ids), (train_ids, val_ids), ...]

SUPPORTED_EXTENSIONS = {'.csv', '.json', '.txt', '.yaml', '.yml'}

parse(file_path: str | Path, format: str | None = None) → List[Tuple[List[int], List[int]]][source]

Parse a fold file and return fold definitions.

Parameters:

file_path – Path to the fold file.
format – Optional format hint (‘csv’, ‘json’, ‘yaml’, ‘txt’). If None, format is auto-detected from extension.

Returns:

List of (train_indices, val_indices) tuples.

Raises:

FileNotFoundError – If file doesn’t exist.
ValueError – If file format is unsupported or content is invalid.