nirs4all.controllers package

Subpackages

nirs4all.controllers.charts package
- Submodules
- Module contents
  - ExclusionChartController
nirs4all.controllers.data package
- Submodules
- Module contents
nirs4all.controllers.flow package
- Submodules
- Module contents
  - DummyController
nirs4all.controllers.models package
nirs4all.controllers.shared package
- Submodules
  - nirs4all.controllers.shared.model_selector module
    - ModelSelector
  - nirs4all.controllers.shared.prediction_aggregator module
    - PredictionAggregator
- Module contents
  - ModelSelector
  - PredictionAggregator
nirs4all.controllers.splitters package
- Submodules
  - nirs4all.controllers.splitters.fold_file_loader module
    - FoldFileLoaderController
    - FoldFileParser
  - nirs4all.controllers.splitters.split module
    - CrossValidatorController
- Module contents
  - FoldFileLoaderController
  - FoldFileParser
    - FoldFileParser.SUPPORTED_EXTENSIONS
    - FoldFileParser.parse()
nirs4all.controllers.transforms package
- Submodules
- Module contents

Submodules

Module contents

Controllers module for nirs4all package.

This module contains all controller classes for pipeline operator execution. Controllers implement the execution logic for different operator types following the operator-controller pattern.

class nirs4all.controllers.AugmentationChartController[source]

Bases: OperatorController

Controller for visualizing augmentation effects on spectra.

Supports two visualization modes: 1. augment_chart: Shows original vs augmented samples overlaid with different colors 2. augment_details_chart: Shows a grid with raw data and each augmentation type separately

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: Any, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Execute augmentation visualization.

Returns:: Tuple of (context, StepOutput)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Chart controllers should skip execution during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.AutoTransferPreprocessingController[source]

Bases: OperatorController

Controller for automatic transfer-optimized preprocessing selection.

This controller analyzes the distributional distance between source and target datasets and automatically selects preprocessing that best aligns them while preserving predictive information.

Configuration options:

preset: Preset configuration for the selector.

“fast” (default): Quick evaluation of single preprocessings only
“balanced”: Includes stacking evaluation
“thorough”: Includes stacking and augmentation
“full”: All stages including supervised validation
“exhaustive”: Deep analysis for research/benchmarking

source_partition: Partition to use as source data (“train” or “test”).

Default is “train”.

target_partition: Partition to use as target data (“train” or “test”).

Default is “test”.

apply_recommendation: Whether to apply the best preprocessing to the

dataset. If False, only stores the recommendation in context. Default is True.

top_k: Number of top recommendations to apply if using augmentation.

Default is 1 (best single preprocessing).

use_augmentation: If top_k > 1, whether to use feature augmentation

to concatenate outputs. Default is False.

n_components: Number of PCA components for metric computation.

Default is 10.

verbose: Verbosity level (0=silent, 1=progress, 2=detailed).

Default is 1.

# Stage-specific options (override preset) run_stage2: Enable stacking evaluation. stage2_top_k: Number of top candidates for stacking. run_stage3: Enable augmentation evaluation. run_stage4: Enable supervised validation.

Example pipeline configurations:

# Simple - use defaults {“auto_transfer_preproc”: {}}

# With preset {“auto_transfer_preproc”: {“preset”: “balanced”}}

# Full configuration {

“auto_transfer_preproc”: {
“preset”: “thorough”, “source_partition”: “train”, “target_partition”: “test”, “apply_recommendation”: True, “top_k”: 1, “verbose”: 2,

}

}

# Multi-source with augmentation {

“auto_transfer_preproc”: {
“preset”: “balanced”, “top_k”: 3, “use_augmentation”: True,

}

}

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List[Tuple[str, Any]]][source]

Execute auto transfer preprocessing selection.

In train mode:

Extract source and target data from the dataset
Run TransferPreprocessingSelector to find best preprocessing
Apply the recommended preprocessing if configured
Store the recommendation as an artifact

In predict mode:

Load the saved preprocessing recommendation
Apply it to the incoming data

Parameters:

step_info – Parsed step containing the auto_transfer_preproc config
dataset – SpectroDataset to operate on
context – Execution context with selector and metadata
runtime_context – Runtime infrastructure (saver, step_number, etc.)
source – Source index (-1 for all sources)
mode – Execution mode (“train”, “predict”, “explain”)
loaded_binaries – Pre-loaded artifacts for predict/explain mode
prediction_store – Not used by this controller

Returns:

Tuple of (updated_context, list_of_artifacts)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if step is an auto_transfer_preproc operation.

priority: int = 9

classmethod supports_prediction_mode() → bool[source]

Supports prediction mode for applying saved recommendations.

In prediction mode, the controller loads the previously computed preprocessing recommendation and applies it to the new data.

classmethod use_multi_source() → bool[source]: Supports multi-source datasets.

class nirs4all.controllers.BaseController[source]

Bases: ABC

Abstract base class for all controllers.

Controllers are responsible for executing operators within a pipeline context. They handle framework-specific logic, state management, and validation.

abstractmethod can_handle(operator: Any) → bool[source]

Check if this controller can handle the given operator.

Parameters:: operator (Any) – The operator to check.
Returns:: True if this controller can handle the operator.
Return type:: bool

cleanup(operator: Any, context: ExecutionContext) → None[source]

Clean up after operator execution.

This method can be overridden to perform cleanup tasks after execution.

Parameters:

operator (Any) – The operator that was executed.
context (ExecutionContext) – Pipeline execution context.

abstractmethod execute(operator: Any, context: ExecutionContext) → Any[source]

Execute the operator within the pipeline context.

Parameters:

operator (Any) – The operator to execute.
context (ExecutionContext) – Pipeline execution context including data, state, etc.

Returns:

Result of operator execution.

Return type:

Any

prepare(operator: Any, context: ExecutionContext) → None[source]

Prepare the operator for execution.

This method can be overridden to perform setup tasks before execution.

Parameters:

operator (Any) – The operator to prepare.
context (ExecutionContext) – Pipeline execution context.

validate(operator: Any) → None[source]

Validate the operator before execution.

Parameters:: operator (Any) – The operator to validate.
Raises:: ValueError – If operator is invalid.

class nirs4all.controllers.BranchController[source]

Bases: OperatorController

Controller for pipeline branching.

Implements the branching mechanism that allows multiple preprocessing chains to be evaluated independently within a single pipeline execution.

Key behaviors:

Creates independent context copies for each branch
Executes branch steps sequentially within each branch
Stores branch contexts in context.custom[“branch_contexts”]
Post-branch steps iterate over all branch contexts

priority

Controller priority (lower = higher priority). Set to 5 to execute before most other controllers.

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, StepOutput][source]

Execute the branch step with V3 chain tracking.

Creates independent contexts for each branch, executes branch-specific steps, and stores branch contexts for post-branch iteration.

In predict/explain mode, only executes the target branch specified in runtime_context.target_model.branch_id for efficiency.

V3 improvements: - Uses trace_recorder.enter_branch() / exit_branch() for branch path tracking - Records each substep individually for complete trace fidelity - Builds proper operator chains for artifact identification

Parameters:

step_info – Parsed step containing branch definitions
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects for prediction mode
prediction_store – External prediction store for model predictions

Returns:

Tuple of (updated_context, StepOutput with collected artifacts)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Check if the step matches the branch controller.

Parameters:

step – Original step configuration
operator – Deserialized operator (may be list of branch definitions)
keyword – Step keyword

Returns:

True if keyword is “branch”

priority: int = 5

classmethod supports_prediction_mode() → bool[source]: Branch controller should execute in prediction mode to reconstruct branches.

classmethod use_multi_source() → bool[source]: Branch controller supports multi-source datasets.

class nirs4all.controllers.ConcatAugmentationController[source]

Bases: OperatorController

Controller that concatenates multiple transformer outputs.

Semantics: - Top-level (add_feature=False): REPLACES each processing with concatenated version - Inside feature_augmentation (add_feature=True): ADDS one new processing

Supports: - Single transformers: PCA(50) - Chained transformers: [Wavelet(), PCA(50)] → sequential application - Mixed: [PCA(50), [Wavelet(), SVD(30)], LocalStats()]

Examples

Top-level replacement: >>> pipeline = [{“concat_transform”: [PCA(50), SVD(50)]}] # Before: (500, 3, 500) with [“raw”, “snv”, “savgol”] # After: (500, 3, 100) with [“raw_concat_PCA_SVD”, “snv_concat_PCA_SVD”, …]

Nested inside feature_augmentation: >>> pipeline = [{ … “feature_augmentation”: [ … SNV(), … {“concat_transform”: [PCA(50), SVD(50)]} … ] … }] # Before: (500, 1, 500) with [“raw”] # After: (500, 3, 500) with [“raw”, “snv”, “concat_PCA_SVD”] (padded)

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]

Execute concat augmentation.

Parameters:

step_info – Parsed step containing the concat_transform config
dataset – SpectroDataset to operate on
context – Execution context with selector and metadata
runtime_context – Runtime infrastructure (saver, step_number, etc.)
source – Source index (-1 for all sources)
mode – Execution mode (“train”, “predict”, “explain”)
loaded_binaries – Pre-fitted transformers for predict/explain mode
prediction_store – Not used by this controller

Returns:

Tuple of (updated_context, list_of_artifacts)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if step is a concat_transform operation.

static normalize_generator_spec(spec: Any) → Any[source]

Normalize generator spec for concat_transform context.

In concat_transform context, multi-selection should use combinations by default since the order of concatenated features doesn’t matter. Translates legacy ‘size’ to ‘pick’ for explicit semantics.

Parameters:: spec – Generator specification (may contain _or_, size, pick, arrange).
Returns:: Normalized spec with ‘size’ converted to ‘pick’ if needed.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Supports prediction mode for applying saved transformers.

classmethod use_multi_source() → bool[source]: Supports multi-source datasets.

class nirs4all.controllers.CrossValidatorController[source]

Bases: OperatorController

Controller for any sklearn‑compatible splitter (native or custom).

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None)[source]

Run operator.split and store the resulting folds on dataset.

Smartly supplies y / groups only if required.
Extracts groups from metadata if specified.
Supports force_group parameter to wrap any splitter with group-awareness.
Maps local indices back to the global index space.
Stores the list of folds into the dataset for subsequent steps.

Parameters:

step_info (ParsedStep) – Parsed step containing the operator and original step configuration.
dataset (SpectroDataset) – The dataset to split.
context (ExecutionContext) – Current execution context.
runtime_context (RuntimeContext) – Runtime context with global settings.
source (int) – Source index (-1 for combined sources).
mode (str) – Execution mode (“train”, “predict”, or “explain”).
loaded_binaries (Any) – Pre-loaded binary data (not used).
prediction_store (Any) – Store for predictions (not used).

Notes

The force_group parameter enables any sklearn-compatible splitter to work with grouped samples by wrapping it with GroupedSplitterWrapper. This aggregates samples by group, passes “virtual samples” to the splitter, and expands fold indices back to the original dataset.

Example usage:

{"split": KFold(n_splits=5), "force_group": "Sample_ID"}
{"split": ShuffleSplit(test_size=0.2), "force_group": "ID", "aggregation": "median"}

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Return True if operator behaves like a splitter.

Criteria – must expose a callable split whose first positional argument is named X. Optional presence of get_n_splits is a plus but not mandatory, so user‑defined simple splitters are still accepted.

Also matches on the ‘split’ keyword for group-aware splitting syntax.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Cross-validators should not execute during prediction mode.

classmethod use_multi_source() → bool[source]: Cross‑validators themselves are single‑source operators.

class nirs4all.controllers.DummyController[source]

Bases: OperatorController

Catch-all controller for operators not handled by other controllers.

This controller has the lowest priority and will catch any operators that don’t match other controllers, providing detailed debugging information about why they weren’t handled elsewhere.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]: Handle unmatched operators and provide detailed debugging information.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Always match as a last resort.

This controller should only be reached if no other controller with higher priority has matched the step/operator/keyword combination.

priority: int = 1000

classmethod supports_prediction_mode() → bool[source]: Dummy controller supports prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.FeatureAugmentationController[source]

Bases: OperatorController

Controller for feature augmentation with multiple action modes.

The feature_augmentation controller supports three action modes that control how preprocessing operations interact with existing processings:

extend (default): Add new processings to the set. Each operation runs independently on the base processing. If a processing already exists, it is not duplicated. Growth pattern is linear.
add: Chain each operation on top of ALL existing processings. Keep original processings alongside new chained versions. Growth pattern is multiplicative with originals (n + n×m).
replace: Chain each operation on top of ALL existing processings. Discard original processings, keeping only the chained versions. Growth pattern is multiplicative without originals (n×m).

Example

>>> # Extend mode (default) - linear growth
>>> {"feature_augmentation": [SNV, Gaussian], "action": "extend"}
>>> # With raw_A already present: raw_A, raw_SNV, raw_Gaussian

>>> # Add mode - multiplicative with originals
>>> {"feature_augmentation": [SNV, Gaussian], "action": "add"}
>>> # With raw_A present: raw_A, raw_A_SNV, raw_A_Gaussian

>>> # Replace mode - multiplicative, discards originals
>>> {"feature_augmentation": [SNV, Gaussian], "action": "replace"}
>>> # With raw_A present: raw_A_SNV, raw_A_Gaussian (raw_A discarded)

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]

Execute feature augmentation with specified action mode.

Parameters:

step_info – Parsed step information containing the operation list and action mode.
dataset – The spectroscopic dataset to process.
context – Current execution context with processing state.
runtime_context – Runtime infrastructure for step execution.
source – Source index (-1 for all sources).
mode – Execution mode (“train”, “predict”, etc.).
loaded_binaries – Pre-loaded binary artifacts for prediction mode.
prediction_store – Store for prediction-time state.

Returns:

Tuple of (updated_context, artifacts_list).

Raises:

ValueError – If action mode is invalid.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

static normalize_generator_spec(spec: Any) → Any[source]

Normalize generator spec for feature_augmentation context.

In feature_augmentation context, multi-selection should use combinations by default since the order of parallel feature channels doesn’t matter. Translates legacy ‘size’ to ‘pick’ for explicit semantics.

Parameters:: spec – Generator specification (may contain _or_, size, pick, arrange).
Returns:: Normalized spec with ‘size’ converted to ‘pick’ if needed.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Feature augmentation should NOT execute during prediction mode - transformations are already applied and saved.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.FoldChartController[source]

Bases: OperatorController

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: Any, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Execute fold visualization showing train/test splits with y-value color coding. Skips execution in prediction mode.

Returns:: Tuple of (context, StepOutput)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Chart controllers should skip execution during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.JaxModelController[source]

Bases: BaseModelController

Controller for JAX/Flax models.

Uses lazy loading pattern - JAX is only imported when training or prediction is actually performed.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, bytes]] | None = None, prediction_store: Predictions = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]: Execute JAX model controller.

get_preferred_layout() → str[source]

Return the preferred data layout for JAX models.

Flax Dense layers expect (batch, features). Flax Conv layers expect (batch, length, features) i.e. (N, L, C). So ‘3d_transpose’ is suitable for Conv1D.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Match JAX models and model configurations.

priority: int = 4

process_hyperparameters(params: Dict[str, Any]) → Dict[str, Any][source]: Process hyperparameters for JAX model tuning.

class nirs4all.controllers.OperatorController[source]

Bases: ABC

Base class for pipeline operators.

abstractmethod execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, Any][source]

Run the operator with the given parameters and context.

Parameters:

step_info – Parsed step containing operator, keyword, and metadata
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects for prediction mode
prediction_store – External prediction store for model predictions

Returns:

Tuple of (updated_context, StepOutput)

abstractmethod classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 100

classmethod supports_prediction_mode() → bool[source]

Check if the controller should execute during prediction mode.

Returns:: True if the controller should execute in prediction mode, False if it should be skipped (e.g., chart controllers)

abstractmethod classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.PyTorchModelController[source]

Bases: BaseModelController

Controller for PyTorch models.

Uses lazy loading pattern - PyTorch is only imported when training or prediction is actually performed.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, bytes]] | None = None, prediction_store: Predictions = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]: Execute PyTorch model controller.

get_preferred_layout() → str[source]

Return the preferred data layout for PyTorch models.

PyTorch typically expects (samples, channels, features) for 1D convs. We use ‘3d’ which gives (samples, processings, features) -> (N, C, L).

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Match PyTorch models and model configurations.

priority: int = 4

process_hyperparameters(params: Dict[str, Any]) → Dict[str, Any][source]: Process hyperparameters for PyTorch model tuning.

class nirs4all.controllers.ResamplerController[source]

Bases: OperatorController

Controller for Resampler operators.

This controller: 1. Extracts wavelengths from dataset headers 2. Validates that headers are convertible to float (wavelengths in cm-1) 3. Fits the resampler with original wavelengths 4. Transforms all data to the target wavelength grid 5. Updates dataset with new features and headers 6. Supports multi-source datasets with per-source or shared parameters

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List][source]

Execute resampling operation.

Parameters:

step_info – Pipeline step configuration
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime context
source – Data source index (-1 for all sources)
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects for prediction mode
prediction_store – External prediction store (unused)

Returns:

Tuple of (updated_context, fitted_resamplers)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Match Resampler objects.

priority: int = 5

classmethod supports_prediction_mode() → bool[source]: Resampler supports prediction mode.

classmethod use_multi_source() → bool[source]: Resampler supports multi-source datasets.

class nirs4all.controllers.SampleAugmentationController[source]

Bases: OperatorController

Sample Augmentation Controller with delegation pattern.

This controller orchestrates sample augmentation by: 1. Calculating augmentation distribution (standard or balanced mode) 2. Creating transformer→samples mapping 3. Emitting ONE run_step per transformer with target samples

The actual augmentation work is delegated to TransformerMixinController.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: Any | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List][source]

Execute sample augmentation with standard or balanced mode.

Step format for standard mode:

{

“sample_augmentation”: {: “transformers”: [transformer1, transformer2, …], “count”: int, “selection”: “random” or “all”, # Default “random” “random_state”: int # Optional

}

Step format for balanced mode (choose one balancing strategy):

Mode 1 - Fixed target size per class: {

“sample_augmentation”: {
“transformers”: […], “balance”: “y” or “metadata_column”, # Default “y” “target_size”: int, # Fixed target samples per class “selection”: “random” or “all”, “random_state”: int

}

}

Mode 2 - Multiplier for augmentation: {

“sample_augmentation”: {
“transformers”: […], “balance”: “y” or “metadata_column”, “max_factor”: float, # Multiplier (e.g., 3 means class grows 3x) “selection”: “random” or “all”, “random_state”: int

}

}

Mode 3 - Percentage of majority class: {

“sample_augmentation”: {
“transformers”: […], “balance”: “y” or “metadata_column”, “ref_percentage”: float, # Target as % of majority (0.0-1.0) “selection”: “random” or “all”, “random_state”: int

}

}

Binning for regression (automatic when balance=”y” and task is regression):

{

“sample_augmentation”: {: “transformers”: […], “balance”: “y”, “bins”: int, # Number of virtual classes (default: 10) “binning_strategy”: “equal_width” or “quantile”, # Default: “equal_width” “max_factor”: float, # Choose one balancing mode “selection”: “random” or “all”, “random_state”: int

}

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

static normalize_generator_spec(spec: Any) → Any[source]

Normalize generator spec for sample_augmentation context.

In sample_augmentation context, multi-selection should use combinations by default since the order of transformers doesn’t matter. Translates legacy ‘size’ to ‘pick’ for explicit semantics.

Parameters:: spec – Generator specification (may contain _or_, size, pick, arrange).
Returns:: Normalized spec with ‘size’ converted to ‘pick’ if needed.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Sample augmentation only runs during training.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.SampleFilterController[source]

Bases: OperatorController

Controller for sample filtering operations.

This controller orchestrates sample filtering by: 1. Retrieving train samples (base only, no augmented) and their X/y values 2. Applying each filter’s get_mask() method to identify outliers 3. Combining masks according to the specified mode (any/all) 4. Marking excluded samples in the dataset’s indexer 5. Generating filtering report (optional)

Sample filters are non-destructive - they mark samples as excluded in the indexer rather than removing data. Excluded samples can be re-included using dataset._indexer.mark_included().

Pipeline syntax:

{

“sample_filter”: {

“filters”: [: YOutlierFilter(method=”iqr”, threshold=1.5), XOutlierFilter(method=”mahalanobis”),

], “mode”: “any”, # “any” = exclude if ANY filter flags “report”: True, # Generate filtering report “cascade_to_augmented”: True, # Also exclude augmented samples

}

Note

Filtering only runs during training mode - in prediction mode, this controller does nothing to avoid excluding prediction samples.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List][source]

Execute sample filtering operation.

This method: 1. Retrieves training data (base samples only) 2. Fits and applies each filter to identify outliers 3. Combines filter masks using the specified mode 4. Marks excluded samples in the dataset’s indexer 5. Optionally prints a filtering report

Parameters:

step_info – Parsed step containing operator and configuration
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index (unused, filtering is dataset-level)
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binaries (filters may be persisted)
prediction_store – External prediction store (unused)

Returns:

Tuple of (updated_context, persisted_artifacts)

Raises:

ValueError – If no filters are specified
ValueError – If invalid mode is specified

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Match sample_filter keyword in pipeline.

priority: int = 5

classmethod supports_prediction_mode() → bool[source]

Sample filtering only runs during training.

Prediction samples should never be filtered/excluded - we want to predict on all provided samples. Filters were fitted during training and their thresholds don’t apply to new data.

classmethod use_multi_source() → bool[source]: Sample filtering operates on the dataset level, not per-source.

class nirs4all.controllers.SklearnModelController[source]

Bases: BaseModelController

Controller for scikit-learn models.

This controller handles sklearn models with support for training on 2D data, cross-validation, hyperparameter tuning with Optuna, model persistence, and integration with the nirs4all pipeline.

priority

Controller priority (6) - higher than TransformerMixin to prioritize supervised models over transformers.

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, bytes]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]

Execute sklearn model controller with score management.

Main entry point for sklearn model execution in the pipeline. Sets the preferred data layout to ‘2d’ and delegates to parent execute method.

Parameters:

step_info – Parsed step containing model configuration and operator.
dataset (SpectroDataset) – Dataset containing features and targets.
context (ExecutionContext) – Pipeline execution context with state info.
runtime_context (RuntimeContext) – Runtime context managing execution state.
source (int) – Source index for multi-source pipelines. Defaults to -1.
mode (str) – Execution mode (‘train’ or ‘predict’). Defaults to ‘train’.
loaded_binaries (Optional[List[Tuple[str, bytes]]]) – Pre-loaded model binaries for prediction mode. Defaults to None.
prediction_store (Optional[Any]) – Store for managing predictions. Defaults to None.

Returns:

Updated context and: list of model binaries (name, serialized_model) for persistence.

Return type:

Tuple[ExecutionContext, List[Tuple[str, bytes]]]

Note

Automatically sets context[‘layout’] = ‘2d’ for sklearn compatibility
Inherits full training, evaluation, and prediction logic from BaseModelController
Respects force_layout if specified in step configuration

get_preferred_layout() → str[source]

Return the preferred data layout for sklearn models.

Returns:

Data layout preference, always ‘2d’ for sklearn models which: expect (n_samples, n_features) input format.

Return type:

str

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Match sklearn estimators and model dictionaries with sklearn models.

Prioritizes supervised models (regressors and classifiers) over transformers by checking for predict methods and using sklearn’s is_regressor/is_classifier.

Parameters:

step (Any) – Pipeline step to check, can be a dict with ‘model’ key or BaseEstimator instance.
operator (Any) – Optional operator object to check if it’s a BaseEstimator.
keyword (str) – Pipeline keyword (unused in this implementation).

Returns:

True if the step matches a sklearn estimator (regressor, classifier,: or has predict method), False otherwise.

Return type:

bool

priority: int = 6

class nirs4all.controllers.SpectraChartController[source]

Bases: OperatorController

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: Any, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Execute spectra visualization for both 2D and 3D plots. Skips execution in prediction mode.

Supports optional parameters via dict syntax:: {“chart_2d”: {“include_excluded”: True, “highlight_excluded”: True}}

Parameters:

include_excluded – If True, include excluded samples in visualization
highlight_excluded – If True, highlight excluded samples with different style

Returns:

Tuple of (context, StepOutput)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Chart controllers should skip execution during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.SpectralDistributionController[source]

Bases: OperatorController

Controller for spectral distribution envelope visualization.

Shows envelope (min/max/mean/IQR) for train vs test partitions, with optional per-fold visualization when cross-validation folds exist.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: Any, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Execute spectral distribution envelope visualization.

Creates envelope plots showing min/max/mean/IQR for train vs test. If CV folds exist, creates a grid showing each fold.

Returns:: Tuple of (context, StepOutput)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Chart controllers should skip execution during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.TensorFlowModelController[source]

Bases: BaseModelController

Controller for TensorFlow/Keras models.

This controller manages the complete lifecycle of TensorFlow/Keras models including: - Model instantiation from various configuration formats - Data preparation with proper tensor formatting (2D/3D) - Model compilation with task-appropriate loss functions and metrics - Training with callbacks (early stopping, model checkpointing) - Hyperparameter tuning via Optuna integration - Model evaluation and prediction - Binary serialization for model persistence

The controller automatically detects TensorFlow models and functions decorated with @framework(‘tensorflow’). It uses lazy loading to avoid importing TensorFlow until actually needed.

priority

Controller priority for matching (4).

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, bytes]] | None = None, prediction_store: Predictions = None) → Tuple[ExecutionContext, List[Tuple[str, bytes]]][source]

Execute TensorFlow model training, finetuning, or prediction.

Sets the preferred data layout to ‘3d_transpose’ for TensorFlow Conv1D models, then delegates to the base class execute method.

Parameters:

step_info – Parsed step containing model configuration and operator.
dataset – SpectroDataset with features, targets, and fold information.
context – Execution context with step_id, processing history, partition info.
runtime_context – Runtime context managing execution state.
source – Data source index (default: -1 for primary source).
mode – Execution mode - ‘train’, ‘finetune’, ‘predict’, or ‘explain’.
loaded_binaries – Optional list of (name, bytes) tuples for prediction mode, containing serialized model and preprocessing artifacts.
prediction_store – External Predictions storage instance for managing prediction results across pipeline steps.

Returns:

updated_context: Context dict with added model information
artifact_metadata: List of serialized binary artifacts for persistence

Return type:

Tuple of (updated_context, list_of_artifact_metadata) where

Raises:

ImportError – If TensorFlow is not installed.

get_preferred_layout() → str[source]

Return the preferred data layout for TensorFlow models.

TensorFlow Conv1D expects input shape (features, channels) where: - features = number of wavelengths/spectral points (timesteps for convolution) - channels = number of preprocessing methods

The ‘3d_transpose’ layout returns (samples, features, processings) which is correct for Conv1D.

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Determine if this controller should handle the given step.

Matches TensorFlow/Keras models, functions decorated with @framework(‘tensorflow’), and serialized model configurations containing TensorFlow components.

Parameters:

step – Pipeline step configuration (dict, model instance, or function).
operator – Optional operator instance extracted from step.
keyword – Optional keyword identifier for the step.

Returns:

True if this controller should handle the step, False otherwise. Returns False immediately if TensorFlow is not installed.

priority: int = 4

process_hyperparameters(params: Dict[str, Any]) → Dict[str, Any][source]

Process hyperparameters for TensorFlow model tuning.

Supports TensorFlow-specific parameter organization: - Parameters prefixed with ‘compile_’ are grouped under ‘compile’ key

(e.g., ‘compile_learning_rate’ → compile[‘learning_rate’])

Parameters prefixed with ‘fit_’ are grouped under ‘fit’ key (e.g., ‘fit_batch_size’ → fit[‘batch_size’])
Other parameters are treated as model architecture parameters

Parameters:: params – Dictionary of sampled parameters.
Returns:: Dictionary of processed hyperparameters with proper nesting for TensorFlow compilation and fitting.

class nirs4all.controllers.TransformerMixinController[source]

Bases: OperatorController

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None)[source]

Execute transformer - handles normal, feature augmentation, and sample augmentation modes.

Supports optional fit_on_all parameter in step configuration to fit the transformer on all data instead of just training data. This is useful for unsupervised preprocessing where you want the transformation to capture the full data distribution.

Step format:

# Standard (fit on train, transform all): StandardScaler()

# Fit on ALL data (unsupervised preprocessing): {“preprocessing”: StandardScaler(), “fit_on_all”: True}

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Match TransformerMixin objects.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: TransformerMixin controllers support prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.YChartController[source]

Bases: OperatorController

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: Any, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, Any][source]

Execute y values histogram visualization.

If cross-validation folds exist (more than 1 fold), displays a grid showing: - One histogram per fold validation set - One histogram for the test partition (if available)

Otherwise, displays a simple train vs test histogram.

Supports optional parameters via dict syntax:: {“chart_y”: {“include_excluded”: True, “highlight_excluded”: True}}

Parameters:

include_excluded – If True, include excluded samples in visualization
highlight_excluded – If True, show excluded samples as separate histogram

Returns:

Tuple of (context, StepOutput)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]: Check if the operator matches the step and keyword.

priority: int = 10

classmethod supports_prediction_mode() → bool[source]: Chart controllers should skip execution during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

class nirs4all.controllers.YTransformerMixinController[source]

Bases: OperatorController

Controller for applying sklearn TransformerMixin operators to targets (y) instead of features (X).

Triggered by the “y_processing” keyword and applies transformations to target data, fitting on train targets and transforming all target data.

Supports both single transformers and chained transformers (list syntax):

Single: {“y_processing”: StandardScaler()}
Chained: {“y_processing”: [StandardScaler, QuantileTransformer(n_quantiles=30)]}

When using chained transformers, each transformer is applied sequentially, with proper ancestry tracking and individual artifact persistence for prediction mode.

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: Any = None, prediction_store: Any = None) → Tuple[ExecutionContext, List[Any]][source]

Execute transformer(s) on dataset targets, fitting on train targets and transforming all targets.

Supports both single transformers and chained transformers (list). Each transformer is applied sequentially, with proper ancestry tracking.

Parameters:

step_info – Parsed step containing operator and metadata
dataset – Dataset containing targets to transform
context – Pipeline context with partition information
runtime_context – Runtime context containing infrastructure components
source – Source index (not used for target processing)
mode – Execution mode (“train”, “predict”, or “explain”)
loaded_binaries – Pre-loaded fitted transformers for predict/explain mode
prediction_store – Not used for y_processing

Returns:

Tuple of (updated_context, fitted_transformers_list)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Match if keyword is ‘y_processing’ and operator is TransformerMixin or list thereof.

Parameters:

step – Original step configuration
operator – Parsed operator (TransformerMixin instance, class, or list)
keyword – Step keyword

Returns:

True if this controller should handle the step

priority: int = 5

classmethod supports_prediction_mode() → bool[source]: Y transformers should not execute during prediction mode.

classmethod use_multi_source() → bool[source]: Check if the operator supports multi-source datasets.

nirs4all.controllers.register_controller(operator_cls: Type[OperatorController])[source]: Decorator to register a controller class.