nirs4all.controllers.data.source_branch module

Source Branch Controller for per-source pipeline execution.

This controller enables per-source pipeline execution for multi-source datasets. Each data source (e.g., NIR, markers, Raman) can have its own independent preprocessing pipeline.

Unlike regular branching (branch), which creates parallel paths that all process the same data, source branching assigns each source to a specific processing pipeline based on its name or index.

Phase 10 Implementation: - Parse source_branch configurations - Create per-source execution contexts - Execute source-specific pipelines - Support prediction mode - Integration with merge_sources

Example

>>> # Different preprocessing per source
>>> pipeline = [
...     {"source_branch": {
...         "NIR": [SNV(), SavitzkyGolay()],
...         "markers": [VarianceThreshold(), MinMaxScaler()],
...     }},
...     {"merge_sources": "concat"},  # Combine sources after
...     PLSRegression(n_components=10)
... ]
>>>
>>> # Automatic source branching (same empty pipeline per source - isolation only)
>>> pipeline = [
...     {"source_branch": "auto"},
...     {"merge_sources": "concat"},
...     PLSRegression(n_components=10)
... ]

Keywords: “source_branch” Priority: 5 (same as BranchController)

class nirs4all.controllers.data.source_branch.SourceBranchConfigParser[source]

Bases: object

Parser for source_branch step configurations.

Handles multiple syntax formats for source branching and normalizes them to SourceBranchConfig.

Supported syntaxes:
  • Simple string: “auto” (isolate each source)

  • Dict with source names: {“NIR”: [steps], “markers”: [steps]}

  • Dict with indices: {0: [steps], 1: [steps]}

  • Dict with special keys: {“_default_”: [steps], “_merge_after_”: False}

classmethod parse(raw_config: Any) SourceBranchConfig[source]

Parse raw source_branch configuration into SourceBranchConfig.

Parameters:

raw_config – The value from {“source_branch”: raw_config}

Returns:

Normalized SourceBranchConfig instance.

Raises:

ValueError – If configuration format is invalid.

class nirs4all.controllers.data.source_branch.SourceBranchController[source]

Bases: OperatorController

Controller for per-source pipeline execution.

This controller enables per-source pipeline execution for multi-source datasets. Each data source gets its own independent processing pipeline.

Key behaviors:
  • Creates per-source execution contexts

  • Executes source-specific pipelines

  • Stores source contexts for subsequent steps or auto-merge

  • Optionally auto-merges sources after processing

Unlike regular BranchController:
  • Operates on the data provenance dimension (sources), not execution paths

  • Each source’s data is isolated during its pipeline execution

  • Sources can have completely different preprocessing chains

  • Designed for multi-modal data (NIR, markers, Raman, etc.)

priority

Controller priority (5 = same as BranchController).

Type:

int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) Tuple[ExecutionContext, StepOutput][source]

Execute source branch step.

For each source, runs a specific sub-pipeline (if defined) and updates the processing context. Uses existing infrastructure:

  1. Get source names and current processing chains

  2. For each source with a defined pipeline: - Create a context with processing limited to that source - Run the sub-pipeline steps - Collect artifacts

  3. Update context with new processing chains

  4. Optionally auto-merge sources

The TransformerController will naturally apply transforms only to the source whose processing is in the context.

Parameters:
  • step_info – Parsed step containing source_branch configuration

  • dataset – Dataset to operate on (must have multiple sources)

  • context – Pipeline execution context

  • runtime_context – Runtime infrastructure context

  • source – Data source index

  • mode – Execution mode (“train” or “predict”)

  • loaded_binaries – Pre-loaded binary objects for prediction mode

  • prediction_store – External prediction store

Returns:

Tuple of (updated_context, StepOutput with artifacts)

Raises:

ValueError – If dataset has only one source.

classmethod matches(step: Any, operator: Any, keyword: str) bool[source]

Check if the step matches the source_branch controller.

Parameters:
  • step – Original step configuration

  • operator – Deserialized operator

  • keyword – Step keyword

Returns:

True if keyword is “source_branch”

priority: int = 5
classmethod supports_prediction_mode() bool[source]

Source branch controller should execute in prediction mode.

classmethod use_multi_source() bool[source]

Source branch controller supports multi-source datasets.