nirs4all.controllers.data.metadata_partitioner module

Metadata Partitioner Controller for metadata-based branching.

This controller partitions the dataset into multiple branches based on a metadata column. Unlike copy branches (where all branches see all samples), this controller creates non-overlapping sample sets - each sample exists in exactly ONE branch.

For example, with column=”site”:

Branch “site_A”: Contains ONLY samples where metadata[“site”] == “A”
Branch “site_B”: Contains ONLY samples where metadata[“site”] == “B”
Branch “site_C”: Contains ONLY samples where metadata[“site”] == “C”

This enables training separate models for different data subsets (e.g., per-site, per-variety, per-instrument models) and combining their predictions via stacking.

Example

>>> pipeline = [
...     MinMaxScaler(),
...     {
...         "branch": [PLS(5), RF(100), XGB()],
...         "by": "metadata_partitioner",
...         "column": "site",
...         "cv": ShuffleSplit(n_splits=3),
...         "min_samples": 20,  # Skip branches with < 20 samples
...     },
...     {"merge": "predictions"},
...     Ridge(),
... ]

class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionConfig(column: str, branch_steps: List[Any], cv: Any | None = None, min_samples: int = 1, group_values: Dict[str, List[Any]] | None = None)[source]

Bases: object

Configuration for metadata partitioning.

column

Metadata column name to partition by.

Type:: str

branch_steps

Pipeline steps to execute in each branch.

Type:: List[Any]

cv

Cross-validation splitter for per-branch CV.

Type:: Any | None

min_samples

Minimum samples required per branch. Branches with fewer samples are skipped.

Type:: int

group_values

Optional dict mapping branch names to lists of values to group together. E.g., {“others”: [“C”, “D”, “E”]} groups values C, D, E into a single “others” branch.

Type:: Dict[str, List[Any]] | None

__post_init__()[source]: Validate configuration after initialization.

branch_steps: List[Any]

column: str

cv: Any | None = None

group_values: Dict[str, List[Any]] | None = None

min_samples: int = 1

class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionerController[source]

Bases: OperatorController

Controller for metadata-based branching via partitioning.

This controller creates branches by partitioning samples based on a metadata column. Each branch contains a disjoint subset of samples where the metadata column equals specific value(s).

Key behaviors:

Each branch contains a disjoint subset of samples
Per-branch cross-validation is supported
Branches with too few samples can be skipped (min_samples)
Values can be grouped into combined branches (group_values)
Models train and predict only on their partition

priority

Controller priority (set to 3 to run before other controllers).

Type:: int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) → Tuple[ExecutionContext, StepOutput][source]

Execute the metadata partitioner branch step.

Creates branches based on metadata column values, with each branch containing only samples matching specific value(s).

In prediction mode, samples are routed to the correct branch based on their metadata value. Each sample is processed by the branch that matches its metadata value.

Parameters:

step_info – Parsed step containing branch definitions
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects for prediction mode
prediction_store – External prediction store for model predictions

Returns:

Tuple of (updated_context, StepOutput with collected artifacts)

classmethod matches(step: Any, operator: Any, keyword: str) → bool[source]

Check if the step matches the metadata_partitioner branch pattern.

Matches:: {“branch”: […], “by”: “metadata_partitioner”, “column”: “…”}

Parameters:

step – Original step configuration
operator – Deserialized operator
keyword – Step keyword

Returns:

True if this is a metadata_partitioner branch definition.

priority: int = 3

classmethod supports_prediction_mode() → bool[source]

Metadata partitioner should execute in prediction mode.

In prediction mode, we need to route samples to the correct branch based on their metadata value.

classmethod use_multi_source() → bool[source]: Metadata partitioner operates on dataset level.