nirs4all.controllers.data.metadata_partitioner module

Metadata Partitioner Controller for metadata-based branching.

This controller partitions the dataset into multiple branches based on a metadata column. Unlike copy branches (where all branches see all samples), this controller creates non-overlapping sample sets - each sample exists in exactly ONE branch.

For example, with column=”site”:
  • Branch “site_A”: Contains ONLY samples where metadata[“site”] == “A”

  • Branch “site_B”: Contains ONLY samples where metadata[“site”] == “B”

  • Branch “site_C”: Contains ONLY samples where metadata[“site”] == “C”

This enables training separate models for different data subsets (e.g., per-site, per-variety, per-instrument models) and combining their predictions via stacking.

Example

>>> pipeline = [
...     MinMaxScaler(),
...     {
...         "branch": [PLS(5), RF(100), XGB()],
...         "by": "metadata_partitioner",
...         "column": "site",
...         "cv": ShuffleSplit(n_splits=3),
...         "min_samples": 20,  # Skip branches with < 20 samples
...     },
...     {"merge": "predictions"},
...     Ridge(),
... ]
class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionConfig(column: str, branch_steps: List[Any], cv: Any | None = None, min_samples: int = 1, group_values: Dict[str, List[Any]] | None = None)[source]

Bases: object

Configuration for metadata partitioning.

column

Metadata column name to partition by.

Type:

str

branch_steps

Pipeline steps to execute in each branch.

Type:

List[Any]

cv

Cross-validation splitter for per-branch CV.

Type:

Any | None

min_samples

Minimum samples required per branch. Branches with fewer samples are skipped.

Type:

int

group_values

Optional dict mapping branch names to lists of values to group together. E.g., {“others”: [“C”, “D”, “E”]} groups values C, D, E into a single “others” branch.

Type:

Dict[str, List[Any]] | None

__post_init__()[source]

Validate configuration after initialization.

branch_steps: List[Any]
column: str
cv: Any | None = None
group_values: Dict[str, List[Any]] | None = None
min_samples: int = 1
class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionerController[source]

Bases: OperatorController

Controller for metadata-based branching via partitioning.

This controller creates branches by partitioning samples based on a metadata column. Each branch contains a disjoint subset of samples where the metadata column equals specific value(s).

Key behaviors:
  • Each branch contains a disjoint subset of samples

  • Per-branch cross-validation is supported

  • Branches with too few samples can be skipped (min_samples)

  • Values can be grouped into combined branches (group_values)

  • Models train and predict only on their partition

priority

Controller priority (set to 3 to run before other controllers).

Type:

int

execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) Tuple[ExecutionContext, StepOutput][source]

Execute the metadata partitioner branch step.

Creates branches based on metadata column values, with each branch containing only samples matching specific value(s).

In prediction mode, samples are routed to the correct branch based on their metadata value. Each sample is processed by the branch that matches its metadata value.

Parameters:
  • step_info – Parsed step containing branch definitions

  • dataset – Dataset to operate on

  • context – Pipeline execution context

  • runtime_context – Runtime infrastructure context

  • source – Data source index

  • mode – Execution mode (“train” or “predict”)

  • loaded_binaries – Pre-loaded binary objects for prediction mode

  • prediction_store – External prediction store for model predictions

Returns:

Tuple of (updated_context, StepOutput with collected artifacts)

classmethod matches(step: Any, operator: Any, keyword: str) bool[source]

Check if the step matches the metadata_partitioner branch pattern.

Matches:

{“branch”: […], “by”: “metadata_partitioner”, “column”: “…”}

Parameters:
  • step – Original step configuration

  • operator – Deserialized operator

  • keyword – Step keyword

Returns:

True if this is a metadata_partitioner branch definition.

priority: int = 3
classmethod supports_prediction_mode() bool[source]

Metadata partitioner should execute in prediction mode.

In prediction mode, we need to route samples to the correct branch based on their metadata value.

classmethod use_multi_source() bool[source]

Metadata partitioner operates on dataset level.