nirs4all.controllers.data.metadata_partitioner module
Metadata Partitioner Controller for metadata-based branching.
This controller partitions the dataset into multiple branches based on a metadata column. Unlike copy branches (where all branches see all samples), this controller creates non-overlapping sample sets - each sample exists in exactly ONE branch.
- For example, with column=”site”:
Branch “site_A”: Contains ONLY samples where metadata[“site”] == “A”
Branch “site_B”: Contains ONLY samples where metadata[“site”] == “B”
Branch “site_C”: Contains ONLY samples where metadata[“site”] == “C”
This enables training separate models for different data subsets (e.g., per-site, per-variety, per-instrument models) and combining their predictions via stacking.
Example
>>> pipeline = [
... MinMaxScaler(),
... {
... "branch": [PLS(5), RF(100), XGB()],
... "by": "metadata_partitioner",
... "column": "site",
... "cv": ShuffleSplit(n_splits=3),
... "min_samples": 20, # Skip branches with < 20 samples
... },
... {"merge": "predictions"},
... Ridge(),
... ]
- class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionConfig(column: str, branch_steps: List[Any], cv: Any | None = None, min_samples: int = 1, group_values: Dict[str, List[Any]] | None = None)[source]
Bases:
objectConfiguration for metadata partitioning.
- branch_steps
Pipeline steps to execute in each branch.
- Type:
List[Any]
- cv
Cross-validation splitter for per-branch CV.
- Type:
Any | None
- min_samples
Minimum samples required per branch. Branches with fewer samples are skipped.
- Type:
- class nirs4all.controllers.data.metadata_partitioner.MetadataPartitionerController[source]
Bases:
OperatorControllerController for metadata-based branching via partitioning.
This controller creates branches by partitioning samples based on a metadata column. Each branch contains a disjoint subset of samples where the metadata column equals specific value(s).
- Key behaviors:
Each branch contains a disjoint subset of samples
Per-branch cross-validation is supported
Branches with too few samples can be skipped (min_samples)
Values can be grouped into combined branches (group_values)
Models train and predict only on their partition
- execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) Tuple[ExecutionContext, StepOutput][source]
Execute the metadata partitioner branch step.
Creates branches based on metadata column values, with each branch containing only samples matching specific value(s).
In prediction mode, samples are routed to the correct branch based on their metadata value. Each sample is processed by the branch that matches its metadata value.
- Parameters:
step_info – Parsed step containing branch definitions
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binary objects for prediction mode
prediction_store – External prediction store for model predictions
- Returns:
Tuple of (updated_context, StepOutput with collected artifacts)
- classmethod matches(step: Any, operator: Any, keyword: str) bool[source]
Check if the step matches the metadata_partitioner branch pattern.
- Matches:
{“branch”: […], “by”: “metadata_partitioner”, “column”: “…”}
- Parameters:
step – Original step configuration
operator – Deserialized operator
keyword – Step keyword
- Returns:
True if this is a metadata_partitioner branch definition.