nirs4all.controllers.data.sample_filter module
Controller for sample filtering operations.
This controller handles sample filtering operators, identifying and marking samples for exclusion from training datasets based on various criteria (outliers, quality issues, etc.).
- class nirs4all.controllers.data.sample_filter.SampleFilterController[source]
Bases:
OperatorControllerController for sample filtering operations.
This controller orchestrates sample filtering by: 1. Retrieving train samples (base only, no augmented) and their X/y values 2. Applying each filter’s get_mask() method to identify outliers 3. Combining masks according to the specified mode (any/all) 4. Marking excluded samples in the dataset’s indexer 5. Generating filtering report (optional)
Sample filters are non-destructive - they mark samples as excluded in the indexer rather than removing data. Excluded samples can be re-included using dataset._indexer.mark_included().
- Pipeline syntax:
- {
- “sample_filter”: {
- “filters”: [
YOutlierFilter(method=”iqr”, threshold=1.5), XOutlierFilter(method=”mahalanobis”),
], “mode”: “any”, # “any” = exclude if ANY filter flags “report”: True, # Generate filtering report “cascade_to_augmented”: True, # Also exclude augmented samples
}
}
Note
Filtering only runs during training mode - in prediction mode, this controller does nothing to avoid excluding prediction samples.
- execute(step_info: ParsedStep, dataset: SpectroDataset, context: ExecutionContext, runtime_context: RuntimeContext, source: int = -1, mode: str = 'train', loaded_binaries: List[Tuple[str, Any]] | None = None, prediction_store: Any | None = None) Tuple[ExecutionContext, List][source]
Execute sample filtering operation.
This method: 1. Retrieves training data (base samples only) 2. Fits and applies each filter to identify outliers 3. Combines filter masks using the specified mode 4. Marks excluded samples in the dataset’s indexer 5. Optionally prints a filtering report
- Parameters:
step_info – Parsed step containing operator and configuration
dataset – Dataset to operate on
context – Pipeline execution context
runtime_context – Runtime infrastructure context
source – Data source index (unused, filtering is dataset-level)
mode – Execution mode (“train” or “predict”)
loaded_binaries – Pre-loaded binaries (filters may be persisted)
prediction_store – External prediction store (unused)
- Returns:
Tuple of (updated_context, persisted_artifacts)
- Raises:
ValueError – If no filters are specified
ValueError – If invalid mode is specified
- classmethod matches(step: Any, operator: Any, keyword: str) bool[source]
Match sample_filter keyword in pipeline.