nirs4all.operators.filters.metadata module
Metadata-based filter for sample filtering.
This module provides the MetadataFilter class for filtering samples based on metadata column values using custom conditions.
- class nirs4all.operators.filters.metadata.MetadataFilter(column: str, condition: Callable[[Any], bool] | None = None, values_to_exclude: List[Any] | None = None, values_to_keep: List[Any] | None = None, exclude_missing: bool = True, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples based on metadata column values.
This filter allows excluding samples based on external metadata (not X or y) using custom condition functions. It’s useful for filtering based on: - Sample quality flags - Acquisition conditions - Sample categories to exclude - Date/time-based filtering - Any other metadata criteria
The filter works with metadata passed during get_mask() call, as metadata is not part of the standard sklearn X, y interface.
- condition
Function returning True for samples to KEEP
- Type:
Callable
- values_to_exclude
List of values that should be excluded
- Type:
List
- values_to_keep
List of values that should be kept
- Type:
List
Example
>>> from nirs4all.operators.filters import MetadataFilter >>> >>> # Exclude specific values >>> filter_obj = MetadataFilter( ... column="quality_flag", ... values_to_exclude=["bad", "corrupted"] ... ) >>> >>> # Keep only specific values >>> filter_keep = MetadataFilter( ... column="sample_type", ... values_to_keep=["control", "treatment"] ... ) >>> >>> # Custom condition >>> filter_custom = MetadataFilter( ... column="temperature", ... condition=lambda x: 20 <= x <= 30 # Keep 20-30°C ... ) >>> >>> # Get mask (metadata must be provided) >>> mask = filter_obj.get_mask(X, metadata=metadata_df)
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [ ... MetadataFilter( ... column="quality", ... values_to_exclude=["bad"] ... ) ... ], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) MetadataFilter[source]
Fit the filter (no-op for metadata filter).
Metadata filtering uses fixed criteria, so no fitting is required.
- Parameters:
X – Feature array (not used).
y – Target array (not used).
- Returns:
The filter instance (unchanged).
- Return type:
self
- get_filter_stats(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
metadata – Metadata dictionary.
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
column: Filtered column name
filtering_type: Type of filtering applied
value_counts: Count of unique values (if available)
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | Any | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features). Used only to determine number of samples if metadata is not provided.
y – Target array (not used).
metadata – Metadata dictionary, DataFrame, or object with column access. Must contain the specified column. Can be: - Dict[str, np.ndarray]: metadata[column] returns array - pd.DataFrame: metadata[column] returns series - Any object with __getitem__ that returns array-like
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample
False means EXCLUDE the sample
- Return type:
np.ndarray
- Raises:
ValueError – If metadata is None and filtering requires it.
KeyError – If the specified column is not in metadata.