nirs4all.operators.filters.metadata module

Metadata-based filter for sample filtering.

This module provides the MetadataFilter class for filtering samples based on metadata column values using custom conditions.

class nirs4all.operators.filters.metadata.MetadataFilter(column: str, condition: Callable[[Any], bool] | None = None, values_to_exclude: List[Any] | None = None, values_to_keep: List[Any] | None = None, exclude_missing: bool = True, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples based on metadata column values.

This filter allows excluding samples based on external metadata (not X or y) using custom condition functions. It’s useful for filtering based on: - Sample quality flags - Acquisition conditions - Sample categories to exclude - Date/time-based filtering - Any other metadata criteria

The filter works with metadata passed during get_mask() call, as metadata is not part of the standard sklearn X, y interface.

column

Metadata column name to filter on

Type:: str

condition

Function returning True for samples to KEEP

Type:: Callable

values_to_exclude

List of values that should be excluded

Type:: List

values_to_keep

List of values that should be kept

Type:: List

Example

>>> from nirs4all.operators.filters import MetadataFilter
>>>
>>> # Exclude specific values
>>> filter_obj = MetadataFilter(
...     column="quality_flag",
...     values_to_exclude=["bad", "corrupted"]
... )
>>>
>>> # Keep only specific values
>>> filter_keep = MetadataFilter(
...     column="sample_type",
...     values_to_keep=["control", "treatment"]
... )
>>>
>>> # Custom condition
>>> filter_custom = MetadataFilter(
...     column="temperature",
...     condition=lambda x: 20 <= x <= 30  # Keep 20-30°C
... )
>>>
>>> # Get mask (metadata must be provided)
>>> mask = filter_obj.get_mask(X, metadata=metadata_df)

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [
...                 MetadataFilter(
...                     column="quality",
...                     values_to_exclude=["bad"]
...                 )
...             ],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → MetadataFilter[source]

Fit the filter (no-op for metadata filter).

Metadata filtering uses fixed criteria, so no fitting is required.

Parameters:

X – Feature array (not used).
y – Target array (not used).

Returns:

The filter instance (unchanged).

Return type:

self

get_filter_stats(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Parameters:

X – Feature array.
y – Target array (unused).
metadata – Metadata dictionary.

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
column: Filtered column name
filtering_type: Type of filtering applied
value_counts: Count of unique values (if available)

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | Any | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features). Used only to determine number of samples if metadata is not provided.
y – Target array (not used).
metadata – Metadata dictionary, DataFrame, or object with column access. Must contain the specified column. Can be: - Dict[str, np.ndarray]: metadata[column] returns array - pd.DataFrame: metadata[column] returns series - Any object with __getitem__ that returns array-like

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample
False means EXCLUDE the sample

Return type:

np.ndarray

Raises:

ValueError – If metadata is None and filtering requires it.
KeyError – If the specified column is not in metadata.