nirs4all.operators.filters.base module

Base class for sample filtering operators.

Sample filters identify samples to be excluded from training based on various criteria. They follow the sklearn TransformerMixin pattern for consistency with the existing augmentation and transformation operators in nirs4all.

class nirs4all.operators.filters.base.CompositeFilter(filters: List[SampleFilter] | None = None, mode: str = 'any', reason: str | None = None)[source]

Bases: SampleFilter

Combine multiple filters with AND/OR logic.

This filter aggregates the results of multiple sub-filters using either “any” or “all” mode: - “any” (default): Exclude if ANY filter flags the sample - “all”: Exclude only if ALL filters flag the sample

filters

List of filter instances to combine

Type:: List[SampleFilter]

mode

Combination mode - “any” or “all”

Type:: str

Example

>>> from nirs4all.operators.filters import YOutlierFilter, CompositeFilter
>>>
>>> # Exclude if either filter flags
>>> combined = CompositeFilter(
...     filters=[
...         YOutlierFilter(method="iqr", threshold=1.5),
...         YOutlierFilter(method="zscore", threshold=3.0),
...     ],
...     mode="any"
... )

property exclusion_reason: str: Get combined exclusion reason from all filters.

fit(X: ndarray, y: ndarray | None = None) → CompositeFilter[source]

Fit all sub-filters to the training data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

The fitted composite filter.

Return type:

self

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics including per-filter breakdown.

Parameters:

X – Feature array.
y – Target array.

Returns:

Dict with overall stats and per-filter breakdown.

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute combined mask from all sub-filters.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Boolean array where True = keep, False = exclude.: For “any” mode: keep if ALL filters say keep For “all” mode: keep if ANY filter says keep

Return type:

np.ndarray

class nirs4all.operators.filters.base.SampleFilter(reason: str | None = None)[source]

Bases: TransformerMixin, BaseEstimator, ABC

Base class for sample filtering operators.

Sample filters identify samples that should be excluded from training datasets. Unlike transformers that modify data, filters mark samples for exclusion without altering the underlying data.

The filtering pattern works as follows: 1. fit(): Learn filter criteria from training data (e.g., compute thresholds) 2. get_mask(): Return boolean mask indicating which samples to KEEP 3. transform(): No-op (filtering happens at indexer level, not data level)

All concrete filter implementations must override the get_mask() method.

reason

Identifier for this filter type, used to track exclusion reasons in the indexer. Default is the class name.

Type:: str

Example

>>> class MyFilter(SampleFilter):
...     def __init__(self, threshold: float = 1.0):
...         super().__init__()
...         self.threshold = threshold
...
...     def fit(self, X, y=None):
...         self.mean_ = np.mean(y)
...         self.std_ = np.std(y)
...         return self
...
...     def get_mask(self, X, y=None) -> np.ndarray:
...         z_scores = np.abs((y - self.mean_) / self.std_)
...         return z_scores <= self.threshold  # True = keep

property exclusion_reason: str

Get the exclusion reason identifier for this filter.

Returns:: Reason string to be stored in indexer’s exclusion_reason column.
Return type:: str

fit(X: ndarray, y: ndarray | None = None) → SampleFilter[source]

Compute filter criteria from training data.

This method should learn any thresholds, statistics, or models needed to identify outliers/bad samples. Override in subclasses for filters that need to learn from data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

The fitted filter instance.

Return type:

self

fit_transform(X: ndarray, y: ndarray | None = None, **fit_params) → ndarray[source]

Fit to data and return unchanged (transform is no-op).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
**fit_params – Additional fitting parameters (unused).

Returns:

The unchanged input array.

Return type:

np.ndarray

get_excluded_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be excluded.

Convenience method that inverts get_mask() to return indices of samples marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to exclude.

Return type:

np.ndarray

Example

>>> filter = YOutlierFilter(method="iqr")
>>> filter.fit(X_train, y_train)
>>> excluded_idx = filter.get_excluded_indices(X_train, y_train)
>>> print(f"Excluding {len(excluded_idx)} samples")

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Override in subclasses to provide filter-specific statistics (e.g., thresholds used, distribution of values, etc.).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Dictionary containing filter statistics:

n_samples: Total number of samples
n_excluded: Number of samples to exclude
n_kept: Number of samples to keep
exclusion_rate: Ratio of excluded to total
reason: Exclusion reason string

Return type:

Dict[str, Any]

get_kept_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be kept.

Convenience method that returns indices of samples NOT marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to keep.

Return type:

np.ndarray

abstractmethod get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

This is the core method that must be implemented by all concrete filters. Returns True for samples that should be kept, False for samples to exclude.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample
False means EXCLUDE the sample

Return type:

np.ndarray

Raises:

NotImplementedError – If the subclass doesn’t implement this method.

transform(X: ndarray) → ndarray[source]

Transform is a no-op for filters.

Filtering happens at the indexer level, not by modifying the data array. This method returns the input unchanged to maintain sklearn compatibility.

Parameters:: X – Feature array of shape (n_samples, n_features).
Returns:: The unchanged input array.
Return type:: np.ndarray