nirs4all.operators.filters.base module
Base class for sample filtering operators.
Sample filters identify samples to be excluded from training based on various criteria. They follow the sklearn TransformerMixin pattern for consistency with the existing augmentation and transformation operators in nirs4all.
- class nirs4all.operators.filters.base.CompositeFilter(filters: List[SampleFilter] | None = None, mode: str = 'any', reason: str | None = None)[source]
Bases:
SampleFilterCombine multiple filters with AND/OR logic.
This filter aggregates the results of multiple sub-filters using either “any” or “all” mode: - “any” (default): Exclude if ANY filter flags the sample - “all”: Exclude only if ALL filters flag the sample
- filters
List of filter instances to combine
- Type:
List[SampleFilter]
Example
>>> from nirs4all.operators.filters import YOutlierFilter, CompositeFilter >>> >>> # Exclude if either filter flags >>> combined = CompositeFilter( ... filters=[ ... YOutlierFilter(method="iqr", threshold=1.5), ... YOutlierFilter(method="zscore", threshold=3.0), ... ], ... mode="any" ... )
- fit(X: ndarray, y: ndarray | None = None) CompositeFilter[source]
Fit all sub-filters to the training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
The fitted composite filter.
- Return type:
self
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics including per-filter breakdown.
- Parameters:
X – Feature array.
y – Target array.
- Returns:
Dict with overall stats and per-filter breakdown.
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute combined mask from all sub-filters.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
- Boolean array where True = keep, False = exclude.
For “any” mode: keep if ALL filters say keep For “all” mode: keep if ANY filter says keep
- Return type:
np.ndarray
- class nirs4all.operators.filters.base.SampleFilter(reason: str | None = None)[source]
Bases:
TransformerMixin,BaseEstimator,ABCBase class for sample filtering operators.
Sample filters identify samples that should be excluded from training datasets. Unlike transformers that modify data, filters mark samples for exclusion without altering the underlying data.
The filtering pattern works as follows: 1. fit(): Learn filter criteria from training data (e.g., compute thresholds) 2. get_mask(): Return boolean mask indicating which samples to KEEP 3. transform(): No-op (filtering happens at indexer level, not data level)
All concrete filter implementations must override the get_mask() method.
- reason
Identifier for this filter type, used to track exclusion reasons in the indexer. Default is the class name.
- Type:
Example
>>> class MyFilter(SampleFilter): ... def __init__(self, threshold: float = 1.0): ... super().__init__() ... self.threshold = threshold ... ... def fit(self, X, y=None): ... self.mean_ = np.mean(y) ... self.std_ = np.std(y) ... return self ... ... def get_mask(self, X, y=None) -> np.ndarray: ... z_scores = np.abs((y - self.mean_) / self.std_) ... return z_scores <= self.threshold # True = keep
- property exclusion_reason: str
Get the exclusion reason identifier for this filter.
- Returns:
Reason string to be stored in indexer’s exclusion_reason column.
- Return type:
- fit(X: ndarray, y: ndarray | None = None) SampleFilter[source]
Compute filter criteria from training data.
This method should learn any thresholds, statistics, or models needed to identify outliers/bad samples. Override in subclasses for filters that need to learn from data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.
- Returns:
The fitted filter instance.
- Return type:
self
- fit_transform(X: ndarray, y: ndarray | None = None, **fit_params) ndarray[source]
Fit to data and return unchanged (transform is no-op).
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
**fit_params – Additional fitting parameters (unused).
- Returns:
The unchanged input array.
- Return type:
np.ndarray
- get_excluded_indices(X: ndarray, y: ndarray | None = None) ndarray[source]
Get indices of samples to be excluded.
Convenience method that inverts get_mask() to return indices of samples marked for exclusion.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
Integer array of indices for samples to exclude.
- Return type:
np.ndarray
Example
>>> filter = YOutlierFilter(method="iqr") >>> filter.fit(X_train, y_train) >>> excluded_idx = filter.get_excluded_indices(X_train, y_train) >>> print(f"Excluding {len(excluded_idx)} samples")
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
Override in subclasses to provide filter-specific statistics (e.g., thresholds used, distribution of values, etc.).
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
- Dictionary containing filter statistics:
n_samples: Total number of samples
n_excluded: Number of samples to exclude
n_kept: Number of samples to keep
exclusion_rate: Ratio of excluded to total
reason: Exclusion reason string
- Return type:
Dict[str, Any]
- get_kept_indices(X: ndarray, y: ndarray | None = None) ndarray[source]
Get indices of samples to be kept.
Convenience method that returns indices of samples NOT marked for exclusion.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
Integer array of indices for samples to keep.
- Return type:
np.ndarray
- abstractmethod get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
This is the core method that must be implemented by all concrete filters. Returns True for samples that should be kept, False for samples to exclude.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample
False means EXCLUDE the sample
- Return type:
np.ndarray
- Raises:
NotImplementedError – If the subclass doesn’t implement this method.
- transform(X: ndarray) ndarray[source]
Transform is a no-op for filters.
Filtering happens at the indexer level, not by modifying the data array. This method returns the input unchanged to maintain sklearn compatibility.
- Parameters:
X – Feature array of shape (n_samples, n_features).
- Returns:
The unchanged input array.
- Return type:
np.ndarray