nirs4all.operators.filters.y_outlier module

Y-based outlier filter for sample filtering.

This module provides the YOutlierFilter class for detecting and excluding samples with outlier target (y) values using various statistical methods.

class nirs4all.operators.filters.y_outlier.YOutlierFilter(method: Literal['iqr', 'zscore', 'percentile', 'mad'] = 'iqr', threshold: float = 1.5, lower_percentile: float = 1.0, upper_percentile: float = 99.0, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with outlier target values.

This filter identifies samples whose y-values are statistical outliers using one of several detection methods. It’s commonly used to remove samples with extreme or erroneous target values before training.

Supported methods: - “iqr”: Interquartile Range method (default) - “zscore”: Z-score (standard deviations from mean) - “percentile”: Direct percentile cutoffs - “mad”: Median Absolute Deviation (robust to outliers)

method

Outlier detection method

Type:

str

threshold

Method-specific threshold

Type:

float

lower_percentile

Lower cutoff for percentile method

Type:

float

upper_percentile

Upper cutoff for percentile method

Type:

float

Example

>>> from nirs4all.operators.filters import YOutlierFilter
>>>
>>> # IQR method (default, threshold=1.5 is standard)
>>> filter_iqr = YOutlierFilter(method="iqr", threshold=1.5)
>>>
>>> # Z-score method (threshold=3.0 is common)
>>> filter_zscore = YOutlierFilter(method="zscore", threshold=3.0)
>>>
>>> # Percentile method
>>> filter_pct = YOutlierFilter(
...     method="percentile",
...     lower_percentile=1.0,
...     upper_percentile=99.0
... )
>>>
>>> # Fit and get mask
>>> filter_iqr.fit(X_train, y_train)
>>> mask = filter_iqr.get_mask(X_train, y_train)  # True = keep
In Pipeline:
>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [YOutlierFilter(method="iqr", threshold=1.5)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]
__repr__() str[source]

Return string representation.

property exclusion_reason: str

Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) YOutlierFilter[source]

Compute outlier detection bounds from training data.

Parameters:
  • X – Feature array of shape (n_samples, n_features). Not used but required for sklearn compatibility.

  • y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

The fitted filter instance.

Return type:

self

Raises:
  • ValueError – If y is None (required for Y-based filtering).

  • ValueError – If y has no valid (non-NaN) values.

get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]

Get statistics about filter application including method-specific details.

Parameters:
  • X – Feature array.

  • y – Target array.

Returns:

  • Base stats (n_samples, n_excluded, n_kept, exclusion_rate)

  • method: Detection method used

  • threshold: Threshold value

  • lower_bound: Computed lower bound

  • upper_bound: Computed upper bound

  • center: Central value (mean/median)

  • scale: Scale measure (std/IQR/MAD)

  • y_range: (min, max) of input y values

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:
  • X – Feature array of shape (n_samples, n_features). Not used but required for API consistency.

  • y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

Boolean array of shape (n_samples,) where:
  • True means KEEP the sample (within bounds)

  • False means EXCLUDE the sample (outside bounds)

Return type:

np.ndarray

Raises: