nirs4all.operators.filters.x_outlier module

X-based outlier filter for sample filtering.

This module provides the XOutlierFilter class for detecting and excluding samples with outlier feature (X) values using various statistical and machine learning methods commonly used in spectroscopy and chemometrics.

class nirs4all.operators.filters.x_outlier.XOutlierFilter(method: Literal['mahalanobis', 'robust_mahalanobis', 'pca_residual', 'pca_leverage', 'isolation_forest', 'lof'] = 'mahalanobis', threshold: float | None = None, n_components: int | None = None, contamination: float = 0.1, random_state: int | None = None, support_fraction: float | None = None, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with outlier spectral features.

This filter identifies samples whose X-values (spectra) are statistical outliers using various detection methods. It’s commonly used to remove samples with corrupted, unusual, or non-representative spectra.

Supported methods: - “mahalanobis”: Mahalanobis distance from center (default) - “robust_mahalanobis”: Robust Mahalanobis using MinCovDet (resistant to outliers) - “pca_residual”: Q-statistic (residual) from PCA reconstruction - “pca_leverage”: T² (Hotelling’s T-squared) in PCA score space - “isolation_forest”: Isolation Forest anomaly detection - “lof”: Local Outlier Factor

method

Outlier detection method

Type:: str

threshold

Detection threshold (method-specific)

Type:: float

n_components

Number of PCA components for PCA-based methods

Type:: int

contamination

Expected proportion of outliers for sklearn methods

Type:: float

Example

>>> from nirs4all.operators.filters import XOutlierFilter
>>>
>>> # Mahalanobis distance (default)
>>> filter_maha = XOutlierFilter(method="mahalanobis", threshold=3.0)
>>>
>>> # Robust Mahalanobis (better with outliers in training data)
>>> filter_robust = XOutlierFilter(method="robust_mahalanobis", threshold=3.0)
>>>
>>> # PCA-based residual (Q-statistic)
>>> filter_pca = XOutlierFilter(method="pca_residual", n_components=10)
>>>
>>> # Fit and get mask
>>> filter_maha.fit(X_train)
>>> mask = filter_maha.get_mask(X_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [XOutlierFilter(method="mahalanobis", threshold=3.0)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → XOutlierFilter[source]

Compute outlier detection model from training data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based filtering, but kept for API consistency).

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If X has insufficient samples for the chosen method.

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Parameters:

X – Feature array.
y – Target array (unused).

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value (if applicable)
n_components: PCA components (if applicable)
distance_stats: Statistics on computed distances/scores

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used, kept for API consistency).

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (not an outlier)
False means EXCLUDE the sample (outlier detected)

Return type:

np.ndarray

Raises:

ValueError – If filter has not been fitted.