nirs4all.operators.filters.x_outlier module
X-based outlier filter for sample filtering.
This module provides the XOutlierFilter class for detecting and excluding samples with outlier feature (X) values using various statistical and machine learning methods commonly used in spectroscopy and chemometrics.
- class nirs4all.operators.filters.x_outlier.XOutlierFilter(method: Literal['mahalanobis', 'robust_mahalanobis', 'pca_residual', 'pca_leverage', 'isolation_forest', 'lof'] = 'mahalanobis', threshold: float | None = None, n_components: int | None = None, contamination: float = 0.1, random_state: int | None = None, support_fraction: float | None = None, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples with outlier spectral features.
This filter identifies samples whose X-values (spectra) are statistical outliers using various detection methods. It’s commonly used to remove samples with corrupted, unusual, or non-representative spectra.
Supported methods: - “mahalanobis”: Mahalanobis distance from center (default) - “robust_mahalanobis”: Robust Mahalanobis using MinCovDet (resistant to outliers) - “pca_residual”: Q-statistic (residual) from PCA reconstruction - “pca_leverage”: T² (Hotelling’s T-squared) in PCA score space - “isolation_forest”: Isolation Forest anomaly detection - “lof”: Local Outlier Factor
Example
>>> from nirs4all.operators.filters import XOutlierFilter >>> >>> # Mahalanobis distance (default) >>> filter_maha = XOutlierFilter(method="mahalanobis", threshold=3.0) >>> >>> # Robust Mahalanobis (better with outliers in training data) >>> filter_robust = XOutlierFilter(method="robust_mahalanobis", threshold=3.0) >>> >>> # PCA-based residual (Q-statistic) >>> filter_pca = XOutlierFilter(method="pca_residual", n_components=10) >>> >>> # Fit and get mask >>> filter_maha.fit(X_train) >>> mask = filter_maha.get_mask(X_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [XOutlierFilter(method="mahalanobis", threshold=3.0)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) XOutlierFilter[source]
Compute outlier detection model from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based filtering, but kept for API consistency).
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If X has insufficient samples for the chosen method.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value (if applicable)
n_components: PCA components (if applicable)
distance_stats: Statistics on computed distances/scores
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used, kept for API consistency).
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (not an outlier)
False means EXCLUDE the sample (outlier detected)
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.