nirs4all.operators.filters.y_outlier module
Y-based outlier filter for sample filtering.
This module provides the YOutlierFilter class for detecting and excluding samples with outlier target (y) values using various statistical methods.
- class nirs4all.operators.filters.y_outlier.YOutlierFilter(method: Literal['iqr', 'zscore', 'percentile', 'mad'] = 'iqr', threshold: float = 1.5, lower_percentile: float = 1.0, upper_percentile: float = 99.0, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples with outlier target values.
This filter identifies samples whose y-values are statistical outliers using one of several detection methods. It’s commonly used to remove samples with extreme or erroneous target values before training.
Supported methods: - “iqr”: Interquartile Range method (default) - “zscore”: Z-score (standard deviations from mean) - “percentile”: Direct percentile cutoffs - “mad”: Median Absolute Deviation (robust to outliers)
Example
>>> from nirs4all.operators.filters import YOutlierFilter >>> >>> # IQR method (default, threshold=1.5 is standard) >>> filter_iqr = YOutlierFilter(method="iqr", threshold=1.5) >>> >>> # Z-score method (threshold=3.0 is common) >>> filter_zscore = YOutlierFilter(method="zscore", threshold=3.0) >>> >>> # Percentile method >>> filter_pct = YOutlierFilter( ... method="percentile", ... lower_percentile=1.0, ... upper_percentile=99.0 ... ) >>> >>> # Fit and get mask >>> filter_iqr.fit(X_train, y_train) >>> mask = filter_iqr.get_mask(X_train, y_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [YOutlierFilter(method="iqr", threshold=1.5)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) YOutlierFilter[source]
Compute outlier detection bounds from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features). Not used but required for sklearn compatibility.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If y is None (required for Y-based filtering).
ValueError – If y has no valid (non-NaN) values.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application including method-specific details.
- Parameters:
X – Feature array.
y – Target array.
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value
lower_bound: Computed lower bound
upper_bound: Computed upper bound
center: Central value (mean/median)
scale: Scale measure (std/IQR/MAD)
y_range: (min, max) of input y values
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features). Not used but required for API consistency.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (within bounds)
False means EXCLUDE the sample (outside bounds)
- Return type:
np.ndarray
- Raises:
ValueError – If y is None.
ValueError – If filter has not been fitted (bounds not set).