nirs4all.operators.filters package
Submodules
- nirs4all.operators.filters.base module
- nirs4all.operators.filters.high_leverage module
- nirs4all.operators.filters.metadata module
- nirs4all.operators.filters.report module
FilterResultFilterResult.filter_nameFilterResult.reasonFilterResult.n_samplesFilterResult.n_excludedFilterResult.n_keptFilterResult.exclusion_rateFilterResult.excluded_indicesFilterResult.statsFilterResult.excluded_indicesFilterResult.exclusion_rateFilterResult.filter_nameFilterResult.n_excludedFilterResult.n_keptFilterResult.n_samplesFilterResult.reasonFilterResult.statsFilterResult.to_dict()
FilteringReportFilteringReport.dataset_nameFilteringReport.partitionFilteringReport.timestampFilteringReport.filter_resultsFilteringReport.combined_modeFilteringReport.n_total_samplesFilteringReport.n_final_excludedFilteringReport.n_final_keptFilteringReport.cascade_to_augmentedFilteringReport.n_augmented_excludedFilteringReport.add_filter_result()FilteringReport.cascade_to_augmentedFilteringReport.combined_modeFilteringReport.dataset_nameFilteringReport.filter_resultsFilteringReport.final_exclusion_rateFilteringReport.n_augmented_excludedFilteringReport.n_final_excludedFilteringReport.n_final_keptFilteringReport.n_total_samplesFilteringReport.partitionFilteringReport.print_report()FilteringReport.summary()FilteringReport.timestampFilteringReport.to_dict()FilteringReport.to_json()
FilteringReportGenerator
- nirs4all.operators.filters.spectral_quality module
SpectralQualityFilterSpectralQualityFilter.max_nan_ratioSpectralQualityFilter.max_zero_ratioSpectralQualityFilter.min_varianceSpectralQualityFilter.max_valueSpectralQualityFilter.min_valueSpectralQualityFilter.__repr__()SpectralQualityFilter.exclusion_reasonSpectralQualityFilter.fit()SpectralQualityFilter.get_filter_stats()SpectralQualityFilter.get_mask()SpectralQualityFilter.get_quality_breakdown()
- nirs4all.operators.filters.x_outlier module
- nirs4all.operators.filters.y_outlier module
Module contents
Sample filtering operators for nirs4all.
This module provides operators for filtering (excluding) samples from training datasets. Filters are non-destructive - they mark samples as excluded in the indexer rather than removing data.
- Classes:
SampleFilter: Base class for all sample filtering operators CompositeFilter: Combine multiple filters with AND/OR logic YOutlierFilter: Filter samples with outlier target values (IQR, zscore, percentile, MAD) XOutlierFilter: Filter samples with outlier spectral features (Mahalanobis, PCA, LOF, etc.) SpectralQualityFilter: Filter samples with poor spectral quality (NaN, zeros, variance) HighLeverageFilter: Filter high-leverage samples that may unduly influence models MetadataFilter: Filter samples based on metadata column values FilteringReport: Comprehensive report of sample filtering operations FilteringReportGenerator: Generator for creating filtering reports
Example
>>> from nirs4all.operators.filters import YOutlierFilter, XOutlierFilter
>>>
>>> # In a pipeline
>>> pipeline = [
... {
... "sample_filter": {
... "filters": [
... YOutlierFilter(method="iqr", threshold=1.5),
... XOutlierFilter(method="mahalanobis", threshold=3.0),
... ],
... "report": True,
... }
... },
... "snv",
... "model:PLSRegression",
... ]
- class nirs4all.operators.filters.CompositeFilter(filters: List[SampleFilter] | None = None, mode: str = 'any', reason: str | None = None)[source]
Bases:
SampleFilterCombine multiple filters with AND/OR logic.
This filter aggregates the results of multiple sub-filters using either “any” or “all” mode: - “any” (default): Exclude if ANY filter flags the sample - “all”: Exclude only if ALL filters flag the sample
- filters
List of filter instances to combine
- Type:
List[SampleFilter]
Example
>>> from nirs4all.operators.filters import YOutlierFilter, CompositeFilter >>> >>> # Exclude if either filter flags >>> combined = CompositeFilter( ... filters=[ ... YOutlierFilter(method="iqr", threshold=1.5), ... YOutlierFilter(method="zscore", threshold=3.0), ... ], ... mode="any" ... )
- fit(X: ndarray, y: ndarray | None = None) CompositeFilter[source]
Fit all sub-filters to the training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
The fitted composite filter.
- Return type:
self
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics including per-filter breakdown.
- Parameters:
X – Feature array.
y – Target array.
- Returns:
Dict with overall stats and per-filter breakdown.
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute combined mask from all sub-filters.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
- Boolean array where True = keep, False = exclude.
For “any” mode: keep if ALL filters say keep For “all” mode: keep if ANY filter says keep
- Return type:
np.ndarray
- class nirs4all.operators.filters.FilterResult(filter_name: str, reason: str, n_samples: int, n_excluded: int, n_kept: int, exclusion_rate: float, excluded_indices: ~typing.List[int] = <factory>, stats: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectResult of applying a single filter.
- class nirs4all.operators.filters.FilteringReport(dataset_name: str, partition: str, timestamp: str = <factory>, filter_results: ~typing.List[~nirs4all.operators.filters.report.FilterResult] = <factory>, combined_mode: str = 'any', n_total_samples: int = 0, n_final_excluded: int = 0, n_final_kept: int = 0, cascade_to_augmented: bool = True, n_augmented_excluded: int = 0)[source]
Bases:
objectComprehensive report of sample filtering operations.
This class aggregates results from multiple filters and provides methods for analysis, visualization, and export.
- filter_results
List of individual filter results
- Type:
- add_filter_result(result: FilterResult) None[source]
Add a filter result to the report.
- filter_results: List[FilterResult]
- print_report(verbose: int = 1) None[source]
Print the filtering report to console.
- Parameters:
verbose – Verbosity level (0=minimal, 1=normal, 2=detailed)
- class nirs4all.operators.filters.FilteringReportGenerator(dataset: SpectroDataset)[source]
Bases:
objectGenerator for creating comprehensive filtering reports.
This class provides utilities for collecting filter statistics, generating reports, and exporting results.
Example
>>> generator = FilteringReportGenerator(dataset) >>> report = generator.create_report( ... filters=[YOutlierFilter(method="iqr")], ... mode="any", ... partition="train" ... ) >>> report.print_report()
- compare_filters(filters: List[SampleFilter], X: ndarray, y: ndarray) Dict[str, Any][source]
Compare multiple filters on the same data without applying them.
Useful for understanding which filter is more aggressive or to find the overlap between filter decisions.
- Parameters:
filters – List of filters to compare
X – Feature array
y – Target array
- Returns:
individual: Per-filter stats
overlap: Samples flagged by multiple filters
unique: Samples flagged by only one filter
- Return type:
Dictionary with comparison statistics
- create_report(filters: List[SampleFilter], X: ndarray, y: ndarray, sample_indices: ndarray, mode: str = 'any', partition: str = 'train', cascade_to_augmented: bool = True, dry_run: bool = True) FilteringReport[source]
Create a filtering report by applying filters to data.
- Parameters:
filters – List of SampleFilter instances to apply
X – Feature array (n_samples, n_features)
y – Target array (n_samples,) or (n_samples, n_targets)
sample_indices – Array of sample indices corresponding to X/y
mode – Filter combination mode (“any” or “all”)
partition – Which partition is being filtered
cascade_to_augmented – Whether augmented samples will be cascaded
dry_run – If True, don’t actually mark samples as excluded
- Returns:
FilteringReport with all statistics and results
- generate_from_indexer(partition: str | None = 'train') FilteringReport[source]
Generate a report from current indexer exclusion state.
This method creates a report based on samples already marked as excluded in the indexer, rather than applying filters.
- Parameters:
partition – Partition to report on (None for all partitions)
- Returns:
FilteringReport based on current exclusion state
- class nirs4all.operators.filters.HighLeverageFilter(method: Literal['hat', 'pca'] = 'hat', threshold_multiplier: float = 2.0, absolute_threshold: float | None = None, n_components: int | None = None, center: bool = True, reason: str | None = None)[source]
Bases:
SampleFilterFilter high-leverage samples that may unduly influence the model.
High-leverage points are samples that are far from the center of the predictor space and can have a disproportionate effect on regression models. This filter identifies and excludes such samples.
The leverage of a sample is computed from the hat matrix H = X(X’X)^(-1)X’. The diagonal elements h_ii represent the leverage of each sample.
Supported methods: - “hat”: Direct hat matrix diagonal computation - “pca”: PCA-based leverage (for high-dimensional data)
Common threshold guidelines: - 2 * p / n (where p = number of parameters, n = samples) - 3 * average leverage - Absolute threshold (e.g., 0.5)
Example
>>> from nirs4all.operators.filters import HighLeverageFilter >>> >>> # Using multiplier of average leverage (default) >>> filter_obj = HighLeverageFilter(threshold_multiplier=2.0) >>> >>> # Using absolute threshold >>> filter_abs = HighLeverageFilter(absolute_threshold=0.5) >>> >>> # PCA-based for high-dimensional data >>> filter_pca = HighLeverageFilter(method="pca", n_components=10) >>> >>> # Fit and get mask >>> filter_obj.fit(X_train) >>> mask = filter_obj.get_mask(X_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [HighLeverageFilter(threshold_multiplier=2.0)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) HighLeverageFilter[source]
Compute leverage statistics from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used for leverage computation).
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If X has insufficient samples.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Leverage computation method
threshold: Computed threshold
n_effective_features: Number of features/components used
leverage_stats: Statistics on leverage values
- Return type:
Dict containing
- get_leverages(X: ndarray) ndarray[source]
Compute leverage values for samples.
This method returns the raw leverage values for inspection or custom thresholding.
- Parameters:
X – Feature array of shape (n_samples, n_features).
- Returns:
Array of leverage values for each sample.
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used).
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (low leverage)
False means EXCLUDE the sample (high leverage)
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.
- class nirs4all.operators.filters.MetadataFilter(column: str, condition: Callable[[Any], bool] | None = None, values_to_exclude: List[Any] | None = None, values_to_keep: List[Any] | None = None, exclude_missing: bool = True, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples based on metadata column values.
This filter allows excluding samples based on external metadata (not X or y) using custom condition functions. It’s useful for filtering based on: - Sample quality flags - Acquisition conditions - Sample categories to exclude - Date/time-based filtering - Any other metadata criteria
The filter works with metadata passed during get_mask() call, as metadata is not part of the standard sklearn X, y interface.
- condition
Function returning True for samples to KEEP
- Type:
Callable
- values_to_exclude
List of values that should be excluded
- Type:
List
- values_to_keep
List of values that should be kept
- Type:
List
Example
>>> from nirs4all.operators.filters import MetadataFilter >>> >>> # Exclude specific values >>> filter_obj = MetadataFilter( ... column="quality_flag", ... values_to_exclude=["bad", "corrupted"] ... ) >>> >>> # Keep only specific values >>> filter_keep = MetadataFilter( ... column="sample_type", ... values_to_keep=["control", "treatment"] ... ) >>> >>> # Custom condition >>> filter_custom = MetadataFilter( ... column="temperature", ... condition=lambda x: 20 <= x <= 30 # Keep 20-30°C ... ) >>> >>> # Get mask (metadata must be provided) >>> mask = filter_obj.get_mask(X, metadata=metadata_df)
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [ ... MetadataFilter( ... column="quality", ... values_to_exclude=["bad"] ... ) ... ], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) MetadataFilter[source]
Fit the filter (no-op for metadata filter).
Metadata filtering uses fixed criteria, so no fitting is required.
- Parameters:
X – Feature array (not used).
y – Target array (not used).
- Returns:
The filter instance (unchanged).
- Return type:
self
- get_filter_stats(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
metadata – Metadata dictionary.
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
column: Filtered column name
filtering_type: Type of filtering applied
value_counts: Count of unique values (if available)
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | Any | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features). Used only to determine number of samples if metadata is not provided.
y – Target array (not used).
metadata – Metadata dictionary, DataFrame, or object with column access. Must contain the specified column. Can be: - Dict[str, np.ndarray]: metadata[column] returns array - pd.DataFrame: metadata[column] returns series - Any object with __getitem__ that returns array-like
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample
False means EXCLUDE the sample
- Return type:
np.ndarray
- Raises:
ValueError – If metadata is None and filtering requires it.
KeyError – If the specified column is not in metadata.
- class nirs4all.operators.filters.SampleFilter(reason: str | None = None)[source]
Bases:
TransformerMixin,BaseEstimator,ABCBase class for sample filtering operators.
Sample filters identify samples that should be excluded from training datasets. Unlike transformers that modify data, filters mark samples for exclusion without altering the underlying data.
The filtering pattern works as follows: 1. fit(): Learn filter criteria from training data (e.g., compute thresholds) 2. get_mask(): Return boolean mask indicating which samples to KEEP 3. transform(): No-op (filtering happens at indexer level, not data level)
All concrete filter implementations must override the get_mask() method.
- reason
Identifier for this filter type, used to track exclusion reasons in the indexer. Default is the class name.
- Type:
Example
>>> class MyFilter(SampleFilter): ... def __init__(self, threshold: float = 1.0): ... super().__init__() ... self.threshold = threshold ... ... def fit(self, X, y=None): ... self.mean_ = np.mean(y) ... self.std_ = np.std(y) ... return self ... ... def get_mask(self, X, y=None) -> np.ndarray: ... z_scores = np.abs((y - self.mean_) / self.std_) ... return z_scores <= self.threshold # True = keep
- property exclusion_reason: str
Get the exclusion reason identifier for this filter.
- Returns:
Reason string to be stored in indexer’s exclusion_reason column.
- Return type:
- fit(X: ndarray, y: ndarray | None = None) SampleFilter[source]
Compute filter criteria from training data.
This method should learn any thresholds, statistics, or models needed to identify outliers/bad samples. Override in subclasses for filters that need to learn from data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.
- Returns:
The fitted filter instance.
- Return type:
self
- fit_transform(X: ndarray, y: ndarray | None = None, **fit_params) ndarray[source]
Fit to data and return unchanged (transform is no-op).
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
**fit_params – Additional fitting parameters (unused).
- Returns:
The unchanged input array.
- Return type:
np.ndarray
- get_excluded_indices(X: ndarray, y: ndarray | None = None) ndarray[source]
Get indices of samples to be excluded.
Convenience method that inverts get_mask() to return indices of samples marked for exclusion.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
Integer array of indices for samples to exclude.
- Return type:
np.ndarray
Example
>>> filter = YOutlierFilter(method="iqr") >>> filter.fit(X_train, y_train) >>> excluded_idx = filter.get_excluded_indices(X_train, y_train) >>> print(f"Excluding {len(excluded_idx)} samples")
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
Override in subclasses to provide filter-specific statistics (e.g., thresholds used, distribution of values, etc.).
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
- Dictionary containing filter statistics:
n_samples: Total number of samples
n_excluded: Number of samples to exclude
n_kept: Number of samples to keep
exclusion_rate: Ratio of excluded to total
reason: Exclusion reason string
- Return type:
Dict[str, Any]
- get_kept_indices(X: ndarray, y: ndarray | None = None) ndarray[source]
Get indices of samples to be kept.
Convenience method that returns indices of samples NOT marked for exclusion.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
- Returns:
Integer array of indices for samples to keep.
- Return type:
np.ndarray
- abstractmethod get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
This is the core method that must be implemented by all concrete filters. Returns True for samples that should be kept, False for samples to exclude.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample
False means EXCLUDE the sample
- Return type:
np.ndarray
- Raises:
NotImplementedError – If the subclass doesn’t implement this method.
- transform(X: ndarray) ndarray[source]
Transform is a no-op for filters.
Filtering happens at the indexer level, not by modifying the data array. This method returns the input unchanged to maintain sklearn compatibility.
- Parameters:
X – Feature array of shape (n_samples, n_features).
- Returns:
The unchanged input array.
- Return type:
np.ndarray
- class nirs4all.operators.filters.SpectralQualityFilter(max_nan_ratio: float = 0.1, max_zero_ratio: float = 0.5, min_variance: float = 1e-08, max_value: float | None = None, min_value: float | None = None, check_inf: bool = True, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples with poor spectral quality.
This filter identifies samples whose spectra exhibit quality issues such as: - High proportion of NaN or missing values - High proportion of zero values (potentially corrupted) - Very low variance (flat or constant spectra) - Values outside expected range (saturation)
Example
>>> from nirs4all.operators.filters import SpectralQualityFilter >>> >>> # Default quality checks >>> filter_obj = SpectralQualityFilter() >>> >>> # Strict quality requirements >>> filter_strict = SpectralQualityFilter( ... max_nan_ratio=0.01, ... max_zero_ratio=0.1, ... min_variance=1e-4 ... ) >>> >>> # Check for saturated spectra >>> filter_sat = SpectralQualityFilter(max_value=4.0, min_value=-0.5) >>> >>> # Get mask >>> mask = filter_obj.get_mask(X_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [SpectralQualityFilter(max_nan_ratio=0.05)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) SpectralQualityFilter[source]
Fit the filter (no-op for quality filter as thresholds are fixed).
The SpectralQualityFilter uses fixed thresholds set at initialization, so no fitting is required. This method is provided for API consistency.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used).
- Returns:
The filter instance (unchanged).
- Return type:
self
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application including quality breakdown.
- Parameters:
X – Feature array.
y – Target array (unused).
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
Quality thresholds
Per-check failure counts
Quality metric distributions
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP based on quality.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based quality checks).
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (passes quality checks)
False means EXCLUDE the sample (fails quality checks)
- Return type:
np.ndarray
- get_quality_breakdown(X: ndarray, y: ndarray | None = None) Dict[str, ndarray][source]
Get detailed breakdown of which quality checks each sample fails.
This method provides per-check masks to understand why specific samples were excluded.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used).
- Returns:
“passes_nan”: True if NaN ratio is acceptable
”passes_inf”: True if no Inf values
”passes_zero”: True if zero ratio is acceptable
”passes_variance”: True if variance is sufficient
”passes_max_value”: True if max value is within limit
”passes_min_value”: True if min value is within limit
”passes_all”: True if passes all checks
- Return type:
Dict with boolean arrays for each quality check
- class nirs4all.operators.filters.XOutlierFilter(method: Literal['mahalanobis', 'robust_mahalanobis', 'pca_residual', 'pca_leverage', 'isolation_forest', 'lof'] = 'mahalanobis', threshold: float | None = None, n_components: int | None = None, contamination: float = 0.1, random_state: int | None = None, support_fraction: float | None = None, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples with outlier spectral features.
This filter identifies samples whose X-values (spectra) are statistical outliers using various detection methods. It’s commonly used to remove samples with corrupted, unusual, or non-representative spectra.
Supported methods: - “mahalanobis”: Mahalanobis distance from center (default) - “robust_mahalanobis”: Robust Mahalanobis using MinCovDet (resistant to outliers) - “pca_residual”: Q-statistic (residual) from PCA reconstruction - “pca_leverage”: T² (Hotelling’s T-squared) in PCA score space - “isolation_forest”: Isolation Forest anomaly detection - “lof”: Local Outlier Factor
Example
>>> from nirs4all.operators.filters import XOutlierFilter >>> >>> # Mahalanobis distance (default) >>> filter_maha = XOutlierFilter(method="mahalanobis", threshold=3.0) >>> >>> # Robust Mahalanobis (better with outliers in training data) >>> filter_robust = XOutlierFilter(method="robust_mahalanobis", threshold=3.0) >>> >>> # PCA-based residual (Q-statistic) >>> filter_pca = XOutlierFilter(method="pca_residual", n_components=10) >>> >>> # Fit and get mask >>> filter_maha.fit(X_train) >>> mask = filter_maha.get_mask(X_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [XOutlierFilter(method="mahalanobis", threshold=3.0)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) XOutlierFilter[source]
Compute outlier detection model from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based filtering, but kept for API consistency).
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If X has insufficient samples for the chosen method.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value (if applicable)
n_components: PCA components (if applicable)
distance_stats: Statistics on computed distances/scores
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used, kept for API consistency).
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (not an outlier)
False means EXCLUDE the sample (outlier detected)
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.
- class nirs4all.operators.filters.YOutlierFilter(method: Literal['iqr', 'zscore', 'percentile', 'mad'] = 'iqr', threshold: float = 1.5, lower_percentile: float = 1.0, upper_percentile: float = 99.0, reason: str | None = None)[source]
Bases:
SampleFilterFilter samples with outlier target values.
This filter identifies samples whose y-values are statistical outliers using one of several detection methods. It’s commonly used to remove samples with extreme or erroneous target values before training.
Supported methods: - “iqr”: Interquartile Range method (default) - “zscore”: Z-score (standard deviations from mean) - “percentile”: Direct percentile cutoffs - “mad”: Median Absolute Deviation (robust to outliers)
Example
>>> from nirs4all.operators.filters import YOutlierFilter >>> >>> # IQR method (default, threshold=1.5 is standard) >>> filter_iqr = YOutlierFilter(method="iqr", threshold=1.5) >>> >>> # Z-score method (threshold=3.0 is common) >>> filter_zscore = YOutlierFilter(method="zscore", threshold=3.0) >>> >>> # Percentile method >>> filter_pct = YOutlierFilter( ... method="percentile", ... lower_percentile=1.0, ... upper_percentile=99.0 ... ) >>> >>> # Fit and get mask >>> filter_iqr.fit(X_train, y_train) >>> mask = filter_iqr.get_mask(X_train, y_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [YOutlierFilter(method="iqr", threshold=1.5)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) YOutlierFilter[source]
Compute outlier detection bounds from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features). Not used but required for sklearn compatibility.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If y is None (required for Y-based filtering).
ValueError – If y has no valid (non-NaN) values.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application including method-specific details.
- Parameters:
X – Feature array.
y – Target array.
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value
lower_bound: Computed lower bound
upper_bound: Computed upper bound
center: Central value (mean/median)
scale: Scale measure (std/IQR/MAD)
y_range: (min, max) of input y values
- Return type:
Dict containing
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features). Not used but required for API consistency.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (within bounds)
False means EXCLUDE the sample (outside bounds)
- Return type:
np.ndarray
- Raises:
ValueError – If y is None.
ValueError – If filter has not been fitted (bounds not set).