nirs4all.operators.filters package

Submodules

Module contents

Sample filtering operators for nirs4all.

This module provides operators for filtering (excluding) samples from training datasets. Filters are non-destructive - they mark samples as excluded in the indexer rather than removing data.

Classes:: SampleFilter: Base class for all sample filtering operators CompositeFilter: Combine multiple filters with AND/OR logic YOutlierFilter: Filter samples with outlier target values (IQR, zscore, percentile, MAD) XOutlierFilter: Filter samples with outlier spectral features (Mahalanobis, PCA, LOF, etc.) SpectralQualityFilter: Filter samples with poor spectral quality (NaN, zeros, variance) HighLeverageFilter: Filter high-leverage samples that may unduly influence models MetadataFilter: Filter samples based on metadata column values FilteringReport: Comprehensive report of sample filtering operations FilteringReportGenerator: Generator for creating filtering reports

Example

>>> from nirs4all.operators.filters import YOutlierFilter, XOutlierFilter
>>>
>>> # In a pipeline
>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [
...                 YOutlierFilter(method="iqr", threshold=1.5),
...                 XOutlierFilter(method="mahalanobis", threshold=3.0),
...             ],
...             "report": True,
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

class nirs4all.operators.filters.CompositeFilter(filters: List[SampleFilter] | None = None, mode: str = 'any', reason: str | None = None)[source]

Bases: SampleFilter

Combine multiple filters with AND/OR logic.

This filter aggregates the results of multiple sub-filters using either “any” or “all” mode: - “any” (default): Exclude if ANY filter flags the sample - “all”: Exclude only if ALL filters flag the sample

filters

List of filter instances to combine

Type:: List[SampleFilter]

mode

Combination mode - “any” or “all”

Type:: str

Example

>>> from nirs4all.operators.filters import YOutlierFilter, CompositeFilter
>>>
>>> # Exclude if either filter flags
>>> combined = CompositeFilter(
...     filters=[
...         YOutlierFilter(method="iqr", threshold=1.5),
...         YOutlierFilter(method="zscore", threshold=3.0),
...     ],
...     mode="any"
... )

property exclusion_reason: str: Get combined exclusion reason from all filters.

fit(X: ndarray, y: ndarray | None = None) → CompositeFilter[source]

Fit all sub-filters to the training data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

The fitted composite filter.

Return type:

self

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics including per-filter breakdown.

Parameters:

X – Feature array.
y – Target array.

Returns:

Dict with overall stats and per-filter breakdown.

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute combined mask from all sub-filters.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Boolean array where True = keep, False = exclude.: For “any” mode: keep if ALL filters say keep For “all” mode: keep if ANY filter says keep

Return type:

np.ndarray

class nirs4all.operators.filters.FilterResult(filter_name: str, reason: str, n_samples: int, n_excluded: int, n_kept: int, exclusion_rate: float, excluded_indices: ~typing.List[int] = <factory>, stats: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Result of applying a single filter.

filter_name

Name/identifier of the filter

Type:: str

reason

Exclusion reason string

Type:: str

n_samples

Total samples evaluated

Type:: int

n_excluded

Number of samples excluded by this filter

Type:: int

n_kept

Number of samples kept

Type:: int

exclusion_rate

Ratio of excluded to total

Type:: float

excluded_indices

Indices of excluded samples

Type:: List[int]

stats

Additional filter-specific statistics

Type:: Dict[str, Any]

excluded_indices: List[int]

exclusion_rate: float

filter_name: str

n_excluded: int

n_kept: int

n_samples: int

reason: str

stats: Dict[str, Any]

to_dict() → Dict[str, Any][source]: Convert to dictionary representation.

class nirs4all.operators.filters.FilteringReport(dataset_name: str, partition: str, timestamp: str = <factory>, filter_results: ~typing.List[~nirs4all.operators.filters.report.FilterResult] = <factory>, combined_mode: str = 'any', n_total_samples: int = 0, n_final_excluded: int = 0, n_final_kept: int = 0, cascade_to_augmented: bool = True, n_augmented_excluded: int = 0)[source]

Bases: object

Comprehensive report of sample filtering operations.

This class aggregates results from multiple filters and provides methods for analysis, visualization, and export.

dataset_name

Name of the filtered dataset

Type:: str

partition

Partition that was filtered (e.g., “train”)

Type:: str

timestamp

When the filtering was performed

Type:: str

filter_results

List of individual filter results

Type:: List[nirs4all.operators.filters.report.FilterResult]

combined_mode

How filters were combined (“any” or “all”)

Type:: str

n_total_samples

Total samples before filtering

Type:: int

n_final_excluded

Final number of excluded samples

Type:: int

n_final_kept

Final number of kept samples

Type:: int

cascade_to_augmented

Whether augmented samples were also excluded

Type:: bool

n_augmented_excluded

Number of augmented samples excluded via cascade

Type:: int

add_filter_result(result: FilterResult) → None[source]: Add a filter result to the report.

cascade_to_augmented: bool = True

combined_mode: str = 'any'

dataset_name: str

filter_results: List[FilterResult]

property final_exclusion_rate: float: Calculate final exclusion rate after combining filters.

n_augmented_excluded: int = 0

n_final_excluded: int = 0

n_final_kept: int = 0

n_total_samples: int = 0

partition: str

print_report(verbose: int = 1) → None[source]

Print the filtering report to console.

Parameters:: verbose – Verbosity level (0=minimal, 1=normal, 2=detailed)

summary() → Dict[str, Any][source]

Get a summary dictionary of the filtering report.

Returns:: Dict containing summary statistics

timestamp: str

to_dict() → Dict[str, Any][source]: Convert the full report to a dictionary.

to_json(indent: int = 2) → str[source]

Convert report to JSON string.

Parameters:: indent – JSON indentation level
Returns:: JSON string representation

class nirs4all.operators.filters.FilteringReportGenerator(dataset: SpectroDataset)[source]

Bases: object

Generator for creating comprehensive filtering reports.

This class provides utilities for collecting filter statistics, generating reports, and exporting results.

Example

>>> generator = FilteringReportGenerator(dataset)
>>> report = generator.create_report(
...     filters=[YOutlierFilter(method="iqr")],
...     mode="any",
...     partition="train"
... )
>>> report.print_report()

compare_filters(filters: List[SampleFilter], X: ndarray, y: ndarray) → Dict[str, Any][source]

Compare multiple filters on the same data without applying them.

Useful for understanding which filter is more aggressive or to find the overlap between filter decisions.

Parameters:

filters – List of filters to compare
X – Feature array
y – Target array

Returns:

individual: Per-filter stats
overlap: Samples flagged by multiple filters
unique: Samples flagged by only one filter

Return type:

Dictionary with comparison statistics

create_report(filters: List[SampleFilter], X: ndarray, y: ndarray, sample_indices: ndarray, mode: str = 'any', partition: str = 'train', cascade_to_augmented: bool = True, dry_run: bool = True) → FilteringReport[source]

Create a filtering report by applying filters to data.

Parameters:

filters – List of SampleFilter instances to apply
X – Feature array (n_samples, n_features)
y – Target array (n_samples,) or (n_samples, n_targets)
sample_indices – Array of sample indices corresponding to X/y
mode – Filter combination mode (“any” or “all”)
partition – Which partition is being filtered
cascade_to_augmented – Whether augmented samples will be cascaded
dry_run – If True, don’t actually mark samples as excluded

Returns:

FilteringReport with all statistics and results

generate_from_indexer(partition: str | None = 'train') → FilteringReport[source]

Generate a report from current indexer exclusion state.

This method creates a report based on samples already marked as excluded in the indexer, rather than applying filters.

Parameters:: partition – Partition to report on (None for all partitions)
Returns:: FilteringReport based on current exclusion state

class nirs4all.operators.filters.HighLeverageFilter(method: Literal['hat', 'pca'] = 'hat', threshold_multiplier: float = 2.0, absolute_threshold: float | None = None, n_components: int | None = None, center: bool = True, reason: str | None = None)[source]

Bases: SampleFilter

Filter high-leverage samples that may unduly influence the model.

High-leverage points are samples that are far from the center of the predictor space and can have a disproportionate effect on regression models. This filter identifies and excludes such samples.

The leverage of a sample is computed from the hat matrix H = X(X’X)^(-1)X’. The diagonal elements h_ii represent the leverage of each sample.

Supported methods: - “hat”: Direct hat matrix diagonal computation - “pca”: PCA-based leverage (for high-dimensional data)

Common threshold guidelines: - 2 * p / n (where p = number of parameters, n = samples) - 3 * average leverage - Absolute threshold (e.g., 0.5)

method

Leverage computation method

Type:: str

threshold_multiplier

Multiple of average leverage to use as threshold

Type:: float

absolute_threshold

Absolute threshold (overrides multiplier if set)

Type:: float

Example

>>> from nirs4all.operators.filters import HighLeverageFilter
>>>
>>> # Using multiplier of average leverage (default)
>>> filter_obj = HighLeverageFilter(threshold_multiplier=2.0)
>>>
>>> # Using absolute threshold
>>> filter_abs = HighLeverageFilter(absolute_threshold=0.5)
>>>
>>> # PCA-based for high-dimensional data
>>> filter_pca = HighLeverageFilter(method="pca", n_components=10)
>>>
>>> # Fit and get mask
>>> filter_obj.fit(X_train)
>>> mask = filter_obj.get_mask(X_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [HighLeverageFilter(threshold_multiplier=2.0)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → HighLeverageFilter[source]

Compute leverage statistics from training data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used for leverage computation).

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If X has insufficient samples.

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Parameters:

X – Feature array.
y – Target array (unused).

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Leverage computation method
threshold: Computed threshold
n_effective_features: Number of features/components used
leverage_stats: Statistics on leverage values

Return type:

Dict containing

get_leverages(X: ndarray) → ndarray[source]

Compute leverage values for samples.

This method returns the raw leverage values for inspection or custom thresholding.

Parameters:: X – Feature array of shape (n_samples, n_features).
Returns:: Array of leverage values for each sample.
Return type:: np.ndarray
Raises:: ValueError – If filter has not been fitted.

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used).

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (low leverage)
False means EXCLUDE the sample (high leverage)

Return type:

np.ndarray

Raises:

ValueError – If filter has not been fitted.

class nirs4all.operators.filters.MetadataFilter(column: str, condition: Callable[[Any], bool] | None = None, values_to_exclude: List[Any] | None = None, values_to_keep: List[Any] | None = None, exclude_missing: bool = True, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples based on metadata column values.

This filter allows excluding samples based on external metadata (not X or y) using custom condition functions. It’s useful for filtering based on: - Sample quality flags - Acquisition conditions - Sample categories to exclude - Date/time-based filtering - Any other metadata criteria

The filter works with metadata passed during get_mask() call, as metadata is not part of the standard sklearn X, y interface.

column

Metadata column name to filter on

Type:: str

condition

Function returning True for samples to KEEP

Type:: Callable

values_to_exclude

List of values that should be excluded

Type:: List

values_to_keep

List of values that should be kept

Type:: List

Example

>>> from nirs4all.operators.filters import MetadataFilter
>>>
>>> # Exclude specific values
>>> filter_obj = MetadataFilter(
...     column="quality_flag",
...     values_to_exclude=["bad", "corrupted"]
... )
>>>
>>> # Keep only specific values
>>> filter_keep = MetadataFilter(
...     column="sample_type",
...     values_to_keep=["control", "treatment"]
... )
>>>
>>> # Custom condition
>>> filter_custom = MetadataFilter(
...     column="temperature",
...     condition=lambda x: 20 <= x <= 30  # Keep 20-30°C
... )
>>>
>>> # Get mask (metadata must be provided)
>>> mask = filter_obj.get_mask(X, metadata=metadata_df)

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [
...                 MetadataFilter(
...                     column="quality",
...                     values_to_exclude=["bad"]
...                 )
...             ],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → MetadataFilter[source]

Fit the filter (no-op for metadata filter).

Metadata filtering uses fixed criteria, so no fitting is required.

Parameters:

X – Feature array (not used).
y – Target array (not used).

Returns:

The filter instance (unchanged).

Return type:

self

get_filter_stats(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Parameters:

X – Feature array.
y – Target array (unused).
metadata – Metadata dictionary.

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
column: Filtered column name
filtering_type: Type of filtering applied
value_counts: Count of unique values (if available)

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None, metadata: Dict[str, ndarray] | Any | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features). Used only to determine number of samples if metadata is not provided.
y – Target array (not used).
metadata – Metadata dictionary, DataFrame, or object with column access. Must contain the specified column. Can be: - Dict[str, np.ndarray]: metadata[column] returns array - pd.DataFrame: metadata[column] returns series - Any object with __getitem__ that returns array-like

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample
False means EXCLUDE the sample

Return type:

np.ndarray

Raises:

ValueError – If metadata is None and filtering requires it.
KeyError – If the specified column is not in metadata.

class nirs4all.operators.filters.SampleFilter(reason: str | None = None)[source]

Bases: TransformerMixin, BaseEstimator, ABC

Base class for sample filtering operators.

Sample filters identify samples that should be excluded from training datasets. Unlike transformers that modify data, filters mark samples for exclusion without altering the underlying data.

The filtering pattern works as follows: 1. fit(): Learn filter criteria from training data (e.g., compute thresholds) 2. get_mask(): Return boolean mask indicating which samples to KEEP 3. transform(): No-op (filtering happens at indexer level, not data level)

All concrete filter implementations must override the get_mask() method.

reason

Identifier for this filter type, used to track exclusion reasons in the indexer. Default is the class name.

Type:: str

Example

>>> class MyFilter(SampleFilter):
...     def __init__(self, threshold: float = 1.0):
...         super().__init__()
...         self.threshold = threshold
...
...     def fit(self, X, y=None):
...         self.mean_ = np.mean(y)
...         self.std_ = np.std(y)
...         return self
...
...     def get_mask(self, X, y=None) -> np.ndarray:
...         z_scores = np.abs((y - self.mean_) / self.std_)
...         return z_scores <= self.threshold  # True = keep

property exclusion_reason: str

Get the exclusion reason identifier for this filter.

Returns:: Reason string to be stored in indexer’s exclusion_reason column.
Return type:: str

fit(X: ndarray, y: ndarray | None = None) → SampleFilter[source]

Compute filter criteria from training data.

This method should learn any thresholds, statistics, or models needed to identify outliers/bad samples. Override in subclasses for filters that need to learn from data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

The fitted filter instance.

Return type:

self

fit_transform(X: ndarray, y: ndarray | None = None, **fit_params) → ndarray[source]

Fit to data and return unchanged (transform is no-op).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
**fit_params – Additional fitting parameters (unused).

Returns:

The unchanged input array.

Return type:

np.ndarray

get_excluded_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be excluded.

Convenience method that inverts get_mask() to return indices of samples marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to exclude.

Return type:

np.ndarray

Example

>>> filter = YOutlierFilter(method="iqr")
>>> filter.fit(X_train, y_train)
>>> excluded_idx = filter.get_excluded_indices(X_train, y_train)
>>> print(f"Excluding {len(excluded_idx)} samples")

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Override in subclasses to provide filter-specific statistics (e.g., thresholds used, distribution of values, etc.).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Dictionary containing filter statistics:

n_samples: Total number of samples
n_excluded: Number of samples to exclude
n_kept: Number of samples to keep
exclusion_rate: Ratio of excluded to total
reason: Exclusion reason string

Return type:

Dict[str, Any]

get_kept_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be kept.

Convenience method that returns indices of samples NOT marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to keep.

Return type:

np.ndarray

abstractmethod get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

This is the core method that must be implemented by all concrete filters. Returns True for samples that should be kept, False for samples to exclude.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample
False means EXCLUDE the sample

Return type:

np.ndarray

Raises:

NotImplementedError – If the subclass doesn’t implement this method.

transform(X: ndarray) → ndarray[source]

Transform is a no-op for filters.

Filtering happens at the indexer level, not by modifying the data array. This method returns the input unchanged to maintain sklearn compatibility.

Parameters:: X – Feature array of shape (n_samples, n_features).
Returns:: The unchanged input array.
Return type:: np.ndarray

class nirs4all.operators.filters.SpectralQualityFilter(max_nan_ratio: float = 0.1, max_zero_ratio: float = 0.5, min_variance: float = 1e-08, max_value: float | None = None, min_value: float | None = None, check_inf: bool = True, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with poor spectral quality.

This filter identifies samples whose spectra exhibit quality issues such as: - High proportion of NaN or missing values - High proportion of zero values (potentially corrupted) - Very low variance (flat or constant spectra) - Values outside expected range (saturation)

max_nan_ratio

Maximum allowed NaN ratio per spectrum

Type:: float

max_zero_ratio

Maximum allowed zero ratio

Type:: float

min_variance

Minimum variance threshold

Type:: float

max_value

Maximum allowed value (saturation detection)

Type:: float

min_value

Minimum allowed value

Type:: float

Example

>>> from nirs4all.operators.filters import SpectralQualityFilter
>>>
>>> # Default quality checks
>>> filter_obj = SpectralQualityFilter()
>>>
>>> # Strict quality requirements
>>> filter_strict = SpectralQualityFilter(
...     max_nan_ratio=0.01,
...     max_zero_ratio=0.1,
...     min_variance=1e-4
... )
>>>
>>> # Check for saturated spectra
>>> filter_sat = SpectralQualityFilter(max_value=4.0, min_value=-0.5)
>>>
>>> # Get mask
>>> mask = filter_obj.get_mask(X_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [SpectralQualityFilter(max_nan_ratio=0.05)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → SpectralQualityFilter[source]

Fit the filter (no-op for quality filter as thresholds are fixed).

The SpectralQualityFilter uses fixed thresholds set at initialization, so no fitting is required. This method is provided for API consistency.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used).

Returns:

The filter instance (unchanged).

Return type:

self

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application including quality breakdown.

Parameters:

X – Feature array.
y – Target array (unused).

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
Quality thresholds
Per-check failure counts
Quality metric distributions

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP based on quality.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based quality checks).

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (passes quality checks)
False means EXCLUDE the sample (fails quality checks)

Return type:

np.ndarray

get_quality_breakdown(X: ndarray, y: ndarray | None = None) → Dict[str, ndarray][source]

Get detailed breakdown of which quality checks each sample fails.

This method provides per-check masks to understand why specific samples were excluded.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used).

Returns:

“passes_nan”: True if NaN ratio is acceptable
”passes_inf”: True if no Inf values
”passes_zero”: True if zero ratio is acceptable
”passes_variance”: True if variance is sufficient
”passes_max_value”: True if max value is within limit
”passes_min_value”: True if min value is within limit
”passes_all”: True if passes all checks

Return type:

Dict with boolean arrays for each quality check

class nirs4all.operators.filters.XOutlierFilter(method: Literal['mahalanobis', 'robust_mahalanobis', 'pca_residual', 'pca_leverage', 'isolation_forest', 'lof'] = 'mahalanobis', threshold: float | None = None, n_components: int | None = None, contamination: float = 0.1, random_state: int | None = None, support_fraction: float | None = None, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with outlier spectral features.

This filter identifies samples whose X-values (spectra) are statistical outliers using various detection methods. It’s commonly used to remove samples with corrupted, unusual, or non-representative spectra.

Supported methods: - “mahalanobis”: Mahalanobis distance from center (default) - “robust_mahalanobis”: Robust Mahalanobis using MinCovDet (resistant to outliers) - “pca_residual”: Q-statistic (residual) from PCA reconstruction - “pca_leverage”: T² (Hotelling’s T-squared) in PCA score space - “isolation_forest”: Isolation Forest anomaly detection - “lof”: Local Outlier Factor

method

Outlier detection method

Type:: str

threshold

Detection threshold (method-specific)

Type:: float

n_components

Number of PCA components for PCA-based methods

Type:: int

contamination

Expected proportion of outliers for sklearn methods

Type:: float

Example

>>> from nirs4all.operators.filters import XOutlierFilter
>>>
>>> # Mahalanobis distance (default)
>>> filter_maha = XOutlierFilter(method="mahalanobis", threshold=3.0)
>>>
>>> # Robust Mahalanobis (better with outliers in training data)
>>> filter_robust = XOutlierFilter(method="robust_mahalanobis", threshold=3.0)
>>>
>>> # PCA-based residual (Q-statistic)
>>> filter_pca = XOutlierFilter(method="pca_residual", n_components=10)
>>>
>>> # Fit and get mask
>>> filter_maha.fit(X_train)
>>> mask = filter_maha.get_mask(X_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [XOutlierFilter(method="mahalanobis", threshold=3.0)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → XOutlierFilter[source]

Compute outlier detection model from training data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used for X-based filtering, but kept for API consistency).

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If X has insufficient samples for the chosen method.

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Parameters:

X – Feature array.
y – Target array (unused).

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value (if applicable)
n_components: PCA components (if applicable)
distance_stats: Statistics on computed distances/scores

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array (not used, kept for API consistency).

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (not an outlier)
False means EXCLUDE the sample (outlier detected)

Return type:

np.ndarray

Raises:

ValueError – If filter has not been fitted.

class nirs4all.operators.filters.YOutlierFilter(method: Literal['iqr', 'zscore', 'percentile', 'mad'] = 'iqr', threshold: float = 1.5, lower_percentile: float = 1.0, upper_percentile: float = 99.0, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with outlier target values.

This filter identifies samples whose y-values are statistical outliers using one of several detection methods. It’s commonly used to remove samples with extreme or erroneous target values before training.

Supported methods: - “iqr”: Interquartile Range method (default) - “zscore”: Z-score (standard deviations from mean) - “percentile”: Direct percentile cutoffs - “mad”: Median Absolute Deviation (robust to outliers)

method

Outlier detection method

Type:: str

threshold

Method-specific threshold

Type:: float

lower_percentile

Lower cutoff for percentile method

Type:: float

upper_percentile

Upper cutoff for percentile method

Type:: float

Example

>>> from nirs4all.operators.filters import YOutlierFilter
>>>
>>> # IQR method (default, threshold=1.5 is standard)
>>> filter_iqr = YOutlierFilter(method="iqr", threshold=1.5)
>>>
>>> # Z-score method (threshold=3.0 is common)
>>> filter_zscore = YOutlierFilter(method="zscore", threshold=3.0)
>>>
>>> # Percentile method
>>> filter_pct = YOutlierFilter(
...     method="percentile",
...     lower_percentile=1.0,
...     upper_percentile=99.0
... )
>>>
>>> # Fit and get mask
>>> filter_iqr.fit(X_train, y_train)
>>> mask = filter_iqr.get_mask(X_train, y_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [YOutlierFilter(method="iqr", threshold=1.5)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → YOutlierFilter[source]

Compute outlier detection bounds from training data.

Parameters:

X – Feature array of shape (n_samples, n_features). Not used but required for sklearn compatibility.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If y is None (required for Y-based filtering).
ValueError – If y has no valid (non-NaN) values.

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application including method-specific details.

Parameters:

X – Feature array.
y – Target array.

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value
lower_bound: Computed lower bound
upper_bound: Computed upper bound
center: Central value (mean/median)
scale: Scale measure (std/IQR/MAD)
y_range: (min, max) of input y values

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features). Not used but required for API consistency.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (within bounds)
False means EXCLUDE the sample (outside bounds)

Return type:

np.ndarray

Raises:

ValueError – If y is None.
ValueError – If filter has not been fitted (bounds not set).