nirs4all.operators.filters.high_leverage module

High leverage filter for sample filtering.

This module provides the HighLeverageFilter class for detecting and excluding samples that have high leverage (influence) on model fitting.

class nirs4all.operators.filters.high_leverage.HighLeverageFilter(method: Literal['hat', 'pca'] = 'hat', threshold_multiplier: float = 2.0, absolute_threshold: float | None = None, n_components: int | None = None, center: bool = True, reason: str | None = None)[source]

Bases: SampleFilter

Filter high-leverage samples that may unduly influence the model.

High-leverage points are samples that are far from the center of the predictor space and can have a disproportionate effect on regression models. This filter identifies and excludes such samples.

The leverage of a sample is computed from the hat matrix H = X(X’X)^(-1)X’. The diagonal elements h_ii represent the leverage of each sample.

Supported methods: - “hat”: Direct hat matrix diagonal computation - “pca”: PCA-based leverage (for high-dimensional data)

Common threshold guidelines: - 2 * p / n (where p = number of parameters, n = samples) - 3 * average leverage - Absolute threshold (e.g., 0.5)

method

Leverage computation method

Type:

str

threshold_multiplier

Multiple of average leverage to use as threshold

Type:

float

absolute_threshold

Absolute threshold (overrides multiplier if set)

Type:

float

Example

>>> from nirs4all.operators.filters import HighLeverageFilter
>>>
>>> # Using multiplier of average leverage (default)
>>> filter_obj = HighLeverageFilter(threshold_multiplier=2.0)
>>>
>>> # Using absolute threshold
>>> filter_abs = HighLeverageFilter(absolute_threshold=0.5)
>>>
>>> # PCA-based for high-dimensional data
>>> filter_pca = HighLeverageFilter(method="pca", n_components=10)
>>>
>>> # Fit and get mask
>>> filter_obj.fit(X_train)
>>> mask = filter_obj.get_mask(X_train)  # True = keep
In Pipeline:
>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [HighLeverageFilter(threshold_multiplier=2.0)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]
__repr__() str[source]

Return string representation.

property exclusion_reason: str

Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) HighLeverageFilter[source]

Compute leverage statistics from training data.

Parameters:
  • X – Feature array of shape (n_samples, n_features).

  • y – Target array (not used for leverage computation).

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If X has insufficient samples.

get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]

Get statistics about filter application.

Parameters:
  • X – Feature array.

  • y – Target array (unused).

Returns:

  • Base stats (n_samples, n_excluded, n_kept, exclusion_rate)

  • method: Leverage computation method

  • threshold: Computed threshold

  • n_effective_features: Number of features/components used

  • leverage_stats: Statistics on leverage values

Return type:

Dict containing

get_leverages(X: ndarray) ndarray[source]

Compute leverage values for samples.

This method returns the raw leverage values for inspection or custom thresholding.

Parameters:

X – Feature array of shape (n_samples, n_features).

Returns:

Array of leverage values for each sample.

Return type:

np.ndarray

Raises:

ValueError – If filter has not been fitted.

get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:
  • X – Feature array of shape (n_samples, n_features).

  • y – Target array (not used).

Returns:

Boolean array of shape (n_samples,) where:
  • True means KEEP the sample (low leverage)

  • False means EXCLUDE the sample (high leverage)

Return type:

np.ndarray

Raises:

ValueError – If filter has not been fitted.