nirs4all.operators.filters.high_leverage module
High leverage filter for sample filtering.
This module provides the HighLeverageFilter class for detecting and excluding samples that have high leverage (influence) on model fitting.
- class nirs4all.operators.filters.high_leverage.HighLeverageFilter(method: Literal['hat', 'pca'] = 'hat', threshold_multiplier: float = 2.0, absolute_threshold: float | None = None, n_components: int | None = None, center: bool = True, reason: str | None = None)[source]
Bases:
SampleFilterFilter high-leverage samples that may unduly influence the model.
High-leverage points are samples that are far from the center of the predictor space and can have a disproportionate effect on regression models. This filter identifies and excludes such samples.
The leverage of a sample is computed from the hat matrix H = X(X’X)^(-1)X’. The diagonal elements h_ii represent the leverage of each sample.
Supported methods: - “hat”: Direct hat matrix diagonal computation - “pca”: PCA-based leverage (for high-dimensional data)
Common threshold guidelines: - 2 * p / n (where p = number of parameters, n = samples) - 3 * average leverage - Absolute threshold (e.g., 0.5)
Example
>>> from nirs4all.operators.filters import HighLeverageFilter >>> >>> # Using multiplier of average leverage (default) >>> filter_obj = HighLeverageFilter(threshold_multiplier=2.0) >>> >>> # Using absolute threshold >>> filter_abs = HighLeverageFilter(absolute_threshold=0.5) >>> >>> # PCA-based for high-dimensional data >>> filter_pca = HighLeverageFilter(method="pca", n_components=10) >>> >>> # Fit and get mask >>> filter_obj.fit(X_train) >>> mask = filter_obj.get_mask(X_train) # True = keep
- In Pipeline:
>>> pipeline = [ ... { ... "sample_filter": { ... "filters": [HighLeverageFilter(threshold_multiplier=2.0)], ... } ... }, ... "snv", ... "model:PLSRegression", ... ]
- fit(X: ndarray, y: ndarray | None = None) HighLeverageFilter[source]
Compute leverage statistics from training data.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used for leverage computation).
- Returns:
The fitted filter instance.
- Return type:
self
- Raises:
ValueError – If X has insufficient samples.
- get_filter_stats(X: ndarray, y: ndarray | None = None) Dict[str, Any][source]
Get statistics about filter application.
- Parameters:
X – Feature array.
y – Target array (unused).
- Returns:
Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Leverage computation method
threshold: Computed threshold
n_effective_features: Number of features/components used
leverage_stats: Statistics on leverage values
- Return type:
Dict containing
- get_leverages(X: ndarray) ndarray[source]
Compute leverage values for samples.
This method returns the raw leverage values for inspection or custom thresholding.
- Parameters:
X – Feature array of shape (n_samples, n_features).
- Returns:
Array of leverage values for each sample.
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.
- get_mask(X: ndarray, y: ndarray | None = None) ndarray[source]
Compute boolean mask indicating which samples to KEEP.
- Parameters:
X – Feature array of shape (n_samples, n_features).
y – Target array (not used).
- Returns:
- Boolean array of shape (n_samples,) where:
True means KEEP the sample (low leverage)
False means EXCLUDE the sample (high leverage)
- Return type:
np.ndarray
- Raises:
ValueError – If filter has not been fitted.