nirs4all.operators.transforms.feature_selection module
Feature selection operators for NIRS spectral data.
This module provides wavelength/variable selection methods commonly used in chemometrics for NIRS data: - CARS (Competitive Adaptive Reweighted Sampling) - MC-UVE (Monte-Carlo Uninformative Variable Elimination)
These selectors identify informative wavelengths and reduce feature dimensionality while preserving predictive performance.
- class nirs4all.operators.transforms.feature_selection.CARS(n_components: int = 10, n_sampling_runs: int = 50, n_variables_ratio_start: float = 1.0, n_variables_ratio_end: float = 0.1, cv_folds: int = 5, subset_ratio: float = 0.8, random_state: int | None = None)[source]
Bases:
TransformerMixin,BaseEstimatorCompetitive Adaptive Reweighted Sampling (CARS) for wavelength selection.
CARS is a variable selection method that iteratively selects important wavelengths by: 1. Fitting PLS models on subsets of samples 2. Calculating variable importance weights from regression coefficients 3. Using exponentially decreasing function to reduce variable count 4. Applying adaptive reweighted sampling based on importance
The method was introduced by Li et al. (2009) and is widely used for NIRS wavelength selection.
- Parameters:
n_components (int, default=10) – Number of PLS components for the internal PLS model.
n_sampling_runs (int, default=50) – Number of Monte-Carlo sampling runs.
n_variables_ratio_start (float, default=1.0) – Starting ratio of variables to keep (1.0 = all variables).
n_variables_ratio_end (float, default=0.1) – Ending ratio of variables to keep.
cv_folds (int, default=5) – Number of cross-validation folds for RMSECV calculation.
subset_ratio (float, default=0.8) – Ratio of samples to use in each Monte-Carlo run.
random_state (int or None, default=None) – Random seed for reproducibility.
- selection_mask_
Boolean mask indicating selected features.
- Type:
ndarray of shape (n_features,)
- n_variables_history_
Number of variables at each iteration.
- Type:
ndarray of shape (n_sampling_runs,)
Examples
>>> from nirs4all.operators.transforms import CARS >>> import numpy as np >>> >>> # Spectral data with 200 wavelengths >>> X = np.random.randn(100, 200) >>> y = np.random.randn(100) >>> >>> # Select informative wavelengths >>> cars = CARS(n_components=10, n_sampling_runs=30) >>> cars.fit(X, y) >>> X_selected = cars.transform(X) >>> print(f"Selected {X_selected.shape[1]} from {X.shape[1]} wavelengths")
References
Li, H., Liang, Y., Xu, Q., & Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648(1), 77-84.
Notes
CARS works best with standardized/scaled data
The exponential decay function ensures smooth variable reduction
Final selection is based on minimum cross-validated RMSECV
- fit(X, y=None, wavelengths: ndarray | None = None)[source]
Fit the CARS selector to identify important wavelengths.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, 1)) – Target values. Required for CARS.
wavelengths (array-like of shape (n_features,), optional) – Original wavelength grid. Stored for reference but not required.
- Returns:
self – Fitted selector.
- Return type:
- get_feature_names_out(input_features=None)[source]
Get output feature names (selected wavelengths as strings).
- get_support(indices: bool = False)[source]
Get a mask or indices of selected features.
- Parameters:
indices (bool, default=False) – If True, return indices instead of boolean mask.
- Returns:
support – Boolean mask or indices of selected features.
- Return type:
ndarray
- set_fit_request(*, wavelengths: bool | None | str = '$UNCHANGED$') CARS
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class nirs4all.operators.transforms.feature_selection.MCUVE(n_components: int = 10, n_iterations: int = 100, subset_ratio: float = 0.8, n_noise_variables: int | None = None, threshold_method: Literal['percentile', 'fixed', 'auto'] = 'auto', threshold_percentile: float = 99, threshold_value: float = 2.0, random_state: int | None = None)[source]
Bases:
TransformerMixin,BaseEstimatorMonte-Carlo Uninformative Variable Elimination (MC-UVE) for wavelength selection.
MC-UVE identifies uninformative variables by comparing the stability of regression coefficients between real variables and random noise variables. Variables with low stability (similar to noise) are eliminated.
The method works by: 1. Augmenting X with noise variables (same distribution as X) 2. Performing multiple PLS fits on bootstrap samples 3. Calculating stability (mean/std) of regression coefficients 4. Selecting variables with stability significantly higher than noise
- Parameters:
n_components (int, default=10) – Number of PLS components for the internal PLS model.
n_iterations (int, default=100) – Number of Monte-Carlo iterations (bootstrap samples).
subset_ratio (float, default=0.8) – Ratio of samples to use in each bootstrap iteration.
n_noise_variables (int or None, default=None) – Number of noise variables to add. If None, uses n_features.
threshold_method ({'percentile', 'fixed', 'auto'}, default='auto') – Method to determine selection threshold: - ‘percentile’: Use percentile of noise stability as threshold - ‘fixed’: Use fixed stability threshold - ‘auto’: Automatically select based on noise distribution
threshold_percentile (float, default=99) – Percentile of noise stability used as threshold (for ‘percentile’ method).
threshold_value (float, default=2.0) – Fixed stability threshold value (for ‘fixed’ method).
random_state (int or None, default=None) – Random seed for reproducibility.
- selection_mask_
Boolean mask indicating selected features.
- Type:
ndarray of shape (n_features,)
- stability_
Stability values for each real variable.
- Type:
ndarray of shape (n_features,)
- mean_coefs_
Mean regression coefficients across iterations.
- Type:
ndarray of shape (n_features,)
- std_coefs_
Standard deviation of coefficients across iterations.
- Type:
ndarray of shape (n_features,)
Examples
>>> from nirs4all.operators.transforms import MCUVE >>> import numpy as np >>> >>> # Spectral data with 200 wavelengths >>> X = np.random.randn(100, 200) >>> y = np.random.randn(100) >>> >>> # Select informative wavelengths >>> mcuve = MCUVE(n_components=10, n_iterations=100) >>> mcuve.fit(X, y) >>> X_selected = mcuve.transform(X) >>> print(f"Selected {X_selected.shape[1]} from {X.shape[1]} wavelengths")
References
Cai, W., Li, Y., & Shao, X. (2008). A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometrics and Intelligent Laboratory Systems, 90(2), 188-194.
Notes
MC-UVE is robust against random noise
Higher stability indicates more informative variables
The noise comparison ensures a principled selection threshold
- fit(X, y=None, wavelengths: ndarray | None = None)[source]
Fit the MC-UVE selector to identify important wavelengths.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, 1)) – Target values. Required for MC-UVE.
wavelengths (array-like of shape (n_features,), optional) – Original wavelength grid. Stored for reference but not required.
- Returns:
self – Fitted selector.
- Return type:
- get_feature_names_out(input_features=None)[source]
Get output feature names (selected wavelengths as strings).
- get_support(indices: bool = False)[source]
Get a mask or indices of selected features.
- Parameters:
indices (bool, default=False) – If True, return indices instead of boolean mask.
- Returns:
support – Boolean mask or indices of selected features.
- Return type:
ndarray
- set_fit_request(*, wavelengths: bool | None | str = '$UNCHANGED$') MCUVE
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.