nirs4all.operators.transforms.feature_selection module

Feature selection operators for NIRS spectral data.

This module provides wavelength/variable selection methods commonly used in chemometrics for NIRS data: - CARS (Competitive Adaptive Reweighted Sampling) - MC-UVE (Monte-Carlo Uninformative Variable Elimination)

These selectors identify informative wavelengths and reduce feature dimensionality while preserving predictive performance.

class nirs4all.operators.transforms.feature_selection.CARS(n_components: int = 10, n_sampling_runs: int = 50, n_variables_ratio_start: float = 1.0, n_variables_ratio_end: float = 0.1, cv_folds: int = 5, subset_ratio: float = 0.8, random_state: int | None = None)[source]

Bases: TransformerMixin, BaseEstimator

Competitive Adaptive Reweighted Sampling (CARS) for wavelength selection.

CARS is a variable selection method that iteratively selects important wavelengths by: 1. Fitting PLS models on subsets of samples 2. Calculating variable importance weights from regression coefficients 3. Using exponentially decreasing function to reduce variable count 4. Applying adaptive reweighted sampling based on importance

The method was introduced by Li et al. (2009) and is widely used for NIRS wavelength selection.

Parameters:

n_components (int, default=10) – Number of PLS components for the internal PLS model.
n_sampling_runs (int, default=50) – Number of Monte-Carlo sampling runs.
n_variables_ratio_start (float, default=1.0) – Starting ratio of variables to keep (1.0 = all variables).
n_variables_ratio_end (float, default=0.1) – Ending ratio of variables to keep.
cv_folds (int, default=5) – Number of cross-validation folds for RMSECV calculation.
subset_ratio (float, default=0.8) – Ratio of samples to use in each Monte-Carlo run.
random_state (int or None, default=None) – Random seed for reproducibility.

selected_indices_

Indices of selected features/wavelengths.

Type:: ndarray of shape (n_selected,)

selection_mask_

Boolean mask indicating selected features.

Type:: ndarray of shape (n_features,)

n_features_in_

Number of features in input data.

Type:: int

n_features_out_

Number of selected features.

Type:: int

rmsecv_history_

RMSECV values at each iteration.

Type:: ndarray of shape (n_sampling_runs,)

n_variables_history_

Number of variables at each iteration.

Type:: ndarray of shape (n_sampling_runs,)

optimal_run_idx_

Index of the run with minimum RMSECV.

Type:: int

Examples

>>> from nirs4all.operators.transforms import CARS
>>> import numpy as np
>>>
>>> # Spectral data with 200 wavelengths
>>> X = np.random.randn(100, 200)
>>> y = np.random.randn(100)
>>>
>>> # Select informative wavelengths
>>> cars = CARS(n_components=10, n_sampling_runs=30)
>>> cars.fit(X, y)
>>> X_selected = cars.transform(X)
>>> print(f"Selected {X_selected.shape[1]} from {X.shape[1]} wavelengths")

References

Li, H., Liang, Y., Xu, Q., & Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Analytica Chimica Acta, 648(1), 77-84.

Notes

CARS works best with standardized/scaled data
The exponential decay function ensures smooth variable reduction
Final selection is based on minimum cross-validated RMSECV

__repr__()[source]: String representation of the selector.

fit(X, y=None, wavelengths: ndarray | None = None)[source]

Fit the CARS selector to identify important wavelengths.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, 1)) – Target values. Required for CARS.
wavelengths (array-like of shape (n_features,), optional) – Original wavelength grid. Stored for reference but not required.

Returns:

self – Fitted selector.

Return type:

CARS

get_feature_names_out(input_features=None)[source]

Get output feature names (selected wavelengths as strings).

Parameters:: input_features (array-like of str or None, default=None) – Input feature names. If None, uses indices.
Returns:: feature_names_out – Selected feature names.
Return type:: ndarray of str

get_support(indices: bool = False)[source]

Get a mask or indices of selected features.

Parameters:: indices (bool, default=False) – If True, return indices instead of boolean mask.
Returns:: support – Boolean mask or indices of selected features.
Return type:: ndarray

set_fit_request(*, wavelengths: bool | None | str = '$UNCHANGED$') → CARS

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: wavelengths (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for wavelengths parameter in fit.
Returns:: self – The updated object.
Return type:: object

transform(X)[source]

Transform data by selecting only the important wavelengths.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Returns:: X_selected – Data with only selected wavelengths.
Return type:: ndarray of shape (n_samples, n_selected)

class nirs4all.operators.transforms.feature_selection.MCUVE(n_components: int = 10, n_iterations: int = 100, subset_ratio: float = 0.8, n_noise_variables: int | None = None, threshold_method: Literal['percentile', 'fixed', 'auto'] = 'auto', threshold_percentile: float = 99, threshold_value: float = 2.0, random_state: int | None = None)[source]

Bases: TransformerMixin, BaseEstimator

Monte-Carlo Uninformative Variable Elimination (MC-UVE) for wavelength selection.

MC-UVE identifies uninformative variables by comparing the stability of regression coefficients between real variables and random noise variables. Variables with low stability (similar to noise) are eliminated.

The method works by: 1. Augmenting X with noise variables (same distribution as X) 2. Performing multiple PLS fits on bootstrap samples 3. Calculating stability (mean/std) of regression coefficients 4. Selecting variables with stability significantly higher than noise

Parameters:

n_components (int, default=10) – Number of PLS components for the internal PLS model.
n_iterations (int, default=100) – Number of Monte-Carlo iterations (bootstrap samples).
subset_ratio (float, default=0.8) – Ratio of samples to use in each bootstrap iteration.
n_noise_variables (int or None, default=None) – Number of noise variables to add. If None, uses n_features.
threshold_method ({'percentile', 'fixed', 'auto'}, default='auto') – Method to determine selection threshold: - ‘percentile’: Use percentile of noise stability as threshold - ‘fixed’: Use fixed stability threshold - ‘auto’: Automatically select based on noise distribution
threshold_percentile (float, default=99) – Percentile of noise stability used as threshold (for ‘percentile’ method).
threshold_value (float, default=2.0) – Fixed stability threshold value (for ‘fixed’ method).
random_state (int or None, default=None) – Random seed for reproducibility.

selected_indices_

Indices of selected features/wavelengths.

Type:: ndarray of shape (n_selected,)

selection_mask_

Boolean mask indicating selected features.

Type:: ndarray of shape (n_features,)

n_features_in_

Number of features in input data.

Type:: int

n_features_out_

Number of selected features.

Type:: int

stability_

Stability values for each real variable.

Type:: ndarray of shape (n_features,)

noise_stability_

Stability values for noise variables.

Type:: ndarray of shape (n_noise_variables,)

threshold_

Threshold value used for selection.

Type:: float

mean_coefs_

Mean regression coefficients across iterations.

Type:: ndarray of shape (n_features,)

std_coefs_

Standard deviation of coefficients across iterations.

Type:: ndarray of shape (n_features,)

Examples

>>> from nirs4all.operators.transforms import MCUVE
>>> import numpy as np
>>>
>>> # Spectral data with 200 wavelengths
>>> X = np.random.randn(100, 200)
>>> y = np.random.randn(100)
>>>
>>> # Select informative wavelengths
>>> mcuve = MCUVE(n_components=10, n_iterations=100)
>>> mcuve.fit(X, y)
>>> X_selected = mcuve.transform(X)
>>> print(f"Selected {X_selected.shape[1]} from {X.shape[1]} wavelengths")

References

Cai, W., Li, Y., & Shao, X. (2008). A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometrics and Intelligent Laboratory Systems, 90(2), 188-194.

Notes

MC-UVE is robust against random noise
Higher stability indicates more informative variables
The noise comparison ensures a principled selection threshold

__repr__()[source]: String representation of the selector.

fit(X, y=None, wavelengths: ndarray | None = None)[source]

Fit the MC-UVE selector to identify important wavelengths.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, 1)) – Target values. Required for MC-UVE.
wavelengths (array-like of shape (n_features,), optional) – Original wavelength grid. Stored for reference but not required.

Returns:

self – Fitted selector.

Return type:

MCUVE

get_feature_names_out(input_features=None)[source]

Get output feature names (selected wavelengths as strings).

Parameters:: input_features (array-like of str or None, default=None) – Input feature names. If None, uses indices.
Returns:: feature_names_out – Selected feature names.
Return type:: ndarray of str

get_support(indices: bool = False)[source]

Get a mask or indices of selected features.

Parameters:: indices (bool, default=False) – If True, return indices instead of boolean mask.
Returns:: support – Boolean mask or indices of selected features.
Return type:: ndarray

set_fit_request(*, wavelengths: bool | None | str = '$UNCHANGED$') → MCUVE

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: wavelengths (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for wavelengths parameter in fit.
Returns:: self – The updated object.
Return type:: object

transform(X)[source]

Transform data by selecting only the important wavelengths.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Returns:: X_selected – Data with only selected wavelengths.
Return type:: ndarray of shape (n_samples, n_selected)