nirs4all.operators package

Subpackages

nirs4all.operators.data package
- Submodules
  - nirs4all.operators.data.merge module
  - nirs4all.operators.data.repetition module
    - RepetitionConfig
    - UnequelRepsStrategy
- Module contents
nirs4all.operators.filters package
- Submodules
- Module contents
nirs4all.operators.models package
nirs4all.operators.splitters package
- Submodules
  - nirs4all.operators.splitters.grouped_wrapper module
    - GroupedSplitterWrapper
  - nirs4all.operators.splitters.splitters module
- Module contents
nirs4all.operators.transforms package
- Submodules
- Module contents

Module contents

class nirs4all.operators.Augmenter(apply_on='samples', random_state=None, *, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Base class for data augmentation transformers.

abstractmethod augment(X, apply_on='samples')[source]

Perform data augmentation.

Parameters:

X (array-like) – Input data to augment.
apply_on (str) – The level at which augmentation is applied. Can be one of ‘samples’, ‘features’, ‘subsets’, or ‘global’. Defaults to ‘samples’.

Returns:

Augmented data.

Return type:

array-like

fit(X, y=None)[source]

Fit to data.

Parameters:

X (array-like) – Input data to fit.
y (array-like or None) – Target variable (unused).

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(X, y=None, **fit_params)[source]

Fit to data and transform it.

Parameters:

X (array-like) – Input data to fit and transform.
y (array-like or None) – Target variable (unused).
**fit_params (dict) – Additional fitting parameters (unused).

Returns:

Transformed data.

Return type:

array-like

transform(X)[source]

Transform the input data by applying data augmentation.

Parameters:: X (array-like) – Input data to transform.
Returns:: Transformed data after augmentation.
Return type:: array-like

class nirs4all.operators.Baseline(*, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Removes baseline (mean) from each spectrum.

Parameters:: copy (bool, optional) – Flag to indicate whether to make a copy of the object, by default True.

fit(X, y=None)[source]

Compute the minimum and maximum to be used for later scaling.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.

Returns:

self – Fitted Baseline object.

Return type:

object

inverse_transform(X, y=None)[source]

partial_fit(X, y=None)[source]

transform(X, y=None)[source]

class nirs4all.operators.CropTransformer(start: int = 0, end: int = None)[source]

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]

transform(X)[source]

class nirs4all.operators.Derivate(order=1, delta=1, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)[source]

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → Derivate

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X, copy=None)[source]

class nirs4all.operators.Detrend(bp=0, *, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Perform spectral detrending to remove linear trend from data.

Parameters:

bp (int, optional) – Breakpoints for piecewise linear detrending. Default is 0.
copy (bool, optional) – Whether to make a copy of the input data. Default is True.

fit(X, y=None)[source]

Fit the transformer to the data.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data.
y (None) – Ignored.

Returns:

self – Returns self.

Return type:

object

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → Detrend

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X, copy=None)[source]

Transform the data by removing linear trend.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data.
copy (bool or None, optional) – Whether to make a copy of the input data. If None, self.copy is used. Default is None.

Returns:

The transformed data.

Return type:

numpy.ndarray

class nirs4all.operators.Gaussian(order=2, sigma=1, *, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)[source]

Fit the Gaussian filter.

Parameters:

X (numpy.ndarray) – Input data.
y (None) – Ignored.

Returns:

self – Returns the instance itself.

Return type:

object

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → Gaussian

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X, copy=None)[source]

Transform the input data using the Gaussian filter.

Parameters:

X (numpy.ndarray) – Input data.
copy (bool, default=None) – Whether to make a copy of the input data.

Returns:

Transformed data.

Return type:

numpy.ndarray

class nirs4all.operators.Haar(*, copy: bool = True)[source]

Bases: Wavelet

Shortcut to the Wavelet haar transform.

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → Haar

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

class nirs4all.operators.IdentityAugmenter(apply_on='samples', random_state=None, *, copy=True)[source]

Bases: Augmenter

An augmenter that returns the input data without any changes.

augment(X, _)[source]

Perform identity augmentation.

Parameters:

X (array-like) – Input data to augment.
_ (str) – Placeholder for unused parameter.

Returns:

Augmented data (same as input data).

Return type:

array-like

nirs4all.operators.IdentityTransformer: alias of FunctionTransformer

class nirs4all.operators.LocalStandardNormalVariate(window=11, pad_mode='reflect', constant_values=0.0, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Local Standard Normal Variate (LSNV).

Per-sample local normalization with a sliding window along features. For each sample and feature j:

mean_w = mean(X[…, j-w//2 : j+w//2+1]) std_w = std (X[…, j-w//2 : j+w//2+1]) X’[j] = (X[j] - mean_w) / std_w

Parameters:

window (int, default=11) – Odd positive window size along features.
pad_mode ({'reflect','edge','constant'}, default='reflect') – Padding mode at boundaries.
constant_values (float, default=0.0) – Used only if pad_mode=’constant’.
copy (bool, default=True) – If False, try in-place.

Notes

Operates row-wise (axis=1). Input must be (n_samples, n_features).
std_w==0 → divide by 1 to avoid NaN.

fit(X, y=None)[source]

fit_transform(X, y=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X)[source]

class nirs4all.operators.MultiplicativeScatterCorrection(scale=True, *, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)[source]

inverse_transform(X)[source]

partial_fit(X, y=None)[source]

transform(X)[source]

class nirs4all.operators.Normalize(feature_range=(-1, 1), *, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Normalize spectrum using either custom range of linalg normalization

Parameters:

feature_range (tuple (min, max), default=(-1, -1)) – Desired range of transformed data. If range min and max equals -1, linalg normalization is applied, otherwise user defined normalization is applied
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

fit(X, y=None)[source]

Fit the Normalize transformer on the training data.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training data.
y (None) – Ignored variable.

Returns:

self – Returns the instance itself.

Return type:

object

inverse_transform(X)[source]

Transform the normalized data back to the original representation.

Parameters:: X (array-like of shape (n_samples, n_features)) – The normalized data to be transformed back.
Returns:: X – The inverse transformed data.
Return type:: ndarray of shape (n_samples, n_features)

partial_fit(X, y=None)[source]

Perform incremental fit on the training data.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training data.
y (None) – Ignored variable.

Returns:

self – Returns the instance itself.

Return type:

object

transform(X)[source]

Transform the input data.

Parameters:: X (array-like of shape (n_samples, n_features)) – The input data to be transformed.
Returns:: X – The transformed data.
Return type:: ndarray of shape (n_samples, n_features)

class nirs4all.operators.Random_X_Operation(apply_on='global', random_state=None, *, copy=True, operator_func=<built-in function mul>, operator_range=(0.97, 1.03))[source]

Bases: Augmenter

Class for applying random operation on data augmentation.

Parameters:

apply_on (str, optional) – Apply augmentation on “features” or “samples” data. Default is “features”.
random_state (int or None, optional) – Random seed for reproducibility. Default is None.
copy (bool, optional) – If True, creates a copy of the input data. Default is True.
operator_func (function, optional) – Operator function to be applied. Default is operator.mul.
operator_range (tuple, optional) – Range for generating random values for the operator. Default is (0.97, 1.03).

augment(X, apply_on='global')[source]

Augment the data by applying random operation.

Parameters:

X (ndarray) – Input data to be augmented.
apply_on (str, optional) – Apply augmentation on “features” or “samples” data. Default is “features”.

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.ResampleTransformer(num_samples: int)[source]

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]

transform(X)[source]

class nirs4all.operators.RobustStandardNormalVariate(axis=1, with_center=True, with_scale=True, k=1.4826, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Robust Standard Normal Variate (RSNV).

Per-sample robust centering and scaling using median and MAD:: med = median(X, axis=1, keepdims=True) mad = median(|X - med|, axis=1, keepdims=True) X’ = (X - med) / (k * mad)

Parameters:

axis (int, default=1) – 1 for row-wise (spectroscopy default). 0 for column-wise.
with_center (bool, default=True) – If True, subtract median.
with_scale (bool, default=True) – If True, divide by k * MAD.
k (float, default=1.4826) – Consistency constant to make MAD a robust estimator of std for Gaussian data.
copy (bool, default=True) – If False, try in-place.

Notes

MAD==0 → divide by 1 to avoid NaN.

fit(X, y=None)[source]

fit_transform(X, y=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X)[source]

class nirs4all.operators.Rotate_Translate(apply_on='samples', random_state=None, *, copy=True, p_range=2, y_factor=3)[source]

Bases: Augmenter

Class for rotating and translating data augmentation.

Vectorized implementation that processes all samples in batch.

Parameters:

apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.
random_state (int or None, optional) – Random seed for reproducibility. Default is None.
copy (bool, optional) – If True, creates a copy of the input data. Default is True.
p_range (int, optional) – Range for generating random slope values. Default is 2.
y_factor (int, optional) – Scaling factor for the initial value. Default is 3.

augment(X, apply_on='samples')[source]

Augment the data by rotating and translating the signal.

Vectorized implementation using NumPy broadcasting.

Parameters:

X (ndarray) – Input data to be augmented, shape (n_samples, n_features).
apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.SampleFilter(reason: str | None = None)[source]

Bases: TransformerMixin, BaseEstimator, ABC

Base class for sample filtering operators.

Sample filters identify samples that should be excluded from training datasets. Unlike transformers that modify data, filters mark samples for exclusion without altering the underlying data.

The filtering pattern works as follows: 1. fit(): Learn filter criteria from training data (e.g., compute thresholds) 2. get_mask(): Return boolean mask indicating which samples to KEEP 3. transform(): No-op (filtering happens at indexer level, not data level)

All concrete filter implementations must override the get_mask() method.

reason

Identifier for this filter type, used to track exclusion reasons in the indexer. Default is the class name.

Type:: str

Example

>>> class MyFilter(SampleFilter):
...     def __init__(self, threshold: float = 1.0):
...         super().__init__()
...         self.threshold = threshold
...
...     def fit(self, X, y=None):
...         self.mean_ = np.mean(y)
...         self.std_ = np.std(y)
...         return self
...
...     def get_mask(self, X, y=None) -> np.ndarray:
...         z_scores = np.abs((y - self.mean_) / self.std_)
...         return z_scores <= self.threshold  # True = keep

property exclusion_reason: str

Get the exclusion reason identifier for this filter.

Returns:: Reason string to be stored in indexer’s exclusion_reason column.
Return type:: str

fit(X: ndarray, y: ndarray | None = None) → SampleFilter[source]

Compute filter criteria from training data.

This method should learn any thresholds, statistics, or models needed to identify outliers/bad samples. Override in subclasses for filters that need to learn from data.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

The fitted filter instance.

Return type:

self

fit_transform(X: ndarray, y: ndarray | None = None, **fit_params) → ndarray[source]

Fit to data and return unchanged (transform is no-op).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).
**fit_params – Additional fitting parameters (unused).

Returns:

The unchanged input array.

Return type:

np.ndarray

get_excluded_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be excluded.

Convenience method that inverts get_mask() to return indices of samples marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to exclude.

Return type:

np.ndarray

Example

>>> filter = YOutlierFilter(method="iqr")
>>> filter.fit(X_train, y_train)
>>> excluded_idx = filter.get_excluded_indices(X_train, y_train)
>>> print(f"Excluding {len(excluded_idx)} samples")

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application.

Override in subclasses to provide filter-specific statistics (e.g., thresholds used, distribution of values, etc.).

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Dictionary containing filter statistics:

n_samples: Total number of samples
n_excluded: Number of samples to exclude
n_kept: Number of samples to keep
exclusion_rate: Ratio of excluded to total
reason: Exclusion reason string

Return type:

Dict[str, Any]

get_kept_indices(X: ndarray, y: ndarray | None = None) → ndarray[source]

Get indices of samples to be kept.

Convenience method that returns indices of samples NOT marked for exclusion.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets).

Returns:

Integer array of indices for samples to keep.

Return type:

np.ndarray

abstractmethod get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

This is the core method that must be implemented by all concrete filters. Returns True for samples that should be kept, False for samples to exclude.

Parameters:

X – Feature array of shape (n_samples, n_features).
y – Target array of shape (n_samples,) or (n_samples, n_targets). May be None for X-only filters.

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample
False means EXCLUDE the sample

Return type:

np.ndarray

Raises:

NotImplementedError – If the subclass doesn’t implement this method.

transform(X: ndarray) → ndarray[source]

Transform is a no-op for filters.

Filtering happens at the indexer level, not by modifying the data array. This method returns the input unchanged to maintain sklearn compatibility.

Parameters:: X – Feature array of shape (n_samples, n_features).
Returns:: The unchanged input array.
Return type:: np.ndarray

class nirs4all.operators.SavitzkyGolay(window_length: int = 11, polyorder: int = 3, deriv: int = 0, delta: float = 1.0, *, copy: bool = True)[source]

Bases: TransformerMixin, BaseEstimator

A class for smoothing and differentiating data using the Savitzky-Golay filter.

Parameters:

window_lengthint, optional (default=11): The length of the window used for smoothing.
polyorderint, optional (default=3): The order of the polynomial used for fitting the samples within the window.
derivint, optional (default=0): The order of the derivative to compute.
deltafloat, optional (default=1.0): The sampling distance of the data.
copybool, optional (default=True): Whether to copy the input data.

Methods:

fit(X, y=None): Fits the transformer to the data X.
transform(X, copy=None): Applies the Savitzky-Golay filter to the data X.

fit(X, y=None)[source]

Verify the X data compliance with Savitzky-Golay filter.

Parameters:

X (array-like) – The data to transform.
y (None) – Ignored.

Raises:

ValueError – If the input X is a sparse matrix.

Returns:

The fitted object.

Return type:

SavitzkyGolay

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → SavitzkyGolay

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X, copy=None)[source]

Apply the Savitzky-Golay filter to the data X.

Parameters:

X (array-like) – The data to transform.
copy (bool or None, optional) – Whether to copy the input data.

Returns:

The transformed data.

Return type:

numpy.ndarray

class nirs4all.operators.SimpleScale(copy=True)[source]

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)[source]

inverse_transform(X)[source]

partial_fit(X, y=None)[source]

transform(X)[source]

class nirs4all.operators.Spline_Curve_Simplification(apply_on='samples', random_state=None, *, copy=True, spline_points=None, uniform=False)[source]

Bases: Augmenter

Class to simplify a 1D signal using B-spline interpolation along the curve.

Optimized implementation with pre-allocated output arrays.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).
spline_points (int, optional) – Number of spline points for simplification. Default is None: the length of the sample / 4.
uniform (bool, optional) – If True, the spline points are uniformly spaced. Default is False.

augment(X, apply_on='samples')[source]

Select regularly spaced points on the x-axis and adjust a spline.

Optimized with pre-allocated output array.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “features” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.Spline_Smoothing(apply_on='samples', random_state=None, *, copy=True)[source]

Bases: Augmenter

Class to apply a smoothing spline to a 1D signal.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

augment(X, apply_on='samples')[source]

Apply a smoothing spline to the data.

Optimized implementation with pre-allocated output array.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.Spline_X_Perturbations(apply_on='samples', random_state=None, *, copy=True, spline_degree=3, perturbation_density=0.05, perturbation_range=(-10, 10))[source]

Bases: Augmenter

Class to apply a perturbation to a 1D signal using B-spline interpolation.

Optimized implementation with pre-generated random parameters.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).
spline_degree (int, optional) – Degree of the spline. Default is 3 (cubic).
perturbation_density (float, optional) – Density of perturbation points relative to data size. Default is 0.05.
perturbation_range (tuple, optional) – Range of perturbation values (min, max). Default is (-10, 10).

augment(X, apply_on='samples')[source]

Augment the data with a perturbation using B-spline interpolation.

Optimized with pre-allocated arrays and batch random generation.

Parameters:

X (ndarray) – Input data to be augmented.
apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.Spline_X_Simplification(apply_on='samples', random_state=None, *, copy=True, spline_points=None, uniform=False)[source]

Bases: Augmenter

Class to simplify a 1D signal using B-spline interpolation along the x-axis.

Optimized implementation with pre-generated random parameters.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).
spline_points (int, optional) – Number of spline points for simplification. Default is None: the length of the sample / 4.
uniform (bool, optional) – If True, the spline points are uniformly spaced. Default is False.

augment(X, apply_on='samples')[source]

Select randomly spaced points along the x-axis and adjust a spline.

Optimized with pre-allocated arrays and batch random generation.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.Spline_Y_Perturbations(apply_on='samples', random_state=None, *, copy=True, spline_points=None, perturbation_intensity=0.005)[source]

Bases: Augmenter

Augment the data with a perturbation on the y-axis using B-spline interpolation.

Optimized implementation with pre-generated random parameters.

Parameters:

X (ndarray) – Input data.
apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).
spline_points (int, optional) – Number of spline points. Default is None (uses sample length / 2).
perturbation_intensity (float, optional) – Intensity of perturbation relative to max value. Default is 0.005.

augment(X, apply_on='samples')[source]

Augment the data with a perturbation on the y-axis using B-spline interpolation.

Optimized with pre-allocated arrays and batch random generation.

Parameters:

X (ndarray) – Input data to be augmented.
apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

class nirs4all.operators.StandardNormalVariate(axis=1, with_mean=True, with_std=True, ddof=0, copy=True)[source]

Bases: TransformerMixin, BaseEstimator

Standard Normal Variate (SNV) transformation.

SNV is a row-wise normalization technique commonly used in spectroscopy to remove scatter effects. Each sample (row) is centered and scaled independently.

For each sample: SNV = (X - mean(X)) / std(X)

Parameters:

axis (int, default=1) – Axis along which to compute mean and standard deviation. - axis=1: Row-wise (default, standard SNV behavior for spectroscopy) - axis=0: Column-wise (equivalent to StandardScaler)
with_mean (bool, default=True) – If True, center the data before scaling.
with_std (bool, default=True) – If True, scale the data to unit variance.
ddof (int, default=0) – Delta Degrees of Freedom for standard deviation calculation.
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead.

Examples

>>> from nirs4all.operators.transforms import StandardNormalVariate
>>> import numpy as np
>>> X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)
>>> snv = StandardNormalVariate()
>>> X_transformed = snv.fit_transform(X)

fit(X, y=None)[source]

Fit the StandardNormalVariate transformer.

For SNV, this is a no-op as the transformation is computed independently for each sample.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training data.
y (None) – Ignored variable.

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(X, y=None)[source]

Fit to data, then transform it.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data.
y (None) – Ignored variable.

Returns:

X_transformed – The transformed data.

Return type:

ndarray of shape (n_samples, n_features)

transform(X)[source]

Perform SNV transformation.

Parameters:: X (array-like of shape (n_samples, n_features)) – The input data to be transformed.
Returns:: X_transformed – The transformed data.
Return type:: ndarray of shape (n_samples, n_features)

class nirs4all.operators.Wavelet(wavelet: str = 'haar', mode: str = 'periodization', *, copy: bool = True)[source]

Bases: TransformerMixin, BaseEstimator

Single level Discrete Wavelet Transform.

Performs a discrete wavelet transform on data, using a wavelet function.

Parameters:

wavelet (Wavelet object or name, default='haar') – Wavelet to use: [‘Haar’, ‘Daubechies’, ‘Symlets’, ‘Coiflets’, ‘Biorthogonal’, ‘Reverse biorthogonal’, ‘Discrete Meyer (FIR Approximation)’…]
mode (str, optional, default='periodization') – Signal extension mode.

fit(X, y=None)[source]

Verify the X data compliance with wavelet transform.

Parameters:

X (array-like, spectra) – The data to transform.
y (None) – Ignored.

Raises:

ValueError – If the input X is a sparse matrix.

Returns:

The fitted object.

Return type:

Wavelet

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → Wavelet

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X, copy=None)[source]

Apply wavelet transform to the data X.

Parameters:

X (array-like) – The data to transform.
copy (bool or None, optional) – Whether to copy the input data.

Returns:

The transformed data.

Return type:

numpy.ndarray

class nirs4all.operators.YOutlierFilter(method: Literal['iqr', 'zscore', 'percentile', 'mad'] = 'iqr', threshold: float = 1.5, lower_percentile: float = 1.0, upper_percentile: float = 99.0, reason: str | None = None)[source]

Bases: SampleFilter

Filter samples with outlier target values.

This filter identifies samples whose y-values are statistical outliers using one of several detection methods. It’s commonly used to remove samples with extreme or erroneous target values before training.

Supported methods: - “iqr”: Interquartile Range method (default) - “zscore”: Z-score (standard deviations from mean) - “percentile”: Direct percentile cutoffs - “mad”: Median Absolute Deviation (robust to outliers)

method

Outlier detection method

Type:: str

threshold

Method-specific threshold

Type:: float

lower_percentile

Lower cutoff for percentile method

Type:: float

upper_percentile

Upper cutoff for percentile method

Type:: float

Example

>>> from nirs4all.operators.filters import YOutlierFilter
>>>
>>> # IQR method (default, threshold=1.5 is standard)
>>> filter_iqr = YOutlierFilter(method="iqr", threshold=1.5)
>>>
>>> # Z-score method (threshold=3.0 is common)
>>> filter_zscore = YOutlierFilter(method="zscore", threshold=3.0)
>>>
>>> # Percentile method
>>> filter_pct = YOutlierFilter(
...     method="percentile",
...     lower_percentile=1.0,
...     upper_percentile=99.0
... )
>>>
>>> # Fit and get mask
>>> filter_iqr.fit(X_train, y_train)
>>> mask = filter_iqr.get_mask(X_train, y_train)  # True = keep

In Pipeline:

>>> pipeline = [
...     {
...         "sample_filter": {
...             "filters": [YOutlierFilter(method="iqr", threshold=1.5)],
...         }
...     },
...     "snv",
...     "model:PLSRegression",
... ]

__repr__() → str[source]: Return string representation.

property exclusion_reason: str: Get descriptive exclusion reason.

fit(X: ndarray, y: ndarray | None = None) → YOutlierFilter[source]

Compute outlier detection bounds from training data.

Parameters:

X – Feature array of shape (n_samples, n_features). Not used but required for sklearn compatibility.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

The fitted filter instance.

Return type:

self

Raises:

ValueError – If y is None (required for Y-based filtering).
ValueError – If y has no valid (non-NaN) values.

get_filter_stats(X: ndarray, y: ndarray | None = None) → Dict[str, Any][source]

Get statistics about filter application including method-specific details.

Parameters:

X – Feature array.
y – Target array.

Returns:

Base stats (n_samples, n_excluded, n_kept, exclusion_rate)
method: Detection method used
threshold: Threshold value
lower_bound: Computed lower bound
upper_bound: Computed upper bound
center: Central value (mean/median)
scale: Scale measure (std/IQR/MAD)
y_range: (min, max) of input y values

Return type:

Dict containing

get_mask(X: ndarray, y: ndarray | None = None) → ndarray[source]

Compute boolean mask indicating which samples to KEEP.

Parameters:

X – Feature array of shape (n_samples, n_features). Not used but required for API consistency.
y – Target array of shape (n_samples,) or (n_samples, n_targets). Required for Y-based filtering.

Returns:

Boolean array of shape (n_samples,) where:

True means KEEP the sample (within bounds)
False means EXCLUDE the sample (outside bounds)

Return type:

np.ndarray

Raises:

ValueError – If y is None.
ValueError – If filter has not been fitted (bounds not set).

nirs4all.operators.baseline(spectra)[source]

Removes baseline (mean) from each spectrum.

Parameters:: spectra (numpy.ndarray) – NIRS data matrix.
Returns:: Mean-centered NIRS data matrix.
Return type:: numpy.ndarray

nirs4all.operators.derivate(spectra, order=1, delta=1)[source]

Computes Nth order derivatives with the desired spacing using numpy.gradient.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
order (float, optional) – Order of the derivation, by default 1.
delta (int, optional) – Delta of the derivative (in samples), by default 1.

Returns:

spectra – Derived NIR spectra.

Return type:

numpy.ndarray

nirs4all.operators.detrend(spectra, bp=0)[source]

Perform spectral detrending to remove linear trend from data.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
bp (list, optional) – A sequence of break points. If given, an individual linear fit is performed for each part of data between two break points. Break points are specified as indices into data. Default is 0.

Returns:

Detrended NIR spectra.

Return type:

numpy.ndarray

nirs4all.operators.gaussian(spectra, order=2, sigma=1)[source]

Computes 1D gaussian filter using scipy.ndimage gaussian 1d filter.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
order (float, optional) – Order of the derivation.
sigma (int, optional) – Sigma of the gaussian.

Returns:

Gaussian NIR spectra.

Return type:

numpy.ndarray

nirs4all.operators.msc(spectra, scaled=True)[source]

Performs multiplicative scatter correction to the mean.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
scaled (bool) – Whether to scale the data. Defaults to True.

Returns:

Scatter-corrected NIR spectra.

Return type:

numpy.ndarray

nirs4all.operators.norml(spectra, feature_range=(-1, 1))[source]

Perform spectral normalization with user-defined limits.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
feature_range (tuple (min, max), default=(-1, 1)) – Desired range of transformed data. If range min and max equals -1, linalg normalization is applied; otherwise, user bounds-defined normalization is applied.

Returns:

spectra – Normalized NIR spectra.

Return type:

numpy.ndarray

nirs4all.operators.savgol(spectra: ndarray, window_length: int = 11, polyorder: int = 3, deriv: int = 0, delta: float = 1.0) → ndarray[source]

Perform Savitzky–Golay filtering on the data (also calculates derivatives). This function is a wrapper for scipy.signal.savgol_filter.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
window_length (int) – Size of the filter window in samples (default 11).
polyorder (int) – Order of the polynomial estimation (default 3).
deriv (int) – Order of the derivation (default 0).
delta (float) – Sampling distance of the data.

Returns:

NIRS data smoothed with Savitzky-Golay filtering.

Return type:

numpy.ndarray

nirs4all.operators.spl_norml(spectra)[source]

Perform simple spectral normalization.

Parameters:: spectra (numpy.ndarray) – NIRS data matrix.
Returns:: spectra – Normalized NIR spectra.
Return type:: numpy.ndarray

nirs4all.operators.wavelet_transform(spectra: ndarray, wavelet: str, mode: str = 'periodization') → ndarray[source]

Computes transform using pywavelet transform.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.
wavelet (str) – wavelet family transformation.
mode (str) – signal extension mode.

Returns:

wavelet and resampled spectra.

Return type:

numpy.ndarray