nirs4all.operators.splitters package

Submodules

Module contents

Splitters module for presets.

This module contains data splitting presets and utilities.

class nirs4all.operators.splitters.BinnedStratifiedGroupKFold(n_splits=5, n_bins=10, strategy='quantile', shuffle=False, random_state=None)[source]

Bases: CustomSplitter

Stratified Group K-Fold cross-validator with binned continuous targets.

This splitter combines: - KBinsDiscretizer to bin continuous y values into discrete categories - StratifiedGroupKFold to ensure stratified splits while respecting groups

This is useful for regression tasks where you want stratified sampling (balanced target distribution across folds) while ensuring samples from the same group are never split across train and test sets.

Parameters:

n_splits (int, default=5) – Number of folds. Must be at least 2.
n_bins (int, default=10) – Number of bins for discretizing continuous y values. More bins = finer stratification but may fail with small datasets.
strategy ({'uniform', 'quantile', 'kmeans'}, default='quantile') –
Strategy used to define the widths of the bins: - ‘uniform’: All bins have identical widths. - ‘quantile’: All bins have the same number of points (recommended for

imbalanced distributions).
- ’kmeans’: Values in each bin have the same nearest center of a 1D k-means cluster.
shuffle (bool, default=False) – Whether to shuffle each class’s samples before splitting.
random_state (int or None, default=None) – Random state for reproducibility when shuffle=True.

Examples

Basic usage with regression targets and groups:

>>> from nirs4all.operators.splitters import BinnedStratifiedGroupKFold
>>> import numpy as np
>>> X = np.random.randn(100, 10)
>>> y = np.random.randn(100)  # Continuous target
>>> groups = np.repeat(np.arange(20), 5)  # 20 groups, 5 samples each
>>> splitter = BinnedStratifiedGroupKFold(n_splits=5, n_bins=5)
>>> for train_idx, test_idx in splitter.split(X, y, groups):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")

With quantile binning for imbalanced targets:

>>> splitter = BinnedStratifiedGroupKFold(
...     n_splits=3,
...     n_bins=10,
...     strategy='quantile',
...     shuffle=True,
...     random_state=42
... )

Notes

The number of bins should be chosen based on the dataset size and the number of unique groups. Too many bins may cause stratification to fail.
Groups are never split across folds - all samples from a group will be in either train or test, never both.
Stratification is approximate when groups have varying sizes.

See also

KBinsStratifiedSplitter: Single train/test split with binned stratification.
sklearn.model_selection.StratifiedGroupKFold: For categorical targets.

get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splitting iterations.

Parameters:

X (object) – Ignored, exists for compatibility.
y (object) – Ignored, exists for compatibility.
groups (object) – Ignored, exists for compatibility.

Returns:

n_splits – Number of folds.

Return type:

int

split(X, y=None, groups=None)[source]

Generate train/test indices for each fold.

Parameters:

X (array-like of shape (n_samples, n_features)) – Feature matrix.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Continuous target values to be binned for stratification.
groups (array-like of shape (n_samples,)) – Group labels for samples. Samples with the same group label will always be in the same fold.

Yields:

train (ndarray) – Training set indices for this fold.
test (ndarray) – Test set indices for this fold.

class nirs4all.operators.splitters.GroupedSplitterWrapper(splitter, aggregation='mean', y_aggregation=None)[source]

Bases: BaseCrossValidator

Wraps any sklearn-compatible splitter to add group-awareness.

This wrapper aggregates samples by group into “virtual samples”, passes them to the inner splitter, and expands the fold indices back to the original sample space. This ensures that all samples from the same group are always in the same fold (train or test), preventing data leakage.

Parameters:

splitter (BaseCrossValidator) – Any sklearn-compatible cross-validator (e.g., KFold, ShuffleSplit, StratifiedKFold).
aggregation (str, default="mean") – Method for aggregating X features within groups: - “mean”: Use group centroid (average of all samples) - “median”: Use group median (robust to outliers) - “first”: Use first sample in each group (fast, no aggregation)
y_aggregation (str or None, default=None) – Method for aggregating y values within groups. If None, inferred from splitter type: - “mean”: For regression (continuous y) - “mode”: For classification (categorical y) - “first”: Use first y value in group

Examples

>>> from sklearn.model_selection import KFold, ShuffleSplit, StratifiedKFold
>>> import numpy as np
>>>
>>> # Basic usage with KFold
>>> X = np.random.randn(100, 10)
>>> y = np.random.randn(100)
>>> groups = np.repeat(np.arange(20), 5)  # 20 groups, 5 samples each
>>>
>>> wrapper = GroupedSplitterWrapper(KFold(n_splits=5))
>>> for train_idx, test_idx in wrapper.split(X, y, groups=groups):
...     # train_idx and test_idx are original sample indices
...     # All samples from the same group are in the same fold
...     train_groups = set(groups[train_idx])
...     test_groups = set(groups[test_idx])
...     assert len(train_groups & test_groups) == 0  # No overlap
>>>
>>> # Usage with ShuffleSplit
>>> wrapper = GroupedSplitterWrapper(ShuffleSplit(n_splits=1, test_size=0.2))
>>> for train_idx, test_idx in wrapper.split(X, y, groups=groups):
...     pass  # Groups are respected
>>>
>>> # Usage with StratifiedKFold (stratifies on aggregated y)
>>> y_class = np.random.randint(0, 3, 100)
>>> wrapper = GroupedSplitterWrapper(
...     StratifiedKFold(n_splits=3),
...     y_aggregation="mode"
... )
>>> for train_idx, test_idx in wrapper.split(X, y_class, groups=groups):
...     pass  # Groups are respected, stratification on group mode

Notes

The wrapper is transparent when no groups are provided - it simply delegates to the inner splitter without any aggregation.

See also

sklearn.model_selection.GroupKFold: Native group-aware K-fold splitter.
sklearn.model_selection.GroupShuffleSplit: Native group-aware shuffle split.
nirs4all.operators.splitters.SPXYGFold: SPXY-based group-aware splitter.

__repr__()[source]: Return string representation of the wrapper.

get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splitting iterations.

Parameters:

X (object) – Ignored, exists for compatibility.
y (object) – Ignored, exists for compatibility.
groups (object) – Ignored, exists for compatibility.

Returns:

n_splits – Number of folds/iterations from the inner splitter.

Return type:

int

split(X, y=None, groups=None)[source]

Generate train/test indices with group-awareness.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,), default=None) – Target values.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples. If None, delegates to the inner splitter without any aggregation.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.KBinsStratifiedSplitter(test_size, random_state=None, n_bins=10, strategy='uniform', encode='ordinal')[source]

Bases: CustomSplitter

Implements stratified sampling using KBins discretization.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.KMeansSplitter(test_size, random_state=None, pca_components=None, metric='euclidean')[source]

Bases: CustomSplitter

Implements sampling using K-Means clustering.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.KennardStoneSplitter(test_size, random_state=None, pca_components=None, metric='euclidean')[source]

Bases: CustomSplitter

Implements the Kennard-Stone sampling method based on maximum minimum distance.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.SPXYGFold(n_splits=5, test_size=None, metric='euclidean', y_metric='euclidean', aggregation='mean', pca_components=None, random_state=None)[source]

Bases: CustomSplitter

SPXY-based K-Fold splitter with group awareness.

Combines: - SPXY (joint X-Y distance) or Kennard-Stone (X-only) selection - Group constraints (samples in same group stay together) - K-fold cross-validation

This splitter extends the SPXY algorithm to support: 1. Classification tasks (using appropriate distance metrics for categorical y) 2. Group-aware splitting (treating groups as atomic units) 3. K-fold cross-validation (not just single train/test split)

The algorithm ensures uniform coverage of the feature space (and optionally target space) across all folds, which is particularly useful for spectroscopy data where sample distribution matters for model generalization.

Parameters:

n_splits (int, default=5) – Number of folds for cross-validation. Use 1 for single train/test split. Must be at least 2 for cross-validation.
test_size (float, default=None) – Proportion of samples for test set. Only used when n_splits=1. If None with n_splits=1, defaults to 0.25.
metric (str, default="euclidean") – Distance metric for X-space. Any metric supported by scipy.spatial.distance.cdist: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
y_metric (str or None, default="euclidean") –
Distance metric for Y-space. - “euclidean”: For regression (continuous y) - default SPXY behavior - “hamming”: For classification (categorical y) - treats all class

differences equally
- None: Ignore Y (pure Kennard-Stone, X-only selection)
aggregation (str, default="mean") – Method for group aggregation when groups are provided: - “mean”: Use group centroid (mean of all samples in group) - “median”: Use group median (robust to outliers)
pca_components (int or None, default=None) – If provided, apply PCA to reduce X dimensionality before distance computation. Useful for high-dimensional spectral data.
random_state (int or None, default=None) – Random state for reproducibility. Only used for tie-breaking when multiple samples have equal distances.

Examples

Basic K-Fold with SPXY:

>>> from nirs4all.operators.splitters import SPXYGFold
>>> splitter = SPXYGFold(n_splits=5)
>>> for train_idx, test_idx in splitter.split(X, y):
...     X_train, X_test = X[train_idx], X[test_idx]

Single train/test split (backward compatible with SPXYSplitter):

>>> splitter = SPXYGFold(n_splits=1, test_size=0.25)
>>> train_idx, test_idx = next(splitter.split(X, y))

Classification with Hamming distance for y:

>>> splitter = SPXYGFold(n_splits=5, y_metric="hamming")
>>> for train_idx, test_idx in splitter.split(X, y_class):
...     pass

Group-aware splitting:

>>> splitter = SPXYGFold(n_splits=5)
>>> for train_idx, test_idx in splitter.split(X, y, groups=sample_ids):
...     pass  # Samples with same group stay together

Pure Kennard-Stone (X-only):

>>> splitter = SPXYGFold(n_splits=5, y_metric=None)
>>> for train_idx, test_idx in splitter.split(X):
...     pass

References

get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splitting iterations.

Parameters:

X (object) – Ignored, exists for compatibility.
y (object) – Ignored, exists for compatibility.
groups (object) – Ignored, exists for compatibility.

Returns:

n_splits – Number of folds.

Return type:

int

split(X, y=None, groups=None)[source]

Generate train/test indices for each fold.

Parameters:

X (array-like of shape (n_samples, n_features)) – Feature matrix.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values. Required if y_metric is not None.
groups (array-like of shape (n_samples,), default=None) – Group labels for samples. Samples with the same group label will always be in the same fold.

Yields:

train (ndarray) – Training set indices for this fold.
test (ndarray) – Test set indices for this fold.

class nirs4all.operators.splitters.SPXYSplitter(test_size, random_state=None, pca_components=None, metric='euclidean')[source]

Bases: CustomSplitter

Implements the SPXY sampling method.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.SPlitSplitter(test_size, random_state=None)[source]

Bases: CustomSplitter

Implements the SPlit sampling.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class nirs4all.operators.splitters.SystematicCircularSplitter(test_size, random_state=None)[source]

Bases: CustomSplitter

Implements the systematic circular sampling method.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.