nirs4all.analysis.selector module

Transfer Preprocessing Selector.

Main class for transfer-optimized preprocessing selection. Evaluates preprocessings and their combinations to find those that best align source and target datasets while preserving predictive information.

Supports two modes for preprocessing generation: 1. Combinatoric mode (default): Uses simple permutations from base preprocessings 2. Generator mode: Uses nirs4all’s generator DSL for flexible, constraint-based specification

class nirs4all.analysis.selector.TransferPreprocessingSelector(preset: str | None = 'fast', preprocessings: Dict[str, Any] | None = None, n_components: int = 10, k_neighbors: int = 10, run_stage2: bool = False, stage2_top_k: int | None = 5, stage2_max_depth: int = 2, stage2_exhaustive: bool = False, run_stage3: bool = False, stage3_top_k: int = 5, stage3_max_order: int = 2, run_stage4: bool = False, stage4_top_k: int = 10, stage4_cv_folds: int = 3, stage4_models: List[str] | None = None, metric_weights: Dict[str, float] | None = None, preprocessing_spec: Dict[str, Any] | None = None, use_generator: bool | None = None, n_jobs: int = -1, verbose: int = 1, random_state: int = 0)[source]

Bases: object

Select preprocessing for optimal transfer between datasets.

This class evaluates preprocessing methods to find those that best minimize distributional distance between source and target datasets while preserving predictive information.

Supports two modes for preprocessing generation:

Combinatoric mode (default): Uses simple permutations from base preprocessings. Stage 1 evaluates all singles, Stage 2 generates permutations from top-K candidates.
Generator mode: Uses nirs4all’s generator DSL for flexible, constraint-based specification. Enable by providing preprocessing_spec.

Stages:: 1. Single Preprocessing (required): Evaluate all base preprocessings 1b. Generator Stacked (optional): If using generator with stacked specs 2. Stacking (optional): Evaluate depth-2+ combinations of top-K 2b. Generator Augmented (optional): If using generator with augmentation specs 3. Augmentation (optional): Evaluate feature concatenation 4. Validation (optional): Supervised validation with proxy models

Parameters:

preset – Preset configuration (‘fast’, ‘balanced’, ‘thorough’, ‘full’, ‘exhaustive’) or None for manual configuration. Default is ‘fast’.
preprocessings – Custom preprocessings dict or None for base set.
n_components – PCA components for metric computation.
k_neighbors – Neighbors for trustworthiness metric.
2 (Stage) – run_stage2: Enable stacking evaluation. stage2_top_k: Number of top candidates for stacking. stage2_max_depth: Maximum stacking depth.
3 (Stage) – run_stage3: Enable augmentation evaluation. stage3_top_k: Number of top candidates for augmentation. stage3_max_order: Maximum augmentation order (2 or 3).
4 (Stage) – run_stage4: Enable supervised validation. stage4_top_k: Number of candidates to validate. stage4_cv_folds: Cross-validation folds.
integration (Generator) –

preprocessing_spec: Generator specification dict for flexible
preprocessing definition. Uses nirs4all.pipeline.config.generator. Supports keywords like _or_, arrange, pick, _mutex_, etc.

use_generator: Enable generator mode. Auto-detected if preprocessing_spec
is provided. Set to False to disable even with preprocessing_spec.
Parallelization –
n_jobs: Number of parallel jobs for preprocessing evaluation.
- n_jobs=-1: Use all available CPU cores (default)
- n_jobs=1: Sequential execution (useful for debugging)
- n_jobs=N: Use N cores
Other – verbose: Verbosity level (0=silent, 1=progress, 2=detailed). random_state: Random seed for reproducibility.

Example

>>> # Quick usage with default fast preset
>>> selector = TransferPreprocessingSelector()
>>> results = selector.fit(X_source, X_target)
>>> print(results.best.name)
'snv'

>>> # With balanced preset for stacking
>>> selector = TransferPreprocessingSelector(preset='balanced')
>>> results = selector.fit(X_source, X_target)
>>> print(results.to_pipeline_spec())
'snv>d1'

>>> # Generator mode: constrained stacking
>>> selector = TransferPreprocessingSelector(
...     preprocessing_spec={
...         "_or_": ["snv", "msc", "d1", "d2", "savgol"],
...         "arrange": 2,
...         "_mutex_": [["d1", "d2"]],  # Don't stack derivatives
...     },
... )
>>> results = selector.fit(X_source, X_target)

>>> # Custom configuration
>>> selector = TransferPreprocessingSelector(
...     preset=None,
...     run_stage2=True,
...     stage2_top_k=10,
...     stage2_max_depth=2,
...     n_components=20,
... )

fit(X_source_or_config, X_target: ndarray | None = None, y_source: ndarray | None = None, y_target: ndarray | None = None) → TransferSelectionResults[source]

Run transfer-optimized preprocessing selection.

Supports two calling conventions:

Raw arrays (original API):
selector.fit(X_source, X_target, y_source, y_target)
DatasetConfigs (nirs4all-native API):
selector.fit(dataset_config) - Single dataset: Uses train as source, test as target - Multiple datasets: Combines X(“all”) from all datasets

Parameters:

X_source_or_config – Either: - np.ndarray: Source dataset (n_samples_src, n_features) - DatasetConfigs: nirs4all dataset configuration
X_target – Target dataset (required if X_source_or_config is array).
y_source – Optional source targets for supervised validation.
y_target – Optional target labels for supervised validation.

Returns:

TransferSelectionResults with ranked recommendations.

Example

>>> # Using DatasetConfigs (recommended nirs4all way)
>>> selector = TransferPreprocessingSelector(preset="balanced")
>>> results = selector.fit(DatasetConfigs(data_path))
>>> pp_list = results.to_preprocessing_list(top_k=10)

>>> # Using raw arrays
>>> results = selector.fit(X_train, X_test, y_train)

fit_from_configs(config_source, config_target, partition: str = 'train') → TransferSelectionResults[source]

Fit from DatasetConfigs or SpectroDataset.

Parameters:

config_source – DatasetConfigs or SpectroDataset for source dataset.
config_target – DatasetConfigs or SpectroDataset for target dataset.
partition – Which partition to use (‘train’ or ‘test’).

Returns:

TransferSelectionResults with ranked recommendations.

Example

>>> from nirs4all.data.config import DatasetConfigs
>>> config_src = DatasetConfigs("path/to/source.json")
>>> config_tgt = DatasetConfigs("path/to/target.json")
>>> selector = TransferPreprocessingSelector()
>>> results = selector.fit_from_configs(config_src, config_tgt)

get_preprocessing_by_name(name: str) → Any[source]

Get a preprocessing transform by name.

Parameters:: name – Preprocessing name (e.g., “snv”, “snv>d1”).
Returns:: Transformer or list of transformers for stacked pipelines.