TransferPreprocessingSelector Cheat Sheet

Purpose

Find preprocessings that:

  1. Minimize distribution gap between train and test sets

  2. Preserve predictive information (don’t destroy useful signal)


Quick Start

from nirs4all.analysis import TransferPreprocessingSelector

# Simple usage with preset
selector = TransferPreprocessingSelector(preset="balanced")
results = selector.fit(X_train, X_test)

# Get top recommendations
top_pp = results.to_preprocessing_list(top_k=10)

Stages Overview

Stage

Name

What it does

Enabled by

1

Single Eval

Tests all base preprocessings

Always on

1b

Generator Stacked

Tests stacked combos from preprocessing_spec

preprocessing_spec

2

Stacking

Combines top-K singles into depth-2/3 pipelines

run_stage2=True

2b

Generator Augmented

Tests augmentation from preprocessing_spec

preprocessing_spec + pick

3

Augmentation

Concatenates features from multiple pipelines

run_stage3=True

4

Supervised Validation

Validates with proxy models (Ridge, PLS)

run_stage4=True + y_source


Presets

Preset

Stage 2

Stage 3

Stage 4

Speed

fast

⚡ Fastest

balanced

✅ (top-5, depth-2)

🔄 Medium

thorough

✅ (top-10, depth-3)

🐢 Slow

full

✅ (top-15, depth-3)

🐌 Very slow


Generator Mode (preprocessing_spec)

Cartesian Product with None → Generates All Depths

preprocessing_spec = {
    "_cartesian_": [
        {"_or_": [None, SNV(), MSC()]},      # Stage 1: scatter correction
        {"_or_": [None, SavitzkyGolay()]},   # Stage 2: smoothing
        {"_or_": [None, FirstDerivative()]}, # Stage 3: derivative
    ]
}
# Generates: singletons, pairs, AND triples
# [None, None, D1()] → [D1()]  (singleton)
# [SNV(), None, D1()] → [SNV(), D1()]  (pair)
# [SNV(), SavGol(), D1()] → full pipeline

Important: Exhaustive cartesian makes Stage 2 redundant!

# If preprocessing_spec already generates all combinations:
selector = TransferPreprocessingSelector(
    preprocessing_spec=GLOBAL_PP,
    preset=None,         # Disable preset
    run_stage2=False,    # Already covered by cartesian
    run_stage4=True,     # Keep supervised validation
)

Key Parameters

TransferPreprocessingSelector(
    # Presets
    preset="balanced",           # "fast", "balanced", "thorough", "full", None

    # Generator mode
    preprocessing_spec=None,     # Dict with _cartesian_, _or_, etc.

    # Stage 2: Stacking
    run_stage2=False,
    stage2_top_k=5,              # How many singles to stack
    stage2_max_depth=2,          # Max pipeline depth (2 or 3)

    # Stage 3: Augmentation (feature concatenation)
    run_stage3=False,
    stage3_top_k=5,
    stage3_max_order=2,          # Concat 2 or 3 pipelines

    # Stage 4: Supervised validation
    run_stage4=False,
    stage4_top_k=10,

    # Metrics
    n_components=10,             # PCA components for metrics
    k_neighbors=10,              # For trustworthiness

    # Performance
    n_jobs=-1,                   # Parallel workers (-1 = all cores)
    verbose=1,
)

Common Patterns

Pattern 1: Quick Exploration

selector = TransferPreprocessingSelector(preset="fast")
results = selector.fit(dataset_config)

Pattern 2: Exhaustive with Generator

GLOBAL_PP = {
    "_cartesian_": [
        {"_or_": [None, MSC(), SNV(), EMSC()]},
        {"_or_": [None, SavitzkyGolay(), Gaussian()]},
        {"_or_": [None, FirstDerivative(), SecondDerivative()]},
    ]
}

selector = TransferPreprocessingSelector(
    preprocessing_spec=GLOBAL_PP,
    preset=None,
    run_stage2=False,  # Cartesian already covers this
    run_stage4=True,   # Validate top candidates
    stage4_top_k=20,
)
results = selector.fit(X_train, X_test, y_train)

Pattern 3: Discovery Mode (let Stage 2 explore)

# Small base set, let Stage 2 find combinations
selector = TransferPreprocessingSelector(
    preset="thorough",  # Enables stage2 with depth-3
    # No preprocessing_spec → uses get_base_preprocessings()
)

Results API

results = selector.fit(X_train, X_test)

# Get rankings
results.best                    # Top TransferResult
results.top(k=10)               # Top 10 results
results.ranking                 # All results (sorted)

# Export
results.to_preprocessing_list(top_k=10)  # List of transforms
results.to_pipeline_spec()               # Pipeline-ready format
results.summary()                        # Text summary

# Metrics
results.raw_metrics             # Baseline (no preprocessing) metrics
results.timing                  # Stage timings

Metrics Explained

Metric

What it measures

Better when

mmd

Maximum Mean Discrepancy

Lower

wasserstein

Distribution distance

Lower

coral

Covariance alignment

Lower

trustworthiness

Local structure preservation

Higher

transfer_score

Weighted combination

Higher


⚠️ Gotchas

  1. Stage 2 only stacks singletons - It doesn’t stack your 4th-order pipelines together

  2. Generator mode bypasses preprocessings dict - Objects are used directly

  3. None in cartesian = optional stage - Generates lower-order pipelines

  4. Stage 4 requires y_source - Pass labels for supervised validation