TransferPreprocessingSelector Cheat Sheet
Purpose
Find preprocessings that:
Minimize distribution gap between train and test sets
Preserve predictive information (don’t destroy useful signal)
Quick Start
from nirs4all.analysis import TransferPreprocessingSelector
# Simple usage with preset
selector = TransferPreprocessingSelector(preset="balanced")
results = selector.fit(X_train, X_test)
# Get top recommendations
top_pp = results.to_preprocessing_list(top_k=10)
Stages Overview
Stage |
Name |
What it does |
Enabled by |
|---|---|---|---|
1 |
Single Eval |
Tests all base preprocessings |
Always on |
1b |
Generator Stacked |
Tests stacked combos from |
|
2 |
Stacking |
Combines top-K singles into depth-2/3 pipelines |
|
2b |
Generator Augmented |
Tests augmentation from |
|
3 |
Augmentation |
Concatenates features from multiple pipelines |
|
4 |
Supervised Validation |
Validates with proxy models (Ridge, PLS) |
|
Presets
Preset |
Stage 2 |
Stage 3 |
Stage 4 |
Speed |
|---|---|---|---|---|
|
❌ |
❌ |
❌ |
⚡ Fastest |
|
✅ (top-5, depth-2) |
❌ |
❌ |
🔄 Medium |
|
✅ (top-10, depth-3) |
✅ |
❌ |
🐢 Slow |
|
✅ (top-15, depth-3) |
✅ |
✅ |
🐌 Very slow |
Generator Mode (preprocessing_spec)
Cartesian Product with None → Generates All Depths
preprocessing_spec = {
"_cartesian_": [
{"_or_": [None, SNV(), MSC()]}, # Stage 1: scatter correction
{"_or_": [None, SavitzkyGolay()]}, # Stage 2: smoothing
{"_or_": [None, FirstDerivative()]}, # Stage 3: derivative
]
}
# Generates: singletons, pairs, AND triples
# [None, None, D1()] → [D1()] (singleton)
# [SNV(), None, D1()] → [SNV(), D1()] (pair)
# [SNV(), SavGol(), D1()] → full pipeline
Important: Exhaustive cartesian makes Stage 2 redundant!
# If preprocessing_spec already generates all combinations:
selector = TransferPreprocessingSelector(
preprocessing_spec=GLOBAL_PP,
preset=None, # Disable preset
run_stage2=False, # Already covered by cartesian
run_stage4=True, # Keep supervised validation
)
Key Parameters
TransferPreprocessingSelector(
# Presets
preset="balanced", # "fast", "balanced", "thorough", "full", None
# Generator mode
preprocessing_spec=None, # Dict with _cartesian_, _or_, etc.
# Stage 2: Stacking
run_stage2=False,
stage2_top_k=5, # How many singles to stack
stage2_max_depth=2, # Max pipeline depth (2 or 3)
# Stage 3: Augmentation (feature concatenation)
run_stage3=False,
stage3_top_k=5,
stage3_max_order=2, # Concat 2 or 3 pipelines
# Stage 4: Supervised validation
run_stage4=False,
stage4_top_k=10,
# Metrics
n_components=10, # PCA components for metrics
k_neighbors=10, # For trustworthiness
# Performance
n_jobs=-1, # Parallel workers (-1 = all cores)
verbose=1,
)
Common Patterns
Pattern 1: Quick Exploration
selector = TransferPreprocessingSelector(preset="fast")
results = selector.fit(dataset_config)
Pattern 2: Exhaustive with Generator
GLOBAL_PP = {
"_cartesian_": [
{"_or_": [None, MSC(), SNV(), EMSC()]},
{"_or_": [None, SavitzkyGolay(), Gaussian()]},
{"_or_": [None, FirstDerivative(), SecondDerivative()]},
]
}
selector = TransferPreprocessingSelector(
preprocessing_spec=GLOBAL_PP,
preset=None,
run_stage2=False, # Cartesian already covers this
run_stage4=True, # Validate top candidates
stage4_top_k=20,
)
results = selector.fit(X_train, X_test, y_train)
Pattern 3: Discovery Mode (let Stage 2 explore)
# Small base set, let Stage 2 find combinations
selector = TransferPreprocessingSelector(
preset="thorough", # Enables stage2 with depth-3
# No preprocessing_spec → uses get_base_preprocessings()
)
Results API
results = selector.fit(X_train, X_test)
# Get rankings
results.best # Top TransferResult
results.top(k=10) # Top 10 results
results.ranking # All results (sorted)
# Export
results.to_preprocessing_list(top_k=10) # List of transforms
results.to_pipeline_spec() # Pipeline-ready format
results.summary() # Text summary
# Metrics
results.raw_metrics # Baseline (no preprocessing) metrics
results.timing # Stage timings
Metrics Explained
Metric |
What it measures |
Better when |
|---|---|---|
|
Maximum Mean Discrepancy |
Lower |
|
Distribution distance |
Lower |
|
Covariance alignment |
Lower |
|
Local structure preservation |
Higher |
|
Weighted combination |
Higher |
⚠️ Gotchas
Stage 2 only stacks singletons - It doesn’t stack your 4th-order pipelines together
Generator mode bypasses
preprocessingsdict - Objects are used directlyNone in cartesian = optional stage - Generates lower-order pipelines
Stage 4 requires y_source - Pass labels for supervised validation