# TransferPreprocessingSelector Cheat Sheet ## Purpose Find preprocessings that: 1. **Minimize distribution gap** between train and test sets 2. **Preserve predictive information** (don't destroy useful signal) --- ## Quick Start ```python from nirs4all.analysis import TransferPreprocessingSelector # Simple usage with preset selector = TransferPreprocessingSelector(preset="balanced") results = selector.fit(X_train, X_test) # Get top recommendations top_pp = results.to_preprocessing_list(top_k=10) ``` --- ## Stages Overview | Stage | Name | What it does | Enabled by | |-------|------|--------------|------------| | 1 | Single Eval | Tests all base preprocessings | Always on | | 1b | Generator Stacked | Tests stacked combos from `preprocessing_spec` | `preprocessing_spec` | | 2 | Stacking | Combines top-K singles into depth-2/3 pipelines | `run_stage2=True` | | 2b | Generator Augmented | Tests augmentation from `preprocessing_spec` | `preprocessing_spec` + `pick` | | 3 | Augmentation | Concatenates features from multiple pipelines | `run_stage3=True` | | 4 | Supervised Validation | Validates with proxy models (Ridge, PLS) | `run_stage4=True` + `y_source` | --- ## Presets | Preset | Stage 2 | Stage 3 | Stage 4 | Speed | |--------|---------|---------|---------|-------| | `fast` | ❌ | ❌ | ❌ | ⚡ Fastest | | `balanced` | ✅ (top-5, depth-2) | ❌ | ❌ | 🔄 Medium | | `thorough` | ✅ (top-10, depth-3) | ✅ | ❌ | 🐢 Slow | | `full` | ✅ (top-15, depth-3) | ✅ | ✅ | 🐌 Very slow | --- ## Generator Mode (preprocessing_spec) ### Cartesian Product with None → Generates All Depths ```python preprocessing_spec = { "_cartesian_": [ {"_or_": [None, SNV(), MSC()]}, # Stage 1: scatter correction {"_or_": [None, SavitzkyGolay()]}, # Stage 2: smoothing {"_or_": [None, FirstDerivative()]}, # Stage 3: derivative ] } # Generates: singletons, pairs, AND triples # [None, None, D1()] → [D1()] (singleton) # [SNV(), None, D1()] → [SNV(), D1()] (pair) # [SNV(), SavGol(), D1()] → full pipeline ``` ### Important: Exhaustive cartesian makes Stage 2 redundant! ```python # If preprocessing_spec already generates all combinations: selector = TransferPreprocessingSelector( preprocessing_spec=GLOBAL_PP, preset=None, # Disable preset run_stage2=False, # Already covered by cartesian run_stage4=True, # Keep supervised validation ) ``` --- ## Key Parameters ```python TransferPreprocessingSelector( # Presets preset="balanced", # "fast", "balanced", "thorough", "full", None # Generator mode preprocessing_spec=None, # Dict with _cartesian_, _or_, etc. # Stage 2: Stacking run_stage2=False, stage2_top_k=5, # How many singles to stack stage2_max_depth=2, # Max pipeline depth (2 or 3) # Stage 3: Augmentation (feature concatenation) run_stage3=False, stage3_top_k=5, stage3_max_order=2, # Concat 2 or 3 pipelines # Stage 4: Supervised validation run_stage4=False, stage4_top_k=10, # Metrics n_components=10, # PCA components for metrics k_neighbors=10, # For trustworthiness # Performance n_jobs=-1, # Parallel workers (-1 = all cores) verbose=1, ) ``` --- ## Common Patterns ### Pattern 1: Quick Exploration ```python selector = TransferPreprocessingSelector(preset="fast") results = selector.fit(dataset_config) ``` ### Pattern 2: Exhaustive with Generator ```python GLOBAL_PP = { "_cartesian_": [ {"_or_": [None, MSC(), SNV(), EMSC()]}, {"_or_": [None, SavitzkyGolay(), Gaussian()]}, {"_or_": [None, FirstDerivative(), SecondDerivative()]}, ] } selector = TransferPreprocessingSelector( preprocessing_spec=GLOBAL_PP, preset=None, run_stage2=False, # Cartesian already covers this run_stage4=True, # Validate top candidates stage4_top_k=20, ) results = selector.fit(X_train, X_test, y_train) ``` ### Pattern 3: Discovery Mode (let Stage 2 explore) ```python # Small base set, let Stage 2 find combinations selector = TransferPreprocessingSelector( preset="thorough", # Enables stage2 with depth-3 # No preprocessing_spec → uses get_base_preprocessings() ) ``` --- ## Results API ```python results = selector.fit(X_train, X_test) # Get rankings results.best # Top TransferResult results.top(k=10) # Top 10 results results.ranking # All results (sorted) # Export results.to_preprocessing_list(top_k=10) # List of transforms results.to_pipeline_spec() # Pipeline-ready format results.summary() # Text summary # Metrics results.raw_metrics # Baseline (no preprocessing) metrics results.timing # Stage timings ``` --- ## Metrics Explained | Metric | What it measures | Better when | |--------|------------------|-------------| | `mmd` | Maximum Mean Discrepancy | Lower | | `wasserstein` | Distribution distance | Lower | | `coral` | Covariance alignment | Lower | | `trustworthiness` | Local structure preservation | Higher | | `transfer_score` | Weighted combination | **Higher** | --- ## ⚠️ Gotchas 1. **Stage 2 only stacks singletons** - It doesn't stack your 4th-order pipelines together 2. **Generator mode bypasses `preprocessings` dict** - Objects are used directly 3. **None in cartesian = optional stage** - Generates lower-order pipelines 4. **Stage 4 requires y_source** - Pass labels for supervised validation