# Retrain and Transfer Learning This guide covers retraining trained pipelines on new data, including full retrain, transfer learning, and fine-tuning modes. ## Overview The retrain feature allows you to: - **Full retrain**: Train from scratch with the same pipeline structure - **Transfer**: Reuse preprocessing artifacts while training a new model - **Finetune**: Continue training an existing model with additional data - **Extract & Modify**: Get pipeline structure for inspection and modification ## Retrain Modes | Mode | Preprocessing | Model | Use Case | |------|---------------|-------|----------| | `full` | Train new | Train new | New calibration set | | `transfer` | Use existing | Train new | Apply preprocessing to new domain | | `finetune` | Use existing | Continue training | Add more data to existing model | ## Basic Usage ### Full Retrain Train everything from scratch using the same pipeline structure: ```python from nirs4all.pipeline import PipelineRunner from nirs4all.data import DatasetConfigs runner = PipelineRunner(save_artifacts=True, verbose=0) # Train initial model predictions, _ = runner.run(pipeline_config, dataset_config) best_pred = predictions.top(n=1, rank_partition="test")[0] # Full retrain on new data new_data = DatasetConfigs(['path/to/new_calibration']) retrained_preds, _ = runner.retrain( source=best_pred, dataset=new_data, mode='full', dataset_name='new_calibration', verbose=0 ) print(f"Retrained RMSE: {retrained_preds.top(n=1)[0]['rmse']:.4f}") ``` ### Transfer Mode Reuse preprocessing artifacts (scalers, SNV, etc.) while training a new model: ```python # Transfer: reuse preprocessing, train new model transfer_preds, _ = runner.retrain( source=best_pred, dataset=new_data, mode='transfer', dataset_name='transfer_test', verbose=0 ) ``` This is useful when: - Your preprocessing is well-optimized for the spectral domain - You want to apply the same preprocessing to a different target variable - You're doing machine/instrument transfer calibration ### Transfer with Different Model Replace the model type during transfer: ```python from sklearn.ensemble import GradientBoostingRegressor new_model = GradientBoostingRegressor(n_estimators=100, random_state=42) transfer_preds, _ = runner.retrain( source=best_pred, dataset=new_data, mode='transfer', new_model=new_model, dataset_name='transfer_new_model', verbose=0 ) ``` ### Finetune Mode Continue training an existing model (most effective with neural networks): ```python # Finetune: continue training with additional epochs finetune_preds, _ = runner.retrain( source=best_pred, dataset=new_data, mode='finetune', epochs=10, dataset_name='finetune_test', verbose=0 ) ``` **Note**: Fine-tuning is most effective with neural network models. For sklearn models like PLSRegression, fine-tuning is equivalent to retraining since they don't support incremental learning. ## Retrain Sources The `retrain()` method accepts various sources: ```python # From prediction dict runner.retrain(best_prediction, new_data, mode='full') # From folder path runner.retrain("runs/2024-12-14_wheat/pipeline_abc123/", new_data, mode='transfer') # From bundle file runner.retrain("exports/wheat_model.n4a", new_data, mode='transfer') # From model ID runner.retrain(model_id, new_data, mode='full') ``` ## Extract and Modify Get the pipeline structure for inspection or modification: ```python # Extract pipeline extracted = runner.extract(best_pred) # Inspect print(f"Number of steps: {len(extracted)}") print(f"Model step index: {extracted.model_step_index}") print(f"Preprocessing chain: {extracted.preprocessing_chain}") # View steps for i, step in enumerate(extracted.steps): print(f"Step {i}: {step}") ``` ### Modify and Run ```python from sklearn.ensemble import RandomForestRegressor # Replace model extracted.set_model(RandomForestRegressor(n_estimators=100)) # Run modified pipeline modified_preds, _ = runner.run( pipeline=extracted.steps, dataset=new_data, pipeline_name='modified_pipeline' ) ``` ## Fine-grained Step Control For advanced use cases, you can control each step individually: ```python from nirs4all.pipeline import StepMode # Define step modes step_modes = [ StepMode(step_index=1, mode='predict'), # Use existing scaler StepMode(step_index=2, mode='predict'), # Use existing y_processing StepMode(step_index=3, mode='train'), # Retrain preprocessing # Model step will follow overall mode ] controlled_preds, _ = runner.retrain( source=best_pred, dataset=new_data, mode='full', step_modes=step_modes, dataset_name='controlled_retrain', verbose=0 ) ``` ### StepMode Options | Mode | Description | |------|-------------| | `'train'` | Train this step from scratch | | `'predict'` | Use existing artifact (no retraining) | | `'skip'` | Skip this step entirely | ## Use Cases ### 1. Seasonal Recalibration Update your model with new season's samples: ```python # Load previous best model previous_model = predictions_db.get_best_for_dataset('wheat_2023') # Retrain with new season's data new_season = DatasetConfigs(['data/wheat_2024']) updated_preds, _ = runner.retrain( source=previous_model, dataset=new_season, mode='full', dataset_name='wheat_2024' ) ``` ### 2. Machine Transfer Apply preprocessing from reference instrument to new instrument: ```python # Model trained on Machine A machine_a_model = best_prediction # New data from Machine B machine_b_data = DatasetConfigs(['data/machine_b']) # Transfer preprocessing, train new model for Machine B transfer_preds, _ = runner.retrain( source=machine_a_model, dataset=machine_b_data, mode='transfer', dataset_name='machine_b_calibration' ) ``` ### 3. Multi-target Prediction Use same preprocessing for different target variables: ```python # Original model for protein protein_model = best_prediction # Same spectra, different target (moisture) moisture_data = DatasetConfigs({ 'X': 'spectra.csv', 'Y': 'moisture_values.csv' }) # Transfer preprocessing, train for moisture moisture_preds, _ = runner.retrain( source=protein_model, dataset=moisture_data, mode='transfer', dataset_name='moisture_prediction' ) ``` ### 4. A/B Testing Models Compare different models with same preprocessing: ```python from sklearn.cross_decomposition import PLSRegression from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor models = [ PLSRegression(n_components=10), GradientBoostingRegressor(n_estimators=100), RandomForestRegressor(n_estimators=100), ] results = {} for model in models: preds, _ = runner.retrain( source=best_pred, dataset=test_data, mode='transfer', new_model=model, dataset_name=f'test_{model.__class__.__name__}' ) best = preds.top(n=1, rank_partition="test")[0] results[model.__class__.__name__] = best['rmse'] print("Model Comparison:") for name, rmse in sorted(results.items(), key=lambda x: x[1]): print(f" {name}: RMSE = {rmse:.4f}") ``` ## Neural Network Fine-tuning For TensorFlow/PyTorch models, fine-tuning supports additional options: ```python finetune_preds, _ = runner.retrain( source=best_nicon_model, dataset=new_data, mode='finetune', epochs=20, learning_rate=0.0001, # Lower LR for fine-tuning freeze_layers=['conv1', 'conv2'], # Freeze early layers dataset_name='finetune_nicon' ) ``` ## Best Practices 1. **Validate preprocessing compatibility**: Ensure new data has same wavelength range 2. **Check feature dimensions**: New data must have same number of features 3. **Use transfer mode wisely**: Best when preprocessing is well-optimized 4. **Start with full retrain**: When in doubt, retrain everything 5. **Compare modes**: Test different modes to find what works best ## Troubleshooting ### Missing Artifacts ```python # Error: Artifact not found # Solution: Ensure original model was trained with save_artifacts=True runner = PipelineRunner(save_artifacts=True, verbose=0) ``` ### Feature Mismatch ```python # Error: Feature dimension mismatch # Solution: Verify new data has same number of features print(f"Expected features: {best_pred['n_features']}") print(f"New data features: {new_data.shape[1]}") ``` ### Mode Not Suitable ```python # Finetune with sklearn model = just retraining # Use transfer or full mode instead for sklearn models runner.retrain(source, data, mode='transfer') # Better for sklearn ``` ## See Also - {doc}`prediction_model_reuse` - Basic prediction workflows - {doc}`export_bundles` - Export models for deployment - {doc}`/user_guide/troubleshooting/migration` - Upgrade from older versions - {doc}`/reference/pipeline_syntax` - Pipeline syntax reference