Retrain and Transfer Learning
This guide covers retraining trained pipelines on new data, including full retrain, transfer learning, and fine-tuning modes.
Overview
The retrain feature allows you to:
Full retrain: Train from scratch with the same pipeline structure
Transfer: Reuse preprocessing artifacts while training a new model
Finetune: Continue training an existing model with additional data
Extract & Modify: Get pipeline structure for inspection and modification
Retrain Modes
Mode |
Preprocessing |
Model |
Use Case |
|---|---|---|---|
|
Train new |
Train new |
New calibration set |
|
Use existing |
Train new |
Apply preprocessing to new domain |
|
Use existing |
Continue training |
Add more data to existing model |
Basic Usage
Full Retrain
Train everything from scratch using the same pipeline structure:
from nirs4all.pipeline import PipelineRunner
from nirs4all.data import DatasetConfigs
runner = PipelineRunner(save_artifacts=True, verbose=0)
# Train initial model
predictions, _ = runner.run(pipeline_config, dataset_config)
best_pred = predictions.top(n=1, rank_partition="test")[0]
# Full retrain on new data
new_data = DatasetConfigs(['path/to/new_calibration'])
retrained_preds, _ = runner.retrain(
source=best_pred,
dataset=new_data,
mode='full',
dataset_name='new_calibration',
verbose=0
)
print(f"Retrained RMSE: {retrained_preds.top(n=1)[0]['rmse']:.4f}")
Transfer Mode
Reuse preprocessing artifacts (scalers, SNV, etc.) while training a new model:
# Transfer: reuse preprocessing, train new model
transfer_preds, _ = runner.retrain(
source=best_pred,
dataset=new_data,
mode='transfer',
dataset_name='transfer_test',
verbose=0
)
This is useful when:
Your preprocessing is well-optimized for the spectral domain
You want to apply the same preprocessing to a different target variable
You’re doing machine/instrument transfer calibration
Transfer with Different Model
Replace the model type during transfer:
from sklearn.ensemble import GradientBoostingRegressor
new_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
transfer_preds, _ = runner.retrain(
source=best_pred,
dataset=new_data,
mode='transfer',
new_model=new_model,
dataset_name='transfer_new_model',
verbose=0
)
Finetune Mode
Continue training an existing model (most effective with neural networks):
# Finetune: continue training with additional epochs
finetune_preds, _ = runner.retrain(
source=best_pred,
dataset=new_data,
mode='finetune',
epochs=10,
dataset_name='finetune_test',
verbose=0
)
Note: Fine-tuning is most effective with neural network models. For sklearn models like PLSRegression, fine-tuning is equivalent to retraining since they don’t support incremental learning.
Retrain Sources
The retrain() method accepts various sources:
# From prediction dict
runner.retrain(best_prediction, new_data, mode='full')
# From folder path
runner.retrain("runs/2024-12-14_wheat/pipeline_abc123/", new_data, mode='transfer')
# From bundle file
runner.retrain("exports/wheat_model.n4a", new_data, mode='transfer')
# From model ID
runner.retrain(model_id, new_data, mode='full')
Extract and Modify
Get the pipeline structure for inspection or modification:
# Extract pipeline
extracted = runner.extract(best_pred)
# Inspect
print(f"Number of steps: {len(extracted)}")
print(f"Model step index: {extracted.model_step_index}")
print(f"Preprocessing chain: {extracted.preprocessing_chain}")
# View steps
for i, step in enumerate(extracted.steps):
print(f"Step {i}: {step}")
Modify and Run
from sklearn.ensemble import RandomForestRegressor
# Replace model
extracted.set_model(RandomForestRegressor(n_estimators=100))
# Run modified pipeline
modified_preds, _ = runner.run(
pipeline=extracted.steps,
dataset=new_data,
pipeline_name='modified_pipeline'
)
Fine-grained Step Control
For advanced use cases, you can control each step individually:
from nirs4all.pipeline import StepMode
# Define step modes
step_modes = [
StepMode(step_index=1, mode='predict'), # Use existing scaler
StepMode(step_index=2, mode='predict'), # Use existing y_processing
StepMode(step_index=3, mode='train'), # Retrain preprocessing
# Model step will follow overall mode
]
controlled_preds, _ = runner.retrain(
source=best_pred,
dataset=new_data,
mode='full',
step_modes=step_modes,
dataset_name='controlled_retrain',
verbose=0
)
StepMode Options
Mode |
Description |
|---|---|
|
Train this step from scratch |
|
Use existing artifact (no retraining) |
|
Skip this step entirely |
Use Cases
1. Seasonal Recalibration
Update your model with new season’s samples:
# Load previous best model
previous_model = predictions_db.get_best_for_dataset('wheat_2023')
# Retrain with new season's data
new_season = DatasetConfigs(['data/wheat_2024'])
updated_preds, _ = runner.retrain(
source=previous_model,
dataset=new_season,
mode='full',
dataset_name='wheat_2024'
)
2. Machine Transfer
Apply preprocessing from reference instrument to new instrument:
# Model trained on Machine A
machine_a_model = best_prediction
# New data from Machine B
machine_b_data = DatasetConfigs(['data/machine_b'])
# Transfer preprocessing, train new model for Machine B
transfer_preds, _ = runner.retrain(
source=machine_a_model,
dataset=machine_b_data,
mode='transfer',
dataset_name='machine_b_calibration'
)
3. Multi-target Prediction
Use same preprocessing for different target variables:
# Original model for protein
protein_model = best_prediction
# Same spectra, different target (moisture)
moisture_data = DatasetConfigs({
'X': 'spectra.csv',
'Y': 'moisture_values.csv'
})
# Transfer preprocessing, train for moisture
moisture_preds, _ = runner.retrain(
source=protein_model,
dataset=moisture_data,
mode='transfer',
dataset_name='moisture_prediction'
)
4. A/B Testing Models
Compare different models with same preprocessing:
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
models = [
PLSRegression(n_components=10),
GradientBoostingRegressor(n_estimators=100),
RandomForestRegressor(n_estimators=100),
]
results = {}
for model in models:
preds, _ = runner.retrain(
source=best_pred,
dataset=test_data,
mode='transfer',
new_model=model,
dataset_name=f'test_{model.__class__.__name__}'
)
best = preds.top(n=1, rank_partition="test")[0]
results[model.__class__.__name__] = best['rmse']
print("Model Comparison:")
for name, rmse in sorted(results.items(), key=lambda x: x[1]):
print(f" {name}: RMSE = {rmse:.4f}")
Neural Network Fine-tuning
For TensorFlow/PyTorch models, fine-tuning supports additional options:
finetune_preds, _ = runner.retrain(
source=best_nicon_model,
dataset=new_data,
mode='finetune',
epochs=20,
learning_rate=0.0001, # Lower LR for fine-tuning
freeze_layers=['conv1', 'conv2'], # Freeze early layers
dataset_name='finetune_nicon'
)
Best Practices
Validate preprocessing compatibility: Ensure new data has same wavelength range
Check feature dimensions: New data must have same number of features
Use transfer mode wisely: Best when preprocessing is well-optimized
Start with full retrain: When in doubt, retrain everything
Compare modes: Test different modes to find what works best
Troubleshooting
Missing Artifacts
# Error: Artifact not found
# Solution: Ensure original model was trained with save_artifacts=True
runner = PipelineRunner(save_artifacts=True, verbose=0)
Feature Mismatch
# Error: Feature dimension mismatch
# Solution: Verify new data has same number of features
print(f"Expected features: {best_pred['n_features']}")
print(f"New data features: {new_data.shape[1]}")
Mode Not Suitable
# Finetune with sklearn model = just retraining
# Use transfer or full mode instead for sklearn models
runner.retrain(source, data, mode='transfer') # Better for sklearn
See Also
Prediction and Model Reuse - Basic prediction workflows
Export and Deployment - Export models for deployment
Migration Guide - Upgrade from older versions
Writing a Pipeline in nirs4all - Pipeline syntax reference