Prediction and Model Reuse
This guide covers how to make predictions using trained models in nirs4all, including loading saved models and applying them to new datasets.
Overview
After training a pipeline, you can use the trained models to make predictions on new data. nirs4all supports several prediction workflows:
Direct prediction: Use a trained model to predict new samples
Model persistence: Save and reload models for later use
Cross-validation ensembles: Combine predictions from multiple CV folds
Basic Prediction Workflow
Training a Model
First, train your pipeline:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import MinMaxScaler
from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineConfigs, PipelineRunner
# Define pipeline
pipeline = [
MinMaxScaler(),
RepeatedKFold(n_splits=3, n_repeats=1, random_state=42),
{"model": PLSRegression(n_components=10), "name": "PLS_10"},
]
# Train
runner = PipelineRunner(save_artifacts=True, verbose=0)
predictions, _ = runner.run(
PipelineConfigs(pipeline),
DatasetConfigs(['path/to/training_data'])
)
# Get best model
best_prediction = predictions.top(n=1, rank_partition="test")[0]
print(f"Best model: {best_prediction['model_name']}")
print(f"RMSE: {best_prediction['rmse']:.4f}")
Making Predictions
Use the predict() method with the best prediction entry:
# Create predictor
predictor = PipelineRunner(save_artifacts=False, save_charts=False, verbose=0)
# Load new data
new_dataset = DatasetConfigs({
'X_test': 'path/to/new_spectra.csv'
})
# Make predictions
y_pred, _ = predictor.predict(best_prediction, new_dataset, verbose=0)
print(f"Predictions: {y_pred[:5]}")
Prediction Sources
The predict() method accepts various sources:
1. Prediction Dictionary (Most Common)
# From Predictions object
best_prediction = predictions.top(n=1, rank_partition="test")[0]
y_pred, _ = runner.predict(best_prediction, new_data)
2. Model ID String
# Using the prediction ID directly
model_id = best_prediction['id']
y_pred, _ = runner.predict(model_id, new_data)
3. Folder Path
# From a pipeline folder
y_pred, _ = runner.predict("runs/2024-12-14_wheat/pipeline_abc123/", new_data)
4. Bundle File
# From an exported bundle (see Export section)
y_pred, _ = runner.predict("exports/wheat_model.n4a", new_data)
5. Direct Model File
You can load a model directly from its binary file. This is useful when you have a pre-trained model saved externally or want to use models trained outside nirs4all.
# From a sklearn/joblib model file
y_pred, _ = runner.predict("models/pls_wheat.joblib", new_data)
# From a pickle file
y_pred, _ = runner.predict("models/my_model.pkl", new_data)
# From a TensorFlow/Keras model
y_pred, _ = runner.predict("models/nn_model.h5", new_data)
y_pred, _ = runner.predict("models/nn_model.keras", new_data)
# From a PyTorch model
y_pred, _ = runner.predict("models/torch_model.pt", new_data)
y_pred, _ = runner.predict("models/checkpoint.pth", new_data)
# From a model folder (AutoGluon, TensorFlow SavedModel)
y_pred, _ = runner.predict("models/autogluon_model/", new_data)
y_pred, _ = runner.predict("models/tf_savedmodel/", new_data)
Supported formats:
Extension |
Framework |
Notes |
|---|---|---|
|
sklearn, XGBoost, LightGBM |
Recommended for sklearn models |
|
Any (cloudpickle) |
General purpose |
|
TensorFlow/Keras |
Legacy Keras format |
|
TensorFlow/Keras |
Modern Keras format |
|
PyTorch |
Full model or state dict |
|
PyTorch |
Checkpoint file |
folder |
AutoGluon, TensorFlow |
SavedModel format |
Important: When using direct model files, no preprocessing artifacts are loaded. The input data should already be preprocessed appropriately for the model.
Prediction Output
The predict() method returns:
y_pred, predictions_obj = runner.predict(source, dataset)
y_pred: numpy array of predictions (averaged across folds if CV)predictions_obj: Predictions object with full metadata
Getting All Predictions
To get predictions from all models (not just the best):
all_preds, predictions_obj = runner.predict(
best_prediction,
new_data,
all_predictions=True,
verbose=0
)
# Iterate over all predictions
for pred in predictions_obj.to_dicts():
print(f"{pred['model_name']}: {pred['rmse']:.4f}")
Cross-Validation Ensemble Predictions
When training with cross-validation, each fold produces a separate model. During prediction, nirs4all automatically:
Loads all fold models
Makes predictions with each
Combines predictions using weighted averaging
The weights are determined by validation performance:
# Fold weights are stored in the prediction
fold_weights = best_prediction.get('fold_weights', {})
print(f"Fold weights: {fold_weights}")
# e.g., {0: 0.34, 1: 0.33, 2: 0.33}
Data Format for Prediction
The new data must have the same number of features as the training data:
# Check expected features
n_features = best_prediction['n_features']
print(f"Expected features: {n_features}")
# Supported formats
# 1. CSV file
new_data = DatasetConfigs({'X_test': 'spectra.csv'})
# 2. NumPy array
import numpy as np
X_new = np.random.randn(20, n_features)
new_data = DatasetConfigs({'test_x': X_new})
# 3. Dictionary
new_data = {'test_x': X_new}
Preprocessing Replay
During prediction, nirs4all automatically replays the preprocessing steps:
Loads saved transformer artifacts (scalers, SNV, etc.)
Applies transforms in the same order as training
Feeds transformed data to the model
This ensures consistent preprocessing between training and prediction.
Error Handling
Common prediction errors and solutions:
Missing Model
try:
y_pred, _ = runner.predict(best_prediction, new_data)
except FileNotFoundError as e:
print(f"Model not found: {e}")
# Re-train or check save_artifacts=True during training
Feature Mismatch
try:
y_pred, _ = runner.predict(best_prediction, new_data)
except ValueError as e:
print(f"Feature mismatch: {e}")
# Check new data has same number of features as training
Missing Preprocessing Artifacts
try:
y_pred, _ = runner.predict(best_prediction, new_data)
except KeyError as e:
print(f"Missing artifact: {e}")
# Ensure all preprocessing steps were saved during training
Using Model Files in Pipelines
You can include pre-trained models directly in your pipeline configuration. This is useful for:
Transfer learning with pre-trained models
Ensemble with external models
Fine-tuning existing models
Loading a Model in Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
# Use a pre-trained model file in pipeline
pipeline = [
MinMaxScaler(),
KFold(n_splits=5),
{"model": "models/pretrained_pls.joblib", "name": "pretrained_pls"}
]
runner = PipelineRunner()
predictions, _ = runner.run(pipeline, dataset)
Supported Model Formats in Pipelines
The model path is resolved automatically:
# sklearn/scikit-learn models
{"model": "models/pls.joblib"}
{"model": "models/ridge.pkl"}
# TensorFlow/Keras
{"model": "models/nn.h5"}
{"model": "models/nn.keras"}
{"model": "models/savedmodel_folder/"} # SavedModel format
# PyTorch
{"model": "models/torch_model.pt"}
{"model": "models/checkpoint.pth"}
{"model": "models/checkpoint.ckpt"}
# AutoGluon
{"model": "models/autogluon_predictor/"}
Exporting a Model to File
You can export just the model (not the full bundle) from a trained prediction using export_model(). This creates a lightweight model file that can be loaded later or shared:
from nirs4all.pipeline import PipelineRunner
runner = PipelineRunner(save_artifacts=True)
predictions, _ = runner.run(pipeline, dataset)
best_pred = predictions.top(n=1)[0]
# Export just the model to .joblib
runner.export_model(best_pred, "exports/pls_model.joblib")
# Export to different formats
runner.export_model(best_pred, "exports/model.pkl") # Pickle format
runner.export_model(best_pred, "exports/model.keras") # TensorFlow/Keras
# Export a specific fold's model
runner.export_model(best_pred, "exports/fold2_model.joblib", fold=2)
Difference between export() and export_model():
Method |
Output |
Use Case |
|---|---|---|
|
Full |
Deployment, sharing complete pipelines |
|
Model file only |
Lightweight sharing, external tools |
The full bundle includes preprocessing artifacts and metadata, while export_model() exports only the trained model binary.
Fine-tuning a Pre-trained Model
# Load and fine-tune a pre-trained model
runner = PipelineRunner()
predictions, _ = runner.retrain(
source="models/pretrained_pls.joblib",
dataset=new_training_data,
mode='finetune'
)
Transfer Learning
# Use a model trained on one dataset for another
runner = PipelineRunner()
predictions, _ = runner.retrain(
source="models/wheat_model.joblib",
dataset=corn_dataset,
mode='transfer'
)
Best Practices
Always use
save_artifacts=Trueduring training to persist modelsVerify feature dimensions before prediction
Use the same preprocessing as training (automatic with nirs4all)
Store prediction entries for reproducibility
Test predictions against known validation data
Match preprocessing when using direct model files - no preprocessing is replayed
See Also
Export and Bundles - Export models for deployment
Retrain and Transfer - Retrain models on new data
User Guide - Complete user guide