Prediction and Model Reuse

This guide covers how to make predictions using trained models in nirs4all, including loading saved models and applying them to new datasets.

Overview

After training a pipeline, you can use the trained models to make predictions on new data. nirs4all supports several prediction workflows:

Direct prediction: Use a trained model to predict new samples
Model persistence: Save and reload models for later use
Cross-validation ensembles: Combine predictions from multiple CV folds

Basic Prediction Workflow

Training a Model

First, train your pipeline:

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import MinMaxScaler

from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineConfigs, PipelineRunner

# Define pipeline
pipeline = [
    MinMaxScaler(),
    RepeatedKFold(n_splits=3, n_repeats=1, random_state=42),
    {"model": PLSRegression(n_components=10), "name": "PLS_10"},
]

# Train
runner = PipelineRunner(save_artifacts=True, verbose=0)
predictions, _ = runner.run(
    PipelineConfigs(pipeline),
    DatasetConfigs(['path/to/training_data'])
)

# Get best model
best_prediction = predictions.top(n=1, rank_partition="test")[0]
print(f"Best model: {best_prediction['model_name']}")
print(f"RMSE: {best_prediction['rmse']:.4f}")

Making Predictions

Use the predict() method with the best prediction entry:

# Create predictor
predictor = PipelineRunner(save_artifacts=False, save_charts=False, verbose=0)

# Load new data
new_dataset = DatasetConfigs({
    'X_test': 'path/to/new_spectra.csv'
})

# Make predictions
y_pred, _ = predictor.predict(best_prediction, new_dataset, verbose=0)
print(f"Predictions: {y_pred[:5]}")

Prediction Sources

The predict() method accepts various sources:

1. Prediction Dictionary (Most Common)

# From Predictions object
best_prediction = predictions.top(n=1, rank_partition="test")[0]
y_pred, _ = runner.predict(best_prediction, new_data)

2. Model ID String

# Using the prediction ID directly
model_id = best_prediction['id']
y_pred, _ = runner.predict(model_id, new_data)

3. Folder Path

# From a pipeline folder
y_pred, _ = runner.predict("runs/2024-12-14_wheat/pipeline_abc123/", new_data)

4. Bundle File

# From an exported bundle (see Export section)
y_pred, _ = runner.predict("exports/wheat_model.n4a", new_data)

5. Direct Model File

You can load a model directly from its binary file. This is useful when you have a pre-trained model saved externally or want to use models trained outside nirs4all.

# From a sklearn/joblib model file
y_pred, _ = runner.predict("models/pls_wheat.joblib", new_data)

# From a pickle file
y_pred, _ = runner.predict("models/my_model.pkl", new_data)

# From a TensorFlow/Keras model
y_pred, _ = runner.predict("models/nn_model.h5", new_data)
y_pred, _ = runner.predict("models/nn_model.keras", new_data)

# From a PyTorch model
y_pred, _ = runner.predict("models/torch_model.pt", new_data)
y_pred, _ = runner.predict("models/checkpoint.pth", new_data)

# From a model folder (AutoGluon, TensorFlow SavedModel)
y_pred, _ = runner.predict("models/autogluon_model/", new_data)
y_pred, _ = runner.predict("models/tf_savedmodel/", new_data)

Supported formats:

Extension	Framework	Notes
`.joblib`	sklearn, XGBoost, LightGBM	Recommended for sklearn models
`.pkl`	Any (cloudpickle)	General purpose
`.h5`, `.hdf5`	TensorFlow/Keras	Legacy Keras format
`.keras`	TensorFlow/Keras	Modern Keras format
`.pt`, `.pth`	PyTorch	Full model or state dict
`.ckpt`	PyTorch	Checkpoint file
folder	AutoGluon, TensorFlow	SavedModel format

Important: When using direct model files, no preprocessing artifacts are loaded. The input data should already be preprocessed appropriately for the model.

Prediction Output

The predict() method returns:

y_pred, predictions_obj = runner.predict(source, dataset)

y_pred: numpy array of predictions (averaged across folds if CV)
predictions_obj: Predictions object with full metadata

Getting All Predictions

To get predictions from all models (not just the best):

all_preds, predictions_obj = runner.predict(
    best_prediction,
    new_data,
    all_predictions=True,
    verbose=0
)

# Iterate over all predictions
for pred in predictions_obj.to_dicts():
    print(f"{pred['model_name']}: {pred['rmse']:.4f}")

Cross-Validation Ensemble Predictions

When training with cross-validation, each fold produces a separate model. During prediction, nirs4all automatically:

Loads all fold models
Makes predictions with each
Combines predictions using weighted averaging

The weights are determined by validation performance:

# Fold weights are stored in the prediction
fold_weights = best_prediction.get('fold_weights', {})
print(f"Fold weights: {fold_weights}")
# e.g., {0: 0.34, 1: 0.33, 2: 0.33}

Data Format for Prediction

The new data must have the same number of features as the training data:

# Check expected features
n_features = best_prediction['n_features']
print(f"Expected features: {n_features}")

# Supported formats
# 1. CSV file
new_data = DatasetConfigs({'X_test': 'spectra.csv'})

# 2. NumPy array
import numpy as np
X_new = np.random.randn(20, n_features)
new_data = DatasetConfigs({'test_x': X_new})

# 3. Dictionary
new_data = {'test_x': X_new}

Preprocessing Replay

During prediction, nirs4all automatically replays the preprocessing steps:

Loads saved transformer artifacts (scalers, SNV, etc.)
Applies transforms in the same order as training
Feeds transformed data to the model

This ensures consistent preprocessing between training and prediction.

Error Handling

Common prediction errors and solutions:

Missing Model

try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except FileNotFoundError as e:
    print(f"Model not found: {e}")
    # Re-train or check save_artifacts=True during training

Feature Mismatch

try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except ValueError as e:
    print(f"Feature mismatch: {e}")
    # Check new data has same number of features as training

Missing Preprocessing Artifacts

try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except KeyError as e:
    print(f"Missing artifact: {e}")
    # Ensure all preprocessing steps were saved during training

Using Model Files in Pipelines

You can include pre-trained models directly in your pipeline configuration. This is useful for:

Transfer learning with pre-trained models
Ensemble with external models
Fine-tuning existing models

Loading a Model in Pipeline

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold

# Use a pre-trained model file in pipeline
pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5),
    {"model": "models/pretrained_pls.joblib", "name": "pretrained_pls"}
]

runner = PipelineRunner()
predictions, _ = runner.run(pipeline, dataset)

Supported Model Formats in Pipelines

The model path is resolved automatically:

# sklearn/scikit-learn models
{"model": "models/pls.joblib"}
{"model": "models/ridge.pkl"}

# TensorFlow/Keras
{"model": "models/nn.h5"}
{"model": "models/nn.keras"}
{"model": "models/savedmodel_folder/"}  # SavedModel format

# PyTorch
{"model": "models/torch_model.pt"}
{"model": "models/checkpoint.pth"}
{"model": "models/checkpoint.ckpt"}

# AutoGluon
{"model": "models/autogluon_predictor/"}

Exporting a Model to File

You can export just the model (not the full bundle) from a trained prediction using export_model(). This creates a lightweight model file that can be loaded later or shared:

from nirs4all.pipeline import PipelineRunner

runner = PipelineRunner(save_artifacts=True)
predictions, _ = runner.run(pipeline, dataset)
best_pred = predictions.top(n=1)[0]

# Export just the model to .joblib
runner.export_model(best_pred, "exports/pls_model.joblib")

# Export to different formats
runner.export_model(best_pred, "exports/model.pkl")        # Pickle format
runner.export_model(best_pred, "exports/model.keras")      # TensorFlow/Keras

# Export a specific fold's model
runner.export_model(best_pred, "exports/fold2_model.joblib", fold=2)

Difference between export() and export_model():

Method	Output	Use Case
`export()`	Full `.n4a` bundle	Deployment, sharing complete pipelines
`export_model()`	Model file only	Lightweight sharing, external tools

The full bundle includes preprocessing artifacts and metadata, while export_model() exports only the trained model binary.

Fine-tuning a Pre-trained Model

# Load and fine-tune a pre-trained model
runner = PipelineRunner()
predictions, _ = runner.retrain(
    source="models/pretrained_pls.joblib",
    dataset=new_training_data,
    mode='finetune'
)

Transfer Learning

# Use a model trained on one dataset for another
runner = PipelineRunner()
predictions, _ = runner.retrain(
    source="models/wheat_model.joblib",
    dataset=corn_dataset,
    mode='transfer'
)

Best Practices

Always use save_artifacts=True during training to persist models
Verify feature dimensions before prediction
Use the same preprocessing as training (automatic with nirs4all)
Store prediction entries for reproducibility
Test predictions against known validation data
Match preprocessing when using direct model files - no preprocessing is replayed