Making Predictions

This guide covers all the ways to predict on new data with trained nirs4all models. Whether you have a RunResult from a training session, an exported .n4a bundle, or a chain ID in your workspace, the nirs4all.predict() function handles them all.

From a RunResult (Most Common)

After training a pipeline, the simplest path is to export the best model and predict from the bundle:

import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression

# Train
result = nirs4all.run(
    pipeline=[MinMaxScaler(), PLSRegression(10)],
    dataset="sample_data/regression",
)

# Export best model
result.export("best_model.n4a")

# Predict on new data
preds = nirs4all.predict(model="best_model.n4a", data=X_new)
print(preds.values)  # numpy array of predictions

You can also predict directly from a prediction dictionary (the output of result.best or result.top()):

# Predict using the best prediction entry
preds = nirs4all.predict(model=result.best, data=X_new)

From an Exported Bundle (.n4a)

A .n4a bundle is a self-contained ZIP file that includes the chain definition and all fitted artifacts. It can be shared between machines without requiring the original workspace.

import nirs4all

# Predict from a bundle file
preds = nirs4all.predict(model="exports/wheat_model.n4a", data=X_new)

print(f"Predictions shape: {preds.shape}")
print(f"Model: {preds.model_name}")
print(f"Values: {preds.values[:5]}")

Bundles are portable – copy the .n4a file to another machine and predict without any workspace setup.

From a Chain ID

If you are working within a workspace that has a DuckDB store, you can predict directly from a stored chain without exporting first:

import nirs4all

# Predict using a chain ID from the workspace store
preds = nirs4all.predict(
    chain_id="abc123-def456",
    data=X_new,
    workspace_path="workspace",
)

This path replays the chain directly from the store, loading artifacts from the artifacts/ directory. It avoids the overhead of exporting to a bundle first.

Note

The model and chain_id parameters are mutually exclusive. Provide one or the other, not both.

From a Standalone Python Script (.n4a.py)

You can export a model as a self-contained Python script that embeds all artifacts. This script runs without nirs4all installed:

# Export as standalone script
result.export("model.n4a.py", format="n4a.py")

Then run it from the command line:

python model.n4a.py input_spectra.csv

The script reads the input CSV, applies all preprocessing steps, runs the model, and prints predictions to stdout.

Data Format Requirements

The data parameter accepts several formats:

Format	Example	Notes
NumPy array	`X_new` (shape: n_samples x n_features)	Most direct; features must match training
Tuple	`(X,)` or `(X, y)`	y is optional, used for evaluation
Dict	`{"X": X, "metadata": meta}`	For chain-based prediction
Path (string)	`"new_data/"`	Folder parsed by dataset loaders
SpectroDataset	`SpectroDataset(...)`	Direct dataset object

Feature Count

The number of features (columns) in the new data must match the number of features seen during training. If the training data had 256 wavelengths, the new data must also have 256 columns.

# Check expected feature count from training
best = result.best
print(f"Expected features: {best.get('n_features')}")

Wavelength Alignment

For spectroscopic data, the wavelengths of the new data should align with the training data. If you are using wavelength-aware operators (e.g., CropTransformer, ResampleTransformer), the wavelength arrays must be compatible.

Cross-Validation Ensemble Averaging

When a model was trained with cross-validation (e.g., 5-fold CV), the chain contains 5 fitted model artifacts – one per fold. During prediction, nirs4all:

Loads all fold models from the chain
Applies the shared preprocessing steps to X_new
Runs model.predict(X_preprocessed) for each fold model
Averages the fold predictions element-wise

This ensemble averaging typically improves prediction stability compared to using a single fold’s model.

X_new --> Preprocessing (shared) --> fold_0 model --> y_pred_0
                                 --> fold_1 model --> y_pred_1  --> mean --> y_pred
                                 --> fold_2 model --> y_pred_2

Preprocessing Replay

During prediction, the chain is replayed step by step:

Each preprocessing step loads its saved artifact (fitted scaler, transformer, etc.)
The step’s transform() method is called on the data (never fit() – the artifacts are already fitted)
The transformed data is passed to the next step
At the model step, predict() is called instead of transform()

This guarantees that the same preprocessing is applied to new data as was applied during training. You do not need to manually apply preprocessing – it is handled automatically by the chain.

Error Handling

Feature Mismatch

If the new data has a different number of features than the training data:

try:
    preds = nirs4all.predict(model="model.n4a", data=X_wrong_shape)
except ValueError as e:
    print(f"Feature mismatch: {e}")
    # Fix: ensure X_new has the correct number of columns

Missing Bundle File

try:
    preds = nirs4all.predict(model="missing_model.n4a", data=X_new)
except FileNotFoundError as e:
    print(f"Bundle not found: {e}")
    # Fix: check the file path or re-export the model

Corrupt or Incomplete Bundle

If a bundle is missing artifact files:

try:
    preds = nirs4all.predict(model="corrupt.n4a", data=X_new)
except (KeyError, FileNotFoundError) as e:
    print(f"Missing artifact: {e}")
    # Fix: re-export the model from the workspace

Invalid Arguments

# Both model and chain_id provided
try:
    preds = nirs4all.predict(model="model.n4a", chain_id="abc", data=X_new)
except ValueError as e:
    print(f"Error: {e}")
    # Fix: provide either model or chain_id, not both

# Neither model nor chain_id provided
try:
    preds = nirs4all.predict(data=X_new)
except ValueError as e:
    print(f"Error: {e}")
    # Fix: provide either model or chain_id

PredictResult Output

nirs4all.predict() returns a PredictResult object:

preds = nirs4all.predict(model="model.n4a", data=X_new)

# Access predictions
preds.values          # numpy array (alias for y_pred)
preds.y_pred          # numpy array of predicted values
preds.shape           # shape of prediction array
preds.model_name      # name of the model used

# Convert to other formats
preds.to_numpy()      # numpy array
preds.to_list()       # Python list
preds.to_dataframe()  # pandas DataFrame

# Check properties
len(preds)            # number of predictions
preds.is_multioutput  # True if multi-output prediction
preds.flatten()       # flattened 1D array

PredictResult with Evaluation

When you provide both X and y (as a tuple), you can evaluate the predictions against ground truth:

preds = nirs4all.predict(model="model.n4a", data=(X_test, y_test))

# The predicted values
print(preds.values)

# Additional metadata
print(preds.metadata)

Complete Example

import nirs4all
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold

# Define pipeline with cross-validation
pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5, shuffle=True, random_state=42),
    {"model": PLSRegression(n_components=10)},
]

# Train
result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression",
    verbose=1,
)

print(f"Best RMSE: {result.best_rmse:.4f}")
print(f"Best R2:   {result.best_r2:.4f}")

# Export
result.export("wheat_model.n4a")

# Predict on new data
X_new = np.random.randn(20, result.best.get("n_features", 100))
preds = nirs4all.predict(model="wheat_model.n4a", data=X_new)

print(f"Predictions: {preds.shape}")
print(f"First 5: {preds.values[:5]}")

# Convert to DataFrame
df = preds.to_dataframe()
print(df.head())