# Prediction and Model Reuse

This guide covers how to make predictions using trained models in nirs4all, including loading saved models and applying them to new datasets.

## Overview

After training a pipeline, you can use the trained models to make predictions on new data. nirs4all supports several prediction workflows:

1. **Direct prediction**: Use a trained model to predict new samples
2. **Model persistence**: Save and reload models for later use
3. **Cross-validation ensembles**: Combine predictions from multiple CV folds

## Basic Prediction Workflow

### Training a Model

First, train your pipeline:

```python
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import MinMaxScaler

from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineConfigs, PipelineRunner

# Define pipeline
pipeline = [
    MinMaxScaler(),
    RepeatedKFold(n_splits=3, n_repeats=1, random_state=42),
    {"model": PLSRegression(n_components=10), "name": "PLS_10"},
]

# Train
runner = PipelineRunner(save_artifacts=True, verbose=0)
predictions, _ = runner.run(
    PipelineConfigs(pipeline),
    DatasetConfigs(['path/to/training_data'])
)

# Get best model
best_prediction = predictions.top(n=1, rank_partition="test")[0]
print(f"Best model: {best_prediction['model_name']}")
print(f"RMSE: {best_prediction['rmse']:.4f}")
```

### Making Predictions

Use the `predict()` method with the best prediction entry:

```python
# Create predictor
predictor = PipelineRunner(save_artifacts=False, save_charts=False, verbose=0)

# Load new data
new_dataset = DatasetConfigs({
    'X_test': 'path/to/new_spectra.csv'
})

# Make predictions
y_pred, _ = predictor.predict(best_prediction, new_dataset, verbose=0)
print(f"Predictions: {y_pred[:5]}")
```

## Prediction Sources

The `predict()` method accepts various sources:

### 1. Prediction Dictionary (Most Common)

```python
# From Predictions object
best_prediction = predictions.top(n=1, rank_partition="test")[0]
y_pred, _ = runner.predict(best_prediction, new_data)
```

### 2. Model ID String

```python
# Using the prediction ID directly
model_id = best_prediction['id']
y_pred, _ = runner.predict(model_id, new_data)
```

### 3. Folder Path

```python
# From a pipeline folder
y_pred, _ = runner.predict("runs/2024-12-14_wheat/pipeline_abc123/", new_data)
```

### 4. Bundle File

```python
# From an exported bundle (see Export section)
y_pred, _ = runner.predict("exports/wheat_model.n4a", new_data)
```

### 5. Direct Model File

You can load a model directly from its binary file. This is useful when you have a pre-trained model saved externally or want to use models trained outside nirs4all.

```python
# From a sklearn/joblib model file
y_pred, _ = runner.predict("models/pls_wheat.joblib", new_data)

# From a pickle file
y_pred, _ = runner.predict("models/my_model.pkl", new_data)

# From a TensorFlow/Keras model
y_pred, _ = runner.predict("models/nn_model.h5", new_data)
y_pred, _ = runner.predict("models/nn_model.keras", new_data)

# From a PyTorch model
y_pred, _ = runner.predict("models/torch_model.pt", new_data)
y_pred, _ = runner.predict("models/checkpoint.pth", new_data)

# From a model folder (AutoGluon, TensorFlow SavedModel)
y_pred, _ = runner.predict("models/autogluon_model/", new_data)
y_pred, _ = runner.predict("models/tf_savedmodel/", new_data)
```

**Supported formats:**

| Extension | Framework | Notes |
|-----------|-----------|-------|
| `.joblib` | sklearn, XGBoost, LightGBM | Recommended for sklearn models |
| `.pkl` | Any (cloudpickle) | General purpose |
| `.h5`, `.hdf5` | TensorFlow/Keras | Legacy Keras format |
| `.keras` | TensorFlow/Keras | Modern Keras format |
| `.pt`, `.pth` | PyTorch | Full model or state dict |
| `.ckpt` | PyTorch | Checkpoint file |
| folder | AutoGluon, TensorFlow | SavedModel format |

**Important:** When using direct model files, no preprocessing artifacts are loaded. The input data should already be preprocessed appropriately for the model.

## Prediction Output

The `predict()` method returns:

```python
y_pred, predictions_obj = runner.predict(source, dataset)
```

- `y_pred`: numpy array of predictions (averaged across folds if CV)
- `predictions_obj`: Predictions object with full metadata

### Getting All Predictions

To get predictions from all models (not just the best):

```python
all_preds, predictions_obj = runner.predict(
    best_prediction,
    new_data,
    all_predictions=True,
    verbose=0
)

# Iterate over all predictions
for pred in predictions_obj.to_dicts():
    print(f"{pred['model_name']}: {pred['rmse']:.4f}")
```

## Cross-Validation Ensemble Predictions

When training with cross-validation, each fold produces a separate model. During prediction, nirs4all automatically:

1. Loads all fold models
2. Makes predictions with each
3. Combines predictions using weighted averaging

The weights are determined by validation performance:

```python
# Fold weights are stored in the prediction
fold_weights = best_prediction.get('fold_weights', {})
print(f"Fold weights: {fold_weights}")
# e.g., {0: 0.34, 1: 0.33, 2: 0.33}
```

## Data Format for Prediction

The new data must have the same number of features as the training data:

```python
# Check expected features
n_features = best_prediction['n_features']
print(f"Expected features: {n_features}")

# Supported formats
# 1. CSV file
new_data = DatasetConfigs({'X_test': 'spectra.csv'})

# 2. NumPy array
import numpy as np
X_new = np.random.randn(20, n_features)
new_data = DatasetConfigs({'test_x': X_new})

# 3. Dictionary
new_data = {'test_x': X_new}
```

## Preprocessing Replay

During prediction, nirs4all automatically replays the preprocessing steps:

1. Loads saved transformer artifacts (scalers, SNV, etc.)
2. Applies transforms in the same order as training
3. Feeds transformed data to the model

This ensures consistent preprocessing between training and prediction.

## Error Handling

Common prediction errors and solutions:

### Missing Model

```python
try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except FileNotFoundError as e:
    print(f"Model not found: {e}")
    # Re-train or check save_artifacts=True during training
```

### Feature Mismatch

```python
try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except ValueError as e:
    print(f"Feature mismatch: {e}")
    # Check new data has same number of features as training
```

### Missing Preprocessing Artifacts

```python
try:
    y_pred, _ = runner.predict(best_prediction, new_data)
except KeyError as e:
    print(f"Missing artifact: {e}")
    # Ensure all preprocessing steps were saved during training
```

## Using Model Files in Pipelines

You can include pre-trained models directly in your pipeline configuration. This is useful for:
- Transfer learning with pre-trained models
- Ensemble with external models
- Fine-tuning existing models

### Loading a Model in Pipeline

```python
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold

# Use a pre-trained model file in pipeline
pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5),
    {"model": "models/pretrained_pls.joblib", "name": "pretrained_pls"}
]

runner = PipelineRunner()
predictions, _ = runner.run(pipeline, dataset)
```

### Supported Model Formats in Pipelines

The model path is resolved automatically:

```python
# sklearn/scikit-learn models
{"model": "models/pls.joblib"}
{"model": "models/ridge.pkl"}

# TensorFlow/Keras
{"model": "models/nn.h5"}
{"model": "models/nn.keras"}
{"model": "models/savedmodel_folder/"}  # SavedModel format

# PyTorch
{"model": "models/torch_model.pt"}
{"model": "models/checkpoint.pth"}
{"model": "models/checkpoint.ckpt"}

# AutoGluon
{"model": "models/autogluon_predictor/"}
```

### Exporting a Model to File

You can export just the model (not the full bundle) from a trained prediction using `export_model()`. This creates a lightweight model file that can be loaded later or shared:

```python
from nirs4all.pipeline import PipelineRunner

runner = PipelineRunner(save_artifacts=True)
predictions, _ = runner.run(pipeline, dataset)
best_pred = predictions.top(n=1)[0]

# Export just the model to .joblib
runner.export_model(best_pred, "exports/pls_model.joblib")

# Export to different formats
runner.export_model(best_pred, "exports/model.pkl")        # Pickle format
runner.export_model(best_pred, "exports/model.keras")      # TensorFlow/Keras

# Export a specific fold's model
runner.export_model(best_pred, "exports/fold2_model.joblib", fold=2)
```

**Difference between `export()` and `export_model()`:**

| Method | Output | Use Case |
|--------|--------|----------|
| `export()` | Full `.n4a` bundle | Deployment, sharing complete pipelines |
| `export_model()` | Model file only | Lightweight sharing, external tools |

The full bundle includes preprocessing artifacts and metadata, while `export_model()` exports only the trained model binary.

### Fine-tuning a Pre-trained Model

```python
# Load and fine-tune a pre-trained model
runner = PipelineRunner()
predictions, _ = runner.retrain(
    source="models/pretrained_pls.joblib",
    dataset=new_training_data,
    mode='finetune'
)
```

### Transfer Learning

```python
# Use a model trained on one dataset for another
runner = PipelineRunner()
predictions, _ = runner.retrain(
    source="models/wheat_model.joblib",
    dataset=corn_dataset,
    mode='transfer'
)
```

## Best Practices

1. **Always use `save_artifacts=True`** during training to persist models
2. **Verify feature dimensions** before prediction
3. **Use the same preprocessing** as training (automatic with nirs4all)
4. **Store prediction entries** for reproducibility
5. **Test predictions** against known validation data
6. **Match preprocessing** when using direct model files - no preprocessing is replayed

## See Also

- [Export and Bundles](export_bundles.md) - Export models for deployment
- [Retrain and Transfer](retrain_transfer.md) - Retrain models on new data
- {doc}`/user_guide/index` - Complete user guide