Migration Guide

This guide helps you migrate from older versions of nirs4all to the current version. It covers API changes, prediction format updates, and dataset configuration migrations.

Table of Contents

  1. API Migration (v0.5 → v0.6+)

  2. Dataset Configuration Migration

  3. Prediction Format Migration

  4. Troubleshooting


API Migration (v0.5 → v0.6+)

nirs4all v0.6 introduces a simplified module-level API that reduces boilerplate while maintaining full functionality. The classic API remains fully supported for backward compatibility.

What Changed

Aspect

Classic API

New API (v0.6+)

Entry point

PipelineRunner.run()

nirs4all.run()

Configuration

Explicit config objects

Inline parameters

Result access

predictions.top(n=1)[0]

result.best

Sessions

N/A

nirs4all.session()

sklearn integration

Manual

NIRSPipeline wrapper

Quick Comparison

Classic API (Still Supported)

from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.data import DatasetConfigs
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression

# Create configuration objects
pipeline_config = PipelineConfigs(
    [MinMaxScaler(), PLSRegression(n_components=10)],
    name="MyPipeline"
)
dataset_config = DatasetConfigs("sample_data/regression")

# Create runner and execute
runner = PipelineRunner(
    verbose=1,
    save_artifacts=True,
    save_charts=False
)
predictions, per_dataset = runner.run(pipeline_config, dataset_config)

# Access results
best = predictions.top(n=1)[0]
print(f"Best RMSE: {best.get('rmse', 'N/A')}")

Migration Steps

1. Basic Training

Before:

from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.data import DatasetConfigs

runner = PipelineRunner(verbose=1, save_artifacts=True)
predictions, _ = runner.run(
    PipelineConfigs(pipeline, "name"),
    DatasetConfigs("path/to/data")
)
best = predictions.top(n=1)[0]

After:

import nirs4all

result = nirs4all.run(
    pipeline=pipeline,
    dataset="path/to/data",
    name="name",
    verbose=1,
    save_artifacts=True
)
best = result.best

2. Accessing Results

Before:

top_5 = predictions.top(n=5)
best = predictions.top(n=1)[0]
rmse = best.get('rmse', float('nan'))
r2 = best.get('r2', float('nan'))
pls_preds = predictions.filter_predictions(model_name='PLSRegression')

After:

top_5 = result.top(n=5)
rmse = result.best_rmse
r2 = result.best_r2
pls_preds = result.filter(model_name='PLSRegression')
print(result.num_predictions)
print(result.get_models())

3. Prediction

Before:

runner = PipelineRunner(verbose=0)
y_pred, metadata = runner.predict(source=best_prediction, dataset=new_data)

After:

predict_result = nirs4all.predict(
    source=result.best,
    dataset=new_data,
    verbose=0
)
y_pred = predict_result.values
df = predict_result.to_dataframe()

4. Model Export

Before:

runner = PipelineRunner(save_artifacts=True)
predictions, _ = runner.run(pipeline_config, dataset_config)
best = predictions.top(n=1)[0]
runner.export(source=best, output_path="exports/model.n4a")

After:

result = nirs4all.run(pipeline, dataset, save_artifacts=True)
result.export("exports/model.n4a")

New Features in v0.6+

Sessions for Multiple Runs

with nirs4all.session(verbose=1, save_artifacts=True) as s:
    result1 = nirs4all.run(pipeline1, data, name="PLS", session=s)
    result2 = nirs4all.run(pipeline2, data, name="RF", session=s)
    result3 = nirs4all.run(pipeline3, data, name="SVM", session=s)

sklearn Integration

from nirs4all.sklearn import NIRSPipeline

result = nirs4all.run(pipeline, dataset, save_artifacts=True)
pipe = NIRSPipeline.from_result(result)
y_pred = pipe.predict(X_test)
score = pipe.score(X_test, y_test)

Migration Checklist

  • Replace PipelineRunner(...) with nirs4all.run(...)

  • Remove explicit PipelineConfigs and DatasetConfigs wrappers

  • Update result access from predictions.top(n=1)[0] to result.best

  • Use result.best_rmse, result.best_r2 for quick access

  • Consider using nirs4all.session() for multiple related runs

  • Use NIRSPipeline.from_result() for sklearn/SHAP integration

  • Update exports from runner.export(source=best, ...) to result.export(...)


Dataset Configuration Migration

The new configuration system provides:

  • Multiple file formats: CSV, NumPy, Parquet, Excel, MATLAB

  • Flexible column/row selection: Select data by name, index, or pattern

  • Multiple partition methods: Static, column-based, percentage, or index-based

  • Multi-source support: Sensor fusion with multiple feature sources

  • Feature variations: Pre-computed preprocessing variants

  • Cross-validation folds: Load pre-defined fold assignments

Note

The legacy format continues to work unchanged. You can migrate gradually.

Quick Comparison

Legacy Format (Still Supported)

train_x: data/Xcal.csv
train_y: data/Ycal.csv
test_x: data/Xval.csv
test_y: data/Yval.csv

global_params:
  delimiter: ";"
  has_header: true
  header_unit: cm-1

task_type: regression

New Sources Format

sources:
  - name: "NIR"
    train_x: data/NIR_train.csv
    test_x: data/NIR_test.csv
    params:
      header_unit: nm

  - name: "MIR"
    train_x: data/MIR_train.csv
    test_x: data/MIR_test.csv
    params:
      header_unit: cm-1

targets:
  path: data/targets.csv

task_type: regression

New Variations Format

variations:
  - name: raw
    train_x: data/X_raw_train.csv
    test_x: data/X_raw_test.csv

  - name: snv
    description: "SNV preprocessed"
    train_x: data/X_snv_train.csv
    test_x: data/X_snv_test.csv

variation_mode: compare
targets:
  path: data/Y.csv
task_type: regression

Converting Configurations

Multi-Source (Legacy → Sources Format)

Before:

train_x:
  - data/sensor1_train.csv
  - data/sensor2_train.csv
test_x:
  - data/sensor1_test.csv
  - data/sensor2_test.csv
train_y: data/Y_train.csv
test_y: data/Y_test.csv

After:

sources:
  - name: "sensor1"
    files:
      - path: data/sensor1_train.csv
        partition: train
      - path: data/sensor1_test.csv
        partition: test

  - name: "sensor2"
    files:
      - path: data/sensor2_train.csv
        partition: train
      - path: data/sensor2_test.csv
        partition: test

targets:
  path: data/Y.csv

Validation Commands

# Validate configuration
nirs4all dataset validate path/to/config.yaml

# Inspect configuration details
nirs4all dataset inspect new_config.yaml --detect

# Compare configurations
nirs4all dataset diff old_config.yaml new_config.yaml

Prediction Format Migration

New Fields in Predictions (v0.9+)

Field

Description

trace_id

Unique identifier for the execution trace

model_artifact_id

Reference to the saved model artifact

execution_hash

Hash of the exact execution path

step_artifacts

List of artifact IDs for each pipeline step

Impact

Old predictions without the new fields will:

  • ✅ Continue to work for basic operations

  • ✅ Work with predict() if model folder still exists

  • ⚠️ Not support retrain() with mode=’transfer’ or ‘finetune’

  • ⚠️ Not support bundle export with full artifact chain

Migration Methods

Migration During Retrain

Old predictions are automatically migrated when used:

runner = PipelineRunner(save_artifacts=True, verbose=0)
new_preds, _ = runner.retrain(
    source=old_prediction,  # Will be migrated automatically
    dataset=new_data,
    mode='full'
)

Checking Migration Status

from nirs4all.database import PredictionsDB

db = PredictionsDB('runs/')
old_format = sum(1 for p in db.all() if 'trace_id' not in p)
new_format = sum(1 for p in db.all() if 'trace_id' in p)

print(f"Old format: {old_format}")
print(f"New format: {new_format}")

Troubleshooting

Common API Migration Issues

Wrong Return Type

# ❌ Wrong - will fail
predictions, per_dataset = nirs4all.run(pipeline, data)

# ✅ Correct
result = nirs4all.run(pipeline, data)
predictions = result.predictions
per_dataset = result.per_dataset

NIRSPipeline is for Prediction Only

# ❌ NIRSPipeline doesn't train
pipe = NIRSPipeline(steps=[MinMaxScaler(), PLSRegression(10)])
pipe.fit(X, y)  # Raises NotImplementedError

# ✅ Train with nirs4all, then wrap
result = nirs4all.run(pipeline, dataset)
pipe = NIRSPipeline.from_result(result)
pipe.predict(X_new)  # Works

Common Dataset Issues

“No data source specified”

Your configuration needs at least one of:

  • train_x or test_x (legacy)

  • sources (new multi-source)

  • variations (new variations)

  • folder (auto-scan)

“Sample count mismatch across sources”

All sources must have the same number of samples. Check that your data files have consistent row counts.

Common Prediction Format Issues

Missing Model Folder

# Error: Model folder not found
# Old predictions without saved folders cannot be fully migrated
from pathlib import Path
folder = Path(pred['folder'])
if not folder.exists():
    print("Original model folder missing - limited functionality")

Hash Mismatch

ValueError: Content hash mismatch for artifact 0001:3:all

Cause: Artifact file was modified after saving. Solution: Delete the corrupted artifact and re-run training.

See Also