# Migration Guide This guide helps you migrate from older versions of nirs4all to the current version. It covers API changes, prediction format updates, and dataset configuration migrations. ## Table of Contents 1. [API Migration (v0.5 → v0.6+)](#api-migration-v05--v06) 2. [Dataset Configuration Migration](#dataset-configuration-migration) 3. [Prediction Format Migration](#prediction-format-migration) 4. [Troubleshooting](#troubleshooting) --- ## API Migration (v0.5 → v0.6+) nirs4all v0.6 introduces a simplified module-level API that reduces boilerplate while maintaining full functionality. The classic API remains fully supported for backward compatibility. ### What Changed | Aspect | Classic API | New API (v0.6+) | |--------|-------------|-----------------| | Entry point | `PipelineRunner.run()` | `nirs4all.run()` | | Configuration | Explicit config objects | Inline parameters | | Result access | `predictions.top(n=1)[0]` | `result.best` | | Sessions | N/A | `nirs4all.session()` | | sklearn integration | Manual | `NIRSPipeline` wrapper | ### Quick Comparison #### Classic API (Still Supported) ```python from nirs4all.pipeline import PipelineRunner, PipelineConfigs from nirs4all.data import DatasetConfigs from sklearn.preprocessing import MinMaxScaler from sklearn.cross_decomposition import PLSRegression # Create configuration objects pipeline_config = PipelineConfigs( [MinMaxScaler(), PLSRegression(n_components=10)], name="MyPipeline" ) dataset_config = DatasetConfigs("sample_data/regression") # Create runner and execute runner = PipelineRunner( verbose=1, save_artifacts=True, save_charts=False ) predictions, per_dataset = runner.run(pipeline_config, dataset_config) # Access results best = predictions.top(n=1)[0] print(f"Best RMSE: {best.get('rmse', 'N/A')}") ``` #### New Module-Level API (Recommended) ```python import nirs4all from sklearn.preprocessing import MinMaxScaler from sklearn.cross_decomposition import PLSRegression # Direct execution with inline configuration result = nirs4all.run( pipeline=[MinMaxScaler(), PLSRegression(n_components=10)], dataset="sample_data/regression", name="MyPipeline", verbose=1, save_artifacts=True, save_charts=False ) # Convenient result access print(f"Best RMSE: {result.best_rmse:.4f}") print(f"Best R²: {result.best_r2:.4f}") ``` ### Migration Steps #### 1. Basic Training **Before:** ```python from nirs4all.pipeline import PipelineRunner, PipelineConfigs from nirs4all.data import DatasetConfigs runner = PipelineRunner(verbose=1, save_artifacts=True) predictions, _ = runner.run( PipelineConfigs(pipeline, "name"), DatasetConfigs("path/to/data") ) best = predictions.top(n=1)[0] ``` **After:** ```python import nirs4all result = nirs4all.run( pipeline=pipeline, dataset="path/to/data", name="name", verbose=1, save_artifacts=True ) best = result.best ``` #### 2. Accessing Results **Before:** ```python top_5 = predictions.top(n=5) best = predictions.top(n=1)[0] rmse = best.get('rmse', float('nan')) r2 = best.get('r2', float('nan')) pls_preds = predictions.filter_predictions(model_name='PLSRegression') ``` **After:** ```python top_5 = result.top(n=5) rmse = result.best_rmse r2 = result.best_r2 pls_preds = result.filter(model_name='PLSRegression') print(result.num_predictions) print(result.get_models()) ``` #### 3. Prediction **Before:** ```python runner = PipelineRunner(verbose=0) y_pred, metadata = runner.predict(source=best_prediction, dataset=new_data) ``` **After:** ```python predict_result = nirs4all.predict( source=result.best, dataset=new_data, verbose=0 ) y_pred = predict_result.values df = predict_result.to_dataframe() ``` #### 4. Model Export **Before:** ```python runner = PipelineRunner(save_artifacts=True) predictions, _ = runner.run(pipeline_config, dataset_config) best = predictions.top(n=1)[0] runner.export(source=best, output_path="exports/model.n4a") ``` **After:** ```python result = nirs4all.run(pipeline, dataset, save_artifacts=True) result.export("exports/model.n4a") ``` ### New Features in v0.6+ #### Sessions for Multiple Runs ```python with nirs4all.session(verbose=1, save_artifacts=True) as s: result1 = nirs4all.run(pipeline1, data, name="PLS", session=s) result2 = nirs4all.run(pipeline2, data, name="RF", session=s) result3 = nirs4all.run(pipeline3, data, name="SVM", session=s) ``` #### sklearn Integration ```python from nirs4all.sklearn import NIRSPipeline result = nirs4all.run(pipeline, dataset, save_artifacts=True) pipe = NIRSPipeline.from_result(result) y_pred = pipe.predict(X_test) score = pipe.score(X_test, y_test) ``` ### Migration Checklist - [ ] Replace `PipelineRunner(...)` with `nirs4all.run(...)` - [ ] Remove explicit `PipelineConfigs` and `DatasetConfigs` wrappers - [ ] Update result access from `predictions.top(n=1)[0]` to `result.best` - [ ] Use `result.best_rmse`, `result.best_r2` for quick access - [ ] Consider using `nirs4all.session()` for multiple related runs - [ ] Use `NIRSPipeline.from_result()` for sklearn/SHAP integration - [ ] Update exports from `runner.export(source=best, ...)` to `result.export(...)` --- ## Dataset Configuration Migration The new configuration system provides: - **Multiple file formats**: CSV, NumPy, Parquet, Excel, MATLAB - **Flexible column/row selection**: Select data by name, index, or pattern - **Multiple partition methods**: Static, column-based, percentage, or index-based - **Multi-source support**: Sensor fusion with multiple feature sources - **Feature variations**: Pre-computed preprocessing variants - **Cross-validation folds**: Load pre-defined fold assignments :::{note} The legacy format continues to work unchanged. You can migrate gradually. ::: ### Quick Comparison #### Legacy Format (Still Supported) ```yaml train_x: data/Xcal.csv train_y: data/Ycal.csv test_x: data/Xval.csv test_y: data/Yval.csv global_params: delimiter: ";" has_header: true header_unit: cm-1 task_type: regression ``` #### New Sources Format ```yaml sources: - name: "NIR" train_x: data/NIR_train.csv test_x: data/NIR_test.csv params: header_unit: nm - name: "MIR" train_x: data/MIR_train.csv test_x: data/MIR_test.csv params: header_unit: cm-1 targets: path: data/targets.csv task_type: regression ``` #### New Variations Format ```yaml variations: - name: raw train_x: data/X_raw_train.csv test_x: data/X_raw_test.csv - name: snv description: "SNV preprocessed" train_x: data/X_snv_train.csv test_x: data/X_snv_test.csv variation_mode: compare targets: path: data/Y.csv task_type: regression ``` ### Converting Configurations #### Multi-Source (Legacy → Sources Format) **Before:** ```yaml train_x: - data/sensor1_train.csv - data/sensor2_train.csv test_x: - data/sensor1_test.csv - data/sensor2_test.csv train_y: data/Y_train.csv test_y: data/Y_test.csv ``` **After:** ```yaml sources: - name: "sensor1" files: - path: data/sensor1_train.csv partition: train - path: data/sensor1_test.csv partition: test - name: "sensor2" files: - path: data/sensor2_train.csv partition: train - path: data/sensor2_test.csv partition: test targets: path: data/Y.csv ``` ### Validation Commands ```bash # Validate configuration nirs4all dataset validate path/to/config.yaml # Inspect configuration details nirs4all dataset inspect new_config.yaml --detect # Compare configurations nirs4all dataset diff old_config.yaml new_config.yaml ``` --- ## Prediction Format Migration ### New Fields in Predictions (v0.9+) | Field | Description | |-------|-------------| | `trace_id` | Unique identifier for the execution trace | | `model_artifact_id` | Reference to the saved model artifact | | `execution_hash` | Hash of the exact execution path | | `step_artifacts` | List of artifact IDs for each pipeline step | ### Impact Old predictions without the new fields will: - ✅ Continue to work for basic operations - ✅ Work with `predict()` if model folder still exists - ⚠️ Not support `retrain()` with mode='transfer' or 'finetune' - ⚠️ Not support bundle export with full artifact chain ### Migration Methods #### Automatic Migration (Recommended) ```python from nirs4all.database import PredictionsDB from nirs4all.migration import migrate_predictions db = PredictionsDB('runs/') results = migrate_predictions(db, dry_run=False, verbose=1) print(f"Migrated: {results['migrated']}") ``` #### Migration During Retrain Old predictions are automatically migrated when used: ```python runner = PipelineRunner(save_artifacts=True, verbose=0) new_preds, _ = runner.retrain( source=old_prediction, # Will be migrated automatically dataset=new_data, mode='full' ) ``` ### Checking Migration Status ```python from nirs4all.database import PredictionsDB db = PredictionsDB('runs/') old_format = sum(1 for p in db.all() if 'trace_id' not in p) new_format = sum(1 for p in db.all() if 'trace_id' in p) print(f"Old format: {old_format}") print(f"New format: {new_format}") ``` --- ## Troubleshooting ### Common API Migration Issues #### Wrong Return Type ```python # ❌ Wrong - will fail predictions, per_dataset = nirs4all.run(pipeline, data) # ✅ Correct result = nirs4all.run(pipeline, data) predictions = result.predictions per_dataset = result.per_dataset ``` #### NIRSPipeline is for Prediction Only ```python # ❌ NIRSPipeline doesn't train pipe = NIRSPipeline(steps=[MinMaxScaler(), PLSRegression(10)]) pipe.fit(X, y) # Raises NotImplementedError # ✅ Train with nirs4all, then wrap result = nirs4all.run(pipeline, dataset) pipe = NIRSPipeline.from_result(result) pipe.predict(X_new) # Works ``` ### Common Dataset Issues #### "No data source specified" Your configuration needs at least one of: - `train_x` or `test_x` (legacy) - `sources` (new multi-source) - `variations` (new variations) - `folder` (auto-scan) #### "Sample count mismatch across sources" All sources must have the same number of samples. Check that your data files have consistent row counts. ### Common Prediction Format Issues #### Missing Model Folder ```python # Error: Model folder not found # Old predictions without saved folders cannot be fully migrated from pathlib import Path folder = Path(pred['folder']) if not folder.exists(): print("Original model folder missing - limited functionality") ``` #### Hash Mismatch ``` ValueError: Content hash mismatch for artifact 0001:3:all ``` **Cause**: Artifact file was modified after saving. **Solution**: Delete the corrupted artifact and re-run training. ## See Also - {doc}`dataset_troubleshooting` - Common data loading issues - {doc}`/getting_started/index` - Installation guide - {doc}`/reference/cli` - CLI command reference