# Model Training This guide covers how to train models in NIRS4ALL, including cross-validation and accessing results. ## Overview Model training in NIRS4ALL follows a simple pattern: 1. Add a cross-validation splitter to your pipeline 2. Add one or more model steps 3. Run the pipeline and access predictions ```python pipeline = [ # ... preprocessing steps ... ShuffleSplit(n_splits=5), # Cross-validation {"model": PLSRegression(n_components=10)} # Model ] ``` ## Basic Model Training ### Single Model The simplest case - train one model: ```python from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import ShuffleSplit import nirs4all pipeline = [ ShuffleSplit(n_splits=3, test_size=0.25), {"model": PLSRegression(n_components=10)} ] result = nirs4all.run( pipeline=pipeline, dataset="sample_data/regression", verbose=1 ) print(f"Best RMSE: {result.best_score:.4f}") ``` ### Multiple Models Compare multiple models in one run: ```python from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import Ridge pipeline = [ ShuffleSplit(n_splits=3), # Each model is trained and evaluated {"model": PLSRegression(n_components=5)}, {"model": PLSRegression(n_components=10)}, {"model": PLSRegression(n_components=15)}, {"model": Ridge(alpha=1.0)}, {"model": RandomForestRegressor(n_estimators=100)}, ] result = nirs4all.run(pipeline, "sample_data/regression", verbose=1) # Compare results for pred in result.top(n=5, display_metrics=['rmse', 'r2']): print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}") ``` ### Named Models Give models custom names for easier identification: ```python pipeline = [ ShuffleSplit(n_splits=3), {"name": "PLS-5", "model": PLSRegression(n_components=5)}, {"name": "PLS-10", "model": PLSRegression(n_components=10)}, {"name": "RF-100", "model": RandomForestRegressor(n_estimators=100)}, ] ``` ## Cross-Validation ### Available Splitters NIRS4ALL supports all sklearn splitters: | Splitter | Use Case | |----------|----------| | `ShuffleSplit` | Random train/val splits (recommended for regression) | | `KFold` | K consecutive folds | | `StratifiedKFold` | Maintains class distribution (classification) | | `GroupKFold` | Keeps groups together | | `LeaveOneOut` | Leave one sample out (small datasets) | | `TimeSeriesSplit` | Respects temporal order | ### ShuffleSplit (Recommended) ```python from sklearn.model_selection import ShuffleSplit pipeline = [ # 5 random 75%/25% train/val splits ShuffleSplit(n_splits=5, test_size=0.25, random_state=42), {"model": PLSRegression(n_components=10)} ] ``` ### KFold ```python from sklearn.model_selection import KFold pipeline = [ # 5-fold cross-validation KFold(n_splits=5, shuffle=True, random_state=42), {"model": PLSRegression(n_components=10)} ] ``` ### StratifiedKFold (Classification) ```python from sklearn.model_selection import StratifiedKFold pipeline = [ # Maintains class balance in each fold StratifiedKFold(n_splits=5, shuffle=True, random_state=42), {"model": LogisticRegression()} ] ``` ### GroupKFold (Group Data) When samples belong to groups (e.g., multiple measurements per subject): ```python from sklearn.model_selection import GroupKFold pipeline = [ # Keeps all samples from a group in same fold # Groups come from metadata column {"force_group_splitting": "subject_id"}, GroupKFold(n_splits=5), {"model": PLSRegression(n_components=10)} ] ``` ## Accessing Results ### The Result Object `nirs4all.run()` returns a `RunResult` with convenient accessors: ```python result = nirs4all.run(pipeline, dataset) # Quick access result.best_score # Best model's primary metric (RMSE for regression) result.best # Best prediction entry (dict) result.num_predictions # Total number of prediction entries # Top performers result.top(n=5) # Top 5 by default metric result.top(n=5, rank_metric='r2') # Top 5 by R² result.top(n=5, display_metrics=['rmse', 'r2']) # Include specific metrics # Full predictions list result.predictions # PredictionResultsList object ``` ### Prediction Entry Structure Each prediction entry contains: ```python pred = result.best print(pred['model_name']) # "PLSRegression" print(pred['rmse']) # 0.1234 print(pred['r2']) # 0.9876 print(pred['fold_id']) # 0 print(pred['preprocessings']) # "MinMaxScaler | SNV" print(pred['y_true']) # numpy array print(pred['y_pred']) # numpy array ``` ### Filtering Results ```python # Get predictions for specific model pls_preds = [p for p in result.predictions if 'PLS' in p['model_name']] # Get predictions for specific fold fold_0 = [p for p in result.predictions if p['fold_id'] == 0] # Get test partition only test_preds = [p for p in result.predictions if p['partition'] == 'test'] ``` ## Regression Models ### PLS Regression (Recommended for NIRS) ```python from sklearn.cross_decomposition import PLSRegression pipeline = [ ShuffleSplit(n_splits=5), {"model": PLSRegression(n_components=10)} ] ``` **Key parameters:** - `n_components`: Number of latent variables (typically 1-30 for NIRS) ### Ridge Regression ```python from sklearn.linear_model import Ridge pipeline = [ ShuffleSplit(n_splits=5), {"model": Ridge(alpha=1.0)} ] ``` **Key parameters:** - `alpha`: Regularization strength (higher = more regularization) ### Random Forest ```python from sklearn.ensemble import RandomForestRegressor pipeline = [ ShuffleSplit(n_splits=5), {"model": RandomForestRegressor(n_estimators=100, random_state=42)} ] ``` **Key parameters:** - `n_estimators`: Number of trees (100-500 typical) - `max_depth`: Maximum tree depth (None = unlimited) ### Support Vector Regression ```python from sklearn.svm import SVR pipeline = [ ShuffleSplit(n_splits=5), {"model": SVR(kernel='rbf', C=1.0)} ] ``` ## Classification Models ### Logistic Regression ```python from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold pipeline = [ StratifiedKFold(n_splits=5), {"model": LogisticRegression(max_iter=1000)} ] result = nirs4all.run( pipeline, "sample_data/classification", verbose=1 ) ``` ### Random Forest Classifier ```python from sklearn.ensemble import RandomForestClassifier pipeline = [ StratifiedKFold(n_splits=5), {"model": RandomForestClassifier(n_estimators=100)} ] ``` ### Support Vector Classifier ```python from sklearn.svm import SVC pipeline = [ StratifiedKFold(n_splits=5), {"model": SVC(kernel='rbf', probability=True)} ] ``` ## Target Processing Scale or transform targets: ```python from sklearn.preprocessing import MinMaxScaler, StandardScaler pipeline = [ # Scale features MinMaxScaler(), # Scale targets (applied before training, inverted after) {"y_processing": MinMaxScaler()}, # OR {"y_processing": StandardScaler()}, ShuffleSplit(n_splits=3), {"model": PLSRegression(n_components=10)} ] ``` :::{note} Target scaling is automatically inverted when making predictions, so metrics are computed on the original scale. ::: ## Model Persistence ### Export for Production ```python # Export best model result.export("exports/my_model.n4a") ``` ### Load and Use ```python from nirs4all.pipeline import load_bundle bundle = load_bundle("exports/my_model.n4a") y_pred = bundle.predict(X_new) ``` ## Tips and Best Practices ### Choosing n_components for PLS ```python # Try a range of components pipeline = [ ShuffleSplit(n_splits=5), {"model": PLSRegression(n_components=5)}, {"model": PLSRegression(n_components=10)}, {"model": PLSRegression(n_components=15)}, {"model": PLSRegression(n_components=20)}, ] result = nirs4all.run(pipeline, dataset) # Find optimal for pred in result.top(n=5, display_metrics=['rmse']): print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}") ``` ### Random State for Reproducibility ```python # Set random_state for reproducible results pipeline = [ ShuffleSplit(n_splits=5, random_state=42), {"model": RandomForestRegressor(n_estimators=100, random_state=42)} ] ``` ### Preprocessing Before Model Always apply preprocessing before the splitter: ```python pipeline = [ # Preprocessing first MinMaxScaler(), StandardNormalVariate(), # Then cross-validation ShuffleSplit(n_splits=5), # Then model {"model": PLSRegression(n_components=10)} ] ``` ## Complete Example ```python """Complete model training example.""" import nirs4all from sklearn.preprocessing import MinMaxScaler from sklearn.cross_decomposition import PLSRegression from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import ShuffleSplit from nirs4all.operators.transforms import StandardNormalVariate, FirstDerivative # Define pipeline pipeline = [ # Preprocessing MinMaxScaler(), StandardNormalVariate(), FirstDerivative(), # Target scaling {"y_processing": MinMaxScaler()}, # Cross-validation ShuffleSplit(n_splits=5, test_size=0.25, random_state=42), # Multiple models {"name": "PLS-5", "model": PLSRegression(n_components=5)}, {"name": "PLS-10", "model": PLSRegression(n_components=10)}, {"name": "PLS-15", "model": PLSRegression(n_components=15)}, {"name": "RF-100", "model": RandomForestRegressor(n_estimators=100, random_state=42)}, ] # Run pipeline result = nirs4all.run( pipeline=pipeline, dataset="sample_data/regression", name="ModelComparison", verbose=1, save_artifacts=True ) # View results print(f"\n📊 Results Summary:") print(f" Total predictions: {result.num_predictions}") print(f" Best RMSE: {result.best_score:.4f}") print("\n🏆 Top 5 Models:") for i, pred in enumerate(result.top(n=5, display_metrics=['rmse', 'r2']), 1): print(f" {i}. {pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}") # Export best model result.export("exports/best_model.n4a") print("\n✅ Best model exported to exports/best_model.n4a") ``` ## See Also - {doc}`/reference/pipeline_syntax` - Complete pipeline syntax - {doc}`/reference/metrics` - Available evaluation metrics - {doc}`/user_guide/pipelines/writing_pipelines` - Pipeline construction guide - {doc}`/user_guide/models/hyperparameter_tuning` - Automated tuning with Optuna