# Model Examples This section covers model training, comparison, hyperparameter tuning, and ensemble methods in NIRS4ALL. ```{contents} On this page :local: :depth: 2 ``` ## Overview | Example | Topic | Difficulty | Duration | |---------|-------|------------|----------| | [U01](#u01-multi-model) | Multi-Model Comparison | ★★☆☆☆ | ~4 min | | [U02](#u02-hyperparameter-tuning) | Hyperparameter Tuning | ★★★☆☆ | ~5 min | | [U03](#u03-stacking-ensembles) | Stacking Ensembles | ★★★☆☆ | ~4 min | | [U04](#u04-pls-variants) | PLS Variants | ★★☆☆☆ | ~3 min | --- ## U01: Multi-Model Comparison **Run and compare multiple models in a single pipeline.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/04_models/U01_multi_model.py) ### What You'll Learn - Defining multiple models in one pipeline - Using the `_or_` generator syntax - Comparing model performance - Model selection strategies ### Model Families for NIRS Different models have different strengths: | Family | Models | Strengths | |--------|--------|-----------| | **Linear** | PLS, Ridge, Lasso, ElasticNet | Handles collinearity, interpretable | | **Tree-based** | RandomForest, GradientBoosting, ExtraTrees | Handles non-linearity, feature importance | | **Other** | SVR, KNeighbors | Local patterns, non-parametric | ### Basic Multi-Model Pipeline List multiple models in the pipeline—each is trained and evaluated: ```python from sklearn.cross_decomposition import PLSRegression from sklearn.linear_model import Ridge from sklearn.ensemble import RandomForestRegressor pipeline = [ StandardNormalVariate(), FirstDerivative(), StandardScaler(), ShuffleSplit(n_splits=3, random_state=42), # Multiple models - each is evaluated {"model": PLSRegression(n_components=10)}, {"model": Ridge(alpha=1.0)}, {"model": RandomForestRegressor(n_estimators=100)}, ] result = nirs4all.run( pipeline=pipeline, dataset="sample_data/regression", name="MultiModel" ) print(f"Models tested: {result.num_predictions}") print(f"Best RMSE: {result.best_score:.4f}") ``` ### Using _or_ Generator Syntax More compact syntax for model variants: ```python pipeline = [ SNV(), ShuffleSplit(n_splits=3), # Generate variants for each model {"_or_": [ {"model": PLSRegression(n_components=10)}, {"model": Ridge(alpha=1.0)}, {"model": Lasso(alpha=0.1)}, {"model": ElasticNet(alpha=0.1)}, ]}, ] ``` ### Combined Preprocessing + Model Search Find the best preprocessing + model combination: ```python pipeline = [ # Explore preprocessing options {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"}, # Add derivative {"feature_augmentation": [FirstDerivative], "action": "add"}, ShuffleSplit(n_splits=3), # Multiple models {"model": PLSRegression(n_components=10)}, {"model": Ridge(alpha=1.0)}, {"model": RandomForestRegressor(n_estimators=50)}, ] ``` ### Viewing Results ```python # Top 10 configurations for i, pred in enumerate(result.top(10, display_metrics=['rmse', 'r2']), 1): preproc = pred.get('preprocessings', 'N/A') model = pred.get('model_name', 'Unknown') print(f"{i}. {preproc} + {model}: RMSE={pred.get('rmse', 0):.4f}") ``` --- ## U02: Hyperparameter Tuning **Automated hyperparameter search with Optuna integration.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/04_models/U02_hyperparameter_tuning.py) ### What You'll Learn - Grid search with `_range_` syntax - Logarithmic ranges for regularization - Optuna integration for smart search - Early stopping and pruning ### Parameter Type Reference NIRS4ALL supports all Optuna parameter types via flexible syntax: **Tuple Format (most common):** ```python "model_params": { "n_components": ('int', 1, 20), # Integer uniform [1, 20] "n_layers": ('int_log', 1, 100), # Integer log-uniform [1, 100] "tol": ('float', 1e-6, 1e-4), # Float uniform [1e-6, 1e-4] "lr": ('float_log', 1e-5, 1e-1), # Float log-uniform [1e-5, 1e-1] "kernel": ['rbf', 'linear', 'poly'], # Categorical } ``` **Dict Format (most flexible):** ```python "model_params": { # Integer with step (e.g., odd values only: 11, 13, 15, ...) "kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2}, # Integer log-scale "max_iter": {'type': 'int', 'min': 100, 'max': 10000, 'log': True}, # Float with step (discrete) "learning_rate": {'type': 'float', 'min': 0.1, 'max': 1.0, 'step': 0.1}, # Float log-scale (recommended for regularization, learning rates) "alpha": {'type': 'float', 'min': 1e-5, 'max': 1e-1, 'log': True}, # Categorical "solver": {'type': 'categorical', 'choices': ['lbfgs', 'sgd', 'adam']}, # Sorted tuple - generates N sorted values (useful for fractional derivative alphas) "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0}, } ``` **Sorted Tuple Format (for ordered sequences):** The `sorted_tuple` type generates multiple values within a range and returns them as a sorted tuple. This is useful for parameters like fractional derivative orders: ```python "model_params": { # Fixed length: 4 floats between 0.0 and 2.0, returned sorted "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0}, # Dynamic length: 3-5 floats (length is also optimized) "alphas": {'type': 'sorted_tuple', 'length': ('int', 3, 5), 'min': 0.0, 'max': 2.0}, # With step for discrete values "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0, 'step': 0.5}, # Integer elements "orders": {'type': 'sorted_tuple', 'length': 3, 'min': 1, 'max': 10, 'element_type': 'int'}, } ``` ### When to Use Log-Scale Use `float_log` or `int_log` for parameters spanning multiple orders of magnitude: - **Learning rates**: `('float_log', 1e-5, 1e-1)` - **Regularization (alpha, lambda)**: `('float_log', 1e-6, 1.0)` - **Number of iterations**: `('int_log', 100, 10000)` Log-uniform sampling ensures each order of magnitude gets equal exploration probability. ### Common Parameter Patterns **Odd integers only** (e.g., kernel sizes for convolutions): ```python "kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2} # Samples: 11, 13, 15, ..., 49, 51 ``` **Sorted tuple** (e.g., fractional derivative orders): ```python "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0} # Samples 4 floats in [0, 2], returns them sorted as a tuple ``` ### Grid Search with _range_ ```python pipeline = [ SNV(), ShuffleSplit(n_splits=3), # Sweep n_components from 5 to 25, step 5 {"model": PLSRegression(), "n_components": {"_range_": [5, 25, 5]}} ] # Generates: PLS(5), PLS(10), PLS(15), PLS(20), PLS(25) ``` ### Logarithmic Ranges For parameters like regularization strength: ```python pipeline = [ SNV(), ShuffleSplit(n_splits=3), # Log-spaced alpha values: 0.001, 0.01, 0.1, 1.0 {"model": Ridge(), "alpha": {"_log_range_": [0.001, 1.0, 4]}} ] ``` ### Combined Grid Search ```python pipeline = [ SNV(), ShuffleSplit(n_splits=3), # Grid over model type AND hyperparameters {"_grid_": { "model": [ PLSRegression(n_components=5), PLSRegression(n_components=10), Ridge(alpha=0.1), Ridge(alpha=1.0), ] }} ] ``` ### Optuna Integration For smarter search (Bayesian optimization): ```python pipeline = [ SNV(), ShuffleSplit(n_splits=3), { "model": Ridge(), "finetune_params": { "n_trials": 50, "sample": "tpe", # Bayesian optimization "verbose": 1, "approach": "single", "model_params": { "alpha": ('float_log', 1e-4, 1e2), # Log-uniform sampling } } } ] result = nirs4all.run( pipeline=pipeline, dataset="sample_data/regression" ) ``` --- ## U03: Stacking Ensembles **Combine multiple models for improved predictions.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/04_models/U03_stacking_ensembles.py) ### What You'll Learn - Prediction merging for stacking - Out-of-fold prediction collection - Meta-learner training - Two-level stacking ### Stacking Concept Stacking combines predictions from multiple base models: ``` Level 0 (Base Models): PLS → predictions_pls RF → predictions_rf Ridge → predictions_ridge Level 1 (Meta-Learner): [predictions_pls, predictions_rf, predictions_ridge] → final_prediction ``` ### Basic Stacking Pipeline ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5), # Create branches for base models {"branch": { "pls": [PLSRegression(n_components=10)], "rf": [RandomForestRegressor(n_estimators=50)], "ridge": [Ridge(alpha=1.0)], }}, # Merge predictions (OOF reconstruction) {"merge": "predictions"}, # Meta-learner {"model": Ridge(alpha=0.1), "name": "MetaLearner"} ] ``` ### Understanding Prediction Merging When using `{"merge": "predictions"}`: 1. Each branch produces out-of-fold (OOF) predictions 2. OOF predictions are reconstructed to form features for the meta-learner 3. No data leakage: each sample's meta-features come from models that didn't see it ### Two-Level Stacking ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5), # Level 0: Different preprocessing + model combinations {"branch": { "snv_pls": [SNV(), PLSRegression(n_components=10)], "msc_pls": [MSC(), PLSRegression(n_components=10)], "snv_rf": [SNV(), RandomForestRegressor(n_estimators=50)], }}, {"merge": "predictions"}, # Level 1: Meta-learner {"model": Ridge(alpha=0.1)} ] ``` --- ## U04: PLS Variants **Explore different Partial Least Squares implementations.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/04_models/U04_pls_variants.py) ### What You'll Learn - Standard PLSRegression - PLSCanonical - Kernel PLS - Selecting the right variant ### PLS Implementations ```python from sklearn.cross_decomposition import ( PLSRegression, PLSCanonical, CCA ) # Standard PLS - most common for NIRS PLSRegression(n_components=10) # PLSCanonical - symmetric, multivariate Y PLSCanonical(n_components=10) # CCA - Canonical Correlation Analysis CCA(n_components=10) ``` ### When to Use Each | Variant | Use Case | |---------|----------| | **PLSRegression** | Standard NIRS calibration, single target | | **PLSCanonical** | Multi-output regression, balanced X/Y | | **CCA** | Finding correlations, exploratory analysis | ### Comparing PLS Components ```python pipeline = [ SNV(), ShuffleSplit(n_splits=5), # Compare different n_components {"model": PLSRegression(n_components=5), "name": "PLS-5"}, {"model": PLSRegression(n_components=10), "name": "PLS-10"}, {"model": PLSRegression(n_components=15), "name": "PLS-15"}, {"model": PLSRegression(n_components=20), "name": "PLS-20"}, ] result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression") # Analyze component selection analyzer = PredictionAnalyzer(result.predictions) analyzer.plot_candlestick(variable="model_name", display_metric='rmse') ``` ### Optimal Component Selection Typically, optimal n_components: - Too few: Underfitting, high bias - Too many: Overfitting, high variance Use cross-validation to find the sweet spot: ```python pipeline = [ SNV(), RepeatedKFold(n_splits=5, n_repeats=3), # Sweep components {"model": PLSRegression(), "n_components": {"_range_": [2, 30, 2]}} ] ``` --- ## Model Selection Guidelines ### For NIRS Data | Data Characteristic | Recommended Models | |---------------------|-------------------| | Linear relationships | PLS, Ridge | | Non-linear relationships | Random Forest, Gradient Boosting | | High collinearity | PLS (designed for this) | | Small sample size | PLS, Ridge (regularized) | | Large sample size | Random Forest, neural networks | | Need interpretability | PLS, Ridge | ### Quick Comparison Strategy ```python # Quick multi-model comparison pipeline = [ SNV(), ShuffleSplit(n_splits=3), {"model": PLSRegression(n_components=10), "name": "PLS"}, {"model": Ridge(alpha=1.0), "name": "Ridge"}, {"model": RandomForestRegressor(n_estimators=100), "name": "RF"}, {"model": GradientBoostingRegressor(n_estimators=50), "name": "GBR"}, ] result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression") # Visualize comparison analyzer = PredictionAnalyzer(result.predictions) analyzer.plot_candlestick(variable="model_name", display_metric='rmse') ``` --- ## Running These Examples ```bash cd examples # Run all model examples ./run.sh -n "U0*.py" -c user # Run hyperparameter tuning python user/04_models/U02_hyperparameter_tuning.py --plots ``` ## Next Steps After mastering model training: - **Cross-Validation**: Proper evaluation strategies - **Deployment**: Save and deploy trained models - **Explainability**: Understand model decisions with SHAP