Model Examples
This section covers model training, comparison, hyperparameter tuning, and ensemble methods in NIRS4ALL.
Overview
Example |
Topic |
Difficulty |
Duration |
|---|---|---|---|
Multi-Model Comparison |
★★☆☆☆ |
~4 min |
|
Hyperparameter Tuning |
★★★☆☆ |
~5 min |
|
Stacking Ensembles |
★★★☆☆ |
~4 min |
|
PLS Variants |
★★☆☆☆ |
~3 min |
U01: Multi-Model Comparison
Run and compare multiple models in a single pipeline.
What You’ll Learn
Defining multiple models in one pipeline
Using the
_or_generator syntaxComparing model performance
Model selection strategies
Model Families for NIRS
Different models have different strengths:
Family |
Models |
Strengths |
|---|---|---|
Linear |
PLS, Ridge, Lasso, ElasticNet |
Handles collinearity, interpretable |
Tree-based |
RandomForest, GradientBoosting, ExtraTrees |
Handles non-linearity, feature importance |
Other |
SVR, KNeighbors |
Local patterns, non-parametric |
Basic Multi-Model Pipeline
List multiple models in the pipeline—each is trained and evaluated:
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
pipeline = [
StandardNormalVariate(),
FirstDerivative(),
StandardScaler(),
ShuffleSplit(n_splits=3, random_state=42),
# Multiple models - each is evaluated
{"model": PLSRegression(n_components=10)},
{"model": Ridge(alpha=1.0)},
{"model": RandomForestRegressor(n_estimators=100)},
]
result = nirs4all.run(
pipeline=pipeline,
dataset="sample_data/regression",
name="MultiModel"
)
print(f"Models tested: {result.num_predictions}")
print(f"Best RMSE: {result.best_score:.4f}")
Using or Generator Syntax
More compact syntax for model variants:
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
# Generate variants for each model
{"_or_": [
{"model": PLSRegression(n_components=10)},
{"model": Ridge(alpha=1.0)},
{"model": Lasso(alpha=0.1)},
{"model": ElasticNet(alpha=0.1)},
]},
]
Combined Preprocessing + Model Search
Find the best preprocessing + model combination:
pipeline = [
# Explore preprocessing options
{"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"},
# Add derivative
{"feature_augmentation": [FirstDerivative], "action": "add"},
ShuffleSplit(n_splits=3),
# Multiple models
{"model": PLSRegression(n_components=10)},
{"model": Ridge(alpha=1.0)},
{"model": RandomForestRegressor(n_estimators=50)},
]
Viewing Results
# Top 10 configurations
for i, pred in enumerate(result.top(10, display_metrics=['rmse', 'r2']), 1):
preproc = pred.get('preprocessings', 'N/A')
model = pred.get('model_name', 'Unknown')
print(f"{i}. {preproc} + {model}: RMSE={pred.get('rmse', 0):.4f}")
U02: Hyperparameter Tuning
Automated hyperparameter search with Optuna integration.
What You’ll Learn
Grid search with
_range_syntaxLogarithmic ranges for regularization
Optuna integration for smart search
Early stopping and pruning
Parameter Type Reference
NIRS4ALL supports all Optuna parameter types via flexible syntax:
Tuple Format (most common):
"model_params": {
"n_components": ('int', 1, 20), # Integer uniform [1, 20]
"n_layers": ('int_log', 1, 100), # Integer log-uniform [1, 100]
"tol": ('float', 1e-6, 1e-4), # Float uniform [1e-6, 1e-4]
"lr": ('float_log', 1e-5, 1e-1), # Float log-uniform [1e-5, 1e-1]
"kernel": ['rbf', 'linear', 'poly'], # Categorical
}
Dict Format (most flexible):
"model_params": {
# Integer with step (e.g., odd values only: 11, 13, 15, ...)
"kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2},
# Integer log-scale
"max_iter": {'type': 'int', 'min': 100, 'max': 10000, 'log': True},
# Float with step (discrete)
"learning_rate": {'type': 'float', 'min': 0.1, 'max': 1.0, 'step': 0.1},
# Float log-scale (recommended for regularization, learning rates)
"alpha": {'type': 'float', 'min': 1e-5, 'max': 1e-1, 'log': True},
# Categorical
"solver": {'type': 'categorical', 'choices': ['lbfgs', 'sgd', 'adam']},
# Sorted tuple - generates N sorted values (useful for fractional derivative alphas)
"alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0},
}
Sorted Tuple Format (for ordered sequences):
The sorted_tuple type generates multiple values within a range and returns them as a sorted tuple.
This is useful for parameters like fractional derivative orders:
"model_params": {
# Fixed length: 4 floats between 0.0 and 2.0, returned sorted
"alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0},
# Dynamic length: 3-5 floats (length is also optimized)
"alphas": {'type': 'sorted_tuple', 'length': ('int', 3, 5), 'min': 0.0, 'max': 2.0},
# With step for discrete values
"alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0, 'step': 0.5},
# Integer elements
"orders": {'type': 'sorted_tuple', 'length': 3, 'min': 1, 'max': 10, 'element_type': 'int'},
}
When to Use Log-Scale
Use float_log or int_log for parameters spanning multiple orders of magnitude:
Learning rates:
('float_log', 1e-5, 1e-1)Regularization (alpha, lambda):
('float_log', 1e-6, 1.0)Number of iterations:
('int_log', 100, 10000)
Log-uniform sampling ensures each order of magnitude gets equal exploration probability.
Common Parameter Patterns
Odd integers only (e.g., kernel sizes for convolutions):
"kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2}
# Samples: 11, 13, 15, ..., 49, 51
Sorted tuple (e.g., fractional derivative orders):
"alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0}
# Samples 4 floats in [0, 2], returns them sorted as a tuple
Grid Search with range
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
# Sweep n_components from 5 to 25, step 5
{"model": PLSRegression(),
"n_components": {"_range_": [5, 25, 5]}}
]
# Generates: PLS(5), PLS(10), PLS(15), PLS(20), PLS(25)
Logarithmic Ranges
For parameters like regularization strength:
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
# Log-spaced alpha values: 0.001, 0.01, 0.1, 1.0
{"model": Ridge(),
"alpha": {"_log_range_": [0.001, 1.0, 4]}}
]
Combined Grid Search
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
# Grid over model type AND hyperparameters
{"_grid_": {
"model": [
PLSRegression(n_components=5),
PLSRegression(n_components=10),
Ridge(alpha=0.1),
Ridge(alpha=1.0),
]
}}
]
Optuna Integration
For smarter search (Bayesian optimization):
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
{
"model": Ridge(),
"finetune_params": {
"n_trials": 50,
"sample": "tpe", # Bayesian optimization
"verbose": 1,
"approach": "single",
"model_params": {
"alpha": ('float_log', 1e-4, 1e2), # Log-uniform sampling
}
}
}
]
result = nirs4all.run(
pipeline=pipeline,
dataset="sample_data/regression"
)
U03: Stacking Ensembles
Combine multiple models for improved predictions.
What You’ll Learn
Prediction merging for stacking
Out-of-fold prediction collection
Meta-learner training
Two-level stacking
Stacking Concept
Stacking combines predictions from multiple base models:
Level 0 (Base Models):
PLS → predictions_pls
RF → predictions_rf
Ridge → predictions_ridge
Level 1 (Meta-Learner):
[predictions_pls, predictions_rf, predictions_ridge] → final_prediction
Basic Stacking Pipeline
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5),
# Create branches for base models
{"branch": {
"pls": [PLSRegression(n_components=10)],
"rf": [RandomForestRegressor(n_estimators=50)],
"ridge": [Ridge(alpha=1.0)],
}},
# Merge predictions (OOF reconstruction)
{"merge": "predictions"},
# Meta-learner
{"model": Ridge(alpha=0.1), "name": "MetaLearner"}
]
Understanding Prediction Merging
When using {"merge": "predictions"}:
Each branch produces out-of-fold (OOF) predictions
OOF predictions are reconstructed to form features for the meta-learner
No data leakage: each sample’s meta-features come from models that didn’t see it
Two-Level Stacking
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=5),
# Level 0: Different preprocessing + model combinations
{"branch": {
"snv_pls": [SNV(), PLSRegression(n_components=10)],
"msc_pls": [MSC(), PLSRegression(n_components=10)],
"snv_rf": [SNV(), RandomForestRegressor(n_estimators=50)],
}},
{"merge": "predictions"},
# Level 1: Meta-learner
{"model": Ridge(alpha=0.1)}
]
U04: PLS Variants
Explore different Partial Least Squares implementations.
What You’ll Learn
Standard PLSRegression
PLSCanonical
Kernel PLS
Selecting the right variant
PLS Implementations
from sklearn.cross_decomposition import (
PLSRegression,
PLSCanonical,
CCA
)
# Standard PLS - most common for NIRS
PLSRegression(n_components=10)
# PLSCanonical - symmetric, multivariate Y
PLSCanonical(n_components=10)
# CCA - Canonical Correlation Analysis
CCA(n_components=10)
When to Use Each
Variant |
Use Case |
|---|---|
PLSRegression |
Standard NIRS calibration, single target |
PLSCanonical |
Multi-output regression, balanced X/Y |
CCA |
Finding correlations, exploratory analysis |
Comparing PLS Components
pipeline = [
SNV(),
ShuffleSplit(n_splits=5),
# Compare different n_components
{"model": PLSRegression(n_components=5), "name": "PLS-5"},
{"model": PLSRegression(n_components=10), "name": "PLS-10"},
{"model": PLSRegression(n_components=15), "name": "PLS-15"},
{"model": PLSRegression(n_components=20), "name": "PLS-20"},
]
result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")
# Analyze component selection
analyzer = PredictionAnalyzer(result.predictions)
analyzer.plot_candlestick(variable="model_name", display_metric='rmse')
Optimal Component Selection
Typically, optimal n_components:
Too few: Underfitting, high bias
Too many: Overfitting, high variance
Use cross-validation to find the sweet spot:
pipeline = [
SNV(),
RepeatedKFold(n_splits=5, n_repeats=3),
# Sweep components
{"model": PLSRegression(),
"n_components": {"_range_": [2, 30, 2]}}
]
Model Selection Guidelines
For NIRS Data
Data Characteristic |
Recommended Models |
|---|---|
Linear relationships |
PLS, Ridge |
Non-linear relationships |
Random Forest, Gradient Boosting |
High collinearity |
PLS (designed for this) |
Small sample size |
PLS, Ridge (regularized) |
Large sample size |
Random Forest, neural networks |
Need interpretability |
PLS, Ridge |
Quick Comparison Strategy
# Quick multi-model comparison
pipeline = [
SNV(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10), "name": "PLS"},
{"model": Ridge(alpha=1.0), "name": "Ridge"},
{"model": RandomForestRegressor(n_estimators=100), "name": "RF"},
{"model": GradientBoostingRegressor(n_estimators=50), "name": "GBR"},
]
result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")
# Visualize comparison
analyzer = PredictionAnalyzer(result.predictions)
analyzer.plot_candlestick(variable="model_name", display_metric='rmse')
Running These Examples
cd examples
# Run all model examples
./run.sh -n "U0*.py" -c user
# Run hyperparameter tuning
python user/04_models/U02_hyperparameter_tuning.py --plots
Next Steps
After mastering model training:
Cross-Validation: Proper evaluation strategies
Deployment: Save and deploy trained models
Explainability: Understand model decisions with SHAP