Model Examples

This section covers model training, comparison, hyperparameter tuning, and ensemble methods in NIRS4ALL.

Overview 

Example	Topic	Difficulty	Duration
U01	Multi-Model Comparison	★★☆☆☆	~4 min
U02	Hyperparameter Tuning	★★★☆☆	~5 min
U03	Stacking Ensembles	★★★☆☆	~4 min
U04	PLS Variants	★★☆☆☆	~3 min

U01: Multi-Model Comparison 

Run and compare multiple models in a single pipeline.

📄 View source code

What You’ll Learn 

Defining multiple models in one pipeline
Using the _or_ generator syntax
Comparing model performance
Model selection strategies

Model Families for NIRS 

Different models have different strengths:

Family	Models	Strengths
Linear	PLS, Ridge, Lasso, ElasticNet	Handles collinearity, interpretable
Tree-based	RandomForest, GradientBoosting, ExtraTrees	Handles non-linearity, feature importance
Other	SVR, KNeighbors	Local patterns, non-parametric

Basic Multi-Model Pipeline 

List multiple models in the pipeline—each is trained and evaluated:

from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

pipeline = [
    StandardNormalVariate(),
    FirstDerivative(),
    StandardScaler(),

    ShuffleSplit(n_splits=3, random_state=42),

    # Multiple models - each is evaluated
    {"model": PLSRegression(n_components=10)},
    {"model": Ridge(alpha=1.0)},
    {"model": RandomForestRegressor(n_estimators=100)},
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression",
    name="MultiModel"
)

print(f"Models tested: {result.num_predictions}")
print(f"Best RMSE: {result.best_score:.4f}")

Using or Generator Syntax 

More compact syntax for model variants:

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),

    # Generate variants for each model
    {"_or_": [
        {"model": PLSRegression(n_components=10)},
        {"model": Ridge(alpha=1.0)},
        {"model": Lasso(alpha=0.1)},
        {"model": ElasticNet(alpha=0.1)},
    ]},
]

Combined Preprocessing + Model Search 

Find the best preprocessing + model combination:

pipeline = [
    # Explore preprocessing options
    {"feature_augmentation": [SNV, MSC, Detrend], "action": "extend"},

    # Add derivative
    {"feature_augmentation": [FirstDerivative], "action": "add"},

    ShuffleSplit(n_splits=3),

    # Multiple models
    {"model": PLSRegression(n_components=10)},
    {"model": Ridge(alpha=1.0)},
    {"model": RandomForestRegressor(n_estimators=50)},
]

Viewing Results 

# Top 10 configurations
for i, pred in enumerate(result.top(10, display_metrics=['rmse', 'r2']), 1):
    preproc = pred.get('preprocessings', 'N/A')
    model = pred.get('model_name', 'Unknown')
    print(f"{i}. {preproc} + {model}: RMSE={pred.get('rmse', 0):.4f}")

U02: Hyperparameter Tuning 

Automated hyperparameter search with Optuna integration.

📄 View source code

What You’ll Learn 

Grid search with _range_ syntax
Logarithmic ranges for regularization
Optuna integration for smart search
Early stopping and pruning

Parameter Type Reference 

NIRS4ALL supports all Optuna parameter types via flexible syntax:

Tuple Format (most common):

"model_params": {
    "n_components": ('int', 1, 20),        # Integer uniform [1, 20]
    "n_layers": ('int_log', 1, 100),       # Integer log-uniform [1, 100]
    "tol": ('float', 1e-6, 1e-4),          # Float uniform [1e-6, 1e-4]
    "lr": ('float_log', 1e-5, 1e-1),       # Float log-uniform [1e-5, 1e-1]
    "kernel": ['rbf', 'linear', 'poly'],   # Categorical
}

Dict Format (most flexible):

"model_params": {
    # Integer with step (e.g., odd values only: 11, 13, 15, ...)
    "kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2},
    # Integer log-scale
    "max_iter": {'type': 'int', 'min': 100, 'max': 10000, 'log': True},
    # Float with step (discrete)
    "learning_rate": {'type': 'float', 'min': 0.1, 'max': 1.0, 'step': 0.1},
    # Float log-scale (recommended for regularization, learning rates)
    "alpha": {'type': 'float', 'min': 1e-5, 'max': 1e-1, 'log': True},
    # Categorical
    "solver": {'type': 'categorical', 'choices': ['lbfgs', 'sgd', 'adam']},
    # Sorted tuple - generates N sorted values (useful for fractional derivative alphas)
    "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0},
}

Sorted Tuple Format (for ordered sequences):

The sorted_tuple type generates multiple values within a range and returns them as a sorted tuple. This is useful for parameters like fractional derivative orders:

"model_params": {
    # Fixed length: 4 floats between 0.0 and 2.0, returned sorted
    "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0},

    # Dynamic length: 3-5 floats (length is also optimized)
    "alphas": {'type': 'sorted_tuple', 'length': ('int', 3, 5), 'min': 0.0, 'max': 2.0},

    # With step for discrete values
    "alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0, 'step': 0.5},

    # Integer elements
    "orders": {'type': 'sorted_tuple', 'length': 3, 'min': 1, 'max': 10, 'element_type': 'int'},
}

When to Use Log-Scale 

Use float_log or int_log for parameters spanning multiple orders of magnitude:

Learning rates: ('float_log', 1e-5, 1e-1)
Regularization (alpha, lambda): ('float_log', 1e-6, 1.0)
Number of iterations: ('int_log', 100, 10000)

Log-uniform sampling ensures each order of magnitude gets equal exploration probability.

Common Parameter Patterns 

Odd integers only (e.g., kernel sizes for convolutions):

"kernel_size": {'type': 'int', 'min': 11, 'max': 51, 'step': 2}
# Samples: 11, 13, 15, ..., 49, 51

Sorted tuple (e.g., fractional derivative orders):

"alphas": {'type': 'sorted_tuple', 'length': 4, 'min': 0.0, 'max': 2.0}
# Samples 4 floats in [0, 2], returns them sorted as a tuple

Grid Search with range

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),

    # Sweep n_components from 5 to 25, step 5
    {"model": PLSRegression(),
     "n_components": {"_range_": [5, 25, 5]}}
]
# Generates: PLS(5), PLS(10), PLS(15), PLS(20), PLS(25)

Logarithmic Ranges 

For parameters like regularization strength:

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),

    # Log-spaced alpha values: 0.001, 0.01, 0.1, 1.0
    {"model": Ridge(),
     "alpha": {"_log_range_": [0.001, 1.0, 4]}}
]

Combined Grid Search 

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),

    # Grid over model type AND hyperparameters
    {"_grid_": {
        "model": [
            PLSRegression(n_components=5),
            PLSRegression(n_components=10),
            Ridge(alpha=0.1),
            Ridge(alpha=1.0),
        ]
    }}
]

Optuna Integration 

For smarter search (Bayesian optimization):

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),
    {
        "model": Ridge(),
        "finetune_params": {
            "n_trials": 50,
            "sample": "tpe",          # Bayesian optimization
            "verbose": 1,
            "approach": "single",
            "model_params": {
                "alpha": ('float_log', 1e-4, 1e2),  # Log-uniform sampling
            }
        }
    }
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression"
)

U03: Stacking Ensembles 

Combine multiple models for improved predictions.

📄 View source code

What You’ll Learn 

Prediction merging for stacking
Out-of-fold prediction collection
Meta-learner training
Two-level stacking

Stacking Concept 

Stacking combines predictions from multiple base models:

Level 0 (Base Models):
  PLS  →  predictions_pls
  RF   →  predictions_rf
  Ridge → predictions_ridge

Level 1 (Meta-Learner):
  [predictions_pls, predictions_rf, predictions_ridge] → final_prediction

Basic Stacking Pipeline 

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5),

    # Create branches for base models
    {"branch": {
        "pls": [PLSRegression(n_components=10)],
        "rf": [RandomForestRegressor(n_estimators=50)],
        "ridge": [Ridge(alpha=1.0)],
    }},

    # Merge predictions (OOF reconstruction)
    {"merge": "predictions"},

    # Meta-learner
    {"model": Ridge(alpha=0.1), "name": "MetaLearner"}
]

Understanding Prediction Merging 

When using {"merge": "predictions"}:

Each branch produces out-of-fold (OOF) predictions
OOF predictions are reconstructed to form features for the meta-learner
No data leakage: each sample’s meta-features come from models that didn’t see it

Two-Level Stacking 

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5),

    # Level 0: Different preprocessing + model combinations
    {"branch": {
        "snv_pls": [SNV(), PLSRegression(n_components=10)],
        "msc_pls": [MSC(), PLSRegression(n_components=10)],
        "snv_rf": [SNV(), RandomForestRegressor(n_estimators=50)],
    }},

    {"merge": "predictions"},

    # Level 1: Meta-learner
    {"model": Ridge(alpha=0.1)}
]

U04: PLS Variants 

Explore different Partial Least Squares implementations.

📄 View source code

What You’ll Learn 

Standard PLSRegression
PLSCanonical
Kernel PLS
Selecting the right variant

PLS Implementations 

from sklearn.cross_decomposition import (
    PLSRegression,
    PLSCanonical,
    CCA
)

# Standard PLS - most common for NIRS
PLSRegression(n_components=10)

# PLSCanonical - symmetric, multivariate Y
PLSCanonical(n_components=10)

# CCA - Canonical Correlation Analysis
CCA(n_components=10)

When to Use Each 

Variant	Use Case
PLSRegression	Standard NIRS calibration, single target
PLSCanonical	Multi-output regression, balanced X/Y
CCA	Finding correlations, exploratory analysis

Comparing PLS Components 

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=5),

    # Compare different n_components
    {"model": PLSRegression(n_components=5), "name": "PLS-5"},
    {"model": PLSRegression(n_components=10), "name": "PLS-10"},
    {"model": PLSRegression(n_components=15), "name": "PLS-15"},
    {"model": PLSRegression(n_components=20), "name": "PLS-20"},
]

result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")

# Analyze component selection
analyzer = PredictionAnalyzer(result.predictions)
analyzer.plot_candlestick(variable="model_name", display_metric='rmse')

Optimal Component Selection 

Typically, optimal n_components:

Too few: Underfitting, high bias
Too many: Overfitting, high variance

Use cross-validation to find the sweet spot:

pipeline = [
    SNV(),
    RepeatedKFold(n_splits=5, n_repeats=3),

    # Sweep components
    {"model": PLSRegression(),
     "n_components": {"_range_": [2, 30, 2]}}
]

Model Selection Guidelines 

For NIRS Data 

Data Characteristic	Recommended Models
Linear relationships	PLS, Ridge
Non-linear relationships	Random Forest, Gradient Boosting
High collinearity	PLS (designed for this)
Small sample size	PLS, Ridge (regularized)
Large sample size	Random Forest, neural networks
Need interpretability	PLS, Ridge

Quick Comparison Strategy 

# Quick multi-model comparison
pipeline = [
    SNV(),
    ShuffleSplit(n_splits=3),

    {"model": PLSRegression(n_components=10), "name": "PLS"},
    {"model": Ridge(alpha=1.0), "name": "Ridge"},
    {"model": RandomForestRegressor(n_estimators=100), "name": "RF"},
    {"model": GradientBoostingRegressor(n_estimators=50), "name": "GBR"},
]

result = nirs4all.run(pipeline=pipeline, dataset="sample_data/regression")

# Visualize comparison
analyzer = PredictionAnalyzer(result.predictions)
analyzer.plot_candlestick(variable="model_name", display_metric='rmse')

Running These Examples 

cd examples

# Run all model examples
./run.sh -n "U0*.py" -c user

# Run hyperparameter tuning
python user/04_models/U02_hyperparameter_tuning.py --plots

Next Steps 

After mastering model training:

Cross-Validation: Proper evaluation strategies
Deployment: Save and deploy trained models
Explainability: Understand model decisions with SHAP