Getting Started Examples

This section introduces the fundamentals of NIRS4ALL through a series of progressive examples. Each example builds upon the previous one, guiding you from your first pipeline to advanced visualization techniques.

Overview 

Example	Topic	Difficulty	Duration
U01	Hello World	★☆☆☆☆	~1 min
U02	Basic Regression	★★☆☆☆	~3 min
U03	Basic Classification	★★☆☆☆	~2 min
U04	Visualization	★★☆☆☆	~3 min

U01: Hello World 

Your first NIRS4ALL pipeline in about 20 lines of code.

📄 View source code

What You’ll Learn 

Using nirs4all.run() to train a pipeline
The structure of a minimal pipeline
Reading results from the RunResult object

Key Concepts 

A pipeline in NIRS4ALL is simply a list of processing steps:

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import MinMaxScaler

import nirs4all

# Generate synthetic data (or use your own dataset path)
dataset = nirs4all.generate.regression(
    n_samples=200,
    target_component=0,
    random_state=42
)

# Define the pipeline as a list of steps
pipeline = [
    MinMaxScaler(),                              # Feature scaling
    {"y_processing": MinMaxScaler()},            # Target scaling
    ShuffleSplit(n_splits=3, test_size=0.25),    # Cross-validation
    {"model": PLSRegression(n_components=10)}    # Model
]

# Run with one simple call
result = nirs4all.run(
    pipeline=pipeline,
    dataset=dataset,
    name="HelloWorld",
    verbose=1
)

# Access results
print(f"Best RMSE: {result.best_rmse:.4f}")
print(f"Best R²: {result.best_r2:.4f}")

# Explore top predictions
for pred in result.top(n=3, display_metrics=['rmse', 'r2']):
    print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}")

The RunResult Object 

The result object provides convenient accessors:

Accessor	Description
`result.best_score`	Best model’s primary score (MSE by default)
`result.best_rmse`	Best model’s RMSE
`result.best_r2`	Best model’s R²
`result.best`	Best prediction entry as a dictionary
`result.top(n)`	Top N predictions ranked by score
`result.predictions`	Full Predictions object for analysis

Understanding Prediction Entries 

Each prediction returned by result.top() or result.best is a dictionary with rich information:

# Get top predictions
for pred in result.top(n=3, display_metrics=['rmse', 'r2']):
    # Core identification
    print(f"Model: {pred['model_name']}")
    print(f"Dataset: {pred['dataset_name']}")
    print(f"Fold: {pred['fold_id']}")
    print(f"Preprocessing: {pred.get('preprocessings', 'none')}")

    # Metrics (available when using display_metrics)
    print(f"RMSE: {pred['rmse']:.4f}")
    print(f"R²: {pred['r2']:.4f}")

    # Scores by partition (primary metric)
    print(f"Train: {pred['train_score']:.6f}")
    print(f"Val: {pred['val_score']:.6f}")
    print(f"Test: {pred['test_score']:.6f}")

    # Additional metadata
    print(f"Samples: {pred['n_samples']}, Features: {pred['n_features']}")

Key fields in each prediction entry:

Field	Description
`model_name`	Name of the model (e.g., “PLSRegression”)
`model_classname`	Class name of the model
`dataset_name`	Dataset used for training
`fold_id`	Cross-validation fold index
`preprocessings`	Preprocessing steps applied
`train_score`	Training score (primary metric)
`val_score`	Validation score (primary metric)
`test_score`	Test score (primary metric)
`rmse`, `r2`	RMSE and R² (when using `display_metrics`)
`n_samples`, `n_features`	Data shape information
`task_type`	‘regression’ or ‘classification’
`metric`	Primary metric name (e.g., ‘mse’)

Tips for Beginners 

Start simple: Begin with a basic pipeline and add complexity gradually
Use verbose=1: See what’s happening during training
Check top models: Use result.top(n=5, display_metrics=['rmse', 'r2']) to compare performance
Explore predictions: Each prediction entry contains detailed metrics and metadata

U02: Basic Regression 

A complete regression pipeline with NIRS-specific preprocessing and visualization.

📄 View source code

What You’ll Learn 

NIRS-specific preprocessing (SNV, Detrend, Derivatives, Gaussian)
Feature augmentation to explore preprocessing combinations
Using PredictionAnalyzer for result visualization
Comparing models with different n_components

NIRS Preprocessing Options 

NIRS4ALL provides specialized transforms for spectral data:

Transform	Purpose	When to Use
`StandardNormalVariate` (SNV)	Scatter correction	Path length variations
`MultiplicativeScatterCorrection` (MSC)	Scatter correction	Reference-based correction
`Detrend`	Baseline correction	Polynomial drift removal
`FirstDerivative`	Enhance peaks, remove baseline	Constant baseline issues
`SavitzkyGolay`	Smoothing + derivatives	Noisy data
`Gaussian`	Smoothing	Noise reduction
`Haar`	Wavelet transform	Multi-resolution analysis

Feature Augmentation 

Instead of manually defining multiple pipelines, use feature augmentation to explore combinations:

pipeline = [
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},

    # Generate 3 preprocessing combinations from 5 options
    {
        "feature_augmentation": {
            "_or_": [Detrend, FirstDerivative, Gaussian, SavitzkyGolay, Haar],
            "pick": 2,      # Pick 2 at a time
            "count": 3      # Generate 3 combinations
        }
    },

    ShuffleSplit(n_splits=3, test_size=0.25),
    {"model": PLSRegression(n_components=10)}
]

Visualization with PredictionAnalyzer 

from nirs4all.visualization.predictions import PredictionAnalyzer

analyzer = PredictionAnalyzer(result.predictions)

# Compare top K models
analyzer.plot_top_k(k=3, rank_metric='rmse')

# Heatmap: models vs preprocessing
analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings")

# Performance distribution
analyzer.plot_candlestick(variable="model_name")

U03: Basic Classification 

Classification pipeline with Random Forest, XGBoost, and confusion matrix visualization.

📄 View source code

What You’ll Learn 

Setting up a classification pipeline
Using multiple classifiers (Random Forest, XGBoost)
Confusion matrix visualization
Classification metrics (accuracy, balanced recall)

Classification Pipeline Structure 

from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import StandardScaler

pipeline = [
    # Feature augmentation with preprocessing options
    {"feature_augmentation": [
        FirstDerivative,
        StandardNormalVariate,
        Haar,
        MultiplicativeScatterCorrection
    ]},

    StandardScaler(),
    ShuffleSplit(n_splits=3, test_size=0.25),

    # Classifier
    {"model": RandomForestClassifier(n_estimators=50, max_depth=8)}
]

Classification Metrics 

Metric	Description	Use Case
`accuracy`	Overall correct predictions	Balanced classes
`balanced_recall`	Average recall per class	Imbalanced classes
`balanced_accuracy`	Average accuracy per class	Class imbalance

Confusion Matrix Visualization 

# Plot confusion matrices for top 4 classifiers
analyzer.plot_confusion_matrix(
    k=4,
    rank_metric='accuracy',
    rank_partition='val',
    display_partition='test'
)

U04: Visualization 

A comprehensive tour of all visualization options in NIRS4ALL.

📄 View source code

What You’ll Learn 

All PredictionAnalyzer methods and options
Heatmaps, candlestick charts, histograms
Top-k comparison plots
Ranking vs display partition configuration

Available Visualizations 

Top-K Comparison

# Basic top-k plot
analyzer.plot_top_k(k=3, rank_metric='rmse')

# Rank by test partition, display R²
analyzer.plot_top_k(k=3, rank_metric='r2', rank_partition='test')

Heatmaps

Create 2D comparisons between any two variables:

# Model vs preprocessing
analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings")

# Model vs dataset
analyzer.plot_heatmap(x_var="model_name", y_var="dataset_name", display_metric="r2")

# Model vs fold with counts
analyzer.plot_heatmap(x_var="model_name", y_var="fold_id", show_counts=True)

Candlestick Charts

Show performance distribution per category:

analyzer.plot_candlestick(variable="model_name", display_metric='rmse')
analyzer.plot_candlestick(variable="dataset_name", display_metric='r2')

Histograms

analyzer.plot_histogram(display_metric='rmse')
analyzer.plot_histogram(display_metric='r2')

Ranking vs Display: A Key Concept 

You can separate ranking from display:

Parameter	Purpose
`rank_metric` + `rank_partition`	Determines which models are “best”
`display_metric` + `display_partition`	What values to show

# Rank by validation RMSE, but display test R²
analyzer.plot_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    rank_metric='rmse',
    rank_partition='val',
    display_metric='r2',
    display_partition='test'
)

Aggregation Options 

Option	Description
`'best'`	Show best score for each cell
`'mean'`	Show mean score
`'median'`	Show median score

Running These Examples 

cd examples

# Run all getting started examples
./run.sh -n "U0*.py" -c user

# Run a specific example
python user/01_getting_started/U01_hello_world.py

# Enable plots
python user/01_getting_started/U02_basic_regression.py --plots --show

Next Steps 

After completing these examples:

Data Handling: Learn different input formats and multi-dataset analysis
Preprocessing: Deep dive into NIRS-specific transformations
Models: Compare multiple models and hyperparameter tuning