Quickstart

Get up and running with NIRS4ALL in 5 minutes. This guide walks you through your first complete pipeline.

Prerequisites

Your First Pipeline

Step 1: Import Libraries

import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit

Step 2: Define a Pipeline

A pipeline is a list of processing steps:

pipeline = [
    MinMaxScaler(),                              # Scale features to [0, 1]
    {"y_processing": MinMaxScaler()},            # Scale targets
    ShuffleSplit(n_splits=3, test_size=0.25),    # 3-fold cross-validation
    {"model": PLSRegression(n_components=10)}    # PLS model
]

Step 3: Run the Pipeline

Use nirs4all.run() to train with one function call:

result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression",   # Path to your data
    name="MyFirstPipeline",
    verbose=1
)

Step 4: View Results

# Check overall performance
print(f"Best RMSE: {result.best_rmse:.4f}")
print(f"Best R²: {result.best_r2:.4f}")
print(f"Number of predictions: {result.num_predictions}")

# Get top 3 models
for pred in result.top(n=3, display_metrics=['rmse', 'r2']):
    print(f"  {pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}")

Step 4b: Understand Prediction Entries

Each prediction returned by top() is a dictionary with detailed information:

# Get the best prediction
best = result.best

# Core identification
print(f"Model: {best['model_name']}")
print(f"Dataset: {best['dataset_name']}")
print(f"Fold: {best['fold_id']}")
print(f"Preprocessing: {best.get('preprocessings', 'none')}")

# Scores by partition (primary metric, always available)
print(f"Primary metric: {best['metric']}")
print(f"Train: {best['train_score']:.6f}")
print(f"Val: {best['val_score']:.6f}")
print(f"Test: {best['test_score']:.6f}")

# Additional metrics (when using display_metrics)
print(f"RMSE: {best.get('rmse', 0):.4f}")
print(f"R²: {best.get('r2', 0):.4f}")

Key fields in each prediction entry:

Field

Description

model_name

Name of the model (e.g., “PLSRegression”)

model_classname

Class name of the model

dataset_name

Dataset name

fold_id

Cross-validation fold index

preprocessings

Preprocessing steps applied

metric

Primary metric name (e.g., ‘mse’)

train_score, val_score, test_score

Scores by partition (primary metric)

rmse, r2, mse, mae

Metrics (when using display_metrics)

n_samples, n_features

Data shape info

task_type

‘regression’ or ‘classification’

Step 5: Export for Production

# Export the best model for later use
result.export("exports/my_model.n4a")

Complete Example

Here’s the complete code you can copy and run:

"""My first NIRS4ALL pipeline."""

import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit

# Generate synthetic NIRS data (or use your own dataset path)
dataset = nirs4all.generate.regression(
    n_samples=200,
    target_component=0,
    random_state=42
)

# Define pipeline
pipeline = [
    MinMaxScaler(),                              # Scale features
    {"y_processing": MinMaxScaler()},            # Scale targets
    ShuffleSplit(n_splits=3, test_size=0.25),    # Cross-validation
    {"model": PLSRegression(n_components=10)}    # Model
]

# Run pipeline
result = nirs4all.run(
    pipeline=pipeline,
    dataset=dataset,
    name="MyFirstPipeline",
    verbose=1
)

# View results
print(f"\n📊 Results:")
print(f"   Best RMSE: {result.best_rmse:.4f}")
print(f"   Best R²: {result.best_r2:.4f}")
print(f"   Total predictions: {result.num_predictions}")

# Top models with detailed metrics
print("\n🏆 Top 3 Models:")
for i, pred in enumerate(result.top(n=3, display_metrics=['rmse', 'r2']), 1):
    print(f"   {i}. {pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}")

# Explore the best prediction entry
print("\n📦 Best prediction details:")
best = result.best
print(f"   Model: {best['model_name']}")
print(f"   Dataset: {best['dataset_name']}")
print(f"   Fold: {best['fold_id']}")
print(f"   Metric: {best['metric']}")

# Access partition-specific scores (primary metric)
print(f"   Train: {best['train_score']:.6f}")
print(f"   Val: {best['val_score']:.6f}")
print(f"   Test: {best['test_score']:.6f}")

# Export best model
result.export("exports/my_model.n4a")
print("\n✅ Model exported to exports/my_model.n4a")

Add NIRS-Specific Preprocessing

NIRS data benefits from specialized preprocessing. Try this enhanced pipeline:

from nirs4all.operators.transforms import (
    StandardNormalVariate,
    FirstDerivative
)

pipeline = [
    MinMaxScaler(),                              # Feature scaling
    StandardNormalVariate(),                     # SNV: scatter correction
    FirstDerivative(),                           # Enhance spectral features
    {"y_processing": MinMaxScaler()},            # Target scaling
    ShuffleSplit(n_splits=3),                    # Cross-validation
    {"model": PLSRegression(n_components=10)}    # Model
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression",
    name="NIRSPipeline",
    verbose=1
)

Using Your Own Data

Replace the sample data with your own:

# From a CSV file
result = nirs4all.run(pipeline, dataset="path/to/your/data.csv")

# From a folder
result = nirs4all.run(pipeline, dataset="path/to/data_folder/")

# With explicit configuration
from nirs4all.data import DatasetConfigs

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_y": "targets.csv",
})
result = nirs4all.run(pipeline, dataset=dataset)

No Data? Generate Synthetic NIRS Spectra

Get started immediately with realistic synthetic data:

import nirs4all

# Generate synthetic NIRS data with known ground truth
dataset = nirs4all.generate.regression(
    n_samples=500,
    components=["water", "protein", "lipid"],
    complexity="realistic",
    random_state=42
)

# Use directly in pipelines
result = nirs4all.run(
    pipeline=[
        MinMaxScaler(),
        ShuffleSplit(n_splits=3),
        {"model": PLSRegression(n_components=10)}
    ],
    dataset=dataset
)

print(f"RMSE: {result.best_rmse:.4f}")

Synthetic data is perfect for:

  • Learning and experimentation

  • Testing preprocessing pipelines

  • Prototyping before real data arrives

  • Reproducible unit tests

See Synthetic Data Generation for full documentation.

Compare Multiple Models

Run and compare different models in one pipeline:

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=3),

    # Multiple models - each is evaluated
    {"model": PLSRegression(n_components=5)},
    {"model": PLSRegression(n_components=10)},
    {"model": PLSRegression(n_components=15)},
    {"model": Ridge(alpha=1.0)},
    {"model": RandomForestRegressor(n_estimators=100)},
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset="sample_data/regression",
    name="MultiModel",
    verbose=1
)

# See which model performed best
for pred in result.top(n=5, display_metrics=['rmse', 'r2']):
    print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}")

Run Multiple Pipelines at Once

Pass a list of pipelines to execute them all independently:

# Define different pipeline strategies
pipeline_pls = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=3),
    {"model": PLSRegression(n_components=10)}
]

pipeline_rf = [
    StandardScaler(),
    ShuffleSplit(n_splits=3),
    {"model": RandomForestRegressor(n_estimators=100)}
]

pipeline_ridge = [
    MinMaxScaler(),
    FirstDerivative(),
    ShuffleSplit(n_splits=3),
    {"model": Ridge(alpha=1.0)}
]

# Run all three pipelines with one call
result = nirs4all.run(
    pipeline=[pipeline_pls, pipeline_rf, pipeline_ridge],  # List of pipelines
    dataset="sample_data/regression",
    verbose=1
)

print(f"Total configurations tested: {result.num_predictions}")
print(f"Best RMSE: {result.best_rmse:.4f}")

Run on Multiple Datasets

Test the same pipeline(s) across different datasets:

# Cartesian product: each pipeline × each dataset
result = nirs4all.run(
    pipeline=[pipeline_pls, pipeline_rf],   # 2 pipelines
    dataset=["data/wheat", "data/corn"],    # 2 datasets
    verbose=1
)
# Runs 4 combinations: PLS×wheat, PLS×corn, RF×wheat, RF×corn

print(f"Tested {result.num_predictions} configurations")

Visualize Results

Create publication-quality visualizations:

from nirs4all.visualization.predictions import PredictionAnalyzer

analyzer = PredictionAnalyzer(result.predictions)

# Predicted vs actual plot for top models
fig1 = analyzer.plot_top_k(k=3, rank_metric='rmse')

# Compare models with candlestick chart
fig2 = analyzer.plot_candlestick(variable="model_name")

# Show all plots
import matplotlib.pyplot as plt
plt.show()

What’s Next?

📚 Core Concepts

Understand pipelines, datasets, and execution flow.

Core Concepts
📖 User Guide

Learn preprocessing, stacking, and deployment.

User Guide
📝 Examples

50+ working examples organized by topic.

Examples
📋 Pipeline Syntax

Complete pipeline syntax reference.

Writing a Pipeline in nirs4all

Key Takeaways

  1. Pipelines are lists of processing steps

  2. One function (nirs4all.run()) handles everything

  3. Results are accessible via result.best_rmse, result.best_r2, result.top(), etc.

  4. Prediction entries are dicts with model_name, dataset_name, fold_id, scores, and more

  5. Detailed scores are available via pred['scores']['train'/'val'/'test']['rmse'/'r2'/...]

  6. Export models with result.export() for deployment

  7. NIRS preprocessing (SNV, derivatives) improves spectral analysis

See Also