Core Concepts

This page explains the fundamental concepts behind NIRS4ALL. Understanding these will help you build effective pipelines.

Overview

NIRS4ALL is built around three core concepts:

SpectroDataset - A container for spectral data, targets, and metadata
Pipeline - A sequence of processing steps
Controllers - The execution engine that runs each step

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Dataset    │ -> │   Pipeline   │ -> │   Results    │
│  (your data) │    │   (steps)    │    │ (predictions)│
└──────────────┘    └──────────────┘    └──────────────┘

SpectroDataset

SpectroDataset is the core data container. It holds:

Component	Description
X (features)	Spectral data matrix (n_samples × n_wavelengths)
y (targets)	Target values for prediction (n_samples,)
metadata	Sample information (IDs, groups, dates, etc.)
fold indices	Cross-validation assignments

Creating a Dataset

Most often, NIRS4ALL creates datasets automatically from your files:

from nirs4all.data import DatasetConfigs

# From a folder (auto-detects files)
dataset = DatasetConfigs("path/to/data/")

# From explicit files
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_y": "targets.csv",
    "train_m": "metadata.csv"  # Optional metadata
})

You can also generate synthetic data for testing and prototyping:

import nirs4all

# Generate realistic synthetic NIRS spectra
dataset = nirs4all.generate.regression(
    n_samples=500,
    components=["water", "protein", "lipid"],
    complexity="realistic"
)

See Synthetic Data Generation for more on synthetic data generation.

### Partitions

Data is organized into partitions:

| Partition | Purpose |
|-----------|---------|
| **train** | Used for model training and cross-validation |
| **test** | Held-out data for final evaluation |
| **val** | Validation set (often created from train via CV) |

:::{note}
During cross-validation, the train partition is automatically split into train/val folds. The test partition (if provided) remains untouched for final evaluation.
:::

## Pipeline

A **pipeline** is a list of processing steps applied sequentially:

```python
pipeline = [
    MinMaxScaler(),                              # Step 1: Scale features
    StandardNormalVariate(),                     # Step 2: SNV preprocessing
    {"y_processing": MinMaxScaler()},            # Step 3: Scale targets
    ShuffleSplit(n_splits=5),                    # Step 4: Cross-validation
    {"model": PLSRegression(n_components=10)}    # Step 5: Model
]

Step Types

Step Type	Syntax	Purpose
Transformer	`MinMaxScaler()`	Modify features (X)
Y Processing	`{"y_processing": ...}`	Modify targets (y)
Splitter	`ShuffleSplit(n_splits=5)`	Define cross-validation
Model	`{"model": PLSRegression()}`	Train predictive model
Branch	`{"branch": [...]}`	Parallel processing paths
Merge	`{"merge": "features"}`	Combine branch outputs
Augmentation	`{"feature_augmentation": ...}`	Generate preprocessing variants

Execution Flow

Input Data → [Preprocessing] → [CV Split] → [Training] → Predictions
     │              │               │            │
     ▼              ▼               ▼            ▼
SpectroDataset  Transformers    Splitter     Models

Data Loading: Your files are loaded into a SpectroDataset
Preprocessing: Transformers modify X (and optionally y)
Cross-Validation: Splitter defines train/val folds
Training: Each model is trained on each fold
Prediction: Out-of-fold predictions are collected

Controllers

Controllers are the execution engine. They interpret each pipeline step and perform the appropriate action.

Controller	Handles
`TransformController`	sklearn TransformerMixin (scalers, preprocessors)
`YProcessingController`	`{"y_processing": ...}` steps
`SplitterController`	Cross-validation splitters
`ModelController`	`{"model": ...}` steps
`BranchController`	`{"branch": ...}` parallel paths
`MergeController`	`{"merge": ...}` combining outputs

Tip

You rarely interact with controllers directly. They work behind the scenes to execute your pipeline.

Predictions and Results

When you run a pipeline, you get a RunResult object:

result = nirs4all.run(pipeline, dataset)

# Access results
result.best_score        # Best model's primary score
result.best              # Best prediction entry (dict)
result.num_predictions   # Total prediction entries
result.predictions       # Full PredictionResultsList

# Get top performers
for pred in result.top(n=5, display_metrics=['rmse', 'r2']):
    print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}")

Prediction Entry Structure

Each prediction entry contains:

Field	Description
`model_name`	Name of the model
`dataset_name`	Name of the dataset
`fold_id`	Cross-validation fold index
`y_true`	True target values
`y_pred`	Predicted values
`rmse`, `r2`, etc.	Computed metrics
`preprocessings`	Applied preprocessing chain
`partition`	Data partition (train/val/test)

Key Terminology

Term	Definition
Spectral data	Features from spectroscopy (reflectance, absorbance, etc.)
Wavelength	Individual feature/column in spectral data
Fold	One train/validation split in cross-validation
OOF (Out-of-Fold)	Predictions made on validation data during CV
Operator	A preprocessing or transformation class
Transformer	sklearn-compatible operator with `fit()` and `transform()`
Pipeline variant	One specific configuration when using generators

The nirs4all.run() Function

The simplest way to run a pipeline:

result = nirs4all.run(
    pipeline=pipeline,           # List of steps (or list of pipelines)
    dataset=dataset,             # See below for supported formats
    name="MyPipeline",           # Pipeline name
    verbose=1,                   # 0=silent, 1=progress, 2=debug
    save_artifacts=True,         # Save models and results
    save_charts=True,            # Save generated plots
    plots_visible=False          # Show plots interactively
)

Supported Pipeline Formats

The pipeline parameter accepts:

Format	Example	Description
List of steps	`[MinMaxScaler(), PLSRegression()]`	Single pipeline
Dict config	`{"pipeline": [...]}`	Dict with steps
Path to config	`"config.yaml"` or `"config.json"`	Load from file
`PipelineConfigs`	`PipelineConfigs(steps)`	Direct config object
List of pipelines	`[pipeline1, pipeline2, ...]`	Run each independently

Supported Dataset Formats

The dataset parameter accepts:

Format	Example	Description
Path to folder	`"sample_data/regression"`	Auto-load from folder
Numpy arrays	`(X, y)` or `X` alone	Direct arrays
Dict with arrays	`{"X": X, "y": y, "metadata": meta}`	Dict with data
`SpectroDataset`	Direct dataset instance	Pre-built dataset
`DatasetConfigs`	Full configuration object	Complete config
List of datasets	`[dataset1, dataset2, ...]`	Run on each dataset

Batch Execution: Pipelines × Datasets

When you provide multiple pipelines and/or multiple datasets, nirs4all.run() executes the cartesian product:

# 2 pipelines × 2 datasets = 4 runs
result = nirs4all.run(
    pipeline=[pipeline_a, pipeline_b],
    dataset=["data/wheat", "data/corn"],
    verbose=1
)
# Runs: pipeline_a×wheat, pipeline_a×corn, pipeline_b×wheat, pipeline_b×corn

All results are collected into a single RunResult for unified analysis.

For more control, use PipelineRunner directly:

from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.data import DatasetConfigs

runner = PipelineRunner(
    verbose=1,
    save_artifacts=True,
    save_charts=True
)

predictions, per_dataset = runner.run(
    PipelineConfigs(pipeline, "MyPipeline"),
    DatasetConfigs("path/to/data")
)

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        nirs4all.run()                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │PipelineRunner│ --> │PipelineOrches│ --> │ Controllers  │    │
│  │              │     │    -trator   │     │  (registry)  │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
│         │                    │                    │              │
│         ▼                    ▼                    ▼              │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │PipelineConfigs│    │ExecutionContext│   │SpectroDataset│    │
│  │  (pipeline)   │    │   (state)      │   │   (data)     │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                         RunResult                                │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │ Predictions  │     │   Metrics    │     │  Artifacts   │    │
│  │    List      │     │              │     │   (.n4a)     │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Next Steps

📖 Loading Data

Learn about DatasetConfigs and supported formats.

Loading Data

🔧 Preprocessing

NIRS-specific preprocessing techniques.

Preprocessing Overview

📋 Pipeline Syntax

Complete syntax reference.

Writing a Pipeline in nirs4all

📝 Examples

Working examples organized by topic.

Examples