Core Concepts

This page explains the fundamental concepts behind NIRS4ALL. Understanding these will help you build effective pipelines.

Overview

NIRS4ALL is built around three core concepts:

  1. SpectroDataset - A container for spectral data, targets, and metadata

  2. Pipeline - A sequence of processing steps

  3. Controllers - The execution engine that runs each step

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Dataset    │ -> │   Pipeline   │ -> │   Results    │
│  (your data) │    │   (steps)    │    │ (predictions)│
└──────────────┘    └──────────────┘    └──────────────┘

SpectroDataset

SpectroDataset is the core data container. It holds:

Component

Description

X (features)

Spectral data matrix (n_samples × n_wavelengths)

y (targets)

Target values for prediction (n_samples,)

metadata

Sample information (IDs, groups, dates, etc.)

fold indices

Cross-validation assignments

Creating a Dataset

Most often, NIRS4ALL creates datasets automatically from your files:

from nirs4all.data import DatasetConfigs

# From a folder (auto-detects files)
dataset = DatasetConfigs("path/to/data/")

# From explicit files
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_y": "targets.csv",
    "train_m": "metadata.csv"  # Optional metadata
})

You can also generate synthetic data for testing and prototyping:

import nirs4all

# Generate realistic synthetic NIRS spectra
dataset = nirs4all.generate.regression(
    n_samples=500,
    components=["water", "protein", "lipid"],
    complexity="realistic"
)

See Synthetic Data Generation for more on synthetic data generation.


### Partitions

Data is organized into partitions:

| Partition | Purpose |
|-----------|---------|
| **train** | Used for model training and cross-validation |
| **test** | Held-out data for final evaluation |
| **val** | Validation set (often created from train via CV) |

:::{note}
During cross-validation, the train partition is automatically split into train/val folds. The test partition (if provided) remains untouched for final evaluation.
:::

## Pipeline

A **pipeline** is a list of processing steps applied sequentially:

```python
pipeline = [
    MinMaxScaler(),                              # Step 1: Scale features
    StandardNormalVariate(),                     # Step 2: SNV preprocessing
    {"y_processing": MinMaxScaler()},            # Step 3: Scale targets
    ShuffleSplit(n_splits=5),                    # Step 4: Cross-validation
    {"model": PLSRegression(n_components=10)}    # Step 5: Model
]

Step Types

Step Type

Syntax

Purpose

Transformer

MinMaxScaler()

Modify features (X)

Y Processing

{"y_processing": ...}

Modify targets (y)

Splitter

ShuffleSplit(n_splits=5)

Define cross-validation

Model

{"model": PLSRegression()}

Train predictive model

Branch

{"branch": [...]}

Parallel processing paths

Merge

{"merge": "features"}

Combine branch outputs

Augmentation

{"feature_augmentation": ...}

Generate preprocessing variants

Execution Flow

Input Data → [Preprocessing] → [CV Split] → [Training] → Predictions
     │              │               │            │
     ▼              ▼               ▼            ▼
SpectroDataset  Transformers    Splitter     Models
  1. Data Loading: Your files are loaded into a SpectroDataset

  2. Preprocessing: Transformers modify X (and optionally y)

  3. Cross-Validation: Splitter defines train/val folds

  4. Training: Each model is trained on each fold

  5. Prediction: Out-of-fold predictions are collected

Controllers

Controllers are the execution engine. They interpret each pipeline step and perform the appropriate action.

Controller

Handles

TransformController

sklearn TransformerMixin (scalers, preprocessors)

YProcessingController

{"y_processing": ...} steps

SplitterController

Cross-validation splitters

ModelController

{"model": ...} steps

BranchController

{"branch": ...} parallel paths

MergeController

{"merge": ...} combining outputs

Tip

You rarely interact with controllers directly. They work behind the scenes to execute your pipeline.

Predictions and Results

When you run a pipeline, you get a RunResult object:

result = nirs4all.run(pipeline, dataset)

# Access results
result.best_score        # Best model's primary score
result.best              # Best prediction entry (dict)
result.num_predictions   # Total prediction entries
result.predictions       # Full PredictionResultsList

# Get top performers
for pred in result.top(n=5, display_metrics=['rmse', 'r2']):
    print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}")

Prediction Entry Structure

Each prediction entry contains:

Field

Description

model_name

Name of the model

dataset_name

Name of the dataset

fold_id

Cross-validation fold index

y_true

True target values

y_pred

Predicted values

rmse, r2, etc.

Computed metrics

preprocessings

Applied preprocessing chain

partition

Data partition (train/val/test)

Key Terminology

Term

Definition

Spectral data

Features from spectroscopy (reflectance, absorbance, etc.)

Wavelength

Individual feature/column in spectral data

Fold

One train/validation split in cross-validation

OOF (Out-of-Fold)

Predictions made on validation data during CV

Operator

A preprocessing or transformation class

Transformer

sklearn-compatible operator with fit() and transform()

Pipeline variant

One specific configuration when using generators

The nirs4all.run() Function

The simplest way to run a pipeline:

result = nirs4all.run(
    pipeline=pipeline,           # List of steps (or list of pipelines)
    dataset=dataset,             # See below for supported formats
    name="MyPipeline",           # Pipeline name
    verbose=1,                   # 0=silent, 1=progress, 2=debug
    save_artifacts=True,         # Save models and results
    save_charts=True,            # Save generated plots
    plots_visible=False          # Show plots interactively
)

Supported Pipeline Formats

The pipeline parameter accepts:

Format

Example

Description

List of steps

[MinMaxScaler(), PLSRegression()]

Single pipeline

Dict config

{"pipeline": [...]}

Dict with steps

Path to config

"config.yaml" or "config.json"

Load from file

PipelineConfigs

PipelineConfigs(steps)

Direct config object

List of pipelines

[pipeline1, pipeline2, ...]

Run each independently

Supported Dataset Formats

The dataset parameter accepts:

Format

Example

Description

Path to folder

"sample_data/regression"

Auto-load from folder

Numpy arrays

(X, y) or X alone

Direct arrays

Dict with arrays

{"X": X, "y": y, "metadata": meta}

Dict with data

SpectroDataset

Direct dataset instance

Pre-built dataset

DatasetConfigs

Full configuration object

Complete config

List of datasets

[dataset1, dataset2, ...]

Run on each dataset

Batch Execution: Pipelines × Datasets

When you provide multiple pipelines and/or multiple datasets, nirs4all.run() executes the cartesian product:

# 2 pipelines × 2 datasets = 4 runs
result = nirs4all.run(
    pipeline=[pipeline_a, pipeline_b],
    dataset=["data/wheat", "data/corn"],
    verbose=1
)
# Runs: pipeline_a×wheat, pipeline_a×corn, pipeline_b×wheat, pipeline_b×corn

All results are collected into a single RunResult for unified analysis.

For more control, use PipelineRunner directly:

from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.data import DatasetConfigs

runner = PipelineRunner(
    verbose=1,
    save_artifacts=True,
    save_charts=True
)

predictions, per_dataset = runner.run(
    PipelineConfigs(pipeline, "MyPipeline"),
    DatasetConfigs("path/to/data")
)

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        nirs4all.run()                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │PipelineRunner│ --> │PipelineOrches│ --> │ Controllers  │    │
│  │              │     │    -trator   │     │  (registry)  │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
│         │                    │                    │              │
│         ▼                    ▼                    ▼              │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │PipelineConfigs│    │ExecutionContext│   │SpectroDataset│    │
│  │  (pipeline)   │    │   (state)      │   │   (data)     │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                         RunResult                                │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│  │ Predictions  │     │   Metrics    │     │  Artifacts   │    │
│  │    List      │     │              │     │   (.n4a)     │    │
│  └──────────────┘     └──────────────┘     └──────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Next Steps

📖 Loading Data

Learn about DatasetConfigs and supported formats.

Loading Data
🔧 Preprocessing

NIRS-specific preprocessing techniques.

Preprocessing Overview
📋 Pipeline Syntax

Complete syntax reference.

Writing a Pipeline in nirs4all
📝 Examples

Working examples organized by topic.

Examples

See Also