Core Concepts
This page explains the fundamental concepts behind NIRS4ALL. Understanding these will help you build effective pipelines.
Overview
NIRS4ALL is built around three core concepts:
SpectroDataset - A container for spectral data, targets, and metadata
Pipeline - A sequence of processing steps
Controllers - The execution engine that runs each step
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Dataset │ -> │ Pipeline │ -> │ Results │
│ (your data) │ │ (steps) │ │ (predictions)│
└──────────────┘ └──────────────┘ └──────────────┘
SpectroDataset
SpectroDataset is the core data container. It holds:
Component |
Description |
|---|---|
X (features) |
Spectral data matrix (n_samples × n_wavelengths) |
y (targets) |
Target values for prediction (n_samples,) |
metadata |
Sample information (IDs, groups, dates, etc.) |
fold indices |
Cross-validation assignments |
Creating a Dataset
Most often, NIRS4ALL creates datasets automatically from your files:
from nirs4all.data import DatasetConfigs
# From a folder (auto-detects files)
dataset = DatasetConfigs("path/to/data/")
# From explicit files
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_y": "targets.csv",
"train_m": "metadata.csv" # Optional metadata
})
You can also generate synthetic data for testing and prototyping:
import nirs4all
# Generate realistic synthetic NIRS spectra
dataset = nirs4all.generate.regression(
n_samples=500,
components=["water", "protein", "lipid"],
complexity="realistic"
)
See Synthetic Data Generation for more on synthetic data generation.
### Partitions
Data is organized into partitions:
| Partition | Purpose |
|-----------|---------|
| **train** | Used for model training and cross-validation |
| **test** | Held-out data for final evaluation |
| **val** | Validation set (often created from train via CV) |
:::{note}
During cross-validation, the train partition is automatically split into train/val folds. The test partition (if provided) remains untouched for final evaluation.
:::
## Pipeline
A **pipeline** is a list of processing steps applied sequentially:
```python
pipeline = [
MinMaxScaler(), # Step 1: Scale features
StandardNormalVariate(), # Step 2: SNV preprocessing
{"y_processing": MinMaxScaler()}, # Step 3: Scale targets
ShuffleSplit(n_splits=5), # Step 4: Cross-validation
{"model": PLSRegression(n_components=10)} # Step 5: Model
]
Step Types
Step Type |
Syntax |
Purpose |
|---|---|---|
Transformer |
|
Modify features (X) |
Y Processing |
|
Modify targets (y) |
Splitter |
|
Define cross-validation |
Model |
|
Train predictive model |
Branch |
|
Parallel processing paths |
Merge |
|
Combine branch outputs |
Augmentation |
|
Generate preprocessing variants |
Execution Flow
Input Data → [Preprocessing] → [CV Split] → [Training] → Predictions
│ │ │ │
▼ ▼ ▼ ▼
SpectroDataset Transformers Splitter Models
Data Loading: Your files are loaded into a SpectroDataset
Preprocessing: Transformers modify X (and optionally y)
Cross-Validation: Splitter defines train/val folds
Training: Each model is trained on each fold
Prediction: Out-of-fold predictions are collected
Controllers
Controllers are the execution engine. They interpret each pipeline step and perform the appropriate action.
Controller |
Handles |
|---|---|
|
sklearn TransformerMixin (scalers, preprocessors) |
|
|
|
Cross-validation splitters |
|
|
|
|
|
|
Tip
You rarely interact with controllers directly. They work behind the scenes to execute your pipeline.
Predictions and Results
When you run a pipeline, you get a RunResult object:
result = nirs4all.run(pipeline, dataset)
# Access results
result.best_score # Best model's primary score
result.best # Best prediction entry (dict)
result.num_predictions # Total prediction entries
result.predictions # Full PredictionResultsList
# Get top performers
for pred in result.top(n=5, display_metrics=['rmse', 'r2']):
print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}")
Prediction Entry Structure
Each prediction entry contains:
Field |
Description |
|---|---|
|
Name of the model |
|
Name of the dataset |
|
Cross-validation fold index |
|
True target values |
|
Predicted values |
|
Computed metrics |
|
Applied preprocessing chain |
|
Data partition (train/val/test) |
Key Terminology
Term |
Definition |
|---|---|
Spectral data |
Features from spectroscopy (reflectance, absorbance, etc.) |
Wavelength |
Individual feature/column in spectral data |
Fold |
One train/validation split in cross-validation |
OOF (Out-of-Fold) |
Predictions made on validation data during CV |
Operator |
A preprocessing or transformation class |
Transformer |
sklearn-compatible operator with |
Pipeline variant |
One specific configuration when using generators |
The nirs4all.run() Function
The simplest way to run a pipeline:
result = nirs4all.run(
pipeline=pipeline, # List of steps (or list of pipelines)
dataset=dataset, # See below for supported formats
name="MyPipeline", # Pipeline name
verbose=1, # 0=silent, 1=progress, 2=debug
save_artifacts=True, # Save models and results
save_charts=True, # Save generated plots
plots_visible=False # Show plots interactively
)
Supported Pipeline Formats
The pipeline parameter accepts:
Format |
Example |
Description |
|---|---|---|
List of steps |
|
Single pipeline |
Dict config |
|
Dict with steps |
Path to config |
|
Load from file |
|
|
Direct config object |
List of pipelines |
|
Run each independently |
Supported Dataset Formats
The dataset parameter accepts:
Format |
Example |
Description |
|---|---|---|
Path to folder |
|
Auto-load from folder |
Numpy arrays |
|
Direct arrays |
Dict with arrays |
|
Dict with data |
|
Direct dataset instance |
Pre-built dataset |
|
Full configuration object |
Complete config |
List of datasets |
|
Run on each dataset |
Batch Execution: Pipelines × Datasets
When you provide multiple pipelines and/or multiple datasets, nirs4all.run() executes the cartesian product:
# 2 pipelines × 2 datasets = 4 runs
result = nirs4all.run(
pipeline=[pipeline_a, pipeline_b],
dataset=["data/wheat", "data/corn"],
verbose=1
)
# Runs: pipeline_a×wheat, pipeline_a×corn, pipeline_b×wheat, pipeline_b×corn
All results are collected into a single RunResult for unified analysis.
For more control, use PipelineRunner directly:
from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.data import DatasetConfigs
runner = PipelineRunner(
verbose=1,
save_artifacts=True,
save_charts=True
)
predictions, per_dataset = runner.run(
PipelineConfigs(pipeline, "MyPipeline"),
DatasetConfigs("path/to/data")
)
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ nirs4all.run() │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │PipelineRunner│ --> │PipelineOrches│ --> │ Controllers │ │
│ │ │ │ -trator │ │ (registry) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │PipelineConfigs│ │ExecutionContext│ │SpectroDataset│ │
│ │ (pipeline) │ │ (state) │ │ (data) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────┤
│ RunResult │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Predictions │ │ Metrics │ │ Artifacts │ │
│ │ List │ │ │ │ (.n4a) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Next Steps
Learn about DatasetConfigs and supported formats.
NIRS-specific preprocessing techniques.
Complete syntax reference.
Working examples organized by topic.
See Also
Quickstart - Run your first pipeline
Writing a Pipeline in nirs4all - Complete pipeline syntax
Architecture Overview - Detailed architecture for contributors