Quickstart
Get up and running with NIRS4ALL in 5 minutes. This guide walks you through your first complete pipeline.
Prerequisites
NIRS4ALL installed (see Installation)
Python 3.9+
Your First Pipeline
Step 1: Import Libraries
import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
Step 2: Define a Pipeline
A pipeline is a list of processing steps:
pipeline = [
MinMaxScaler(), # Scale features to [0, 1]
{"y_processing": MinMaxScaler()}, # Scale targets
ShuffleSplit(n_splits=3, test_size=0.25), # 3-fold cross-validation
{"model": PLSRegression(n_components=10)} # PLS model
]
Step 3: Run the Pipeline
Use nirs4all.run() to train with one function call:
result = nirs4all.run(
pipeline=pipeline,
dataset="sample_data/regression", # Path to your data
name="MyFirstPipeline",
verbose=1
)
Step 4: View Results
# Check overall performance
print(f"Best RMSE: {result.best_rmse:.4f}")
print(f"Best R²: {result.best_r2:.4f}")
print(f"Number of predictions: {result.num_predictions}")
# Get top 3 models
for pred in result.top(n=3, display_metrics=['rmse', 'r2']):
print(f" {pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}")
Step 4b: Understand Prediction Entries
Each prediction returned by top() is a dictionary with detailed information:
# Get the best prediction
best = result.best
# Core identification
print(f"Model: {best['model_name']}")
print(f"Dataset: {best['dataset_name']}")
print(f"Fold: {best['fold_id']}")
print(f"Preprocessing: {best.get('preprocessings', 'none')}")
# Scores by partition (primary metric, always available)
print(f"Primary metric: {best['metric']}")
print(f"Train: {best['train_score']:.6f}")
print(f"Val: {best['val_score']:.6f}")
print(f"Test: {best['test_score']:.6f}")
# Additional metrics (when using display_metrics)
print(f"RMSE: {best.get('rmse', 0):.4f}")
print(f"R²: {best.get('r2', 0):.4f}")
Key fields in each prediction entry:
Field |
Description |
|---|---|
|
Name of the model (e.g., “PLSRegression”) |
|
Class name of the model |
|
Dataset name |
|
Cross-validation fold index |
|
Preprocessing steps applied |
|
Primary metric name (e.g., ‘mse’) |
|
Scores by partition (primary metric) |
|
Metrics (when using |
|
Data shape info |
|
‘regression’ or ‘classification’ |
Step 5: Export for Production
# Export the best model for later use
result.export("exports/my_model.n4a")
Complete Example
Here’s the complete code you can copy and run:
"""My first NIRS4ALL pipeline."""
import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
# Generate synthetic NIRS data (or use your own dataset path)
dataset = nirs4all.generate.regression(
n_samples=200,
target_component=0,
random_state=42
)
# Define pipeline
pipeline = [
MinMaxScaler(), # Scale features
{"y_processing": MinMaxScaler()}, # Scale targets
ShuffleSplit(n_splits=3, test_size=0.25), # Cross-validation
{"model": PLSRegression(n_components=10)} # Model
]
# Run pipeline
result = nirs4all.run(
pipeline=pipeline,
dataset=dataset,
name="MyFirstPipeline",
verbose=1
)
# View results
print(f"\n📊 Results:")
print(f" Best RMSE: {result.best_rmse:.4f}")
print(f" Best R²: {result.best_r2:.4f}")
print(f" Total predictions: {result.num_predictions}")
# Top models with detailed metrics
print("\n🏆 Top 3 Models:")
for i, pred in enumerate(result.top(n=3, display_metrics=['rmse', 'r2']), 1):
print(f" {i}. {pred['model_name']}: RMSE={pred['rmse']:.4f}, R²={pred['r2']:.4f}")
# Explore the best prediction entry
print("\n📦 Best prediction details:")
best = result.best
print(f" Model: {best['model_name']}")
print(f" Dataset: {best['dataset_name']}")
print(f" Fold: {best['fold_id']}")
print(f" Metric: {best['metric']}")
# Access partition-specific scores (primary metric)
print(f" Train: {best['train_score']:.6f}")
print(f" Val: {best['val_score']:.6f}")
print(f" Test: {best['test_score']:.6f}")
# Export best model
result.export("exports/my_model.n4a")
print("\n✅ Model exported to exports/my_model.n4a")
Add NIRS-Specific Preprocessing
NIRS data benefits from specialized preprocessing. Try this enhanced pipeline:
from nirs4all.operators.transforms import (
StandardNormalVariate,
FirstDerivative
)
pipeline = [
MinMaxScaler(), # Feature scaling
StandardNormalVariate(), # SNV: scatter correction
FirstDerivative(), # Enhance spectral features
{"y_processing": MinMaxScaler()}, # Target scaling
ShuffleSplit(n_splits=3), # Cross-validation
{"model": PLSRegression(n_components=10)} # Model
]
result = nirs4all.run(
pipeline=pipeline,
dataset="sample_data/regression",
name="NIRSPipeline",
verbose=1
)
Using Your Own Data
Replace the sample data with your own:
# From a CSV file
result = nirs4all.run(pipeline, dataset="path/to/your/data.csv")
# From a folder
result = nirs4all.run(pipeline, dataset="path/to/data_folder/")
# With explicit configuration
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_y": "targets.csv",
})
result = nirs4all.run(pipeline, dataset=dataset)
No Data? Generate Synthetic NIRS Spectra
Get started immediately with realistic synthetic data:
import nirs4all
# Generate synthetic NIRS data with known ground truth
dataset = nirs4all.generate.regression(
n_samples=500,
components=["water", "protein", "lipid"],
complexity="realistic",
random_state=42
)
# Use directly in pipelines
result = nirs4all.run(
pipeline=[
MinMaxScaler(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10)}
],
dataset=dataset
)
print(f"RMSE: {result.best_rmse:.4f}")
Synthetic data is perfect for:
Learning and experimentation
Testing preprocessing pipelines
Prototyping before real data arrives
Reproducible unit tests
See Synthetic Data Generation for full documentation.
Compare Multiple Models
Run and compare different models in one pipeline:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
pipeline = [
MinMaxScaler(),
ShuffleSplit(n_splits=3),
# Multiple models - each is evaluated
{"model": PLSRegression(n_components=5)},
{"model": PLSRegression(n_components=10)},
{"model": PLSRegression(n_components=15)},
{"model": Ridge(alpha=1.0)},
{"model": RandomForestRegressor(n_estimators=100)},
]
result = nirs4all.run(
pipeline=pipeline,
dataset="sample_data/regression",
name="MultiModel",
verbose=1
)
# See which model performed best
for pred in result.top(n=5, display_metrics=['rmse', 'r2']):
print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}")
Run Multiple Pipelines at Once
Pass a list of pipelines to execute them all independently:
# Define different pipeline strategies
pipeline_pls = [
MinMaxScaler(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10)}
]
pipeline_rf = [
StandardScaler(),
ShuffleSplit(n_splits=3),
{"model": RandomForestRegressor(n_estimators=100)}
]
pipeline_ridge = [
MinMaxScaler(),
FirstDerivative(),
ShuffleSplit(n_splits=3),
{"model": Ridge(alpha=1.0)}
]
# Run all three pipelines with one call
result = nirs4all.run(
pipeline=[pipeline_pls, pipeline_rf, pipeline_ridge], # List of pipelines
dataset="sample_data/regression",
verbose=1
)
print(f"Total configurations tested: {result.num_predictions}")
print(f"Best RMSE: {result.best_rmse:.4f}")
Run on Multiple Datasets
Test the same pipeline(s) across different datasets:
# Cartesian product: each pipeline × each dataset
result = nirs4all.run(
pipeline=[pipeline_pls, pipeline_rf], # 2 pipelines
dataset=["data/wheat", "data/corn"], # 2 datasets
verbose=1
)
# Runs 4 combinations: PLS×wheat, PLS×corn, RF×wheat, RF×corn
print(f"Tested {result.num_predictions} configurations")
Visualize Results
Create publication-quality visualizations:
from nirs4all.visualization.predictions import PredictionAnalyzer
analyzer = PredictionAnalyzer(result.predictions)
# Predicted vs actual plot for top models
fig1 = analyzer.plot_top_k(k=3, rank_metric='rmse')
# Compare models with candlestick chart
fig2 = analyzer.plot_candlestick(variable="model_name")
# Show all plots
import matplotlib.pyplot as plt
plt.show()
What’s Next?
Understand pipelines, datasets, and execution flow.
Learn preprocessing, stacking, and deployment.
50+ working examples organized by topic.
Complete pipeline syntax reference.
Key Takeaways
Pipelines are lists of processing steps
One function (
nirs4all.run()) handles everythingResults are accessible via
result.best_rmse,result.best_r2,result.top(), etc.Prediction entries are dicts with model_name, dataset_name, fold_id, scores, and more
Detailed scores are available via
pred['scores']['train'/'val'/'test']['rmse'/'r2'/...]Export models with
result.export()for deploymentNIRS preprocessing (SNV, derivatives) improves spectral analysis
See Also
Core Concepts - Understanding SpectroDataset and pipelines
Preprocessing Overview - NIRS preprocessing techniques
Writing a Pipeline in nirs4all - Complete syntax reference