Configuration Reference
This page provides the complete specification for PipelineConfigs and DatasetConfigs.
PipelineConfigs
PipelineConfigs defines the processing pipeline: preprocessing steps, cross-validation, and models.
Constructor
from nirs4all.pipeline import PipelineConfigs
config = PipelineConfigs(
definition, # Pipeline definition (list, dict, or path)
name="", # Pipeline name
description="", # Optional description
max_generation_count=10000 # Maximum pipeline variants to generate
)
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
list, dict, str |
required |
Pipeline steps as list, dict with |
|
str |
|
Pipeline name (used in artifacts and results) |
|
str |
|
Human-readable description |
|
int |
|
Maximum pipeline variants from generators |
Definition Formats
List of Steps (Recommended)
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
pipeline = PipelineConfigs([
MinMaxScaler(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10)}
], name="MyPipeline")
Dictionary with Pipeline Key
pipeline = PipelineConfigs({
"pipeline": [
MinMaxScaler(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10)}
]
}, name="MyPipeline")
YAML File Path
pipeline = PipelineConfigs("config/pipeline.yaml", name="MyPipeline")
pipeline.yaml:
pipeline:
- class: sklearn.preprocessing.MinMaxScaler
- class: sklearn.model_selection.ShuffleSplit
params:
n_splits: 3
- model:
class: sklearn.cross_decomposition.PLSRegression
params:
n_components: 10
JSON File Path
pipeline = PipelineConfigs("config/pipeline.json", name="MyPipeline")
Step Serialization
Steps are serialized to a canonical format:
Input |
Serialized Form |
|---|---|
|
|
|
|
|
|
Accessing Pipeline Configurations
pipeline = PipelineConfigs([...], name="MyPipeline")
# Access expanded configurations (list of step lists)
pipeline.steps # List of step configurations
# Access names (includes hash for uniqueness)
pipeline.names # ["MyPipeline_a1b2c3"]
# Check if generators were used
pipeline.has_configurations # True if _or_, _range_ expanded
DatasetConfigs
DatasetConfigs defines how to load and configure datasets.
Constructor
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs(
configurations, # Path(s) or configuration dict(s)
task_type="auto", # Force task type
signal_type=None, # Override signal type
aggregate=None, # Aggregation column or True
aggregate_method=None, # Aggregation method
aggregate_exclude_outliers=None # Exclude outliers before aggregation
)
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str, dict, list |
required |
Path, config dict, or list of either |
|
str, list |
|
Task type per dataset |
|
str, list |
|
Signal type override |
|
str, bool, list |
|
Aggregation setting |
|
str, list |
|
Method: “mean”, “median”, “vote” |
|
bool, list |
|
Exclude outliers via T² |
Configuration Dictionary Keys
Data File Keys
Key |
Description |
Example |
|---|---|---|
|
Training features |
|
|
Training targets |
|
|
Training metadata |
|
|
Test features |
|
|
Test targets |
|
|
Test metadata |
|
Parameter Keys
Key |
Description |
|---|---|
|
Parameters for |
|
Parameters for |
|
Parameters for |
|
Parameters applied to all files |
File Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Column separator |
|
str |
|
Decimal point character |
|
bool |
|
First row is header |
|
str |
|
Header interpretation |
|
str |
|
Spectral signal type |
|
str |
|
Missing value handling |
|
str |
|
Target column name (combined files) |
|
str |
|
Excel sheet name |
Header Unit Options
Value |
Description |
|---|---|
|
Wavelengths in nanometers |
|
Wavenumbers in cm⁻¹ |
|
No header row |
|
Text labels (ignored) |
|
Numeric indices |
|
Automatic detection |
Signal Type Options
Value |
Description |
|---|---|
|
Absorbance values |
|
Reflectance 0-1 |
|
Reflectance 0-100 |
|
Transmittance 0-1 |
|
Transmittance 0-100 |
|
Automatic detection |
NA Policy Options
Value |
Description |
|---|---|
|
Drop rows with missing values |
|
Fill with column mean |
|
Fill with column median |
|
Fill with zeros |
Task Type Options
Value |
Description |
|---|---|
|
Auto-detect from targets |
|
Continuous target prediction |
|
Two-class classification |
|
Multi-class classification |
Configuration Examples
Simple Path
dataset = DatasetConfigs("path/to/data/")
Explicit Files
dataset = DatasetConfigs({
"train_x": "spectra_train.csv",
"train_y": "targets_train.csv",
"test_x": "spectra_test.csv",
"test_y": "targets_test.csv"
})
With Parameters
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_y": "targets.csv",
"train_x_params": {
"header_unit": "nm",
"signal_type": "reflectance",
"delimiter": ";"
},
"train_y_params": {
"has_header": True
}
})
Multi-Source Dataset
dataset = DatasetConfigs({
"train_x": ["nir_spectra.csv", "markers.csv"],
"train_y": "targets.csv",
"train_x_params": [
{"header_unit": "nm", "signal_type": "reflectance"},
{"header_unit": "text"}
]
})
Multiple Datasets
dataset = DatasetConfigs([
"dataset1/",
"dataset2/",
{"train_x": "custom/spectra.csv", "train_y": "custom/targets.csv"}
])
Using SpectroDataset Directly
For advanced use cases, nirs4all.run() also accepts SpectroDataset instances directly, bypassing DatasetConfigs:
from nirs4all.data import SpectroDataset
import nirs4all
# Single SpectroDataset
result = nirs4all.run(pipeline, my_spectro_dataset)
# Multiple SpectroDataset instances (multi-dataset run)
result = nirs4all.run(pipeline, [dataset1, dataset2, dataset3])
This is useful when:
Using synthetic data generators that return
SpectroDatasetProgrammatically constructing datasets
Chaining pipeline runs with transformed data
With Aggregation
dataset = DatasetConfigs(
"path/to/data/",
aggregate="sample_id", # Column name in metadata
aggregate_method="mean", # "mean", "median", or "vote"
aggregate_exclude_outliers=True # Remove outliers before aggregating
)
Accessing Dataset Data
dataset = DatasetConfigs("path/to/data/")
# Iterate over datasets
for ds in dataset.iter_datasets():
print(f"Dataset: {ds.name}")
print(f" Samples: {len(ds)}")
print(f" Features: {ds.n_features}")
print(f" Task: {ds.task_type}")
# Get specific dataset by index
ds = dataset.get_dataset_at(0)
# Get all datasets as list
all_datasets = dataset.get_datasets()
Complete Examples
Full Pipeline Configuration
from nirs4all.pipeline import PipelineConfigs, PipelineRunner
from nirs4all.data import DatasetConfigs
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from nirs4all.operators.transforms import StandardNormalVariate
# Pipeline configuration
pipeline = PipelineConfigs([
MinMaxScaler(),
StandardNormalVariate(),
{"y_processing": MinMaxScaler()},
ShuffleSplit(n_splits=5, test_size=0.25, random_state=42),
{"model": PLSRegression(n_components=10)}
], name="ProductionPipeline", description="NIR protein prediction model")
# Dataset configuration
dataset = DatasetConfigs({
"train_x": "data/spectra.csv",
"train_y": "data/protein.csv",
"train_m": "data/samples.csv",
"train_x_params": {
"header_unit": "nm",
"signal_type": "reflectance",
"delimiter": ","
}
}, task_type="regression", aggregate="sample_id")
# Run
runner = PipelineRunner(verbose=1, save_artifacts=True)
predictions, per_dataset = runner.run(pipeline, dataset)
YAML Configuration File
pipeline.yaml:
pipeline:
# Preprocessing
- class: sklearn.preprocessing.MinMaxScaler
- class: nirs4all.operators.transforms.StandardNormalVariate
# Target scaling
- y_processing:
class: sklearn.preprocessing.MinMaxScaler
# Cross-validation
- class: sklearn.model_selection.ShuffleSplit
params:
n_splits: 5
test_size: 0.25
random_state: 42
# Model
- model:
class: sklearn.cross_decomposition.PLSRegression
params:
n_components: 10
dataset.yaml:
train_x: data/spectra.csv
train_y: data/targets.csv
train_x_params:
header_unit: nm
signal_type: reflectance
delimiter: ","
task_type: regression
Python usage:
from nirs4all.pipeline import PipelineConfigs, PipelineRunner
from nirs4all.data import DatasetConfigs
pipeline = PipelineConfigs("config/pipeline.yaml", name="YAMLPipeline")
dataset = DatasetConfigs("config/dataset.yaml")
runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(pipeline, dataset)
See Also
Writing a Pipeline in nirs4all - Complete pipeline syntax reference
Generator Keywords Reference - Generator syntax (
_or_,_range_)Loading Data - Data loading guide
Core Concepts - Core concepts overview