Loading Data
This guide covers how to load spectral data into NIRS4ALL using DatasetConfigs.
Overview
NIRS4ALL loads data through DatasetConfigs, which handles:
Multiple file formats (CSV, Excel, MATLAB, NumPy, Parquet)
Automatic file detection in folders
Multi-source datasets (e.g., NIR + markers)
Train/test splits
Metadata handling
Quick Start
From a Folder
The simplest approach - NIRS4ALL auto-detects your data files:
from nirs4all.data import DatasetConfigs
# Auto-detect files in folder
dataset = DatasetConfigs("path/to/data/")
Expected folder structure:
data/
├── train_x.csv # Training features (spectra)
├── train_y.csv # Training targets
├── train_m.csv # Training metadata (optional)
├── test_x.csv # Test features (optional)
├── test_y.csv # Test targets (optional)
└── test_m.csv # Test metadata (optional)
From a Single File
For a single file with features and targets combined:
# CSV with last column as target
dataset = DatasetConfigs("data.csv")
# Explicit target column
dataset = DatasetConfigs({
"train_x": "data.csv",
"global_params": {"target_column": "protein"}
})
From Explicit Files
Full control over file paths:
dataset = DatasetConfigs({
"train_x": "spectra_train.csv",
"train_y": "targets_train.csv",
"test_x": "spectra_test.csv",
"test_y": "targets_test.csv"
})
Supported Formats
Format |
Extensions |
Notes |
|---|---|---|
CSV |
|
Most common; configurable delimiter |
Excel |
|
Single sheet or specify sheet name |
MATLAB |
|
Reads first array variable |
NumPy |
|
Binary format, fast loading |
Parquet |
|
Columnar format, efficient for large data |
CSV Configuration
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {
"delimiter": ";", # Column separator
"decimal_separator": ",", # Decimal point
"has_header": True, # First row is header
"na_policy": "drop" # Handle missing values
}
})
Excel Configuration
dataset = DatasetConfigs({
"train_x": "data.xlsx",
"train_x_params": {
"sheet_name": "Spectra", # Specific sheet
"header_row": 0 # Header row index
}
})
File Keys Reference
Key |
Description |
|---|---|
|
Training features (spectra) |
|
Training targets |
|
Training metadata |
|
Test features |
|
Test targets |
|
Test metadata |
|
Parameters for corresponding file (e.g., |
|
Parameters applied to all files |
Wavelength Headers
NIRS4ALL understands wavelength information from column headers:
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {
"header_unit": "nm" # Headers are wavelengths in nm
# Options: "nm", "cm-1", "none", "text", "index"
}
})
|
Description |
Example Headers |
|---|---|---|
|
Wavelengths in nanometers |
|
|
Wavenumbers in cm⁻¹ |
|
|
No header row |
- |
|
Text labels (ignored) |
|
|
Numeric indices |
|
Signal Type
Specify the type of spectral signal for proper handling:
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {
"signal_type": "reflectance"
}
})
# Or as constructor parameter
dataset = DatasetConfigs("spectra.csv", signal_type="reflectance")
Signal Type |
Description |
|---|---|
|
-log₁₀(R) |
|
Raw reflectance (0-1) |
|
Reflectance as percentage (0-100) |
|
Raw transmittance (0-1) |
|
Transmittance as percentage |
|
Automatic detection (default) |
Task Type
Force regression or classification mode:
# Force regression
dataset = DatasetConfigs("data/", task_type="regression")
# Force classification
dataset = DatasetConfigs("data/", task_type="binary_classification")
# Valid options:
# - "auto" (default)
# - "regression"
# - "binary_classification"
# - "multiclass_classification"
Multi-Source Datasets
Combine multiple data sources (e.g., NIR spectra + chemical markers):
dataset = DatasetConfigs({
"train_x": ["nir_spectra.csv", "markers.csv"],
"train_y": "targets.csv",
"train_x_params": [
{"header_unit": "nm", "signal_type": "reflectance"},
{"header_unit": "text"} # Markers have text headers
]
})
Processing Multi-Source Data
Use source_branch to apply different preprocessing to each source:
pipeline = [
{"source_branch": {
0: [StandardNormalVariate(), FirstDerivative()], # NIR source
1: [StandardScaler()] # Markers source
}},
{"merge_sources": "concat"}, # Combine sources
{"model": PLSRegression(n_components=10)}
]
Sample Aggregation
Aggregate predictions from multiple measurements per sample:
# Aggregate by sample ID column
dataset = DatasetConfigs(
"data/",
aggregate="sample_id" # Metadata column name
)
# Aggregate by target values
dataset = DatasetConfigs(
"data/",
aggregate=True # Group by y values
)
# With custom method
dataset = DatasetConfigs(
"data/",
aggregate="sample_id",
aggregate_method="median", # "mean", "median", or "vote"
aggregate_exclude_outliers=True # Remove outliers before aggregating
)
Multiple Datasets
Run the same pipeline on multiple datasets:
dataset = DatasetConfigs([
"dataset1/",
"dataset2/",
{"train_x": "custom/spectra.csv", "train_y": "custom/targets.csv"}
])
# Results will include predictions for all datasets
result = nirs4all.run(pipeline, dataset)
Using SpectroDataset Directly
For advanced use cases, you can pass SpectroDataset instances directly to nirs4all.run():
from nirs4all.data import SpectroDataset
import nirs4all
# Create a SpectroDataset manually
dataset = SpectroDataset(name="my_dataset")
dataset.add_samples(X_train, indexes={"partition": "train"})
dataset.add_targets(y_train)
# Use directly in run()
result = nirs4all.run(pipeline, dataset)
Multiple SpectroDataset Instances
You can also pass a list of SpectroDataset instances:
# Multiple SpectroDataset instances
datasets = [dataset1, dataset2, dataset3]
result = nirs4all.run(pipeline, datasets)
This is particularly useful when:
Working with synthetic data generators that return
SpectroDatasetProgrammatically creating datasets from different sources
Chaining multiple pipeline runs with transformed data
Complete Example
from nirs4all.data import DatasetConfigs
import nirs4all
# Comprehensive configuration
dataset = DatasetConfigs({
# Training data
"train_x": "data/train_spectra.csv",
"train_y": "data/train_targets.csv",
"train_m": "data/train_metadata.csv",
# Test data
"test_x": "data/test_spectra.csv",
"test_y": "data/test_targets.csv",
# Training file parameters
"train_x_params": {
"header_unit": "nm",
"signal_type": "reflectance",
"delimiter": ","
},
# Force regression task
"task_type": "regression"
})
# Run pipeline
result = nirs4all.run(
pipeline=[
MinMaxScaler(),
ShuffleSplit(n_splits=3),
{"model": PLSRegression(n_components=10)}
],
dataset=dataset,
verbose=1
)
Common Patterns
Load with Metadata
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_y": "targets.csv",
"train_m": "metadata.csv" # Sample IDs, dates, groups, etc.
})
Specify Target Column
# When features and target are in the same file
dataset = DatasetConfigs({
"train_x": "combined_data.csv",
"global_params": {
"target_column": "protein" # Column name for target
}
})
Handle Missing Values
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"global_params": {
"na_policy": "drop" # Drop rows with NaN
# Options: "drop", "fill_mean", "fill_median", "fill_zero"
}
})
Troubleshooting
File Not Found
# Use absolute paths if relative paths fail
import os
path = os.path.abspath("data/spectra.csv")
dataset = DatasetConfigs(path)
Wrong Delimiter
# Check file manually, then specify
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {"delimiter": "\t"} # Tab-separated
})
Header Issues
# No header row
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {"has_header": False}
})
# Skip header row
dataset = DatasetConfigs({
"train_x": "spectra.csv",
"train_x_params": {"header_unit": "none", "has_header": True}
})
See Also
Core Concepts - Understanding SpectroDataset
Configuration Reference - Full DatasetConfigs specification
Sample Filtering User Guide - Filter samples during loading
Sample Aggregation - Aggregate multiple measurements