Loading Data

This guide covers how to load spectral data into NIRS4ALL using DatasetConfigs.

Overview

NIRS4ALL loads data through DatasetConfigs, which handles:

  • Multiple file formats (CSV, Excel, MATLAB, NumPy, Parquet)

  • Automatic file detection in folders

  • Multi-source datasets (e.g., NIR + markers)

  • Train/test splits

  • Metadata handling

Quick Start

From a Folder

The simplest approach - NIRS4ALL auto-detects your data files:

from nirs4all.data import DatasetConfigs

# Auto-detect files in folder
dataset = DatasetConfigs("path/to/data/")

Expected folder structure:

data/
├── train_x.csv      # Training features (spectra)
├── train_y.csv      # Training targets
├── train_m.csv      # Training metadata (optional)
├── test_x.csv       # Test features (optional)
├── test_y.csv       # Test targets (optional)
└── test_m.csv       # Test metadata (optional)

From a Single File

For a single file with features and targets combined:

# CSV with last column as target
dataset = DatasetConfigs("data.csv")

# Explicit target column
dataset = DatasetConfigs({
    "train_x": "data.csv",
    "global_params": {"target_column": "protein"}
})

From Explicit Files

Full control over file paths:

dataset = DatasetConfigs({
    "train_x": "spectra_train.csv",
    "train_y": "targets_train.csv",
    "test_x": "spectra_test.csv",
    "test_y": "targets_test.csv"
})

Supported Formats

Format

Extensions

Notes

CSV

.csv

Most common; configurable delimiter

Excel

.xlsx, .xls

Single sheet or specify sheet name

MATLAB

.mat

Reads first array variable

NumPy

.npy, .npz

Binary format, fast loading

Parquet

.parquet

Columnar format, efficient for large data

CSV Configuration

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {
        "delimiter": ";",           # Column separator
        "decimal_separator": ",",   # Decimal point
        "has_header": True,         # First row is header
        "na_policy": "drop"         # Handle missing values
    }
})

Excel Configuration

dataset = DatasetConfigs({
    "train_x": "data.xlsx",
    "train_x_params": {
        "sheet_name": "Spectra",    # Specific sheet
        "header_row": 0             # Header row index
    }
})

File Keys Reference

Key

Description

train_x

Training features (spectra)

train_y

Training targets

train_m

Training metadata

test_x

Test features

test_y

Test targets

test_m

Test metadata

*_params

Parameters for corresponding file (e.g., train_x_params)

global_params

Parameters applied to all files

Wavelength Headers

NIRS4ALL understands wavelength information from column headers:

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {
        "header_unit": "nm"        # Headers are wavelengths in nm
        # Options: "nm", "cm-1", "none", "text", "index"
    }
})

header_unit

Description

Example Headers

"nm"

Wavelengths in nanometers

900, 902, 904, ...

"cm-1"

Wavenumbers in cm⁻¹

4000, 3998, 3996, ...

"none"

No header row

-

"text"

Text labels (ignored)

V1, V2, V3, ...

"index"

Numeric indices

1, 2, 3, ...

Signal Type

Specify the type of spectral signal for proper handling:

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {
        "signal_type": "reflectance"
    }
})

# Or as constructor parameter
dataset = DatasetConfigs("spectra.csv", signal_type="reflectance")

Signal Type

Description

"absorbance"

-log₁₀(R)

"reflectance"

Raw reflectance (0-1)

"reflectance%"

Reflectance as percentage (0-100)

"transmittance"

Raw transmittance (0-1)

"transmittance%"

Transmittance as percentage

"auto"

Automatic detection (default)

Task Type

Force regression or classification mode:

# Force regression
dataset = DatasetConfigs("data/", task_type="regression")

# Force classification
dataset = DatasetConfigs("data/", task_type="binary_classification")

# Valid options:
# - "auto" (default)
# - "regression"
# - "binary_classification"
# - "multiclass_classification"

Multi-Source Datasets

Combine multiple data sources (e.g., NIR spectra + chemical markers):

dataset = DatasetConfigs({
    "train_x": ["nir_spectra.csv", "markers.csv"],
    "train_y": "targets.csv",
    "train_x_params": [
        {"header_unit": "nm", "signal_type": "reflectance"},
        {"header_unit": "text"}  # Markers have text headers
    ]
})

Processing Multi-Source Data

Use source_branch to apply different preprocessing to each source:

pipeline = [
    {"source_branch": {
        0: [StandardNormalVariate(), FirstDerivative()],  # NIR source
        1: [StandardScaler()]                             # Markers source
    }},
    {"merge_sources": "concat"},  # Combine sources
    {"model": PLSRegression(n_components=10)}
]

Sample Aggregation

Aggregate predictions from multiple measurements per sample:

# Aggregate by sample ID column
dataset = DatasetConfigs(
    "data/",
    aggregate="sample_id"      # Metadata column name
)

# Aggregate by target values
dataset = DatasetConfigs(
    "data/",
    aggregate=True             # Group by y values
)

# With custom method
dataset = DatasetConfigs(
    "data/",
    aggregate="sample_id",
    aggregate_method="median",  # "mean", "median", or "vote"
    aggregate_exclude_outliers=True  # Remove outliers before aggregating
)

Multiple Datasets

Run the same pipeline on multiple datasets:

dataset = DatasetConfigs([
    "dataset1/",
    "dataset2/",
    {"train_x": "custom/spectra.csv", "train_y": "custom/targets.csv"}
])

# Results will include predictions for all datasets
result = nirs4all.run(pipeline, dataset)

Using SpectroDataset Directly

For advanced use cases, you can pass SpectroDataset instances directly to nirs4all.run():

from nirs4all.data import SpectroDataset
import nirs4all

# Create a SpectroDataset manually
dataset = SpectroDataset(name="my_dataset")
dataset.add_samples(X_train, indexes={"partition": "train"})
dataset.add_targets(y_train)

# Use directly in run()
result = nirs4all.run(pipeline, dataset)

Multiple SpectroDataset Instances

You can also pass a list of SpectroDataset instances:

# Multiple SpectroDataset instances
datasets = [dataset1, dataset2, dataset3]
result = nirs4all.run(pipeline, datasets)

This is particularly useful when:

  • Working with synthetic data generators that return SpectroDataset

  • Programmatically creating datasets from different sources

  • Chaining multiple pipeline runs with transformed data

Complete Example

from nirs4all.data import DatasetConfigs
import nirs4all

# Comprehensive configuration
dataset = DatasetConfigs({
    # Training data
    "train_x": "data/train_spectra.csv",
    "train_y": "data/train_targets.csv",
    "train_m": "data/train_metadata.csv",

    # Test data
    "test_x": "data/test_spectra.csv",
    "test_y": "data/test_targets.csv",

    # Training file parameters
    "train_x_params": {
        "header_unit": "nm",
        "signal_type": "reflectance",
        "delimiter": ","
    },

    # Force regression task
    "task_type": "regression"
})

# Run pipeline
result = nirs4all.run(
    pipeline=[
        MinMaxScaler(),
        ShuffleSplit(n_splits=3),
        {"model": PLSRegression(n_components=10)}
    ],
    dataset=dataset,
    verbose=1
)

Common Patterns

Load with Metadata

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_y": "targets.csv",
    "train_m": "metadata.csv"  # Sample IDs, dates, groups, etc.
})

Specify Target Column

# When features and target are in the same file
dataset = DatasetConfigs({
    "train_x": "combined_data.csv",
    "global_params": {
        "target_column": "protein"  # Column name for target
    }
})

Handle Missing Values

dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "global_params": {
        "na_policy": "drop"     # Drop rows with NaN
        # Options: "drop", "fill_mean", "fill_median", "fill_zero"
    }
})

Troubleshooting

File Not Found

# Use absolute paths if relative paths fail
import os
path = os.path.abspath("data/spectra.csv")
dataset = DatasetConfigs(path)

Wrong Delimiter

# Check file manually, then specify
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {"delimiter": "\t"}  # Tab-separated
})

Header Issues

# No header row
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {"has_header": False}
})

# Skip header row
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_x_params": {"header_unit": "none", "has_header": True}
})

See Also