Data Handling Examples

This section covers all the ways to load, configure, and work with data in NIRS4ALL. From simple numpy arrays to complex multi-source datasets, you’ll learn the flexible input options available.

Overview 

Example	Topic	Difficulty	Duration
U01	Flexible Inputs	★☆☆☆☆	~2 min
U02	Multi-Datasets	★★☆☆☆	~3 min
U03	Multi-Source Data	★★★☆☆	~3 min
U04	Wavelength Handling	★★☆☆☆	~3 min
U05	Synthetic Data	★★☆☆☆	~2 min
U06	Advanced Synthetic Data	★★★☆☆	~5 min

U01: Flexible Inputs 

Demonstrates all possible input formats for datasets and pipelines.

📄 View source code

What You’ll Learn 

Direct numpy array input with (X, y) tuples
Dictionary-based dataset configuration
Partition info specification
SpectroDataset object usage

Input Format Overview 

NIRS4ALL accepts datasets in multiple formats:

Format	Example	Best For
Folder path	`"sample_data/regression"`	File-based datasets
Tuple	`(X, y)` or `(X, y, partition_info)`	Quick experiments
Dictionary	`{"train_x": X_train, "test_x": X_test, ...}`	Explicit splits
DatasetConfigs	`DatasetConfigs("path")`	Full control
SpectroDataset	`SpectroDataset(name="my_data")`	Programmatic access

Simplest Approach: Direct Arrays 

import numpy as np
from sklearn.linear_model import Ridge

# Generate or load your data
X = np.random.randn(200, 100)
y = np.random.randn(200)

# Partition info: first 160 samples for training
partition_info = {"train": 160}

# Run directly with tuple
result = nirs4all.run(
    pipeline=[Ridge(alpha=1.0)],
    dataset=(X, y, partition_info),
    name="DirectArrays"
)

Partition Info Options 

# Integer: first N samples = train
{"train": 160}

# Slice objects
{"train": slice(0, 150), "test": slice(150, 200)}

# Explicit indices
{"train": list(range(150)), "test": list(range(150, 200))}

Dictionary Configuration 

dataset_dict = {
    "name": "my_dataset",
    "train_x": X_train,
    "train_y": y_train,
    "test_x": X_test,
    "test_y": y_test
}

result = nirs4all.run(pipeline=pipeline, dataset=dataset_dict)

Using SpectroDataset Directly 

For maximum control, create a SpectroDataset object:

from nirs4all.data import SpectroDataset

dataset = SpectroDataset(name="custom")
dataset.add_samples(X_train, indexes={"partition": "train"})
dataset.add_targets(y_train)
dataset.add_samples(X_test, indexes={"partition": "test"})
dataset.add_targets(y_test)

result = nirs4all.run(pipeline=pipeline, dataset=dataset)

U02: Multi-Datasets 

Run the same pipeline on multiple datasets and compare results.

📄 View source code

What You’ll Learn 

Specifying multiple datasets as a list
Per-dataset result access
Cross-dataset comparison visualizations

Specifying Multiple Datasets 

Simply pass a list of dataset paths:

data_paths = [
    'sample_data/regression',
    'sample_data/regression_2',
    'sample_data/regression_3'
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset=data_paths,
    name="MultiDataset"
)

Accessing Per-Dataset Results 

for dataset_name, dataset_info in result.per_dataset.items():
    dataset_predictions = dataset_info.get('run_predictions')

    if dataset_predictions:
        top_models = dataset_predictions.top(n=3, rank_metric='rmse')
        print(f"Dataset: {dataset_name}")
        for model in top_models:
            print(f"  {model['model_name']}: RMSE={model.get('rmse', 0):.4f}")

Cross-Dataset Visualization 

analyzer = PredictionAnalyzer(result.predictions)

# Compare models across datasets
analyzer.plot_heatmap(
    x_var="model_name",
    y_var="dataset_name",
    display_metric='rmse'
)

# Dataset difficulty comparison
analyzer.plot_candlestick(
    variable="dataset_name",
    display_metric='rmse'
)

Use Cases 

Generalization testing: Does a model work well on different samples?
Dataset comparison: Which datasets are most challenging?
Robust model selection: Find models that work everywhere

U03: Multi-Source Data 

Work with datasets that have multiple feature sources (e.g., NIR + markers).

📄 View source code

What You’ll Learn 

Loading multi-source datasets
Feature augmentation with generator syntax
Basic multi-source handling

Understanding Multi-Source Data 

Multi-source datasets contain features from different instruments or measurement types:

Example: NIR spectrometer + wet chemistry markers
├── Source 1: NIR spectra (1000 wavelengths)
└── Source 2: Lab markers (10 chemical values)

Data Structure 

Multi-source data has multiple X files per partition:

sample_data/multi/
├── Xtrain_1.csv  # Source 1 training features
├── Xtrain_2.csv  # Source 2 training features
├── Xval_1.csv    # Source 1 validation features
├── Xval_2.csv    # Source 2 validation features
└── Ytrain.csv    # Targets

Loading Multi-Source Data 

from nirs4all.data import DatasetConfigs

# Automatic loading
dataset_config = DatasetConfigs('sample_data/multi')

# NIRS4ALL automatically detects and loads all sources
result = nirs4all.run(
    pipeline=pipeline,
    dataset='sample_data/multi',
    name="MultiSource"
)

Advanced: Source-Specific Processing 

For per-source preprocessing (covered in Developer examples):

# Source branching: different preprocessing per source
{"source_branch": {
    "NIR": [SNV(), FirstDerivative()],
    "markers": [VarianceThreshold()],
}}

# Merge sources
{"merge_sources": "concat"}  # Horizontal concatenation

U04: Wavelength Handling 

Handle wavelength grids: interpolation, downsampling, and unit conversion.

📄 View source code

What You’ll Learn 

Resampler operator for wavelength interpolation
Downsampling to fewer wavelengths
Focusing on specific spectral regions

Why Resample Wavelengths?

Instrument standardization: Different spectrometers have different wavelength grids
Transfer learning: Match wavelengths between training and inference instruments
Dimensionality reduction: Reduce features while preserving spectral shape
Region focus: Analyze specific spectral regions

The Resampler Operator 

from nirs4all.operators.transforms import Resampler
import numpy as np

# Target wavelengths (e.g., from a reference instrument)
target_wavelengths = np.linspace(1000, 2500, 100)

pipeline = [
    Resampler(target_wavelengths=target_wavelengths, method='linear'),
    # ... rest of pipeline
]

Common Resampling Scenarios 

Match Another Dataset

# Get wavelengths from reference dataset
ref_config = DatasetConfigs("reference_data")
ref_dataset = list(ref_config.iter_datasets())[0]
target_wl = ref_dataset.float_headers(0)

# Resample to match
Resampler(target_wavelengths=target_wl)

Downsample

# Reduce to 50 evenly-spaced points
target_wl = np.linspace(start_wl, end_wl, 50)
Resampler(target_wavelengths=target_wl)

Focus on Region

# Focus on fingerprint region (e.g., 1400-1800 nm)
region_wl = np.linspace(1400, 1800, 100)
Resampler(target_wavelengths=region_wl)

Interpolation Methods 

Method	Description	Use Case
`'linear'`	Linear interpolation	Default, fast
`'cubic'`	Cubic spline	Smooth spectra
`'quadratic'`	Quadratic interpolation	Balance speed/smoothness
`'nearest'`	Nearest neighbor	Discrete features

U05: Synthetic Data 

Generate synthetic NIRS spectra for testing and prototyping.

📄 View source code

What You’ll Learn 

Using nirs4all.generate() for quick dataset creation
Convenience functions for regression and classification
Configuring spectral complexity and components

Basic Generation 

import nirs4all

# Generate a SpectroDataset
dataset = nirs4all.generate(n_samples=500, random_state=42)

# Or get raw numpy arrays
X, y = nirs4all.generate(n_samples=300, as_dataset=False)

Regression Datasets 

dataset = nirs4all.generate.regression(
    n_samples=500,
    target_range=(0, 100),      # Scale targets
    target_component=0,          # Which component as target
    complexity="realistic",      # Noise level
    random_state=42
)

Classification Datasets 

# Binary classification
dataset = nirs4all.generate.classification(
    n_samples=400,
    n_classes=2,
    class_separation=2.0,  # Well-separated classes
    random_state=42
)

# Imbalanced multiclass
dataset = nirs4all.generate.classification(
    n_samples=600,
    n_classes=3,
    class_weights=[0.5, 0.3, 0.2],
    random_state=42
)

Complexity Levels 

Level	Description	Use Case
`"simple"`	Minimal noise	Unit tests, fast prototyping
`"realistic"`	Typical NIR noise/scatter	Development, validation
`"complex"`	High noise, artifacts	Robustness testing

Specifying Chemical Components 

dataset = nirs4all.generate(
    n_samples=400,
    components=["water", "protein", "lipid", "starch"],
    complexity="realistic"
)

Available components: water, protein, lipid, starch, cellulose, chlorophyll, oil, nitrogen_compound

Direct Pipeline Integration 

# Generate and train in one call
result = nirs4all.run(
    pipeline=[StandardScaler(), PLSRegression(n_components=10)],
    dataset=nirs4all.generate.regression(n_samples=600, complexity="realistic"),
    name="SyntheticTest"
)

U06: Synthetic Advanced 

Master the full synthetic data generation API for complex scenarios.

📄 View source code

What You’ll Learn 

Using SyntheticDatasetBuilder for full control
Metadata generation (groups, repetitions)
Multi-source datasets
Batch effects simulation
Non-linear target complexity for realistic benchmarks
Exporting to files
Matching real data characteristics

SyntheticDatasetBuilder 

For maximum control over synthetic data generation:

from nirs4all.synthesis import SyntheticDatasetBuilder

# Full control over generation
builder = SyntheticDatasetBuilder(
    n_samples=500,
    wavelength_range=(1000, 2500),
    n_wavelengths=256,
    components=["water", "protein", "lipid"],
)

# Add batch effects
builder.add_batch_effect(n_batches=3, intensity=0.1)

# Add metadata
builder.add_group_metadata(n_groups=5)

# Generate dataset
dataset = builder.build()

Multi-Source Synthetic Data 

# Create multi-source datasets
builder = SyntheticDatasetBuilder(n_samples=300)
builder.add_source("NIR", wavelength_range=(1000, 2500), n_wavelengths=256)
builder.add_source("markers", n_features=10, feature_type="numerical")

dataset = builder.build()

Exporting to Files 

# Export for loader testing
dataset.to_csv("synthetic_data/")
# Creates: Xtrain.csv, Ytrain.csv, Xtest.csv, Ytest.csv

Running These Examples 

cd examples

# Run all data handling examples
./run.sh -n "U0*.py" -c user

# Run with plots
python user/02_data_handling/U05_synthetic_data.py --plots --show

Next Steps 

After mastering data handling:

Preprocessing: Apply NIRS-specific transformations
Models: Compare different model architectures
Cross-Validation: Choose the right validation strategy