Data Handling Examples

This section covers all the ways to load, configure, and work with data in NIRS4ALL. From simple numpy arrays to complex multi-source datasets, you’ll learn the flexible input options available.

Overview

Example

Topic

Difficulty

Duration

U01

Flexible Inputs

★☆☆☆☆

~2 min

U02

Multi-Datasets

★★☆☆☆

~3 min

U03

Multi-Source Data

★★★☆☆

~3 min

U04

Wavelength Handling

★★☆☆☆

~3 min

U05

Synthetic Data

★★☆☆☆

~2 min

U06

Advanced Synthetic Data

★★★☆☆

~5 min


U01: Flexible Inputs

Demonstrates all possible input formats for datasets and pipelines.

📄 View source code

What You’ll Learn

  • Direct numpy array input with (X, y) tuples

  • Dictionary-based dataset configuration

  • Partition info specification

  • SpectroDataset object usage

Input Format Overview

NIRS4ALL accepts datasets in multiple formats:

Format

Example

Best For

Folder path

"sample_data/regression"

File-based datasets

Tuple

(X, y) or (X, y, partition_info)

Quick experiments

Dictionary

{"train_x": X_train, "test_x": X_test, ...}

Explicit splits

DatasetConfigs

DatasetConfigs("path")

Full control

SpectroDataset

SpectroDataset(name="my_data")

Programmatic access

Simplest Approach: Direct Arrays

import numpy as np
from sklearn.linear_model import Ridge

# Generate or load your data
X = np.random.randn(200, 100)
y = np.random.randn(200)

# Partition info: first 160 samples for training
partition_info = {"train": 160}

# Run directly with tuple
result = nirs4all.run(
    pipeline=[Ridge(alpha=1.0)],
    dataset=(X, y, partition_info),
    name="DirectArrays"
)

Partition Info Options

# Integer: first N samples = train
{"train": 160}

# Slice objects
{"train": slice(0, 150), "test": slice(150, 200)}

# Explicit indices
{"train": list(range(150)), "test": list(range(150, 200))}

Dictionary Configuration

dataset_dict = {
    "name": "my_dataset",
    "train_x": X_train,
    "train_y": y_train,
    "test_x": X_test,
    "test_y": y_test
}

result = nirs4all.run(pipeline=pipeline, dataset=dataset_dict)

Using SpectroDataset Directly

For maximum control, create a SpectroDataset object:

from nirs4all.data import SpectroDataset

dataset = SpectroDataset(name="custom")
dataset.add_samples(X_train, indexes={"partition": "train"})
dataset.add_targets(y_train)
dataset.add_samples(X_test, indexes={"partition": "test"})
dataset.add_targets(y_test)

result = nirs4all.run(pipeline=pipeline, dataset=dataset)

U02: Multi-Datasets

Run the same pipeline on multiple datasets and compare results.

📄 View source code

What You’ll Learn

  • Specifying multiple datasets as a list

  • Per-dataset result access

  • Cross-dataset comparison visualizations

Specifying Multiple Datasets

Simply pass a list of dataset paths:

data_paths = [
    'sample_data/regression',
    'sample_data/regression_2',
    'sample_data/regression_3'
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset=data_paths,
    name="MultiDataset"
)

Accessing Per-Dataset Results

for dataset_name, dataset_info in result.per_dataset.items():
    dataset_predictions = dataset_info.get('run_predictions')

    if dataset_predictions:
        top_models = dataset_predictions.top(n=3, rank_metric='rmse')
        print(f"Dataset: {dataset_name}")
        for model in top_models:
            print(f"  {model['model_name']}: RMSE={model.get('rmse', 0):.4f}")

Cross-Dataset Visualization

analyzer = PredictionAnalyzer(result.predictions)

# Compare models across datasets
analyzer.plot_heatmap(
    x_var="model_name",
    y_var="dataset_name",
    display_metric='rmse'
)

# Dataset difficulty comparison
analyzer.plot_candlestick(
    variable="dataset_name",
    display_metric='rmse'
)

Use Cases

  • Generalization testing: Does a model work well on different samples?

  • Dataset comparison: Which datasets are most challenging?

  • Robust model selection: Find models that work everywhere


U03: Multi-Source Data

Work with datasets that have multiple feature sources (e.g., NIR + markers).

📄 View source code

What You’ll Learn

  • Loading multi-source datasets

  • Feature augmentation with generator syntax

  • Basic multi-source handling

Understanding Multi-Source Data

Multi-source datasets contain features from different instruments or measurement types:

Example: NIR spectrometer + wet chemistry markers
├── Source 1: NIR spectra (1000 wavelengths)
└── Source 2: Lab markers (10 chemical values)

Data Structure

Multi-source data has multiple X files per partition:

sample_data/multi/
├── Xtrain_1.csv  # Source 1 training features
├── Xtrain_2.csv  # Source 2 training features
├── Xval_1.csv    # Source 1 validation features
├── Xval_2.csv    # Source 2 validation features
└── Ytrain.csv    # Targets

Loading Multi-Source Data

from nirs4all.data import DatasetConfigs

# Automatic loading
dataset_config = DatasetConfigs('sample_data/multi')

# NIRS4ALL automatically detects and loads all sources
result = nirs4all.run(
    pipeline=pipeline,
    dataset='sample_data/multi',
    name="MultiSource"
)

Advanced: Source-Specific Processing

For per-source preprocessing (covered in Developer examples):

# Source branching: different preprocessing per source
{"source_branch": {
    "NIR": [SNV(), FirstDerivative()],
    "markers": [VarianceThreshold()],
}}

# Merge sources
{"merge_sources": "concat"}  # Horizontal concatenation

U04: Wavelength Handling

Handle wavelength grids: interpolation, downsampling, and unit conversion.

📄 View source code

What You’ll Learn

  • Resampler operator for wavelength interpolation

  • Downsampling to fewer wavelengths

  • Focusing on specific spectral regions

Why Resample Wavelengths?

  • Instrument standardization: Different spectrometers have different wavelength grids

  • Transfer learning: Match wavelengths between training and inference instruments

  • Dimensionality reduction: Reduce features while preserving spectral shape

  • Region focus: Analyze specific spectral regions

The Resampler Operator

from nirs4all.operators.transforms import Resampler
import numpy as np

# Target wavelengths (e.g., from a reference instrument)
target_wavelengths = np.linspace(1000, 2500, 100)

pipeline = [
    Resampler(target_wavelengths=target_wavelengths, method='linear'),
    # ... rest of pipeline
]

Common Resampling Scenarios

Match Another Dataset

# Get wavelengths from reference dataset
ref_config = DatasetConfigs("reference_data")
ref_dataset = list(ref_config.iter_datasets())[0]
target_wl = ref_dataset.float_headers(0)

# Resample to match
Resampler(target_wavelengths=target_wl)

Downsample

# Reduce to 50 evenly-spaced points
target_wl = np.linspace(start_wl, end_wl, 50)
Resampler(target_wavelengths=target_wl)

Focus on Region

# Focus on fingerprint region (e.g., 1400-1800 nm)
region_wl = np.linspace(1400, 1800, 100)
Resampler(target_wavelengths=region_wl)

Interpolation Methods

Method

Description

Use Case

'linear'

Linear interpolation

Default, fast

'cubic'

Cubic spline

Smooth spectra

'quadratic'

Quadratic interpolation

Balance speed/smoothness

'nearest'

Nearest neighbor

Discrete features


U05: Synthetic Data

Generate synthetic NIRS spectra for testing and prototyping.

📄 View source code

What You’ll Learn

  • Using nirs4all.generate() for quick dataset creation

  • Convenience functions for regression and classification

  • Configuring spectral complexity and components

Basic Generation

import nirs4all

# Generate a SpectroDataset
dataset = nirs4all.generate(n_samples=500, random_state=42)

# Or get raw numpy arrays
X, y = nirs4all.generate(n_samples=300, as_dataset=False)

Regression Datasets

dataset = nirs4all.generate.regression(
    n_samples=500,
    target_range=(0, 100),      # Scale targets
    target_component=0,          # Which component as target
    complexity="realistic",      # Noise level
    random_state=42
)

Classification Datasets

# Binary classification
dataset = nirs4all.generate.classification(
    n_samples=400,
    n_classes=2,
    class_separation=2.0,  # Well-separated classes
    random_state=42
)

# Imbalanced multiclass
dataset = nirs4all.generate.classification(
    n_samples=600,
    n_classes=3,
    class_weights=[0.5, 0.3, 0.2],
    random_state=42
)

Complexity Levels

Level

Description

Use Case

"simple"

Minimal noise

Unit tests, fast prototyping

"realistic"

Typical NIR noise/scatter

Development, validation

"complex"

High noise, artifacts

Robustness testing

Specifying Chemical Components

dataset = nirs4all.generate(
    n_samples=400,
    components=["water", "protein", "lipid", "starch"],
    complexity="realistic"
)

Available components: water, protein, lipid, starch, cellulose, chlorophyll, oil, nitrogen_compound

Direct Pipeline Integration

# Generate and train in one call
result = nirs4all.run(
    pipeline=[StandardScaler(), PLSRegression(n_components=10)],
    dataset=nirs4all.generate.regression(n_samples=600, complexity="realistic"),
    name="SyntheticTest"
)

U06: Synthetic Advanced

Master the full synthetic data generation API for complex scenarios.

📄 View source code

What You’ll Learn

  • Using SyntheticDatasetBuilder for full control

  • Metadata generation (groups, repetitions)

  • Multi-source datasets

  • Batch effects simulation

  • Non-linear target complexity for realistic benchmarks

  • Exporting to files

  • Matching real data characteristics

SyntheticDatasetBuilder

For maximum control over synthetic data generation:

from nirs4all.synthesis import SyntheticDatasetBuilder

# Full control over generation
builder = SyntheticDatasetBuilder(
    n_samples=500,
    wavelength_range=(1000, 2500),
    n_wavelengths=256,
    components=["water", "protein", "lipid"],
)

# Add batch effects
builder.add_batch_effect(n_batches=3, intensity=0.1)

# Add metadata
builder.add_group_metadata(n_groups=5)

# Generate dataset
dataset = builder.build()

Multi-Source Synthetic Data

# Create multi-source datasets
builder = SyntheticDatasetBuilder(n_samples=300)
builder.add_source("NIR", wavelength_range=(1000, 2500), n_wavelengths=256)
builder.add_source("markers", n_features=10, feature_type="numerical")

dataset = builder.build()

Exporting to Files

# Export for loader testing
dataset.to_csv("synthetic_data/")
# Creates: Xtrain.csv, Ytrain.csv, Xtest.csv, Ytest.csv

Running These Examples

cd examples

# Run all data handling examples
./run.sh -n "U0*.py" -c user

# Run with plots
python user/02_data_handling/U05_synthetic_data.py --plots --show

Next Steps

After mastering data handling:

  • Preprocessing: Apply NIRS-specific transformations

  • Models: Compare different model architectures

  • Cross-Validation: Choose the right validation strategy