Data Handling Examples
This section covers all the ways to load, configure, and work with data in NIRS4ALL. From simple numpy arrays to complex multi-source datasets, you’ll learn the flexible input options available.
Overview
Example |
Topic |
Difficulty |
Duration |
|---|---|---|---|
Flexible Inputs |
★☆☆☆☆ |
~2 min |
|
Multi-Datasets |
★★☆☆☆ |
~3 min |
|
Multi-Source Data |
★★★☆☆ |
~3 min |
|
Wavelength Handling |
★★☆☆☆ |
~3 min |
|
Synthetic Data |
★★☆☆☆ |
~2 min |
|
Advanced Synthetic Data |
★★★☆☆ |
~5 min |
U01: Flexible Inputs
Demonstrates all possible input formats for datasets and pipelines.
What You’ll Learn
Direct numpy array input with
(X, y)tuplesDictionary-based dataset configuration
Partition info specification
SpectroDataset object usage
Input Format Overview
NIRS4ALL accepts datasets in multiple formats:
Format |
Example |
Best For |
|---|---|---|
Folder path |
|
File-based datasets |
Tuple |
|
Quick experiments |
Dictionary |
|
Explicit splits |
DatasetConfigs |
|
Full control |
SpectroDataset |
|
Programmatic access |
Simplest Approach: Direct Arrays
import numpy as np
from sklearn.linear_model import Ridge
# Generate or load your data
X = np.random.randn(200, 100)
y = np.random.randn(200)
# Partition info: first 160 samples for training
partition_info = {"train": 160}
# Run directly with tuple
result = nirs4all.run(
pipeline=[Ridge(alpha=1.0)],
dataset=(X, y, partition_info),
name="DirectArrays"
)
Partition Info Options
# Integer: first N samples = train
{"train": 160}
# Slice objects
{"train": slice(0, 150), "test": slice(150, 200)}
# Explicit indices
{"train": list(range(150)), "test": list(range(150, 200))}
Dictionary Configuration
dataset_dict = {
"name": "my_dataset",
"train_x": X_train,
"train_y": y_train,
"test_x": X_test,
"test_y": y_test
}
result = nirs4all.run(pipeline=pipeline, dataset=dataset_dict)
Using SpectroDataset Directly
For maximum control, create a SpectroDataset object:
from nirs4all.data import SpectroDataset
dataset = SpectroDataset(name="custom")
dataset.add_samples(X_train, indexes={"partition": "train"})
dataset.add_targets(y_train)
dataset.add_samples(X_test, indexes={"partition": "test"})
dataset.add_targets(y_test)
result = nirs4all.run(pipeline=pipeline, dataset=dataset)
U02: Multi-Datasets
Run the same pipeline on multiple datasets and compare results.
What You’ll Learn
Specifying multiple datasets as a list
Per-dataset result access
Cross-dataset comparison visualizations
Specifying Multiple Datasets
Simply pass a list of dataset paths:
data_paths = [
'sample_data/regression',
'sample_data/regression_2',
'sample_data/regression_3'
]
result = nirs4all.run(
pipeline=pipeline,
dataset=data_paths,
name="MultiDataset"
)
Accessing Per-Dataset Results
for dataset_name, dataset_info in result.per_dataset.items():
dataset_predictions = dataset_info.get('run_predictions')
if dataset_predictions:
top_models = dataset_predictions.top(n=3, rank_metric='rmse')
print(f"Dataset: {dataset_name}")
for model in top_models:
print(f" {model['model_name']}: RMSE={model.get('rmse', 0):.4f}")
Cross-Dataset Visualization
analyzer = PredictionAnalyzer(result.predictions)
# Compare models across datasets
analyzer.plot_heatmap(
x_var="model_name",
y_var="dataset_name",
display_metric='rmse'
)
# Dataset difficulty comparison
analyzer.plot_candlestick(
variable="dataset_name",
display_metric='rmse'
)
Use Cases
Generalization testing: Does a model work well on different samples?
Dataset comparison: Which datasets are most challenging?
Robust model selection: Find models that work everywhere
U03: Multi-Source Data
Work with datasets that have multiple feature sources (e.g., NIR + markers).
What You’ll Learn
Loading multi-source datasets
Feature augmentation with generator syntax
Basic multi-source handling
Understanding Multi-Source Data
Multi-source datasets contain features from different instruments or measurement types:
Example: NIR spectrometer + wet chemistry markers
├── Source 1: NIR spectra (1000 wavelengths)
└── Source 2: Lab markers (10 chemical values)
Data Structure
Multi-source data has multiple X files per partition:
sample_data/multi/
├── Xtrain_1.csv # Source 1 training features
├── Xtrain_2.csv # Source 2 training features
├── Xval_1.csv # Source 1 validation features
├── Xval_2.csv # Source 2 validation features
└── Ytrain.csv # Targets
Loading Multi-Source Data
from nirs4all.data import DatasetConfigs
# Automatic loading
dataset_config = DatasetConfigs('sample_data/multi')
# NIRS4ALL automatically detects and loads all sources
result = nirs4all.run(
pipeline=pipeline,
dataset='sample_data/multi',
name="MultiSource"
)
Advanced: Source-Specific Processing
For per-source preprocessing (covered in Developer examples):
# Source branching: different preprocessing per source
{"source_branch": {
"NIR": [SNV(), FirstDerivative()],
"markers": [VarianceThreshold()],
}}
# Merge sources
{"merge_sources": "concat"} # Horizontal concatenation
U04: Wavelength Handling
Handle wavelength grids: interpolation, downsampling, and unit conversion.
What You’ll Learn
Resampleroperator for wavelength interpolationDownsampling to fewer wavelengths
Focusing on specific spectral regions
Why Resample Wavelengths?
Instrument standardization: Different spectrometers have different wavelength grids
Transfer learning: Match wavelengths between training and inference instruments
Dimensionality reduction: Reduce features while preserving spectral shape
Region focus: Analyze specific spectral regions
The Resampler Operator
from nirs4all.operators.transforms import Resampler
import numpy as np
# Target wavelengths (e.g., from a reference instrument)
target_wavelengths = np.linspace(1000, 2500, 100)
pipeline = [
Resampler(target_wavelengths=target_wavelengths, method='linear'),
# ... rest of pipeline
]
Common Resampling Scenarios
Match Another Dataset
# Get wavelengths from reference dataset
ref_config = DatasetConfigs("reference_data")
ref_dataset = list(ref_config.iter_datasets())[0]
target_wl = ref_dataset.float_headers(0)
# Resample to match
Resampler(target_wavelengths=target_wl)
Downsample
# Reduce to 50 evenly-spaced points
target_wl = np.linspace(start_wl, end_wl, 50)
Resampler(target_wavelengths=target_wl)
Focus on Region
# Focus on fingerprint region (e.g., 1400-1800 nm)
region_wl = np.linspace(1400, 1800, 100)
Resampler(target_wavelengths=region_wl)
Interpolation Methods
Method |
Description |
Use Case |
|---|---|---|
|
Linear interpolation |
Default, fast |
|
Cubic spline |
Smooth spectra |
|
Quadratic interpolation |
Balance speed/smoothness |
|
Nearest neighbor |
Discrete features |
U05: Synthetic Data
Generate synthetic NIRS spectra for testing and prototyping.
What You’ll Learn
Using
nirs4all.generate()for quick dataset creationConvenience functions for regression and classification
Configuring spectral complexity and components
Basic Generation
import nirs4all
# Generate a SpectroDataset
dataset = nirs4all.generate(n_samples=500, random_state=42)
# Or get raw numpy arrays
X, y = nirs4all.generate(n_samples=300, as_dataset=False)
Regression Datasets
dataset = nirs4all.generate.regression(
n_samples=500,
target_range=(0, 100), # Scale targets
target_component=0, # Which component as target
complexity="realistic", # Noise level
random_state=42
)
Classification Datasets
# Binary classification
dataset = nirs4all.generate.classification(
n_samples=400,
n_classes=2,
class_separation=2.0, # Well-separated classes
random_state=42
)
# Imbalanced multiclass
dataset = nirs4all.generate.classification(
n_samples=600,
n_classes=3,
class_weights=[0.5, 0.3, 0.2],
random_state=42
)
Complexity Levels
Level |
Description |
Use Case |
|---|---|---|
|
Minimal noise |
Unit tests, fast prototyping |
|
Typical NIR noise/scatter |
Development, validation |
|
High noise, artifacts |
Robustness testing |
Specifying Chemical Components
dataset = nirs4all.generate(
n_samples=400,
components=["water", "protein", "lipid", "starch"],
complexity="realistic"
)
Available components: water, protein, lipid, starch, cellulose, chlorophyll, oil, nitrogen_compound
Direct Pipeline Integration
# Generate and train in one call
result = nirs4all.run(
pipeline=[StandardScaler(), PLSRegression(n_components=10)],
dataset=nirs4all.generate.regression(n_samples=600, complexity="realistic"),
name="SyntheticTest"
)
U06: Synthetic Advanced
Master the full synthetic data generation API for complex scenarios.
What You’ll Learn
Using
SyntheticDatasetBuilderfor full controlMetadata generation (groups, repetitions)
Multi-source datasets
Batch effects simulation
Non-linear target complexity for realistic benchmarks
Exporting to files
Matching real data characteristics
SyntheticDatasetBuilder
For maximum control over synthetic data generation:
from nirs4all.synthesis import SyntheticDatasetBuilder
# Full control over generation
builder = SyntheticDatasetBuilder(
n_samples=500,
wavelength_range=(1000, 2500),
n_wavelengths=256,
components=["water", "protein", "lipid"],
)
# Add batch effects
builder.add_batch_effect(n_batches=3, intensity=0.1)
# Add metadata
builder.add_group_metadata(n_groups=5)
# Generate dataset
dataset = builder.build()
Multi-Source Synthetic Data
# Create multi-source datasets
builder = SyntheticDatasetBuilder(n_samples=300)
builder.add_source("NIR", wavelength_range=(1000, 2500), n_wavelengths=256)
builder.add_source("markers", n_features=10, feature_type="numerical")
dataset = builder.build()
Exporting to Files
# Export for loader testing
dataset.to_csv("synthetic_data/")
# Creates: Xtrain.csv, Ytrain.csv, Xtest.csv, Ytest.csv
Running These Examples
cd examples
# Run all data handling examples
./run.sh -n "U0*.py" -c user
# Run with plots
python user/02_data_handling/U05_synthetic_data.py --plots --show
Next Steps
After mastering data handling:
Preprocessing: Apply NIRS-specific transformations
Models: Compare different model architectures
Cross-Validation: Choose the right validation strategy