# Data Handling Examples This section covers all the ways to load, configure, and work with data in NIRS4ALL. From simple numpy arrays to complex multi-source datasets, you'll learn the flexible input options available. ```{contents} On this page :local: :depth: 2 ``` ## Overview | Example | Topic | Difficulty | Duration | |---------|-------|------------|----------| | [U01](#u01-flexible-inputs) | Flexible Inputs | ★☆☆☆☆ | ~2 min | | [U02](#u02-multi-datasets) | Multi-Datasets | ★★☆☆☆ | ~3 min | | [U03](#u03-multi-source) | Multi-Source Data | ★★★☆☆ | ~3 min | | [U04](#u04-wavelength-handling) | Wavelength Handling | ★★☆☆☆ | ~3 min | | [U05](#u05-synthetic-data) | Synthetic Data | ★★☆☆☆ | ~2 min | | [U06](#u06-synthetic-advanced) | Advanced Synthetic Data | ★★★☆☆ | ~5 min | --- ## U01: Flexible Inputs **Demonstrates all possible input formats for datasets and pipelines.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U01_flexible_inputs.py) ### What You'll Learn - Direct numpy array input with `(X, y)` tuples - Dictionary-based dataset configuration - Partition info specification - SpectroDataset object usage ### Input Format Overview NIRS4ALL accepts datasets in multiple formats: | Format | Example | Best For | |--------|---------|----------| | Folder path | `"sample_data/regression"` | File-based datasets | | Tuple | `(X, y)` or `(X, y, partition_info)` | Quick experiments | | Dictionary | `{"train_x": X_train, "test_x": X_test, ...}` | Explicit splits | | DatasetConfigs | `DatasetConfigs("path")` | Full control | | SpectroDataset | `SpectroDataset(name="my_data")` | Programmatic access | ### Simplest Approach: Direct Arrays ```python import numpy as np from sklearn.linear_model import Ridge # Generate or load your data X = np.random.randn(200, 100) y = np.random.randn(200) # Partition info: first 160 samples for training partition_info = {"train": 160} # Run directly with tuple result = nirs4all.run( pipeline=[Ridge(alpha=1.0)], dataset=(X, y, partition_info), name="DirectArrays" ) ``` ### Partition Info Options ```python # Integer: first N samples = train {"train": 160} # Slice objects {"train": slice(0, 150), "test": slice(150, 200)} # Explicit indices {"train": list(range(150)), "test": list(range(150, 200))} ``` ### Dictionary Configuration ```python dataset_dict = { "name": "my_dataset", "train_x": X_train, "train_y": y_train, "test_x": X_test, "test_y": y_test } result = nirs4all.run(pipeline=pipeline, dataset=dataset_dict) ``` ### Using SpectroDataset Directly For maximum control, create a SpectroDataset object: ```python from nirs4all.data import SpectroDataset dataset = SpectroDataset(name="custom") dataset.add_samples(X_train, indexes={"partition": "train"}) dataset.add_targets(y_train) dataset.add_samples(X_test, indexes={"partition": "test"}) dataset.add_targets(y_test) result = nirs4all.run(pipeline=pipeline, dataset=dataset) ``` --- ## U02: Multi-Datasets **Run the same pipeline on multiple datasets and compare results.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U02_multi_datasets.py) ### What You'll Learn - Specifying multiple datasets as a list - Per-dataset result access - Cross-dataset comparison visualizations ### Specifying Multiple Datasets Simply pass a list of dataset paths: ```python data_paths = [ 'sample_data/regression', 'sample_data/regression_2', 'sample_data/regression_3' ] result = nirs4all.run( pipeline=pipeline, dataset=data_paths, name="MultiDataset" ) ``` ### Accessing Per-Dataset Results ```python for dataset_name, dataset_info in result.per_dataset.items(): dataset_predictions = dataset_info.get('run_predictions') if dataset_predictions: top_models = dataset_predictions.top(n=3, rank_metric='rmse') print(f"Dataset: {dataset_name}") for model in top_models: print(f" {model['model_name']}: RMSE={model.get('rmse', 0):.4f}") ``` ### Cross-Dataset Visualization ```python analyzer = PredictionAnalyzer(result.predictions) # Compare models across datasets analyzer.plot_heatmap( x_var="model_name", y_var="dataset_name", display_metric='rmse' ) # Dataset difficulty comparison analyzer.plot_candlestick( variable="dataset_name", display_metric='rmse' ) ``` ### Use Cases - **Generalization testing**: Does a model work well on different samples? - **Dataset comparison**: Which datasets are most challenging? - **Robust model selection**: Find models that work everywhere --- ## U03: Multi-Source Data **Work with datasets that have multiple feature sources (e.g., NIR + markers).** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U03_multi_source.py) ### What You'll Learn - Loading multi-source datasets - Feature augmentation with generator syntax - Basic multi-source handling ### Understanding Multi-Source Data Multi-source datasets contain features from different instruments or measurement types: ``` Example: NIR spectrometer + wet chemistry markers ├── Source 1: NIR spectra (1000 wavelengths) └── Source 2: Lab markers (10 chemical values) ``` ### Data Structure Multi-source data has multiple X files per partition: ``` sample_data/multi/ ├── Xtrain_1.csv # Source 1 training features ├── Xtrain_2.csv # Source 2 training features ├── Xval_1.csv # Source 1 validation features ├── Xval_2.csv # Source 2 validation features └── Ytrain.csv # Targets ``` ### Loading Multi-Source Data ```python from nirs4all.data import DatasetConfigs # Automatic loading dataset_config = DatasetConfigs('sample_data/multi') # NIRS4ALL automatically detects and loads all sources result = nirs4all.run( pipeline=pipeline, dataset='sample_data/multi', name="MultiSource" ) ``` ### Advanced: Source-Specific Processing For per-source preprocessing (covered in Developer examples): ```python # Source branching: different preprocessing per source {"source_branch": { "NIR": [SNV(), FirstDerivative()], "markers": [VarianceThreshold()], }} # Merge sources {"merge_sources": "concat"} # Horizontal concatenation ``` --- ## U04: Wavelength Handling **Handle wavelength grids: interpolation, downsampling, and unit conversion.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U04_wavelength_handling.py) ### What You'll Learn - `Resampler` operator for wavelength interpolation - Downsampling to fewer wavelengths - Focusing on specific spectral regions ### Why Resample Wavelengths? - **Instrument standardization**: Different spectrometers have different wavelength grids - **Transfer learning**: Match wavelengths between training and inference instruments - **Dimensionality reduction**: Reduce features while preserving spectral shape - **Region focus**: Analyze specific spectral regions ### The Resampler Operator ```python from nirs4all.operators.transforms import Resampler import numpy as np # Target wavelengths (e.g., from a reference instrument) target_wavelengths = np.linspace(1000, 2500, 100) pipeline = [ Resampler(target_wavelengths=target_wavelengths, method='linear'), # ... rest of pipeline ] ``` ### Common Resampling Scenarios #### Match Another Dataset ```python # Get wavelengths from reference dataset ref_config = DatasetConfigs("reference_data") ref_dataset = list(ref_config.iter_datasets())[0] target_wl = ref_dataset.float_headers(0) # Resample to match Resampler(target_wavelengths=target_wl) ``` #### Downsample ```python # Reduce to 50 evenly-spaced points target_wl = np.linspace(start_wl, end_wl, 50) Resampler(target_wavelengths=target_wl) ``` #### Focus on Region ```python # Focus on fingerprint region (e.g., 1400-1800 nm) region_wl = np.linspace(1400, 1800, 100) Resampler(target_wavelengths=region_wl) ``` ### Interpolation Methods | Method | Description | Use Case | |--------|-------------|----------| | `'linear'` | Linear interpolation | Default, fast | | `'cubic'` | Cubic spline | Smooth spectra | | `'quadratic'` | Quadratic interpolation | Balance speed/smoothness | | `'nearest'` | Nearest neighbor | Discrete features | --- ## U05: Synthetic Data **Generate synthetic NIRS spectra for testing and prototyping.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U05_synthetic_data.py) ### What You'll Learn - Using `nirs4all.generate()` for quick dataset creation - Convenience functions for regression and classification - Configuring spectral complexity and components ### Basic Generation ```python import nirs4all # Generate a SpectroDataset dataset = nirs4all.generate(n_samples=500, random_state=42) # Or get raw numpy arrays X, y = nirs4all.generate(n_samples=300, as_dataset=False) ``` ### Regression Datasets ```python dataset = nirs4all.generate.regression( n_samples=500, target_range=(0, 100), # Scale targets target_component=0, # Which component as target complexity="realistic", # Noise level random_state=42 ) ``` ### Classification Datasets ```python # Binary classification dataset = nirs4all.generate.classification( n_samples=400, n_classes=2, class_separation=2.0, # Well-separated classes random_state=42 ) # Imbalanced multiclass dataset = nirs4all.generate.classification( n_samples=600, n_classes=3, class_weights=[0.5, 0.3, 0.2], random_state=42 ) ``` ### Complexity Levels | Level | Description | Use Case | |-------|-------------|----------| | `"simple"` | Minimal noise | Unit tests, fast prototyping | | `"realistic"` | Typical NIR noise/scatter | Development, validation | | `"complex"` | High noise, artifacts | Robustness testing | ### Specifying Chemical Components ```python dataset = nirs4all.generate( n_samples=400, components=["water", "protein", "lipid", "starch"], complexity="realistic" ) ``` Available components: `water`, `protein`, `lipid`, `starch`, `cellulose`, `chlorophyll`, `oil`, `nitrogen_compound` ### Direct Pipeline Integration ```python # Generate and train in one call result = nirs4all.run( pipeline=[StandardScaler(), PLSRegression(n_components=10)], dataset=nirs4all.generate.regression(n_samples=600, complexity="realistic"), name="SyntheticTest" ) ``` --- ## U06: Synthetic Advanced **Master the full synthetic data generation API for complex scenarios.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/02_data_handling/U06_synthetic_advanced.py) ### What You'll Learn - Using `SyntheticDatasetBuilder` for full control - Metadata generation (groups, repetitions) - Multi-source datasets - Batch effects simulation - Non-linear target complexity for realistic benchmarks - Exporting to files - Matching real data characteristics ### SyntheticDatasetBuilder For maximum control over synthetic data generation: ```python from nirs4all.data.synthetic import SyntheticDatasetBuilder # Full control over generation builder = SyntheticDatasetBuilder( n_samples=500, wavelength_range=(1000, 2500), n_wavelengths=256, components=["water", "protein", "lipid"], ) # Add batch effects builder.add_batch_effect(n_batches=3, intensity=0.1) # Add metadata builder.add_group_metadata(n_groups=5) # Generate dataset dataset = builder.build() ``` ### Multi-Source Synthetic Data ```python # Create multi-source datasets builder = SyntheticDatasetBuilder(n_samples=300) builder.add_source("NIR", wavelength_range=(1000, 2500), n_wavelengths=256) builder.add_source("markers", n_features=10, feature_type="numerical") dataset = builder.build() ``` ### Exporting to Files ```python # Export for loader testing dataset.to_csv("synthetic_data/") # Creates: Xtrain.csv, Ytrain.csv, Xtest.csv, Ytest.csv ``` --- ## Running These Examples ```bash cd examples # Run all data handling examples ./run.sh -n "U0*.py" -c user # Run with plots python user/02_data_handling/U05_synthetic_data.py --plots --show ``` ## Next Steps After mastering data handling: - **Preprocessing**: Apply NIRS-specific transformations - **Models**: Compare different model architectures - **Cross-Validation**: Choose the right validation strategy