# Loading Data This guide covers how to load spectral data into NIRS4ALL using `DatasetConfigs`. ## Overview NIRS4ALL loads data through `DatasetConfigs`, which handles: - Multiple file formats (CSV, Excel, MATLAB, NumPy, Parquet) - Automatic file detection in folders - Multi-source datasets (e.g., NIR + markers) - Train/test splits - Metadata handling ## Quick Start ### From a Folder The simplest approach - NIRS4ALL auto-detects your data files: ```python from nirs4all.data import DatasetConfigs # Auto-detect files in folder dataset = DatasetConfigs("path/to/data/") ``` Expected folder structure: ``` data/ ├── train_x.csv # Training features (spectra) ├── train_y.csv # Training targets ├── train_m.csv # Training metadata (optional) ├── test_x.csv # Test features (optional) ├── test_y.csv # Test targets (optional) └── test_m.csv # Test metadata (optional) ``` ### From a Single File For a single file with features and targets combined: ```python # CSV with last column as target dataset = DatasetConfigs("data.csv") # Explicit target column dataset = DatasetConfigs({ "train_x": "data.csv", "global_params": {"target_column": "protein"} }) ``` ### From Explicit Files Full control over file paths: ```python dataset = DatasetConfigs({ "train_x": "spectra_train.csv", "train_y": "targets_train.csv", "test_x": "spectra_test.csv", "test_y": "targets_test.csv" }) ``` ## Supported Formats | Format | Extensions | Notes | |--------|-----------|-------| | CSV | `.csv` | Most common; configurable delimiter | | Excel | `.xlsx`, `.xls` | Single sheet or specify sheet name | | MATLAB | `.mat` | Reads first array variable | | NumPy | `.npy`, `.npz` | Binary format, fast loading | | Parquet | `.parquet` | Columnar format, efficient for large data | ### CSV Configuration ```python dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": { "delimiter": ";", # Column separator "decimal_separator": ",", # Decimal point "has_header": True, # First row is header "na_policy": "drop" # Handle missing values } }) ``` ### Excel Configuration ```python dataset = DatasetConfigs({ "train_x": "data.xlsx", "train_x_params": { "sheet_name": "Spectra", # Specific sheet "header_row": 0 # Header row index } }) ``` ## File Keys Reference | Key | Description | |-----|-------------| | `train_x` | Training features (spectra) | | `train_y` | Training targets | | `train_m` | Training metadata | | `test_x` | Test features | | `test_y` | Test targets | | `test_m` | Test metadata | | `*_params` | Parameters for corresponding file (e.g., `train_x_params`) | | `global_params` | Parameters applied to all files | ## Wavelength Headers NIRS4ALL understands wavelength information from column headers: ```python dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": { "header_unit": "nm" # Headers are wavelengths in nm # Options: "nm", "cm-1", "none", "text", "index" } }) ``` | `header_unit` | Description | Example Headers | |---------------|-------------|-----------------| | `"nm"` | Wavelengths in nanometers | `900, 902, 904, ...` | | `"cm-1"` | Wavenumbers in cm⁻¹ | `4000, 3998, 3996, ...` | | `"none"` | No header row | - | | `"text"` | Text labels (ignored) | `V1, V2, V3, ...` | | `"index"` | Numeric indices | `1, 2, 3, ...` | ## Signal Type Specify the type of spectral signal for proper handling: ```python dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": { "signal_type": "reflectance" } }) # Or as constructor parameter dataset = DatasetConfigs("spectra.csv", signal_type="reflectance") ``` | Signal Type | Description | |-------------|-------------| | `"absorbance"` | -log₁₀(R) | | `"reflectance"` | Raw reflectance (0-1) | | `"reflectance%"` | Reflectance as percentage (0-100) | | `"transmittance"` | Raw transmittance (0-1) | | `"transmittance%"` | Transmittance as percentage | | `"auto"` | Automatic detection (default) | ## Task Type Force regression or classification mode: ```python # Force regression dataset = DatasetConfigs("data/", task_type="regression") # Force classification dataset = DatasetConfigs("data/", task_type="binary_classification") # Valid options: # - "auto" (default) # - "regression" # - "binary_classification" # - "multiclass_classification" ``` ## Multi-Source Datasets Combine multiple data sources (e.g., NIR spectra + chemical markers): ```python dataset = DatasetConfigs({ "train_x": ["nir_spectra.csv", "markers.csv"], "train_y": "targets.csv", "train_x_params": [ {"header_unit": "nm", "signal_type": "reflectance"}, {"header_unit": "text"} # Markers have text headers ] }) ``` ### Processing Multi-Source Data Use `source_branch` to apply different preprocessing to each source: ```python pipeline = [ {"source_branch": { 0: [StandardNormalVariate(), FirstDerivative()], # NIR source 1: [StandardScaler()] # Markers source }}, {"merge_sources": "concat"}, # Combine sources {"model": PLSRegression(n_components=10)} ] ``` ## Sample Aggregation Aggregate predictions from multiple measurements per sample: ```python # Aggregate by sample ID column dataset = DatasetConfigs( "data/", aggregate="sample_id" # Metadata column name ) # Aggregate by target values dataset = DatasetConfigs( "data/", aggregate=True # Group by y values ) # With custom method dataset = DatasetConfigs( "data/", aggregate="sample_id", aggregate_method="median", # "mean", "median", or "vote" aggregate_exclude_outliers=True # Remove outliers before aggregating ) ``` ## Multiple Datasets Run the same pipeline on multiple datasets: ```python dataset = DatasetConfigs([ "dataset1/", "dataset2/", {"train_x": "custom/spectra.csv", "train_y": "custom/targets.csv"} ]) # Results will include predictions for all datasets result = nirs4all.run(pipeline, dataset) ``` ## Using SpectroDataset Directly For advanced use cases, you can pass `SpectroDataset` instances directly to `nirs4all.run()`: ```python from nirs4all.data import SpectroDataset import nirs4all # Create a SpectroDataset manually dataset = SpectroDataset(name="my_dataset") dataset.add_samples(X_train, indexes={"partition": "train"}) dataset.add_targets(y_train) # Use directly in run() result = nirs4all.run(pipeline, dataset) ``` ### Multiple SpectroDataset Instances You can also pass a list of `SpectroDataset` instances: ```python # Multiple SpectroDataset instances datasets = [dataset1, dataset2, dataset3] result = nirs4all.run(pipeline, datasets) ``` This is particularly useful when: - Working with synthetic data generators that return `SpectroDataset` - Programmatically creating datasets from different sources - Chaining multiple pipeline runs with transformed data ## Complete Example ```python from nirs4all.data import DatasetConfigs import nirs4all # Comprehensive configuration dataset = DatasetConfigs({ # Training data "train_x": "data/train_spectra.csv", "train_y": "data/train_targets.csv", "train_m": "data/train_metadata.csv", # Test data "test_x": "data/test_spectra.csv", "test_y": "data/test_targets.csv", # Training file parameters "train_x_params": { "header_unit": "nm", "signal_type": "reflectance", "delimiter": "," }, # Force regression task "task_type": "regression" }) # Run pipeline result = nirs4all.run( pipeline=[ MinMaxScaler(), ShuffleSplit(n_splits=3), {"model": PLSRegression(n_components=10)} ], dataset=dataset, verbose=1 ) ``` ## Common Patterns ### Load with Metadata ```python dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_y": "targets.csv", "train_m": "metadata.csv" # Sample IDs, dates, groups, etc. }) ``` ### Specify Target Column ```python # When features and target are in the same file dataset = DatasetConfigs({ "train_x": "combined_data.csv", "global_params": { "target_column": "protein" # Column name for target } }) ``` ### Handle Missing Values ```python dataset = DatasetConfigs({ "train_x": "spectra.csv", "global_params": { "na_policy": "drop" # Drop rows with NaN # Options: "drop", "fill_mean", "fill_median", "fill_zero" } }) ``` ## Troubleshooting ### File Not Found ```python # Use absolute paths if relative paths fail import os path = os.path.abspath("data/spectra.csv") dataset = DatasetConfigs(path) ``` ### Wrong Delimiter ```python # Check file manually, then specify dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": {"delimiter": "\t"} # Tab-separated }) ``` ### Header Issues ```python # No header row dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": {"has_header": False} }) # Skip header row dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_x_params": {"header_unit": "none", "has_header": True} }) ``` ## See Also - {doc}`/getting_started/concepts` - Understanding SpectroDataset - {doc}`/reference/configuration` - Full DatasetConfigs specification - {doc}`sample_filtering` - Filter samples during loading - {doc}`aggregation` - Aggregate multiple measurements