Dataset Configuration Troubleshooting Guide

This guide helps diagnose and resolve common issues when loading NIRS datasets.

Error Code Reference

Schema Errors (E1xx)

Code	Description	Solution
E101	Missing required field	Add the missing field to your configuration
E102	Invalid field type	Check expected type in schema documentation
E103	Invalid enum value	Use one of the allowed values listed in error
E104	Configuration validation failed	Check the detailed validation messages

Example - E101 Missing Required Field:

# ❌ Error: E101 - Missing required field 'train_x'
name: my_dataset

# ✅ Fixed: Add required train_x path
name: my_dataset
train_x: data/spectra.csv

File Errors (E2xx)

Code	Description	Solution
E201	File not found	Check file path exists and is accessible
E202	Permission denied	Check file permissions
E203	Invalid file format	Ensure file matches expected format (CSV, TSV, etc.)
E204	File is empty	Verify file contains data
E205	Encoding error	Specify correct encoding in params

Example - E201 File Not Found:

# Check paths are relative to config file or use absolute paths
train_x: ./data/spectra.csv  # Relative to config
train_x: /home/user/project/data/spectra.csv  # Absolute

Data Errors (E3xx)

Code	Description	Solution
E301	Missing values detected	Handle NaN/missing values in preprocessing
E302	Shape mismatch	Ensure X and y have matching sample counts
E303	Invalid numeric data	Check for non-numeric values in spectral data
E304	Spectral range inconsistent	Verify wavelength headers match across files
E305	Duplicate samples	Use aggregation or remove duplicates

Example - E302 Shape Mismatch:

# train_x has 100 samples, train_y has 95 samples
# Check for:
# 1. Header row counted as data
# 2. Missing samples in target file
# 3. Different sample ID formats

Loading Errors (E4xx)

Code	Description	Solution
E401	Delimiter detection failed	Explicitly set delimiter in params
E402	Header parsing failed	Check header format matches header_unit
E403	Data type conversion failed	Check data contains valid numbers
E404	Memory error	Use lazy_loading or process in chunks

Example - E401 Delimiter Detection:

# If auto-detection fails, specify delimiter explicitly
params:
  delimiter: ","   # CSV
  delimiter: "\t"  # TSV
  delimiter: ";"   # European CSV

Partition Errors (E5xx)

Code	Description	Solution
E501	Partition overlap	Ensure train/val/test don’t share samples
E502	Empty partition	Check partition indices are valid
E503	Invalid partition indices	Indices must be within sample count
E504	Partition sum mismatch	Partition sizes should account for all samples

Aggregation Errors (E6xx)

Code	Description	Solution
E601	Group column not found	Check group_by column exists in metadata
E602	Aggregation method failed	Check method name is valid
E603	Custom aggregation error	Verify custom function signature
E604	Empty group after aggregation	Some groups may have all outliers

Example - E601 Group Column:

# Ensure the group column exists in your data
aggregation:
  group_by: sample_id  # Must match column name exactly (case-sensitive)

Variation Errors (E7xx)

Code	Description	Solution
E701	Variation definition error	Check variation syntax
E702	Invalid spectral range	Range must be within data bounds
E703	Resampling error	Check resample parameters
E704	Noise application error	Verify noise level is valid (0-1)

Fold Errors (E8xx)

Code	Description	Solution
E801	Fold definition error	Check fold indices syntax
E802	Fold overlap	Ensure folds don’t share test samples
E803	Invalid fold indices	Indices must be within sample count
E804	Inconsistent fold structure	All folds should have train and test

Runtime Errors (E9xx)

Code	Description	Solution
E901	Cache error	Clear cache and retry
E902	Lazy loading error	Try with lazy_loading: false
E903	Timeout error	Increase timeout or reduce data size
E904	Memory limit exceeded	Use lazy loading or reduce batch size

Common Scenarios

Scenario 1: European CSV Format

European CSV files use semicolons as delimiters and commas as decimal separators.

params:
  delimiter: ";"
  decimal: ","

Scenario 2: Wavelength Headers with Units

If headers contain units (e.g., “1100 nm”, “4000 cm-1”):

params:
  header_unit: nm       # or cm-1, um
  header_regex: null    # Use default pattern

Scenario 3: Large Dataset Memory Issues

For datasets that exceed available memory:

performance:
  lazy_loading: true
  cache_enabled: true
  cache_max_size_mb: 1024  # Limit cache size

Scenario 4: Sample Replicates

When each sample has multiple measurements:

aggregation:
  group_by: sample_id
  method: mean
  exclude_outliers: true
  outlier_threshold: 2.5

Scenario 5: Metadata Linking Issues

When sample IDs don’t match between files:

# Check for common issues:
# 1. Leading/trailing whitespace: "  sample1  " vs "sample1"
# 2. Case differences: "Sample1" vs "sample1"
# 3. Numeric formatting: "001" vs "1"

metadata:
  path: metadata.csv
  link_by: sample_id
  strip_whitespace: true  # Remove whitespace
  case_sensitive: false   # Ignore case

Validation Workflow

Use the CLI to validate configurations before running pipelines:

# Validate configuration syntax
nirs4all dataset validate config.yaml

# Inspect data with auto-detection
nirs4all dataset inspect data.csv --detect

# Export normalized configuration
nirs4all dataset export config.yaml -o normalized.yaml

# Compare configurations
nirs4all dataset diff config1.yaml config2.yaml

Getting Diagnostic Reports

For detailed diagnostics, enable verbose mode:

from nirs4all.data import DatasetConfigs
from nirs4all.data.schema.validation import DiagnosticBuilder

# Create diagnostic builder
diagnostics = DiagnosticBuilder()

# Load with diagnostics
try:
    config = DatasetConfigs.from_yaml("config.yaml")
except Exception as e:
    # Get diagnostic report
    report = diagnostics.build()
    print(report.to_text())

    # Or save as JSON for analysis
    report.save_json("diagnostics.json")

FAQ

Q: My file loads but wavelengths are wrong A: Check header_unit matches your data. Use nirs4all dataset inspect file.csv --detect to see detected parameters.

Q: Aggregation removes too many samples A: Lower outlier_threshold or set exclude_outliers: false.

Q: Cache isn’t being used A: Ensure cache_enabled: true and check cache size limits.

Q: Getting OOM errors with large datasets A: Enable lazy_loading: true in performance settings.

Q: Configuration works locally but fails in CI A: Use absolute paths or paths relative to config file location.