Dataset Configuration Troubleshooting Guide

This guide helps diagnose and resolve common issues when loading NIRS datasets.

Error Code Reference

Schema Errors (E1xx)

Code

Description

Solution

E101

Missing required field

Add the missing field to your configuration

E102

Invalid field type

Check expected type in schema documentation

E103

Invalid enum value

Use one of the allowed values listed in error

E104

Configuration validation failed

Check the detailed validation messages

Example - E101 Missing Required Field:

# ❌ Error: E101 - Missing required field 'train_x'
name: my_dataset

# ✅ Fixed: Add required train_x path
name: my_dataset
train_x: data/spectra.csv

File Errors (E2xx)

Code

Description

Solution

E201

File not found

Check file path exists and is accessible

E202

Permission denied

Check file permissions

E203

Invalid file format

Ensure file matches expected format (CSV, TSV, etc.)

E204

File is empty

Verify file contains data

E205

Encoding error

Specify correct encoding in params

Example - E201 File Not Found:

# Check paths are relative to config file or use absolute paths
train_x: ./data/spectra.csv  # Relative to config
train_x: /home/user/project/data/spectra.csv  # Absolute

Data Errors (E3xx)

Code

Description

Solution

E301

Missing values detected

Handle NaN/missing values in preprocessing

E302

Shape mismatch

Ensure X and y have matching sample counts

E303

Invalid numeric data

Check for non-numeric values in spectral data

E304

Spectral range inconsistent

Verify wavelength headers match across files

E305

Duplicate samples

Use aggregation or remove duplicates

Example - E302 Shape Mismatch:

# train_x has 100 samples, train_y has 95 samples
# Check for:
# 1. Header row counted as data
# 2. Missing samples in target file
# 3. Different sample ID formats

Loading Errors (E4xx)

Code

Description

Solution

E401

Delimiter detection failed

Explicitly set delimiter in params

E402

Header parsing failed

Check header format matches header_unit

E403

Data type conversion failed

Check data contains valid numbers

E404

Memory error

Use lazy_loading or process in chunks

Example - E401 Delimiter Detection:

# If auto-detection fails, specify delimiter explicitly
params:
  delimiter: ","   # CSV
  delimiter: "\t"  # TSV
  delimiter: ";"   # European CSV

Partition Errors (E5xx)

Code

Description

Solution

E501

Partition overlap

Ensure train/val/test don’t share samples

E502

Empty partition

Check partition indices are valid

E503

Invalid partition indices

Indices must be within sample count

E504

Partition sum mismatch

Partition sizes should account for all samples

Aggregation Errors (E6xx)

Code

Description

Solution

E601

Group column not found

Check group_by column exists in metadata

E602

Aggregation method failed

Check method name is valid

E603

Custom aggregation error

Verify custom function signature

E604

Empty group after aggregation

Some groups may have all outliers

Example - E601 Group Column:

# Ensure the group column exists in your data
aggregation:
  group_by: sample_id  # Must match column name exactly (case-sensitive)

Variation Errors (E7xx)

Code

Description

Solution

E701

Variation definition error

Check variation syntax

E702

Invalid spectral range

Range must be within data bounds

E703

Resampling error

Check resample parameters

E704

Noise application error

Verify noise level is valid (0-1)

Fold Errors (E8xx)

Code

Description

Solution

E801

Fold definition error

Check fold indices syntax

E802

Fold overlap

Ensure folds don’t share test samples

E803

Invalid fold indices

Indices must be within sample count

E804

Inconsistent fold structure

All folds should have train and test

Runtime Errors (E9xx)

Code

Description

Solution

E901

Cache error

Clear cache and retry

E902

Lazy loading error

Try with lazy_loading: false

E903

Timeout error

Increase timeout or reduce data size

E904

Memory limit exceeded

Use lazy loading or reduce batch size


Common Scenarios

Scenario 1: European CSV Format

European CSV files use semicolons as delimiters and commas as decimal separators.

params:
  delimiter: ";"
  decimal: ","

Scenario 2: Wavelength Headers with Units

If headers contain units (e.g., “1100 nm”, “4000 cm-1”):

params:
  header_unit: nm       # or cm-1, um
  header_regex: null    # Use default pattern

Scenario 3: Large Dataset Memory Issues

For datasets that exceed available memory:

performance:
  lazy_loading: true
  cache_enabled: true
  cache_max_size_mb: 1024  # Limit cache size

Scenario 4: Sample Replicates

When each sample has multiple measurements:

aggregation:
  group_by: sample_id
  method: mean
  exclude_outliers: true
  outlier_threshold: 2.5

Scenario 5: Metadata Linking Issues

When sample IDs don’t match between files:

# Check for common issues:
# 1. Leading/trailing whitespace: "  sample1  " vs "sample1"
# 2. Case differences: "Sample1" vs "sample1"
# 3. Numeric formatting: "001" vs "1"

metadata:
  path: metadata.csv
  link_by: sample_id
  strip_whitespace: true  # Remove whitespace
  case_sensitive: false   # Ignore case

Validation Workflow

Use the CLI to validate configurations before running pipelines:

# Validate configuration syntax
nirs4all dataset validate config.yaml

# Inspect data with auto-detection
nirs4all dataset inspect data.csv --detect

# Export normalized configuration
nirs4all dataset export config.yaml -o normalized.yaml

# Compare configurations
nirs4all dataset diff config1.yaml config2.yaml

Getting Diagnostic Reports

For detailed diagnostics, enable verbose mode:

from nirs4all.data import DatasetConfigs
from nirs4all.data.schema.validation import DiagnosticBuilder

# Create diagnostic builder
diagnostics = DiagnosticBuilder()

# Load with diagnostics
try:
    config = DatasetConfigs.from_yaml("config.yaml")
except Exception as e:
    # Get diagnostic report
    report = diagnostics.build()
    print(report.to_text())

    # Or save as JSON for analysis
    report.save_json("diagnostics.json")

FAQ

Q: My file loads but wavelengths are wrong A: Check header_unit matches your data. Use nirs4all dataset inspect file.csv --detect to see detected parameters.

Q: Aggregation removes too many samples A: Lower outlier_threshold or set exclude_outliers: false.

Q: Cache isn’t being used A: Ensure cache_enabled: true and check cache size limits.

Q: Getting OOM errors with large datasets A: Enable lazy_loading: true in performance settings.

Q: Configuration works locally but fails in CI A: Use absolute paths or paths relative to config file location.