Dataset Configuration Troubleshooting Guide
This guide helps diagnose and resolve common issues when loading NIRS datasets.
Error Code Reference
Schema Errors (E1xx)
Code |
Description |
Solution |
|---|---|---|
E101 |
Missing required field |
Add the missing field to your configuration |
E102 |
Invalid field type |
Check expected type in schema documentation |
E103 |
Invalid enum value |
Use one of the allowed values listed in error |
E104 |
Configuration validation failed |
Check the detailed validation messages |
Example - E101 Missing Required Field:
# ❌ Error: E101 - Missing required field 'train_x'
name: my_dataset
# ✅ Fixed: Add required train_x path
name: my_dataset
train_x: data/spectra.csv
File Errors (E2xx)
Code |
Description |
Solution |
|---|---|---|
E201 |
File not found |
Check file path exists and is accessible |
E202 |
Permission denied |
Check file permissions |
E203 |
Invalid file format |
Ensure file matches expected format (CSV, TSV, etc.) |
E204 |
File is empty |
Verify file contains data |
E205 |
Encoding error |
Specify correct encoding in params |
Example - E201 File Not Found:
# Check paths are relative to config file or use absolute paths
train_x: ./data/spectra.csv # Relative to config
train_x: /home/user/project/data/spectra.csv # Absolute
Data Errors (E3xx)
Code |
Description |
Solution |
|---|---|---|
E301 |
Missing values detected |
Handle NaN/missing values in preprocessing |
E302 |
Shape mismatch |
Ensure X and y have matching sample counts |
E303 |
Invalid numeric data |
Check for non-numeric values in spectral data |
E304 |
Spectral range inconsistent |
Verify wavelength headers match across files |
E305 |
Duplicate samples |
Use aggregation or remove duplicates |
Example - E302 Shape Mismatch:
# train_x has 100 samples, train_y has 95 samples
# Check for:
# 1. Header row counted as data
# 2. Missing samples in target file
# 3. Different sample ID formats
Loading Errors (E4xx)
Code |
Description |
Solution |
|---|---|---|
E401 |
Delimiter detection failed |
Explicitly set delimiter in params |
E402 |
Header parsing failed |
Check header format matches header_unit |
E403 |
Data type conversion failed |
Check data contains valid numbers |
E404 |
Memory error |
Use lazy_loading or process in chunks |
Example - E401 Delimiter Detection:
# If auto-detection fails, specify delimiter explicitly
params:
delimiter: "," # CSV
delimiter: "\t" # TSV
delimiter: ";" # European CSV
Partition Errors (E5xx)
Code |
Description |
Solution |
|---|---|---|
E501 |
Partition overlap |
Ensure train/val/test don’t share samples |
E502 |
Empty partition |
Check partition indices are valid |
E503 |
Invalid partition indices |
Indices must be within sample count |
E504 |
Partition sum mismatch |
Partition sizes should account for all samples |
Aggregation Errors (E6xx)
Code |
Description |
Solution |
|---|---|---|
E601 |
Group column not found |
Check group_by column exists in metadata |
E602 |
Aggregation method failed |
Check method name is valid |
E603 |
Custom aggregation error |
Verify custom function signature |
E604 |
Empty group after aggregation |
Some groups may have all outliers |
Example - E601 Group Column:
# Ensure the group column exists in your data
aggregation:
group_by: sample_id # Must match column name exactly (case-sensitive)
Variation Errors (E7xx)
Code |
Description |
Solution |
|---|---|---|
E701 |
Variation definition error |
Check variation syntax |
E702 |
Invalid spectral range |
Range must be within data bounds |
E703 |
Resampling error |
Check resample parameters |
E704 |
Noise application error |
Verify noise level is valid (0-1) |
Fold Errors (E8xx)
Code |
Description |
Solution |
|---|---|---|
E801 |
Fold definition error |
Check fold indices syntax |
E802 |
Fold overlap |
Ensure folds don’t share test samples |
E803 |
Invalid fold indices |
Indices must be within sample count |
E804 |
Inconsistent fold structure |
All folds should have train and test |
Runtime Errors (E9xx)
Code |
Description |
Solution |
|---|---|---|
E901 |
Cache error |
Clear cache and retry |
E902 |
Lazy loading error |
Try with lazy_loading: false |
E903 |
Timeout error |
Increase timeout or reduce data size |
E904 |
Memory limit exceeded |
Use lazy loading or reduce batch size |
Common Scenarios
Scenario 1: European CSV Format
European CSV files use semicolons as delimiters and commas as decimal separators.
params:
delimiter: ";"
decimal: ","
Scenario 2: Wavelength Headers with Units
If headers contain units (e.g., “1100 nm”, “4000 cm-1”):
params:
header_unit: nm # or cm-1, um
header_regex: null # Use default pattern
Scenario 3: Large Dataset Memory Issues
For datasets that exceed available memory:
performance:
lazy_loading: true
cache_enabled: true
cache_max_size_mb: 1024 # Limit cache size
Scenario 4: Sample Replicates
When each sample has multiple measurements:
aggregation:
group_by: sample_id
method: mean
exclude_outliers: true
outlier_threshold: 2.5
Scenario 5: Metadata Linking Issues
When sample IDs don’t match between files:
# Check for common issues:
# 1. Leading/trailing whitespace: " sample1 " vs "sample1"
# 2. Case differences: "Sample1" vs "sample1"
# 3. Numeric formatting: "001" vs "1"
metadata:
path: metadata.csv
link_by: sample_id
strip_whitespace: true # Remove whitespace
case_sensitive: false # Ignore case
Validation Workflow
Use the CLI to validate configurations before running pipelines:
# Validate configuration syntax
nirs4all dataset validate config.yaml
# Inspect data with auto-detection
nirs4all dataset inspect data.csv --detect
# Export normalized configuration
nirs4all dataset export config.yaml -o normalized.yaml
# Compare configurations
nirs4all dataset diff config1.yaml config2.yaml
Getting Diagnostic Reports
For detailed diagnostics, enable verbose mode:
from nirs4all.data import DatasetConfigs
from nirs4all.data.schema.validation import DiagnosticBuilder
# Create diagnostic builder
diagnostics = DiagnosticBuilder()
# Load with diagnostics
try:
config = DatasetConfigs.from_yaml("config.yaml")
except Exception as e:
# Get diagnostic report
report = diagnostics.build()
print(report.to_text())
# Or save as JSON for analysis
report.save_json("diagnostics.json")
FAQ
Q: My file loads but wavelengths are wrong
A: Check header_unit matches your data. Use nirs4all dataset inspect file.csv --detect to see detected parameters.
Q: Aggregation removes too many samples
A: Lower outlier_threshold or set exclude_outliers: false.
Q: Cache isn’t being used
A: Ensure cache_enabled: true and check cache size limits.
Q: Getting OOM errors with large datasets
A: Enable lazy_loading: true in performance settings.
Q: Configuration works locally but fails in CI A: Use absolute paths or paths relative to config file location.