# Dataset Configuration Troubleshooting Guide

This guide helps diagnose and resolve common issues when loading NIRS datasets.

## Error Code Reference

### Schema Errors (E1xx)

| Code | Description | Solution |
|------|-------------|----------|
| E101 | Missing required field | Add the missing field to your configuration |
| E102 | Invalid field type | Check expected type in schema documentation |
| E103 | Invalid enum value | Use one of the allowed values listed in error |
| E104 | Configuration validation failed | Check the detailed validation messages |

**Example - E101 Missing Required Field:**
```yaml
# ❌ Error: E101 - Missing required field 'train_x'
name: my_dataset

# ✅ Fixed: Add required train_x path
name: my_dataset
train_x: data/spectra.csv
```

### File Errors (E2xx)

| Code | Description | Solution |
|------|-------------|----------|
| E201 | File not found | Check file path exists and is accessible |
| E202 | Permission denied | Check file permissions |
| E203 | Invalid file format | Ensure file matches expected format (CSV, TSV, etc.) |
| E204 | File is empty | Verify file contains data |
| E205 | Encoding error | Specify correct encoding in params |

**Example - E201 File Not Found:**
```yaml
# Check paths are relative to config file or use absolute paths
train_x: ./data/spectra.csv  # Relative to config
train_x: /home/user/project/data/spectra.csv  # Absolute
```

### Data Errors (E3xx)

| Code | Description | Solution |
|------|-------------|----------|
| E301 | Missing values detected | Handle NaN/missing values in preprocessing |
| E302 | Shape mismatch | Ensure X and y have matching sample counts |
| E303 | Invalid numeric data | Check for non-numeric values in spectral data |
| E304 | Spectral range inconsistent | Verify wavelength headers match across files |
| E305 | Duplicate samples | Use aggregation or remove duplicates |

**Example - E302 Shape Mismatch:**
```yaml
# train_x has 100 samples, train_y has 95 samples
# Check for:
# 1. Header row counted as data
# 2. Missing samples in target file
# 3. Different sample ID formats
```

### Loading Errors (E4xx)

| Code | Description | Solution |
|------|-------------|----------|
| E401 | Delimiter detection failed | Explicitly set delimiter in params |
| E402 | Header parsing failed | Check header format matches header_unit |
| E403 | Data type conversion failed | Check data contains valid numbers |
| E404 | Memory error | Use lazy_loading or process in chunks |

**Example - E401 Delimiter Detection:**
```yaml
# If auto-detection fails, specify delimiter explicitly
params:
  delimiter: ","   # CSV
  delimiter: "\t"  # TSV
  delimiter: ";"   # European CSV
```

### Partition Errors (E5xx)

| Code | Description | Solution |
|------|-------------|----------|
| E501 | Partition overlap | Ensure train/val/test don't share samples |
| E502 | Empty partition | Check partition indices are valid |
| E503 | Invalid partition indices | Indices must be within sample count |
| E504 | Partition sum mismatch | Partition sizes should account for all samples |

### Aggregation Errors (E6xx)

| Code | Description | Solution |
|------|-------------|----------|
| E601 | Group column not found | Check group_by column exists in metadata |
| E602 | Aggregation method failed | Check method name is valid |
| E603 | Custom aggregation error | Verify custom function signature |
| E604 | Empty group after aggregation | Some groups may have all outliers |

**Example - E601 Group Column:**
```yaml
# Ensure the group column exists in your data
aggregation:
  group_by: sample_id  # Must match column name exactly (case-sensitive)
```

### Variation Errors (E7xx)

| Code | Description | Solution |
|------|-------------|----------|
| E701 | Variation definition error | Check variation syntax |
| E702 | Invalid spectral range | Range must be within data bounds |
| E703 | Resampling error | Check resample parameters |
| E704 | Noise application error | Verify noise level is valid (0-1) |

### Fold Errors (E8xx)

| Code | Description | Solution |
|------|-------------|----------|
| E801 | Fold definition error | Check fold indices syntax |
| E802 | Fold overlap | Ensure folds don't share test samples |
| E803 | Invalid fold indices | Indices must be within sample count |
| E804 | Inconsistent fold structure | All folds should have train and test |

### Runtime Errors (E9xx)

| Code | Description | Solution |
|------|-------------|----------|
| E901 | Cache error | Clear cache and retry |
| E902 | Lazy loading error | Try with lazy_loading: false |
| E903 | Timeout error | Increase timeout or reduce data size |
| E904 | Memory limit exceeded | Use lazy loading or reduce batch size |

---

## Common Scenarios

### Scenario 1: European CSV Format

European CSV files use semicolons as delimiters and commas as decimal separators.

```yaml
params:
  delimiter: ";"
  decimal: ","
```

### Scenario 2: Wavelength Headers with Units

If headers contain units (e.g., "1100 nm", "4000 cm-1"):

```yaml
params:
  header_unit: nm       # or cm-1, um
  header_regex: null    # Use default pattern
```

### Scenario 3: Large Dataset Memory Issues

For datasets that exceed available memory:

```yaml
performance:
  lazy_loading: true
  cache_enabled: true
  cache_max_size_mb: 1024  # Limit cache size
```

### Scenario 4: Sample Replicates

When each sample has multiple measurements:

```yaml
aggregation:
  group_by: sample_id
  method: mean
  exclude_outliers: true
  outlier_threshold: 2.5
```

### Scenario 5: Metadata Linking Issues

When sample IDs don't match between files:

```yaml
# Check for common issues:
# 1. Leading/trailing whitespace: "  sample1  " vs "sample1"
# 2. Case differences: "Sample1" vs "sample1"
# 3. Numeric formatting: "001" vs "1"

metadata:
  path: metadata.csv
  link_by: sample_id
  strip_whitespace: true  # Remove whitespace
  case_sensitive: false   # Ignore case
```

---

## Validation Workflow

Use the CLI to validate configurations before running pipelines:

```bash
# Validate configuration syntax
nirs4all dataset validate config.yaml

# Inspect data with auto-detection
nirs4all dataset inspect data.csv --detect

# Export normalized configuration
nirs4all dataset export config.yaml -o normalized.yaml

# Compare configurations
nirs4all dataset diff config1.yaml config2.yaml
```

---

## Getting Diagnostic Reports

For detailed diagnostics, enable verbose mode:

```python
from nirs4all.data import DatasetConfigs
from nirs4all.data.schema.validation import DiagnosticBuilder

# Create diagnostic builder
diagnostics = DiagnosticBuilder()

# Load with diagnostics
try:
    config = DatasetConfigs.from_yaml("config.yaml")
except Exception as e:
    # Get diagnostic report
    report = diagnostics.build()
    print(report.to_text())

    # Or save as JSON for analysis
    report.save_json("diagnostics.json")
```

---

## FAQ

**Q: My file loads but wavelengths are wrong**
A: Check `header_unit` matches your data. Use `nirs4all dataset inspect file.csv --detect` to see detected parameters.

**Q: Aggregation removes too many samples**
A: Lower `outlier_threshold` or set `exclude_outliers: false`.

**Q: Cache isn't being used**
A: Ensure `cache_enabled: true` and check cache size limits.

**Q: Getting OOM errors with large datasets**
A: Enable `lazy_loading: true` in performance settings.

**Q: Configuration works locally but fails in CI**
A: Use absolute paths or paths relative to config file location.