# Synthetic Data Generation

Generate realistic synthetic NIRS spectra for testing, prototyping, and research.

## Overview

NIRS4ALL includes a powerful synthetic data generator based on **Beer-Lambert law** physics with realistic instrumental effects. This enables:

- **Reproducible testing**: Generate deterministic datasets for CI/CD pipelines
- **Prototyping**: Quickly explore pipelines without needing real data
- **Research**: Create controlled experiments with known ground truth
- **Teaching**: Demonstrate NIRS concepts with configurable examples

```{note}
The synthetic generator produces physically-motivated spectra with Voigt profile peak shapes and realistic noise models, making them suitable for algorithm development and validation.
```

## Quick Start

### Simple Generation

```python
import nirs4all

# Generate a dataset with 1000 samples
dataset = nirs4all.generate(n_samples=1000, random_state=42)

# Use immediately in a pipeline
result = nirs4all.run(
    pipeline=[MinMaxScaler(), PLSRegression(10)],
    dataset=dataset
)
```

### Get Raw Arrays

```python
# Get numpy arrays for quick experiments
X, y = nirs4all.generate(n_samples=500, as_dataset=False, random_state=42)

print(f"Features shape: {X.shape}")  # (500, 751)
print(f"Targets shape: {y.shape}")    # (500,) or (500, n_components)
```

## Convenience Functions

### Regression Datasets

```python
# Basic regression dataset
dataset = nirs4all.generate.regression(n_samples=500)

# With specific target range and distribution
dataset = nirs4all.generate.regression(
    n_samples=1000,
    target_range=(0, 100),           # Scale targets to 0-100
    target_component="protein",       # Use protein concentration as target
    distribution="lognormal",         # Lognormal concentration distribution
    random_state=42
)
```

### Classification Datasets

```python
# Binary classification
dataset = nirs4all.generate.classification(
    n_samples=500,
    n_classes=2,
    random_state=42
)

# Multiclass with imbalanced classes
dataset = nirs4all.generate.classification(
    n_samples=1000,
    n_classes=3,
    class_weights=[0.5, 0.3, 0.2],    # 50%, 30%, 20% class distribution
    class_separation=2.0,              # Higher = more separable classes
    random_state=42
)
```

### Multi-Source Datasets

Combine multiple data types (e.g., NIR spectra + chemical markers):

```python
dataset = nirs4all.generate.multi_source(
    n_samples=500,
    sources=[
        {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
        {"name": "markers", "type": "aux", "n_features": 15}
    ],
    random_state=42
)
```

## Builder API

For full control, use the fluent builder interface:

```python
from nirs4all.data.synthetic import SyntheticDatasetBuilder

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    # Configure spectral features
    .with_features(
        wavelength_range=(1000, 2500),
        complexity="realistic",              # simple, realistic, or complex
        components=["water", "protein", "lipid", "starch"]
    )
    # Configure target generation
    .with_targets(
        distribution="lognormal",
        range=(5, 50),
        component="protein"                  # Use protein as primary target
    )
    # Add metadata
    .with_metadata(
        n_groups=5,                          # 5 sample groups
        n_repetitions=(2, 4)                 # 2-4 measurements per sample
    )
    # Configure partitioning
    .with_partitions(train_ratio=0.8)
    # Add batch effects (for domain adaptation research)
    .with_batch_effects(n_batches=3)
    .build()
)
```

## Non-Linear Target Complexity

By default, synthetic targets have a simple linear relationship with spectral features,
making them too easy to predict. Use these methods to create more realistic, challenging datasets:

### Non-Linear Interactions

Add polynomial, synergistic, or antagonistic relationships between concentrations and targets:

```python
dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_nonlinear_targets(
        interactions="polynomial",     # polynomial, synergistic, antagonistic
        interaction_strength=0.6,      # 0=linear, 1=fully non-linear
        hidden_factors=2,              # Latent variables not in spectra
        polynomial_degree=2            # Quadratic terms
    )
    .build()
)
```

| Interaction Type | Description | Use Case |
|------------------|-------------|----------|
| `"polynomial"` | C₁², C₁×C₂, C₁×C₂×C₃ terms | General non-linearity |
| `"synergistic"` | Combinations enhance effect | Chemical synergies |
| `"antagonistic"` | Michaelis-Menten saturation | Enzyme kinetics, inhibition |

### Confounders and Partial Predictability

Introduce factors that make the target only partially predictable from spectra:

```python
dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_target_complexity(
        signal_to_confound_ratio=0.7,  # 70% predictable, 30% irreducible error
        n_confounders=2,               # Variables affecting both spectra and target
        temporal_drift=True            # Relationship changes over samples
    )
    .build()
)
```

### Multi-Regime Target Landscapes

Create regions in feature space with different target-spectra relationships:

```python
dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_complex_target_landscape(
        n_regimes=3,                      # 3 different relationship regimes
        regime_method="concentration",    # concentration, spectral, or random
        regime_overlap=0.2,               # Smooth transitions between regimes
        noise_heteroscedasticity=0.5      # Noise varies by regime
    )
    .build()
)
```

### Combining All Complexity Features

For realistic benchmarking, combine all complexity features:

```python
# Create a challenging benchmark dataset
dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    # Non-linear interactions
    .with_nonlinear_targets(
        interactions="polynomial",
        interaction_strength=0.5,
        hidden_factors=2
    )
    # Confounders
    .with_target_complexity(
        signal_to_confound_ratio=0.7,
        n_confounders=2
    )
    # Multi-regime
    .with_complex_target_landscape(
        n_regimes=3,
        noise_heteroscedasticity=0.3
    )
    .build()
)
```

```{note}
These features help test whether your model can handle:
- Non-linear relationships (try tree-based models, neural networks)
- Irreducible error (avoid overfitting)
- Subpopulations with different behaviors (local models, mixture models)
```

## Environmental and Matrix Effects

Real NIR spectra are affected by environmental conditions and sample matrix properties. Use these features to generate more realistic synthetic data.

### Temperature Effects

Temperature variations cause peak shifts, intensity changes, and band broadening:

```python
from nirs4all.data.synthetic import (
    SyntheticDatasetBuilder,
    EnvironmentalEffectsConfig,
    TemperatureConfig,
)

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_environmental_effects(
        temperature=TemperatureConfig(
            sample_temperature=35.0,      # °C above reference (25°C)
            temperature_variation=5.0,    # Sample-to-sample variation
        ),
    )
    .build()
)
```

| Effect | Description | Impact on Spectra |
|--------|-------------|-------------------|
| Peak shift | O-H bands shift with temperature | ~0.3 nm/°C blue shift |
| Intensity change | H-bonding decreases with temperature | ~0.2%/°C intensity decrease |
| Broadening | Thermal motion widens peaks | ~0.1%/°C width increase |

### Moisture/Water Activity Effects

Water content and activity affect hydrogen bonding and water band shapes:

```python
from nirs4all.data.synthetic import MoistureConfig

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_environmental_effects(
        moisture=MoistureConfig(
            water_activity=0.65,          # Water activity (0-1)
            moisture_content=0.12,        # Fractional moisture
            free_water_fraction=0.4,      # Fraction of free water
        ),
    )
    .build()
)
```

### Particle Size Effects (Scattering)

Particle size affects scattering in diffuse reflectance measurements:

```python
from nirs4all.data.synthetic import (
    ScatteringEffectsConfig,
    ParticleSizeConfig,
    ParticleSizeDistribution,
)

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_scattering_effects(
        particle_size=ParticleSizeConfig(
            distribution=ParticleSizeDistribution(
                mean_size_um=50.0,        # Mean particle size (μm)
                std_size_um=15.0,         # Size variation
                distribution="lognormal", # Distribution type
            ),
        ),
    )
    .build()
)
```

### Combined Effects

Combine environmental and scattering effects for maximum realism:

```python
from nirs4all.data.synthetic import (
    SyntheticNIRSGenerator,
    EnvironmentalEffectsConfig,
    ScatteringEffectsConfig,
    TemperatureConfig,
    MoistureConfig,
    ParticleSizeConfig,
    EMSCConfig,
)

# Create generator with all Phase 3 effects
generator = SyntheticNIRSGenerator(
    wavelength_start=1000,
    wavelength_end=2500,
    complexity="realistic",
    environmental_config=EnvironmentalEffectsConfig(
        temperature=TemperatureConfig(sample_temperature=30.0),
        moisture=MoistureConfig(water_activity=0.6),
    ),
    scattering_effects_config=ScatteringEffectsConfig(
        particle_size=ParticleSizeConfig(
            distribution=ParticleSizeDistribution(mean_size_um=60.0)
        ),
        emsc=EMSCConfig(multiplicative_range=(0.85, 1.15)),
    ),
    random_state=42,
)

# Generate with effects enabled
X, C, E = generator.generate(
    n_samples=500,
    include_environmental_effects=True,
    include_scattering_effects=True,
)
```

```{note}
Environmental and scattering effects are correctable by standard preprocessing (SNV, MSC, derivatives). Use them to test robustness of your preprocessing pipeline.
```

## Validation and Benchmarking

Phase 4 introduces tools to validate synthetic data quality and benchmark against standard datasets.

### Spectral Realism Scorecard

Evaluate how realistic your synthetic data is compared to real reference data using 6 quantitative metrics:

```python
from nirs4all.data.synthetic import compute_spectral_realism_scorecard

# Compare synthetic data to real reference data
score = compute_spectral_realism_scorecard(
    real_spectra=X_real,
    synthetic_spectra=X_synth,
    wavelengths=wavelengths,
    include_adversarial=True
)

print(f"Overall Pass: {score.overall_pass}")
print(f"Correlation Length Overlap: {score.correlation_length_overlap:.3f}")
print(f"Adversarial AUC: {score.adversarial_auc:.3f}")  # Lower is better (harder to distinguish)
```

### Benchmark Datasets

Access metadata and properties for standard NIR benchmark datasets to create matching synthetic data:

```python
from nirs4all.data.synthetic import (
    list_benchmark_datasets,
    get_benchmark_info,
    create_synthetic_matching_benchmark
)

# List available benchmarks
print(list_benchmark_datasets())
# ['corn', 'tecator', 'shootout2002', 'wheat_kernels', ...]

# Get info about a dataset
info = get_benchmark_info("corn")
print(f"{info.full_name}: {info.n_samples} samples, {info.n_wavelengths} wavelengths")

# Create synthetic data matching the benchmark properties
X, C, E = create_synthetic_matching_benchmark("corn", n_samples=1000)
```

### Prior Sampling

Generate realistic configurations based on domain knowledge (hierarchical sampling):

```python
from nirs4all.data.synthetic import sample_prior

# Sample a random realistic configuration
config = sample_prior(random_state=42)
print(f"Domain: {config['domain']}")
print(f"Instrument: {config['instrument']}")

# Sample for a specific domain
food_config = sample_prior(domain="food")
```

### GPU Acceleration

Accelerate generation of large datasets using JAX or CuPy (automatically detected):

```python
from nirs4all.data.synthetic import AcceleratedGenerator

# Automatically uses GPU if available (JAX/CuPy)
gen = AcceleratedGenerator(random_state=42)

# Generate large batch efficiently
X = gen.generate_batch(
    n_samples=100000,
    wavelengths=wavelengths,
    component_spectra=E,
    concentrations=C
)
```

## Configuration Options

### Complexity Levels

| Level | Description | Use Case |
|-------|-------------|----------|
| `"simple"` | Minimal noise, no scatter | Unit tests, quick prototyping |
| `"realistic"` | Typical NIR noise and effects | Algorithm development, validation |
| `"complex"` | High noise, artifacts, outliers | Robustness testing |

```python
# Simple (fast, clean spectra)
dataset = nirs4all.generate(complexity="simple", n_samples=1000)

# Realistic (recommended for most use cases)
dataset = nirs4all.generate(complexity="realistic", n_samples=1000)

# Complex (challenging scenarios)
dataset = nirs4all.generate(complexity="complex", n_samples=1000)
```

### Predefined Components

The generator includes **48 predefined spectral components** with physically-accurate NIR band assignments based on published spectroscopy literature. Use these directly or as building blocks for custom scenarios.

**Water & Moisture:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"water"` | 1450, 1940, 2500 | Free water O-H overtones |
| `"moisture"` | 1460, 1930 | Bound water in organic matrices |

**Proteins & Nitrogen Compounds:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"protein"` | 1510, 1680, 2050, 2180 | Amide N-H and aromatic C-H |
| `"nitrogen_compound"` | 1500, 2060, 2150 | Primary/secondary amines |
| `"urea"` | 1480, 1530, 2010, 2170 | Urea CO(NH₂)₂ |
| `"amino_acid"` | 1520, 2040, 2260 | Free amino acids |
| `"casein"` | 1510, 1680, 2050, 2180 | Milk protein |
| `"gluten"` | 1505, 1680, 2050, 2180 | Wheat protein complex |

**Lipids & Hydrocarbons:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"lipid"` | 1210, 1390, 1720, 2310 | Triglyceride C-H stretching |
| `"oil"` | 1165, 1725, 2305 | Vegetable/mineral oils |
| `"saturated_fat"` | 1195, 1730, 2315 | Saturated fatty acids |
| `"unsaturated_fat"` | 1160, 1720, 2145 | Mono/polyunsaturated fats (=C-H) |
| `"waxes"` | 1190, 1720, 2310 | Cuticular waxes |
| `"aromatic"` | 1145, 1685, 2150 | Benzene derivatives |
| `"alkane"` | 1190, 1715, 2310 | Saturated hydrocarbons |

**Carbohydrates:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"starch"` | 1460, 1580, 2100, 2270 | Amylose/amylopectin |
| `"cellulose"` | 1490, 1780, 2090, 2280 | β-1,4-glucan chains |
| `"glucose"` | 1440, 1690, 2080, 2270 | D-glucose monosaccharide |
| `"fructose"` | 1430, 1695, 2070 | D-fructose (fruit sugar) |
| `"sucrose"` | 1435, 1685, 2075 | Disaccharide (table sugar) |
| `"lactose"` | 1450, 1690, 1940, 2100 | Milk sugar |
| `"hemicellulose"` | 1470, 1760, 2085 | Xylan/glucomannan |
| `"lignin"` | 1140, 1420, 1670, 2130 | Aromatic plant polymer |
| `"dietary_fiber"` | 1490, 1770, 2090, 2275 | Plant cell wall material |

**Alcohols & Polyols:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"ethanol"` | 1410, 1580, 1695, 2050 | Ethanol C₂H₅OH |
| `"methanol"` | 1400, 1545, 1705, 2040 | Methanol CH₃OH |
| `"glycerol"` | 1450, 1580, 1700, 2060 | Polyol (fermentation) |

**Organic Acids:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"acetic_acid"` | 1420, 1700, 1940 | Acetic acid CH₃COOH |
| `"citric_acid"` | 1440, 1920, 2060 | Citric acid (fruit acids) |
| `"lactic_acid"` | 1430, 1485, 1700, 2020 | Lactic acid |
| `"malic_acid"` | 1440, 1920, 2050, 2255 | Fruit acid (apples) |
| `"tartaric_acid"` | 1435, 1910, 2040, 2260 | Grape/wine acid |

**Plant Pigments & Phenolics:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"chlorophyll"` | 1070, 1400, 1730, 2270 | Chlorophyll a/b |
| `"carotenoid"` | 1050, 1680, 2135 | β-carotene, xanthophylls |
| `"tannins"` | 1420, 1670, 2056, 2270 | Phenolic compounds |

**Pharmaceuticals:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"caffeine"` | 1130, 1665, 1695, 2010 | Caffeine C₈H₁₀N₄O₂ |
| `"aspirin"` | 1145, 1435, 1680, 2020 | Acetylsalicylic acid |
| `"paracetamol"` | 1140, 1390, 1510, 1670 | Acetaminophen |

**Fibers & Textiles:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"cotton"` | 1200, 1494, 1780, 2100 | Cotton cellulose fiber |
| `"polyester"` | 1140, 1660, 1720, 2015 | PET synthetic fiber |
| `"nylon"` | 1500, 1720, 2050, 2295 | Polyamide fiber |

**Polymers & Plastics:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"polyethylene"` | 1190, 1720, 2310, 2355 | HDPE/LDPE plastic |
| `"polystyrene"` | 1145, 1680, 1720, 2170 | Aromatic polymer |
| `"natural_rubber"` | 1160, 1720, 2130, 2250 | cis-1,4-polyisoprene |

**Solvents:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"acetone"` | 1690, 1710, 2100, 2300 | Ketone (propan-2-one) |

**Soil Minerals:**
| Component | Key Bands (nm) | Description |
|-----------|----------------|-------------|
| `"carbonates"` | 2330, 2525 | CaCO₃, MgCO₃ (calcite) |
| `"gypsum"` | 1740, 1900, 2200 | CaSO₄·2H₂O |
| `"kaolinite"` | 1400, 2160, 2200 | Clay mineral |

```python
# Use specific components
dataset = nirs4all.generate(
    components=["water", "protein", "lipid"],
    n_samples=1000
)

# List all available components
from nirs4all.data.synthetic import ComponentLibrary
library = ComponentLibrary.from_predefined()
print(library.component_names)  # All 48 component names
```

```{seealso}
For detailed band assignments with literature references, see {doc}`/developer/synthetic`.
```

### Concentration Distributions

| Distribution | Description | Use Case |
|--------------|-------------|----------|
| `"dirichlet"` | Sum-to-one constraint | Natural composition data |
| `"uniform"` | Independent uniform | Wide concentration range |
| `"lognormal"` | Skewed, realistic | Agricultural/biological data |
| `"correlated"` | With inter-component correlations | Complex mixtures |

### Target Configuration

```python
# Single component target with scaling
dataset = nirs4all.generate.builder(n_samples=1000)
    .with_targets(
        component="protein",       # Use one component
        range=(0, 100),            # Scale to percentage
        distribution="lognormal"
    )
    .build()

# Multi-output regression (all components as targets)
dataset = nirs4all.generate.builder(n_samples=1000)
    .with_targets(
        component=None,            # Use all components
        range=(0, 1),              # Normalize
    )
    .build()
```

## Exporting Synthetic Data

### To Folder (DatasetConfigs compatible)

```python
# Generate and save to folder
path = nirs4all.generate.to_folder(
    "data/synthetic",
    n_samples=1000,
    train_ratio=0.8,
    format="standard",          # Creates Xcal, Ycal, Xval, Yval files
    random_state=42
)

# Later, load with DatasetConfigs
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs(path)
```

### Export Formats

| Format | Description | Files Created |
|--------|-------------|---------------|
| `"standard"` | Separate train/test files | Xcal.csv, Ycal.csv, Xval.csv, Yval.csv |
| `"single"` | All data in one file | data.csv (with partition column) |
| `"fragmented"` | Multiple small files | Useful for loader testing |

### To Single CSV

```python
path = nirs4all.generate.to_csv(
    "data/synthetic.csv",
    n_samples=500,
    random_state=42
)
```

## Matching Real Data

Generate synthetic data that resembles a real dataset:

```python
# From a dataset path
dataset = nirs4all.generate.from_template(
    "sample_data/regression",
    n_samples=1000,
    random_state=42
)

# From numpy arrays
dataset = nirs4all.generate.from_template(
    X_real,
    n_samples=500,
    wavelengths=wavelengths,
    random_state=42
)
```

The fitter analyzes:
- Statistical properties (mean, std, range)
- Spectral shape (slope, curvature)
- Noise characteristics
- PCA structure

## Advanced: Custom Component Library

Create custom spectral components for specific applications:

```python
from nirs4all.data.synthetic import (
    SyntheticNIRSGenerator,
    ComponentLibrary,
    SpectralComponent,
    NIRBand
)

# Define custom component
my_component = SpectralComponent(
    name="my_compound",
    bands=[
        NIRBand(center=1500, sigma=20, gamma=2, amplitude=0.6, name="C-H stretch"),
        NIRBand(center=2100, sigma=30, gamma=3, amplitude=0.8, name="O-H combination"),
    ]
)

# Create library with custom and predefined components
library = ComponentLibrary()
library.add_component(my_component)
library.add_from_predefined(["water", "protein"])

# Generate with custom library
generator = SyntheticNIRSGenerator(
    component_library=library,
    random_state=42
)
X, Y, E = generator.generate(n_samples=1000)
```

## Integration with Pipelines

### Direct Usage

```python
import nirs4all
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold

# Generate and train in one workflow
result = nirs4all.run(
    pipeline=[
        StandardScaler(),
        KFold(n_splits=5),
        {"model": PLSRegression(n_components=10)}
    ],
    dataset=nirs4all.generate(n_samples=1000, random_state=42),
    name="synthetic_test",
    verbose=1
)

print(f"Best RMSE: {result.best_rmse:.4f}")
```

### Comparing Preprocessing Methods

```python
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from nirs4all.operators.transforms import SNV, MSC, FirstDerivative

# Generate consistent dataset
dataset = nirs4all.generate(n_samples=500, complexity="realistic", random_state=42)

# Test different preprocessing
for preproc in [MinMaxScaler(), StandardScaler(), SNV(), MSC(), FirstDerivative()]:
    result = nirs4all.run(
        pipeline=[preproc, KFold(3), PLSRegression(10)],
        dataset=dataset,
        verbose=0
    )
    print(f"{preproc.__class__.__name__}: RMSE={result.best_rmse:.4f}")
```

## Performance

| Samples | Complexity | Approximate Time |
|---------|------------|-----------------|
| 1,000 | simple | ~0.05s |
| 1,000 | realistic | ~0.1s |
| 10,000 | realistic | ~0.5s |
| 100,000 | complex | ~5s |

## API Reference

### Top-Level Functions

| Function | Description |
|----------|-------------|
| `nirs4all.generate()` | Main generation function |
| `nirs4all.generate.regression()` | Regression dataset |
| `nirs4all.generate.classification()` | Classification dataset |
| `nirs4all.generate.multi_source()` | Multi-source dataset |
| `nirs4all.generate.builder()` | Get builder for full control |
| `nirs4all.generate.to_folder()` | Generate and export to folder |
| `nirs4all.generate.to_csv()` | Generate and export to CSV |
| `nirs4all.generate.from_template()` | Generate matching real data |

### Core Classes

| Class | Description |
|-------|-------------|
| `SyntheticNIRSGenerator` | Core generation engine |
| `SyntheticDatasetBuilder` | Fluent builder interface |
| `ComponentLibrary` | Collection of spectral components |
| `SpectralComponent` | Single chemical component definition |
| `NIRBand` | Single absorption band (Voigt profile) |
| `NonLinearTargetProcessor` | Non-linear target complexity |
| `NonLinearTargetConfig` | Configuration for target complexity |

### Environmental & Scattering Classes (Phase 3)

| Class | Description |
|-------|-------------|
| `TemperatureConfig` | Temperature effect configuration |
| `MoistureConfig` | Moisture/water activity configuration |
| `EnvironmentalEffectsConfig` | Combined environmental effects |
| `EnvironmentalEffectsSimulator` | Apply temperature and moisture effects |
| `ParticleSizeConfig` | Particle size distribution configuration |
| `ParticleSizeDistribution` | Sample particle size distributions |
| `EMSCConfig` | EMSC-style scattering configuration |
| `ScatteringCoefficientConfig` | Scattering coefficient generation |
| `ScatteringEffectsConfig` | Combined scattering effects |
| `ScatteringEffectsSimulator` | Apply particle size and scattering effects |

### Validation & Benchmarking Classes (Phase 4)

| Class | Description |
|-------|-------------|
| `SpectralRealismScore` | Scorecard results container |
| `RealismMetric` | Enum of validation metrics |
| `BenchmarkDatasetInfo` | Metadata for benchmark datasets |
| `NIRSPriorConfig` | Configuration for prior sampling |
| `PriorSampler` | Hierarchical configuration sampler |
| `AcceleratedGenerator` | GPU-accelerated generation engine |
| `AcceleratorBackend` | Enum of acceleration backends (JAX, CuPy, NumPy) |

### Builder Methods for Target Complexity

| Method | Description |
|--------|-------------|
| `.with_nonlinear_targets()` | Add polynomial, synergistic, or antagonistic interactions |
| `.with_target_complexity()` | Add confounders and partial predictability |
| `.with_complex_target_landscape()` | Create multi-regime target landscapes |

## See Also

- {doc}`/developer/synthetic` - Developer guide for extending the generator
- {doc}`/api/nirs4all.api.generate` - API reference
- {doc}`/api/nirs4all.data.synthetic` - Low-level classes reference
- {doc}`loading_data` - Loading real datasets
- {doc}`/getting_started/concepts` - Understanding SpectroDataset