Synthetic Data Generation

Generate realistic synthetic NIRS spectra for testing, prototyping, and research.

Overview

NIRS4ALL includes a powerful synthetic data generator based on Beer-Lambert law physics with realistic instrumental effects. This enables:

  • Reproducible testing: Generate deterministic datasets for CI/CD pipelines

  • Prototyping: Quickly explore pipelines without needing real data

  • Research: Create controlled experiments with known ground truth

  • Teaching: Demonstrate NIRS concepts with configurable examples

Note

The synthetic generator produces physically-motivated spectra with Voigt profile peak shapes and realistic noise models, making them suitable for algorithm development and validation.

Quick Start

Simple Generation

import nirs4all

# Generate a dataset with 1000 samples
dataset = nirs4all.generate(n_samples=1000, random_state=42)

# Use immediately in a pipeline
result = nirs4all.run(
    pipeline=[MinMaxScaler(), PLSRegression(10)],
    dataset=dataset
)

Get Raw Arrays

# Get numpy arrays for quick experiments
X, y = nirs4all.generate(n_samples=500, as_dataset=False, random_state=42)

print(f"Features shape: {X.shape}")  # (500, 751)
print(f"Targets shape: {y.shape}")    # (500,) or (500, n_components)

Convenience Functions

Regression Datasets

# Basic regression dataset
dataset = nirs4all.generate.regression(n_samples=500)

# With specific target range and distribution
dataset = nirs4all.generate.regression(
    n_samples=1000,
    target_range=(0, 100),           # Scale targets to 0-100
    target_component="protein",       # Use protein concentration as target
    distribution="lognormal",         # Lognormal concentration distribution
    random_state=42
)

Classification Datasets

# Binary classification
dataset = nirs4all.generate.classification(
    n_samples=500,
    n_classes=2,
    random_state=42
)

# Multiclass with imbalanced classes
dataset = nirs4all.generate.classification(
    n_samples=1000,
    n_classes=3,
    class_weights=[0.5, 0.3, 0.2],    # 50%, 30%, 20% class distribution
    class_separation=2.0,              # Higher = more separable classes
    random_state=42
)

Multi-Source Datasets

Combine multiple data types (e.g., NIR spectra + chemical markers):

dataset = nirs4all.generate.multi_source(
    n_samples=500,
    sources=[
        {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
        {"name": "markers", "type": "aux", "n_features": 15}
    ],
    random_state=42
)

Builder API

For full control, use the fluent builder interface:

from nirs4all.data.synthetic import SyntheticDatasetBuilder

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    # Configure spectral features
    .with_features(
        wavelength_range=(1000, 2500),
        complexity="realistic",              # simple, realistic, or complex
        components=["water", "protein", "lipid", "starch"]
    )
    # Configure target generation
    .with_targets(
        distribution="lognormal",
        range=(5, 50),
        component="protein"                  # Use protein as primary target
    )
    # Add metadata
    .with_metadata(
        n_groups=5,                          # 5 sample groups
        n_repetitions=(2, 4)                 # 2-4 measurements per sample
    )
    # Configure partitioning
    .with_partitions(train_ratio=0.8)
    # Add batch effects (for domain adaptation research)
    .with_batch_effects(n_batches=3)
    .build()
)

Non-Linear Target Complexity

By default, synthetic targets have a simple linear relationship with spectral features, making them too easy to predict. Use these methods to create more realistic, challenging datasets:

Non-Linear Interactions

Add polynomial, synergistic, or antagonistic relationships between concentrations and targets:

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_nonlinear_targets(
        interactions="polynomial",     # polynomial, synergistic, antagonistic
        interaction_strength=0.6,      # 0=linear, 1=fully non-linear
        hidden_factors=2,              # Latent variables not in spectra
        polynomial_degree=2            # Quadratic terms
    )
    .build()
)

Interaction Type

Description

Use Case

"polynomial"

C₁², C₁×C₂, C₁×C₂×C₃ terms

General non-linearity

"synergistic"

Combinations enhance effect

Chemical synergies

"antagonistic"

Michaelis-Menten saturation

Enzyme kinetics, inhibition

Confounders and Partial Predictability

Introduce factors that make the target only partially predictable from spectra:

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_target_complexity(
        signal_to_confound_ratio=0.7,  # 70% predictable, 30% irreducible error
        n_confounders=2,               # Variables affecting both spectra and target
        temporal_drift=True            # Relationship changes over samples
    )
    .build()
)

Multi-Regime Target Landscapes

Create regions in feature space with different target-spectra relationships:

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_complex_target_landscape(
        n_regimes=3,                      # 3 different relationship regimes
        regime_method="concentration",    # concentration, spectral, or random
        regime_overlap=0.2,               # Smooth transitions between regimes
        noise_heteroscedasticity=0.5      # Noise varies by regime
    )
    .build()
)

Combining All Complexity Features

For realistic benchmarking, combine all complexity features:

# Create a challenging benchmark dataset
dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    # Non-linear interactions
    .with_nonlinear_targets(
        interactions="polynomial",
        interaction_strength=0.5,
        hidden_factors=2
    )
    # Confounders
    .with_target_complexity(
        signal_to_confound_ratio=0.7,
        n_confounders=2
    )
    # Multi-regime
    .with_complex_target_landscape(
        n_regimes=3,
        noise_heteroscedasticity=0.3
    )
    .build()
)

Note

These features help test whether your model can handle:

  • Non-linear relationships (try tree-based models, neural networks)

  • Irreducible error (avoid overfitting)

  • Subpopulations with different behaviors (local models, mixture models)

Environmental and Matrix Effects

Real NIR spectra are affected by environmental conditions and sample matrix properties. Use these features to generate more realistic synthetic data.

Temperature Effects

Temperature variations cause peak shifts, intensity changes, and band broadening:

from nirs4all.data.synthetic import (
    SyntheticDatasetBuilder,
    EnvironmentalEffectsConfig,
    TemperatureConfig,
)

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_targets(component=0, range=(0, 100))
    .with_environmental_effects(
        temperature=TemperatureConfig(
            sample_temperature=35.0,      # °C above reference (25°C)
            temperature_variation=5.0,    # Sample-to-sample variation
        ),
    )
    .build()
)

Effect

Description

Impact on Spectra

Peak shift

O-H bands shift with temperature

~0.3 nm/°C blue shift

Intensity change

H-bonding decreases with temperature

~0.2%/°C intensity decrease

Broadening

Thermal motion widens peaks

~0.1%/°C width increase

Moisture/Water Activity Effects

Water content and activity affect hydrogen bonding and water band shapes:

from nirs4all.data.synthetic import MoistureConfig

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_environmental_effects(
        moisture=MoistureConfig(
            water_activity=0.65,          # Water activity (0-1)
            moisture_content=0.12,        # Fractional moisture
            free_water_fraction=0.4,      # Fraction of free water
        ),
    )
    .build()
)

Particle Size Effects (Scattering)

Particle size affects scattering in diffuse reflectance measurements:

from nirs4all.data.synthetic import (
    ScatteringEffectsConfig,
    ParticleSizeConfig,
    ParticleSizeDistribution,
)

dataset = (
    SyntheticDatasetBuilder(n_samples=1000, random_state=42)
    .with_features(complexity="realistic")
    .with_scattering_effects(
        particle_size=ParticleSizeConfig(
            distribution=ParticleSizeDistribution(
                mean_size_um=50.0,        # Mean particle size (μm)
                std_size_um=15.0,         # Size variation
                distribution="lognormal", # Distribution type
            ),
        ),
    )
    .build()
)

Combined Effects

Combine environmental and scattering effects for maximum realism:

from nirs4all.data.synthetic import (
    SyntheticNIRSGenerator,
    EnvironmentalEffectsConfig,
    ScatteringEffectsConfig,
    TemperatureConfig,
    MoistureConfig,
    ParticleSizeConfig,
    EMSCConfig,
)

# Create generator with all Phase 3 effects
generator = SyntheticNIRSGenerator(
    wavelength_start=1000,
    wavelength_end=2500,
    complexity="realistic",
    environmental_config=EnvironmentalEffectsConfig(
        temperature=TemperatureConfig(sample_temperature=30.0),
        moisture=MoistureConfig(water_activity=0.6),
    ),
    scattering_effects_config=ScatteringEffectsConfig(
        particle_size=ParticleSizeConfig(
            distribution=ParticleSizeDistribution(mean_size_um=60.0)
        ),
        emsc=EMSCConfig(multiplicative_range=(0.85, 1.15)),
    ),
    random_state=42,
)

# Generate with effects enabled
X, C, E = generator.generate(
    n_samples=500,
    include_environmental_effects=True,
    include_scattering_effects=True,
)

Note

Environmental and scattering effects are correctable by standard preprocessing (SNV, MSC, derivatives). Use them to test robustness of your preprocessing pipeline.

Validation and Benchmarking

Phase 4 introduces tools to validate synthetic data quality and benchmark against standard datasets.

Spectral Realism Scorecard

Evaluate how realistic your synthetic data is compared to real reference data using 6 quantitative metrics:

from nirs4all.data.synthetic import compute_spectral_realism_scorecard

# Compare synthetic data to real reference data
score = compute_spectral_realism_scorecard(
    real_spectra=X_real,
    synthetic_spectra=X_synth,
    wavelengths=wavelengths,
    include_adversarial=True
)

print(f"Overall Pass: {score.overall_pass}")
print(f"Correlation Length Overlap: {score.correlation_length_overlap:.3f}")
print(f"Adversarial AUC: {score.adversarial_auc:.3f}")  # Lower is better (harder to distinguish)

Benchmark Datasets

Access metadata and properties for standard NIR benchmark datasets to create matching synthetic data:

from nirs4all.data.synthetic import (
    list_benchmark_datasets,
    get_benchmark_info,
    create_synthetic_matching_benchmark
)

# List available benchmarks
print(list_benchmark_datasets())
# ['corn', 'tecator', 'shootout2002', 'wheat_kernels', ...]

# Get info about a dataset
info = get_benchmark_info("corn")
print(f"{info.full_name}: {info.n_samples} samples, {info.n_wavelengths} wavelengths")

# Create synthetic data matching the benchmark properties
X, C, E = create_synthetic_matching_benchmark("corn", n_samples=1000)

Prior Sampling

Generate realistic configurations based on domain knowledge (hierarchical sampling):

from nirs4all.data.synthetic import sample_prior

# Sample a random realistic configuration
config = sample_prior(random_state=42)
print(f"Domain: {config['domain']}")
print(f"Instrument: {config['instrument']}")

# Sample for a specific domain
food_config = sample_prior(domain="food")

GPU Acceleration

Accelerate generation of large datasets using JAX or CuPy (automatically detected):

from nirs4all.data.synthetic import AcceleratedGenerator

# Automatically uses GPU if available (JAX/CuPy)
gen = AcceleratedGenerator(random_state=42)

# Generate large batch efficiently
X = gen.generate_batch(
    n_samples=100000,
    wavelengths=wavelengths,
    component_spectra=E,
    concentrations=C
)

Configuration Options

Complexity Levels

Level

Description

Use Case

"simple"

Minimal noise, no scatter

Unit tests, quick prototyping

"realistic"

Typical NIR noise and effects

Algorithm development, validation

"complex"

High noise, artifacts, outliers

Robustness testing

# Simple (fast, clean spectra)
dataset = nirs4all.generate(complexity="simple", n_samples=1000)

# Realistic (recommended for most use cases)
dataset = nirs4all.generate(complexity="realistic", n_samples=1000)

# Complex (challenging scenarios)
dataset = nirs4all.generate(complexity="complex", n_samples=1000)

Predefined Components

The generator includes 48 predefined spectral components with physically-accurate NIR band assignments based on published spectroscopy literature. Use these directly or as building blocks for custom scenarios.

Water & Moisture:

Component

Key Bands (nm)

Description

"water"

1450, 1940, 2500

Free water O-H overtones

"moisture"

1460, 1930

Bound water in organic matrices

Proteins & Nitrogen Compounds:

Component

Key Bands (nm)

Description

"protein"

1510, 1680, 2050, 2180

Amide N-H and aromatic C-H

"nitrogen_compound"

1500, 2060, 2150

Primary/secondary amines

"urea"

1480, 1530, 2010, 2170

Urea CO(NH₂)₂

"amino_acid"

1520, 2040, 2260

Free amino acids

"casein"

1510, 1680, 2050, 2180

Milk protein

"gluten"

1505, 1680, 2050, 2180

Wheat protein complex

Lipids & Hydrocarbons:

Component

Key Bands (nm)

Description

"lipid"

1210, 1390, 1720, 2310

Triglyceride C-H stretching

"oil"

1165, 1725, 2305

Vegetable/mineral oils

"saturated_fat"

1195, 1730, 2315

Saturated fatty acids

"unsaturated_fat"

1160, 1720, 2145

Mono/polyunsaturated fats (=C-H)

"waxes"

1190, 1720, 2310

Cuticular waxes

"aromatic"

1145, 1685, 2150

Benzene derivatives

"alkane"

1190, 1715, 2310

Saturated hydrocarbons

Carbohydrates:

Component

Key Bands (nm)

Description

"starch"

1460, 1580, 2100, 2270

Amylose/amylopectin

"cellulose"

1490, 1780, 2090, 2280

β-1,4-glucan chains

"glucose"

1440, 1690, 2080, 2270

D-glucose monosaccharide

"fructose"

1430, 1695, 2070

D-fructose (fruit sugar)

"sucrose"

1435, 1685, 2075

Disaccharide (table sugar)

"lactose"

1450, 1690, 1940, 2100

Milk sugar

"hemicellulose"

1470, 1760, 2085

Xylan/glucomannan

"lignin"

1140, 1420, 1670, 2130

Aromatic plant polymer

"dietary_fiber"

1490, 1770, 2090, 2275

Plant cell wall material

Alcohols & Polyols:

Component

Key Bands (nm)

Description

"ethanol"

1410, 1580, 1695, 2050

Ethanol C₂H₅OH

"methanol"

1400, 1545, 1705, 2040

Methanol CH₃OH

"glycerol"

1450, 1580, 1700, 2060

Polyol (fermentation)

Organic Acids:

Component

Key Bands (nm)

Description

"acetic_acid"

1420, 1700, 1940

Acetic acid CH₃COOH

"citric_acid"

1440, 1920, 2060

Citric acid (fruit acids)

"lactic_acid"

1430, 1485, 1700, 2020

Lactic acid

"malic_acid"

1440, 1920, 2050, 2255

Fruit acid (apples)

"tartaric_acid"

1435, 1910, 2040, 2260

Grape/wine acid

Plant Pigments & Phenolics:

Component

Key Bands (nm)

Description

"chlorophyll"

1070, 1400, 1730, 2270

Chlorophyll a/b

"carotenoid"

1050, 1680, 2135

β-carotene, xanthophylls

"tannins"

1420, 1670, 2056, 2270

Phenolic compounds

Pharmaceuticals:

Component

Key Bands (nm)

Description

"caffeine"

1130, 1665, 1695, 2010

Caffeine C₈H₁₀N₄O₂

"aspirin"

1145, 1435, 1680, 2020

Acetylsalicylic acid

"paracetamol"

1140, 1390, 1510, 1670

Acetaminophen

Fibers & Textiles:

Component

Key Bands (nm)

Description

"cotton"

1200, 1494, 1780, 2100

Cotton cellulose fiber

"polyester"

1140, 1660, 1720, 2015

PET synthetic fiber

"nylon"

1500, 1720, 2050, 2295

Polyamide fiber

Polymers & Plastics:

Component

Key Bands (nm)

Description

"polyethylene"

1190, 1720, 2310, 2355

HDPE/LDPE plastic

"polystyrene"

1145, 1680, 1720, 2170

Aromatic polymer

"natural_rubber"

1160, 1720, 2130, 2250

cis-1,4-polyisoprene

Solvents:

Component

Key Bands (nm)

Description

"acetone"

1690, 1710, 2100, 2300

Ketone (propan-2-one)

Soil Minerals:

Component

Key Bands (nm)

Description

"carbonates"

2330, 2525

CaCO₃, MgCO₃ (calcite)

"gypsum"

1740, 1900, 2200

CaSO₄·2H₂O

"kaolinite"

1400, 2160, 2200

Clay mineral

# Use specific components
dataset = nirs4all.generate(
    components=["water", "protein", "lipid"],
    n_samples=1000
)

# List all available components
from nirs4all.data.synthetic import ComponentLibrary
library = ComponentLibrary.from_predefined()
print(library.component_names)  # All 48 component names

See also

For detailed band assignments with literature references, see Synthetic Data Generator.

Concentration Distributions

Distribution

Description

Use Case

"dirichlet"

Sum-to-one constraint

Natural composition data

"uniform"

Independent uniform

Wide concentration range

"lognormal"

Skewed, realistic

Agricultural/biological data

"correlated"

With inter-component correlations

Complex mixtures

Target Configuration

# Single component target with scaling
dataset = nirs4all.generate.builder(n_samples=1000)
    .with_targets(
        component="protein",       # Use one component
        range=(0, 100),            # Scale to percentage
        distribution="lognormal"
    )
    .build()

# Multi-output regression (all components as targets)
dataset = nirs4all.generate.builder(n_samples=1000)
    .with_targets(
        component=None,            # Use all components
        range=(0, 1),              # Normalize
    )
    .build()

Exporting Synthetic Data

To Folder (DatasetConfigs compatible)

# Generate and save to folder
path = nirs4all.generate.to_folder(
    "data/synthetic",
    n_samples=1000,
    train_ratio=0.8,
    format="standard",          # Creates Xcal, Ycal, Xval, Yval files
    random_state=42
)

# Later, load with DatasetConfigs
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs(path)

Export Formats

Format

Description

Files Created

"standard"

Separate train/test files

Xcal.csv, Ycal.csv, Xval.csv, Yval.csv

"single"

All data in one file

data.csv (with partition column)

"fragmented"

Multiple small files

Useful for loader testing

To Single CSV

path = nirs4all.generate.to_csv(
    "data/synthetic.csv",
    n_samples=500,
    random_state=42
)

Matching Real Data

Generate synthetic data that resembles a real dataset:

# From a dataset path
dataset = nirs4all.generate.from_template(
    "sample_data/regression",
    n_samples=1000,
    random_state=42
)

# From numpy arrays
dataset = nirs4all.generate.from_template(
    X_real,
    n_samples=500,
    wavelengths=wavelengths,
    random_state=42
)

The fitter analyzes:

  • Statistical properties (mean, std, range)

  • Spectral shape (slope, curvature)

  • Noise characteristics

  • PCA structure

Advanced: Custom Component Library

Create custom spectral components for specific applications:

from nirs4all.data.synthetic import (
    SyntheticNIRSGenerator,
    ComponentLibrary,
    SpectralComponent,
    NIRBand
)

# Define custom component
my_component = SpectralComponent(
    name="my_compound",
    bands=[
        NIRBand(center=1500, sigma=20, gamma=2, amplitude=0.6, name="C-H stretch"),
        NIRBand(center=2100, sigma=30, gamma=3, amplitude=0.8, name="O-H combination"),
    ]
)

# Create library with custom and predefined components
library = ComponentLibrary()
library.add_component(my_component)
library.add_from_predefined(["water", "protein"])

# Generate with custom library
generator = SyntheticNIRSGenerator(
    component_library=library,
    random_state=42
)
X, Y, E = generator.generate(n_samples=1000)

Integration with Pipelines

Direct Usage

import nirs4all
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold

# Generate and train in one workflow
result = nirs4all.run(
    pipeline=[
        StandardScaler(),
        KFold(n_splits=5),
        {"model": PLSRegression(n_components=10)}
    ],
    dataset=nirs4all.generate(n_samples=1000, random_state=42),
    name="synthetic_test",
    verbose=1
)

print(f"Best RMSE: {result.best_rmse:.4f}")

Comparing Preprocessing Methods

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from nirs4all.operators.transforms import SNV, MSC, FirstDerivative

# Generate consistent dataset
dataset = nirs4all.generate(n_samples=500, complexity="realistic", random_state=42)

# Test different preprocessing
for preproc in [MinMaxScaler(), StandardScaler(), SNV(), MSC(), FirstDerivative()]:
    result = nirs4all.run(
        pipeline=[preproc, KFold(3), PLSRegression(10)],
        dataset=dataset,
        verbose=0
    )
    print(f"{preproc.__class__.__name__}: RMSE={result.best_rmse:.4f}")

Performance

Samples

Complexity

Approximate Time

1,000

simple

~0.05s

1,000

realistic

~0.1s

10,000

realistic

~0.5s

100,000

complex

~5s

API Reference

Top-Level Functions

Function

Description

nirs4all.generate()

Main generation function

nirs4all.generate.regression()

Regression dataset

nirs4all.generate.classification()

Classification dataset

nirs4all.generate.multi_source()

Multi-source dataset

nirs4all.generate.builder()

Get builder for full control

nirs4all.generate.to_folder()

Generate and export to folder

nirs4all.generate.to_csv()

Generate and export to CSV

nirs4all.generate.from_template()

Generate matching real data

Core Classes

Class

Description

SyntheticNIRSGenerator

Core generation engine

SyntheticDatasetBuilder

Fluent builder interface

ComponentLibrary

Collection of spectral components

SpectralComponent

Single chemical component definition

NIRBand

Single absorption band (Voigt profile)

NonLinearTargetProcessor

Non-linear target complexity

NonLinearTargetConfig

Configuration for target complexity

Environmental & Scattering Classes (Phase 3)

Class

Description

TemperatureConfig

Temperature effect configuration

MoistureConfig

Moisture/water activity configuration

EnvironmentalEffectsConfig

Combined environmental effects

EnvironmentalEffectsSimulator

Apply temperature and moisture effects

ParticleSizeConfig

Particle size distribution configuration

ParticleSizeDistribution

Sample particle size distributions

EMSCConfig

EMSC-style scattering configuration

ScatteringCoefficientConfig

Scattering coefficient generation

ScatteringEffectsConfig

Combined scattering effects

ScatteringEffectsSimulator

Apply particle size and scattering effects

Validation & Benchmarking Classes (Phase 4)

Class

Description

SpectralRealismScore

Scorecard results container

RealismMetric

Enum of validation metrics

BenchmarkDatasetInfo

Metadata for benchmark datasets

NIRSPriorConfig

Configuration for prior sampling

PriorSampler

Hierarchical configuration sampler

AcceleratedGenerator

GPU-accelerated generation engine

AcceleratorBackend

Enum of acceleration backends (JAX, CuPy, NumPy)

Builder Methods for Target Complexity

Method

Description

.with_nonlinear_targets()

Add polynomial, synergistic, or antagonistic interactions

.with_target_complexity()

Add confounders and partial predictability

.with_complex_target_landscape()

Create multi-regime target landscapes

See Also