Synthetic Data Generation
Generate realistic synthetic NIRS spectra for testing, prototyping, and research.
Overview
NIRS4ALL includes a powerful synthetic data generator based on Beer-Lambert law physics with realistic instrumental effects. This enables:
Reproducible testing: Generate deterministic datasets for CI/CD pipelines
Prototyping: Quickly explore pipelines without needing real data
Research: Create controlled experiments with known ground truth
Teaching: Demonstrate NIRS concepts with configurable examples
Note
The synthetic generator produces physically-motivated spectra with Voigt profile peak shapes and realistic noise models, making them suitable for algorithm development and validation.
Quick Start
Simple Generation
import nirs4all
# Generate a dataset with 1000 samples
dataset = nirs4all.generate(n_samples=1000, random_state=42)
# Use immediately in a pipeline
result = nirs4all.run(
pipeline=[MinMaxScaler(), PLSRegression(10)],
dataset=dataset
)
Get Raw Arrays
# Get numpy arrays for quick experiments
X, y = nirs4all.generate(n_samples=500, as_dataset=False, random_state=42)
print(f"Features shape: {X.shape}") # (500, 751)
print(f"Targets shape: {y.shape}") # (500,) or (500, n_components)
Convenience Functions
Regression Datasets
# Basic regression dataset
dataset = nirs4all.generate.regression(n_samples=500)
# With specific target range and distribution
dataset = nirs4all.generate.regression(
n_samples=1000,
target_range=(0, 100), # Scale targets to 0-100
target_component="protein", # Use protein concentration as target
distribution="lognormal", # Lognormal concentration distribution
random_state=42
)
Classification Datasets
# Binary classification
dataset = nirs4all.generate.classification(
n_samples=500,
n_classes=2,
random_state=42
)
# Multiclass with imbalanced classes
dataset = nirs4all.generate.classification(
n_samples=1000,
n_classes=3,
class_weights=[0.5, 0.3, 0.2], # 50%, 30%, 20% class distribution
class_separation=2.0, # Higher = more separable classes
random_state=42
)
Multi-Source Datasets
Combine multiple data types (e.g., NIR spectra + chemical markers):
dataset = nirs4all.generate.multi_source(
n_samples=500,
sources=[
{"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
{"name": "markers", "type": "aux", "n_features": 15}
],
random_state=42
)
Builder API
For full control, use the fluent builder interface:
from nirs4all.data.synthetic import SyntheticDatasetBuilder
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
# Configure spectral features
.with_features(
wavelength_range=(1000, 2500),
complexity="realistic", # simple, realistic, or complex
components=["water", "protein", "lipid", "starch"]
)
# Configure target generation
.with_targets(
distribution="lognormal",
range=(5, 50),
component="protein" # Use protein as primary target
)
# Add metadata
.with_metadata(
n_groups=5, # 5 sample groups
n_repetitions=(2, 4) # 2-4 measurements per sample
)
# Configure partitioning
.with_partitions(train_ratio=0.8)
# Add batch effects (for domain adaptation research)
.with_batch_effects(n_batches=3)
.build()
)
Non-Linear Target Complexity
By default, synthetic targets have a simple linear relationship with spectral features, making them too easy to predict. Use these methods to create more realistic, challenging datasets:
Non-Linear Interactions
Add polynomial, synergistic, or antagonistic relationships between concentrations and targets:
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_targets(component=0, range=(0, 100))
.with_nonlinear_targets(
interactions="polynomial", # polynomial, synergistic, antagonistic
interaction_strength=0.6, # 0=linear, 1=fully non-linear
hidden_factors=2, # Latent variables not in spectra
polynomial_degree=2 # Quadratic terms
)
.build()
)
Interaction Type |
Description |
Use Case |
|---|---|---|
|
C₁², C₁×C₂, C₁×C₂×C₃ terms |
General non-linearity |
|
Combinations enhance effect |
Chemical synergies |
|
Michaelis-Menten saturation |
Enzyme kinetics, inhibition |
Confounders and Partial Predictability
Introduce factors that make the target only partially predictable from spectra:
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_targets(component=0, range=(0, 100))
.with_target_complexity(
signal_to_confound_ratio=0.7, # 70% predictable, 30% irreducible error
n_confounders=2, # Variables affecting both spectra and target
temporal_drift=True # Relationship changes over samples
)
.build()
)
Multi-Regime Target Landscapes
Create regions in feature space with different target-spectra relationships:
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_targets(component=0, range=(0, 100))
.with_complex_target_landscape(
n_regimes=3, # 3 different relationship regimes
regime_method="concentration", # concentration, spectral, or random
regime_overlap=0.2, # Smooth transitions between regimes
noise_heteroscedasticity=0.5 # Noise varies by regime
)
.build()
)
Combining All Complexity Features
For realistic benchmarking, combine all complexity features:
# Create a challenging benchmark dataset
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_targets(component=0, range=(0, 100))
# Non-linear interactions
.with_nonlinear_targets(
interactions="polynomial",
interaction_strength=0.5,
hidden_factors=2
)
# Confounders
.with_target_complexity(
signal_to_confound_ratio=0.7,
n_confounders=2
)
# Multi-regime
.with_complex_target_landscape(
n_regimes=3,
noise_heteroscedasticity=0.3
)
.build()
)
Note
These features help test whether your model can handle:
Non-linear relationships (try tree-based models, neural networks)
Irreducible error (avoid overfitting)
Subpopulations with different behaviors (local models, mixture models)
Environmental and Matrix Effects
Real NIR spectra are affected by environmental conditions and sample matrix properties. Use these features to generate more realistic synthetic data.
Temperature Effects
Temperature variations cause peak shifts, intensity changes, and band broadening:
from nirs4all.data.synthetic import (
SyntheticDatasetBuilder,
EnvironmentalEffectsConfig,
TemperatureConfig,
)
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_targets(component=0, range=(0, 100))
.with_environmental_effects(
temperature=TemperatureConfig(
sample_temperature=35.0, # °C above reference (25°C)
temperature_variation=5.0, # Sample-to-sample variation
),
)
.build()
)
Effect |
Description |
Impact on Spectra |
|---|---|---|
Peak shift |
O-H bands shift with temperature |
~0.3 nm/°C blue shift |
Intensity change |
H-bonding decreases with temperature |
~0.2%/°C intensity decrease |
Broadening |
Thermal motion widens peaks |
~0.1%/°C width increase |
Moisture/Water Activity Effects
Water content and activity affect hydrogen bonding and water band shapes:
from nirs4all.data.synthetic import MoistureConfig
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_environmental_effects(
moisture=MoistureConfig(
water_activity=0.65, # Water activity (0-1)
moisture_content=0.12, # Fractional moisture
free_water_fraction=0.4, # Fraction of free water
),
)
.build()
)
Particle Size Effects (Scattering)
Particle size affects scattering in diffuse reflectance measurements:
from nirs4all.data.synthetic import (
ScatteringEffectsConfig,
ParticleSizeConfig,
ParticleSizeDistribution,
)
dataset = (
SyntheticDatasetBuilder(n_samples=1000, random_state=42)
.with_features(complexity="realistic")
.with_scattering_effects(
particle_size=ParticleSizeConfig(
distribution=ParticleSizeDistribution(
mean_size_um=50.0, # Mean particle size (μm)
std_size_um=15.0, # Size variation
distribution="lognormal", # Distribution type
),
),
)
.build()
)
Combined Effects
Combine environmental and scattering effects for maximum realism:
from nirs4all.data.synthetic import (
SyntheticNIRSGenerator,
EnvironmentalEffectsConfig,
ScatteringEffectsConfig,
TemperatureConfig,
MoistureConfig,
ParticleSizeConfig,
EMSCConfig,
)
# Create generator with all Phase 3 effects
generator = SyntheticNIRSGenerator(
wavelength_start=1000,
wavelength_end=2500,
complexity="realistic",
environmental_config=EnvironmentalEffectsConfig(
temperature=TemperatureConfig(sample_temperature=30.0),
moisture=MoistureConfig(water_activity=0.6),
),
scattering_effects_config=ScatteringEffectsConfig(
particle_size=ParticleSizeConfig(
distribution=ParticleSizeDistribution(mean_size_um=60.0)
),
emsc=EMSCConfig(multiplicative_range=(0.85, 1.15)),
),
random_state=42,
)
# Generate with effects enabled
X, C, E = generator.generate(
n_samples=500,
include_environmental_effects=True,
include_scattering_effects=True,
)
Note
Environmental and scattering effects are correctable by standard preprocessing (SNV, MSC, derivatives). Use them to test robustness of your preprocessing pipeline.
Validation and Benchmarking
Phase 4 introduces tools to validate synthetic data quality and benchmark against standard datasets.
Spectral Realism Scorecard
Evaluate how realistic your synthetic data is compared to real reference data using 6 quantitative metrics:
from nirs4all.data.synthetic import compute_spectral_realism_scorecard
# Compare synthetic data to real reference data
score = compute_spectral_realism_scorecard(
real_spectra=X_real,
synthetic_spectra=X_synth,
wavelengths=wavelengths,
include_adversarial=True
)
print(f"Overall Pass: {score.overall_pass}")
print(f"Correlation Length Overlap: {score.correlation_length_overlap:.3f}")
print(f"Adversarial AUC: {score.adversarial_auc:.3f}") # Lower is better (harder to distinguish)
Benchmark Datasets
Access metadata and properties for standard NIR benchmark datasets to create matching synthetic data:
from nirs4all.data.synthetic import (
list_benchmark_datasets,
get_benchmark_info,
create_synthetic_matching_benchmark
)
# List available benchmarks
print(list_benchmark_datasets())
# ['corn', 'tecator', 'shootout2002', 'wheat_kernels', ...]
# Get info about a dataset
info = get_benchmark_info("corn")
print(f"{info.full_name}: {info.n_samples} samples, {info.n_wavelengths} wavelengths")
# Create synthetic data matching the benchmark properties
X, C, E = create_synthetic_matching_benchmark("corn", n_samples=1000)
Prior Sampling
Generate realistic configurations based on domain knowledge (hierarchical sampling):
from nirs4all.data.synthetic import sample_prior
# Sample a random realistic configuration
config = sample_prior(random_state=42)
print(f"Domain: {config['domain']}")
print(f"Instrument: {config['instrument']}")
# Sample for a specific domain
food_config = sample_prior(domain="food")
GPU Acceleration
Accelerate generation of large datasets using JAX or CuPy (automatically detected):
from nirs4all.data.synthetic import AcceleratedGenerator
# Automatically uses GPU if available (JAX/CuPy)
gen = AcceleratedGenerator(random_state=42)
# Generate large batch efficiently
X = gen.generate_batch(
n_samples=100000,
wavelengths=wavelengths,
component_spectra=E,
concentrations=C
)
Configuration Options
Complexity Levels
Level |
Description |
Use Case |
|---|---|---|
|
Minimal noise, no scatter |
Unit tests, quick prototyping |
|
Typical NIR noise and effects |
Algorithm development, validation |
|
High noise, artifacts, outliers |
Robustness testing |
# Simple (fast, clean spectra)
dataset = nirs4all.generate(complexity="simple", n_samples=1000)
# Realistic (recommended for most use cases)
dataset = nirs4all.generate(complexity="realistic", n_samples=1000)
# Complex (challenging scenarios)
dataset = nirs4all.generate(complexity="complex", n_samples=1000)
Predefined Components
The generator includes 48 predefined spectral components with physically-accurate NIR band assignments based on published spectroscopy literature. Use these directly or as building blocks for custom scenarios.
Water & Moisture:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1450, 1940, 2500 |
Free water O-H overtones |
|
1460, 1930 |
Bound water in organic matrices |
Proteins & Nitrogen Compounds:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1510, 1680, 2050, 2180 |
Amide N-H and aromatic C-H |
|
1500, 2060, 2150 |
Primary/secondary amines |
|
1480, 1530, 2010, 2170 |
Urea CO(NH₂)₂ |
|
1520, 2040, 2260 |
Free amino acids |
|
1510, 1680, 2050, 2180 |
Milk protein |
|
1505, 1680, 2050, 2180 |
Wheat protein complex |
Lipids & Hydrocarbons:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1210, 1390, 1720, 2310 |
Triglyceride C-H stretching |
|
1165, 1725, 2305 |
Vegetable/mineral oils |
|
1195, 1730, 2315 |
Saturated fatty acids |
|
1160, 1720, 2145 |
Mono/polyunsaturated fats (=C-H) |
|
1190, 1720, 2310 |
Cuticular waxes |
|
1145, 1685, 2150 |
Benzene derivatives |
|
1190, 1715, 2310 |
Saturated hydrocarbons |
Carbohydrates:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1460, 1580, 2100, 2270 |
Amylose/amylopectin |
|
1490, 1780, 2090, 2280 |
β-1,4-glucan chains |
|
1440, 1690, 2080, 2270 |
D-glucose monosaccharide |
|
1430, 1695, 2070 |
D-fructose (fruit sugar) |
|
1435, 1685, 2075 |
Disaccharide (table sugar) |
|
1450, 1690, 1940, 2100 |
Milk sugar |
|
1470, 1760, 2085 |
Xylan/glucomannan |
|
1140, 1420, 1670, 2130 |
Aromatic plant polymer |
|
1490, 1770, 2090, 2275 |
Plant cell wall material |
Alcohols & Polyols:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1410, 1580, 1695, 2050 |
Ethanol C₂H₅OH |
|
1400, 1545, 1705, 2040 |
Methanol CH₃OH |
|
1450, 1580, 1700, 2060 |
Polyol (fermentation) |
Organic Acids:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1420, 1700, 1940 |
Acetic acid CH₃COOH |
|
1440, 1920, 2060 |
Citric acid (fruit acids) |
|
1430, 1485, 1700, 2020 |
Lactic acid |
|
1440, 1920, 2050, 2255 |
Fruit acid (apples) |
|
1435, 1910, 2040, 2260 |
Grape/wine acid |
Plant Pigments & Phenolics:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1070, 1400, 1730, 2270 |
Chlorophyll a/b |
|
1050, 1680, 2135 |
β-carotene, xanthophylls |
|
1420, 1670, 2056, 2270 |
Phenolic compounds |
Pharmaceuticals:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1130, 1665, 1695, 2010 |
Caffeine C₈H₁₀N₄O₂ |
|
1145, 1435, 1680, 2020 |
Acetylsalicylic acid |
|
1140, 1390, 1510, 1670 |
Acetaminophen |
Fibers & Textiles:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1200, 1494, 1780, 2100 |
Cotton cellulose fiber |
|
1140, 1660, 1720, 2015 |
PET synthetic fiber |
|
1500, 1720, 2050, 2295 |
Polyamide fiber |
Polymers & Plastics:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1190, 1720, 2310, 2355 |
HDPE/LDPE plastic |
|
1145, 1680, 1720, 2170 |
Aromatic polymer |
|
1160, 1720, 2130, 2250 |
cis-1,4-polyisoprene |
Solvents:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
1690, 1710, 2100, 2300 |
Ketone (propan-2-one) |
Soil Minerals:
Component |
Key Bands (nm) |
Description |
|---|---|---|
|
2330, 2525 |
CaCO₃, MgCO₃ (calcite) |
|
1740, 1900, 2200 |
CaSO₄·2H₂O |
|
1400, 2160, 2200 |
Clay mineral |
# Use specific components
dataset = nirs4all.generate(
components=["water", "protein", "lipid"],
n_samples=1000
)
# List all available components
from nirs4all.data.synthetic import ComponentLibrary
library = ComponentLibrary.from_predefined()
print(library.component_names) # All 48 component names
See also
For detailed band assignments with literature references, see Synthetic Data Generator.
Concentration Distributions
Distribution |
Description |
Use Case |
|---|---|---|
|
Sum-to-one constraint |
Natural composition data |
|
Independent uniform |
Wide concentration range |
|
Skewed, realistic |
Agricultural/biological data |
|
With inter-component correlations |
Complex mixtures |
Target Configuration
# Single component target with scaling
dataset = nirs4all.generate.builder(n_samples=1000)
.with_targets(
component="protein", # Use one component
range=(0, 100), # Scale to percentage
distribution="lognormal"
)
.build()
# Multi-output regression (all components as targets)
dataset = nirs4all.generate.builder(n_samples=1000)
.with_targets(
component=None, # Use all components
range=(0, 1), # Normalize
)
.build()
Exporting Synthetic Data
To Folder (DatasetConfigs compatible)
# Generate and save to folder
path = nirs4all.generate.to_folder(
"data/synthetic",
n_samples=1000,
train_ratio=0.8,
format="standard", # Creates Xcal, Ycal, Xval, Yval files
random_state=42
)
# Later, load with DatasetConfigs
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs(path)
Export Formats
Format |
Description |
Files Created |
|---|---|---|
|
Separate train/test files |
Xcal.csv, Ycal.csv, Xval.csv, Yval.csv |
|
All data in one file |
data.csv (with partition column) |
|
Multiple small files |
Useful for loader testing |
To Single CSV
path = nirs4all.generate.to_csv(
"data/synthetic.csv",
n_samples=500,
random_state=42
)
Matching Real Data
Generate synthetic data that resembles a real dataset:
# From a dataset path
dataset = nirs4all.generate.from_template(
"sample_data/regression",
n_samples=1000,
random_state=42
)
# From numpy arrays
dataset = nirs4all.generate.from_template(
X_real,
n_samples=500,
wavelengths=wavelengths,
random_state=42
)
The fitter analyzes:
Statistical properties (mean, std, range)
Spectral shape (slope, curvature)
Noise characteristics
PCA structure
Advanced: Custom Component Library
Create custom spectral components for specific applications:
from nirs4all.data.synthetic import (
SyntheticNIRSGenerator,
ComponentLibrary,
SpectralComponent,
NIRBand
)
# Define custom component
my_component = SpectralComponent(
name="my_compound",
bands=[
NIRBand(center=1500, sigma=20, gamma=2, amplitude=0.6, name="C-H stretch"),
NIRBand(center=2100, sigma=30, gamma=3, amplitude=0.8, name="O-H combination"),
]
)
# Create library with custom and predefined components
library = ComponentLibrary()
library.add_component(my_component)
library.add_from_predefined(["water", "protein"])
# Generate with custom library
generator = SyntheticNIRSGenerator(
component_library=library,
random_state=42
)
X, Y, E = generator.generate(n_samples=1000)
Integration with Pipelines
Direct Usage
import nirs4all
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold
# Generate and train in one workflow
result = nirs4all.run(
pipeline=[
StandardScaler(),
KFold(n_splits=5),
{"model": PLSRegression(n_components=10)}
],
dataset=nirs4all.generate(n_samples=1000, random_state=42),
name="synthetic_test",
verbose=1
)
print(f"Best RMSE: {result.best_rmse:.4f}")
Comparing Preprocessing Methods
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from nirs4all.operators.transforms import SNV, MSC, FirstDerivative
# Generate consistent dataset
dataset = nirs4all.generate(n_samples=500, complexity="realistic", random_state=42)
# Test different preprocessing
for preproc in [MinMaxScaler(), StandardScaler(), SNV(), MSC(), FirstDerivative()]:
result = nirs4all.run(
pipeline=[preproc, KFold(3), PLSRegression(10)],
dataset=dataset,
verbose=0
)
print(f"{preproc.__class__.__name__}: RMSE={result.best_rmse:.4f}")
Performance
Samples |
Complexity |
Approximate Time |
|---|---|---|
1,000 |
simple |
~0.05s |
1,000 |
realistic |
~0.1s |
10,000 |
realistic |
~0.5s |
100,000 |
complex |
~5s |
API Reference
Top-Level Functions
Function |
Description |
|---|---|
|
Main generation function |
|
Regression dataset |
|
Classification dataset |
|
Multi-source dataset |
|
Get builder for full control |
|
Generate and export to folder |
|
Generate and export to CSV |
|
Generate matching real data |
Core Classes
Class |
Description |
|---|---|
|
Core generation engine |
|
Fluent builder interface |
|
Collection of spectral components |
|
Single chemical component definition |
|
Single absorption band (Voigt profile) |
|
Non-linear target complexity |
|
Configuration for target complexity |
Environmental & Scattering Classes (Phase 3)
Class |
Description |
|---|---|
|
Temperature effect configuration |
|
Moisture/water activity configuration |
|
Combined environmental effects |
|
Apply temperature and moisture effects |
|
Particle size distribution configuration |
|
Sample particle size distributions |
|
EMSC-style scattering configuration |
|
Scattering coefficient generation |
|
Combined scattering effects |
|
Apply particle size and scattering effects |
Validation & Benchmarking Classes (Phase 4)
Class |
Description |
|---|---|
|
Scorecard results container |
|
Enum of validation metrics |
|
Metadata for benchmark datasets |
|
Configuration for prior sampling |
|
Hierarchical configuration sampler |
|
GPU-accelerated generation engine |
|
Enum of acceleration backends (JAX, CuPy, NumPy) |
Builder Methods for Target Complexity
Method |
Description |
|---|---|
|
Add polynomial, synergistic, or antagonistic interactions |
|
Add confounders and partial predictability |
|
Create multi-regime target landscapes |
See Also
Synthetic Data Generator - Developer guide for extending the generator
nirs4all.api.generate module - API reference
nirs4all.data.synthetic package - Low-level classes reference
Loading Data - Loading real datasets
Core Concepts - Understanding SpectroDataset