nirs4all.data.synthetic.generator module
Synthetic NIRS Spectra Generator.
A physically-motivated synthetic NIRS spectra generator based on Beer-Lambert law, with realistic instrumental effects and noise models.
- Key features:
Voigt profile peak shapes (Gaussian + Lorentzian convolution)
Realistic NIR band positions from known spectroscopic databases
Configurable baseline, scattering, and instrumental effects
Batch/session effects for domain adaptation research
Controllable outlier/artifact generation
Instrument archetype simulation (Phase 2)
Measurement mode physics (transmittance, reflectance, ATR) (Phase 2)
Detector response and noise models (Phase 2)
Multi-sensor stitching (Phase 2)
Multi-scan averaging/denoising (Phase 2)
References
Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press.
Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis. CRC Press.
- class nirs4all.data.synthetic.generator.SyntheticNIRSGenerator(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, component_library: ComponentLibrary | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'realistic', instrument: str | InstrumentArchetype | None = None, measurement_mode: str | MeasurementMode | None = None, multi_sensor_config: MultiSensorConfig | None = None, multi_scan_config: MultiScanConfig | None = None, environmental_config: EnvironmentalEffectsConfig | None = None, scattering_effects_config: ScatteringEffectsConfig | None = None, random_state: int | None = None)[source]
Bases:
objectGenerator for synthetic NIRS spectra with realistic instrumental effects.
This generator implements a physically-motivated model based on Beer-Lambert law with additional effects for baseline, scattering, instrumental response, and noise.
- Model:
A_i(λ) = L_i * Σ_k c_ik * ε_k(λ) + baseline_i(λ) + scatter_i(λ) + noise_i(λ)
- where:
c_ik: concentration of component k in sample i
ε_k(λ): molar absorptivity of component k (Voigt profiles)
L_i: optical path length factor
baseline: polynomial baseline drift
scatter: multiplicative/additive scattering effects
noise: wavelength-dependent Gaussian noise
- Phase 2 Features:
Instrument archetype simulation (FOSS, Bruker, etc.)
Measurement mode physics (transmittance, reflectance, ATR)
Detector response curves and noise models
Multi-sensor stitching (combining signals from different wavelength ranges)
Multi-scan averaging/denoising (simulating multiple scans per sample)
- Phase 3 Features:
Temperature effects on spectral bands (O-H, N-H, C-H shifts)
Moisture and water activity effects
Particle size effects (EMSC-style scattering)
Scattering coefficient generation (Kubelka-Munk)
- wavelengths
Array of wavelength values in nm.
- n_wavelengths
Number of wavelength points.
- library
ComponentLibrary containing spectral components.
- E
Precomputed component spectra matrix (n_components, n_wavelengths).
- params
Dictionary of effect parameters based on complexity level.
- instrument
Optional InstrumentArchetype for realistic simulation.
- measurement_mode_simulator
Optional measurement mode simulator.
- environmental_simulator
Optional Phase 3 environmental effects simulator.
- scattering_effects_simulator
Optional Phase 3 scattering effects simulator.
- Parameters:
wavelength_start – Start wavelength in nm.
wavelength_end – End wavelength in nm.
wavelength_step – Wavelength step in nm.
component_library – Optional ComponentLibrary. If None, generates predefined components for realistic mode or random for simple mode.
complexity – Complexity level controlling noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.
instrument – Instrument archetype name or InstrumentArchetype object. If provided, uses instrument-specific wavelength range, detector, etc.
measurement_mode – Measurement mode (transmittance, reflectance, etc.).
multi_sensor_config – Configuration for multi-sensor stitching.
multi_scan_config – Configuration for multi-scan averaging.
environmental_config – Phase 3 configuration for temperature/moisture effects.
scattering_effects_config – Phase 3 configuration for particle size/scattering.
random_state – Random seed for reproducibility.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> X, Y, E = generator.generate(n_samples=1000) >>> print(X.shape, Y.shape, E.shape) (1000, 751) (1000, 5) (5, 751)
>>> # With instrument simulation (Phase 2) >>> generator = SyntheticNIRSGenerator( ... instrument="foss_xds", ... measurement_mode="reflectance", ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3) >>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig >>> env_config = EnvironmentalEffectsConfig( ... enable_temperature=True, ... enable_moisture=True ... ) >>> generator = SyntheticNIRSGenerator( ... environmental_config=env_config, ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # Create a SpectroDataset directly >>> dataset = generator.create_dataset(n_train=800, n_test=200)
See also
ComponentLibrary: For managing spectral components. InstrumentArchetype: For instrument-specific simulation. MeasurementModeSimulator: For measurement mode physics. EnvironmentalEffectsSimulator: For temperature/moisture effects (Phase 3). ScatteringEffectsSimulator: For particle size/scattering effects (Phase 3).
- create_dataset(n_train: int = 800, n_test: int = 200, target_component: str | int | None = None, **generate_kwargs: Any) SpectroDataset[source]
Create a SpectroDataset from synthetic spectra.
This method generates synthetic spectra and wraps them in a SpectroDataset object ready for use with nirs4all pipelines.
- Parameters:
n_train – Number of training samples.
n_test – Number of test samples.
target_component – Which component to use as target. - If None: uses all components as multi-output target. - If str: uses the component with that name. - If int: uses the component at that index.
**generate_kwargs – Additional arguments passed to generate().
- Returns:
SpectroDataset with train/test partitions.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> dataset = generator.create_dataset( ... n_train=800, ... n_test=200, ... target_component="protein" ... ) >>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
- generate(n_samples: int = 1000, concentration_method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', include_batch_effects: bool = False, n_batches: int = 1, include_instrument_effects: bool = True, include_multi_sensor: bool = True, include_multi_scan: bool = True, include_environmental_effects: bool = True, include_scattering_effects: bool = True, temperatures: ndarray | None = None, return_metadata: bool = False) Tuple[ndarray, ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray, Dict[str, Any]][source]
Generate synthetic NIRS spectra.
This is the main generation method that creates synthetic spectra by applying all physical effects in sequence.
- Parameters:
n_samples – Number of spectra to generate.
concentration_method – Method for generating concentrations. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.
include_batch_effects – Whether to add batch/session effects.
n_batches – Number of batches (only if include_batch_effects=True).
include_instrument_effects – Whether to apply instrument-specific effects (detector response, noise). Only applies if instrument was specified during initialization.
include_multi_sensor – Whether to apply multi-sensor stitching effects. Only applies if multi_sensor_config is set.
include_multi_scan – Whether to simulate multi-scan averaging. Only applies if multi_scan_config is set.
include_environmental_effects – Whether to apply Phase 3 temperature and moisture effects. Only applies if environmental_config is set.
include_scattering_effects – Whether to apply Phase 3 particle size and EMSC-style scattering effects. Only applies if scattering_effects_config is set.
temperatures – Optional array of temperatures (°C) for each sample. If None and environmental effects are enabled, random temperatures are generated based on the configuration. Shape: (n_samples,).
return_metadata – Whether to return additional metadata dictionary.
- Returns:
- Tuple of (X, Y, E):
X: Spectra matrix (n_samples, n_wavelengths)
Y: Concentration matrix (n_samples, n_components)
E: Component spectra (n_components, n_wavelengths)
- If return_metadata=True:
- Tuple of (X, Y, E, metadata):
metadata: Dictionary with generation details
- Return type:
If return_metadata=False
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> X, Y, E = generator.generate(n_samples=500) >>> print(f"Spectra: {X.shape}, Targets: {Y.shape}") Spectra: (500, 751), Targets: (500, 5)
>>> # With instrument simulation (Phase 2) >>> generator = SyntheticNIRSGenerator( ... instrument="foss_xds", ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3) >>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig >>> env_config = EnvironmentalEffectsConfig() >>> generator = SyntheticNIRSGenerator( ... environmental_config=env_config, ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # With metadata >>> X, Y, E, meta = generator.generate(100, return_metadata=True) >>> print(meta.keys())
- generate_batch_effects(n_batches: int, samples_per_batch: List[int]) Tuple[ndarray, ndarray][source]
Generate batch/session effects for domain adaptation research.
- Parameters:
n_batches – Number of measurement batches/sessions.
samples_per_batch – List of sample counts per batch.
- Returns:
batch_offsets: Wavelength-dependent offsets per batch.
batch_gains: Multiplicative gains per batch.
- Return type:
Tuple of
- generate_concentrations(n_samples: int, method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', alpha: ndarray | None = None, correlation_matrix: ndarray | None = None) ndarray[source]
Generate concentration matrix using specified distribution.
- Parameters:
n_samples – Number of samples to generate.
method – Concentration generation method: - ‘dirichlet’: Compositional data (concentrations sum to ~1). - ‘uniform’: Independent uniform [0, 1] values. - ‘lognormal’: Log-normal distributed, normalized. - ‘correlated’: Multivariate with specified correlations.
alpha – Dirichlet concentration parameters (only for ‘dirichlet’ method). Shape: (n_components,). Higher values = more uniform distribution.
correlation_matrix – Correlation structure for ‘correlated’ method. Shape: (n_components, n_components).
- Returns:
Concentration matrix of shape (n_samples, n_components).
- Raises:
ValueError – If method is unknown.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> C = generator.generate_concentrations(100, method='dirichlet') >>> print(C.shape, C.sum(axis=1).mean()) # Should sum to ~1