nirs4all.data.synthetic.generator module

Synthetic NIRS Spectra Generator.

A physically-motivated synthetic NIRS spectra generator based on Beer-Lambert law, with realistic instrumental effects and noise models.

Key features:
  • Voigt profile peak shapes (Gaussian + Lorentzian convolution)

  • Realistic NIR band positions from known spectroscopic databases

  • Configurable baseline, scattering, and instrumental effects

  • Batch/session effects for domain adaptation research

  • Controllable outlier/artifact generation

  • Instrument archetype simulation (Phase 2)

  • Measurement mode physics (transmittance, reflectance, ATR) (Phase 2)

  • Detector response and noise models (Phase 2)

  • Multi-sensor stitching (Phase 2)

  • Multi-scan averaging/denoising (Phase 2)

References

  • Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press.

  • Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis. CRC Press.

class nirs4all.data.synthetic.generator.SyntheticNIRSGenerator(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, component_library: ComponentLibrary | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'realistic', instrument: str | InstrumentArchetype | None = None, measurement_mode: str | MeasurementMode | None = None, multi_sensor_config: MultiSensorConfig | None = None, multi_scan_config: MultiScanConfig | None = None, environmental_config: EnvironmentalEffectsConfig | None = None, scattering_effects_config: ScatteringEffectsConfig | None = None, random_state: int | None = None)[source]

Bases: object

Generator for synthetic NIRS spectra with realistic instrumental effects.

This generator implements a physically-motivated model based on Beer-Lambert law with additional effects for baseline, scattering, instrumental response, and noise.

Model:

A_i(λ) = L_i * Σ_k c_ik * ε_k(λ) + baseline_i(λ) + scatter_i(λ) + noise_i(λ)

where:
  • c_ik: concentration of component k in sample i

  • ε_k(λ): molar absorptivity of component k (Voigt profiles)

  • L_i: optical path length factor

  • baseline: polynomial baseline drift

  • scatter: multiplicative/additive scattering effects

  • noise: wavelength-dependent Gaussian noise

Phase 2 Features:
  • Instrument archetype simulation (FOSS, Bruker, etc.)

  • Measurement mode physics (transmittance, reflectance, ATR)

  • Detector response curves and noise models

  • Multi-sensor stitching (combining signals from different wavelength ranges)

  • Multi-scan averaging/denoising (simulating multiple scans per sample)

Phase 3 Features:
  • Temperature effects on spectral bands (O-H, N-H, C-H shifts)

  • Moisture and water activity effects

  • Particle size effects (EMSC-style scattering)

  • Scattering coefficient generation (Kubelka-Munk)

wavelengths

Array of wavelength values in nm.

n_wavelengths

Number of wavelength points.

library

ComponentLibrary containing spectral components.

E

Precomputed component spectra matrix (n_components, n_wavelengths).

params

Dictionary of effect parameters based on complexity level.

instrument

Optional InstrumentArchetype for realistic simulation.

measurement_mode_simulator

Optional measurement mode simulator.

environmental_simulator

Optional Phase 3 environmental effects simulator.

scattering_effects_simulator

Optional Phase 3 scattering effects simulator.

Parameters:
  • wavelength_start – Start wavelength in nm.

  • wavelength_end – End wavelength in nm.

  • wavelength_step – Wavelength step in nm.

  • component_library – Optional ComponentLibrary. If None, generates predefined components for realistic mode or random for simple mode.

  • complexity – Complexity level controlling noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.

  • instrument – Instrument archetype name or InstrumentArchetype object. If provided, uses instrument-specific wavelength range, detector, etc.

  • measurement_mode – Measurement mode (transmittance, reflectance, etc.).

  • multi_sensor_config – Configuration for multi-sensor stitching.

  • multi_scan_config – Configuration for multi-scan averaging.

  • environmental_config – Phase 3 configuration for temperature/moisture effects.

  • scattering_effects_config – Phase 3 configuration for particle size/scattering.

  • random_state – Random seed for reproducibility.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> X, Y, E = generator.generate(n_samples=1000)
>>> print(X.shape, Y.shape, E.shape)
(1000, 751) (1000, 5) (5, 751)
>>> # With instrument simulation (Phase 2)
>>> generator = SyntheticNIRSGenerator(
...     instrument="foss_xds",
...     measurement_mode="reflectance",
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3)
>>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig
>>> env_config = EnvironmentalEffectsConfig(
...     enable_temperature=True,
...     enable_moisture=True
... )
>>> generator = SyntheticNIRSGenerator(
...     environmental_config=env_config,
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # Create a SpectroDataset directly
>>> dataset = generator.create_dataset(n_train=800, n_test=200)

See also

ComponentLibrary: For managing spectral components. InstrumentArchetype: For instrument-specific simulation. MeasurementModeSimulator: For measurement mode physics. EnvironmentalEffectsSimulator: For temperature/moisture effects (Phase 3). ScatteringEffectsSimulator: For particle size/scattering effects (Phase 3).

__repr__() str[source]

Return string representation of the generator.

create_dataset(n_train: int = 800, n_test: int = 200, target_component: str | int | None = None, **generate_kwargs: Any) SpectroDataset[source]

Create a SpectroDataset from synthetic spectra.

This method generates synthetic spectra and wraps them in a SpectroDataset object ready for use with nirs4all pipelines.

Parameters:
  • n_train – Number of training samples.

  • n_test – Number of test samples.

  • target_component – Which component to use as target. - If None: uses all components as multi-output target. - If str: uses the component with that name. - If int: uses the component at that index.

  • **generate_kwargs – Additional arguments passed to generate().

Returns:

SpectroDataset with train/test partitions.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> dataset = generator.create_dataset(
...     n_train=800,
...     n_test=200,
...     target_component="protein"
... )
>>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
generate(n_samples: int = 1000, concentration_method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', include_batch_effects: bool = False, n_batches: int = 1, include_instrument_effects: bool = True, include_multi_sensor: bool = True, include_multi_scan: bool = True, include_environmental_effects: bool = True, include_scattering_effects: bool = True, temperatures: ndarray | None = None, return_metadata: bool = False) Tuple[ndarray, ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray, Dict[str, Any]][source]

Generate synthetic NIRS spectra.

This is the main generation method that creates synthetic spectra by applying all physical effects in sequence.

Parameters:
  • n_samples – Number of spectra to generate.

  • concentration_method – Method for generating concentrations. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.

  • include_batch_effects – Whether to add batch/session effects.

  • n_batches – Number of batches (only if include_batch_effects=True).

  • include_instrument_effects – Whether to apply instrument-specific effects (detector response, noise). Only applies if instrument was specified during initialization.

  • include_multi_sensor – Whether to apply multi-sensor stitching effects. Only applies if multi_sensor_config is set.

  • include_multi_scan – Whether to simulate multi-scan averaging. Only applies if multi_scan_config is set.

  • include_environmental_effects – Whether to apply Phase 3 temperature and moisture effects. Only applies if environmental_config is set.

  • include_scattering_effects – Whether to apply Phase 3 particle size and EMSC-style scattering effects. Only applies if scattering_effects_config is set.

  • temperatures – Optional array of temperatures (°C) for each sample. If None and environmental effects are enabled, random temperatures are generated based on the configuration. Shape: (n_samples,).

  • return_metadata – Whether to return additional metadata dictionary.

Returns:

Tuple of (X, Y, E):
  • X: Spectra matrix (n_samples, n_wavelengths)

  • Y: Concentration matrix (n_samples, n_components)

  • E: Component spectra (n_components, n_wavelengths)

If return_metadata=True:
Tuple of (X, Y, E, metadata):
  • metadata: Dictionary with generation details

Return type:

If return_metadata=False

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> X, Y, E = generator.generate(n_samples=500)
>>> print(f"Spectra: {X.shape}, Targets: {Y.shape}")
Spectra: (500, 751), Targets: (500, 5)
>>> # With instrument simulation (Phase 2)
>>> generator = SyntheticNIRSGenerator(
...     instrument="foss_xds",
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3)
>>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig
>>> env_config = EnvironmentalEffectsConfig()
>>> generator = SyntheticNIRSGenerator(
...     environmental_config=env_config,
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # With metadata
>>> X, Y, E, meta = generator.generate(100, return_metadata=True)
>>> print(meta.keys())
generate_batch_effects(n_batches: int, samples_per_batch: List[int]) Tuple[ndarray, ndarray][source]

Generate batch/session effects for domain adaptation research.

Parameters:
  • n_batches – Number of measurement batches/sessions.

  • samples_per_batch – List of sample counts per batch.

Returns:

  • batch_offsets: Wavelength-dependent offsets per batch.

  • batch_gains: Multiplicative gains per batch.

Return type:

Tuple of

generate_concentrations(n_samples: int, method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', alpha: ndarray | None = None, correlation_matrix: ndarray | None = None) ndarray[source]

Generate concentration matrix using specified distribution.

Parameters:
  • n_samples – Number of samples to generate.

  • method – Concentration generation method: - ‘dirichlet’: Compositional data (concentrations sum to ~1). - ‘uniform’: Independent uniform [0, 1] values. - ‘lognormal’: Log-normal distributed, normalized. - ‘correlated’: Multivariate with specified correlations.

  • alpha – Dirichlet concentration parameters (only for ‘dirichlet’ method). Shape: (n_components,). Higher values = more uniform distribution.

  • correlation_matrix – Correlation structure for ‘correlated’ method. Shape: (n_components, n_components).

Returns:

Concentration matrix of shape (n_samples, n_components).

Raises:

ValueError – If method is unknown.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> C = generator.generate_concentrations(100, method='dirichlet')
>>> print(C.shape, C.sum(axis=1).mean())  # Should sum to ~1