Synthetic NIRS Spectra Generator - Scientific Documentation

Overview

The SyntheticNIRSGenerator provides a physically-motivated model for generating realistic synthetic Near-Infrared (NIR) spectra. It implements the Beer-Lambert law with additional effects for instrumental artifacts, noise, and inter-sample variability commonly observed in real-world spectroscopic measurements.

This generator is designed for:

Autoencoder training: Generate unlimited labeled data for unsupervised feature learning
Algorithm benchmarking: Test preprocessing and modeling algorithms under controlled conditions
Domain adaptation research: Simulate multi-instrument/multi-session variability
Data augmentation: Supplement real datasets with physically plausible synthetic samples

Theoretical Foundation

Beer-Lambert Law

The fundamental spectroscopic relationship underlying the generator is the Beer-Lambert-Bouguer Law:

$$A(\lambda) = \varepsilon(\lambda) \cdot c \cdot L$$

Where:

$A(\lambda)$ = Absorbance at wavelength $\lambda$
$\varepsilon(\lambda)$ = Molar absorptivity (extinction coefficient)
$c$ = Concentration of the absorbing species
$L$ = Optical path length

For mixtures with $K$ components, the total absorbance follows the additivity principle:

$$A(\lambda) = \sum_{k=1}^{K} \varepsilon_k(\lambda) \cdot c_k \cdot L$$

In matrix notation: $\mathbf{A} = \mathbf{C} \cdot \mathbf{E}$

where $\mathbf{C}$ is the concentration matrix $(N \times K)$ and $\mathbf{E}$ is the pure component spectra matrix $(K \times P)$.

References:

Beer, A. (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flüssigkeiten. Annalen der Physik, 162(5), 78-88.
Swinehart, D. F. (1962). The Beer-Lambert Law. Journal of Chemical Education, 39(7), 333.

Peak Shape: Voigt Profile

Scientific Justification

Real absorption peaks in NIR spectroscopy are neither purely Gaussian nor purely Lorentzian. The observed peak shape is a Voigt profile - the convolution of Gaussian (Doppler/thermal) and Lorentzian (collision/pressure) broadening mechanisms.

$$V(x; \sigma, \gamma) = \int_{-\infty}^{\infty} G(x’; \sigma) L(x - x’; \gamma) dx’$$

Where:

$G(x; \sigma)$ = Gaussian profile with width $\sigma$
$L(x; \gamma)$ = Lorentzian profile with width $\gamma$

The Voigt function is computed using scipy.special.voigt_profile.

Parameters

Parameter	Type	Effect	Typical Range
`center`	float	Peak position in nm	1000-2500 nm
`sigma`	float	Gaussian width (HWHM) in nm	10-50 nm
`gamma`	float	Lorentzian width (HWHM) in nm	0-10 nm
`amplitude`	float	Peak height in absorbance units	0.1-1.0 AU

Effect of parameters:

sigma: Controls the overall peak width. Larger values = broader peaks
gamma: Controls “tails” of the peak. γ=0 gives pure Gaussian; increasing γ adds Lorentzian character with heavier tails
amplitude: Directly scales peak height; related to absorptivity × concentration

References:

Olivero, J. J., & Longbothum, R. L. (1977). Empirical fits to the Voigt line width. Journal of Quantitative Spectroscopy and Radiative Transfer, 17(2), 233-236.
Whiting, E. E. (1968). An empirical approximation to the Voigt profile. Journal of Quantitative Spectroscopy and Radiative Transfer, 8(6), 1379-1384.

NIR Band Assignments

Scientific Basis

NIR absorption bands arise from molecular overtones and combination bands of fundamental vibrational modes. The predefined components are based on established NIR band assignments from spectroscopic literature.

Functional Group	Band Type	Wavelength (nm)	Spectral Region
O-H (water)	1st overtone	1400-1500	ν + δ
O-H (water)	combination	1900-2000	ν₁ + ν₃
N-H (protein)	1st overtone	1500-1560	2ν
N-H (protein)	combination	2000-2100	ν + δ
C-H (aliphatic)	2nd overtone	1100-1250	3ν
C-H (aliphatic)	1st overtone	1650-1800	2ν
C-H (aromatic)	combination	2200-2400	ν + δ

Where ν = stretching, δ = bending modes.

Predefined Components

water:        O-H bands at 1450, 1940, 2500 nm
protein:      N-H bands at 1510, 1680, 2050, 2180, 2300 nm
lipid:        C-H bands at 1210, 1390, 1720, 2310, 2350 nm
starch:       O-H/C-O bands at 1460, 1580, 2100, 2270 nm
cellulose:    O-H/C-O bands at 1490, 1780, 2090, 2280, 2340 nm
chlorophyll:  absorption at 1070, 1400, 2270 nm
oil:          C-H bands at 1165, 1215, 1410, 1725, 2140, 2305 nm

References:

Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press.
Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis (3rd ed.). CRC Press.
Shenk, J. S., Workman Jr, J. J., & Westerhaus, M. O. (2008). Application of NIR spectroscopy to agricultural products. In Handbook of Near-Infrared Analysis (3rd ed., pp. 347-386). CRC Press.

Concentration Generation Methods

Dirichlet Distribution (default)

The Dirichlet distribution generates compositional data that sums to 1.0, appropriate for relative proportions:

$$\mathbf{c} \sim \text{Dir}(\alpha_1, \alpha_2, …, \alpha_K)$$

Parameter	Effect
α = [1,1,…,1]	Uniform over simplex
α = [2,2,…,2]	Concentrated toward center
α < 1	Sparse (extreme values)
α > 1	Dense (moderate values)

Other Methods

Method	Distribution	Use Case
`uniform`	U(0,1) independent	When components are independent
`lognormal`	LogN(0, 0.5) normalized	For positively skewed concentrations
`correlated`	Multivariate normal + Cholesky	When components have known correlations

References:

Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.

Instrumental and Physical Effects

1. Path Length Variation

Scientific Basis: In diffuse reflectance/transmittance, the effective optical path length varies due to:

Sample packing density
Particle size distribution
Probe-sample contact pressure

$$A_i(\lambda) = L_i \cdot A_0(\lambda), \quad L_i \sim \mathcal{N}(1, \sigma_L)$$

Parameter	Effect	Typical Values
`path_length_std`	Sample-to-sample variation	0.02-0.08

Reference: Martens, H., & Næs, T. (1989). Multivariate Calibration. John Wiley & Sons.

2. Baseline Drift

Scientific Basis: Baseline variations arise from:

Detector drift over time
Temperature effects on optical components
Sample holder variations
Reference spectrum mismatches

Modeled as a polynomial:

$$\text{baseline}_i(\lambda) = b_0 + b_1 \tilde{\lambda} + b_2 \tilde{\lambda}^2 + b_3 \tilde{\lambda}^3$$

where $\tilde{\lambda}$ is the centered, scaled wavelength.

Parameter	Effect	Typical Values
`baseline_amplitude`	Maximum drift magnitude (AU)	0.01-0.05

3. Global Slope

Scientific Basis: NIR spectra commonly exhibit a global upward or downward trend caused by:

Rayleigh scattering: $I \propto \lambda^{-4}$ (small particles)
Mie scattering: wavelength-dependent for particles comparable to λ
Sample surface roughness: affects baseline slope
Detector sensitivity curve: not perfectly flat

$$A_i(\lambda) \leftarrow A_i(\lambda) + s_i \cdot \frac{\lambda - \lambda_{\min}}{\lambda_{\max} - \lambda_{\min}}$$

where $s_i \sim \mathcal{N}(\mu_s, \sigma_s)$ is the slope (absorbance per 1000nm).

Parameter	Effect	Typical Values
`global_slope_mean`	Average slope direction (AU/1000nm)	-0.1 to +0.15
`global_slope_std`	Sample-to-sample slope variation	0.02-0.05

Positive slope: Common in diffuse reflectance (scattering increases with λ) Negative slope: Can occur with certain sample types or instrument configurations

References:

Rinnan, Å., Van Den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.

4. Scattering Effects (MSC/SNV-like)

Scientific Basis: Multiplicative Scatter Correction (MSC) and Standard Normal Variate (SNV) are designed to remove scatter-induced baseline effects. The generator simulates these effects before correction:

$$A_{\text{scatter}}(\lambda) = \alpha \cdot A(\lambda) + \beta + \gamma \cdot \tilde{\lambda}$$

Where:

$\alpha$ = multiplicative scatter effect (gain)
$\beta$ = additive offset
$\gamma$ = wavelength-dependent tilt

Parameter	Effect	Typical Values
`scatter_alpha_std`	Multiplicative variation	0.02-0.08
`scatter_beta_std`	Additive offset variation	0.005-0.02
`tilt_std`	Linear tilt variation	0.005-0.02

References:

Geladi, P., MacDougall, D., & Martens, H. (1985). Linearization and scatter-correction for near-infrared reflectance spectra of meat. Applied Spectroscopy, 39(3), 491-500.
Barnes, R. J., Dhanoa, M. S., & Lister, S. J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43(5), 772-777.

5. Wavelength Calibration Errors

Scientific Basis: Wavelength calibration errors occur due to:

Spectrometer temperature changes (thermal expansion of grating)
Aging of optical components
Mechanical drift in monochromator
Different instruments having slightly different calibrations

Modeled as a shift and stretch:

$$\lambda_{\text{measured}} = s \cdot \lambda_{\text{true}} + \Delta\lambda$$

Where $s \sim \mathcal{N}(1, \sigma_s)$ and $\Delta\lambda \sim \mathcal{N}(0, \sigma_\Delta)$.

Parameter	Effect	Typical Values
`shift_std`	Wavelength shift (nm)	0.2-1.0 nm
`stretch_std`	Wavelength scale factor std	0.0005-0.002

Reference: Feudale, R. N., et al. (2002). Transfer of multivariate calibration models: a review. Chemometrics and Intelligent Laboratory Systems, 64(2), 181-192.

6. Instrumental Broadening

Scientific Basis: Finite spectral resolution of the instrument causes peak broadening. The instrument’s slit function is approximated as Gaussian.

$$A_{\text{meas}}(\lambda) = A_{\text{true}}(\lambda) * G(\lambda; \text{FWHM})$$

Where * denotes convolution and FWHM is the Full Width at Half Maximum of the instrumental line shape.

Parameter	Effect	Typical Values
`instrumental_fwhm`	Spectral resolution (nm)	4-12 nm

Lower FWHM: Higher resolution, sharper peaks, typically research-grade instruments Higher FWHM: Lower resolution, broader peaks, typical of industrial/portable instruments

Reference: Griffiths, P. R., & De Haseth, J. A. (2007). Fourier Transform Infrared Spectrometry (2nd ed.). John Wiley & Sons.

7. Noise Model

Scientific Basis: NIR detector noise has multiple components:

Shot noise: Poisson-distributed, signal-dependent
Thermal noise: Johnson-Nyquist, independent of signal
Readout noise: Electronics, independent of signal

Approximated as heteroscedastic Gaussian:

$$\sigma(\lambda) = \sigma_{\text{base}} + \sigma_{\text{signal}} \cdot |A(\lambda)|$$

Parameter	Effect	Typical Values
`noise_base`	Signal-independent noise floor	0.002-0.008
`noise_signal_dep`	Signal-dependent noise factor	0.005-0.015

Signal-to-Noise Ratio (SNR) approximately: SNR ≈ A / σ(A)

Reference: Workman Jr, J. (2007). NIR spectroscopy instrumentation. In Handbook of Near-Infrared Analysis (3rd ed., pp. 91-112). CRC Press.

8. Artifacts

Scientific Basis: Real-world spectra may contain artifacts:

Artifact Type	Cause	Model
Spike	Cosmic rays, electrical interference	Random point additions
Dead band	Detector defects, atmospheric absorption	Localized noise increase
Saturation	Detector/ADC overflow	Clipping at high absorbance

Parameter	Effect	Typical Values
`artifact_prob`	Probability of artifact per sample	0.0-0.05

9. Batch Effects

Scientific Basis: Multi-session/multi-instrument data exhibits systematic differences:

Lamp aging → intensity drift
Environmental changes → baseline shift
Recalibration → scale changes

$$A_{\text{batch}_j} = g_j \cdot A + \mathbf{o}_j$$

Where $g_j$ is batch-specific gain and $\mathbf{o}_j$ is batch-specific offset.

Used for:

Domain adaptation research
Transfer learning studies
Calibration maintenance testing

Reference: Feudale, R. N., et al. (2002). Transfer of multivariate calibration models: a review. Chemometrics and Intelligent Laboratory Systems, 64(2), 181-192.

Complexity Levels

The generator provides three preset complexity levels optimizing parameters for different use cases:

Simple (Testing/Debugging)

complexity = "simple"

Low noise, minimal artifacts
Small path length and scatter variation
No global slope (flat baseline trend)
Suitable for: algorithm debugging, unit testing

Realistic (Training/Benchmarking)

complexity = "realistic"  # Default

Moderate noise levels
Typical instrument resolution (8 nm FWHM)
~2% artifact rate
Positive global slope (typical NIR behavior)
Suitable for: model training, algorithm comparison

Complex (Robustness Testing)

complexity = "complex"

High noise levels
Large inter-sample variability
~5% artifact rate
Strong global slope variation
Lower resolution (12 nm FWHM)
Suitable for: stress testing, robustness evaluation

Complete Parameter Reference

Parameter	Simple	Realistic	Complex	Unit
`path_length_std`	0.02	0.05	0.08	fraction
`baseline_amplitude`	0.01	0.02	0.05	AU
`scatter_alpha_std`	0.02	0.05	0.08	fraction
`scatter_beta_std`	0.005	0.01	0.02	AU
`tilt_std`	0.005	0.01	0.02	AU
`global_slope_mean`	0.0	0.05	0.08	AU/1000nm
`global_slope_std`	0.02	0.03	0.05	AU/1000nm
`shift_std`	0.2	0.5	1.0	nm
`stretch_std`	0.0005	0.001	0.002	fraction
`instrumental_fwhm`	4	8	12	nm
`noise_base`	0.002	0.005	0.008	AU
`noise_signal_dep`	0.005	0.01	0.015	fraction
`artifact_prob`	0.0	0.02	0.05	probability

Usage Examples

Basic Usage

from examples.synthetic import SyntheticNIRSGenerator

generator = SyntheticNIRSGenerator(
    wavelength_start=1000,
    wavelength_end=2500,
    complexity="realistic",
    random_state=42
)

X, Y, E = generator.generate(n_samples=1000)
# X: (1000, 751) spectra
# Y: (1000, 5) concentrations
# E: (5, 751) pure component spectra

Custom Parameters

generator = SyntheticNIRSGenerator(complexity="realistic")
# Override specific parameters
generator.params["global_slope_mean"] = 0.02
generator.params["noise_base"] = 0.003

X, Y, E = generator.generate(n_samples=500)

With Batch Effects

X, Y, E, metadata = generator.generate(
    n_samples=600,
    include_batch_effects=True,
    n_batches=3,
    return_metadata=True
)

batch_ids = metadata["batch_ids"]  # Sample-to-batch mapping

References

Core Spectroscopy

Beer, A. (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flüssigkeiten. Annalen der Physik, 162(5), 78-88.
Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press. ISBN: 978-1439875254
Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis (3rd ed.). CRC Press. ISBN: 978-0849373930

Peak Shapes

Olivero, J. J., & Longbothum, R. L. (1977). Empirical fits to the Voigt line width. Journal of Quantitative Spectroscopy and Radiative Transfer, 17(2), 233-236.

Preprocessing and Scatter Correction

Rinnan, Å., Van Den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10), 1201-1222.
Barnes, R. J., Dhanoa, M. S., & Lister, S. J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43(5), 772-777.
Geladi, P., MacDougall, D., & Martens, H. (1985). Linearization and scatter-correction for near-infrared reflectance spectra of meat. Applied Spectroscopy, 39(3), 491-500.

Calibration Transfer

Feudale, R. N., Woody, N. A., Tan, H., Myles, A. J., Brown, S. D., & Ferré, J. (2002). Transfer of multivariate calibration models: a review. Chemometrics and Intelligent Laboratory Systems, 64(2), 181-192.

Multivariate Calibration

Martens, H., & Næs, T. (1989). Multivariate Calibration. John Wiley & Sons. ISBN: 978-0471930471

Compositional Data

Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall. ISBN: 978-0412280603

Version History

v1.0.0 (2024): Initial implementation with Beer-Lambert model, Voigt profiles, complexity levels
v1.1.0 (2024): Added global slope effect, SyntheticRealComparator for real data comparison