# Synthetic NIRS Spectra Generator - Scientific Documentation ## Overview The `SyntheticNIRSGenerator` provides a physically-motivated model for generating realistic synthetic Near-Infrared (NIR) spectra. It implements the Beer-Lambert law with additional effects for instrumental artifacts, noise, and inter-sample variability commonly observed in real-world spectroscopic measurements. This generator is designed for: - **Autoencoder training**: Generate unlimited labeled data for unsupervised feature learning - **Algorithm benchmarking**: Test preprocessing and modeling algorithms under controlled conditions - **Domain adaptation research**: Simulate multi-instrument/multi-session variability - **Data augmentation**: Supplement real datasets with physically plausible synthetic samples --- ## Theoretical Foundation ### Beer-Lambert Law The fundamental spectroscopic relationship underlying the generator is the **Beer-Lambert-Bouguer Law**: $$A(\lambda) = \varepsilon(\lambda) \cdot c \cdot L$$ Where: - $A(\lambda)$ = Absorbance at wavelength $\lambda$ - $\varepsilon(\lambda)$ = Molar absorptivity (extinction coefficient) - $c$ = Concentration of the absorbing species - $L$ = Optical path length For mixtures with $K$ components, the total absorbance follows the **additivity principle**: $$A(\lambda) = \sum_{k=1}^{K} \varepsilon_k(\lambda) \cdot c_k \cdot L$$ In matrix notation: $\mathbf{A} = \mathbf{C} \cdot \mathbf{E}$ where $\mathbf{C}$ is the concentration matrix $(N \times K)$ and $\mathbf{E}$ is the pure component spectra matrix $(K \times P)$. **References:** - Beer, A. (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flüssigkeiten. *Annalen der Physik*, 162(5), 78-88. - Swinehart, D. F. (1962). The Beer-Lambert Law. *Journal of Chemical Education*, 39(7), 333. --- ## Peak Shape: Voigt Profile ### Scientific Justification Real absorption peaks in NIR spectroscopy are neither purely Gaussian nor purely Lorentzian. The observed peak shape is a **Voigt profile** - the convolution of Gaussian (Doppler/thermal) and Lorentzian (collision/pressure) broadening mechanisms. $$V(x; \sigma, \gamma) = \int_{-\infty}^{\infty} G(x'; \sigma) L(x - x'; \gamma) dx'$$ Where: - $G(x; \sigma)$ = Gaussian profile with width $\sigma$ - $L(x; \gamma)$ = Lorentzian profile with width $\gamma$ The Voigt function is computed using `scipy.special.voigt_profile`. ### Parameters | Parameter | Type | Effect | Typical Range | |-----------|------|--------|---------------| | `center` | float | Peak position in nm | 1000-2500 nm | | `sigma` | float | Gaussian width (HWHM) in nm | 10-50 nm | | `gamma` | float | Lorentzian width (HWHM) in nm | 0-10 nm | | `amplitude` | float | Peak height in absorbance units | 0.1-1.0 AU | **Effect of parameters:** - **sigma**: Controls the overall peak width. Larger values = broader peaks - **gamma**: Controls "tails" of the peak. γ=0 gives pure Gaussian; increasing γ adds Lorentzian character with heavier tails - **amplitude**: Directly scales peak height; related to absorptivity × concentration **References:** - Olivero, J. J., & Longbothum, R. L. (1977). Empirical fits to the Voigt line width. *Journal of Quantitative Spectroscopy and Radiative Transfer*, 17(2), 233-236. - Whiting, E. E. (1968). An empirical approximation to the Voigt profile. *Journal of Quantitative Spectroscopy and Radiative Transfer*, 8(6), 1379-1384. --- ## NIR Band Assignments ### Scientific Basis NIR absorption bands arise from **molecular overtones and combination bands** of fundamental vibrational modes. The predefined components are based on established NIR band assignments from spectroscopic literature. | Functional Group | Band Type | Wavelength (nm) | Spectral Region | |-----------------|-----------|-----------------|-----------------| | O-H (water) | 1st overtone | 1400-1500 | ν + δ | | O-H (water) | combination | 1900-2000 | ν₁ + ν₃ | | N-H (protein) | 1st overtone | 1500-1560 | 2ν | | N-H (protein) | combination | 2000-2100 | ν + δ | | C-H (aliphatic) | 2nd overtone | 1100-1250 | 3ν | | C-H (aliphatic) | 1st overtone | 1650-1800 | 2ν | | C-H (aromatic) | combination | 2200-2400 | ν + δ | Where ν = stretching, δ = bending modes. ### Predefined Components ``` water: O-H bands at 1450, 1940, 2500 nm protein: N-H bands at 1510, 1680, 2050, 2180, 2300 nm lipid: C-H bands at 1210, 1390, 1720, 2310, 2350 nm starch: O-H/C-O bands at 1460, 1580, 2100, 2270 nm cellulose: O-H/C-O bands at 1490, 1780, 2090, 2280, 2340 nm chlorophyll: absorption at 1070, 1400, 2270 nm oil: C-H bands at 1165, 1215, 1410, 1725, 2140, 2305 nm ``` **References:** - Workman Jr, J., & Weyer, L. (2012). *Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy*. CRC Press. - Burns, D. A., & Ciurczak, E. W. (2007). *Handbook of Near-Infrared Analysis* (3rd ed.). CRC Press. - Shenk, J. S., Workman Jr, J. J., & Westerhaus, M. O. (2008). Application of NIR spectroscopy to agricultural products. In *Handbook of Near-Infrared Analysis* (3rd ed., pp. 347-386). CRC Press. --- ## Concentration Generation Methods ### Dirichlet Distribution (default) The Dirichlet distribution generates compositional data that sums to 1.0, appropriate for relative proportions: $$\mathbf{c} \sim \text{Dir}(\alpha_1, \alpha_2, ..., \alpha_K)$$ | Parameter | Effect | |-----------|--------| | α = [1,1,...,1] | Uniform over simplex | | α = [2,2,...,2] | Concentrated toward center | | α < 1 | Sparse (extreme values) | | α > 1 | Dense (moderate values) | ### Other Methods | Method | Distribution | Use Case | |--------|--------------|----------| | `uniform` | U(0,1) independent | When components are independent | | `lognormal` | LogN(0, 0.5) normalized | For positively skewed concentrations | | `correlated` | Multivariate normal + Cholesky | When components have known correlations | **References:** - Aitchison, J. (1986). *The Statistical Analysis of Compositional Data*. Chapman & Hall. --- ## Instrumental and Physical Effects ### 1. Path Length Variation **Scientific Basis:** In diffuse reflectance/transmittance, the effective optical path length varies due to: - Sample packing density - Particle size distribution - Probe-sample contact pressure $$A_i(\lambda) = L_i \cdot A_0(\lambda), \quad L_i \sim \mathcal{N}(1, \sigma_L)$$ | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `path_length_std` | Sample-to-sample variation | 0.02-0.08 | **Reference:** Martens, H., & Næs, T. (1989). *Multivariate Calibration*. John Wiley & Sons. --- ### 2. Baseline Drift **Scientific Basis:** Baseline variations arise from: - Detector drift over time - Temperature effects on optical components - Sample holder variations - Reference spectrum mismatches Modeled as a polynomial: $$\text{baseline}_i(\lambda) = b_0 + b_1 \tilde{\lambda} + b_2 \tilde{\lambda}^2 + b_3 \tilde{\lambda}^3$$ where $\tilde{\lambda}$ is the centered, scaled wavelength. | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `baseline_amplitude` | Maximum drift magnitude (AU) | 0.01-0.05 | --- ### 3. Global Slope **Scientific Basis:** NIR spectra commonly exhibit a global upward or downward trend caused by: - **Rayleigh scattering**: $I \propto \lambda^{-4}$ (small particles) - **Mie scattering**: wavelength-dependent for particles comparable to λ - **Sample surface roughness**: affects baseline slope - **Detector sensitivity curve**: not perfectly flat $$A_i(\lambda) \leftarrow A_i(\lambda) + s_i \cdot \frac{\lambda - \lambda_{\min}}{\lambda_{\max} - \lambda_{\min}}$$ where $s_i \sim \mathcal{N}(\mu_s, \sigma_s)$ is the slope (absorbance per 1000nm). | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `global_slope_mean` | Average slope direction (AU/1000nm) | -0.1 to +0.15 | | `global_slope_std` | Sample-to-sample slope variation | 0.02-0.05 | **Positive slope:** Common in diffuse reflectance (scattering increases with λ) **Negative slope:** Can occur with certain sample types or instrument configurations **References:** - Rinnan, Å., Van Den Berg, F., & Engelsen, S. B. (2009). Review of the most common pre-processing techniques for near-infrared spectra. *TrAC Trends in Analytical Chemistry*, 28(10), 1201-1222. --- ### 4. Scattering Effects (MSC/SNV-like) **Scientific Basis:** Multiplicative Scatter Correction (MSC) and Standard Normal Variate (SNV) are designed to remove scatter-induced baseline effects. The generator simulates these effects *before* correction: $$A_{\text{scatter}}(\lambda) = \alpha \cdot A(\lambda) + \beta + \gamma \cdot \tilde{\lambda}$$ Where: - $\alpha$ = multiplicative scatter effect (gain) - $\beta$ = additive offset - $\gamma$ = wavelength-dependent tilt | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `scatter_alpha_std` | Multiplicative variation | 0.02-0.08 | | `scatter_beta_std` | Additive offset variation | 0.005-0.02 | | `tilt_std` | Linear tilt variation | 0.005-0.02 | **References:** - Geladi, P., MacDougall, D., & Martens, H. (1985). Linearization and scatter-correction for near-infrared reflectance spectra of meat. *Applied Spectroscopy*, 39(3), 491-500. - Barnes, R. J., Dhanoa, M. S., & Lister, S. J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. *Applied Spectroscopy*, 43(5), 772-777. --- ### 5. Wavelength Calibration Errors **Scientific Basis:** Wavelength calibration errors occur due to: - Spectrometer temperature changes (thermal expansion of grating) - Aging of optical components - Mechanical drift in monochromator - Different instruments having slightly different calibrations Modeled as a shift and stretch: $$\lambda_{\text{measured}} = s \cdot \lambda_{\text{true}} + \Delta\lambda$$ Where $s \sim \mathcal{N}(1, \sigma_s)$ and $\Delta\lambda \sim \mathcal{N}(0, \sigma_\Delta)$. | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `shift_std` | Wavelength shift (nm) | 0.2-1.0 nm | | `stretch_std` | Wavelength scale factor std | 0.0005-0.002 | **Reference:** Feudale, R. N., et al. (2002). Transfer of multivariate calibration models: a review. *Chemometrics and Intelligent Laboratory Systems*, 64(2), 181-192. --- ### 6. Instrumental Broadening **Scientific Basis:** Finite spectral resolution of the instrument causes peak broadening. The instrument's slit function is approximated as Gaussian. $$A_{\text{meas}}(\lambda) = A_{\text{true}}(\lambda) * G(\lambda; \text{FWHM})$$ Where * denotes convolution and FWHM is the Full Width at Half Maximum of the instrumental line shape. | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `instrumental_fwhm` | Spectral resolution (nm) | 4-12 nm | **Lower FWHM:** Higher resolution, sharper peaks, typically research-grade instruments **Higher FWHM:** Lower resolution, broader peaks, typical of industrial/portable instruments **Reference:** Griffiths, P. R., & De Haseth, J. A. (2007). *Fourier Transform Infrared Spectrometry* (2nd ed.). John Wiley & Sons. --- ### 7. Noise Model **Scientific Basis:** NIR detector noise has multiple components: - **Shot noise**: Poisson-distributed, signal-dependent - **Thermal noise**: Johnson-Nyquist, independent of signal - **Readout noise**: Electronics, independent of signal Approximated as heteroscedastic Gaussian: $$\sigma(\lambda) = \sigma_{\text{base}} + \sigma_{\text{signal}} \cdot |A(\lambda)|$$ | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `noise_base` | Signal-independent noise floor | 0.002-0.008 | | `noise_signal_dep` | Signal-dependent noise factor | 0.005-0.015 | **Signal-to-Noise Ratio (SNR)** approximately: SNR ≈ A / σ(A) **Reference:** Workman Jr, J. (2007). NIR spectroscopy instrumentation. In *Handbook of Near-Infrared Analysis* (3rd ed., pp. 91-112). CRC Press. --- ### 8. Artifacts **Scientific Basis:** Real-world spectra may contain artifacts: | Artifact Type | Cause | Model | |---------------|-------|-------| | **Spike** | Cosmic rays, electrical interference | Random point additions | | **Dead band** | Detector defects, atmospheric absorption | Localized noise increase | | **Saturation** | Detector/ADC overflow | Clipping at high absorbance | | Parameter | Effect | Typical Values | |-----------|--------|----------------| | `artifact_prob` | Probability of artifact per sample | 0.0-0.05 | --- ### 9. Batch Effects **Scientific Basis:** Multi-session/multi-instrument data exhibits systematic differences: - Lamp aging → intensity drift - Environmental changes → baseline shift - Recalibration → scale changes $$A_{\text{batch}_j} = g_j \cdot A + \mathbf{o}_j$$ Where $g_j$ is batch-specific gain and $\mathbf{o}_j$ is batch-specific offset. Used for: - Domain adaptation research - Transfer learning studies - Calibration maintenance testing **Reference:** Feudale, R. N., et al. (2002). Transfer of multivariate calibration models: a review. *Chemometrics and Intelligent Laboratory Systems*, 64(2), 181-192. --- ## Complexity Levels The generator provides three preset complexity levels optimizing parameters for different use cases: ### Simple (Testing/Debugging) ```python complexity = "simple" ``` - Low noise, minimal artifacts - Small path length and scatter variation - No global slope (flat baseline trend) - Suitable for: algorithm debugging, unit testing ### Realistic (Training/Benchmarking) ```python complexity = "realistic" # Default ``` - Moderate noise levels - Typical instrument resolution (8 nm FWHM) - ~2% artifact rate - Positive global slope (typical NIR behavior) - Suitable for: model training, algorithm comparison ### Complex (Robustness Testing) ```python complexity = "complex" ``` - High noise levels - Large inter-sample variability - ~5% artifact rate - Strong global slope variation - Lower resolution (12 nm FWHM) - Suitable for: stress testing, robustness evaluation --- ## Complete Parameter Reference | Parameter | Simple | Realistic | Complex | Unit | |-----------|--------|-----------|---------|------| | `path_length_std` | 0.02 | 0.05 | 0.08 | fraction | | `baseline_amplitude` | 0.01 | 0.02 | 0.05 | AU | | `scatter_alpha_std` | 0.02 | 0.05 | 0.08 | fraction | | `scatter_beta_std` | 0.005 | 0.01 | 0.02 | AU | | `tilt_std` | 0.005 | 0.01 | 0.02 | AU | | `global_slope_mean` | 0.0 | 0.05 | 0.08 | AU/1000nm | | `global_slope_std` | 0.02 | 0.03 | 0.05 | AU/1000nm | | `shift_std` | 0.2 | 0.5 | 1.0 | nm | | `stretch_std` | 0.0005 | 0.001 | 0.002 | fraction | | `instrumental_fwhm` | 4 | 8 | 12 | nm | | `noise_base` | 0.002 | 0.005 | 0.008 | AU | | `noise_signal_dep` | 0.005 | 0.01 | 0.015 | fraction | | `artifact_prob` | 0.0 | 0.02 | 0.05 | probability | --- ## Usage Examples ### Basic Usage ```python from examples.synthetic import SyntheticNIRSGenerator generator = SyntheticNIRSGenerator( wavelength_start=1000, wavelength_end=2500, complexity="realistic", random_state=42 ) X, Y, E = generator.generate(n_samples=1000) # X: (1000, 751) spectra # Y: (1000, 5) concentrations # E: (5, 751) pure component spectra ``` ### Custom Parameters ```python generator = SyntheticNIRSGenerator(complexity="realistic") # Override specific parameters generator.params["global_slope_mean"] = 0.02 generator.params["noise_base"] = 0.003 X, Y, E = generator.generate(n_samples=500) ``` ### With Batch Effects ```python X, Y, E, metadata = generator.generate( n_samples=600, include_batch_effects=True, n_batches=3, return_metadata=True ) batch_ids = metadata["batch_ids"] # Sample-to-batch mapping ``` --- ## References ### Core Spectroscopy 1. **Beer, A.** (1852). Bestimmung der Absorption des rothen Lichts in farbigen Flüssigkeiten. *Annalen der Physik*, 162(5), 78-88. 2. **Workman Jr, J., & Weyer, L.** (2012). *Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy*. CRC Press. ISBN: 978-1439875254 3. **Burns, D. A., & Ciurczak, E. W.** (2007). *Handbook of Near-Infrared Analysis* (3rd ed.). CRC Press. ISBN: 978-0849373930 ### Peak Shapes 4. **Olivero, J. J., & Longbothum, R. L.** (1977). Empirical fits to the Voigt line width. *Journal of Quantitative Spectroscopy and Radiative Transfer*, 17(2), 233-236. ### Preprocessing and Scatter Correction 5. **Rinnan, Å., Van Den Berg, F., & Engelsen, S. B.** (2009). Review of the most common pre-processing techniques for near-infrared spectra. *TrAC Trends in Analytical Chemistry*, 28(10), 1201-1222. 6. **Barnes, R. J., Dhanoa, M. S., & Lister, S. J.** (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. *Applied Spectroscopy*, 43(5), 772-777. 7. **Geladi, P., MacDougall, D., & Martens, H.** (1985). Linearization and scatter-correction for near-infrared reflectance spectra of meat. *Applied Spectroscopy*, 39(3), 491-500. ### Calibration Transfer 8. **Feudale, R. N., Woody, N. A., Tan, H., Myles, A. J., Brown, S. D., & Ferré, J.** (2002). Transfer of multivariate calibration models: a review. *Chemometrics and Intelligent Laboratory Systems*, 64(2), 181-192. ### Multivariate Calibration 9. **Martens, H., & Næs, T.** (1989). *Multivariate Calibration*. John Wiley & Sons. ISBN: 978-0471930471 ### Compositional Data 10. **Aitchison, J.** (1986). *The Statistical Analysis of Compositional Data*. Chapman & Hall. ISBN: 978-0412280603 --- ## Version History - **v1.0.0** (2024): Initial implementation with Beer-Lambert model, Voigt profiles, complexity levels - **v1.1.0** (2024): Added global slope effect, SyntheticRealComparator for real data comparison