Synthetic Data Generator
This guide covers the internals of the synthetic NIRS data generator for developers who want to extend or customize it.
Architecture Overview
nirs4all/data/synthetic/
├── __init__.py # Public API exports
├── generator.py # SyntheticNIRSGenerator (core engine)
├── builder.py # SyntheticDatasetBuilder (fluent API)
├── components.py # NIRBand, SpectralComponent, ComponentLibrary
├── config.py # Configuration dataclasses
├── _constants.py # Predefined components (111 total), defaults
├── metadata.py # MetadataGenerator
├── targets.py # TargetGenerator (regression/classification)
├── sources.py # MultiSourceGenerator
├── exporter.py # DatasetExporter, CSVVariationGenerator
├── fitter.py # RealDataFitter
├── validation.py # Data validation utilities
│
├── # === Phase 1: Enhanced Component Generation ===
├── wavenumber.py # Wavenumber utilities, overtone/combination bands
├── procedural.py # ProceduralComponentGenerator
├── domains.py # Application domain priors (20 domains)
│
├── # === Phase 2: Realistic Instrument Simulation ===
├── instruments.py # InstrumentArchetype, InstrumentSimulator, multi-sensor
├── measurement_modes.py # MeasurementMode, measurement mode simulators
├── detectors.py # DetectorSimulator, detector response curves
│
├── # === Phase 3: Environmental and Matrix Effects ===
├── environmental.py # Temperature and moisture/water activity effects
└── scattering.py # Particle size effects, EMSC-style scattering
Physical Model
The generator implements a Beer-Lambert law based model:
A_i(λ) = L_i * Σ_k c_ik * ε_k(λ) + baseline_i(λ) + scatter_i(λ) + noise_i(λ)
Where:
c_ik: Concentration of component k in sample iε_k(λ): Molar absorptivity of component k (Voigt profiles)L_i: Optical path length factorbaseline: Polynomial baseline driftscatter: Multiplicative/additive scattering effectsnoise: Wavelength-dependent Gaussian noise
Peak Shapes
Absorption bands are modeled using Voigt profiles (convolution of Gaussian and Lorentzian):
from nirs4all.data.synthetic import NIRBand
band = NIRBand(
center=1450, # Peak center in nm
sigma=25, # Gaussian width (FWHM)
gamma=3, # Lorentzian width
amplitude=0.8, # Peak height
name="O-H 1st overtone"
)
Core Components
SyntheticNIRSGenerator
The main generation engine:
from nirs4all.data.synthetic import SyntheticNIRSGenerator
generator = SyntheticNIRSGenerator(
wavelength_start=1000,
wavelength_end=2500,
wavelength_step=2,
component_library=library, # Optional custom components
complexity="realistic",
random_state=42,
)
# Generate raw data
X, C, E = generator.generate(n_samples=1000)
# X: spectra (n_samples, n_wavelengths)
# C: concentrations (n_samples, n_components)
# E: pure component spectra (n_components, n_wavelengths)
# Generate with metadata
X, C, E, metadata = generator.generate(
n_samples=1000,
return_metadata=True,
include_batch_effects=True,
n_batches=3
)
ComponentLibrary
Manages collections of spectral components:
from nirs4all.data.synthetic import (
ComponentLibrary,
SpectralComponent,
NIRBand
)
# Create from predefined components
library = ComponentLibrary.from_predefined(
["water", "protein", "lipid"],
random_state=42
)
# Create custom library
library = ComponentLibrary(random_state=42)
# Add custom component
my_component = SpectralComponent(
name="my_compound",
bands=[
NIRBand(center=1500, sigma=20, gamma=2, amplitude=0.6),
NIRBand(center=2100, sigma=30, gamma=3, amplitude=0.8),
],
correlation_group=1 # Components with same group are correlated
)
library.add_component(my_component)
# Add random component
library.add_random_component(
name="unknown",
n_bands=4,
wavelength_range=(1000, 2500)
)
# Compute spectra
wavelengths = np.linspace(1000, 2500, 751)
spectra_matrix = library.compute_all(wavelengths)
# Shape: (n_components, n_wavelengths)
Concentration Methods
Different distributions for component concentrations:
# Dirichlet: Sum-to-one constraint (natural compositions)
C = generator.generate_concentrations(
n_samples=1000,
method="dirichlet",
alpha=np.array([1.0, 2.0, 1.5]) # Concentration weights
)
# Uniform: Independent uniform distributions
C = generator.generate_concentrations(
n_samples=1000,
method="uniform"
)
# Lognormal: Skewed distributions (biological data)
C = generator.generate_concentrations(
n_samples=1000,
method="lognormal"
)
# Correlated: With inter-component correlations
correlation_matrix = np.array([
[1.0, 0.7, -0.3],
[0.7, 1.0, 0.2],
[-0.3, 0.2, 1.0]
])
C = generator.generate_concentrations(
n_samples=1000,
method="correlated",
correlation_matrix=correlation_matrix
)
Extending the Generator
Creating Custom Components
from nirs4all.data.synthetic import SpectralComponent, NIRBand
# Define a pharmaceutical compound
aspirin = SpectralComponent(
name="aspirin",
bands=[
NIRBand(center=1520, sigma=15, gamma=1.5, amplitude=0.4, name="C-H aromatic"),
NIRBand(center=1680, sigma=20, gamma=2.0, amplitude=0.5, name="C=O"),
NIRBand(center=2020, sigma=25, gamma=2.5, amplitude=0.35, name="C-H combination"),
],
correlation_group=10 # Unique group for independent variation
)
# Use in generator
library = ComponentLibrary.from_predefined(["starch", "water"])
library.add_component(aspirin)
generator = SyntheticNIRSGenerator(
component_library=library,
complexity="realistic",
random_state=42
)
Custom Noise Models
The generator supports configurable noise through complexity parameters:
from nirs4all.data.synthetic._constants import COMPLEXITY_PARAMS
# Default complexity parameters
print(COMPLEXITY_PARAMS["realistic"])
# {
# 'noise_scale': 0.005,
# 'baseline_degree': 2,
# 'baseline_amplitude': 0.1,
# 'scatter_intensity': 0.02,
# ...
# }
# For custom noise, subclass or modify generator parameters
class CustomNoiseGenerator(SyntheticNIRSGenerator):
def _set_complexity_params(self):
super()._set_complexity_params()
# Override specific parameters
self.params['noise_scale'] = 0.01
self.params['noise_wavelength_dependence'] = True
Custom Target Transformations
from nirs4all.data.synthetic import TargetGenerator
class CustomTargetGenerator(TargetGenerator):
def custom_transform(self, concentrations, **kwargs):
"""Apply custom transformation to concentrations."""
# Example: ratio of two components
ratio = concentrations[:, 0] / (concentrations[:, 1] + 1e-6)
return np.clip(ratio, 0, 10)
Wavenumber Utilities (Phase 1)
The wavenumber module provides utilities for physically-correct band placement based on wavenumber (cm⁻¹) instead of wavelength (nm). This is critical because harmonic relationships (overtones, combinations) are linear in wavenumber space, not wavelength space.
Mathematical Background
λ (nm) = 10⁷ / ν̃ (cm⁻¹)
ν̃ (cm⁻¹) = 10⁷ / λ (nm)
For overtones with anharmonicity correction:
ν̃ₙ = n × ν̃₀ × (1 - n × χ)
Where:
nis the quantum number (1 = fundamental, 2 = 1st overtone, etc.)ν̃₀is the fundamental frequencyχis the anharmonicity constant (typically 0.01-0.03)
NIR Spectral Zones
The module defines 7 NIR spectral zones in wavenumber space:
Zone |
Wavenumber (cm⁻¹) |
Wavelength (nm) |
Description |
|---|---|---|---|
3rd overtones |
9000-12500 |
800-1111 |
Electronic transitions |
2nd overtones |
7500-9000 |
1111-1333 |
C-H, N-H 2nd overtones |
1st O-H/N-H |
6000-7500 |
1333-1667 |
O-H, N-H 1st overtones |
1st C-H |
5500-6250 |
1600-1818 |
C-H 1st overtones |
O-H comb |
5000-5500 |
1818-2000 |
O-H combination bands |
N-H comb |
4500-5200 |
1923-2222 |
N-H, C-H combinations |
C-H comb |
4000-4545 |
2200-2500 |
C-H, C-O combinations |
Fundamental Vibrations
22 fundamental vibrations are defined (in cm⁻¹):
from nirs4all.data.synthetic import FUNDAMENTAL_VIBRATIONS
# O-H vibrations
print(FUNDAMENTAL_VIBRATIONS["O-H_stretch_free"]) # 3650 cm⁻¹
print(FUNDAMENTAL_VIBRATIONS["O-H_stretch_hbond"]) # 3400 cm⁻¹
# C-H vibrations
print(FUNDAMENTAL_VIBRATIONS["C-H_stretch_CH3_asym"]) # 2960 cm⁻¹
print(FUNDAMENTAL_VIBRATIONS["C-H_stretch_aromatic"]) # 3060 cm⁻¹
# N-H vibrations
print(FUNDAMENTAL_VIBRATIONS["N-H_stretch_primary"]) # 3400 cm⁻¹
Overtone Calculation
Calculate overtone positions with anharmonicity correction:
from nirs4all.data.synthetic import calculate_overtone_position
# Calculate O-H 1st overtone (n=2 in spectroscopic convention)
result = calculate_overtone_position("O-H_stretch_free", 2)
print(f"O-H 1st overtone: {result.wavelength_nm:.1f} nm") # ~1433 nm
print(f"Wavenumber: {result.wavenumber_cm:.1f} cm⁻¹")
print(f"Amplitude factor: {result.amplitude_factor:.3f}")
# C-H overtone series
for n in [2, 3, 4]: # 1st, 2nd, 3rd overtones
result = calculate_overtone_position("C-H_stretch_CH3_asym", n)
print(f"C-H {n-1}st overtone: {result.wavelength_nm:.1f} nm")
# Use numeric frequency directly
result = calculate_overtone_position(3400, 2, anharmonicity=0.022)
Combination Band Calculation
Calculate combination band positions:
from nirs4all.data.synthetic import calculate_combination_band
# O-H stretch + bend combination
result = calculate_combination_band(["O-H_stretch_free", "O-H_bend"])
print(f"O-H combination: {result.wavelength_nm:.1f} nm") # ~1890 nm
# Multiple modes
result = calculate_combination_band(["C-H_stretch_CH3_asym", "C-H_bend"])
print(f"C-H combination: {result.wavelength_nm:.1f} nm")
Hydrogen Bonding Shift
Model hydrogen bonding effects on band positions:
from nirs4all.data.synthetic import apply_hydrogen_bonding_shift
# Free O-H stretch
free_oh = 3650 # cm⁻¹
# Apply H-bonding (shifts to lower wavenumber = longer wavelength)
bonded_oh = apply_hydrogen_bonding_shift(free_oh, strength=0.5)
print(f"H-bonded O-H: {bonded_oh:.0f} cm⁻¹") # ~3400 cm⁻¹
Zone Classification
Classify wavelengths into NIR zones:
from nirs4all.data.synthetic import classify_wavelength_zone
zone = classify_wavelength_zone(1450) # O-H 1st overtone region
print(f"1450 nm is in: {zone}")
zone = classify_wavelength_zone(2100) # Combination region
print(f"2100 nm is in: {zone}")
Procedural Component Generator (Phase 1)
The procedural module enables programmatic generation of chemically-plausible spectral components based on functional group properties.
Functional Group Types
10 functional group types are defined:
from nirs4all.data.synthetic import FunctionalGroupType
print(list(FunctionalGroupType))
# [HYDROXYL, AMINE, METHYL, METHYLENE, AROMATIC_CH,
# VINYL, CARBONYL, CARBOXYL, THIOL, WATER]
Each functional group has physical properties defined:
from nirs4all.data.synthetic import FUNCTIONAL_GROUP_PROPERTIES
props = FUNCTIONAL_GROUP_PROPERTIES[FunctionalGroupType.HYDROXYL]
print(f"Fundamental: {props['fundamental_cm']} cm⁻¹")
print(f"Bandwidth: {props['bandwidth_cm']} cm⁻¹")
print(f"H-bond susceptibility: {props['h_bond_susceptibility']}")
print(f"Typical amplitude: {props['typical_amplitude']}")
Procedural Configuration
from nirs4all.data.synthetic import ProceduralComponentConfig
config = ProceduralComponentConfig(
# Band generation
n_fundamental_bands=3, # Number of fundamental vibrations
include_overtones=True, # Generate overtone bands
max_overtone_order=3, # Up to 2nd overtone (n=3)
include_combinations=True, # Generate combination bands
max_combinations=3, # Max combination bands
# Environmental effects
h_bond_strength=0.5, # Average H-bonding (0-1)
h_bond_variability=0.2, # Random variation
# Anharmonicity
anharmonicity=0.02, # Default anharmonicity constant
anharmonicity_variability=0.005, # Random variation
# Band property variation
amplitude_variability=0.3,
bandwidth_variability=0.2,
# Wavelength range
wavelength_range=(900, 2500), # NIR region
# Specific functional groups (optional)
functional_groups=[
FunctionalGroupType.HYDROXYL,
FunctionalGroupType.METHYL,
],
)
Generating Components
from nirs4all.data.synthetic import ProceduralComponentGenerator
generator = ProceduralComponentGenerator(random_state=42)
# Generate a single component
component = generator.generate_component(
name="ethanol_like",
config=config,
)
print(f"Generated: {component.name}")
print(f"Bands: {len(component.bands)}")
for band in component.bands:
print(f" - {band.name}: {band.center:.1f} nm")
Specifying Functional Groups
# Generate from specific functional groups
component = generator.generate_component(
name="amine_compound",
functional_groups=[
FunctionalGroupType.AMINE,
FunctionalGroupType.METHYLENE,
],
)
# Generate sugar-like compound
sugar = generator.generate_component(
name="carbohydrate",
functional_groups=[
FunctionalGroupType.HYDROXYL, # Multiple O-H groups
FunctionalGroupType.HYDROXYL,
FunctionalGroupType.METHYLENE,
],
config=ProceduralComponentConfig(
h_bond_strength=0.7, # Strong H-bonding
),
)
Generating Variants
Create perturbed versions of existing components:
# Generate base component
base = generator.generate_component("base_compound")
# Create variants with small perturbations
for i in range(5):
variant = generator.generate_variant(
base,
perturbation_scale=0.1, # 10% variation
)
print(f"Variant {i+1}: {len(variant.bands)} bands")
Integration with ComponentLibrary
from nirs4all.data.synthetic import ComponentLibrary
# Create library with procedural + predefined components
library = ComponentLibrary(random_state=42)
# Add predefined components
predefined = ComponentLibrary.from_predefined(["water", "starch"])
for comp in predefined.components.values():
library.add_component(comp)
# Add procedurally generated components
proc_gen = ProceduralComponentGenerator(random_state=42)
for i in range(5):
component = proc_gen.generate_component(
name=f"novel_compound_{i}",
config=ProceduralComponentConfig(n_fundamental_bands=3),
)
library.add_component(component)
print(f"Library has {len(library.component_names)} components")
Application Domains (Phase 1)
The domains module provides domain-aware configuration for synthetic data generation, with 20 predefined application domains across 8 categories.
Domain Categories
from nirs4all.data.synthetic import DomainCategory
print(list(DomainCategory))
# [AGRICULTURE, FOOD, PHARMACEUTICAL, PETROCHEMICAL,
# ENVIRONMENTAL, BIOMEDICAL, INDUSTRIAL, RESEARCH]
Available Domains
Category |
Domains |
|---|---|
Agriculture |
|
Food |
|
Pharmaceutical |
|
Petrochemical |
|
Environmental |
|
Biomedical |
|
Industrial |
|
Getting Domain Configuration
from nirs4all.data.synthetic import get_domain_config, list_domains
# List all domains
all_domains = list_domains()
print(f"Available domains: {len(all_domains)}")
# List domains by category
ag_domains = list_domains(category=DomainCategory.AGRICULTURE)
print(f"Agriculture domains: {ag_domains}")
# Get specific domain configuration
grain = get_domain_config("agriculture_grain")
print(f"Domain: {grain.name}")
print(f"Category: {grain.category}")
print(f"Components: {grain.typical_components}")
print(f"Wavelength range: {grain.wavelength_range}")
print(f"Measurement mode: {grain.measurement_mode}")
Domain Configuration Details
Each domain includes:
from nirs4all.data.synthetic import DomainConfig
# DomainConfig fields:
# - name: Display name
# - category: DomainCategory enum
# - description: Detailed description
# - typical_components: List of component names
# - component_weights: Relative importance weights
# - concentration_priors: Statistical priors for concentrations
# - wavelength_range: (start, end) in nm
# - n_components_range: (min, max) components per sample
# - noise_level: "low", "medium", "high"
# - measurement_mode: "reflectance", "transmission", etc.
# - typical_sample_types: Example sample types
# - complexity: "simple", "realistic", "complex"
Concentration Priors
Domains include realistic concentration priors:
from nirs4all.data.synthetic import ConcentrationPrior
import numpy as np
# Get grain domain priors
grain = get_domain_config("agriculture_grain")
# Starch typically 50-80% in grains
starch_prior = grain.concentration_priors.get("starch")
if starch_prior:
print(f"Starch prior: {starch_prior.distribution}")
print(f" Mean: ~65%, Range: 30-80%")
# Sample from priors
rng = np.random.default_rng(42)
concentrations = grain.sample_concentrations(n_samples=100, rng=rng)
print(f"Sampled starch mean: {concentrations.get('starch', []).mean():.2f}")
Domain-Aware Library Creation
The create_domain_aware_library function samples components and concentrations based on domain priors:
from nirs4all.data.synthetic import create_domain_aware_library
# Returns (component_names, concentration_matrix)
components, concentrations = create_domain_aware_library(
"agriculture_grain",
n_samples=100,
random_state=42
)
print(f"Components: {components}")
print(f"Concentrations shape: {concentrations.shape}")
Using Domains with Generator
To use domain configurations with SyntheticNIRSGenerator, create a ComponentLibrary from the domain’s typical components:
from nirs4all.data.synthetic import (
SyntheticNIRSGenerator,
ComponentLibrary,
get_domain_config,
)
# Get domain config
domain = get_domain_config("agriculture_grain")
# Create ComponentLibrary from domain's typical components
library = ComponentLibrary.from_predefined(domain.typical_components[:5])
# Configure generator with domain parameters
generator = SyntheticNIRSGenerator(
component_library=library,
wavelength_start=domain.wavelength_range[0],
wavelength_end=domain.wavelength_range[1],
complexity="realistic",
random_state=42,
)
# Generate domain-appropriate data
X, C, *rest = generator.generate(n_samples=500)
print(f"Generated {X.shape[0]} spectra for grain analysis")
Getting Domain Components
from nirs4all.data.synthetic import get_domain_components, get_domains_for_component
# Get typical components for a domain
components = get_domain_components("food_dairy")
print(f"Dairy components: {components}")
# Find which domains use a specific component
domains = get_domains_for_component("protein")
print(f"Domains using protein: {domains}")
Extended Component Library (Phase 1)
The predefined component library has been extended from 31 to 111 components covering diverse application areas.
Component Categories
Category |
Count |
Examples |
|---|---|---|
Carbohydrates |
15 |
starch, cellulose, glucose, maltose, raffinose, inulin |
Proteins |
10 |
protein, casein, gluten, albumin, collagen, keratin |
Lipids |
12 |
lipid, oil, oleic_acid, linoleic_acid, phospholipid |
Alcohols |
8 |
ethanol, methanol, glycerol, sorbitol, mannitol |
Organic Acids |
10 |
acetic_acid, citric_acid, ascorbic_acid, formic_acid |
Pharmaceuticals |
10 |
aspirin, paracetamol, ibuprofen, metformin |
Polymers |
8 |
polyethylene, polystyrene, pmma, pet, abs |
Minerals |
5 |
kaolinite, montmorillonite, silica, talc |
Pigments |
7 |
chlorophyll, carotenoid, anthocyanin, lycopene |
Solvents |
7 |
acetone, dmso, ethyl_acetate, toluene |
Other |
19 |
water, aromatic, alkane, etc. |
Listing Components
from nirs4all.data.synthetic._constants import get_predefined_components
components = get_predefined_components()
print(f"Total components: {len(components)}")
# List all component names
for name in sorted(components.keys()):
print(f" - {name}")
Using Extended Components
from nirs4all.data.synthetic import ComponentLibrary
# Pharmaceutical formulation
pharma = ComponentLibrary.from_predefined([
"ibuprofen", "microcrystalline_cellulose", "starch", "water"
])
# Food analysis
food = ComponentLibrary.from_predefined([
"protein", "lipid", "starch", "glucose", "fructose", "moisture"
])
# Polymer analysis
polymers = ComponentLibrary.from_predefined([
"polyethylene", "polypropylene", "pet", "abs", "pmma"
])
# Soil analysis
soil = ComponentLibrary.from_predefined([
"kaolinite", "montmorillonite", "silica", "cellulose", "lignin"
])
Instrument Simulation (Phase 2)
Phase 2 introduces realistic instrument simulation with 19 predefined instrument archetypes, multi-sensor stitching, multi-scan averaging, and detector response modeling.
Instrument Archetypes
The module provides pre-configured models of real NIR instruments:
from nirs4all.data.synthetic import (
get_instrument_archetype,
list_instrument_archetypes,
get_instruments_by_category,
InstrumentCategory,
)
# List all available instruments (19 total)
instruments = list_instrument_archetypes()
print(f"Available: {instruments}")
# Get by category
by_category = get_instruments_by_category()
for cat, names in by_category.items():
print(f"{cat}: {names}")
# Get specific instrument
foss_xds = get_instrument_archetype("foss_xds")
print(f"FOSS XDS range: {foss_xds.wavelength_range}")
print(f"Resolution: {foss_xds.spectral_resolution} nm")
print(f"SNR: {foss_xds.snr}")
Available Instruments
Category |
Instruments |
|---|---|
Benchtop |
|
FT-NIR |
|
Handheld |
|
Process |
|
Embedded |
|
Filter |
|
Diode Array |
|
InstrumentArchetype Properties
Each archetype includes complete optical and electronic specifications:
from nirs4all.data.synthetic import (
InstrumentArchetype,
InstrumentCategory,
DetectorType,
MonochromatorType,
)
# Create custom instrument
custom = InstrumentArchetype(
name="my_instrument",
category=InstrumentCategory.BENCHTOP,
detector_type=DetectorType.INGAAS,
monochromator_type=MonochromatorType.GRATING,
wavelength_range=(900, 1700),
spectral_resolution=4.0, # nm FWHM
wavelength_accuracy=0.1, # nm
photometric_noise=0.0001, # AU
photometric_range=(0.0, 3.0), # AU min/max
snr=20000, # Signal-to-noise ratio
stray_light=0.0001, # Fraction
warm_up_drift=0.1, # %/hour
temperature_sensitivity=0.01, # nm/°C
scan_speed=5.0, # Scans/second
integration_time_ms=50.0, # ms
description="Custom research instrument",
)
Multi-Sensor Configuration
Many NIR instruments use multiple detectors to cover wide wavelength ranges, then stitch the signals together:
from nirs4all.data.synthetic import (
MultiSensorConfig,
SensorConfig,
DetectorType,
)
# Configure multi-sensor system (e.g., FOSS XDS with Si + PbS)
multi_sensor = MultiSensorConfig(
enabled=True,
sensors=[
SensorConfig(
detector_type=DetectorType.SI,
wavelength_range=(400, 1100),
spectral_resolution=0.5,
noise_level=0.8,
gain=1.0,
overlap_range=20.0, # nm overlap for stitching
),
SensorConfig(
detector_type=DetectorType.PBS,
wavelength_range=(1100, 2500),
spectral_resolution=0.5,
noise_level=1.2,
),
],
stitch_method="weighted", # weighted, average, first, last, optimal
stitch_smoothing=10.0, # nm smoothing at boundaries
add_stitch_artifacts=True, # Simulate stitching artifacts
artifact_intensity=0.01, # Artifact strength (0-1)
)
Stitch Methods
Method |
Description |
|---|---|
|
Linear blend based on distance from boundary |
|
Simple average in overlap region |
|
Use first sensor’s data in overlap |
|
Use second sensor’s data in overlap |
|
SNR-weighted combination |
Multi-Scan Averaging
Real instruments take multiple scans and average them to reduce noise:
from nirs4all.data.synthetic import MultiScanConfig
multi_scan = MultiScanConfig(
enabled=True,
n_scans=32, # Number of scans to average
averaging_method="mean", # mean, median, weighted, savgol
scan_to_scan_noise=0.001, # Additional noise between scans
wavelength_jitter=0.05, # nm shift between scans
discard_outliers=True, # Remove outlier scans
outlier_threshold=2.5, # Z-score threshold
)
Averaging Methods
Method |
Description |
Use Case |
|---|---|---|
|
Arithmetic mean |
Standard averaging |
|
Median filter |
Robust to outliers |
|
SNR-weighted average |
Variable quality scans |
|
Savitzky-Golay smoothing |
Spectral smoothing |
Using Instruments with the Generator
from nirs4all.data.synthetic import SyntheticNIRSGenerator
# Generate with a predefined instrument
gen = SyntheticNIRSGenerator(
wavelength_start=1000,
wavelength_end=2500,
instrument="foss_xds", # Apply FOSS XDS effects
random_state=42,
)
X, C, pure, metadata = gen.generate(n_samples=100, return_metadata=True)
print(f"Multi-sensor applied: {metadata.get('multi_sensor')}")
print(f"Multi-scan applied: {metadata.get('multi_scan')}")
# Generate with custom multi-sensor/multi-scan config
gen = SyntheticNIRSGenerator(
wavelength_start=400,
wavelength_end=2500,
multi_sensor_config=multi_sensor,
multi_scan_config=multi_scan,
random_state=42,
)
InstrumentSimulator
For fine-grained control, use the simulator directly:
from nirs4all.data.synthetic import InstrumentSimulator, get_instrument_archetype
import numpy as np
# Get archetype and create simulator
archetype = get_instrument_archetype("bruker_mpa")
simulator = InstrumentSimulator(archetype, random_state=42)
# Apply to existing spectra
spectra = np.random.default_rng(42).normal(0.5, 0.1, (100, 500))
wavelengths = np.linspace(1000, 2500, 500)
result, output_wl = simulator.apply(
spectra,
wavelengths,
temperature_offset=2.0, # °C deviation from calibration
)
print(f"Output shape: {result.shape}")
print(f"Output range: {output_wl.min():.0f}-{output_wl.max():.0f} nm")
Detector Response Curves
The module includes realistic detector spectral response curves:
from nirs4all.data.synthetic import (
get_detector_response,
list_detector_types,
DetectorType,
DetectorSimulator,
)
import numpy as np
# Available detectors
print(f"Detector types: {list_detector_types()}")
# ['si', 'ingaas', 'ingaas_ext', 'pbs', 'pbse', 'mems', 'mct']
# Get response curve
wavelengths = np.linspace(800, 2500, 1000)
ingaas = get_detector_response(DetectorType.INGAAS)
print(f"InGaAs sensitivity: {ingaas.short_cutoff}-{ingaas.cutoff_wavelength} nm")
print(f"Peak wavelength: {ingaas.peak_wavelength} nm")
print(f"Peak QE: {ingaas.peak_qe}")
# Get responsivity at specific wavelengths
responsivity = ingaas.get_response_at(wavelengths)
Detector Specifications
Type |
Range (nm) |
Peak (nm) |
Peak QE |
Use Case |
|---|---|---|---|---|
Si |
350-1100 |
850 |
0.85 |
Visible/short-NIR |
InGaAs |
850-1700 |
1300 |
0.92 |
Standard NIR |
InGaAs Ext |
850-2500 |
1700 |
0.85 |
Extended NIR |
PbS |
1000-3000 |
2200 |
0.85 |
Long-NIR |
PbSe |
1500-5000 |
3500 |
0.75 |
MIR |
MEMS |
900-1700 |
1300 |
0.60 |
Miniature/embedded |
MCT |
1000-14000 |
5000 |
0.70 |
Cooled high-perf |
Detector Noise Models
Each detector type has characteristic noise properties:
from nirs4all.data.synthetic import (
DetectorSimulator,
DetectorConfig,
NoiseModelConfig,
DetectorType,
)
# Configure detector noise
noise_config = NoiseModelConfig(
shot_noise_factor=1.0, # Signal-dependent (√signal)
thermal_noise_factor=0.5, # Temperature-dependent
read_noise_factor=0.3, # Constant per-pixel
flicker_noise_factor=0.1, # 1/f noise
dark_current=0.001, # Counts/ms
integration_time_ms=50.0,
)
detector_config = DetectorConfig(
detector_type=DetectorType.INGAAS,
gain=1.0,
nonlinearity=0.01, # Nonlinearity coefficient
noise=noise_config,
)
# Apply detector effects
simulator = DetectorSimulator(detector_config, random_state=42)
processed = simulator.apply(spectra, wavelengths, add_noise=True)
Measurement Modes
Different sampling geometries produce different spectral characteristics:
from nirs4all.data.synthetic import (
MeasurementModeSimulator,
MeasurementModeConfig,
MeasurementMode,
TransmittanceConfig,
ReflectanceConfig,
ATRConfig,
)
# Transmittance (liquid samples)
trans_config = MeasurementModeConfig(
mode=MeasurementMode.TRANSMITTANCE,
transmittance=TransmittanceConfig(
path_length_mm=10.0,
path_length_variation=0.02,
cuvette_material="quartz",
),
)
sim = MeasurementModeSimulator(config=trans_config, random_state=42)
processed = sim.apply(spectra, wavelengths)
# Diffuse reflectance (powder samples)
refl_config = MeasurementModeConfig(
mode=MeasurementMode.REFLECTANCE,
reflectance=ReflectanceConfig(
geometry="integrating_sphere",
sample_presentation="powder",
illumination_angle=0.0,
collection_angle=45.0,
),
)
# ATR (solids, pastes)
atr_config = MeasurementModeConfig(
mode=MeasurementMode.ATR,
atr=ATRConfig(
crystal_material="diamond",
crystal_refractive_index=2.4,
incidence_angle=45.0,
n_reflections=1,
),
)
Available Measurement Modes
Mode |
Description |
Typical Samples |
|---|---|---|
|
Light passes through sample |
Liquids, thin films |
|
Diffuse or specular reflection |
Powders, solids |
|
Transmission + reflection |
Liquids with mirror |
|
Attenuated total reflectance |
Pastes, thick liquids |
|
Fiber probe immersed |
Suspensions |
|
Remote fiber measurement |
Process monitoring |
Environmental Effects (Phase 3)
Phase 3 introduces simulation of environmental and matrix effects that affect NIR spectra in real-world measurements. These effects are critical for generating realistic synthetic data that captures the variability seen in practical applications.
Temperature Effects
Temperature affects NIR spectra through peak shifts, intensity changes, and band broadening. The environmental.py module provides comprehensive temperature simulation:
from nirs4all.data.synthetic import (
TemperatureEffectSimulator,
TemperatureConfig,
TemperatureEffectParams,
SpectralRegion,
)
import numpy as np
# Configure temperature effects
temp_config = TemperatureConfig(
sample_temperature=35.0, # Sample temperature (°C)
reference_temperature=25.0, # Reference/calibration temperature
temperature_variation=2.0, # Sample-to-sample variation (σ)
enable_wavelength_shift=True, # Enable peak position shifts
enable_intensity_change=True, # Enable intensity changes
enable_broadening=True, # Enable thermal broadening
)
# Create simulator
simulator = TemperatureEffectSimulator(temp_config, random_state=42)
# Apply to spectra
wavelengths = np.linspace(1000, 2500, 751)
spectra = np.random.randn(100, 751) * 0.1 + 0.5 # Example spectra
# Generate per-sample temperatures
sample_temps = simulator.sample_temperatures(n_samples=100)
# Apply effects
modified_spectra = simulator.apply(spectra, wavelengths, sample_temperatures=sample_temps)
Spectral Regions
The module defines 8 spectral regions with distinct temperature responses:
Region |
Wavelength (nm) |
Shift (nm/°C) |
Intensity (%/°C) |
Reference |
|---|---|---|---|---|
O-H 1st overtone |
1400-1520 |
-0.30 |
-0.20 |
Maeda et al. (1995) |
O-H combination |
1900-2000 |
-0.40 |
-0.30 |
Segtnan et al. (2001) |
C-H 1st overtone |
1650-1780 |
-0.05 |
-0.05 |
Literature estimate |
C-H combination |
2200-2400 |
-0.08 |
-0.08 |
Literature estimate |
N-H 1st overtone |
1500-1550 |
-0.20 |
-0.15 |
Literature estimate |
N-H combination |
2000-2100 |
-0.25 |
-0.20 |
Literature estimate |
Free water |
N/A |
-0.35 |
-0.25 |
Luck (1998) |
Bound water |
N/A |
-0.20 |
-0.35 |
Büning-Pfaue (2003) |
Custom Temperature Parameters
from nirs4all.data.synthetic import TemperatureEffectParams, SpectralRegion
# Define custom parameters for a specific region
custom_params = TemperatureEffectParams(
wavelength_range=(1400, 1520),
shift_per_degree=-0.30, # nm/°C (negative = blue shift)
intensity_change_per_degree=-0.002, # fraction/°C
broadening_per_degree=0.001, # fraction/°C
reference="Custom application"
)
# Use in configuration
temp_config = TemperatureConfig(
sample_temperature=40.0,
custom_params={SpectralRegion.OH_FIRST_OVERTONE: custom_params}
)
Moisture/Water Activity Effects
Water content and activity affect hydrogen bonding, which in turn modifies water-related bands:
from nirs4all.data.synthetic import (
MoistureEffectSimulator,
MoistureConfig,
)
# Configure moisture effects
moisture_config = MoistureConfig(
water_activity=0.6, # Water activity (0-1)
moisture_content=0.15, # Fractional moisture content
free_water_fraction=0.3, # Fraction of water that is free
bound_water_shift=20.0, # nm shift for bound water
enable_band_shift=True,
enable_intensity_modulation=True,
)
# Create simulator
moisture_sim = MoistureEffectSimulator(moisture_config, random_state=42)
# Apply effects
modified_spectra = moisture_sim.apply(spectra, wavelengths)
Water Activity Effects
Water Activity |
Effect on Spectra |
|---|---|
Low (< 0.3) |
Bound water dominates, shifted O-H bands |
Medium (0.3-0.7) |
Mixed free/bound water, broadened bands |
High (> 0.7) |
Free water dominates, sharp O-H bands |
Combined Environmental Effects
The EnvironmentalEffectsSimulator combines all environmental effects:
from nirs4all.data.synthetic import (
EnvironmentalEffectsSimulator,
EnvironmentalEffectsConfig,
TemperatureConfig,
MoistureConfig,
)
# Combine temperature and moisture
env_config = EnvironmentalEffectsConfig(
temperature=TemperatureConfig(
sample_temperature=35.0,
temperature_variation=3.0,
),
moisture=MoistureConfig(
water_activity=0.65,
moisture_content=0.12,
),
)
# Create combined simulator
env_simulator = EnvironmentalEffectsSimulator(env_config, random_state=42)
# Apply all effects
modified_spectra = env_simulator.apply(
spectra,
wavelengths,
sample_temperatures=temps, # Optional: explicit temperatures
)
Scattering Effects (Phase 3)
The scattering.py module simulates light scattering effects, which are critical for diffuse reflectance spectra. Rather than implementing full Mie theory, this module uses empirical EMSC-style models that approximate real-world scattering distortions.
Particle Size Effects
Particle size strongly affects NIR spectra through scattering:
from nirs4all.data.synthetic import (
ParticleSizeSimulator,
ParticleSizeConfig,
ParticleSizeDistribution,
)
# Configure particle size distribution
psd = ParticleSizeDistribution(
mean_size_um=50.0, # Mean size in micrometers
std_size_um=15.0, # Standard deviation
min_size_um=5.0, # Minimum (truncation)
max_size_um=200.0, # Maximum (truncation)
distribution="lognormal", # lognormal, normal, or uniform
)
# Configure particle size effects
ps_config = ParticleSizeConfig(
distribution=psd,
reference_size_um=50.0, # Reference size for calibration
enable_baseline_shift=True,
enable_multiplicative_scatter=True,
enable_wavelength_dependence=True,
)
# Create simulator
ps_simulator = ParticleSizeSimulator(ps_config, random_state=42)
# Sample particle sizes for each sample
sizes = psd.sample(n_samples=100, rng=np.random.default_rng(42))
# Apply effects
modified_spectra = ps_simulator.apply(spectra, wavelengths, particle_sizes=sizes)
Particle Size Regimes
Size Regime |
Particle Size |
Scattering Behavior |
λ Dependence |
|---|---|---|---|
Small |
< 10 μm |
Rayleigh-like |
λ⁻⁴ |
Medium |
10-100 μm |
Mixed Mie |
λ⁻¹ to λ⁻² |
Large |
> 100 μm |
Geometrical |
Weak |
EMSC-Style Transformations
Extended Multiplicative Scatter Correction (EMSC) models are used to simulate realistic scattering distortions:
from nirs4all.data.synthetic import (
EMSCTransformSimulator,
EMSCConfig,
)
# Configure EMSC-style effects
emsc_config = EMSCConfig(
baseline_offset_range=(-0.1, 0.1), # Additive offset
multiplicative_range=(0.8, 1.2), # Multiplicative scaling
linear_slope_range=(-0.0001, 0.0001), # Linear baseline
quadratic_range=(-1e-7, 1e-7), # Quadratic baseline
inverse_lambda_range=(0.0, 0.01), # 1/λ scattering term
sample_variation=0.3, # Sample-to-sample variation
)
# Create simulator
emsc_simulator = EMSCTransformSimulator(emsc_config, random_state=42)
# Apply EMSC-style distortions
distorted_spectra = emsc_simulator.apply(spectra, wavelengths)
EMSC Model
The EMSC model transforms spectra as:
A_observed = a + b × A_pure + c × λ + d × λ² + e × (1/λ)
Where:
a: Baseline offset (scattering-induced)b: Multiplicative scaling (path length variation)c: Linear wavelength dependenced: Quadratic term (curvature)e: Inverse wavelength term (Rayleigh-like scattering)
Scattering Coefficient Generation
For Kubelka-Munk reflectance simulation, generate realistic scattering coefficients:
from nirs4all.data.synthetic import (
ScatteringCoefficientGenerator,
ScatteringCoefficientConfig,
)
# Configure scattering coefficient generation
scat_config = ScatteringCoefficientConfig(
baseline_scattering=1.0, # S₀ reference value
wavelength_reference_nm=1700.0, # Reference wavelength
wavelength_exponent=1.5, # λ dependence exponent
particle_size_um=50.0, # Affects wavelength exponent
)
# Generate scattering coefficients
generator = ScatteringCoefficientGenerator(scat_config, random_state=42)
S_lambda = generator.generate(wavelengths)
# Use with Kubelka-Munk
# R∞ = 1 + K/S - √((K/S)² + 2K/S)
Wavelength Exponent vs Particle Size
Particle Size |
Wavelength Exponent b |
Model |
|---|---|---|
< 10 μm |
2-4 |
Rayleigh (λ⁻⁴) |
10-50 μm |
1-2 |
Mixed |
50-100 μm |
0.5-1 |
Mie |
> 100 μm |
0-0.5 |
Geometrical |
Combined Scattering Effects
The ScatteringEffectsSimulator combines all scattering-related effects:
from nirs4all.data.synthetic import (
ScatteringEffectsSimulator,
ScatteringEffectsConfig,
ParticleSizeConfig,
EMSCConfig,
ScatteringCoefficientConfig,
)
# Combine all scattering effects
scatter_config = ScatteringEffectsConfig(
particle_size=ParticleSizeConfig(
distribution=ParticleSizeDistribution(mean_size_um=50.0),
),
emsc=EMSCConfig(
multiplicative_range=(0.85, 1.15),
),
scattering_coefficient=ScatteringCoefficientConfig(
baseline_scattering=1.0,
),
)
# Create combined simulator
scatter_simulator = ScatteringEffectsSimulator(scatter_config, random_state=42)
# Apply all scattering effects
modified_spectra = scatter_simulator.apply(
spectra,
wavelengths,
particle_sizes=sizes, # Optional: explicit sizes
)
Integration with Generator
Phase 3 effects are integrated into the main SyntheticNIRSGenerator:
from nirs4all.data.synthetic import (
SyntheticNIRSGenerator,
EnvironmentalEffectsConfig,
ScatteringEffectsConfig,
TemperatureConfig,
ParticleSizeConfig,
)
# Configure environmental and scattering effects
env_config = EnvironmentalEffectsConfig(
temperature=TemperatureConfig(
sample_temperature=30.0,
temperature_variation=5.0,
),
)
scatter_config = ScatteringEffectsConfig(
particle_size=ParticleSizeConfig(
distribution=ParticleSizeDistribution(mean_size_um=75.0),
),
)
# Create generator with Phase 3 effects
generator = SyntheticNIRSGenerator(
wavelength_start=1000,
wavelength_end=2500,
complexity="realistic",
environmental_config=env_config,
scattering_effects_config=scatter_config,
random_state=42,
)
# Generate with effects
X, C, E = generator.generate(
n_samples=500,
include_environmental_effects=True,
include_scattering_effects=True,
)
Metadata System
The MetadataGenerator creates realistic sample metadata:
from nirs4all.data.synthetic import MetadataGenerator
metadata_gen = MetadataGenerator(random_state=42)
result = metadata_gen.generate(
n_samples=100,
sample_id_prefix="wheat",
n_groups=5, # For GroupKFold
group_names=["Field_A", "Field_B", "Field_C", "Field_D", "Field_E"],
n_repetitions=(2, 4), # Variable repetitions per sample
)
# Access generated metadata
print(result.sample_ids) # ['wheat_001', 'wheat_002', ...]
print(result.groups) # ['Field_A', 'Field_A', 'Field_B', ...]
print(result.biological_samples) # Sample indices for repetitions
print(result.repetition_counts) # [3, 2, 4, ...]
Classification Generation
The TargetGenerator creates separable classes:
from nirs4all.data.synthetic import TargetGenerator, ClassSeparationConfig
target_gen = TargetGenerator(random_state=42)
# Generate classification targets
y = target_gen.classification(
n_samples=1000,
concentrations=C, # From generator
n_classes=3,
class_weights=[0.4, 0.4, 0.2], # Class proportions
separation=2.0, # Higher = more separable
separation_method="component" # How to create class differences
)
# Separation methods:
# - "component": Different concentration profiles per class
# - "threshold": Classes based on concentration thresholds
# - "cluster": K-means-like assignment
Non-Linear Target Complexity
The NonLinearTargetProcessor creates challenging, realistic targets that require
non-linear models to predict well. It implements three strategies:
Architecture
from nirs4all.data.synthetic.targets import (
NonLinearTargetProcessor,
NonLinearTargetConfig
)
config = NonLinearTargetConfig(
# Proposition 1: Non-linear interactions
nonlinear_interactions="polynomial", # none, polynomial, synergistic, antagonistic
interaction_strength=0.5, # 0=linear, 1=fully non-linear
hidden_factors=2, # Latent variables not in spectra
polynomial_degree=2, # 2 or 3
# Proposition 2: Confounders
signal_to_confound_ratio=0.7, # 1.0=fully predictable
n_confounders=2, # Confounding variables
spectral_masking=0.0, # Future: hide signal in noisy regions
temporal_drift=True, # Relationship changes over samples
# Proposition 3: Multi-regime
n_regimes=3, # Different relationship regimes
regime_method="concentration", # concentration, spectral, random
regime_overlap=0.2, # Transition zone smoothness
noise_heteroscedasticity=0.5, # Noise varies by regime
)
processor = NonLinearTargetProcessor(config, random_state=42)
y_complex = processor.process(
concentrations=C,
y_base=y_linear,
spectra=X # Optional, for spectral-based regime assignment
)
Interaction Types
Type |
Formula |
Use Case |
|---|---|---|
|
y = f(C₁², C₁×C₂, …) |
General non-linearity |
|
y = f(√(C₁×C₂) × (C₁+C₂)) |
Chemical synergies |
|
y = Vmax×C/(Km+C) × inhibition |
Michaelis-Menten kinetics |
Multi-Regime Landscapes
Different regions of the feature space have different target-spectra relationships:
# Regime 0: y ∝ C₁ (linear)
# Regime 1: y ∝ C₁² (quadratic)
# Regime 2: y ∝ max(y) - y (inverse)
# Regime 3+: y ∝ C₁/C₂ (ratio)
config = NonLinearTargetConfig(
n_regimes=4,
regime_method="concentration", # Partition by concentration space
regime_overlap=0.2, # Smooth transitions
noise_heteroscedasticity=0.5 # Different noise per regime
)
Extending the Processor
class CustomTargetProcessor(NonLinearTargetProcessor):
def _apply_nonlinear_interactions(self, C, y):
# Custom non-linear transformation
custom_term = np.sin(C[:, 0] * np.pi) * np.exp(-C[:, 1])
strength = self.config.interaction_strength
return (1 - strength) * y + strength * custom_term.reshape(-1, 1)
Multi-Source Generation
Generate datasets with multiple data types:
from nirs4all.data.synthetic import MultiSourceGenerator, SourceConfig
generator = MultiSourceGenerator(random_state=42)
result = generator.generate(
n_samples=500,
sources=[
SourceConfig(
name="NIR",
source_type="nir",
wavelength_range=(1000, 2500),
complexity="realistic",
components=["water", "protein", "lipid"]
),
SourceConfig(
name="VIS",
source_type="nir",
wavelength_range=(400, 1000),
complexity="simple"
),
SourceConfig(
name="markers",
source_type="aux",
n_features=10
)
]
)
# Access individual sources
X_nir = result.sources["NIR"]
X_vis = result.sources["VIS"]
X_markers = result.sources["markers"]
# Get combined features
X_all = result.get_combined_features()
Exporter System
Export synthetic data to various formats:
from nirs4all.data.synthetic import DatasetExporter, CSVVariationGenerator
exporter = DatasetExporter()
# Standard format (Xcal, Ycal, Xval, Yval)
path = exporter.to_folder(
"output/dataset",
X, y,
train_ratio=0.8,
wavelengths=wavelengths,
format="standard"
)
# Single file format
path = exporter.to_folder(
"output/dataset",
X, y,
format="single"
)
# For testing CSV loaders with format variations
csv_gen = CSVVariationGenerator()
variations = csv_gen.generate_variations(
X, y, wavelengths,
output_dir="test_data",
variations=[
"comma_delimiter",
"tab_delimiter",
"no_headers",
"european_decimals"
]
)
Real Data Fitting
Analyze real data and generate similar synthetic data:
from nirs4all.data.synthetic import RealDataFitter, compute_spectral_properties
# Compute properties of real data
props = compute_spectral_properties(
X_real,
wavelengths=wavelengths,
name="real_wheat"
)
print(f"Global mean: {props.global_mean:.4f}")
print(f"Noise estimate: {props.noise_estimate:.6f}")
print(f"PCA components for 95% var: {props.pca_n_components_95}")
# Fit generator parameters
fitter = RealDataFitter()
params = fitter.fit(X_real, wavelengths=wavelengths)
print(f"Fitted wavelength range: {params.wavelength_start}-{params.wavelength_end}")
print(f"Recommended complexity: {params.complexity}")
# Generate similar data
X_synth = fitter.generate(n_samples=1000, random_state=42)
Validation System
Validate generated data quality:
from nirs4all.data.synthetic import (
validate_spectra,
validate_wavelengths,
validate_concentrations,
ValidationError
)
try:
validate_spectra(X, wavelengths)
validate_concentrations(C)
validate_wavelengths(wavelengths)
except ValidationError as e:
print(f"Validation failed: {e}")
Integration with Test Fixtures
The synthetic module provides pytest fixtures for testing:
# In tests/conftest.py (already configured)
def test_my_feature(standard_regression_dataset):
"""Use session-scoped synthetic dataset."""
X = standard_regression_dataset.x({"partition": "train"})
y = standard_regression_dataset.y({"partition": "train"})
# ... test logic
def test_with_custom_data(synthetic_builder_factory):
"""Use factory to create custom dataset."""
dataset = (
synthetic_builder_factory(n_samples=100)
.with_features(complexity="simple")
.with_classification(n_classes=3)
.build()
)
# ... test logic
def test_loader(synthetic_dataset_folder):
"""Test with temporary CSV files."""
from nirs4all.data import DatasetConfigs
dataset = DatasetConfigs(str(synthetic_dataset_folder))
# ... test logic
Performance Considerations
Operation |
Samples |
Time |
|---|---|---|
Generation (simple) |
10,000 |
~0.3s |
Generation (realistic) |
10,000 |
~0.5s |
Generation (complex) |
10,000 |
~1.0s |
Metadata generation |
10,000 |
~0.05s |
Export to CSV |
10,000 |
~0.5s |
Real data fitting |
N/A |
~0.2s |
Tips:
Use
complexity="simple"for unit testsReuse generators with same random_state
Use session-scoped fixtures for expensive datasets
Generate once, clone for modifications
Predefined Spectral Components
The synthetic generator includes 48 predefined spectral components with band assignments based on published NIR spectroscopy literature. Each component’s absorption bands are modeled using Voigt profiles with accurate wavelength positions, widths, and relative intensities.
Available Components
Water and Moisture
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Free water (H₂O) |
1450, 1940, 2500 |
[1] pp. 34-36 |
|
Bound water in matrices |
1460, 1930 |
[2] pp. 358-362 |
Proteins and Nitrogen Compounds
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
General protein (amide, N-H) |
1510, 1680, 2050, 2180, 2300 |
[1] pp. 48-52 |
|
Primary/secondary amines |
1500, 2060, 2150 |
[1] pp. 52-54 |
|
CO(NH₂)₂ |
1480, 1530, 2010, 2170 |
[9] p. 1125 |
|
Free amino acids |
1520, 2040, 2260 |
[3] pp. 215-220 |
|
Milk protein |
1510, 1680, 2050, 2180 |
[4] pp. 85-88 |
|
Wheat protein complex |
1505, 1680, 2050, 2180, 2290 |
[5] pp. 155-160 |
Lipids and Hydrocarbons
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Triglycerides (C-H) |
1210, 1390, 1720, 2310, 2350 |
[1] pp. 44-48 |
|
Vegetable/mineral oils |
1165, 1215, 1410, 1725, 2140, 2305 |
[4] pp. 67-72 |
|
Saturated fatty acids |
1195, 1395, 1730, 2315, 2355 |
[7] pp. 15-20 |
|
Mono/polyunsaturated fats |
1160, 1400, 1720, 2145, 2175 |
[7] pp. 20-25 |
|
Cuticular waxes |
1190, 1720, 2310, 2350 |
[7] pp. 15-20 |
|
Benzene derivatives |
1145, 1685, 2150, 2440 |
[1] pp. 56-58 |
|
Saturated hydrocarbons |
1190, 1715, 2310, 2360 |
[7] pp. 10-15 |
Carbohydrates
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Amylose/amylopectin |
1460, 1580, 2100, 2270 |
[5] pp. 155-160 |
|
β-1,4-glucan chains |
1490, 1780, 2090, 2280, 2340 |
[6] pp. 295-300 |
|
D-glucose |
1440, 1690, 2080, 2270 |
[2] pp. 368-370 |
|
D-fructose |
1430, 1695, 2070, 2260 |
[2] pp. 368-370 |
|
Disaccharide |
1435, 1685, 2075, 2265 |
[2] pp. 370-372 |
|
Milk sugar |
1450, 1690, 1940, 2100, 2270 |
[12] |
|
Xylan/glucomannan |
1470, 1760, 2085, 2250 |
[6] pp. 300-303 |
|
Aromatic polymer |
1140, 1420, 1670, 2130, 2270 |
[6] pp. 303-305 |
|
Plant cell wall material |
1490, 1770, 2090, 2275, 2340 |
[6], [5] |
Alcohols and Polyols
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
C₂H₅OH |
1410, 1580, 1695, 2050, 2290 |
[1] pp. 38-40 |
|
CH₃OH |
1400, 1545, 1705, 2040 |
[1] pp. 38-40 |
|
Polyol (fermentation) |
1450, 1580, 1700, 2060, 2280 |
[11] |
Organic Acids
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
CH₃COOH |
1420, 1700, 1940, 2240 |
[8] pp. 8-10 |
|
C₆H₈O₇ |
1440, 1920, 2060, 2260 |
[4] pp. 78-80 |
|
CH₃CH(OH)COOH |
1430, 1485, 1700, 2020, 2255 |
[9] pp. 1128-1130 |
|
Fruit acid (apples) |
1440, 1920, 2050, 2255 |
[4] pp. 78-80 |
|
Grape/wine acid |
1435, 1910, 2040, 2260 |
[11] |
Plant Pigments and Phenolics
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Chlorophyll a/b |
1070, 1400, 1730, 2270 |
[2] pp. 375-378 |
|
β-carotene, xanthophylls |
1050, 1680, 2135, 2280 |
[2] pp. 378-380 |
|
Phenolic compounds |
1420, 1670, 2056, 2270 |
[6], [11] |
Pharmaceutical Compounds
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
C₈H₁₀N₄O₂ |
1130, 1665, 1695, 2010, 2280 |
[9] pp. 1130-1132 |
|
Acetylsalicylic acid |
1145, 1435, 1680, 2020, 2140 |
[9] pp. 1125-1128 |
|
Acetaminophen |
1140, 1390, 1510, 1670, 2055, 2260 |
[9] pp. 1132-1135 |
Fibers and Textiles
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Cotton cellulose |
1200, 1494, 1780, 2100, 2280, 2345 |
[6] pp. 295-298 |
|
PET fiber |
1140, 1660, 1720, 2015, 2130, 2255 |
[1] pp. 60-62 |
Polymers and Plastics
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
HDPE/LDPE plastic |
1190, 1720, 2310, 2355 |
[15] |
|
Aromatic polymer |
1145, 1680, 1720, 2170, 2300 |
[15] |
|
cis-1,4-polyisoprene |
1160, 1720, 2130, 2250, 2350 |
[15] |
|
Polyamide fiber |
1500, 1720, 2050, 2295 |
[1] pp. 60-62 |
Solvents
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
Ketone (propan-2-one) |
1690, 1710, 2100, 2300 |
[1] pp. 42-44 |
Soil Minerals
Component |
Description |
Key Bands (nm) |
Reference |
|---|---|---|---|
|
CaCO₃, MgCO₃ (calcite) |
2330, 2525 |
[13] |
|
CaSO₄·2H₂O |
1740, 1900, 2200 |
[13] |
|
Clay mineral |
1400, 2160, 2200 |
[13] |
Component Correlation Groups
Components within the same correlation group can be configured to have correlated concentrations, reflecting real-world relationships:
Group |
Components |
Application Domain |
|---|---|---|
1 |
water, moisture |
Moisture analysis |
2 |
protein, nitrogen_compound, urea, amino_acid, casein, gluten |
Nitrogen/protein analysis |
3 |
lipid, oil, saturated_fat, unsaturated_fat, waxes |
Fat/oil analysis |
4 |
starch, cellulose, glucose, fructose, sucrose, hemicellulose, lactose, dietary_fiber, cotton |
Carbohydrate analysis |
5 |
chlorophyll, carotenoid, lignin, tannins |
Plant constituents |
6 |
aromatic, alkane |
Petrochemical |
7 |
ethanol, methanol, glycerol |
Alcohols/polyols |
8 |
acetic_acid, citric_acid, lactic_acid, malic_acid, tartaric_acid |
Organic acids |
9 |
caffeine, aspirin, paracetamol |
Pharmaceuticals |
10 |
polyester |
Synthetic fibers |
11 |
polyethylene, polystyrene, natural_rubber, nylon |
Polymers |
12 |
acetone |
Solvents |
13 |
carbonates, gypsum, kaolinite |
Soil minerals |
Usage Example
from nirs4all.data.synthetic import ComponentLibrary
# List all available components (48 total)
library = ComponentLibrary.from_predefined()
print(f"Available components: {library.component_names}")
# Create library with agricultural components
agri_library = ComponentLibrary.from_predefined([
"water", "protein", "starch", "cellulose", "oil", "chlorophyll"
])
# Create library with dairy components
dairy_library = ComponentLibrary.from_predefined([
"water", "lactose", "casein", "lipid"
])
# Create library with wine/beverage components
wine_library = ComponentLibrary.from_predefined([
"water", "ethanol", "glycerol", "glucose", "fructose",
"malic_acid", "tartaric_acid", "tannins"
])
# Create library with soil analysis components
soil_library = ComponentLibrary.from_predefined([
"water", "carbonates", "gypsum", "kaolinite", "cellulose", "lignin"
])
# Create library with polymer/plastic components
polymer_library = ComponentLibrary.from_predefined([
"polyethylene", "polystyrene", "nylon", "polyester"
])
References
The predefined spectral components are based on established NIR spectroscopy literature:
Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy (2nd ed.). CRC Press. — Comprehensive NIR band assignments and spectral atlas.
Burns, D. A., & Ciurczak, E. W. (Eds.). (2007). Handbook of Near-Infrared Analysis (3rd ed.). CRC Press. — Theory and applications of NIR spectroscopy.
Siesler, H. W., Ozaki, Y., Kawata, S., & Heise, H. M. (Eds.). (2002). Near-Infrared Spectroscopy: Principles, Instruments, Applications. Wiley-VCH. — Detailed overtone/combination band theory.
Osborne, B. G., Fearn, T., & Hindle, P. H. (1993). Practical NIR Spectroscopy with Applications in Food and Beverage Analysis (2nd ed.). Longman Scientific & Technical. — Food and agricultural applications.
Williams, P. C., & Norris, K. H. (Eds.). (2001). Near-Infrared Technology in the Agricultural and Food Industries (2nd ed.). AACC International. — Band assignments for agricultural commodities.
Schwanninger, M., Rodrigues, J. C., & Fackler, K. (2011). A Review of Band Assignments in Near Infrared Spectra of Wood and Wood Components. Journal of Near Infrared Spectroscopy, 19(5), 287-308. — Comprehensive wood/cellulose band assignments.
Murray, I. (1986). The NIR Spectra of Homologous Series of Organic Compounds. In P. C. Williams & K. H. Norris (Eds.), Near-Infrared Technology in the Agricultural and Food Industries. AACC. — Fundamental hydrocarbon and alcohol band positions.
Bokobza, L. (1998). Near Infrared Spectroscopy. Journal of Near Infrared Spectroscopy, 6(1), 3-17. — Review of NIR fundamentals and band assignments.
Reich, G. (2005). Near-Infrared Spectroscopy and Imaging: Basic Principles and Pharmaceutical Applications. Advanced Drug Delivery Reviews, 57(8), 1109-1143. — Pharmaceutical compound band assignments.
Blanco, M., & Villarroya, I. (2002). NIR Spectroscopy: A Rapid-Response Analytical Tool. TrAC Trends in Analytical Chemistry, 21(4), 240-250. — Review of NIR applications and typical band positions.
Martelo-Vidal, M. J., & Vázquez, M. (2014). Application of Artificial Neural Networks Coupled to UV–VIS–NIR Spectroscopy for the Rapid Quantification of Wine Compounds. CyTA – Journal of Food, 12(1), 32-39. — Wine fermentation compounds (glycerol, organic acids, tannins).
Luypaert, J., Zhang, M. H., & Massart, D. L. (2003). Feasibility Study for the Use of Near Infrared Spectroscopy in the Qualitative and Quantitative Analysis of Green Tea. Analytica Chimica Acta, 478(2), 303-312. — Tea and beverage NIR assignments.
Khayamim, F., et al. (2015). Visible and Near-Infrared Spectroscopy for the Prediction of Soil Properties. Spectroscopy Letters, 48(6), 405-418. — Soil mineral NIR characterization.
Shenk, J. S., Workman Jr, J. J., & Westerhaus, M. O. (2008). Application of NIR Spectroscopy to Agricultural Products. In D. A. Burns & E. W. Ciurczak (Eds.), Handbook of Near-Infrared Analysis (3rd ed., pp. 347-386). CRC Press. — Comprehensive agricultural NIR band assignments.
Lachenal, G. (1995). Dispersive and Fourier Transform Near-Infrared Spectroscopy of Polymeric Materials. Vibrational Spectroscopy, 9(1), 93-100. — Polymer NIR band assignments.
See Also
Synthetic Data Generation - User guide
nirs4all.data.synthetic package - API reference
Testing Guide - Testing guide with synthetic fixtures