nirs4all.synthesis package

Subpackages

Submodules

Module contents

Synthetic NIRS Data Generation Module.

This module provides tools for generating realistic synthetic NIRS spectra for testing, examples, benchmarking, and ML research.

Key Features:
  • Physically-motivated generation based on Beer-Lambert law

  • Voigt profile peak shapes (Gaussian + Lorentzian convolution)

  • Realistic NIR band positions from known spectroscopic databases

  • Configurable complexity levels (simple, realistic, complex)

  • Batch/session effects for domain adaptation research

  • Direct SpectroDataset creation for pipeline integration

Quick Start:
>>> from nirs4all.synthesis import SyntheticNIRSGenerator
>>>
>>> # Simple generation
>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> X, Y, E = generator.generate(n_samples=1000)
>>>
>>> # Create a SpectroDataset
>>> dataset = generator.create_dataset(n_train=800, n_test=200)
>>> # Use predefined components
>>> from nirs4all.synthesis import ComponentLibrary
>>> library = ComponentLibrary.from_predefined(["water", "protein", "lipid"])
>>> generator = SyntheticNIRSGenerator(component_library=library)

See also

  • nirs4all.generate: Top-level generation API

  • SyntheticDatasetBuilder: Fluent dataset construction

References

  • Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press.

  • Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis. CRC Press.

class nirs4all.synthesis.ATRConfig(crystal_material: str = 'diamond', crystal_refractive_index: float = 2.4, incidence_angle: float = 45.0, n_reflections: int = 1, sample_refractive_index: float = 1.5)[source]

Bases: object

Configuration for Attenuated Total Reflectance mode.

ATR uses internal reflection within a high-refractive-index crystal. The evanescent wave penetrates into the sample, with penetration depth depending on wavelength.

crystal_material

ATR crystal material.

Type:

str

crystal_refractive_index

Refractive index of crystal.

Type:

float

incidence_angle

Angle of incidence (degrees).

Type:

float

n_reflections

Number of internal reflections.

Type:

int

sample_refractive_index

Approximate refractive index of sample.

Type:

float

crystal_material: str = 'diamond'
crystal_refractive_index: float = 2.4
incidence_angle: float = 45.0
n_reflections: int = 1
sample_refractive_index: float = 1.5
class nirs4all.synthesis.AcceleratedArrays(backend: AcceleratorBackend, zeros: Callable, ones: Callable, arange: Callable, linspace: Callable, array: Callable, exp: Callable, log: Callable, sqrt: Callable, sin: Callable, cos: Callable, sum: Callable, dot: Callable, matmul: Callable, random_normal: Callable, random_uniform: Callable, to_numpy: Callable)[source]

Bases: object

Container for accelerated array operations.

arange: Callable
array: Callable
backend: AcceleratorBackend
cos: Callable
dot: Callable
exp: Callable
linspace: Callable
log: Callable
matmul: Callable
ones: Callable
random_normal: Callable
random_uniform: Callable
sin: Callable
sqrt: Callable
sum: Callable
to_numpy: Callable
zeros: Callable
class nirs4all.synthesis.AcceleratedGenerator(backend: AcceleratorBackend | None = None, random_state: int | None = None)[source]

Bases: object

GPU-accelerated synthetic spectrum generator.

This class provides a high-level interface for generating large batches of synthetic spectra using GPU acceleration when available.

Parameters:
  • backend – Backend to use (auto-detect if None).

  • random_state – Random state for reproducibility.

Example

>>> gen = AcceleratedGenerator(random_state=42)
>>> print(f"Using backend: {gen.backend}")
>>>
>>> # Generate 10000 spectra
>>> X = gen.generate_batch(
...     n_samples=10000,
...     wavelengths=np.linspace(1000, 2500, 700),
...     component_spectra=E,
...     concentrations=C,
... )
generate_batch(n_samples: int, wavelengths: ndarray, component_spectra: ndarray, concentrations: ndarray, noise_level: float = 0.01) ndarray[source]

Generate a batch of spectra.

Parameters:
  • n_samples – Number of samples.

  • wavelengths – Wavelength array.

  • component_spectra – Component spectra (n_components, n_wavelengths).

  • concentrations – Concentrations (n_samples, n_components).

  • noise_level – Noise level.

Returns:

Generated spectra (n_samples, n_wavelengths).

generate_voigt_profiles(wavelengths: ndarray, centers: ndarray, amplitudes: ndarray, sigmas: ndarray, gammas: ndarray) ndarray[source]

Generate Voigt profiles for component spectra.

Parameters:
  • wavelengths – Wavelength array.

  • centers – Band centers.

  • amplitudes – Band amplitudes.

  • sigmas – Gaussian widths.

  • gammas – Lorentzian widths.

Returns:

Spectrum array.

class nirs4all.synthesis.AcceleratorBackend(value)[source]

Bases: str, Enum

Available acceleration backends.

CUPY = 'cupy'
JAX = 'jax'
NUMPY = 'numpy'
class nirs4all.synthesis.AggregateComponent(name: str, components: ~typing.Dict[str, float], description: str, domain: str, category: str = '', spectral_category: str = '', variability: ~typing.Dict[str, ~typing.Tuple[float, float]] = <factory>, correlations: ~typing.List[~typing.Tuple[str, str, float]] = <factory>, tags: ~typing.List[str] = <factory>, references: ~typing.List[str] = <factory>)[source]

Bases: object

Predefined mixture of spectral components for common sample types.

Each aggregate defines a typical composition for a product type along with realistic variability ranges for generating diverse samples.

name

Unique identifier for the aggregate.

Type:

str

components

Base composition as {component_name: weight}. Weights should approximately sum to 1.0 (allowing for ash, etc.).

Type:

Dict[str, float]

description

Human-readable description of the aggregate.

Type:

str

domain

Application domain (e.g., “agriculture”, “food”, “pharmaceutical”).

Type:

str

category

Product category within domain (e.g., “grain”, “dairy”, “solid_dosage”).

Type:

str

variability

Optional weight ranges for components with natural variation. Format: {component_name: (min_weight, max_weight)}.

Type:

Dict[str, Tuple[float, float]]

correlations

Optional correlation constraints between components. Format: [(comp1, comp2, correlation_coefficient), …].

Type:

List[Tuple[str, str, float]]

tags

Classification tags for filtering (e.g., [“grain”, “cereal”]).

Type:

List[str]

references

Literature or database citations.

Type:

List[str]

Example

>>> wheat = AggregateComponent(
...     name="wheat_grain",
...     components={"starch": 0.65, "protein": 0.12, "moisture": 0.12},
...     description="Typical wheat grain composition",
...     domain="agriculture",
...     category="grain",
...     variability={"protein": (0.08, 0.18), "moisture": (0.08, 0.15)},
... )
category: str = ''
components: Dict[str, float]
correlations: List[Tuple[str, str, float]]
description: str
domain: str
info() str[source]

Return formatted information about the aggregate.

Returns:

Human-readable string with aggregate details.

name: str
references: List[str]
spectral_category: str = ''
tags: List[str]
validate() List[str][source]

Validate aggregate definition.

Returns:

List of validation issues (empty if all valid).

variability: Dict[str, Tuple[float, float]]
class nirs4all.synthesis.BandAssignment(center: float, functional_group: str, overtone_level: str, assignment: str = '', description: str = '', sigma_range: Tuple[float, float]=(15, 30), gamma_range: Tuple[float, float]=(0, 5), intensity: str = 'medium', chemical_context: str = '', affected_by: List[str] = <factory>, common_compounds: List[str] = <factory>, references: List[str] = <factory>, tags: List[str] = <factory>)[source]

Bases: object

Represents a single NIR band assignment.

This class stores the physical and chemical properties of an absorption band, including its position, width, and molecular origin.

center

Central wavelength in nm.

Type:

float

wavenumber

Central wavenumber in cm⁻¹ (calculated from center).

functional_group

Chemical functional group (e.g., “O-H”, “C-H”, “N-H”).

Type:

str

overtone_level

Overtone designation (“fundamental”, “1st”, “2nd”, “3rd”, “combination”).

Type:

str

assignment

Specific vibrational mode assignment (e.g., “2ν₁”, “ν₁+ν₃”).

Type:

str

description

Human-readable description of the band.

Type:

str

sigma_range

Typical Gaussian width range (min, max) in nm.

Type:

Tuple[float, float]

gamma_range

Typical Lorentzian width range (min, max) in nm.

Type:

Tuple[float, float]

intensity

Relative intensity category (“very_weak”, “weak”, “medium”, “strong”, “very_strong”).

Type:

str

chemical_context

Additional context (e.g., “free”, “H-bonded”, “aromatic”, “aliphatic”).

Type:

str

affected_by

Factors that shift/modify the band (e.g., [“H-bonding”, “temperature”]).

Type:

List[str]

common_compounds

Example compounds showing this band.

Type:

List[str]

references

Literature citations for the band assignment.

Type:

List[str]

tags

Classification tags for filtering (e.g., [“water”, “carbohydrate”]).

Type:

List[str]

Example

>>> band = BandAssignment(
...     center=1450,
...     functional_group="O-H",
...     overtone_level="1st",
...     assignment="2ν₁ (O-H stretch)",
...     description="O-H 1st overtone, free hydroxyl",
...     sigma_range=(20, 30),
...     intensity="strong",
... )
affected_by: List[str]
assignment: str = ''
center: float
chemical_context: str = ''
common_compounds: List[str]
description: str = ''
functional_group: str
gamma_range: Tuple[float, float] = (0, 5)
info() str[source]

Return formatted information about the band.

intensity: str = 'medium'
overtone_level: str
references: List[str]
sigma_range: Tuple[float, float] = (15, 30)
tags: List[str]
to_nir_band(amplitude: float = 1.0, sigma: float | None = None, gamma: float | None = None)[source]

Convert to NIRBand object for spectral generation.

Parameters:
  • amplitude – Peak amplitude (default 1.0).

  • sigma – Gaussian width in nm. If None, uses midpoint of sigma_range.

  • gamma – Lorentzian width in nm. If None, uses midpoint of gamma_range.

Returns:

NIRBand object configured with this band’s parameters.

property wavenumber: float

Convert wavelength (nm) to wavenumber (cm⁻¹).

class nirs4all.synthesis.BatchEffectConfig(enabled: bool = False, n_batches: int = 3, offset_std: float = 0.02, gain_std: float = 0.03)[source]

Bases: object

Configuration for batch/session effects simulation.

enabled

Whether to add batch effects.

Type:

bool

n_batches

Number of measurement batches/sessions.

Type:

int

offset_std

Standard deviation of batch offset.

Type:

float

gain_std

Standard deviation of batch gain multiplier.

Type:

float

enabled: bool = False
gain_std: float = 0.03
n_batches: int = 3
offset_std: float = 0.02
class nirs4all.synthesis.BenchmarkDatasetInfo(name: str, full_name: str, domain: BenchmarkDomain, n_samples: int, n_wavelengths: int, wavelength_range: Tuple[float, float], targets: List[str], sample_type: str, measurement_mode: str, source_url: str, reference: str, license: str = 'Unknown', typical_snr: Tuple[float, float] = (50, 500), typical_peak_density: Tuple[float, float] = (1.0, 5.0), notes: str = '')[source]

Bases: object

Metadata for a benchmark dataset.

name

Dataset name/identifier.

Type:

str

full_name

Full descriptive name.

Type:

str

domain

Application domain.

Type:

nirs4all.synthesis.benchmarks.BenchmarkDomain

n_samples

Number of samples (approximate if variable).

Type:

int

n_wavelengths

Number of wavelength points.

Type:

int

wavelength_range

(min, max) wavelength in nm.

Type:

Tuple[float, float]

targets

List of target variable names.

Type:

List[str]

sample_type

Description of sample type.

Type:

str

measurement_mode

Typical measurement mode.

Type:

str

source_url

URL to obtain the dataset.

Type:

str

reference

Publication or source reference.

Type:

str

license

License information.

Type:

str

typical_snr

Typical signal-to-noise ratio range.

Type:

Tuple[float, float]

typical_peak_density

Typical peaks per 100 nm.

Type:

Tuple[float, float]

notes

Additional notes.

Type:

str

domain: BenchmarkDomain
full_name: str
license: str = 'Unknown'
measurement_mode: str
n_samples: int
n_wavelengths: int
name: str
notes: str = ''
reference: str
sample_type: str
source_url: str
summary() str[source]

Return a human-readable summary.

targets: List[str]
typical_peak_density: Tuple[float, float] = (1.0, 5.0)
typical_snr: Tuple[float, float] = (50, 500)
wavelength_range: Tuple[float, float]
class nirs4all.synthesis.BenchmarkDomain(value)[source]

Bases: str, Enum

Domains for benchmark datasets.

AGRICULTURE = 'agriculture'
ENVIRONMENTAL = 'environmental'
FOOD = 'food'
GENERAL = 'general'
PETROCHEMICAL = 'petrochemical'
PHARMACEUTICAL = 'pharmaceutical'
class nirs4all.synthesis.CSVVariationGenerator[source]

Bases: object

Generate CSV files with various format variations for loader testing.

This class creates CSV files with different delimiters, encodings, header formats, and other variations to test the robustness of CSV loaders.

base_exporter

DatasetExporter for actual file writing.

Example

>>> generator = CSVVariationGenerator()
>>>
>>> # Generate all variations
>>> paths = generator.generate_all_variations(
...     "test_data",
...     X, y,
...     wavelengths=wavelengths
... )
>>>
>>> # Generate specific variation
>>> path = generator.with_semicolon_delimiter(
...     "data_semicolon",
...     X, y
... )
as_fragmented(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create fragmented dataset with multiple small files.

as_single_file(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create single CSV file with all data and partition column.

generate_all_variations(base_path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Dict[str, Path][source]

Generate CSV files with all format variations.

Creates multiple versions of the dataset with different CSV format options for comprehensive loader testing.

Parameters:
  • base_path – Base output folder path.

  • X – Feature matrix.

  • y – Target values.

  • wavelengths – Optional wavelength values.

  • train_ratio – Train/test split ratio.

  • random_state – Random seed.

Returns:

Dictionary mapping variation name to created path.

Example

>>> paths = generator.generate_all_variations(
...     "test_variations",
...     X, y,
...     random_state=42
... )
>>> print(paths.keys())
with_comma_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create CSV with comma delimiter.

with_precision(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None, precision: int = 6) Path[source]

Create CSV with specified floating point precision.

with_row_index(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create CSV with row index column.

with_semicolon_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create CSV with semicolon delimiter (nirs4all default).

with_tab_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create CSV with tab delimiter.

without_headers(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, random_state: int | None = None) Path[source]

Create CSV without column headers.

class nirs4all.synthesis.CategoryGenerator(templates: List[str | ProductTemplate], random_state: int | None = None, **kwargs: Any)[source]

Bases: object

Generator combining multiple product templates for diverse datasets.

CategoryGenerator enables creation of training datasets that span multiple product types, useful for building robust models that generalize across categories.

templates

List of ProductTemplate objects.

generators

List of ProductGenerator objects for each template.

Parameters:
  • templates – List of template names or ProductTemplate objects.

  • random_state – Random seed for reproducibility.

  • **kwargs – Additional arguments passed to ProductGenerator.

Example

>>> # Combine dairy products
>>> gen = CategoryGenerator(["milk_variable_fat", "cheese_variable_moisture"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")
>>>
>>> # Universal fat predictor training
>>> gen = CategoryGenerator([
...     "milk_variable_fat",
...     "cheese_variable_moisture",
...     "meat_variable_fat",
... ])
>>> dataset = gen.generate(n_samples=10000, target="lipid")
__repr__() str[source]

Return string representation.

generate(n_samples: int = 1000, target: str | None = None, samples_per_template: List[int] | None = None, train_ratio: float = 0.8, shuffle: bool = True, include_template_labels: bool = False) SpectroDataset[source]

Generate combined dataset from multiple templates.

Parameters:
  • n_samples – Total number of samples to generate.

  • target – Component to use as regression target. Must exist in all templates.

  • samples_per_template – Number of samples per template. If None, divides equally.

  • train_ratio – Proportion of samples for training partition.

  • shuffle – Whether to shuffle samples across templates.

  • include_template_labels – If True, adds template index as metadata.

Returns:

SpectroDataset combining samples from all templates.

Example

>>> gen = CategoryGenerator(["milk_variable_fat", "meat_variable_fat"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")
class nirs4all.synthesis.ClassSeparationConfig(separation: float = 1.5, method: Literal['component', 'shift', 'intensity'] = 'component', noise: float = 0.1)[source]

Bases: object

Configuration for class separation in classification tasks.

separation

Separation factor (higher = more separable). Values around 0.5-1.0 create overlapping classes. Values around 2.0-3.0 create well-separated classes.

Type:

float

method

How to create class differences: - “component”: Different component concentration profiles per class. - “shift”: Systematic spectral shifts between classes. - “intensity”: Different overall intensity levels.

Type:

Literal[‘component’, ‘shift’, ‘intensity’]

noise

Noise level to add to class boundaries.

Type:

float

method: Literal['component', 'shift', 'intensity'] = 'component'
noise: float = 0.1
separation: float = 1.5
class nirs4all.synthesis.CombinationBandResult(mode1_cm: float, mode2_cm: float, wavenumber_cm: float, wavelength_nm: float, amplitude_factor: float, band_type: str)[source]

Bases: object

Result of combination band calculation.

amplitude_factor: float
band_type: str
mode1_cm: float
mode2_cm: float
wavelength_nm: float
wavenumber_cm: float
class nirs4all.synthesis.ComponentFitResult(component_names: List[str], concentrations: ndarray, baseline_coefficients: ndarray | None, fitted_spectrum: ndarray, residuals: ndarray, r_squared: float, rmse: float, wavelengths: ndarray | None = None)[source]

Bases: object

Result of fitting spectral components to an observed spectrum.

component_names

Names of components used in fitting.

Type:

List[str]

concentrations

Estimated concentration for each component.

Type:

numpy.ndarray

baseline_coefficients

Polynomial baseline coefficients (if fit_baseline=True).

Type:

numpy.ndarray | None

fitted_spectrum

Reconstructed spectrum from fit.

Type:

numpy.ndarray

residuals

Difference between observed and fitted spectra.

Type:

numpy.ndarray

r_squared

R² goodness-of-fit metric.

Type:

float

rmse

Root mean squared error of fit.

Type:

float

wavelengths

Wavelength grid used for fitting.

Type:

numpy.ndarray | None

baseline_coefficients: ndarray | None
component_names: List[str]
concentrations: ndarray
fitted_spectrum: ndarray
r_squared: float
residuals: ndarray
rmse: float
summary() str[source]

Return human-readable summary of fit results.

to_dict() Dict[str, float][source]

Return concentrations as a dictionary.

top_components(n: int = 5, threshold: float = 0.0) List[Tuple[str, float]][source]

Get top N components by concentration.

Parameters:
  • n – Maximum number of components to return.

  • threshold – Minimum concentration threshold.

Returns:

List of (component_name, concentration) tuples, sorted descending.

wavelengths: ndarray | None = None
class nirs4all.synthesis.ComponentFitter(component_names: List[str] | None = None, wavelengths: ndarray | None = None, fit_baseline: bool = True, baseline_order: int = 2, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, sg_window_length: int = 15, sg_polyorder: int = 2)[source]

Bases: object

Fit linear combinations of spectral components to observed spectra.

Solves: spectrum ≈ Σ(c_i * component_i(λ)) + baseline

Uses non-negative least squares (NNLS) to ensure positive concentrations, which is physically meaningful for spectroscopic analysis.

Preprocessing Support: If your observed spectra are preprocessed (e.g., second derivative, SNV), use the preprocessing parameter to apply the same transformation to component spectra before fitting.

Auto-detection: Set auto_detect_preprocessing=True to automatically detect the preprocessing type from the data (recommended for derivative data).

Example

>>> from nirs4all.synthesis import ComponentFitter
>>>
>>> # Fit with all available components
>>> fitter = ComponentFitter(wavelengths=np.arange(1000, 2500, 2))
>>> result = fitter.fit(observed_spectrum)
>>> print(result.summary())
>>>
>>> # Fit preprocessed data (e.g., second derivative)
>>> fitter = ComponentFitter(
...     component_names=["water", "protein", "lipid"],
...     wavelengths=wavelengths,
...     preprocessing="second_derivative",  # Components will be transformed
... )
>>> result = fitter.fit(derivative_spectrum)
>>>
>>> # Auto-detect preprocessing (recommended for unknown data)
>>> fitter = ComponentFitter(
...     wavelengths=wavelengths,
...     auto_detect_preprocessing=True,  # Will detect derivative, SNV, etc.
... )
>>> result = fitter.fit(unknown_spectrum)
component_names

List of component names to fit.

wavelengths

Wavelength grid for fitting.

fit_baseline

Whether to include polynomial baseline.

baseline_order

Polynomial order for baseline (default 2).

preprocessing

Preprocessing to apply to components before fitting.

auto_detect_preprocessing

If True, detect preprocessing from data.

detected_preprocessing

The detected preprocessing type (after first fit).

detected_preprocessing: PreprocessingType | None
fit(spectrum: ndarray, method: str = 'nnls') ComponentFitResult[source]

Fit components to a single spectrum.

Parameters:
  • spectrum – Observed spectrum, shape (n_wavelengths,).

  • method – Fitting method. - “nnls”: Non-negative least squares (default, physically meaningful). - “lsq”: Unconstrained least squares (allows negative concentrations).

Returns:

ComponentFitResult with concentrations, residuals, and fit quality metrics.

Example

>>> result = fitter.fit(observed_spectrum)
>>> print(f"R² = {result.r_squared:.4f}")
>>> print(f"Top components: {result.top_components(3)}")
fit_batch(spectra: ndarray, method: str = 'nnls', n_jobs: int = -1) List[ComponentFitResult][source]

Fit components to multiple spectra in parallel.

Parameters:
  • spectra – Observed spectra, shape (n_samples, n_wavelengths).

  • method – Fitting method (“nnls” or “lsq”).

  • n_jobs – Number of parallel jobs (-1 = all cores, 1 = sequential).

Returns:

List of ComponentFitResult objects.

Example

>>> results = fitter.fit_batch(X_observed, n_jobs=4)
>>> mean_r2 = np.mean([r.r_squared for r in results])
>>> print(f"Mean R² = {mean_r2:.4f}")
get_concentration_matrix(spectra: ndarray, method: str = 'nnls', n_jobs: int = -1) Tuple[ndarray, List[str]][source]

Get concentration matrix for batch of spectra.

Convenience method that extracts just the concentrations.

Parameters:
  • spectra – Observed spectra, shape (n_samples, n_wavelengths).

  • method – Fitting method (“nnls” or “lsq”).

  • n_jobs – Number of parallel jobs.

Returns:

  • concentrations: Array of shape (n_samples, n_components)

  • component_names: List of component names

Return type:

Tuple of

Example

>>> C, names = fitter.get_concentration_matrix(X_observed)
>>> water_idx = names.index("water")
>>> water_concentrations = C[:, water_idx]
suggest_components(spectrum: ndarray, top_n: int = 5, threshold: float = 0.01, method: str = 'nnls') List[Tuple[str, float]][source]

Suggest which components are likely present in a spectrum.

Performs a fit and returns the top components by concentration.

Parameters:
  • spectrum – Observed spectrum, shape (n_wavelengths,).

  • top_n – Maximum number of components to return.

  • threshold – Minimum concentration threshold.

  • method – Fitting method (“nnls” or “lsq”).

Returns:

List of (component_name, estimated_concentration) tuples, sorted by concentration descending.

Example

>>> suggestions = fitter.suggest_components(unknown_spectrum)
>>> print("Likely components:")
>>> for name, conc in suggestions:
...     print(f"  {name}: {conc:.3f}")
class nirs4all.synthesis.ComponentLibrary(random_state: int | None = None)[source]

Bases: object

Library of spectral components for synthetic NIRS generation.

Supports both predefined components (based on known NIR band assignments) and programmatically generated random components for research purposes.

rng

NumPy random generator for reproducibility.

Example

>>> # Create from predefined components
>>> library = ComponentLibrary.from_predefined(
...     ["water", "protein", "lipid"],
...     random_state=42
... )
>>>
>>> # Or generate random components
>>> library = ComponentLibrary(random_state=42)
>>> library.generate_random_library(n_components=5)
>>>
>>> # Compute all component spectra
>>> wavelengths = np.arange(1000, 2500, 2)
>>> E = library.compute_all(wavelengths)  # shape: (n_components, n_wavelengths)
__contains__(name: str) bool[source]

Check if component exists by name.

__getitem__(name: str) SpectralComponent[source]

Get component by name.

__iter__()[source]

Iterate over components.

__len__() int[source]

Return number of components.

add_boundary_component(name: str, measurement_range: Tuple[float, float] = (1000, 2500), edge: str = 'both', n_bands: int = 1, amplitude_range: Tuple[float, float] = (0.3, 1.0), width_range: Tuple[float, float] = (50, 200), offset_range: Tuple[float, float] = (0.3, 1.5)) SpectralComponent[source]

Generate a component with bands outside the measurement range.

This creates “boundary” or “truncated” peaks - absorption bands whose centers lie outside the measured wavelength range, resulting in partial peaks visible at the spectral edges. This is a common phenomenon in real NIR spectra where absorption bands extend beyond the instrument’s wavelength range.

Common causes include: - Strong water absorption bands at ~2500 nm affecting NIR edge - UV/visible absorption tails at the low wavelength end - Mid-IR fundamental bands tailing into NIR at the high end

Parameters:
  • name – Component name.

  • measurement_range – (min, max) wavelength range of the “measurement” (nm). Bands will be placed outside this range.

  • edge – Which edge(s) to add boundary bands: - “left”: Only below min wavelength - “right”: Only above max wavelength - “both”: Either edge (randomly selected)

  • n_bands – Number of boundary bands to generate.

  • amplitude_range – Range for peak amplitudes (0-1 scale).

  • width_range – Range for band widths (nm). Controls how much of the peak is visible in the measurement range.

  • offset_range – Range for how far outside the measurement range to place the band center, as a fraction of width. e.g., 0.5 means center is 0.5*width outside the range.

Returns:

The generated SpectralComponent with boundary bands.

Example

>>> library = ComponentLibrary(random_state=42)
>>> # Add water band tail at long wavelength edge
>>> boundary = library.add_boundary_component(
...     "water_tail",
...     measurement_range=(1000, 2400),
...     edge="right",
...     amplitude_range=(0.5, 1.0),
...     width_range=(100, 300)
... )

References

  • Burns & Ciurczak (2007). Handbook of Near-Infrared Analysis. Discussion of wavelength range selection and edge effects.

add_boundary_components_from_known(measurement_range: Tuple[float, float] = (1000, 2500)) ComponentLibrary[source]

Add known boundary components that affect common NIR measurement ranges.

Based on literature, certain absorption bands commonly appear as truncated peaks at measurement boundaries:

  • Left edge (short wavelengths): Electronic transitions, UV tails

  • Right edge (long wavelengths): Strong water O-H bands, C-H fundamentals

Parameters:

measurement_range – (min, max) wavelength range of measurement (nm).

Returns:

Self for method chaining.

Example

>>> library = ComponentLibrary(random_state=42)
>>> library.add_boundary_components_from_known((1000, 2400))
add_component(component: SpectralComponent) ComponentLibrary[source]

Add a spectral component to the library.

Parameters:

component – SpectralComponent to add.

Returns:

Self for method chaining.

add_random_component(name: str, n_bands: int = 3, wavelength_range: Tuple[float, float] = (1000, 2500), zones: List[Tuple[float, float]] | None = None) SpectralComponent[source]

Generate and add a random spectral component.

Creates a component with randomly placed absorption bands within the specified wavelength range or zones.

Parameters:
  • name – Component name.

  • n_bands – Number of absorption bands to generate.

  • wavelength_range – Overall wavelength range for band placement.

  • zones – Optional list of (min, max) wavelength zones for band centers. If None, uses default NIR-relevant zones.

Returns:

The generated SpectralComponent.

Example

>>> library = ComponentLibrary(random_state=42)
>>> component = library.add_random_component(
...     "random_compound",
...     n_bands=4,
...     wavelength_range=(1000, 2500)
... )
property component_names: List[str]

Get list of component names in order.

property components: Dict[str, SpectralComponent]

Get all components in the library.

compute_all(wavelengths: ndarray) ndarray[source]

Compute spectra for all components at given wavelengths.

Parameters:

wavelengths – Array of wavelengths in nm.

Returns:

Array of shape (n_components, n_wavelengths) containing the spectrum of each component.

Example

>>> library = ComponentLibrary.from_predefined(["water", "protein"])
>>> wavelengths = np.arange(1000, 2500, 2)
>>> E = library.compute_all(wavelengths)
>>> print(E.shape)
(2, 751)
classmethod from_predefined(component_names: List[str] | None = None, random_state: int | None = None) ComponentLibrary[source]

Create a library from predefined spectral components.

Parameters:
  • component_names – List of component names to include. If None, includes all predefined components.

  • random_state – Random seed for reproducibility.

Returns:

ComponentLibrary instance populated with predefined components.

Raises:

ValueError – If an unknown component name is specified.

Example

>>> library = ComponentLibrary.from_predefined(
...     ["water", "protein", "lipid"]
... )
generate_random_library(n_components: int = 5, n_bands_range: Tuple[int, int] = (2, 6)) ComponentLibrary[source]

Generate a library of random spectral components.

Parameters:
  • n_components – Number of components to generate.

  • n_bands_range – Range (min, max) for number of bands per component.

Returns:

Self for method chaining.

Example

>>> library = ComponentLibrary(random_state=42)
>>> library.generate_random_library(n_components=5, n_bands_range=(2, 5))
property n_components: int

Number of components in the library.

class nirs4all.synthesis.ComponentVariation(component: str, variation_type: VariationType, value: float | None = None, min_value: float | None = None, max_value: float | None = None, mean: float | None = None, std: float | None = None, correlated_with: str | None = None, correlation: float | None = None, compute_as: str | None = None)[source]

Bases: object

Specification for how a component’s concentration varies.

component

Name of the spectral component (must exist in library).

Type:

str

variation_type

Type of variation (FIXED, UNIFORM, NORMAL, etc.).

Type:

nirs4all.synthesis.products.VariationType

value

For FIXED type, the exact value.

Type:

float | None

min_value

For UNIFORM/NORMAL, the minimum bound.

Type:

float | None

max_value

For UNIFORM/NORMAL, the maximum bound.

Type:

float | None

mean

For NORMAL/LOGNORMAL, the distribution mean.

Type:

float | None

std

For NORMAL/LOGNORMAL, the distribution standard deviation.

Type:

float | None

correlated_with

For CORRELATED, the source component name.

Type:

str | None

correlation

For CORRELATED, the correlation coefficient.

Type:

float | None

compute_as

For COMPUTED, a string describing the computation (currently supports “remainder” for 1 - sum(others)).

Type:

str | None

Example

>>> # Fixed moisture content
>>> moisture = ComponentVariation("moisture", VariationType.FIXED, value=0.12)
>>>
>>> # Variable protein with uniform distribution
>>> protein = ComponentVariation(
...     "protein", VariationType.UNIFORM,
...     min_value=0.08, max_value=0.18
... )
>>>
>>> # Starch negatively correlated with protein
>>> starch = ComponentVariation(
...     "starch", VariationType.CORRELATED,
...     correlated_with="protein", correlation=-0.85,
...     min_value=0.55, max_value=0.72
... )
__post_init__() None[source]

Validate specification based on variation type.

component: str
compute_as: str | None = None
correlated_with: str | None = None
correlation: float | None = None
max_value: float | None = None
mean: float | None = None
min_value: float | None = None
std: float | None = None
value: float | None = None
variation_type: VariationType
class nirs4all.synthesis.ConcentrationPrior(distribution: str = 'uniform', params: Dict[str, float]=<factory>, min_value: float = 0.0, max_value: float = 1.0)[source]

Bases: object

Prior distribution for component concentrations.

distribution

Distribution type (‘uniform’, ‘normal’, ‘lognormal’, ‘beta’).

Type:

str

params

Parameters for the distribution (distribution-specific).

Type:

Dict[str, float]

min_value

Minimum allowed concentration.

Type:

float

max_value

Maximum allowed concentration.

Type:

float

distribution: str = 'uniform'
max_value: float = 1.0
min_value: float = 0.0
params: Dict[str, float]
sample(rng: Generator, n_samples: int = 1) ndarray[source]

Sample from the concentration prior.

class nirs4all.synthesis.DatasetComparisonResult(dataset_name: str, n_real_samples: int, n_synthetic_samples: int, realism_score: SpectralRealismScore, tstr_r2: float | None = None, trts_r2: float | None = None)[source]

Bases: object

Result of comparing synthetic data against a benchmark dataset.

dataset_name

Name of the benchmark dataset.

Type:

str

n_real_samples

Number of samples in real dataset.

Type:

int

n_synthetic_samples

Number of synthetic samples used.

Type:

int

realism_score

The spectral realism score.

Type:

nirs4all.synthesis.validation.SpectralRealismScore

tstr_r2

Train-on-Synthetic, Test-on-Real R² (if applicable).

Type:

float | None

trts_r2

Train-on-Real, Test-on-Synthetic R² (if applicable).

Type:

float | None

dataset_name: str
n_real_samples: int
n_synthetic_samples: int
realism_score: SpectralRealismScore
summary() str[source]

Return a human-readable summary.

trts_r2: float | None = None
tstr_r2: float | None = None
class nirs4all.synthesis.DatasetExporter(config: ExportConfig | None = None)[source]

Bases: object

Export synthetic datasets to various file formats.

This class provides methods for exporting synthetic NIRS datasets to files and folders compatible with nirs4all’s data loaders.

config

Export configuration settings.

Parameters:

config – Optional ExportConfig. Uses defaults if None.

Example

>>> exporter = DatasetExporter()
>>>
>>> # Export to standard folder structure
>>> path = exporter.to_folder(
...     "output/data",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=wavelengths
... )
>>>
>>> # Export to single CSV
>>> path = exporter.to_csv(
...     "output/all_data.csv",
...     X, y,
...     wavelengths=wavelengths
... )
to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, include_targets: bool = True) Path[source]

Export dataset to a single CSV file.

Creates a CSV file with features (and optionally targets) combined.

Parameters:
  • path – Output file path.

  • X – Feature matrix (n_samples, n_features).

  • y – Target values (n_samples,) or (n_samples, n_targets).

  • wavelengths – Optional wavelength values for column headers.

  • metadata – Optional dict of metadata arrays.

  • include_targets – Whether to include target column(s).

Returns:

Path to created file.

Example

>>> exporter.to_csv("data.csv", X, y, wavelengths=wavelengths)
to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, random_state: int | None = None, format: Literal['standard', 'single', 'fragmented'] | None = None) Path[source]

Export dataset to a folder structure.

Creates a folder with CSV files compatible with nirs4all’s DatasetConfigs loader.

Parameters:
  • path – Output folder path.

  • X – Feature matrix (n_samples, n_features).

  • y – Target values (n_samples,) or (n_samples, n_targets).

  • train_ratio – Proportion for training set.

  • wavelengths – Optional wavelength values for column headers.

  • metadata – Optional dict of metadata arrays (same length as X).

  • random_state – Random seed for train/test split.

  • format – Override config format for this export.

Returns:

Path to created folder.

Raises:

Example

>>> exporter.to_folder(
...     "data/synthetic",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=np.arange(1000, 2500, 2)
... )
to_numpy(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, compressed: bool = False) Path[source]

Export dataset to numpy .npy or .npz format.

Parameters:
  • path – Output file path (without extension).

  • X – Feature matrix (n_samples, n_features).

  • y – Target values.

  • wavelengths – Optional wavelength values.

  • compressed – Whether to use compressed format (.npz).

Returns:

Path to created file.

Example

>>> exporter.to_numpy("data", X, y, compressed=True)
class nirs4all.synthesis.DerivativeAwareForwardModelFitter(components: List['SpectralComponent'], canonical_grid: np.ndarray, target_grid: np.ndarray, derivative_order: int = 1, sg_window: int = 15, sg_polyorder: int = 2, baseline_order: int = 6, wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0), ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0), path_length_bounds: Tuple[float, float] = (0.5, 2.0))[source]

Bases: object

Forward model fitter for derivative-preprocessed datasets.

Key principle: Never fit derivative spectra by adding narrow bands. Instead:

  1. Fit latent physical model (raw absorbance)

  2. Apply derivative preprocessing to model output

  3. Compare in derivative space

This ensures concentrations remain physically interpretable without oscillatory artifacts from narrow compensating peaks.

components

List of SpectralComponent objects.

Type:

List[‘SpectralComponent’]

canonical_grid

High-resolution canonical wavelength grid.

Type:

np.ndarray

target_grid

Target wavelength grid (dataset grid).

Type:

np.ndarray

derivative_order

1 for first derivative, 2 for second.

Type:

int

sg_window

Savitzky-Golay window length.

Type:

int

sg_polyorder

Savitzky-Golay polynomial order.

Type:

int

baseline_order

Number of Chebyshev baseline terms.

Type:

int

Example

>>> fitter = DerivativeAwareForwardModelFitter(
...     components=components,
...     canonical_grid=canonical_wl,
...     target_grid=dataset_wl,
...     derivative_order=1,  # First derivative
... )
>>> result = fitter.fit(derivative_spectrum)
>>> print(f"R² = {result['r_squared']:.4f}")
__post_init__()[source]

Pre-compute component spectra on canonical grid.

baseline_order: int = 6
canonical_grid: np.ndarray
components: List['SpectralComponent']
derivative_order: int = 1
fit(y_deriv: ndarray, initial_guess: ndarray | None = None) Dict[str, Any][source]

Fit forward model to derivative spectrum.

Parameters:
  • y_deriv – Target spectrum (already derivative-preprocessed).

  • initial_guess – Initial [wl_shift, ils_sigma, path_length].

Returns:

  • r_squared: Coefficient of determination

  • fitted_deriv: Fitted derivative spectrum

  • fitted_raw: Reconstructed raw spectrum

  • residuals_deriv: Fitting residuals

  • concentrations: Fitted component concentrations

  • baseline_coeffs: Fitted baseline coefficients

  • wl_shift, ils_sigma, path_length: Instrument params

Return type:

Dict with fitted parameters

ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0)
path_length_bounds: Tuple[float, float] = (0.5, 2.0)
sg_polyorder: int = 2
sg_window: int = 15
target_grid: np.ndarray
wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0)
class nirs4all.synthesis.DetectorConfig(detector_type: DetectorType = DetectorType.INGAAS, temperature_k: float = 293.0, integration_time_ms: float = 100.0, gain: float = 1.0, noise_model: NoiseModelConfig = <factory>, apply_response_curve: bool = True, apply_nonlinearity: bool = False, nonlinearity_coefficient: float = 0.02)[source]

Bases: object

Complete detector configuration.

detector_type

Type of detector.

Type:

nirs4all.synthesis.instruments.DetectorType

temperature_k

Operating temperature in Kelvin.

Type:

float

integration_time_ms

Integration time in milliseconds.

Type:

float

gain

Amplifier gain.

Type:

float

noise_model

Noise model configuration.

Type:

nirs4all.synthesis.detectors.NoiseModelConfig

apply_response_curve

Whether to apply spectral response.

Type:

bool

apply_nonlinearity

Whether to apply detector nonlinearity.

Type:

bool

nonlinearity_coefficient

Quadratic nonlinearity coefficient.

Type:

float

apply_nonlinearity: bool = False
apply_response_curve: bool = True
detector_type: DetectorType = 'ingaas'
gain: float = 1.0
integration_time_ms: float = 100.0
noise_model: NoiseModelConfig
nonlinearity_coefficient: float = 0.02
temperature_k: float = 293.0
class nirs4all.synthesis.DetectorSimulator(config: DetectorConfig | None = None, random_state: int | None = None)[source]

Bases: object

Simulate detector effects on NIR spectra.

Applies detector spectral response, noise models, and nonlinearity to synthetic spectra.

config

Detector configuration.

rng

Random number generator.

Example

>>> config = DetectorConfig(detector_type=DetectorType.INGAAS)
>>> simulator = DetectorSimulator(config, random_state=42)
>>> spectra_out = simulator.apply(spectra, wavelengths)
apply(spectra: ndarray, wavelengths: ndarray, base_signal_level: float = 1.0) ndarray[source]

Apply detector effects to spectra.

Parameters:
  • spectra – Input spectra (n_samples, n_wavelengths).

  • wavelengths – Wavelength array (nm).

  • base_signal_level – Reference signal level for noise scaling.

Returns:

Spectra with detector effects applied.

class nirs4all.synthesis.DetectorSpectralResponse(detector_type: DetectorType, wavelengths: ndarray, response: ndarray, peak_wavelength: float, cutoff_wavelength: float, short_cutoff: float, peak_qe: float = 0.7)[source]

Bases: object

Spectral response curve for a detector.

Defines the wavelength-dependent sensitivity (quantum efficiency) of the detector.

detector_type

Type of detector.

Type:

nirs4all.synthesis.instruments.DetectorType

wavelengths

Wavelength grid for response curve (nm).

Type:

numpy.ndarray

response

Relative response at each wavelength (0-1).

Type:

numpy.ndarray

peak_wavelength

Wavelength of peak response (nm).

Type:

float

cutoff_wavelength

Long-wavelength cutoff (nm).

Type:

float

short_cutoff

Short-wavelength cutoff (nm).

Type:

float

peak_qe

Peak quantum efficiency (0-1).

Type:

float

cutoff_wavelength: float
detector_type: DetectorType
get_response_at(wavelengths: ndarray) ndarray[source]

Get detector response at specified wavelengths.

Parameters:

wavelengths – Wavelengths to evaluate (nm).

Returns:

Detector response at each wavelength.

peak_qe: float = 0.7
peak_wavelength: float
response: ndarray
short_cutoff: float
wavelengths: ndarray
class nirs4all.synthesis.DetectorType(value)[source]

Bases: str, Enum

Types of NIR detectors.

INGAAS = 'ingaas'
INGAAS_EXTENDED = 'ingaas_ext'
MCT = 'mct'
MEMS = 'mems'
PBS = 'pbs'
PBSE = 'pbse'
SI = 'si'
class nirs4all.synthesis.DomainCategory(value)[source]

Bases: str, Enum

Top-level domain categories.

AGRICULTURE = 'agriculture'
BEVERAGE = 'beverage'
BIOMEDICAL = 'biomedical'
ENVIRONMENTAL = 'environmental'
FOOD = 'food'
PETROCHEMICAL = 'petrochemical'
PHARMACEUTICAL = 'pharmaceutical'
POLYMER = 'polymer'
TEXTILE = 'textile'
class nirs4all.synthesis.DomainConfig(name: str, category: DomainCategory, description: str = '', typical_components: List[str] = <factory>, component_weights: Dict[str, float] | None=None, concentration_priors: Dict[str, ~nirs4all.synthesis.domains.ConcentrationPrior]=<factory>, wavelength_range: Tuple[float, float]=(1000, 2500), n_components_range: Tuple[int, int]=(3, 8), noise_level: str = 'medium', measurement_mode: str = 'reflectance', typical_sample_types: List[str] = <factory>, complexity: str = 'realistic', additional_params: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Configuration for a specific application domain.

Encapsulates all domain-specific parameters needed for generating realistic synthetic NIRS data.

name

Human-readable domain name.

Type:

str

category

Domain category (agriculture, pharmaceutical, etc.).

Type:

nirs4all.synthesis.domains.DomainCategory

description

Brief description of the domain.

Type:

str

typical_components

List of predefined component names commonly found.

Type:

List[str]

component_weights

Relative importance of each component (for selection).

Type:

Dict[str, float] | None

concentration_priors

Per-component concentration distributions.

Type:

Dict[str, nirs4all.synthesis.domains.ConcentrationPrior]

wavelength_range

Typical measurement range (nm).

Type:

Tuple[float, float]

n_components_range

Range of number of components per sample.

Type:

Tuple[int, int]

noise_level

Typical noise level (‘low’, ‘medium’, ‘high’).

Type:

str

measurement_mode

Typical measurement geometry.

Type:

str

typical_sample_types

Examples of sample types in this domain.

Type:

List[str]

complexity

Overall complexity level for generation.

Type:

str

additional_params

Domain-specific additional parameters.

Type:

Dict[str, Any]

additional_params: Dict[str, Any]
category: DomainCategory
complexity: str = 'realistic'
component_weights: Dict[str, float] | None = None
concentration_priors: Dict[str, ConcentrationPrior]
description: str = ''
get_component_weights() Dict[str, float][source]

Get normalized component weights for selection.

measurement_mode: str = 'reflectance'
n_components_range: Tuple[int, int] = (3, 8)
name: str
noise_level: str = 'medium'
sample_components(rng: Generator, n_components: int | None = None) List[str][source]

Sample components for a sample based on domain priors.

Parameters:
  • rng – Random number generator.

  • n_components – Number of components. If None, samples from range.

Returns:

List of component names.

sample_concentrations(rng: Generator, components: List[str], n_samples: int = 1) ndarray[source]

Sample concentrations for selected components.

Parameters:
  • rng – Random number generator.

  • components – List of component names.

  • n_samples – Number of samples.

Returns:

Concentration matrix (n_samples, n_components).

typical_components: List[str]
typical_sample_types: List[str]
wavelength_range: Tuple[float, float] = (1000, 2500)
class nirs4all.synthesis.DomainInference(domain_name: str = 'unknown', category: str = 'unknown', confidence: float = 0.0, detected_components: List[str] = <factory>, alternative_domains: Dict[str, float]=<factory>)[source]

Bases: object

Results of application domain inference.

domain_name

Best matching domain name.

Type:

str

category

Domain category.

Type:

str

confidence

Confidence score (0-1).

Type:

float

detected_components

Components detected from peak analysis.

Type:

List[str]

alternative_domains

Other possible domains with scores.

Type:

Dict[str, float]

alternative_domains: Dict[str, float]
category: str = 'unknown'
confidence: float = 0.0
detected_components: List[str]
domain_name: str = 'unknown'
class nirs4all.synthesis.EMSCConfig(polynomial_order: int = 2, multiplicative_scatter_std: float = 0.15, additive_scatter_std: float = 0.05, include_wavelength_terms: bool = True, wavelength_coef_std: float = 0.02, reference_spectrum: ndarray | None = None)[source]

Bases: object

Configuration for EMSC-style scattering transformation.

EMSC models scattering distortion as: x = a + b*x_ref + d*λ + e*λ² + …

where a, b are multiplicative/additive scatter, and higher terms model baseline curvature due to scattering.

polynomial_order

Order of polynomial for wavelength-dependent scatter.

Type:

int

multiplicative_scatter_std

Std dev of multiplicative scatter factor b.

Type:

float

additive_scatter_std

Std dev of additive scatter offset a.

Type:

float

include_wavelength_terms

Whether to include λ, λ² terms.

Type:

bool

wavelength_coef_std

Std dev of wavelength coefficient.

Type:

float

reference_spectrum

Optional reference spectrum for EMSC.

Type:

numpy.ndarray | None

additive_scatter_std: float = 0.05
include_wavelength_terms: bool = True
multiplicative_scatter_std: float = 0.15
polynomial_order: int = 2
reference_spectrum: ndarray | None = None
wavelength_coef_std: float = 0.02
class nirs4all.synthesis.EdgeArtifactsConfig(enable_detector_rolloff: bool = False, enable_stray_light: bool = False, enable_truncated_peaks: bool = False, enable_edge_curvature: bool = False, detector_model: str = 'generic_nir', rolloff_severity: float = 0.3, stray_fraction: float = 0.001, stray_wavelength_dependent: bool = True, left_peak_amplitude: float = 0.0, right_peak_amplitude: float = 0.0, curvature_type: str = 'concave', left_curvature_severity: float = 0.0, right_curvature_severity: float = 0.0)[source]

Bases: object

Configuration for edge artifact effects in synthetic NIRS spectra.

Edge artifacts are common in NIR spectra and arise from various sources: - Detector sensitivity roll-off at spectral extremes - Stray light contamination - Truncated absorption peaks at measurement boundaries - Baseline curvature/bending at spectrum edges

These artifacts are well-documented in the literature: - Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas

for Interpretive Near-Infrared Spectroscopy. CRC Press. Chapters 4-5.

  • Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis. CRC Press. Chapters on instrumentation.

  • ASTM E1944-98(2017): Standard Practice for Describing and Measuring Performance of NIR Instruments.

enable_detector_rolloff

Enable detector sensitivity roll-off.

Type:

bool

enable_stray_light

Enable stray light effects.

Type:

bool

enable_truncated_peaks

Enable truncated absorption peaks.

Type:

bool

enable_edge_curvature

Enable baseline curvature at edges.

Type:

bool

detector_model

Detector model for roll-off (‘generic_nir’, ‘ingaas’, ‘pbs’, ‘silicon_ccd’). Defaults to ‘generic_nir’.

Type:

str

rolloff_severity

Severity of detector roll-off (0.0-1.0).

Type:

float

stray_fraction

Stray light fraction (0.0-0.02 typical).

Type:

float

stray_wavelength_dependent

Whether stray light varies with wavelength.

Type:

bool

left_peak_amplitude

Amplitude of truncated peak at low wavelength edge.

Type:

float

right_peak_amplitude

Amplitude of truncated peak at high wavelength edge.

Type:

float

curvature_type

Type of edge curvature (‘concave’, ‘convex’, ‘asymmetric’).

Type:

str

left_curvature_severity

Severity of left edge curvature (0.0-1.0).

Type:

float

right_curvature_severity

Severity of right edge curvature (0.0-1.0).

Type:

float

curvature_type: str = 'concave'
detector_model: str = 'generic_nir'
enable_detector_rolloff: bool = False
enable_edge_curvature: bool = False
enable_stray_light: bool = False
enable_truncated_peaks: bool = False
left_curvature_severity: float = 0.0
left_peak_amplitude: float = 0.0
right_curvature_severity: float = 0.0
right_peak_amplitude: float = 0.0
rolloff_severity: float = 0.3
stray_fraction: float = 0.001
stray_wavelength_dependent: bool = True
class nirs4all.synthesis.EnvironmentalEffectsConfig(temperature: TemperatureConfig = <factory>, moisture: MoistureConfig = <factory>, enable_temperature: bool = True, enable_moisture: bool = True)[source]

Bases: object

Combined configuration for all environmental effects.

temperature

Temperature effect configuration.

Type:

nirs4all.synthesis.environmental.TemperatureConfig

moisture

Moisture effect configuration.

Type:

nirs4all.synthesis.environmental.MoistureConfig

enable_temperature

Whether to apply temperature effects.

Type:

bool

enable_moisture

Whether to apply moisture effects.

Type:

bool

enable_moisture: bool = True
enable_temperature: bool = True
moisture: MoistureConfig
temperature: TemperatureConfig
class nirs4all.synthesis.EnvironmentalInference(estimated_temperature_variation: float = 0.0, has_temperature_effects: bool = False, estimated_moisture_variation: float = 0.0, has_moisture_effects: bool = False, water_band_shift: float = 0.0)[source]

Bases: object

Results of environmental effects inference.

estimated_temperature_variation

Estimated temperature variation (°C).

Type:

float

has_temperature_effects

Whether temperature effects are detectable.

Type:

bool

estimated_moisture_variation

Estimated moisture variation.

Type:

float

has_moisture_effects

Whether moisture effects are detectable.

Type:

bool

water_band_shift

Detected shift in water bands (nm).

Type:

float

estimated_moisture_variation: float = 0.0
estimated_temperature_variation: float = 0.0
has_moisture_effects: bool = False
has_temperature_effects: bool = False
water_band_shift: float = 0.0
class nirs4all.synthesis.ExportConfig(format: Literal['standard', 'single', 'fragmented'] = 'standard', separator: str = ';', float_precision: int = 6, include_headers: bool = True, include_index: bool = False, compression: Literal['gzip', 'zip'] | None = None, file_extension: str = '.csv')[source]

Bases: object

Configuration for dataset export.

format

Export format (‘standard’, ‘single’, ‘fragmented’). - ‘standard’: Separate Xcal, Ycal, Xval, Yval files. - ‘single’: All data in one file with partition column. - ‘fragmented’: Multiple small files (for loader testing).

Type:

Literal[‘standard’, ‘single’, ‘fragmented’]

separator

CSV delimiter character.

Type:

str

float_precision

Decimal precision for floating point values.

Type:

int

include_headers

Whether to include column headers in CSV.

Type:

bool

include_index

Whether to include row index.

Type:

bool

compression

Optional compression (‘gzip’, ‘zip’, None).

Type:

Literal[‘gzip’, ‘zip’] | None

file_extension

File extension to use.

Type:

str

compression: Literal['gzip', 'zip'] | None = None
file_extension: str = '.csv'
float_precision: int = 6
format: Literal['standard', 'single', 'fragmented'] = 'standard'
include_headers: bool = True
include_index: bool = False
separator: str = ';'
class nirs4all.synthesis.FeatureConfig(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', n_components: int | None = None, component_names: List[str] | None = None)[source]

Bases: object

Configuration for spectral feature generation.

wavelength_start

Start wavelength in nm.

Type:

float

wavelength_end

End wavelength in nm.

Type:

float

wavelength_step

Wavelength step in nm.

Type:

float

complexity

Complexity level affecting noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.

Type:

Literal[‘simple’, ‘realistic’, ‘complex’]

n_components

Number of spectral components (auto if None).

Type:

int | None

component_names

Specific predefined components to use. If None, uses default components based on complexity.

Type:

List[str] | None

complexity: Literal['simple', 'realistic', 'complex'] = 'simple'
component_names: List[str] | None = None
n_components: int | None = None
wavelength_end: float = 2500.0
wavelength_start: float = 1000.0
wavelength_step: float = 2.0
class nirs4all.synthesis.FittedParameters(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, global_slope_mean: float = 0.0, global_slope_std: float = 0.02, noise_base: float = 0.001, noise_signal_dep: float = 0.005, path_length_std: float = 0.05, baseline_amplitude: float = 0.02, scatter_alpha_std: float = 0.05, scatter_beta_std: float = 0.01, tilt_std: float = 0.01, complexity: str = 'realistic', source_name: str = '', source_properties: SpectralProperties | None = None, inferred_instrument: str = 'unknown', instrument_inference: InstrumentInference | None = None, measurement_mode: str = 'transmittance', measurement_mode_confidence: float = 0.0, inferred_domain: str = 'unknown', domain_inference: DomainInference | None = None, environmental_inference: EnvironmentalInference | None = None, temperature_config: Dict[str, ~typing.Any]=<factory>, moisture_config: Dict[str, ~typing.Any]=<factory>, scattering_inference: ScatteringInference | None = None, particle_size_config: Dict[str, ~typing.Any]=<factory>, emsc_config: Dict[str, ~typing.Any]=<factory>, edge_artifact_inference: EdgeArtifactInference | None = None, edge_artifacts_config: Dict[str, ~typing.Any]=<factory>, boundary_components_config: Dict[str, ~typing.Any]=<factory>, preprocessing_inference: PreprocessingInference | None = None, preprocessing_type: str = 'raw_absorbance', is_preprocessed: bool = False, detected_components: List[str] = <factory>, suggested_n_components: int = 5)[source]

Bases: object

Parameters fitted from real data for synthetic generation.

This dataclass contains all parameters needed to configure a SyntheticNIRSGenerator to produce spectra similar to a real dataset, including Phase 1-4 enhanced features.

# Basic wavelength grid
wavelength_start

Start wavelength (nm).

Type:

float

wavelength_end

End wavelength (nm).

Type:

float

wavelength_step

Wavelength step (nm).

Type:

float

# Slope and baseline parameters
global_slope_mean

Mean global slope.

Type:

float

global_slope_std

Slope standard deviation.

Type:

float

baseline_amplitude

Baseline drift amplitude.

Type:

float

# Noise parameters
noise_base

Base noise level.

Type:

float

noise_signal_dep

Signal-dependent noise factor.

Type:

float

# Scatter parameters
path_length_std

Path length variation.

Type:

float

scatter_alpha_std

Multiplicative scatter std.

Type:

float

scatter_beta_std

Additive scatter std.

Type:

float

tilt_std

Spectral tilt standard deviation.

Type:

float

# Complexity
complexity

Suggested complexity level.

Type:

str

# Source metadata
source_name

Name of source dataset.

Type:

str

source_properties

Full SpectralProperties of source.

Type:

nirs4all.synthesis.fitter.SpectralProperties | None

# Phase 1-4 Enhanced Parameters
# Instrument inference
inferred_instrument

Inferred instrument archetype.

Type:

str

instrument_inference

Full instrument inference result.

Type:

nirs4all.synthesis.fitter.InstrumentInference | None

# Measurement mode
measurement_mode

Inferred measurement mode.

Type:

str

measurement_mode_confidence

Confidence of inference.

Type:

float

# Domain inference
inferred_domain

Inferred application domain.

Type:

str

domain_inference

Full domain inference result.

Type:

nirs4all.synthesis.fitter.DomainInference | None

# Environmental effects
environmental_inference

Environmental effects inference.

Type:

nirs4all.synthesis.fitter.EnvironmentalInference | None

temperature_config

Suggested temperature config parameters.

Type:

Dict[str, Any]

moisture_config

Suggested moisture config parameters.

Type:

Dict[str, Any]

# Scattering effects
scattering_inference

Scattering effects inference.

Type:

nirs4all.synthesis.fitter.ScatteringInference | None

particle_size_config

Suggested particle size config parameters.

Type:

Dict[str, Any]

emsc_config

Suggested EMSC config parameters.

Type:

Dict[str, Any]

# Detected components for procedural generation
detected_components

List of detected/inferred component names.

Type:

List[str]

suggested_n_components

Suggested number of components.

Type:

int

baseline_amplitude: float = 0.02
boundary_components_config: Dict[str, Any]
complexity: str = 'realistic'
detected_components: List[str]
domain_inference: DomainInference | None = None
edge_artifact_inference: EdgeArtifactInference | None = None
edge_artifacts_config: Dict[str, Any]
emsc_config: Dict[str, Any]
environmental_inference: EnvironmentalInference | None = None
classmethod from_dict(data: Dict[str, Any]) FittedParameters[source]

Create FittedParameters from a dictionary.

Parameters:

data – Dictionary with parameter values.

Returns:

FittedParameters instance.

global_slope_mean: float = 0.0
global_slope_std: float = 0.02
inferred_domain: str = 'unknown'
inferred_instrument: str = 'unknown'
instrument_inference: InstrumentInference | None = None
is_preprocessed: bool = False
classmethod load(path: str) FittedParameters[source]

Load parameters from JSON file.

Parameters:

path – Input file path.

Returns:

FittedParameters instance.

measurement_mode: str = 'transmittance'
measurement_mode_confidence: float = 0.0
moisture_config: Dict[str, Any]
noise_base: float = 0.001
noise_signal_dep: float = 0.005
particle_size_config: Dict[str, Any]
path_length_std: float = 0.05
preprocessing_inference: PreprocessingInference | None = None
preprocessing_type: str = 'raw_absorbance'
save(path: str) None[source]

Save parameters to JSON file.

Parameters:

path – Output file path.

scatter_alpha_std: float = 0.05
scatter_beta_std: float = 0.01
scattering_inference: ScatteringInference | None = None
source_name: str = ''
source_properties: SpectralProperties | None = None
suggested_n_components: int = 5
summary() str[source]

Generate a human-readable summary of fitted parameters.

Returns:

Multi-line summary string.

temperature_config: Dict[str, Any]
tilt_std: float = 0.01
to_dict() Dict[str, Any][source]

Convert all parameters to a dictionary.

Returns:

Dictionary with all parameter values.

to_full_config() Dict[str, Any][source]

Convert all fitted parameters to a comprehensive configuration.

This includes all Phase 1-4 parameters for complete synthetic data generation matching the source dataset.

Returns:

Dictionary with all configuration parameters.

Example

>>> params = fitter.fit(X_real)
>>> config = params.to_full_config()
>>> # Use with builder pattern or advanced configuration
to_generator_kwargs() Dict[str, Any][source]

Convert fitted parameters to kwargs for SyntheticNIRSGenerator.

Returns:

Dictionary of keyword arguments.

Example

>>> params = fitter.fit(X_real)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())
wavelength_end: float = 2500.0
wavelength_start: float = 1000.0
wavelength_step: float = 2.0
class nirs4all.synthesis.ForwardModelFitter(components: List['SpectralComponent'], canonical_grid: np.ndarray, target_grid: np.ndarray, baseline_order: int = 4, wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0), ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0), path_length_bounds: Tuple[float, float] = (0.5, 2.0))[source]

Bases: object

Variable projection fitter for physical forward model.

Fits a physical mixture model to observed spectra by separating: - Linear params: concentrations, baseline coefficients (solved via NNLS/lsq) - Nonlinear params: wl_shift, ils_sigma, path_length (solved via optimization)

This approach is numerically stable and physically interpretable.

components

List of SpectralComponent objects.

Type:

List[‘SpectralComponent’]

canonical_grid

High-resolution canonical wavelength grid.

Type:

np.ndarray

target_grid

Target wavelength grid (dataset grid).

Type:

np.ndarray

baseline_order

Number of Chebyshev baseline terms.

Type:

int

wl_shift_bounds

Bounds for wavelength shift parameter.

Type:

Tuple[float, float]

ils_sigma_bounds

Bounds for ILS sigma parameter.

Type:

Tuple[float, float]

path_length_bounds

Bounds for path length parameter.

Type:

Tuple[float, float]

Example

>>> from nirs4all.synthesis._constants import get_predefined_components
>>> components = [get_predefined_components()[n] for n in ['water', 'protein']]
>>> fitter = ForwardModelFitter(
...     components=components,
...     canonical_grid=np.linspace(400, 2500, 4200),
...     target_grid=dataset_wavelengths,
... )
>>> result = fitter.fit(spectrum)
>>> print(f"R² = {result['r_squared']:.4f}")
__post_init__()[source]

Pre-compute component spectra on canonical grid.

baseline_order: int = 4
canonical_grid: np.ndarray
components: List['SpectralComponent']
fit(y: ndarray, initial_guess: ndarray | None = None) Dict[str, Any][source]

Fit forward model to target spectrum.

Parameters:
  • y – Target spectrum.

  • initial_guess – Initial [wl_shift, ils_sigma, path_length].

Returns:

  • r_squared: Coefficient of determination

  • fitted: Fitted spectrum

  • residuals: Fitting residuals

  • concentrations: Fitted component concentrations

  • baseline_coeffs: Fitted baseline coefficients

  • wl_shift, ils_sigma, path_length: Instrument params

Return type:

Dict with fitted parameters

ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0)
path_length_bounds: Tuple[float, float] = (0.5, 2.0)
target_grid: np.ndarray
wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0)
class nirs4all.synthesis.FunctionalGroupType(value)[source]

Bases: str, Enum

Types of functional groups for component generation.

AMINE = 'amine'
AROMATIC_CH = 'aromatic_ch'
CARBONYL = 'carbonyl'
CARBOXYL = 'carboxyl'
HYDROXYL = 'hydroxyl'
METHYL = 'methyl'
METHYLENE = 'methylene'
THIOL = 'thiol'
VINYL = 'vinyl'
WATER = 'water'
class nirs4all.synthesis.InstrumentArchetype(name: str, category: ~nirs4all.synthesis.instruments.InstrumentCategory, detector_type: ~nirs4all.synthesis.instruments.DetectorType, monochromator_type: ~nirs4all.synthesis.instruments.MonochromatorType, wavelength_range: ~typing.Tuple[float, float], spectral_resolution: float = 8.0, wavelength_accuracy: float = 0.5, photometric_noise: float = 0.0001, photometric_range: ~typing.Tuple[float, float] = (0.0, 3.0), snr: float = 10000.0, stray_light: float = 0.0001, warm_up_drift: float = 0.1, temperature_sensitivity: float = 0.01, scan_speed: float = 1.0, integration_time_ms: float = 100.0, optical_path: str = 'transmission', multi_sensor: ~nirs4all.synthesis.instruments.MultiSensorConfig = <factory>, multi_scan: ~nirs4all.synthesis.instruments.MultiScanConfig = <factory>, description: str = '')[source]

Bases: object

Parameterized NIR instrument simulation.

Represents a complete instrument model with optical, electronic, and measurement characteristics. Can be used to generate realistic synthetic spectra that match specific instrument types.

name

Instrument archetype name.

Type:

str

category

Instrument category (benchtop, handheld, etc.).

Type:

nirs4all.synthesis.instruments.InstrumentCategory

detector_type

Primary detector type.

Type:

nirs4all.synthesis.instruments.DetectorType

monochromator_type

Wavelength selection mechanism.

Type:

nirs4all.synthesis.instruments.MonochromatorType

wavelength_range

Nominal wavelength range (nm).

Type:

Tuple[float, float]

spectral_resolution

Spectral resolution (FWHM in nm).

Type:

float

wavelength_accuracy

Wavelength accuracy (nm).

Type:

float

photometric_noise

Photometric noise level (AU).

Type:

float

photometric_range

Photometric range (min, max AU).

Type:

Tuple[float, float]

snr

Signal-to-noise ratio at 1 AU.

Type:

float

stray_light

Stray light level (fraction).

Type:

float

warm_up_drift

Intensity drift during warm-up (%/hour).

Type:

float

temperature_sensitivity

Wavelength shift per °C.

Type:

float

scan_speed

Scans per second.

Type:

float

integration_time_ms

Integration time in milliseconds.

Type:

float

optical_path

Optical path type (‘transmission’, ‘reflection’, etc.).

Type:

str

multi_sensor

Multi-sensor configuration.

Type:

nirs4all.synthesis.instruments.MultiSensorConfig

multi_scan

Multi-scan averaging configuration.

Type:

nirs4all.synthesis.instruments.MultiScanConfig

description

Human-readable description.

Type:

str

category: InstrumentCategory
description: str = ''
detector_type: DetectorType
get_noise_model_params() Dict[str, float][source]

Get noise model parameters based on detector type.

integration_time_ms: float = 100.0
monochromator_type: MonochromatorType
multi_scan: MultiScanConfig
multi_sensor: MultiSensorConfig
name: str
optical_path: str = 'transmission'
photometric_noise: float = 0.0001
photometric_range: Tuple[float, float] = (0.0, 3.0)
scan_speed: float = 1.0
snr: float = 10000.0
spectral_resolution: float = 8.0
stray_light: float = 0.0001
temperature_sensitivity: float = 0.01
warm_up_drift: float = 0.1
wavelength_accuracy: float = 0.5
wavelength_range: Tuple[float, float]
class nirs4all.synthesis.InstrumentCategory(value)[source]

Bases: str, Enum

Categories of NIR instruments.

BENCHTOP = 'benchtop'
DIODE_ARRAY = 'diode_array'
EMBEDDED = 'embedded'
FILTER = 'filter'
FT_NIR = 'ft_nir'
HANDHELD = 'handheld'
PROCESS = 'process'
class nirs4all.synthesis.InstrumentChain(wl_shift: float = 0.0, wl_stretch: float = 1.0, ils_sigma: float = 4.0, stray_light: float = 0.001, gain: float = 1.0, offset: float = 0.0)[source]

Bases: object

Forward instrument chain: canonical grid → dataset grid.

Applies the complete measurement chain to transform a high-resolution physical spectrum to the observed instrument grid.

Chain:
  1. Wavelength warp (shift + stretch)

  2. ILS convolution (Gaussian smoothing)

  3. Stray light / gain / offset

  4. Resample to target grid

wl_shift

Wavelength shift in nm.

Type:

float

wl_stretch

Wavelength scale factor.

Type:

float

ils_sigma

Instrument line shape Gaussian sigma in nm.

Type:

float

stray_light

Stray light fraction.

Type:

float

gain

Photometric gain.

Type:

float

offset

Photometric offset.

Type:

float

Example

>>> chain = InstrumentChain(wl_shift=2.0, ils_sigma=5.0)
>>> spectrum_obs = chain.apply(spectrum_phys, canonical_wl, target_wl)
apply(spectrum: ndarray, canonical_wl: ndarray, target_wl: ndarray) ndarray[source]

Apply full instrument chain.

Parameters:
  • spectrum – Input spectrum on canonical grid.

  • canonical_wl – Canonical wavelength grid (nm).

  • target_wl – Target wavelength grid (nm).

Returns:

Transformed spectrum on target grid.

gain: float = 1.0
ils_sigma: float = 4.0
offset: float = 0.0
stray_light: float = 0.001
wl_shift: float = 0.0
wl_stretch: float = 1.0
class nirs4all.synthesis.InstrumentInference(archetype_name: str = 'unknown', detector_type: str = 'unknown', wavelength_range: Tuple[float, float]=(1000.0, 2500.0), estimated_resolution: float = 8.0, confidence: float = 0.0, alternative_archetypes: Dict[str, float]=<factory>)[source]

Bases: object

Results of instrument archetype inference.

archetype_name

Best matching instrument archetype name.

Type:

str

detector_type

Inferred detector type.

Type:

str

wavelength_range

Detected wavelength range.

Type:

Tuple[float, float]

estimated_resolution

Estimated spectral resolution (nm).

Type:

float

confidence

Confidence score (0-1).

Type:

float

alternative_archetypes

Other possible archetypes with scores.

Type:

Dict[str, float]

alternative_archetypes: Dict[str, float]
archetype_name: str = 'unknown'
confidence: float = 0.0
detector_type: str = 'unknown'
estimated_resolution: float = 8.0
wavelength_range: Tuple[float, float] = (1000.0, 2500.0)
class nirs4all.synthesis.InstrumentSimulator(archetype: InstrumentArchetype, random_state: int | None = None)[source]

Bases: object

Apply instrument-specific effects to synthetic spectra.

Simulates the complete instrument response including: - Spectral resolution (instrumental broadening) - Multi-sensor stitching - Multi-scan averaging - Detector noise (shot, thermal, 1/f) - Wavelength calibration errors - Stray light effects - Etalon/fringing interference

archetype

The instrument archetype being simulated.

rng

Random number generator for reproducibility.

Example

>>> archetype = get_instrument_archetype("viavi_micronir")
>>> simulator = InstrumentSimulator(archetype, random_state=42)
>>> spectra_out = simulator.apply(spectra, wavelengths)
apply(spectra: ndarray, wavelengths: ndarray, temperature_offset: float = 0.0) Tuple[ndarray, ndarray][source]

Apply all instrument effects to spectra.

Parameters:
  • spectra – Input spectra array (n_samples, n_wavelengths).

  • wavelengths – Wavelength array in nm.

  • temperature_offset – Temperature deviation from calibration (°C).

Returns:

Tuple of (modified_spectra, output_wavelengths). Output wavelengths may differ if resampled to instrument grid.

class nirs4all.synthesis.LoadedBenchmarkDataset(info: BenchmarkDatasetInfo, X: ndarray, y: ndarray, wavelengths: ndarray, sample_ids: ndarray | None = None, metadata: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Container for a loaded benchmark dataset.

info

Dataset metadata.

Type:

nirs4all.synthesis.benchmarks.BenchmarkDatasetInfo

X

Spectral data (n_samples, n_wavelengths).

Type:

numpy.ndarray

y

Target values (n_samples, n_targets) or (n_samples,).

Type:

numpy.ndarray

wavelengths

Wavelength array.

Type:

numpy.ndarray

sample_ids

Optional sample identifiers.

Type:

numpy.ndarray | None

metadata

Optional additional metadata.

Type:

Dict[str, Any]

X: ndarray
info: BenchmarkDatasetInfo
metadata: Dict[str, Any]
sample_ids: ndarray | None = None
wavelengths: ndarray
y: ndarray
class nirs4all.synthesis.MatrixType(value)[source]

Bases: str, Enum

Physical matrix types that affect spectral properties.

EMULSION = 'emulsion'
FILM = 'film'
GEL = 'gel'
GRANULAR = 'granular'
LIQUID = 'liquid'
PASTE = 'paste'
POWDER = 'powder'
SLURRY = 'slurry'
SOLID = 'solid'
TISSUE = 'tissue'
class nirs4all.synthesis.MeasurementMode(value)[source]

Bases: str, Enum

Types of NIR measurement geometries.

ATR = 'atr'
FIBER_OPTIC = 'fiber_optic'
INTERACTANCE = 'interactance'
REFLECTANCE = 'reflectance'
TRANSFLECTANCE = 'transflectance'
TRANSMITTANCE = 'transmittance'
class nirs4all.synthesis.MeasurementModeInference(value)[source]

Bases: str, Enum

Inferred measurement mode from spectral analysis.

ATR = 'atr'
REFLECTANCE = 'reflectance'
TRANSFLECTANCE = 'transflectance'
TRANSMITTANCE = 'transmittance'
UNKNOWN = 'unknown'
class nirs4all.synthesis.MeasurementModeSimulator(config: MeasurementModeConfig | None = None, random_state: int | None = None)[source]

Bases: object

Simulate different NIR measurement modes.

Converts absorption coefficients to measured signal (absorbance, reflectance, etc.) based on the physics of different measurement geometries.

config

Measurement mode configuration.

rng

Random number generator for reproducibility.

Example

>>> config = MeasurementModeConfig(mode=MeasurementMode.REFLECTANCE)
>>> simulator = MeasurementModeSimulator(config, random_state=42)
>>> reflectance = simulator.apply(absorption_coefficients, wavelengths)
absorbance_to_reflectance(absorbance: ndarray) ndarray[source]

Convert apparent absorbance to reflectance.

R = 10^(-A)

Parameters:

absorbance – Apparent absorbance values.

Returns:

Reflectance values (0-1).

apply(absorption: ndarray, wavelengths: ndarray, scattering: ndarray | None = None) ndarray[source]

Apply measurement mode transformation.

Converts absorption coefficients (K) to measured signal based on the configured measurement mode.

Parameters:
  • absorption – Absorption coefficient array (n_samples, n_wavelengths).

  • wavelengths – Wavelength array in nm.

  • scattering – Optional scattering coefficient array (n_samples, n_wavelengths). If None and needed, will be generated automatically.

Returns:

Measured signal (absorbance, reflectance, etc.) depending on mode.

generate_scattering_coefficients(shape: Tuple[int, int], wavelengths: ndarray) ndarray[source]

Generate realistic scattering coefficients.

Scattering coefficient follows approximate relationship: S(λ) ∝ λ^(-α) * (particle_size)^β

Parameters:
  • shape – Output shape (n_samples, n_wavelengths).

  • wavelengths – Wavelength array in nm.

Returns:

Scattering coefficient array.

inverse_kubelka_munk(ks_ratio: ndarray) ndarray[source]

Inverse Kubelka-Munk transformation.

Given K/S, solve for R∞: R∞ = 1 + K/S - sqrt((K/S)² + 2*K/S)

Parameters:

ks_ratio – K/S ratio values.

Returns:

Reflectance values.

kubelka_munk(reflectance: ndarray) ndarray[source]

Apply Kubelka-Munk transformation.

f(R) = (1 - R)² / (2R) = K/S

Parameters:

reflectance – Reflectance values (0-1).

Returns:

Kubelka-Munk function values (K/S ratio).

reflectance_to_absorbance(reflectance: ndarray) ndarray[source]

Convert reflectance to apparent absorbance.

A = log10(1/R) = -log10(R)

Parameters:

reflectance – Reflectance values (0-1).

Returns:

Apparent absorbance values.

class nirs4all.synthesis.MetadataConfig(generate_sample_ids: bool = True, sample_id_prefix: str = 'sample', n_groups: int | None = None, n_repetitions: int | Tuple[int, int] = 1, group_names: List[str] | None = None, additional_columns: Dict[str, Any] | None = None)[source]

Bases: object

Configuration for sample metadata generation.

generate_sample_ids

Whether to generate sample IDs.

Type:

bool

sample_id_prefix

Prefix for sample IDs.

Type:

str

n_groups

Number of sample groups (e.g., biological replicates).

Type:

int | None

n_repetitions

Repetitions per sample, either fixed int or (min, max) range.

Type:

int | Tuple[int, int]

group_names

Optional list of group names.

Type:

List[str] | None

additional_columns

Dict of column_name -> generator function or values.

Type:

Dict[str, Any] | None

additional_columns: Dict[str, Any] | None = None
generate_sample_ids: bool = True
group_names: List[str] | None = None
n_groups: int | None = None
n_repetitions: int | Tuple[int, int] = 1
sample_id_prefix: str = 'sample'
class nirs4all.synthesis.MetadataGenerationResult(sample_ids: ndarray, bio_sample_ids: ndarray | None = None, repetitions: ndarray | None = None, groups: ndarray | None = None, group_indices: ndarray | None = None, n_bio_samples: int = 0, additional_columns: Dict[str, ndarray] | None = None)[source]

Bases: object

Container for generated metadata.

sample_ids

Unique sample identifiers.

Type:

numpy.ndarray

bio_sample_ids

Biological sample identifiers (before repetitions).

Type:

numpy.ndarray | None

repetitions

Repetition number for each sample.

Type:

numpy.ndarray | None

groups

Group assignments.

Type:

numpy.ndarray | None

group_indices

Integer group indices (for stratification).

Type:

numpy.ndarray | None

n_bio_samples

Number of unique biological samples.

Type:

int

additional_columns

Any extra columns generated.

Type:

Dict[str, numpy.ndarray] | None

additional_columns: Dict[str, ndarray] | None = None
bio_sample_ids: ndarray | None = None
group_indices: ndarray | None = None
groups: ndarray | None = None
n_bio_samples: int = 0
repetitions: ndarray | None = None
sample_ids: ndarray
to_dict() Dict[str, ndarray][source]

Convert to dictionary format suitable for DataFrame or SpectroDataset.

Returns:

Dictionary with string keys and array values.

class nirs4all.synthesis.MetadataGenerator(random_state: int | None = None)[source]

Bases: object

Generate realistic metadata for synthetic NIRS datasets.

This class creates sample identifiers, biological sample groupings, repetition structures, and group assignments that mimic real spectroscopy datasets.

rng

NumPy random generator for reproducibility.

Parameters:

random_state – Random seed for reproducibility.

Example

>>> generator = MetadataGenerator(random_state=42)
>>>
>>> # Generate with repetitions and groups
>>> metadata = generator.generate(
...     n_samples=100,
...     sample_id_prefix="WHEAT",
...     n_groups=3,
...     group_names=["Field_A", "Field_B", "Field_C"],
...     n_repetitions=(2, 4)
... )
>>>
>>> # Result: Each biological sample has 2-4 spectral measurements
>>> print(f"Bio samples: {metadata.n_bio_samples}")
>>> print(f"Total samples: {len(metadata.sample_ids)}")
generate(n_samples: int, *, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1, bio_sample_prefix: str = 'B', additional_columns: Dict[str, Any] | None = None) MetadataGenerationResult[source]

Generate complete metadata for a synthetic dataset.

This method handles the complex logic of generating samples with repetitions while respecting group structures. When repetitions are requested, biological samples are created first, then each is replicated 1 or more times to create the final samples.

Parameters:
  • n_samples – Total number of samples (spectra) to generate.

  • sample_id_prefix – Prefix for sample ID strings.

  • n_groups – Number of groups (None for no grouping).

  • group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.

  • n_repetitions – Number of repetitions per biological sample. If int: fixed number of repetitions. If tuple (min, max): random number in range [min, max].

  • bio_sample_prefix – Prefix for biological sample IDs.

  • additional_columns – Dictionary of additional columns to generate. Keys are column names, values can be: - Callable(n_samples, rng) -> np.ndarray - List of values to randomly sample from - Tuple (distribution, params) for numeric data

Returns:

MetadataGenerationResult containing all generated metadata.

Raises:

ValueError – If n_samples is less than 1 or if n_repetitions would make it impossible to generate the requested samples.

Example

>>> generator = MetadataGenerator(random_state=42)
>>>
>>> # Simple case: 100 samples, no repetitions
>>> result = generator.generate(100)
>>> assert len(result.sample_ids) == 100
>>>
>>> # With repetitions: ~50 bio samples, each measured 2 times
>>> result = generator.generate(100, n_repetitions=2)
>>> assert result.n_bio_samples == 50
>>>
>>> # Variable repetitions
>>> result = generator.generate(100, n_repetitions=(1, 3))
class nirs4all.synthesis.MetricResult(metric: RealismMetric, value: float, threshold: float, passed: bool, details: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Result of a single realism metric evaluation.

metric

The metric type.

Type:

nirs4all.synthesis.validation.RealismMetric

value

The computed metric value.

Type:

float

threshold

The threshold for passing.

Type:

float

passed

Whether the metric passed the threshold.

Type:

bool

details

Additional details about the metric computation.

Type:

Dict[str, Any]

details: Dict[str, Any]
metric: RealismMetric
passed: bool
threshold: float
value: float
class nirs4all.synthesis.MoistureConfig(water_activity: float = 0.5, moisture_content: float = 0.1, free_water_fraction: float = 0.3, bound_water_shift: float = 25.0, temperature_interaction: bool = True, reference_aw: float = 0.5)[source]

Bases: object

Configuration for moisture/water activity effect simulation.

Moisture affects NIR spectra through: - Direct water absorption bands - Hydrogen bonding with sample matrix - Free vs. bound water ratio

water_activity

Water activity (a_w) value (0.0 to 1.0).

Type:

float

moisture_content

Moisture content as fraction (optional, for intensity).

Type:

float

free_water_fraction

Fraction of water that is “free” vs. bound (0-1).

Type:

float

bound_water_shift

Wavelength shift for bound water relative to free (nm).

Type:

float

temperature_interaction

Whether moisture effects interact with temperature.

Type:

bool

reference_aw

Reference water activity for baseline.

Type:

float

__post_init__()[source]

Validate water activity range.

bound_water_shift: float = 25.0
free_water_fraction: float = 0.3
moisture_content: float = 0.1
reference_aw: float = 0.5
temperature_interaction: bool = True
water_activity: float = 0.5
class nirs4all.synthesis.MonochromatorType(value)[source]

Bases: str, Enum

Types of wavelength selection mechanisms.

AOTF = 'aotf'
DMD = 'dmd'
FABRY_PEROT = 'fabry_perot'
FILTER_WHEEL = 'filter_wheel'
FT = 'fourier_transform'
GRATING = 'grating'
LVF = 'lvf'
class nirs4all.synthesis.MultiScanConfig(enabled: bool = False, n_scans: int = 16, averaging_method: str = 'mean', scan_to_scan_noise: float = 0.001, wavelength_jitter: float = 0.05, discard_outliers: bool = False, outlier_threshold: float = 3.0)[source]

Bases: object

Configuration for multi-scan averaging/accumulation.

Real instruments often acquire multiple scans per sample and average them to improve signal-to-noise ratio. This config simulates that process.

enabled

Whether multi-scan mode is enabled.

Type:

bool

n_scans

Number of scans to simulate and average.

Type:

int

averaging_method

How to combine scans. Options: ‘mean’, ‘median’, ‘weighted’, ‘savgol’

Type:

str

scan_to_scan_noise

Additional noise between scans (simulates drift).

Type:

float

wavelength_jitter

Random wavelength shift between scans (nm).

Type:

float

discard_outliers

Whether to discard outlier scans.

Type:

bool

outlier_threshold

Z-score threshold for outlier detection.

Type:

float

averaging_method: str = 'mean'
discard_outliers: bool = False
enabled: bool = False
n_scans: int = 16
outlier_threshold: float = 3.0
scan_to_scan_noise: float = 0.001
wavelength_jitter: float = 0.05
class nirs4all.synthesis.MultiSensorConfig(enabled: bool = False, sensors: List[SensorConfig] = <factory>, stitch_method: str = 'weighted', stitch_smoothing: float = 10.0, add_stitch_artifacts: bool = True, artifact_intensity: float = 0.02)[source]

Bases: object

Configuration for multi-sensor spectral stitching.

Modern NIR instruments often use multiple sensors/detectors to cover wide wavelength ranges. This config controls how the signals are combined.

enabled

Whether multi-sensor mode is enabled.

Type:

bool

sensors

List of SensorConfig for each sensor.

Type:

List[nirs4all.synthesis.instruments.SensorConfig]

stitch_method

Method for combining overlapping regions. Options: ‘weighted’, ‘average’, ‘first’, ‘last’, ‘optimal’

Type:

str

stitch_smoothing

Smoothing window (nm) at stitch boundaries.

Type:

float

add_stitch_artifacts

Whether to simulate stitching artifacts.

Type:

bool

artifact_intensity

Intensity of stitching artifacts (0-1).

Type:

float

add_stitch_artifacts: bool = True
artifact_intensity: float = 0.02
enabled: bool = False
sensors: List[SensorConfig]
stitch_method: str = 'weighted'
stitch_smoothing: float = 10.0
class nirs4all.synthesis.MultiSourceGenerator(random_state: int | None = None)[source]

Bases: object

Generate synthetic multi-source NIRS datasets.

This class creates datasets combining multiple data sources, such as: - Multiple NIR spectral ranges (e.g., visible-NIR + shortwave-NIR) - NIR spectra + molecular markers - NIR spectra + auxiliary measurements

The generated sources share common underlying structure through component concentrations, creating realistic inter-source correlations.

rng

NumPy random generator.

Parameters:

random_state – Random seed for reproducibility.

Example

>>> generator = MultiSourceGenerator(random_state=42)
>>>
>>> result = generator.generate(
...     n_samples=500,
...     sources=[
...         {
...             "name": "NIR",
...             "type": "nir",
...             "wavelength_range": (1000, 2500),
...             "complexity": "realistic"
...         },
...         {
...             "name": "markers",
...             "type": "aux",
...             "n_features": 20,
...             "correlation_with_target": 0.7
...         }
...     ],
...     target_range=(0, 100)
... )
>>>
>>> print(result.source_names)
['NIR', 'markers']
create_dataset(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, train_ratio: float = 0.8, target_range: Tuple[float, float] | None = None, name: str = 'multi_source_synthetic') SpectroDataset[source]

Create a SpectroDataset from multi-source generation.

Parameters:
  • n_samples – Number of samples to generate.

  • sources – List of source configurations.

  • train_ratio – Proportion of samples for training.

  • target_range – Optional (min, max) for target scaling.

  • name – Dataset name.

Returns:

SpectroDataset with multiple sources configured.

Example

>>> dataset = generator.create_dataset(
...     n_samples=500,
...     sources=[
...         {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 10}
...     ],
...     train_ratio=0.8
... )
generate(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, target_range: Tuple[float, float] | None = None, concentration_method: str = 'dirichlet', n_components: int = 5) MultiSourceResult[source]

Generate a multi-source dataset.

All sources share underlying component concentrations, which creates realistic correlations between sources. NIR sources generate spectra from these concentrations, while auxiliary sources create features correlated with the same underlying structure.

Parameters:
  • n_samples – Number of samples to generate.

  • sources – List of source configurations (SourceConfig or dict).

  • target_range – Optional (min, max) for scaling target values.

  • concentration_method – Method for generating component concentrations.

  • n_components – Number of underlying components.

Returns:

MultiSourceResult containing all generated data.

Example

>>> result = generator.generate(
...     n_samples=300,
...     sources=[
...         {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)},
...         {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)},
...     ]
... )
class nirs4all.synthesis.MultiSourceResult(sources: ~typing.Dict[str, ~numpy.ndarray], targets: ~numpy.ndarray, source_configs: ~typing.List[~nirs4all.synthesis.sources.SourceConfig], wavelengths: ~typing.Dict[str, ~numpy.ndarray] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]

Bases: object

Container for multi-source generation results.

sources

Dictionary mapping source names to feature arrays.

Type:

Dict[str, numpy.ndarray]

targets

Target values.

Type:

numpy.ndarray

source_configs

Source configuration objects.

Type:

List[nirs4all.synthesis.sources.SourceConfig]

wavelengths

Dictionary mapping NIR source names to wavelength arrays.

Type:

Dict[str, numpy.ndarray]

metadata

Optional metadata dictionary.

Type:

Dict[str, Any] | None

get_combined_features() ndarray[source]

Concatenate all sources into single feature matrix.

metadata: Dict[str, Any] | None = None
property n_features_total: int

Get total number of features across all sources.

property n_samples: int

Get number of samples.

source_configs: List[SourceConfig]
property source_names: List[str]

Get list of source names.

sources: Dict[str, ndarray]
targets: ndarray
wavelengths: Dict[str, ndarray]
class nirs4all.synthesis.NIRBand(center: float, sigma: float, gamma: float = 0.0, amplitude: float = 1.0, name: str = '')[source]

Bases: object

Represents a single NIR absorption band.

This class models an absorption band using a Voigt profile, which is the convolution of Gaussian (thermal broadening) and Lorentzian (pressure broadening) line shapes.

center

Central wavelength in nm.

Type:

float

sigma

Gaussian width (standard deviation) in nm.

Type:

float

gamma

Lorentzian width (HWHM) in nm. Use 0 for pure Gaussian.

Type:

float

amplitude

Peak amplitude in absorbance units.

Type:

float

name

Descriptive name of the band (e.g., “O-H 1st overtone”).

Type:

str

Example

>>> band = NIRBand(center=1450, sigma=25, gamma=3, amplitude=0.8)
>>> wavelengths = np.arange(1400, 1500, 1)
>>> spectrum = band.compute(wavelengths)
amplitude: float = 1.0
center: float
compute(wavelengths: ndarray) ndarray[source]

Compute the band profile at given wavelengths using Voigt profile.

Parameters:

wavelengths – Array of wavelengths in nm at which to evaluate the band.

Returns:

Array of absorbance values at each wavelength.

Note

When gamma=0, a pure Gaussian profile is used for efficiency. Otherwise, the full Voigt profile (Gaussian ⊗ Lorentzian) is computed.

gamma: float = 0.0
name: str = ''
sigma: float
class nirs4all.synthesis.NIRSPriorConfig(domain_weights: Dict[str, float]=<factory>, instrument_given_domain: Dict[str, ~typing.Dict[str, float]]=<factory>, mode_given_category: Dict[str, ~typing.Dict[str, float]]=<factory>, matrix_given_domain: Dict[str, ~typing.Dict[str, float]]=<factory>, temperature_range: Tuple[float, float]=(15.0, 40.0), particle_size_range: Tuple[float, float]=(5.0, 200.0), noise_level_range: Tuple[float, float]=(0.5, 2.0), n_samples_range: Tuple[int, int]=(100, 2000), target_type_weights: Dict[str, float]=<factory>, n_targets_range: Tuple[int, int]=(1, 5), n_classes_range: Tuple[int, int]=(2, 5))[source]

Bases: object

Configuration for NIRS data generation with conditional sampling.

This class defines the prior distributions and conditional dependencies for sampling complete generation configurations.

domain_weights

Prior weights for each domain.

Type:

Dict[str, float]

instrument_given_domain

P(instrument_category | domain).

Type:

Dict[str, Dict[str, float]]

mode_given_category

P(measurement_mode | instrument_category).

Type:

Dict[str, Dict[str, float]]

matrix_given_domain

P(matrix_type | domain).

Type:

Dict[str, Dict[str, float]]

temperature_range

(min, max) temperature in Celsius.

Type:

Tuple[float, float]

particle_size_range

(min, max) particle size in microns.

Type:

Tuple[float, float]

noise_level_range

(min, max) noise level multiplier.

Type:

Tuple[float, float]

Example

>>> config = NIRSPriorConfig()
>>> sampler = PriorSampler(config, random_state=42)
>>> sample = sampler.sample()
>>> print(sample["domain"], sample["instrument"])
domain_weights: Dict[str, float]
get_domain_weight(domain: str) float[source]

Get prior weight for a domain.

instrument_given_domain: Dict[str, Dict[str, float]]
matrix_given_domain: Dict[str, Dict[str, float]]
mode_given_category: Dict[str, Dict[str, float]]
n_classes_range: Tuple[int, int] = (2, 5)
n_samples_range: Tuple[int, int] = (100, 2000)
n_targets_range: Tuple[int, int] = (1, 5)
noise_level_range: Tuple[float, float] = (0.5, 2.0)
normalize_weights(weights: Dict[str, float]) Dict[str, float][source]

Normalize weights to sum to 1.

particle_size_range: Tuple[float, float] = (5.0, 200.0)
target_type_weights: Dict[str, float]
temperature_range: Tuple[float, float] = (15.0, 40.0)
class nirs4all.synthesis.NoiseModelConfig(shot_noise_enabled: bool = True, thermal_noise_enabled: bool = True, read_noise_enabled: bool = True, flicker_noise_enabled: bool = False, quantization_noise_enabled: bool = False, shot_noise_factor: float = 1.0, thermal_noise_factor: float = 1.0, read_noise_electrons: float = 50.0, flicker_corner_freq: float = 100.0, adc_bits: int = 16, full_scale: float = 3.0)[source]

Bases: object

Configuration for detector noise model.

shot_noise_enabled

Enable shot (photon) noise.

Type:

bool

thermal_noise_enabled

Enable thermal (Johnson) noise.

Type:

bool

read_noise_enabled

Enable readout noise.

Type:

bool

flicker_noise_enabled

Enable 1/f (flicker) noise.

Type:

bool

quantization_noise_enabled

Enable ADC quantization noise.

Type:

bool

shot_noise_factor

Scaling factor for shot noise.

Type:

float

thermal_noise_factor

Scaling factor for thermal noise.

Type:

float

read_noise_electrons

Read noise in electrons.

Type:

float

flicker_corner_freq

1/f noise corner frequency (Hz).

Type:

float

adc_bits

ADC resolution in bits.

Type:

int

full_scale

Full-scale signal level.

Type:

float

adc_bits: int = 16
flicker_corner_freq: float = 100.0
flicker_noise_enabled: bool = False
full_scale: float = 3.0
quantization_noise_enabled: bool = False
read_noise_electrons: float = 50.0
read_noise_enabled: bool = True
shot_noise_enabled: bool = True
shot_noise_factor: float = 1.0
thermal_noise_enabled: bool = True
thermal_noise_factor: float = 1.0
class nirs4all.synthesis.OperatorVarianceParams(noise_std: float = 0.001, offset_std: float = 0.01, slope_std: float = 0.001, curvature_std: float = 0.0001, mult_scatter_std: float = 0.05)[source]

Bases: object

Parameters for operator-based variance modeling.

Models spectral variation as independent physical sources: - High-frequency noise (detector noise) - Baseline offset/slope/curvature (instrumental drift, scattering) - Multiplicative scatter (sample thickness, optical path variation)

noise_std

Standard deviation of high-frequency noise.

Type:

float

offset_std

Standard deviation of baseline offset.

Type:

float

slope_std

Standard deviation of baseline slope (per 1000nm).

Type:

float

curvature_std

Standard deviation of baseline curvature.

Type:

float

mult_scatter_std

Standard deviation of multiplicative scatter.

Type:

float

curvature_std: float = 0.0001
mult_scatter_std: float = 0.05
noise_std: float = 0.001
offset_std: float = 0.01
slope_std: float = 0.001
to_dict() Dict[str, float][source]

Convert to dictionary.

class nirs4all.synthesis.OptimizedComponentFitter(wavelengths: ndarray | None = None, priority_categories: List[str] | None = None, max_components: int = 10, baseline_order: int = 4, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, sg_window_length: int = 15, sg_polyorder: int = 3, regularization: float = 1e-06, smooth_sigma_nm: float = 30.0, use_nnls: bool = False)[source]

Bases: object

Optimize component selection using greedy search with category prioritization.

Unlike ComponentFitter which fits all components simultaneously with NNLS, this class uses a greedy forward selection approach that:

  1. Starts with baseline-only fit

  2. Greedily adds components from priority categories (low threshold)

  3. Fills remaining slots from other categories (higher threshold)

  4. Applies swap refinement to escape local optima

This approach produces much better fits for real-world data by: - Avoiding overfitting to spurious components - Respecting domain knowledge (e.g., protein for dairy, starch for grains) - Allowing both positive and negative coefficients (OLS, not NNLS)

Example

>>> from nirs4all.synthesis import OptimizedComponentFitter
>>>
>>> # Create fitter for grain analysis
>>> fitter = OptimizedComponentFitter(
...     wavelengths=wavelengths,
...     priority_categories=['carbohydrates', 'proteins', 'water_related'],
...     max_components=10,
... )
>>> result = fitter.fit(spectrum)
>>> print(result.summary())
wavelengths

Wavelength grid for fitting.

priority_categories

Categories to prioritize in component selection.

max_components

Maximum number of components to select.

baseline_order

Polynomial order for baseline (default 4).

preprocessing

Preprocessing to apply to components.

auto_detect_preprocessing

Auto-detect preprocessing from data.

detected_preprocessing: PreprocessingType | None
fit(spectrum: ndarray) OptimizedFitResult[source]

Fit components to a spectrum using greedy category-prioritized selection.

The algorithm: 1. Starts with baseline-only fit 2. Greedily adds components from priority categories (very low threshold: 0.0001) 3. Fills remaining slots from other categories (higher threshold: 0.005) 4. Applies swap refinement (prefers swapping in priority components)

Parameters:

spectrum – Observed spectrum, shape (n_wavelengths,).

Returns:

OptimizedFitResult with fit results.

class nirs4all.synthesis.OptimizedFitResult(component_names: List[str], concentrations: ndarray, baseline_coefficients: ndarray | None, fitted_spectrum: ndarray, residuals: ndarray, r_squared: float, rmse: float, n_components: int, n_priority_components: int, baseline_r_squared: float, wavelengths: ndarray)[source]

Bases: object

Result from optimized greedy component fitting.

component_names

Names of selected components (in order of selection).

Type:

List[str]

concentrations

Fitted concentrations for each component.

Type:

numpy.ndarray

baseline_coefficients

Polynomial baseline coefficients.

Type:

numpy.ndarray | None

fitted_spectrum

Reconstructed spectrum from fit.

Type:

numpy.ndarray

residuals

Fit residuals.

Type:

numpy.ndarray

r_squared

Coefficient of determination.

Type:

float

rmse

Root mean squared error.

Type:

float

n_components

Number of components selected.

Type:

int

n_priority_components

Number of components from priority categories.

Type:

int

baseline_r_squared

R² from baseline-only fit (for comparison).

Type:

float

wavelengths

Wavelength grid used for fitting.

Type:

numpy.ndarray

baseline_coefficients: ndarray | None
baseline_r_squared: float
component_names: List[str]
concentrations: ndarray
fitted_spectrum: ndarray
n_components: int
n_priority_components: int
r_squared: float
residuals: ndarray
rmse: float
summary() str[source]

Return human-readable summary.

top_components(n: int = 5, threshold: float = 0.001) List[Tuple[str, float]][source]

Get top components by concentration.

wavelengths: ndarray
class nirs4all.synthesis.OutputConfig(as_dataset: bool = True, include_metadata: bool = False, include_wavelengths: bool = True)[source]

Bases: object

Configuration for output format.

as_dataset

Whether to return SpectroDataset (vs tuple).

Type:

bool

include_metadata

Whether to include generation metadata.

Type:

bool

include_wavelengths

Whether to include wavelength array in output.

Type:

bool

as_dataset: bool = True
include_metadata: bool = False
include_wavelengths: bool = True
class nirs4all.synthesis.OvertoneResult(order: int, wavenumber_cm: float, wavelength_nm: float, amplitude_factor: float, bandwidth_factor: float)[source]

Bases: object

Result of overtone calculation.

amplitude_factor: float
bandwidth_factor: float
order: int
wavelength_nm: float
wavenumber_cm: float
class nirs4all.synthesis.PCAVarianceParams(n_components: int = 5, explained_variance_ratio: ndarray | None = None, score_means: ndarray | None = None, score_stds: ndarray | None = None, components: ndarray | None = None, mean_spectrum: ndarray | None = None)[source]

Bases: object

Parameters for PCA-based variance modeling.

Models spectral variation using principal component score distributions.

n_components

Number of PCA components.

Type:

int

explained_variance_ratio

Explained variance per component.

Type:

numpy.ndarray | None

score_means

Mean of PC scores.

Type:

numpy.ndarray | None

score_stds

Std of PC scores.

Type:

numpy.ndarray | None

components

PCA loading vectors (n_components, n_wavelengths).

Type:

numpy.ndarray | None

mean_spectrum

Mean spectrum from PCA.

Type:

numpy.ndarray | None

components: ndarray | None = None
explained_variance_ratio: ndarray | None = None
mean_spectrum: ndarray | None = None
n_components: int = 5
score_means: ndarray | None = None
score_stds: ndarray | None = None
class nirs4all.synthesis.ParticleSizeConfig(distribution: ParticleSizeDistribution = <factory>, reference_size_um: float = 50.0, size_effect_strength: float = 1.0, wavelength_exponent: float = 1.5, include_path_length_effect: bool = True, path_length_sensitivity: float = 0.5)[source]

Bases: object

Configuration for particle size effects.

distribution

Particle size distribution parameters.

Type:

nirs4all.synthesis.scattering.ParticleSizeDistribution

reference_size_um

Reference particle size for baseline scattering.

Type:

float

size_effect_strength

How strongly size affects scattering (0-1).

Type:

float

wavelength_exponent

Exponent for wavelength dependence of scattering. - 4.0 = Rayleigh (particles << wavelength) - 0.0 = No wavelength dependence - 1.0-2.0 = Typical for NIR powder samples

Type:

float

include_path_length_effect

Whether particle size affects optical path.

Type:

bool

path_length_sensitivity

How strongly size affects path length.

Type:

float

distribution: ParticleSizeDistribution
include_path_length_effect: bool = True
path_length_sensitivity: float = 0.5
reference_size_um: float = 50.0
size_effect_strength: float = 1.0
wavelength_exponent: float = 1.5
class nirs4all.synthesis.ParticleSizeDistribution(mean_size_um: float = 50.0, std_size_um: float = 15.0, min_size_um: float = 5.0, max_size_um: float = 200.0, distribution: str = 'lognormal')[source]

Bases: object

Particle size distribution parameters.

Models particle size as a log-normal distribution, which is common for ground/milled samples in NIR analysis.

mean_size_um

Mean particle size in micrometers.

Type:

float

std_size_um

Standard deviation of particle size in micrometers.

Type:

float

min_size_um

Minimum particle size (lower truncation).

Type:

float

max_size_um

Maximum particle size (upper truncation).

Type:

float

distribution

Type of distribution (‘lognormal’, ‘normal’, ‘uniform’).

Type:

str

distribution: str = 'lognormal'
max_size_um: float = 200.0
mean_size_um: float = 50.0
min_size_um: float = 5.0
sample(n_samples: int, rng: Generator) ndarray[source]

Sample particle sizes from the distribution.

std_size_um: float = 15.0
class nirs4all.synthesis.PartitionConfig(train_ratio: float = 0.8, stratify: bool = False, shuffle: bool = True, group_aware: bool = True)[source]

Bases: object

Configuration for data partitioning (train/test split).

train_ratio

Proportion of samples for training (0.0-1.0).

Type:

float

stratify

Whether to stratify by target (for classification).

Type:

bool

shuffle

Whether to shuffle before splitting.

Type:

bool

group_aware

Whether to keep groups together when splitting.

Type:

bool

group_aware: bool = True
shuffle: bool = True
stratify: bool = False
train_ratio: float = 0.8
class nirs4all.synthesis.PreprocessingInference(preprocessing_type: PreprocessingType = PreprocessingType.RAW_ABSORBANCE, confidence: float = 0.0, is_preprocessed: bool = False, global_mean: float = 0.0, global_range: Tuple[float, float] = (0.0, 1.0), zero_crossing_ratio: float = 0.0, per_sample_std_variation: float = 0.0, oscillation_frequency: float = 0.0, suggested_inverse: str | None = None)[source]

Bases: object

Results of preprocessing type inference.

Detects whether spectral data has been preprocessed (derivatives, normalization, centering, etc.) before being provided to the fitter.

This is crucial for generating synthetic data that matches the real data distribution - synthetic spectra should be generated as raw absorbance and then the same preprocessing applied.

preprocessing_type

Detected preprocessing type.

Type:

nirs4all.synthesis.fitter.PreprocessingType

confidence

Confidence score (0-1).

Type:

float

is_preprocessed

Whether data appears to be preprocessed.

Type:

bool

global_mean

Mean value (0 suggests centering/derivatives).

Type:

float

global_range

(min, max) value range.

Type:

Tuple[float, float]

zero_crossing_ratio

Ratio of zero crossings (high for derivatives).

Type:

float

per_sample_std_variation

Variation in per-sample std (low for SNV).

Type:

float

oscillation_frequency

Spectral oscillation frequency (high for 2nd deriv).

Type:

float

suggested_inverse

Suggested inverse operation to recover raw data.

Type:

str | None

confidence: float = 0.0
global_mean: float = 0.0
global_range: Tuple[float, float] = (0.0, 1.0)
is_preprocessed: bool = False
oscillation_frequency: float = 0.0
per_sample_std_variation: float = 0.0
preprocessing_type: PreprocessingType = 'raw_absorbance'
suggested_inverse: str | None = None
zero_crossing_ratio: float = 0.0
class nirs4all.synthesis.PreprocessingType(value)[source]

Bases: str, Enum

Detected preprocessing type of spectral data.

FIRST_DERIVATIVE = 'first_derivative'
MEAN_CENTERED = 'mean_centered'
MSC_CORRECTED = 'msc_corrected'
NORMALIZED = 'normalized'
RAW_ABSORBANCE = 'raw_absorbance'
RAW_REFLECTANCE = 'raw_reflectance'
SECOND_DERIVATIVE = 'second_derivative'
SNV_CORRECTED = 'snv_corrected'
UNKNOWN = 'unknown'
class nirs4all.synthesis.PriorSampler(config: NIRSPriorConfig | None = None, random_state: int | None = None)[source]

Bases: object

Sample complete generation configurations from prior distributions.

This class implements hierarchical sampling where lower-level configurations are conditioned on higher-level choices.

Parameters:
  • config – Prior configuration.

  • random_state – Random state for reproducibility.

Example

>>> config = NIRSPriorConfig()
>>> sampler = PriorSampler(config, random_state=42)
>>>
>>> # Sample a single configuration
>>> sample = sampler.sample()
>>> print(sample)
>>>
>>> # Sample multiple configurations
>>> samples = sampler.sample_batch(10)
sample() Dict[str, Any][source]

Sample a complete dataset configuration from the prior.

Returns:

Dictionary with all configuration parameters.

Example

>>> sampler = PriorSampler(random_state=42)
>>> config = sampler.sample()
>>> print(config["domain"])
>>> print(config["instrument"])
sample_batch(n: int) List[Dict[str, Any]][source]

Sample multiple configurations from the prior.

Parameters:

n – Number of configurations to sample.

Returns:

List of configuration dictionaries.

sample_components(domain: str, n_components: int | None = None) List[str][source]

Sample component set based on domain.

sample_domain() str[source]

Sample a domain from the prior.

sample_for_domain(domain: str, n_samples: int | None = None) Dict[str, Any][source]

Sample a configuration constrained to a specific domain.

Parameters:
  • domain – Domain to sample for.

  • n_samples – Optional number of samples (uses prior if None).

Returns:

Configuration dictionary for the specified domain.

sample_for_instrument(instrument: str, n_samples: int | None = None) Dict[str, Any][source]

Sample a configuration constrained to a specific instrument.

Parameters:
  • instrument – Instrument name to use.

  • n_samples – Optional number of samples.

Returns:

Configuration dictionary for the specified instrument.

sample_instrument(category: str) str[source]

Sample a specific instrument given the category.

sample_instrument_category(domain: str) str[source]

Sample an instrument category given the domain.

sample_matrix_type(domain: str) str[source]

Sample a matrix type given the domain.

sample_measurement_mode(instrument_category: str) str[source]

Sample a measurement mode given the instrument category.

sample_n_samples() int[source]

Sample number of samples to generate.

sample_noise_level(instrument_category: str) float[source]

Sample noise level multiplier based on instrument category.

sample_particle_size(matrix_type: str) float[source]

Sample particle size based on matrix type.

sample_target_config() Dict[str, Any][source]

Sample target generation configuration.

sample_temperature() float[source]

Sample a temperature from the prior range.

class nirs4all.synthesis.ProceduralComponentConfig(n_fundamental_bands: int = 3, include_overtones: bool = True, max_overtone_order: int = 3, include_combinations: bool = True, max_combinations: int = 3, h_bond_strength: float = 0.3, h_bond_variability: float = 0.2, anharmonicity: float = 0.02, anharmonicity_variability: float = 0.005, amplitude_variability: float = 0.3, bandwidth_variability: float = 0.2, wavelength_range: Tuple[float, float] = (900, 2500), functional_groups: List[FunctionalGroupType] | None = None, combination_amplitude_factor: float = 0.2)[source]

Bases: object

Configuration for procedural component generation.

Controls the complexity and characteristics of generated spectral components including the number of bands, overtone generation, combination bands, and environmental effects.

n_fundamental_bands

Number of fundamental vibration bands to generate.

Type:

int

include_overtones

Whether to generate overtone bands (1st, 2nd, etc.).

Type:

bool

max_overtone_order

Maximum overtone order (2=1st overtone, 3=2nd, etc.).

Type:

int

include_combinations

Whether to generate combination bands.

Type:

bool

max_combinations

Maximum number of combination bands.

Type:

int

h_bond_strength

Average hydrogen bonding strength (0-1).

Type:

float

h_bond_variability

Variability in H-bond strength between samples.

Type:

float

anharmonicity

Anharmonicity constant for overtone calculations.

Type:

float

anharmonicity_variability

Variability in anharmonicity.

Type:

float

amplitude_variability

Random variation in band amplitudes.

Type:

float

bandwidth_variability

Random variation in band widths.

Type:

float

wavelength_range

NIR wavelength range for band placement (nm).

Type:

Tuple[float, float]

functional_groups

Optional list of specific functional groups to use.

Type:

List[nirs4all.synthesis.procedural.FunctionalGroupType] | None

combination_amplitude_factor

Amplitude reduction for combination bands.

Type:

float

Example

>>> config = ProceduralComponentConfig(
...     n_fundamental_bands=3,
...     include_overtones=True,
...     include_combinations=True,
...     h_bond_strength=0.5
... )
amplitude_variability: float = 0.3
anharmonicity: float = 0.02
anharmonicity_variability: float = 0.005
bandwidth_variability: float = 0.2
combination_amplitude_factor: float = 0.2
functional_groups: List[FunctionalGroupType] | None = None
h_bond_strength: float = 0.3
h_bond_variability: float = 0.2
include_combinations: bool = True
include_overtones: bool = True
max_combinations: int = 3
max_overtone_order: int = 3
n_fundamental_bands: int = 3
wavelength_range: Tuple[float, float] = (900, 2500)
class nirs4all.synthesis.ProceduralComponentGenerator(random_state: int | None = None)[source]

Bases: object

Generator for procedurally-created spectral components.

Creates chemically-plausible spectral components with physically-motivated constraints. Uses wavenumber-space calculations for proper overtone and combination band placement.

rng

NumPy random generator for reproducibility.

Example

>>> generator = ProceduralComponentGenerator(random_state=42)
>>>
>>> # Generate a single component
>>> component = generator.generate_component("my_compound")
>>>
>>> # Generate a library of components
>>> library = generator.generate_library(n_components=10)
>>>
>>> # Generate with specific configuration
>>> config = ProceduralComponentConfig(n_fundamental_bands=4)
>>> component = generator.generate_component("complex_compound", config)
generate_component(name: str, config: ProceduralComponentConfig | None = None, functional_groups: List[FunctionalGroupType] | None = None, correlation_group: int | None = None) SpectralComponent[source]

Generate a single spectral component.

Creates a chemically-plausible component with bands following physical constraints (overtone relationships, combination bands, etc.).

Parameters:
  • name – Name for the component.

  • config – Generation configuration. If None, uses defaults.

  • functional_groups – Specific functional groups to use. If None, randomly selects based on config.

  • correlation_group – Optional correlation group ID.

Returns:

SpectralComponent with generated bands.

Example

>>> generator = ProceduralComponentGenerator(random_state=42)
>>> component = generator.generate_component("my_compound")
>>> print(f"Generated {len(component.bands)} bands")
generate_from_functional_groups(name: str, functional_groups: List[FunctionalGroupType | str], config: ProceduralComponentConfig | None = None) SpectralComponent[source]

Generate a component with specified functional groups.

Convenience method for creating components with known chemistry.

Parameters:
  • name – Component name.

  • functional_groups – List of functional groups (enum or string).

  • config – Optional generation configuration.

Returns:

SpectralComponent with bands from specified functional groups.

Example

>>> generator = ProceduralComponentGenerator(random_state=42)
>>> # Generate an alcohol
>>> alcohol = generator.generate_from_functional_groups(
...     "alcohol",
...     ["hydroxyl", "methyl", "methylene"]
... )
generate_library(n_components: int, config: ProceduralComponentConfig | None = None, name_prefix: str = 'component') ComponentLibrary[source]

Generate a library of procedural components.

Creates multiple unique components with varied characteristics.

Parameters:
  • n_components – Number of components to generate.

  • config – Generation configuration applied to all components.

  • name_prefix – Prefix for component names.

Returns:

ComponentLibrary populated with generated components.

Example

>>> generator = ProceduralComponentGenerator(random_state=42)
>>> library = generator.generate_library(10)
>>> print(f"Created library with {library.n_components} components")
generate_variant(base_component: SpectralComponent, variation_scale: float = 0.1, name: str | None = None) SpectralComponent[source]

Generate a variant of an existing component.

Creates a new component with similar characteristics but varied band positions, widths, and amplitudes. Useful for simulating batch effects or matrix variations.

Parameters:
  • base_component – Component to base the variant on.

  • variation_scale – Scale of random variations (0-1).

  • name – Name for the variant. If None, appends “_variant”.

Returns:

SpectralComponent variant.

Example

>>> generator = ProceduralComponentGenerator(random_state=42)
>>> base = generator.generate_component("base")
>>> variant = generator.generate_variant(base, variation_scale=0.15)
class nirs4all.synthesis.ProductGenerator(template: str | ProductTemplate, random_state: int | None = None, wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, wavelengths: ndarray | None = None, instrument_wavelength_grid: str | None = None, complexity: str = 'realistic')[source]

Bases: object

Generator for product-level synthetic NIRS spectra.

ProductGenerator creates realistic synthetic spectra based on predefined product templates with controlled composition variability. It handles correlation constraints, compositional bounds, and efficient batch generation for neural network training.

template

The ProductTemplate used for generation.

library

ComponentLibrary with the required spectral components.

rng

NumPy random generator for reproducibility.

Parameters:
  • template – Template name (str) or ProductTemplate object.

  • random_state – Random seed for reproducibility.

  • wavelength_start – Start wavelength in nm (default: 1000).

  • wavelength_end – End wavelength in nm (default: 2500).

  • wavelength_step – Wavelength step in nm (default: 2).

  • wavelengths – Custom wavelength array (overrides start/end/step).

  • instrument_wavelength_grid – Predefined instrument grid name.

  • complexity – Spectral complexity (‘simple’, ‘realistic’, ‘complex’).

Example

>>> # Generate milk samples with variable fat
>>> generator = ProductGenerator("milk_variable_fat", random_state=42)
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>>
>>> # High-variability training data
>>> generator = ProductGenerator("universal_protein_predictor")
>>> dataset = generator.generate(n_samples=50000, target="protein")
>>>
>>> # Match specific instrument wavelengths
>>> generator = ProductGenerator(
...     "wheat_variable_protein",
...     instrument_wavelength_grid="foss_xds"
... )
__repr__() str[source]

Return string representation.

generate(n_samples: int = 1000, target: str | None = None, train_ratio: float = 0.8, include_batch_effects: bool = False, n_batches: int = 1, return_concentrations: bool = False) 'SpectroDataset' | Tuple['SpectroDataset', np.ndarray][source]

Generate synthetic product samples.

Parameters:
  • n_samples – Number of samples to generate.

  • target – Component to use as regression target. If None, uses template’s default_target.

  • train_ratio – Proportion of samples for training partition.

  • include_batch_effects – Whether to add batch/session effects.

  • n_batches – Number of batches (if include_batch_effects=True).

  • return_concentrations – If True, also return the full concentration matrix.

Returns:

SpectroDataset with train/test partitions. If return_concentrations=True, returns (dataset, concentrations).

Example

>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
generate_dataset_for_target(target: str, n_samples: int = 1000, target_range: Tuple[float, float] | None = None, **kwargs: Any) SpectroDataset[source]

Generate dataset optimized for a specific target component.

This is a convenience method that generates a dataset and optionally scales the target values to a specified range.

Parameters:
  • target – Component to use as regression target.

  • n_samples – Number of samples to generate.

  • target_range – Optional (min, max) to scale target values.

  • **kwargs – Additional arguments passed to generate().

Returns:

SpectroDataset ready for pipeline use.

Example

>>> generator = ProductGenerator("wheat_variable_protein")
>>> dataset = generator.generate_dataset_for_target(
...     target="protein",
...     n_samples=10000,
...     target_range=(0, 100)  # Scale to percentage
... )
class nirs4all.synthesis.ProductTemplate(name: str, description: str, category: str, domain: str, components: ~typing.List[~nirs4all.synthesis.products.ComponentVariation], default_target: str = '', tags: ~typing.List[str] = <factory>, references: ~typing.List[str] = <factory>)[source]

Bases: object

Template defining a product type with composition variability.

A ProductTemplate describes a realistic product type (e.g., wheat grain, milk, pharmaceutical tablet) along with specifications for how each component’s concentration can vary. This enables generation of diverse samples suitable for neural network training.

name

Unique identifier for the template.

Type:

str

description

Human-readable description.

Type:

str

category

Product category (e.g., “dairy”, “grain”, “pharma”).

Type:

str

domain

Application domain (e.g., “agriculture”, “food”, “pharmaceutical”).

Type:

str

components

List of ComponentVariation specifications.

Type:

List[nirs4all.synthesis.products.ComponentVariation]

default_target

Default component to use as regression target.

Type:

str

tags

Classification tags for filtering.

Type:

List[str]

references

Literature or data source citations.

Type:

List[str]

Example

>>> milk_template = ProductTemplate(
...     name="milk_variable_fat",
...     description="Milk with variable fat content (skim to whole)",
...     category="dairy",
...     domain="food",
...     components=[
...         ComponentVariation("water", VariationType.COMPUTED, compute_as="remainder"),
...         ComponentVariation("lipid", VariationType.UNIFORM, min_value=0.005, max_value=0.06),
...         ComponentVariation("casein", VariationType.NORMAL, mean=0.028, std=0.003),
...         ComponentVariation("whey", VariationType.FIXED, value=0.006),
...         ComponentVariation("lactose", VariationType.NORMAL, mean=0.05, std=0.003),
...     ],
...     default_target="lipid",
... )
__post_init__() None[source]

Validate template consistency.

category: str
property component_names: List[str]

Return list of component names in this template.

components: List[ComponentVariation]
default_target: str = ''
description: str
domain: str
info() str[source]

Return formatted information about the template.

name: str
references: List[str]
tags: List[str]
class nirs4all.synthesis.RealBandFitResult(band_names: ~typing.List[str], band_centers: ~numpy.ndarray, amplitudes: ~numpy.ndarray, sigmas: ~numpy.ndarray, baseline_coefficients: ~numpy.ndarray, fitted_spectrum: ~numpy.ndarray, residuals: ~numpy.ndarray, r_squared: float, rmse: float, n_bands: int, wavelengths: ~numpy.ndarray, band_assignments: ~typing.List[~typing.Any] = <factory>)[source]

Bases: object

Result from real band fitting using known NIR band assignments.

band_names

Names of fitted bands (e.g., “O-H/1st”, “C-H/combination”).

Type:

List[str]

band_centers

Fixed center wavelengths from NIR_BANDS.

Type:

numpy.ndarray

amplitudes

Fitted amplitudes for each band.

Type:

numpy.ndarray

sigmas

Sigma values (within constrained ranges).

Type:

numpy.ndarray

baseline_coefficients

Polynomial baseline coefficients.

Type:

numpy.ndarray

fitted_spectrum

Reconstructed spectrum from fit.

Type:

numpy.ndarray

residuals

Fit residuals.

Type:

numpy.ndarray

r_squared

Coefficient of determination.

Type:

float

rmse

Root mean squared error.

Type:

float

n_bands

Number of bands used.

Type:

int

wavelengths

Wavelength grid used for fitting.

Type:

numpy.ndarray

band_assignments

Original BandAssignment objects.

Type:

List[Any]

amplitudes: ndarray
band_assignments: List[Any]
band_centers: ndarray
band_names: List[str]
baseline_coefficients: ndarray
fitted_spectrum: ndarray
n_bands: int
r_squared: float
residuals: ndarray
rmse: float
sigmas: ndarray
summary() str[source]

Return human-readable summary.

top_bands(n: int = 10, threshold: float = 0.001) List[Tuple[str, float, float]][source]

Get top bands by amplitude. Returns (name, center, amplitude).

wavelengths: ndarray
class nirs4all.synthesis.RealBandFitter(baseline_order: int = 4, max_bands: int = 50, target_r2: float = 0.98, allow_sigma_variation: bool = True, sigma_margin: float = 0.3, n_iterations: int = 3)[source]

Bases: object

Fit spectra using REAL NIR band assignments from the _bands.py dictionary.

Unlike pure Gaussian band fitting which optimizes band centers freely, this class uses: - Fixed band centers from known spectroscopic literature assignments - Constrained sigma values based on typical ranges for each band type - Only amplitude optimization (more physically interpretable)

This provides spectroscopically meaningful decomposition that can be linked back to functional groups (O-H, C-H, N-H, etc.) and overtone levels.

Example

>>> from nirs4all.synthesis import RealBandFitter
>>>
>>> fitter = RealBandFitter(baseline_order=4, max_bands=40)
>>> result = fitter.fit(spectrum, wavelengths)
>>> print(result.summary())
>>>
>>> # See which functional groups contribute
>>> for name, center, amp in result.top_bands(10):
...     print(f"{center:.0f} nm: {name} (amplitude={amp:.4f})")
baseline_order

Polynomial baseline order.

max_bands

Maximum number of bands to use.

target_r2

Target R² for iterative refinement.

allow_sigma_variation

Allow sigma to vary within literature ranges.

sigma_margin

How much sigma can vary from midpoint (0.3 = ±30%).

fit(spectrum: ndarray, wavelengths: ndarray) RealBandFitResult[source]

Fit spectrum using real NIR band positions.

Parameters:
  • spectrum – Target spectrum to fit, shape (n_wavelengths,).

  • wavelengths – Wavelengths in nm, shape (n_wavelengths,).

Returns:

RealBandFitResult with fit results and band assignments.

class nirs4all.synthesis.RealDataFitter[source]

Bases: object

Fit generator parameters to match real dataset properties.

This class analyzes real NIRS data and estimates parameters for the SyntheticNIRSGenerator to produce similar spectra. Includes Phase 1-4 enhanced inference for instruments, domains, and effects.

source_properties

SpectralProperties of the analyzed data.

fitted_params

FittedParameters after fitting.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Access inferred characteristics
>>> print(f"Instrument: {params.inferred_instrument}")
>>> print(f"Domain: {params.inferred_domain}")
>>>
>>> # Create matched generator
>>> generator = fitter.create_matched_generator()
>>> X_synth, _, _ = generator.generate(1000)
apply_matching_preprocessing(X: ndarray, *, window_length: int = 15, polyorder: int = 2) ndarray[source]

Apply preprocessing to match the detected preprocessing of real data.

If the real data was detected as preprocessed (e.g., second derivative), this method applies the same preprocessing to synthetic raw absorbance spectra so they match the real data distribution.

Parameters:
  • X – Raw absorbance spectra from generator (n_samples, n_wavelengths).

  • window_length – Savitzky-Golay window length for derivatives.

  • polyorder – Polynomial order for Savitzky-Golay filter.

Returns:

Preprocessed spectra matching the real data type.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wl)
>>> generator = fitter.create_matched_generator()
>>> X_raw, _, _ = generator.generate(1000)
>>> X_matched = fitter.apply_matching_preprocessing(X_raw)
create_matched_generator(random_state: int | None = None) SyntheticNIRSGenerator[source]

Create a SyntheticNIRSGenerator configured to match the fitted data.

This method creates a generator with all fitted parameters including Phase 1-4 enhanced features (instrument, domain, effects).

Parameters:

random_state – Random seed for reproducibility.

Returns:

Configured SyntheticNIRSGenerator instance.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>> generator = fitter.create_matched_generator(random_state=42)
>>> X_synth, _, _ = generator.generate(1000)
evaluate_similarity(X_synthetic: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]

Evaluate similarity between synthetic and source data.

Computes various metrics comparing synthetic spectra to the original real data.

Parameters:
  • X_synthetic – Synthetic spectra matrix.

  • wavelengths – Optional wavelength grid.

Returns:

Dictionary with similarity metrics.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> params = fitter.fit(X_real)
>>> X_synth, _, _ = generator.generate(1000)
>>> metrics = fitter.evaluate_similarity(X_synth)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
fit(X: np.ndarray | 'SpectroDataset', *, wavelengths: np.ndarray | None = None, name: str = 'source', infer_instrument: bool = True, infer_domain: bool = True, infer_measurement_mode: bool = True, infer_environmental: bool = True, infer_scattering: bool = True, infer_edge_artifacts: bool = True, infer_preprocessing: bool = True) FittedParameters[source]

Fit generator parameters to real data.

Analyzes the input data and estimates optimal parameters for generating synthetic spectra with similar properties. Includes Phase 1-6 enhanced inference.

Parameters:
  • X – Real spectra matrix (n_samples, n_wavelengths) or SpectroDataset.

  • wavelengths – Wavelength grid (required if X is ndarray).

  • name – Dataset name for reference.

  • infer_instrument – Whether to infer instrument archetype.

  • infer_domain – Whether to infer application domain.

  • infer_measurement_mode – Whether to infer measurement mode.

  • infer_environmental – Whether to infer environmental effects.

  • infer_scattering – Whether to infer scattering parameters.

  • infer_edge_artifacts – Whether to infer edge artifact effects.

  • infer_preprocessing – Whether to detect preprocessing type.

Returns:

FittedParameters object with estimated parameters.

Raises:

ValueError – If X is empty or has wrong shape.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wl, name="wheat")
>>> print(params.summary())
fit_from_path(path: str, *, name: str | None = None) FittedParameters[source]

Fit parameters from a dataset path.

Loads data using DatasetConfigs and fits parameters.

Parameters:
  • path – Path to dataset folder.

  • name – Optional name override.

Returns:

FittedParameters object.

Example

>>> params = fitter.fit_from_path("sample_data/regression")
fitted_params: FittedParameters | None
get_tuning_recommendations() List[str][source]

Get recommendations for tuning generation parameters.

Based on the fitted parameters and source data, provides suggestions for manual tuning.

Returns:

List of recommendation strings.

Example

>>> params = fitter.fit(X_real)
>>> for rec in fitter.get_tuning_recommendations():
...     print(f"- {rec}")
source_properties: SpectralProperties | None
class nirs4all.synthesis.RealismMetric(value)[source]

Bases: str, Enum

Metrics used in the spectral realism scorecard.

ADVERSARIAL_AUC = 'adversarial_auc'
BASELINE_CURVATURE = 'baseline_curvature'
CORRELATION_LENGTH = 'correlation_length'
DERIVATIVE_STATISTICS = 'derivative_statistics'
PEAK_DENSITY = 'peak_density'
SNR_DISTRIBUTION = 'snr_distribution'
class nirs4all.synthesis.ReflectanceConfig(geometry: str = 'integrating_sphere', reference_material: str = 'spectralon', reference_reflectance: float = 0.99, illumination_angle: float = 0.0, collection_angle: float = 45.0, sample_presentation: str = 'powder')[source]

Bases: object

Configuration for diffuse reflectance measurement mode.

Implements Kubelka-Munk theory: f(R) = (1-R)² / 2R = K/S where R is reflectance, K is absorption coefficient, S is scattering.

geometry

Measurement geometry (integrating sphere, fiber probe, etc.).

Type:

str

reference_material

Reference standard material.

Type:

str

reference_reflectance

Reflectance of reference standard.

Type:

float

illumination_angle

Angle of illumination (degrees from normal).

Type:

float

collection_angle

Angle of collection (degrees from normal).

Type:

float

sample_presentation

How sample is presented (powder, solid, slurry).

Type:

str

collection_angle: float = 45.0
geometry: str = 'integrating_sphere'
illumination_angle: float = 0.0
reference_material: str = 'spectralon'
reference_reflectance: float = 0.99
sample_presentation: str = 'powder'
class nirs4all.synthesis.ScatteringCoefficientConfig(baseline_scattering: float = 1.0, wavelength_exponent: float = 1.0, particle_size_factor: float = 0.5, sample_variation: float = 0.15, wavelength_reference_nm: float = 1500.0)[source]

Bases: object

Configuration for scattering coefficient (S) generation.

For Kubelka-Munk reflectance, we need both absorption (K) and scattering (S) coefficients. This config controls S(λ) generation.

baseline_scattering

Base scattering coefficient value.

Type:

float

wavelength_exponent

Exponent for wavelength dependence. S(λ) ∝ λ^(-exponent)

Type:

float

particle_size_factor

How strongly particle size affects S.

Type:

float

sample_variation

Sample-to-sample variation in S.

Type:

float

wavelength_reference_nm

Reference wavelength for normalization.

Type:

float

baseline_scattering: float = 1.0
particle_size_factor: float = 0.5
sample_variation: float = 0.15
wavelength_exponent: float = 1.0
wavelength_reference_nm: float = 1500.0
class nirs4all.synthesis.ScatteringConfig(baseline_scattering: float = 1.0, wavelength_exponent: float = 1.0, particle_size_um: float = 50.0, particle_size_variation: float = 0.2, sample_to_sample_variation: float = 0.15)[source]

Bases: object

Configuration for scattering coefficient generation.

Controls how scattering coefficients are generated for samples, which is essential for Kubelka-Munk reflectance simulation.

baseline_scattering

Base scattering coefficient (arbitrary units).

Type:

float

wavelength_exponent

Exponent for wavelength dependence (Rayleigh-like). S(λ) ∝ λ^(-exponent), typically 0.5-2.0

Type:

float

particle_size_um

Mean particle size in micrometers.

Type:

float

particle_size_variation

Coefficient of variation in particle size.

Type:

float

sample_to_sample_variation

How much scattering varies between samples.

Type:

float

baseline_scattering: float = 1.0
particle_size_um: float = 50.0
particle_size_variation: float = 0.2
sample_to_sample_variation: float = 0.15
wavelength_exponent: float = 1.0
class nirs4all.synthesis.ScatteringEffectsConfig(model: ScatteringModel = ScatteringModel.EMSC, particle_size: ParticleSizeConfig = <factory>, emsc: EMSCConfig = <factory>, scattering_coefficient: ScatteringCoefficientConfig = <factory>, enable_particle_size: bool = True, enable_emsc: bool = True)[source]

Bases: object

Combined configuration for all scattering effects.

model

Which scattering model to use.

Type:

nirs4all.synthesis.scattering.ScatteringModel

particle_size

Particle size effect configuration.

Type:

nirs4all.synthesis.scattering.ParticleSizeConfig

emsc

EMSC-style transformation configuration.

Type:

nirs4all.synthesis.scattering.EMSCConfig

scattering_coefficient

Scattering coefficient generation config.

Type:

nirs4all.synthesis.scattering.ScatteringCoefficientConfig

enable_particle_size

Whether to apply particle size effects.

Type:

bool

enable_emsc

Whether to apply EMSC-style transformation.

Type:

bool

emsc: EMSCConfig
enable_emsc: bool = True
enable_particle_size: bool = True
model: ScatteringModel = 'emsc'
particle_size: ParticleSizeConfig
scattering_coefficient: ScatteringCoefficientConfig
class nirs4all.synthesis.ScatteringInference(has_scatter_effects: bool = False, estimated_particle_size_um: float = 50.0, multiplicative_scatter_std: float = 0.0, additive_scatter_std: float = 0.0, baseline_curvature: float = 0.0, snv_correctable: bool = False, msc_correctable: bool = False)[source]

Bases: object

Results of scattering effects inference.

has_scatter_effects

Whether significant scatter is detected.

Type:

bool

estimated_particle_size_um

Estimated mean particle size (μm).

Type:

float

multiplicative_scatter_std

Estimated MSC-style multiplicative scatter.

Type:

float

additive_scatter_std

Estimated SNV-style additive scatter.

Type:

float

baseline_curvature

Detected baseline curvature intensity.

Type:

float

snv_correctable

Whether SNV would improve spectra.

Type:

bool

msc_correctable

Whether MSC would improve spectra.

Type:

bool

additive_scatter_std: float = 0.0
baseline_curvature: float = 0.0
estimated_particle_size_um: float = 50.0
has_scatter_effects: bool = False
msc_correctable: bool = False
multiplicative_scatter_std: float = 0.0
snv_correctable: bool = False
class nirs4all.synthesis.ScatteringModel(value)[source]

Bases: str, Enum

Available scattering models.

EMSC = 'emsc'
KUBELKA_MUNK = 'kubelka_munk'
MIE_APPROX = 'mie_approx'
POLYNOMIAL = 'polynomial'
RAYLEIGH = 'rayleigh'
class nirs4all.synthesis.SensorConfig(detector_type: DetectorType, wavelength_range: Tuple[float, float], spectral_resolution: float = 8.0, noise_level: float = 1.0, gain: float = 1.0, overlap_range: float = 20.0)[source]

Bases: object

Configuration for a single sensor/detector in a multi-sensor system.

Multi-sensor instruments use multiple detectors with different wavelength ranges, then stitch the signals together. This is common in extended-range instruments (e.g., 400-2500 nm coverage using Si + InGaAs detectors).

detector_type

Type of detector for this sensor.

Type:

nirs4all.synthesis.instruments.DetectorType

wavelength_range

(start, end) wavelength range in nm.

Type:

Tuple[float, float]

spectral_resolution

Resolution in nm (FWHM).

Type:

float

noise_level

Relative noise level (1.0 = standard).

Type:

float

gain

Detector gain multiplier.

Type:

float

overlap_range

Wavelength overlap with adjacent sensor for stitching (nm).

Type:

float

detector_type: DetectorType
gain: float = 1.0
noise_level: float = 1.0
overlap_range: float = 20.0
spectral_resolution: float = 8.0
wavelength_range: Tuple[float, float]
class nirs4all.synthesis.SourceConfig(name: str, source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir', n_features: int | None = None, wavelength_start: float | None = None, wavelength_end: float | None = None, wavelength_step: float = 2.0, components: List[str] | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal', correlation_with_target: float = 0.5)[source]

Bases: object

Configuration for a single data source.

name

Unique identifier for the source.

Type:

str

source_type

Type of source (‘nir’, ‘vis’, ‘aux’, ‘markers’).

Type:

Literal[‘nir’, ‘vis’, ‘aux’, ‘markers’]

n_features

Number of features (auto-calculated for NIR sources).

Type:

int | None

# NIR-specific
wavelength_start

Start wavelength for NIR sources.

Type:

float | None

wavelength_end

End wavelength for NIR sources.

Type:

float | None

wavelength_step

Wavelength step for NIR sources.

Type:

float

components

Component names for NIR sources.

Type:

List[str] | None

complexity

Complexity level for NIR sources.

Type:

Literal[‘simple’, ‘realistic’, ‘complex’]

# Auxiliary-specific
distribution

Distribution for auxiliary features.

Type:

Literal[‘normal’, ‘uniform’, ‘lognormal’]

correlation_with_target

How correlated aux features are with target.

Type:

float

complexity: Literal['simple', 'realistic', 'complex'] = 'simple'
components: List[str] | None = None
correlation_with_target: float = 0.5
distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal'
classmethod from_dict(config: Dict[str, Any]) SourceConfig[source]

Create SourceConfig from dictionary.

n_features: int | None = None
name: str
source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir'
wavelength_end: float | None = None
wavelength_start: float | None = None
wavelength_step: float = 2.0
class nirs4all.synthesis.SpectralComponent(name: str, bands: List[NIRBand] = <factory>, correlation_group: int | None = None, category: str = '', subcategory: str = '', synonyms: List[str] = <factory>, formula: str = '', cas_number: str = '', references: List[str] = <factory>, tags: List[str] = <factory>)[source]

Bases: object

A spectral component representing a chemical compound or functional group.

Each component consists of multiple absorption bands that together define the characteristic NIR signature of the compound.

name

Component name (e.g., “water”, “protein”, “lipid”).

Type:

str

bands

List of NIRBand objects defining the spectral signature.

Type:

List[nirs4all.synthesis.components.NIRBand]

correlation_group

Optional group ID for components that should have correlated concentrations (e.g., protein and nitrogen compounds).

Type:

int | None

category

Primary category (e.g., “carbohydrates”, “proteins”, “lipids”).

Type:

str

subcategory

More specific classification (e.g., “monosaccharides”, “amino_acids”).

Type:

str

synonyms

Alternative names (e.g., [“vitamin C”] for ascorbic_acid).

Type:

List[str]

formula

Chemical formula (e.g., “C6H12O6” for glucose).

Type:

str

cas_number

CAS registry number for chemical identification.

Type:

str

references

Literature citations for band assignments.

Type:

List[str]

tags

Classification tags (e.g., [“food”, “pharma”, “agriculture”]).

Type:

List[str]

Example

>>> water = SpectralComponent(
...     name="water",
...     bands=[
...         NIRBand(center=1450, sigma=25, gamma=3, amplitude=0.8),
...         NIRBand(center=1940, sigma=30, gamma=4, amplitude=1.0),
...     ],
...     correlation_group=1,
...     category="water_related",
...     formula="H2O",
... )
>>> wavelengths = np.arange(1000, 2500, 2)
>>> spectrum = water.compute(wavelengths)
bands: List[NIRBand]
cas_number: str = ''
category: str = ''
compute(wavelengths: ndarray) ndarray[source]

Compute the full component spectrum by summing all bands.

Parameters:

wavelengths – Array of wavelengths in nm at which to evaluate.

Returns:

Array of absorbance values representing the combined spectrum.

correlation_group: int | None = None
formula: str = ''
has_bands_in_range(wavelength_range: Tuple[float, float]) bool[source]

Check if component has any bands with centers in the given wavelength range.

Parameters:

wavelength_range – (min, max) wavelength in nm.

Returns:

True if at least one band center is within the range.

info() str[source]

Return formatted information about the component.

Returns:

Human-readable string with component details.

is_normalized(tolerance: float = 0.01) bool[source]

Check if the component’s band amplitudes are max-normalized (max amplitude = 1.0).

Parameters:

tolerance – Acceptable deviation from 1.0 for max amplitude.

Returns:

True if max amplitude is within tolerance of 1.0.

name: str
normalized(method: str = 'max') SpectralComponent[source]

Return a new SpectralComponent with normalized band amplitudes.

Parameters:

method – Normalization method. - “max”: Scale so max amplitude = 1.0 (default) - “sum”: Scale so sum of amplitudes = 1.0

Returns:

New SpectralComponent with normalized amplitudes.

Example

>>> component = SpectralComponent(name="test", bands=[
...     NIRBand(center=1450, sigma=25, amplitude=0.8),
...     NIRBand(center=1940, sigma=30, amplitude=2.0),
... ])
>>> normalized = component.normalized()
>>> print(max(b.amplitude for b in normalized.bands))  # 1.0
references: List[str]
subcategory: str = ''
synonyms: List[str]
tags: List[str]
validate() List[str][source]

Validate component parameters.

Returns:

List of validation issues (empty if all valid).

Example

>>> component = SpectralComponent(name="test", bands=[])
>>> issues = component.validate()
>>> if issues:
...     print("Issues found:", issues)
class nirs4all.synthesis.SpectralProperties(name: str = 'dataset', n_samples: int = 0, n_wavelengths: int = 0, wavelengths: ndarray | None = None, mean_spectrum: ndarray | None = None, std_spectrum: ndarray | None = None, global_mean: float = 0.0, global_std: float = 0.0, global_range: Tuple[float, float] = (0.0, 0.0), mean_slope: float = 0.0, slope_std: float = 0.0, slopes: ndarray | None = None, mean_curvature: float = 0.0, curvature_std: float = 0.0, skewness: float = 0.0, kurtosis: float = 0.0, noise_estimate: float = 0.0, snr_estimate: float = 0.0, pca_explained_variance: ndarray | None = None, pca_n_components_95: int = 0, n_peaks_mean: float = 0.0, peak_positions: ndarray | None = None, peak_wavenumbers: ndarray | None = None, effective_resolution: float = 8.0, noise_correlation_length: float = 1.0, wavelength_range: Tuple[float, float] = (1000.0, 2500.0), baseline_offset: float = 0.0, kubelka_munk_linearity: float = 0.0, baseline_convexity: float = 0.0, water_band_variation: float = 0.0, oh_band_positions: ndarray | None = None, temperature_sensitivity_score: float = 0.0, scatter_baseline_slope: float = 0.0, scatter_baseline_curvature: float = 0.0, sample_to_sample_offset_std: float = 0.0, sample_to_sample_slope_std: float = 0.0, protein_band_intensity: float = 0.0, carbohydrate_band_intensity: float = 0.0, lipid_band_intensity: float = 0.0, water_band_intensity: float = 0.0, left_edge_noise_std: float = 0.0, right_edge_noise_std: float = 0.0, center_noise_std: float = 0.0, left_edge_slope: float = 0.0, right_edge_slope: float = 0.0, edge_curvature_intensity: float = 0.0, edge_curvature_asymmetry: float = 0.0, has_boundary_rise_left: bool = False, has_boundary_rise_right: bool = False)[source]

Bases: object

Container for computed spectral properties of a dataset.

This dataclass holds various statistical and spectral properties computed from a NIRS dataset for comparison and fitting purposes.

name

Dataset identifier.

Type:

str

n_samples

Number of samples.

Type:

int

n_wavelengths

Number of wavelengths.

Type:

int

wavelengths

Wavelength grid.

Type:

numpy.ndarray | None

# Basic statistics
mean_spectrum

Mean spectrum across samples.

Type:

numpy.ndarray | None

std_spectrum

Standard deviation spectrum.

Type:

numpy.ndarray | None

global_mean

Overall mean absorbance.

Type:

float

global_std

Overall standard deviation.

Type:

float

global_range

(min, max) absorbance range.

Type:

Tuple[float, float]

# Shape properties
mean_slope

Average spectral slope (per 1000nm).

Type:

float

slope_std

Standard deviation of slopes.

Type:

float

mean_curvature

Average curvature (second derivative).

Type:

float

# Distribution statistics
skewness

Skewness of absorbance distribution.

Type:

float

kurtosis

Kurtosis of absorbance distribution.

Type:

float

# Noise characteristics
noise_estimate

Estimated noise level.

Type:

float

snr_estimate

Signal-to-noise ratio estimate.

Type:

float

# PCA properties
pca_explained_variance

Explained variance ratios.

Type:

numpy.ndarray | None

pca_n_components_95

Components for 95% variance.

Type:

int

# Peak analysis
n_peaks_mean

Mean number of peaks.

Type:

float

peak_positions

Wavelengths of detected peaks.

Type:

numpy.ndarray | None

peak_wavenumbers

Wavenumber positions of peaks.

Type:

numpy.ndarray | None

# Phase 1-4 Enhanced properties
# Instrument indicators
effective_resolution

Estimated spectral resolution from peak widths.

Type:

float

noise_correlation_length

Correlation length of noise (detector indicator).

Type:

float

wavelength_range

Actual wavelength range of data.

Type:

Tuple[float, float]

# Measurement mode indicators
baseline_offset

Mean baseline offset (transmittance indicator).

Type:

float

kubelka_munk_linearity

K-M linearity score (reflectance indicator).

Type:

float

baseline_convexity

Convexity of baseline (ATR indicator).

Type:

float

# Environmental indicators
water_band_variation

Variation in water band region.

Type:

float

oh_band_positions

Detected O-H band positions.

Type:

numpy.ndarray | None

temperature_sensitivity_score

Score for temperature effect detection.

Type:

float

# Scattering indicators
scatter_baseline_slope

Wavelength-dependent scatter slope.

Type:

float

scatter_baseline_curvature

Curvature from scattering.

Type:

float

sample_to_sample_offset_std

Sample-to-sample offset variation.

Type:

float

sample_to_sample_slope_std

Sample-to-sample slope variation.

Type:

float

# Domain indicators
protein_band_intensity

Intensity in protein band regions.

Type:

float

carbohydrate_band_intensity

Intensity in carbohydrate regions.

Type:

float

lipid_band_intensity

Intensity in lipid band regions.

Type:

float

water_band_intensity

Intensity in water band regions.

Type:

float

baseline_convexity: float = 0.0
baseline_offset: float = 0.0
carbohydrate_band_intensity: float = 0.0
center_noise_std: float = 0.0
curvature_std: float = 0.0
edge_curvature_asymmetry: float = 0.0
edge_curvature_intensity: float = 0.0
effective_resolution: float = 8.0
global_mean: float = 0.0
global_range: Tuple[float, float] = (0.0, 0.0)
global_std: float = 0.0
has_boundary_rise_left: bool = False
has_boundary_rise_right: bool = False
kubelka_munk_linearity: float = 0.0
kurtosis: float = 0.0
left_edge_noise_std: float = 0.0
left_edge_slope: float = 0.0
lipid_band_intensity: float = 0.0
mean_curvature: float = 0.0
mean_slope: float = 0.0
mean_spectrum: ndarray | None = None
n_peaks_mean: float = 0.0
n_samples: int = 0
n_wavelengths: int = 0
name: str = 'dataset'
noise_correlation_length: float = 1.0
noise_estimate: float = 0.0
oh_band_positions: ndarray | None = None
pca_explained_variance: ndarray | None = None
pca_n_components_95: int = 0
peak_positions: ndarray | None = None
peak_wavenumbers: ndarray | None = None
protein_band_intensity: float = 0.0
right_edge_noise_std: float = 0.0
right_edge_slope: float = 0.0
sample_to_sample_offset_std: float = 0.0
sample_to_sample_slope_std: float = 0.0
scatter_baseline_curvature: float = 0.0
scatter_baseline_slope: float = 0.0
skewness: float = 0.0
slope_std: float = 0.0
slopes: ndarray | None = None
snr_estimate: float = 0.0
std_spectrum: ndarray | None = None
temperature_sensitivity_score: float = 0.0
water_band_intensity: float = 0.0
water_band_variation: float = 0.0
wavelength_range: Tuple[float, float] = (1000.0, 2500.0)
wavelengths: ndarray | None = None
class nirs4all.synthesis.SpectralRealismScore(correlation_length_overlap: float, derivative_ks_pvalue: float, peak_density_ratio: float, baseline_curvature_overlap: float, snr_magnitude_match: bool, adversarial_auc: float, overall_pass: bool, metric_results: List[MetricResult] = <factory>, warnings: List[str] = <factory>)[source]

Bases: object

Complete spectral realism assessment results.

This dataclass contains the results of comparing synthetic spectra against real spectra using multiple quantitative metrics.

correlation_length_overlap

Distribution overlap for autocorrelation decay [0-1].

Type:

float

derivative_ks_pvalue

p-value from KS test on derivative distributions.

Type:

float

peak_density_ratio

Ratio of synthetic to real peak densities.

Type:

float

baseline_curvature_overlap

Distribution overlap for baseline curvature [0-1].

Type:

float

snr_magnitude_match

Whether SNR is within one order of magnitude.

Type:

bool

adversarial_auc

AUC of classifier trying to distinguish real from synthetic.

Type:

float

overall_pass

Whether all critical metrics pass.

Type:

bool

metric_results

Individual metric results with details.

Type:

List[nirs4all.synthesis.validation.MetricResult]

warnings

Any warnings from the analysis.

Type:

List[str]

Example

>>> score = compute_spectral_realism_scorecard(real_spectra, synthetic_spectra, wavelengths)
>>> print(f"Overall pass: {score.overall_pass}")
>>> print(f"Adversarial AUC: {score.adversarial_auc:.3f}")
>>> for metric in score.metric_results:
...     print(metric)
adversarial_auc: float
baseline_curvature_overlap: float
correlation_length_overlap: float
derivative_ks_pvalue: float
metric_results: List[MetricResult]
overall_pass: bool
peak_density_ratio: float
snr_magnitude_match: bool
summary() str[source]

Return a human-readable summary of the score.

to_dict() Dict[str, Any][source]

Convert to dictionary for serialization.

warnings: List[str]
class nirs4all.synthesis.SpectralRegion(value)[source]

Bases: str, Enum

NIR spectral regions with distinct temperature responses.

CH_COMBINATION = 'ch_combination'
CH_FIRST_OVERTONE = 'ch_1st_overtone'
NH_COMBINATION = 'nh_combination'
NH_FIRST_OVERTONE = 'nh_1st_overtone'
OH_COMBINATION = 'oh_combination'
OH_FIRST_OVERTONE = 'oh_1st_overtone'
WATER_BOUND = 'water_bound'
WATER_FREE = 'water_free'
class nirs4all.synthesis.SyntheticDatasetBuilder(n_samples: int = 1000, random_state: int | None = None, name: str = 'synthetic_nirs')[source]

Bases: object

Fluent builder for constructing synthetic NIRS datasets.

This builder provides a chainable interface for configuring all aspects of synthetic data generation, from spectral features to targets and metadata.

The builder accumulates configuration through method calls, then generates the dataset when build() is called.

state

Internal BuilderState containing all configuration.

Parameters:
  • n_samples – Number of samples to generate.

  • random_state – Random seed for reproducibility.

  • name – Dataset name.

Example

>>> from nirs4all.synthesis import SyntheticDatasetBuilder
>>>
>>> # Simple usage
>>> dataset = SyntheticDatasetBuilder(n_samples=500).build()
>>>
>>> # Full configuration
>>> dataset = (
...     SyntheticDatasetBuilder(n_samples=1000, random_state=42)
...     .with_features(
...         wavelength_range=(1000, 2500),
...         complexity="realistic",
...         components=["water", "protein", "lipid"]
...     )
...     .with_targets(
...         distribution="lognormal",
...         range=(5, 50),
...         component="protein"
...     )
...     .with_metadata(
...         n_groups=3,
...         n_repetitions=(2, 5)
...     )
...     .with_partitions(train_ratio=0.8)
...     .build()
... )

See also

nirs4all.generate: Top-level convenience function. SyntheticNIRSGenerator: Core generation engine.

__repr__() str[source]

Return string representation of the builder.

build() SpectroDataset | Tuple[np.ndarray, np.ndarray][source]

Build the synthetic dataset with all configured options.

This method generates the data and returns it in the configured format.

Returns:

SpectroDataset instance. If as_dataset=False: Tuple of (X, y) numpy arrays.

Return type:

If as_dataset=True (default)

Raises:

RuntimeError – If build() was already called on this builder.

Example

>>> dataset = builder.build()
>>> print(dataset.num_samples)
1000
build_arrays() Tuple[ndarray, ndarray][source]

Build and return raw numpy arrays.

This is a convenience method equivalent to calling with_output(as_dataset=False).build().

Returns:

Tuple of (X, y) numpy arrays.

Example

>>> X, y = builder.build_arrays()
build_dataset() SpectroDataset[source]

Build and return a SpectroDataset.

This is a convenience method equivalent to calling with_output(as_dataset=True).build().

Returns:

SpectroDataset instance.

Example

>>> dataset = builder.build_dataset()
export(path: str | 'Path', format: Literal['standard', 'single', 'fragmented'] = 'standard') Path[source]

Generate data and export to folder.

Generates the synthetic data and exports it to a folder structure compatible with nirs4all’s DatasetConfigs loader.

Parameters:
  • path – Output folder path.

  • format – Export format: - ‘standard’: Xcal, Ycal, Xval, Yval files. - ‘single’: All data in one file with partition column. - ‘fragmented’: Multiple small files (for testing).

Returns:

Path to created folder.

Example

>>> builder = SyntheticDatasetBuilder(n_samples=1000)
>>> path = builder.export("data/synthetic", format="standard")
export_to_csv(path: str | 'Path', include_targets: bool = True) Path[source]

Generate data and export to a single CSV file.

Parameters:
  • path – Output file path.

  • include_targets – Whether to include target column(s).

Returns:

Path to created file.

Example

>>> path = builder.export_to_csv("data.csv")
fit_to(template: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, *, match_statistics: bool = True, match_structure: bool = True) SyntheticDatasetBuilder[source]

Configure builder to generate data similar to a template.

Analyzes the template data and adjusts generation parameters to produce synthetic data with similar properties.

Parameters:
  • template – Real data to mimic (array or SpectroDataset).

  • wavelengths – Wavelength grid (if template is array).

  • match_statistics – Match statistical properties (mean, std).

  • match_structure – Match PCA structure and complexity.

Returns:

Self for method chaining.

Example

>>> builder = SyntheticDatasetBuilder(n_samples=1000)
>>> builder.fit_to(X_real, wavelengths=wl)
>>> X_synth, y = builder.build_arrays()
classmethod from_config(config: SyntheticDatasetConfig) SyntheticDatasetBuilder[source]

Create a builder from a SyntheticDatasetConfig object.

Parameters:

config – Configuration object to use.

Returns:

Configured SyntheticDatasetBuilder instance.

Example

>>> config = SyntheticDatasetConfig(n_samples=500)
>>> builder = SyntheticDatasetBuilder.from_config(config)
>>> dataset = builder.build()
get_config() SyntheticDatasetConfig[source]

Get the current configuration as a SyntheticDatasetConfig object.

Returns:

SyntheticDatasetConfig with all current settings.

Example

>>> config = builder.get_config()
>>> print(config.n_samples)
1000
with_aggregate(name: str, *, variability: bool = False, target_component: str | None = None) SyntheticDatasetBuilder[source]

Configure generation from a predefined aggregate component.

Aggregates are predefined compositions representing common sample types (e.g., “wheat_grain”, “milk”, “tablet_excipient_base”). Using an aggregate automatically sets up the component library with realistic proportions.

Parameters:
  • name – Aggregate name (e.g., “wheat_grain”, “milk”, “cheese_cheddar”).

  • variability – If True, sample compositions from realistic variability ranges instead of using fixed base values. Useful for generating diverse training data.

  • target_component – Optional component to use as regression target. If not specified, uses the first component in the aggregate.

Returns:

Self for method chaining.

Raises:

ValueError – If aggregate name is not found.

Example

>>> # Generate wheat samples with protein as target
>>> dataset = (
...     SyntheticDatasetBuilder(n_samples=1000, random_state=42)
...     .with_aggregate("wheat_grain", variability=True)
...     .with_targets(component="protein", range=(8, 18))
...     .build()
... )

See also

nirs4all.synthesis.list_aggregates: List available aggregates. nirs4all.synthesis.aggregate_info: Get aggregate details.

with_batch_effects(*, enabled: bool = True, n_batches: int = 3) SyntheticDatasetBuilder[source]

Configure batch/session effects simulation.

Batch effects introduce systematic variations between measurement sessions, useful for domain adaptation research.

Parameters:
  • enabled – Whether to enable batch effects.

  • n_batches – Number of measurement batches.

Returns:

Self for method chaining.

Example

>>> builder.with_batch_effects(n_batches=5)
with_classification(*, n_classes: int = 2, separation: float = 1.5, class_weights: List[float] | None = None, separation_method: Literal['component', 'threshold', 'cluster'] = 'component') SyntheticDatasetBuilder[source]

Configure target generation for classification tasks.

This creates discrete class labels with controllable separation between classes, enabling classification experiments with varying difficulty levels.

Parameters:
  • n_classes – Number of classes to generate.

  • separation – Class separation factor (higher = more separable). Values around 0.5-1.0: overlapping classes (challenging). Values around 1.5-2.0: moderate separation (realistic). Values around 2.5+: well-separated classes (easy).

  • class_weights – Optional class weights for imbalanced datasets. Should sum to 1.0.

  • separation_method – How to create class differences: - “component”: Different component concentration profiles per class. - “threshold”: Classes based on concentration thresholds. - “cluster”: K-means-like cluster assignment.

Returns:

Self for method chaining.

Example

>>> builder.with_classification(
...     n_classes=3,
...     separation=2.0,
...     class_weights=[0.5, 0.3, 0.2]
... )
with_complex_target_landscape(*, n_regimes: int = 3, regime_method: Literal['concentration', 'spectral', 'random'] = 'concentration', regime_overlap: float = 0.2, noise_heteroscedasticity: float = 0.5) SyntheticDatasetBuilder[source]

Configure multi-regime target landscapes with spatially-varying relationships.

This creates regions in feature space where the target-spectra relationship differs, simulating subpopulations like ripe/unripe fruit or healthy/diseased.

Parameters:
  • n_regimes – Number of different relationship regimes. Default 3.

  • regime_method – How to partition samples into regimes: - “concentration”: Regimes based on concentration space clustering. - “spectral”: Regimes based on spectral feature patterns. - “random”: Random regime assignment (baseline difficulty).

  • regime_overlap – Overlap between regimes creating transition zones. 0 = hard boundaries, 0.5 = smooth transitions. Default 0.2.

  • noise_heteroscedasticity – How much prediction noise varies by regime. 0 = same noise everywhere, 1 = very different noise levels. Default 0.5.

Returns:

Self for method chaining.

Example

>>> # Create challenging multi-regime landscape
>>> builder.with_complex_target_landscape(
...     n_regimes=4,
...     regime_method="concentration",
...     regime_overlap=0.3,
...     noise_heteroscedasticity=0.7
... )
with_features(*, wavelength_range: Tuple[float, float] | None = None, wavelength_step: float | None = None, complexity: Literal['simple', 'realistic', 'complex'] | None = None, components: List[str] | None = None, component_library: ComponentLibrary | None = None, path_length_std: float | None = None, baseline_amplitude: float | None = None, scatter_alpha_std: float | None = None, scatter_beta_std: float | None = None, tilt_std: float | None = None, global_slope_mean: float | None = None, global_slope_std: float | None = None, shift_std: float | None = None, stretch_std: float | None = None, instrumental_fwhm: float | None = None, noise_base: float | None = None, noise_signal_dep: float | None = None, artifact_prob: float | None = None, instrument: str | None = None, measurement_mode: str | None = None) SyntheticDatasetBuilder[source]

Configure spectral feature generation.

Parameters:
  • wavelength_range – Tuple of (start, end) wavelengths in nm.

  • wavelength_step – Wavelength sampling step in nm.

  • complexity – Complexity level affecting noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.

  • components – List of predefined component names to use.

  • component_library – Pre-configured ComponentLibrary instance.

  • path_length_std – Standard deviation of optical path length variation.

  • baseline_amplitude – Amplitude of polynomial baseline drift.

  • scatter_alpha_std – MSC-like multiplicative scattering coefficient variation.

  • scatter_beta_std – Additive scattering offset variation.

  • tilt_std – Standard deviation of linear spectral tilt.

  • global_slope_mean – Mean slope across all spectra.

  • global_slope_std – Standard deviation of global slope.

  • shift_std – Random wavelength axis shift (nm).

  • stretch_std – Wavelength axis stretching/compression factor.

  • instrumental_fwhm – Instrumental broadening FWHM (nm).

  • noise_base – Constant noise floor (detector noise).

  • noise_signal_dep – Noise proportional to signal intensity (shot noise).

  • artifact_prob – Probability of spectral artifacts.

  • instrument – Instrument archetype name (e.g., ‘foss_xds’, ‘bruker_mpa’).

  • measurement_mode – Measurement mode (‘transmittance’, ‘reflectance’, ‘atr’, etc.).

Returns:

Self for method chaining.

Raises:

ValueError – If both components and component_library are specified.

Example

>>> # Simple usage with preset
>>> builder.with_features(
...     wavelength_range=(1000, 2500),
...     complexity="realistic",
...     components=["water", "protein"]
... )
>>> # Advanced usage with custom physics parameters
>>> builder.with_features(
...     wavelength_range=(1000, 2500),
...     components=["water", "protein", "lipid"],
...     noise_base=0.003,
...     noise_signal_dep=0.008,
...     baseline_amplitude=0.015,
...     scatter_alpha_std=0.04,
...     instrument="foss_xds"
... )
with_metadata(*, sample_ids: bool = True, sample_id_prefix: str | None = None, n_groups: int | None = None, n_repetitions: int | Tuple[int, int] | None = None, group_names: List[str] | None = None) SyntheticDatasetBuilder[source]

Configure sample metadata generation.

Generates realistic metadata including sample IDs, biological sample groupings (with repetitions), and group assignments.

Parameters:
  • sample_ids – Whether to generate sample IDs.

  • sample_id_prefix – Prefix for sample ID strings.

  • n_groups – Number of sample groups (for grouped cross-validation).

  • n_repetitions – Repetitions per biological sample. Either a fixed int or a (min, max) tuple for random variation. When set, each “biological sample” gets multiple spectral measurements.

  • group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.

Returns:

Self for method chaining.

Example

>>> builder.with_metadata(
...     n_groups=5,
...     n_repetitions=(2, 4),
...     group_names=["Field_A", "Field_B", "Field_C", "Field_D", "Field_E"]
... )
with_nonlinear_targets(*, interactions: Literal['none', 'polynomial', 'synergistic', 'antagonistic'] = 'polynomial', interaction_strength: float = 0.5, hidden_factors: int = 0, polynomial_degree: int = 2) SyntheticDatasetBuilder[source]

Configure non-linear relationships between concentrations and targets.

This introduces non-linear mixture effects that make targets harder to predict with simple linear models, simulating real chemical interactions.

Parameters:
  • interactions – Type of non-linear interaction: - “none”: Pure linear relationship (default behavior). - “polynomial”: Include terms like C1², C1×C2, etc. - “synergistic”: Non-additive effects where combinations enhance target. - “antagonistic”: Saturation/inhibition (Michaelis-Menten-like).

  • interaction_strength – Blend factor between linear and non-linear. 0 = purely linear, 1 = fully non-linear. Default 0.5.

  • hidden_factors – Number of latent variables that affect target but have NO spectral signature. Forces models to learn robust features.

  • polynomial_degree – Maximum degree for polynomial interactions (2 or 3).

Returns:

Self for method chaining.

Example

>>> # Make targets require non-linear models
>>> builder.with_nonlinear_targets(
...     interactions="polynomial",
...     interaction_strength=0.7,
...     hidden_factors=2
... )
with_output(*, as_dataset: bool | None = None, include_metadata: bool | None = None) SyntheticDatasetBuilder[source]

Configure output format.

Parameters:
  • as_dataset – If True, returns SpectroDataset. If False, returns tuple.

  • include_metadata – Whether to include generation metadata in output.

Returns:

Self for method chaining.

Example

>>> builder.with_output(as_dataset=False)  # Returns (X, y) tuple
with_partitions(*, train_ratio: float | None = None, stratify: bool | None = None, shuffle: bool | None = None) SyntheticDatasetBuilder[source]

Configure data partitioning (train/test split).

Parameters:
  • train_ratio – Proportion of samples for training (0.0-1.0).

  • stratify – Whether to stratify by target (for classification).

  • shuffle – Whether to shuffle before splitting.

Returns:

Self for method chaining.

Example

>>> builder.with_partitions(train_ratio=0.75, shuffle=True)
with_sources(sources: List[Dict[str, Any] | Any]) SyntheticDatasetBuilder[source]

Configure multi-source generation.

Multi-source datasets combine different types of data, such as multiple NIR spectral ranges or NIR spectra with auxiliary measurements.

Parameters:

sources – List of source configurations. Each source is a dict with: - name: Unique source identifier (required). - type: Source type - “nir”, “vis”, “aux”, “markers” (default: “nir”). - wavelength_range: (start, end) for NIR sources. - n_features: Number of features for auxiliary sources. - complexity: Complexity level for NIR sources. - components: Component names for NIR sources.

Returns:

Self for method chaining.

Example

>>> builder.with_sources([
...     {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...     {"name": "markers", "type": "aux", "n_features": 15}
... ])
with_target_complexity(*, signal_to_confound_ratio: float = 0.7, n_confounders: int = 2, spectral_masking: float = 0.0, temporal_drift: bool = False) SyntheticDatasetBuilder[source]

Configure spectral-target decoupling and confounding effects.

This introduces factors that make the target only partially predictable from spectral features, simulating real-world irreducible error.

Parameters:
  • signal_to_confound_ratio – Proportion of target variance explainable from spectra. 1.0 = fully predictable, 0.5 = 50% unexplainable. Default 0.7 (70% predictable).

  • n_confounders – Number of confounding variables that affect both spectra and target in different ways. Default 2.

  • spectral_masking – Fraction of predictive signal hidden in high-noise wavelength regions (0.0-0.5). Default 0.0.

  • temporal_drift – If True, the target-spectra relationship gradually changes across samples, testing model robustness.

Returns:

Self for method chaining.

Example

>>> # Add realistic confounding
>>> builder.with_target_complexity(
...     signal_to_confound_ratio=0.6,
...     n_confounders=3,
...     temporal_drift=True
... )
with_targets(*, distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] | None = None, range: Tuple[float, float] | None = None, component: str | int | None = None, transform: Literal['log', 'sqrt'] | None = None) SyntheticDatasetBuilder[source]

Configure target variable generation for regression tasks.

Parameters:
  • distribution – Concentration distribution method. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.

  • range – Target value range (min, max) for scaling.

  • component – Which component to use as target. If None, uses all components (multi-output). If str, uses the component with that name. If int, uses the component at that index.

  • transform – Optional transformation to apply (‘log’, ‘sqrt’).

Returns:

Self for method chaining.

Example

>>> builder.with_targets(
...     distribution="lognormal",
...     range=(5, 50),
...     component="protein"
... )
with_wavelengths(wavelengths: ndarray | None = None, *, instrument_grid: str | None = None) SyntheticDatasetBuilder[source]

Configure custom wavelength grid for spectrum generation.

This method allows generating spectra at specific wavelengths matching a real instrument’s wavelength grid, which is essential for transfer learning and domain adaptation experiments.

Priority: wavelengths > instrument_grid > wavelength_range (in with_features)

Parameters:
  • wavelengths – Custom wavelength array in nm. If provided, overrides the wavelength_range set in with_features().

  • instrument_grid – Name of predefined instrument wavelength grid. Available grids include: ‘micronir_onsite’, ‘foss_xds’, ‘scio’, ‘neospectra_micro’, ‘asd_fieldspec’, ‘bruker_mpa’, etc. See list_instrument_wavelength_grids() for all options.

Returns:

Self for method chaining.

Raises:

ValueError – If instrument_grid name is not recognized.

Example

>>> # Use predefined instrument wavelength grid
>>> builder.with_wavelengths(instrument_grid="micronir_onsite")
>>> # Use custom wavelength array
>>> custom_wl = np.linspace(1000, 2000, 100)
>>> builder.with_wavelengths(wavelengths=custom_wl)
>>> # Full example
>>> from nirs4all.synthesis import SyntheticDatasetBuilder
>>> dataset = (
...     SyntheticDatasetBuilder(n_samples=500)
...     .with_wavelengths(instrument_grid="micronir_onsite")
...     .with_features(complexity="realistic")
...     .build()
... )

See also

get_instrument_wavelengths: Get wavelengths for a specific instrument. list_instrument_wavelength_grids: List all available instrument grids.

class nirs4all.synthesis.SyntheticDatasetConfig(n_samples: int = 1000, random_state: int | None = None, features: FeatureConfig = <factory>, targets: TargetConfig = <factory>, metadata: MetadataConfig = <factory>, partitions: PartitionConfig = <factory>, batch_effects: BatchEffectConfig = <factory>, nonlinear: NonLinearConfig = <factory>, confounders: ConfounderConfig = <factory>, multi_regime: MultiRegimeConfig = <factory>, output: OutputConfig = <factory>, name: str = 'synthetic_nirs')[source]

Bases: object

Complete configuration for synthetic dataset generation.

This is the main configuration object that combines all sub-configurations for generating synthetic NIRS datasets.

n_samples

Total number of samples to generate.

Type:

int

random_state

Random seed for reproducibility.

Type:

int | None

features

Feature generation configuration.

Type:

nirs4all.synthesis.config.FeatureConfig

targets

Target variable configuration.

Type:

nirs4all.synthesis.config.TargetConfig

metadata

Sample metadata configuration.

Type:

nirs4all.synthesis.config.MetadataConfig

partitions

Train/test split configuration.

Type:

nirs4all.synthesis.config.PartitionConfig

batch_effects

Batch effect configuration.

Type:

nirs4all.synthesis.config.BatchEffectConfig

output

Output format configuration.

Type:

nirs4all.synthesis.config.OutputConfig

name

Optional dataset name.

Type:

str

Example

>>> config = SyntheticDatasetConfig(
...     n_samples=1000,
...     random_state=42,
...     features=FeatureConfig(complexity="realistic"),
...     targets=TargetConfig(distribution="lognormal", range=(0, 100)),
... )
__post_init__() None[source]

Validate configuration after initialization.

batch_effects: BatchEffectConfig
confounders: ConfounderConfig
features: FeatureConfig
metadata: MetadataConfig
multi_regime: MultiRegimeConfig
n_samples: int = 1000
name: str = 'synthetic_nirs'
nonlinear: NonLinearConfig
output: OutputConfig
partitions: PartitionConfig
random_state: int | None = None
targets: TargetConfig
class nirs4all.synthesis.SyntheticNIRSGenerator(wavelength_start: float = 350.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, wavelengths: ndarray | None = None, instrument_wavelength_grid: str | None = None, component_library: ComponentLibrary | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'realistic', instrument: str | InstrumentArchetype | None = None, measurement_mode: str | MeasurementMode | None = None, multi_sensor_config: MultiSensorConfig | None = None, multi_scan_config: MultiScanConfig | None = None, environmental_config: EnvironmentalEffectsConfig | None = None, scattering_effects_config: ScatteringEffectsConfig | None = None, edge_artifacts_config: EdgeArtifactsConfig | None = None, custom_params: Dict[str, Any] | None = None, random_state: int | None = None)[source]

Bases: object

Generator for synthetic NIRS spectra with realistic instrumental effects.

This generator implements a physically-motivated model based on Beer-Lambert law with additional effects for baseline, scattering, instrumental response, and noise.

Model:

A_i(λ) = L_i * Σ_k c_ik * ε_k(λ) + baseline_i(λ) + scatter_i(λ) + noise_i(λ)

where:
  • c_ik: concentration of component k in sample i

  • ε_k(λ): molar absorptivity of component k (Voigt profiles)

  • L_i: optical path length factor

  • baseline: polynomial baseline drift

  • scatter: multiplicative/additive scattering effects

  • noise: wavelength-dependent Gaussian noise

Phase 2 Features:
  • Instrument archetype simulation (FOSS, Bruker, etc.)

  • Measurement mode physics (transmittance, reflectance, ATR)

  • Detector response curves and noise models

  • Multi-sensor stitching (combining signals from different wavelength ranges)

  • Multi-scan averaging/denoising (simulating multiple scans per sample)

Phase 3 Features:
  • Temperature effects on spectral bands (O-H, N-H, C-H shifts)

  • Moisture and water activity effects

  • Particle size effects (EMSC-style scattering)

wavelengths

Array of wavelength values in nm.

n_wavelengths

Number of wavelength points.

library

ComponentLibrary containing spectral components.

E

Precomputed component spectra matrix (n_components, n_wavelengths).

params

Dictionary of effect parameters based on complexity level.

instrument

Optional InstrumentArchetype for realistic simulation.

measurement_mode_simulator

Optional measurement mode simulator.

Parameters:
  • wavelength_start – Start wavelength in nm.

  • wavelength_end – End wavelength in nm.

  • wavelength_step – Wavelength step in nm.

  • component_library – Optional ComponentLibrary. If None, generates predefined components for realistic mode or random for simple mode.

  • complexity – Complexity level controlling noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.

  • instrument – Instrument archetype name or InstrumentArchetype object. If provided, uses instrument-specific wavelength range, detector, etc.

  • measurement_mode – Measurement mode (transmittance, reflectance, etc.).

  • multi_sensor_config – Configuration for multi-sensor stitching.

  • multi_scan_config – Configuration for multi-scan averaging.

  • environmental_config – Phase 3 configuration for temperature/moisture effects.

  • scattering_effects_config – Phase 3 configuration for particle size/scattering.

  • random_state – Random seed for reproducibility.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> X, Y, E = generator.generate(n_samples=1000)
>>> print(X.shape, Y.shape, E.shape)
(1000, 751) (1000, 5) (5, 751)
>>> # With instrument simulation (Phase 2)
>>> generator = SyntheticNIRSGenerator(
...     instrument="foss_xds",
...     measurement_mode="reflectance",
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3)
>>> from nirs4all.synthesis import EnvironmentalEffectsConfig
>>> env_config = EnvironmentalEffectsConfig(
...     enable_temperature=True,
...     enable_moisture=True
... )
>>> generator = SyntheticNIRSGenerator(
...     environmental_config=env_config,
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # Create a SpectroDataset directly
>>> dataset = generator.create_dataset(n_train=800, n_test=200)

See also

ComponentLibrary: For managing spectral components. InstrumentArchetype: For instrument-specific simulation. MeasurementModeSimulator: For measurement mode physics. nirs4all.operators.augmentation.TemperatureAugmenter: For temperature effects. nirs4all.operators.augmentation.MoistureAugmenter: For moisture effects. nirs4all.operators.augmentation.ParticleSizeAugmenter: For particle size effects. nirs4all.operators.augmentation.EMSCDistortionAugmenter: For EMSC-style distortions.

__repr__() str[source]

Return string representation of the generator.

create_dataset(n_train: int = 800, n_test: int = 200, target_component: str | int | None = None, **generate_kwargs: Any) SpectroDataset[source]

Create a SpectroDataset from synthetic spectra.

This method generates synthetic spectra and wraps them in a SpectroDataset object ready for use with nirs4all pipelines.

Parameters:
  • n_train – Number of training samples.

  • n_test – Number of test samples.

  • target_component – Which component to use as target. - If None: uses all components as multi-output target. - If str: uses the component with that name. - If int: uses the component at that index.

  • **generate_kwargs – Additional arguments passed to generate().

Returns:

SpectroDataset with train/test partitions.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> dataset = generator.create_dataset(
...     n_train=800,
...     n_test=200,
...     target_component="protein"
... )
>>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
generate(n_samples: int = 1000, concentration_method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', include_batch_effects: bool = False, n_batches: int = 1, include_instrument_effects: bool = True, include_multi_sensor: bool = True, include_multi_scan: bool = True, include_environmental_effects: bool = True, include_scattering_effects: bool = True, include_edge_artifacts: bool = True, temperatures: ndarray | None = None, return_metadata: bool = False) Tuple[ndarray, ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray, Dict[str, Any]][source]

Generate synthetic NIRS spectra.

This is the main generation method that creates synthetic spectra by applying all physical effects in sequence.

Parameters:
  • n_samples – Number of spectra to generate.

  • concentration_method – Method for generating concentrations. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.

  • include_batch_effects – Whether to add batch/session effects.

  • n_batches – Number of batches (only if include_batch_effects=True).

  • include_instrument_effects – Whether to apply instrument-specific effects (detector response, noise). Only applies if instrument was specified during initialization.

  • include_multi_sensor – Whether to apply multi-sensor stitching effects. Only applies if multi_sensor_config is set.

  • include_multi_scan – Whether to simulate multi-scan averaging. Only applies if multi_scan_config is set.

  • include_environmental_effects – Whether to apply Phase 3 temperature and moisture effects. Only applies if environmental_config is set.

  • include_scattering_effects – Whether to apply Phase 3 particle size and EMSC-style scattering effects. Only applies if scattering_effects_config is set.

  • include_edge_artifacts – Whether to apply edge artifact effects (detector roll-off, stray light, edge curvature, truncated peaks). Only applies if edge_artifacts_config is set.

  • temperatures – Optional array of temperatures (°C) for each sample. If None and environmental effects are enabled, random temperatures are generated based on the configuration. Shape: (n_samples,).

  • return_metadata – Whether to return additional metadata dictionary.

Returns:

Tuple of (X, Y, E):
  • X: Spectra matrix (n_samples, n_wavelengths)

  • Y: Concentration matrix (n_samples, n_components)

  • E: Component spectra (n_components, n_wavelengths)

If return_metadata=True:
Tuple of (X, Y, E, metadata):
  • metadata: Dictionary with generation details

Return type:

If return_metadata=False

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> X, Y, E = generator.generate(n_samples=500)
>>> print(f"Spectra: {X.shape}, Targets: {Y.shape}")
Spectra: (500, 751), Targets: (500, 5)
>>> # With instrument simulation (Phase 2)
>>> generator = SyntheticNIRSGenerator(
...     instrument="foss_xds",
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3)
>>> from nirs4all.synthesis import EnvironmentalEffectsConfig
>>> env_config = EnvironmentalEffectsConfig()
>>> generator = SyntheticNIRSGenerator(
...     environmental_config=env_config,
...     random_state=42
... )
>>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # With metadata
>>> X, Y, E, meta = generator.generate(100, return_metadata=True)
>>> print(meta.keys())
generate_batch_effects(n_batches: int, samples_per_batch: List[int]) Tuple[ndarray, ndarray][source]

Generate batch/session effects for domain adaptation research.

Note: This is NOT delegated to BatchEffectAugmenter because the semantics differ. This method generates distinct per-batch offsets/gains for multiple batch groups (domain adaptation), while BatchEffectAugmenter applies a single batch-level or sample-level effect. The multi-batch group assignment logic in generate() depends on the per-batch return values.

Parameters:
  • n_batches – Number of measurement batches/sessions.

  • samples_per_batch – List of sample counts per batch.

Returns:

  • batch_offsets: Wavelength-dependent offsets per batch.

  • batch_gains: Multiplicative gains per batch.

Return type:

Tuple of

generate_concentrations(n_samples: int, method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', alpha: ndarray | None = None, correlation_matrix: ndarray | None = None) ndarray[source]

Generate concentration matrix using specified distribution.

Parameters:
  • n_samples – Number of samples to generate.

  • method – Concentration generation method: - ‘dirichlet’: Compositional data (concentrations sum to ~1). - ‘uniform’: Independent uniform [0, 1] values. - ‘lognormal’: Log-normal distributed, normalized. - ‘correlated’: Multivariate with specified correlations.

  • alpha – Dirichlet concentration parameters (only for ‘dirichlet’ method). Shape: (n_components,). Higher values = more uniform distribution.

  • correlation_matrix – Correlation structure for ‘correlated’ method. Shape: (n_components, n_components).

Returns:

Concentration matrix of shape (n_samples, n_components).

Raises:

ValueError – If method is unknown.

Example

>>> generator = SyntheticNIRSGenerator(random_state=42)
>>> C = generator.generate_concentrations(100, method='dirichlet')
>>> print(C.shape, C.sum(axis=1).mean())  # Should sum to ~1
generate_from_concentrations(concentrations: ndarray, include_batch_effects: bool = False, n_batches: int = 1, include_instrument_effects: bool = True, include_environmental_effects: bool = True, include_scattering_effects: bool = True, include_edge_artifacts: bool = True, temperatures: ndarray | None = None) Tuple[ndarray, Dict[str, Any]][source]

Generate synthetic NIRS spectra from pre-defined concentrations.

This method allows generating spectra using externally-provided concentrations (e.g., from aggregate components) instead of random sampling.

Parameters:
  • concentrations – Concentration matrix (n_samples, n_components). Each row should sum to approximately 1.0.

  • include_batch_effects – Whether to add batch/session effects.

  • n_batches – Number of batches (only if include_batch_effects=True).

  • include_instrument_effects – Whether to apply instrument-specific effects (detector response, noise).

  • include_environmental_effects – Whether to apply temperature and moisture effects.

  • include_scattering_effects – Whether to apply particle size and EMSC-style scattering effects.

  • include_edge_artifacts – Whether to apply edge artifact effects.

  • temperatures – Optional array of temperatures (°C) for each sample.

Returns:

  • X: Spectra matrix (n_samples, n_wavelengths)

  • metadata: Dictionary with generation details

Return type:

Tuple of (X, metadata) where

Example

>>> # Generate from aggregate concentrations
>>> C = np.array([[0.6, 0.3, 0.1], [0.5, 0.35, 0.15]])
>>> generator = SyntheticNIRSGenerator(
...     component_library=ComponentLibrary.from_predefined(
...         ["starch", "protein", "moisture"]
...     )
... )
>>> X, meta = generator.generate_from_concentrations(C)
class nirs4all.synthesis.TargetConfig(distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', range: Tuple[float, float] | None = None, n_targets: int | None = None, component_indices: List[int] | None = None, transform: Literal['log', 'sqrt'] | None = None)[source]

Bases: object

Configuration for target variable generation.

distribution

Target value distribution method. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.

Type:

Literal[‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’]

range

Optional (min, max) range for scaling targets.

Type:

Tuple[float, float] | None

n_targets

Number of target variables (auto from components if None).

Type:

int | None

component_indices

Which components to use as targets (all if None).

Type:

List[int] | None

transform

Optional transformation to apply (‘log’, ‘sqrt’, None).

Type:

Literal[‘log’, ‘sqrt’] | None

component_indices: List[int] | None = None
distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet'
n_targets: int | None = None
range: Tuple[float, float] | None = None
transform: Literal['log', 'sqrt'] | None = None
class nirs4all.synthesis.TargetGenerator(random_state: int | None = None)[source]

Bases: object

Generate target variables for synthetic NIRS datasets.

This class creates both regression targets (continuous values correlated with component concentrations) and classification targets (discrete labels with controllable class separation).

rng

NumPy random generator for reproducibility.

Parameters:

random_state – Random seed for reproducibility.

Example

>>> generator = TargetGenerator(random_state=42)
>>>
>>> # Generate concentrations first (from SyntheticNIRSGenerator)
>>> C = np.random.rand(100, 5)  # 5 components
>>>
>>> # Regression targets scaled to percentage
>>> y = generator.regression(
...     n_samples=100,
...     concentrations=C,
...     component=0,  # Use first component
...     range=(0, 100)
... )
>>>
>>> # Multi-class classification
>>> y = generator.classification(
...     n_samples=100,
...     concentrations=C,
...     n_classes=4,
...     separation=2.0
... )
classification(n_samples: int, concentrations: ndarray | None = None, *, n_classes: int = 2, class_weights: List[float] | None = None, separation: float = 1.5, separation_method: Literal['component', 'threshold', 'cluster'] = 'component', class_names: List[str] | None = None, return_proba: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Generate classification target labels with controllable class separation.

The separation parameter controls how distinguishable classes are in feature space. Higher values create more separable classes.

Parameters:
  • n_samples – Number of samples.

  • concentrations – Component concentration matrix.

  • n_classes – Number of classes to generate.

  • class_weights – Class proportions (should sum to 1.0). If None, uses balanced classes.

  • separation – Class separation factor: - 0.5-1.0: Overlapping classes (challenging) - 1.5-2.0: Moderate separation (realistic) - 2.5+: Well-separated classes (easy)

  • separation_method – How to create class differences: - “component”: Each class has distinct component profiles - “threshold”: Classes based on concentration thresholds - “cluster”: K-means-like cluster assignment

  • class_names – Optional string labels for classes.

  • return_proba – If True, also return class probabilities.

Returns:

Integer class labels (n_samples,). If return_proba=True: Tuple of (labels, probabilities).

Return type:

If return_proba=False

Example

>>> # Binary classification with balanced classes
>>> y = generator.classification(100, C, n_classes=2)
>>>
>>> # 3-class with imbalanced weights
>>> y = generator.classification(
...     100, C,
...     n_classes=3,
...     class_weights=[0.5, 0.3, 0.2],
...     separation=2.0
... )
regression(n_samples: int, concentrations: ndarray | None = None, *, distribution: Literal['uniform', 'normal', 'lognormal', 'bimodal'] = 'uniform', range: Tuple[float, float] | None = None, component: int | str | List[int] | None = None, component_names: List[str] | None = None, correlation: float = 0.9, noise: float = 0.1, transform: Literal['log', 'sqrt'] | None = None) ndarray[source]

Generate regression target values.

Parameters:
  • n_samples – Number of samples.

  • concentrations – Component concentration matrix (n_samples, n_components). If None, generates random base values.

  • distribution – Target value distribution.

  • range – (min, max) for scaling targets.

  • component – Which component(s) to use as target: - None: Weighted combination of all components - int: Use component at that index - str: Use component with that name (requires component_names) - List[int]: Multi-output using specified component indices

  • component_names – Names of components (for string component selection).

  • correlation – Correlation between concentrations and targets (0-1).

  • noise – Noise level to add.

  • transform – Optional transformation (‘log’, ‘sqrt’).

Returns:

Target values array. Shape (n_samples,) for single target, or (n_samples, n_targets) for multi-output.

Example

>>> y = generator.regression(
...     100, C,
...     distribution="lognormal",
...     range=(5, 50),
...     component="protein",
...     component_names=["water", "protein", "lipid"]
... )
class nirs4all.synthesis.TemperatureConfig(reference_temperature: float = 25.0, sample_temperature: float = 25.0, temperature_variation: float = 0.0, enable_shift: bool = True, enable_intensity: bool = True, enable_broadening: bool = True, region_specific: bool = True, custom_regions: Dict[SpectralRegion, TemperatureEffectParams] | None = None)[source]

Bases: object

Configuration for temperature effect simulation.

reference_temperature

Reference temperature in °C (typically 25°C).

Type:

float

sample_temperature

Actual sample temperature in °C.

Type:

float

temperature_variation

Sample-to-sample temperature variation (std dev in °C).

Type:

float

enable_shift

Whether to apply wavelength shifts.

Type:

bool

enable_intensity

Whether to apply intensity changes.

Type:

bool

enable_broadening

Whether to apply band broadening.

Type:

bool

region_specific

Whether to use region-specific parameters.

Type:

bool

custom_regions

Optional custom region parameters to override defaults.

Type:

Dict[nirs4all.synthesis.environmental.SpectralRegion, nirs4all.synthesis.environmental.TemperatureEffectParams] | None

custom_regions: Dict[SpectralRegion, TemperatureEffectParams] | None = None
property delta_temperature: float

Temperature difference from reference.

enable_broadening: bool = True
enable_intensity: bool = True
enable_shift: bool = True
reference_temperature: float = 25.0
region_specific: bool = True
sample_temperature: float = 25.0
temperature_variation: float = 0.0
class nirs4all.synthesis.TemperatureEffectParams(wavelength_range: Tuple[float, float], shift_per_degree: float, intensity_change_per_degree: float, broadening_per_degree: float, reference: str = '')[source]

Bases: object

Temperature effect parameters for a spectral region.

Based on literature values for temperature-induced spectral changes in NIR.

wavelength_range

Affected wavelength range (nm).

Type:

Tuple[float, float]

shift_per_degree

Peak position shift per °C (nm). Negative = blue shift.

Type:

float

intensity_change_per_degree

Fractional intensity change per °C.

Type:

float

broadening_per_degree

Fractional bandwidth increase per °C.

Type:

float

reference

Literature reference for values.

Type:

str

broadening_per_degree: float
intensity_change_per_degree: float
reference: str = ''
shift_per_degree: float
wavelength_range: Tuple[float, float]
class nirs4all.synthesis.TransflectanceConfig(path_length_mm: float = 0.5, reflector_type: str = 'gold', reflector_reflectance: float = 0.95, spacer_thickness_mm: float = 0.5)[source]

Bases: object

Configuration for transflectance measurement mode.

Light passes through sample, reflects off a mirror/diffuser, and passes through sample again (double-pass).

path_length_mm

Single-pass path length in mm.

Type:

float

reflector_type

Type of backing reflector.

Type:

str

reflector_reflectance

Reflectance of backing material.

Type:

float

spacer_thickness_mm

Spacer thickness controlling path length.

Type:

float

path_length_mm: float = 0.5
reflector_reflectance: float = 0.95
reflector_type: str = 'gold'
spacer_thickness_mm: float = 0.5
class nirs4all.synthesis.TransmittanceConfig(path_length_mm: float = 1.0, path_length_variation: float = 0.02, cuvette_material: str = 'quartz', reference_type: str = 'air')[source]

Bases: object

Configuration for transmittance measurement mode.

Implements Beer-Lambert law: A = εcl where A is absorbance, ε is molar absorptivity, c is concentration, and l is path length.

path_length_mm

Optical path length in mm.

Type:

float

path_length_variation

Sample-to-sample variation in path length.

Type:

float

cuvette_material

Material of sample holder (affects NIR absorption).

Type:

str

reference_type

Type of reference measurement.

Type:

str

cuvette_material: str = 'quartz'
path_length_mm: float = 1.0
path_length_variation: float = 0.02
reference_type: str = 'air'
exception nirs4all.synthesis.ValidationError[source]

Bases: Exception

Exception raised when synthetic data validation fails.

class nirs4all.synthesis.VarianceFitResult(operator_params: OperatorVarianceParams, pca_params: PCAVarianceParams, n_samples: int = 0, wavelengths: ndarray | None = None)[source]

Bases: object

Combined result from variance fitting.

operator_params

Operator-based variance parameters.

Type:

nirs4all.synthesis.fitter.OperatorVarianceParams

pca_params

PCA-based variance parameters.

Type:

nirs4all.synthesis.fitter.PCAVarianceParams

n_samples

Number of samples used for fitting.

Type:

int

wavelengths

Wavelength grid.

Type:

numpy.ndarray | None

n_samples: int = 0
operator_params: OperatorVarianceParams
pca_params: PCAVarianceParams
summary() str[source]

Return human-readable summary.

wavelengths: ndarray | None = None
class nirs4all.synthesis.VarianceFitter(n_pca_components: int = 10)[source]

Bases: object

Fit variance parameters from real spectra.

Provides two complementary methods for modeling spectral variation: - Operator-based: Independent physical sources (noise, scatter, baseline) - PCA-based: Correlated variations capturing the covariance structure

Example

>>> from nirs4all.synthesis import VarianceFitter
>>>
>>> fitter = VarianceFitter()
>>> result = fitter.fit(X_real, wavelengths)
>>>
>>> # Use operator-based params for generation
>>> print(f"Noise level: {result.operator_params.noise_std:.6f}")
>>>
>>> # Generate synthetic variance using PCA
>>> X_variance = fitter.generate_pca_variance(n_samples=100, random_state=42)
fit(X: ndarray, wavelengths: ndarray | None = None) VarianceFitResult[source]

Fit variance parameters from real spectra.

Parameters:
  • X – Real spectra matrix (n_samples, n_wavelengths).

  • wavelengths – Wavelength array (nm).

Returns:

VarianceFitResult with both operator and PCA parameters.

generate_operator_variance(base_spectrum: ndarray, wavelengths: ndarray, n_samples: int = 100, random_state: int | None = None) ndarray[source]

Generate synthetic spectra using operator-based variance.

Parameters:
  • base_spectrum – Mean/fitted spectrum to add variance to.

  • wavelengths – Wavelength array.

  • n_samples – Number of samples to generate.

  • random_state – Random seed.

Returns:

Array of synthetic spectra (n_samples, n_wavelengths).

generate_pca_variance(n_samples: int = 100, n_components: int | None = None, random_state: int | None = None) ndarray[source]

Generate synthetic spectra using PCA-based variance.

Parameters:
  • n_samples – Number of samples to generate.

  • n_components – Number of PCA components to use (None = all).

  • random_state – Random seed.

Returns:

Array of synthetic spectra (n_samples, n_wavelengths).

class nirs4all.synthesis.VariationType(value)[source]

Bases: Enum

Type of variation for component concentrations.

FIXED

No variation, use exact specified value.

UNIFORM

Uniform distribution between min and max.

NORMAL

Normal (Gaussian) distribution with mean and std.

LOGNORMAL

Log-normal distribution for non-negative values.

CORRELATED

Value derived from correlation with another component.

COMPUTED

Value computed from other components (e.g., 1 - sum(others)).

COMPUTED = 6
CORRELATED = 5
FIXED = 1
LOGNORMAL = 4
NORMAL = 3
UNIFORM = 2
nirs4all.synthesis.aggregate_info(name: str) str[source]

Return formatted information about an aggregate.

Parameters:

name – Aggregate name.

Returns:

Human-readable string with aggregate details.

Example

>>> print(aggregate_info("wheat_grain"))
nirs4all.synthesis.apply_hydrogen_bonding_shift(wavenumber_cm: float, h_bond_strength: float = 0.5, is_donor: bool = True) float[source]

Apply hydrogen bonding shift to a wavenumber.

Hydrogen bonding weakens X-H bonds, shifting stretching frequencies to lower wavenumbers (red shift). The shift magnitude depends on the hydrogen bond strength.

Typical shifts for O-H: - Free O-H: ~3650 cm⁻¹ - Weak H-bond: ~3500 cm⁻¹ - Strong H-bond: ~3200 cm⁻¹

Parameters:
  • wavenumber_cm – Original wavenumber in cm⁻¹.

  • h_bond_strength – Hydrogen bond strength (0 = none, 1 = very strong).

  • is_donor – Whether the group is a hydrogen bond donor.

Returns:

Shifted wavenumber in cm⁻¹.

Example

>>> apply_hydrogen_bonding_shift(3650, h_bond_strength=0.5)
3467.5  # Red-shifted by hydrogen bonding
nirs4all.synthesis.available_components() List[str][source]

Return list of all available predefined component names.

Returns:

Sorted list of component names.

Example

>>> names = available_components()
>>> print(f"Available: {len(names)} components")
>>> print(names[:5])
nirs4all.synthesis.band_info(functional_group: str, band_key: str) str[source]

Return formatted information about a specific band.

Parameters:
  • functional_group – Functional group name.

  • band_key – Band key within the group.

Returns:

Human-readable string with band details.

Example

>>> print(band_info("O-H", "1st_overtone_water"))
nirs4all.synthesis.band_summary() str

Return a summary of the band assignments dictionary.

Returns:

Human-readable summary string.

Example

>>> print(summary())
nirs4all.synthesis.benchmark_backends(n_samples: int = 1000, n_wavelengths: int = 700, n_components: int = 5, n_trials: int = 5) Dict[str, float][source]

Benchmark available backends.

Parameters:
  • n_samples – Number of samples to generate.

  • n_wavelengths – Number of wavelengths.

  • n_components – Number of components.

  • n_trials – Number of timing trials.

Returns:

Dictionary of backend name to mean time in seconds.

Example

>>> results = benchmark_backends()
>>> for backend, time in results.items():
...     print(f"{backend}: {time:.4f}s")
nirs4all.synthesis.calculate_combination_band(mode1: str | float | List[str | float], mode2: str | float | None = None, band_type: str = 'sum', coupling_factor: float = 1.0) CombinationBandResult[source]

Calculate combination band position.

Combination bands arise from simultaneous excitation of two vibrational modes. - Sum bands: ν̃_comb = ν̃₁ + ν̃₂ (most common in NIR) - Difference bands: ν̃_comb = |ν̃₁ - ν̃₂| (less common)

Parameters:
  • mode1 – First vibration - either a vibration type string (e.g., ‘O-H_stretch’), a numeric wavenumber in cm⁻¹, or a list of two modes.

  • mode2 – Second vibration (same format as mode1). If mode1 is a list, this should be None.

  • band_type – ‘sum’ or ‘difference’.

  • coupling_factor – Mechanical coupling between modes (0-1, affects amplitude).

Returns:

CombinationBandResult with position and intensity information.

Example

>>> # O-H stretch + O-H bend combination (water) using strings
>>> result = calculate_combination_band("O-H_stretch", "O-H_bend")
>>> print(f"{result.wavelength_nm:.0f} nm")
1984 nm
>>> # Using a list of modes
>>> result = calculate_combination_band(["O-H_stretch", "O-H_bend"])
>>> # Using numeric values
>>> result = calculate_combination_band(3400, 1640)
nirs4all.synthesis.calculate_overtone_position(vibration_type_or_frequency: str | float, overtone_order: int, anharmonicity: float | None = None) OvertoneResult[source]

Calculate overtone band position with anharmonicity correction.

For a harmonic oscillator, overtones would be exactly at n × ν̃₀. However, real molecular vibrations are anharmonic, causing overtones to appear at slightly lower wavenumbers than the harmonic prediction.

The anharmonic wavenumber is: ν̃ₙ = n × ν̃₀ × (1 - n × χ) where χ is the anharmonicity constant (typically 0.01-0.03).

Parameters:
  • vibration_type_or_frequency – Either a vibration type string (e.g., ‘O-H_stretch’) from FUNDAMENTAL_VIBRATIONS, or a numeric fundamental frequency in cm⁻¹.

  • overtone_order – Order (1 = fundamental, 2 = 1st overtone, 3 = 2nd overtone).

  • anharmonicity – Anharmonicity constant χ. If None and vibration_type is a string, uses the default for that vibration type. Otherwise defaults to 0.02.

Returns:

OvertoneResult with position and intensity information.

Example

>>> result = calculate_overtone_position("O-H_stretch", 2)  # O-H 1st overtone
>>> print(f"{result.wavelength_nm:.0f} nm")
1442 nm  # (with anharmonicity)
>>> result = calculate_overtone_position(3400, 2)  # Numeric frequency
>>> print(f"{result.wavelength_nm:.0f} nm")
1503 nm
nirs4all.synthesis.classify_wavelength_extended(wavelength_nm: float) Tuple[str, str] | None[source]

Classify a wavelength into extended spectral zones (Vis-NIR: 350-2500 nm).

This function covers both visible (electronic transitions) and NIR (vibrational overtones/combinations) regions.

Parameters:

wavelength_nm – Wavelength in nm.

Returns:

Tuple of (zone_name, description), or None if outside defined zones.

Example

>>> classify_wavelength_extended(450)
('blue_absorption', 'Soret bands, carotenoid peak absorptions')
>>> classify_wavelength_extended(660)
('red_absorption', 'Chlorophyll Q bands, hemoglobin bands')
>>> classify_wavelength_extended(1450)
('1st_overtones_OH_NH', '1st overtones O-H, N-H')
nirs4all.synthesis.classify_wavelength_zone(wavelength_nm: float) str | None[source]

Classify a wavelength into its corresponding NIR zone.

Parameters:

wavelength_nm – Wavelength in nm.

Returns:

Zone name string, or None if outside defined zones.

Example

>>> classify_wavelength_zone(1450)
'1st_overtones_OH_NH'
>>> classify_wavelength_zone(2300)
'combination_CH'
nirs4all.synthesis.compare_datasets(X_synthetic: ndarray, X_real: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]

Quick comparison between synthetic and real datasets.

Parameters:
  • X_synthetic – Synthetic spectra.

  • X_real – Real spectra.

  • wavelengths – Wavelength grid.

Returns:

Dictionary with comparison metrics.

Example

>>> metrics = compare_datasets(X_synth, X_real)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
nirs4all.synthesis.component_info(name: str) str[source]

Return formatted information about a component.

Parameters:

name – Component name.

Returns:

Human-readable string with component details.

Example

>>> print(component_info("water"))
nirs4all.synthesis.compute_adversarial_validation_auc(real_spectra: ndarray, synthetic_spectra: ndarray, cv_folds: int = 5, random_state: int | None = None) Tuple[float, float][source]

Train classifier to distinguish real vs. synthetic spectra.

A lower AUC indicates that synthetic data is more realistic (harder to distinguish from real data).

Parameters:
  • real_spectra – Real spectra array (n_real, n_wavelengths).

  • synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).

  • cv_folds – Number of cross-validation folds.

  • random_state – Random state for reproducibility.

Returns:

Tuple of (mean_auc, std_auc) across folds.

Target:

AUC < 0.6: Excellent (nearly indistinguishable) AUC < 0.7: Good (hard to distinguish) AUC < 0.8: Acceptable (some differences) AUC >= 0.8: Poor (clearly distinguishable)

Example

>>> real = np.random.randn(100, 500)
>>> synthetic = np.random.randn(100, 500) + 0.1
>>> mean_auc, std_auc = compute_adversarial_validation_auc(real, synthetic)
>>> print(f"AUC: {mean_auc:.3f} ± {std_auc:.3f}")
nirs4all.synthesis.compute_baseline_curvature(spectra: ndarray, polynomial_degree: int = 3) ndarray[source]

Compute baseline curvature by fitting polynomials and measuring residuals.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • polynomial_degree – Degree of polynomial to fit.

Returns:

Array of residual standard deviations for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> curvatures = compute_baseline_curvature(X)
nirs4all.synthesis.compute_correlation_length(spectra: ndarray, max_lag: int = 50) ndarray[source]

Compute correlation lengths for a set of spectra.

The correlation length is the lag at which the autocorrelation function decays to 1/e of its initial value.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • max_lag – Maximum lag to compute autocorrelation for.

Returns:

Array of correlation lengths for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> lengths = compute_correlation_length(X)
>>> print(f"Mean correlation length: {lengths.mean():.2f}")
nirs4all.synthesis.compute_derivative_statistics(spectra: ndarray, wavelengths: ndarray | None = None, order: int = 1) Tuple[ndarray, ndarray][source]

Compute derivative statistics for spectra.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • wavelengths – Wavelength array for proper derivative scaling.

  • order – Derivative order (1 or 2).

Returns:

Tuple of (mean_derivatives, std_derivatives) per sample.

Example

>>> X = np.random.randn(100, 500)
>>> means, stds = compute_derivative_statistics(X, order=1)
nirs4all.synthesis.compute_distribution_overlap(dist1: ndarray, dist2: ndarray, n_bins: int = 50) float[source]

Compute overlap between two distributions using histogram intersection.

Parameters:
  • dist1 – First distribution samples.

  • dist2 – Second distribution samples.

  • n_bins – Number of histogram bins.

Returns:

Overlap coefficient in [0, 1], where 1 means identical distributions.

Example

>>> x1 = np.random.randn(1000)
>>> x2 = np.random.randn(1000) + 0.5
>>> overlap = compute_distribution_overlap(x1, x2)
nirs4all.synthesis.compute_peak_density(spectra: ndarray, wavelengths: ndarray, window_nm: float = 100.0, prominence_threshold: float = 0.01) ndarray[source]

Compute peak density (peaks per 100 nm) for spectra.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • wavelengths – Wavelength array in nm.

  • window_nm – Window size for density calculation (default 100 nm).

  • prominence_threshold – Minimum peak prominence as fraction of spectrum range.

Returns:

Array of peak densities (peaks per window_nm) for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> wl = np.linspace(1000, 2500, 500)
>>> densities = compute_peak_density(X, wl)
nirs4all.synthesis.compute_snr(spectra: ndarray, noise_region_fraction: float = 0.1) ndarray[source]

Estimate signal-to-noise ratio for spectra.

Uses the standard deviation of the highest-frequency components (via high-pass filtering) as noise estimate.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • noise_region_fraction – Fraction of spectrum to use for noise estimation.

Returns:

Array of SNR estimates for each spectrum.

Example

>>> X = np.random.randn(100, 500) + np.sin(np.linspace(0, 10, 500))
>>> snr = compute_snr(X)
nirs4all.synthesis.compute_spectral_properties(X: ndarray, wavelengths: ndarray | None = None, name: str = 'dataset', n_pca_components: int = 20) SpectralProperties[source]

Compute comprehensive spectral properties of a dataset.

Analyzes a matrix of spectra to extract statistical and spectral properties useful for fitting and comparison. Includes Phase 1-4 enhanced properties for instrument, mode, domain, and effect inference.

Parameters:
  • X – Spectra matrix (n_samples, n_wavelengths).

  • wavelengths – Optional wavelength grid.

  • name – Dataset identifier.

  • n_pca_components – Maximum PCA components to compute.

Returns:

SpectralProperties with computed metrics.

Example

>>> props = compute_spectral_properties(X_real, wavelengths)
>>> print(f"Mean slope: {props.mean_slope:.4f}")
>>> print(f"Inferred resolution: {props.effective_resolution:.1f} nm")
nirs4all.synthesis.compute_spectral_realism_scorecard(real_spectra: ndarray, synthetic_spectra: ndarray, wavelengths: ndarray | None = None, thresholds: Dict[str, float] | None = None, include_adversarial: bool = True, random_state: int | None = None) SpectralRealismScore[source]

Compute comprehensive spectral realism scorecard.

This function computes multiple quantitative metrics to assess whether synthetic spectra are realistic compared to real data.

Parameters:
  • real_spectra – Real spectra array (n_real, n_wavelengths).

  • synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).

  • wavelengths – Wavelength array in nm. If None, uses indices.

  • thresholds – Custom thresholds for metrics. Defaults: - correlation_length_overlap: 0.7 - derivative_ks_pvalue: 0.05 - peak_density_ratio_min: 0.5 - peak_density_ratio_max: 2.0 - baseline_curvature_overlap: 0.6 - snr_order_of_magnitude: 1.0 (log10 difference) - adversarial_auc: 0.7

  • include_adversarial – Whether to compute adversarial AUC (slower).

  • random_state – Random state for adversarial validation.

Returns:

SpectralRealismScore with all metrics and pass/fail status.

Example

>>> from nirs4all.synthesis import SyntheticNIRSGenerator
>>> gen = SyntheticNIRSGenerator(random_state=42)
>>> X_synth, _, _ = gen.generate(200)
>>> # X_real would be loaded from real data
>>> X_real = np.random.randn(200, X_synth.shape[1])  # Placeholder
>>> score = compute_spectral_realism_scorecard(X_real, X_synth, gen.wavelengths)
>>> print(score.summary())
nirs4all.synthesis.convert_bandwidth_to_wavelength(bandwidth_cm: float, center_nm: float) float[source]

Convert bandwidth from wavenumber to wavelength units.

Since the relationship between wavenumber and wavelength is non-linear, the bandwidth conversion depends on the center wavelength/wavenumber.

The approximation is: Δλ ≈ Δν̃ × λ² / 10^7

This is derived from the differential: dλ = -dν̃ × (10^7 / ν̃²) = -dν̃ × λ² / 10^7 (taking absolute value for bandwidth)

Parameters:
  • bandwidth_cm – Bandwidth in cm⁻¹ (e.g., FWHM).

  • center_nm – Center wavelength in nm.

Returns:

Bandwidth in nm.

Example

>>> convert_bandwidth_to_wavelength(100, 1450)  # 100 cm⁻¹ at 1450 nm
21.025  # approximately
>>> convert_bandwidth_to_wavelength(100, 2200)  # Same bandwidth at 2200 nm
48.4    # Broader in nm due to non-linear relationship
nirs4all.synthesis.create_atr_simulator(crystal_material: str = 'diamond', incidence_angle: float = 45.0, n_reflections: int = 1, random_state: int | None = None) MeasurementModeSimulator[source]

Create an ATR mode simulator.

Parameters:
  • crystal_material – ATR crystal material.

  • incidence_angle – Incidence angle in degrees.

  • n_reflections – Number of internal reflections.

  • random_state – Random seed.

Returns:

Configured MeasurementModeSimulator.

nirs4all.synthesis.create_domain_aware_library(domain_name: str, n_samples: int = 100, random_state: int | None = None) Tuple[List[str], ndarray][source]

Create component selection and concentrations based on domain priors.

This function samples components and their concentrations according to domain-specific distributions.

Parameters:
  • domain_name – Name of the domain.

  • n_samples – Number of samples to generate concentrations for.

  • random_state – Random seed for reproducibility.

Returns:

Tuple of (component_names, concentration_matrix).

Example

>>> components, concentrations = create_domain_aware_library(
...     "food_dairy",
...     n_samples=50,
...     random_state=42
... )
>>> print(components)
['water', 'lactose', 'casein', 'lipid']
>>> print(concentrations.shape)
(50, 4)
nirs4all.synthesis.create_reflectance_simulator(geometry: str = 'integrating_sphere', particle_size_um: float = 50.0, random_state: int | None = None) MeasurementModeSimulator[source]

Create a diffuse reflectance mode simulator.

Parameters:
  • geometry – Measurement geometry.

  • particle_size_um – Mean particle size.

  • random_state – Random seed.

Returns:

Configured MeasurementModeSimulator.

nirs4all.synthesis.create_synthetic_matching_benchmark(benchmark_name: str, n_samples: int | None = None, random_state: int | None = None) Tuple[ndarray, ndarray, ndarray][source]

Create synthetic data matching benchmark dataset properties.

Parameters:
  • benchmark_name – Name of benchmark dataset to match.

  • n_samples – Number of samples (uses benchmark size if None).

  • random_state – Random state for reproducibility.

Returns:

Tuple of (spectra, concentrations, component_spectra).

Example

>>> X, C, E = create_synthetic_matching_benchmark("corn", random_state=42)
>>> print(X.shape)
nirs4all.synthesis.create_transmittance_simulator(path_length_mm: float = 1.0, random_state: int | None = None) MeasurementModeSimulator[source]

Create a transmittance mode simulator.

Parameters:
  • path_length_mm – Optical path length in mm.

  • random_state – Random seed.

Returns:

Configured MeasurementModeSimulator.

nirs4all.synthesis.detect_best_backend() AcceleratorBackend[source]

Detect the best available acceleration backend.

Returns:

AcceleratorBackend enum indicating best available option.

Example

>>> backend = detect_best_backend()
>>> print(f"Using backend: {backend}")
nirs4all.synthesis.expand_aggregate(name: str, variability: bool = False, random_state: int | None = None, renormalize: bool = True) Dict[str, float][source]

Expand an aggregate into component weights.

Parameters:
  • name – Aggregate name.

  • variability – If True, sample from variability ranges instead of using fixed base composition.

  • random_state – Random seed for variability sampling.

  • renormalize – If True, normalize weights to sum to 1.0.

Returns:

weight}.

Return type:

Dictionary of {component_name

Example

>>> # Get fixed composition
>>> comp = expand_aggregate("wheat_grain")
>>> print(comp["protein"])
0.12
>>>
>>> # Sample with variability for training data
>>> comp = expand_aggregate("wheat_grain", variability=True, random_state=42)
>>> print(comp["protein"])  # Will vary between 0.08 and 0.18
nirs4all.synthesis.export_to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None) Path[source]

Quick function to export synthetic data to single CSV.

Parameters:
  • path – Output file path.

  • X – Feature matrix.

  • y – Target values.

  • wavelengths – Optional wavelength values.

Returns:

Path to created file.

Example

>>> path = export_to_csv("data.csv", X, y)
nirs4all.synthesis.export_to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, format: Literal['standard', 'single', 'fragmented'] = 'standard', random_state: int | None = None) Path[source]

Quick function to export synthetic data to folder.

Convenience function for simple export use cases.

Parameters:
  • path – Output folder path.

  • X – Feature matrix.

  • y – Target values.

  • train_ratio – Train/test split ratio.

  • wavelengths – Optional wavelength values.

  • format – Export format.

  • random_state – Random seed.

Returns:

Path to created folder.

Example

>>> path = export_to_folder(
...     "data/synthetic",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=wavelengths
... )
nirs4all.synthesis.fit_components(spectrum: ndarray, wavelengths: ndarray, component_names: List[str] | None = None, fit_baseline: bool = True, baseline_order: int = 2, method: str = 'nnls', preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False) ComponentFitResult[source]

Convenience function to fit components to a spectrum.

Parameters:
  • spectrum – Observed spectrum.

  • wavelengths – Wavelength grid.

  • component_names – Components to fit (None = all available).

  • fit_baseline – Include polynomial baseline.

  • baseline_order – Polynomial order for baseline.

  • method – Fitting method (“nnls” or “lsq”).

  • preprocessing – Preprocessing to apply to components (e.g., “second_derivative”). Use this when fitting preprocessed data.

  • auto_detect_preprocessing – If True, automatically detect preprocessing type from the data. This is useful for derivative data where the preprocessing type is unknown. Takes precedence over preprocessing if set.

Returns:

ComponentFitResult with fit results.

Example

>>> # Fit raw absorbance data
>>> result = fit_components(spectrum, wavelengths, ["water", "protein", "lipid"])
>>>
>>> # Fit second derivative data
>>> result = fit_components(
...     deriv_spectrum, wavelengths, ["water", "protein"],
...     preprocessing="second_derivative"
... )
>>>
>>> # Auto-detect preprocessing (recommended for unknown data)
>>> result = fit_components(
...     unknown_spectrum, wavelengths,
...     auto_detect_preprocessing=True
... )
nirs4all.synthesis.fit_components_optimized(spectrum: ndarray, wavelengths: ndarray, priority_categories: List[str] | None = None, max_components: int = 10, baseline_order: int = 4, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, smooth_sigma_nm: float = 30.0, use_nnls: bool = False) OptimizedFitResult[source]

Convenience function for optimized component fitting.

Uses greedy category-prioritized selection for better fits than NNLS.

Parameters:
  • spectrum – Observed spectrum.

  • wavelengths – Wavelength grid.

  • priority_categories – Categories to prioritize (e.g., [‘carbohydrates’, ‘proteins’]).

  • max_components – Maximum components to select.

  • baseline_order – Polynomial baseline order.

  • preprocessing – Preprocessing type (‘first_derivative’, ‘second_derivative’, etc.).

  • auto_detect_preprocessing – Auto-detect preprocessing from data.

  • smooth_sigma_nm – Gaussian smoothing sigma in nm to broaden component spectra.

  • use_nnls – Use non-negative least squares instead of OLS.

Returns:

OptimizedFitResult with fit results.

Example

>>> result = fit_components_optimized(
...     spectrum, wavelengths,
...     priority_categories=['carbohydrates', 'proteins'],
...     auto_detect_preprocessing=True,
... )
>>> print(f"R² = {result.r_squared:.4f}")
nirs4all.synthesis.fit_real_bands(spectrum: ndarray, wavelengths: ndarray, baseline_order: int = 4, max_bands: int = 50, target_r2: float = 0.98, allow_sigma_variation: bool = True) RealBandFitResult[source]

Convenience function for fitting spectrum using real NIR band assignments.

Uses known band positions from the NIR_BANDS dictionary for physically meaningful spectral decomposition.

Parameters:
  • spectrum – Observed spectrum.

  • wavelengths – Wavelength grid in nm.

  • baseline_order – Polynomial baseline order.

  • max_bands – Maximum number of bands to use.

  • target_r2 – Target R² for early stopping.

  • allow_sigma_variation – Allow sigma to vary within constrained ranges.

Returns:

RealBandFitResult with fit results.

Example

>>> result = fit_real_bands(spectrum, wavelengths)
>>> print(f"R² = {result.r_squared:.4f}")
>>> for name, center, amp in result.top_bands(5):
...     print(f"{center:.0f} nm: {name}")
nirs4all.synthesis.fit_to_real_data(X: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, name: str = 'source') FittedParameters[source]

Quick function to fit parameters to real data.

Convenience function for simple fitting use cases.

Parameters:
  • X – Real spectra or SpectroDataset.

  • wavelengths – Wavelength grid.

  • name – Dataset name.

Returns:

FittedParameters object.

Example

>>> params = fit_to_real_data(X_real, wavelengths)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())
nirs4all.synthesis.fit_variance(X: ndarray, wavelengths: ndarray | None = None, n_pca_components: int = 10) VarianceFitResult[source]

Convenience function to fit variance parameters from real spectra.

Parameters:
  • X – Real spectra matrix (n_samples, n_wavelengths).

  • wavelengths – Wavelength array (nm).

  • n_pca_components – Number of PCA components to fit.

Returns:

VarianceFitResult with fitted parameters.

Example

>>> result = fit_variance(X_real, wavelengths)
>>> print(f"Noise level: {result.operator_params.noise_std:.6f}")
nirs4all.synthesis.generate_band_spectrum(band: BandAssignment, wavelengths: ndarray, amplitude: float = 1.0, sigma: float | None = None, gamma: float | None = None) ndarray[source]

Generate a spectrum for a single band.

Parameters:
  • band – BandAssignment to generate spectrum for.

  • wavelengths – Array of wavelengths in nm.

  • amplitude – Peak amplitude.

  • sigma – Gaussian width (default: midpoint of sigma_range).

  • gamma – Lorentzian width (default: midpoint of gamma_range).

Returns:

Array of absorbance values at each wavelength.

Example

>>> band = get_band("O-H", "1st_overtone_water")
>>> wl = np.arange(1300, 1600, 1)
>>> spectrum = generate_band_spectrum(band, wl, amplitude=0.8)
nirs4all.synthesis.generate_classification_targets(n_samples: int, concentrations: ndarray | None = None, *, random_state: int | None = None, n_classes: int = 2, class_weights: List[float] | None = None, separation: float = 1.5) ndarray[source]

Convenience function for generating classification targets.

Parameters:
  • n_samples – Number of samples.

  • concentrations – Component concentrations (optional).

  • random_state – Random seed.

  • n_classes – Number of classes.

  • class_weights – Class proportions.

  • separation – Class separation factor.

Returns:

Integer class labels array.

nirs4all.synthesis.generate_multi_source(n_samples: int, sources: List[Dict[str, Any]] | None = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, as_dataset: bool = True, train_ratio: float = 0.8, name: str = 'multi_source_synthetic') SpectroDataset | MultiSourceResult[source]

Convenience function for generating multi-source datasets.

Parameters:
  • n_samples – Number of samples.

  • sources – List of source configurations. If None, uses default single NIR source with wavelength range (1000, 2500).

  • random_state – Random seed.

  • target_range – Target value range.

  • as_dataset – If True, returns SpectroDataset.

  • train_ratio – Training set proportion.

  • name – Dataset name.

Returns:

SpectroDataset or MultiSourceResult depending on as_dataset.

Example

>>> dataset = generate_multi_source(
...     n_samples=500,
...     sources=[
...         {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 15}
...     ],
...     random_state=42
... )
nirs4all.synthesis.generate_product_samples(template: str | ProductTemplate, n_samples: int = 1000, target: str | None = None, random_state: int | None = None, **kwargs: Any) SpectroDataset[source]

Generate synthetic product samples (convenience function).

This is a shorthand for creating a ProductGenerator and calling generate().

Parameters:
  • template – Template name or ProductTemplate object.

  • n_samples – Number of samples to generate.

  • target – Component to use as regression target.

  • random_state – Random seed for reproducibility.

  • **kwargs – Additional arguments passed to ProductGenerator.generate().

Returns:

SpectroDataset with synthetic samples.

Example

>>> from nirs4all.synthesis import generate_product_samples
>>>
>>> # Generate milk samples
>>> dataset = generate_product_samples(
...     "milk_variable_fat",
...     n_samples=1000,
...     target="lipid",
...     random_state=42
... )
nirs4all.synthesis.generate_regression_targets(n_samples: int, concentrations: ndarray | None = None, *, random_state: int | None = None, distribution: str = 'uniform', range: Tuple[float, float] | None = None) ndarray[source]

Convenience function for generating regression targets.

Parameters:
  • n_samples – Number of samples.

  • concentrations – Component concentrations (optional).

  • random_state – Random seed.

  • distribution – Target distribution type.

  • range – Value range (min, max).

Returns:

Target values array.

nirs4all.synthesis.generate_sample_metadata(n_samples: int, *, random_state: int | None = None, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1) Dict[str, ndarray][source]

Convenience function to generate sample metadata.

This is a simplified interface to MetadataGenerator for common use cases.

Parameters:
  • n_samples – Total number of samples to generate.

  • random_state – Random seed for reproducibility.

  • sample_id_prefix – Prefix for sample ID strings.

  • n_groups – Number of groups (None for no grouping).

  • group_names – Optional list of group names.

  • n_repetitions – Repetitions per biological sample.

Returns:

Dictionary with metadata arrays.

Example

>>> metadata = generate_sample_metadata(
...     n_samples=100,
...     random_state=42,
...     n_groups=3,
...     n_repetitions=(2, 4)
... )
>>> print(metadata.keys())
nirs4all.synthesis.get_acceleration_speedup_estimate(n_samples: int) float[source]

Estimate speedup from GPU acceleration.

Parameters:

n_samples – Number of samples to generate.

Returns:

Estimated speedup factor (1.0 for CPU).

nirs4all.synthesis.get_aggregate(name: str) AggregateComponent[source]

Get an aggregate component definition by name.

Parameters:

name – Aggregate name (e.g., “wheat_grain”, “milk”, “tablet_excipient_base”).

Returns:

AggregateComponent object.

Raises:

ValueError – If aggregate name is not found.

Example

>>> wheat = get_aggregate("wheat_grain")
>>> print(wheat.description)
Typical wheat grain composition
nirs4all.synthesis.get_all_zones_extended() List[Tuple[float, float, str, str]][source]

Get all extended spectral zones (Vis-NIR) converted to wavelength space.

Returns:

List of (min_wavelength, max_wavelength, zone_name, description) tuples in nm.

Example

>>> zones = get_all_zones_extended()
>>> for min_wl, max_wl, name, desc in zones:
...     print(f"{name}: {min_wl:.0f}-{max_wl:.0f} nm - {desc}")
nirs4all.synthesis.get_all_zones_wavelength() List[Tuple[float, float, str]][source]

Get all NIR zones converted to wavelength space.

Returns:

List of (min_wavelength, max_wavelength, zone_name) tuples in nm.

Example

>>> zones = get_all_zones_wavelength()
>>> for min_wl, max_wl, name in zones:
...     print(f"{name}: {min_wl:.0f}-{max_wl:.0f} nm")
nirs4all.synthesis.get_backend_info() Dict[str, Any][source]

Get detailed information about available backends.

Returns:

Dictionary with backend availability and details.

nirs4all.synthesis.get_band(functional_group: str, band_key: str) BandAssignment[source]

Get a specific band assignment by functional group and key.

Parameters:
  • functional_group – Functional group name (e.g., “O-H”, “C-H_aliphatic”).

  • band_key – Band key within the group (e.g., “1st_overtone_water”).

Returns:

BandAssignment object.

Raises:

KeyError – If functional group or band key not found.

Example

>>> band = get_band("O-H", "1st_overtone_water")
>>> print(f"{band.center} nm: {band.description}")
1450 nm: O-H 1st overtone, water
nirs4all.synthesis.get_bands_by_compound(compound: str) List[BandAssignment][source]

Get all bands commonly found in a specific compound.

Parameters:

compound – Compound name (e.g., “water”, “protein”, “cellulose”).

Returns:

List of BandAssignment objects where the compound appears in common_compounds.

Example

>>> # Get bands found in water
>>> water_bands = get_bands_by_compound("water")
>>> for b in water_bands:
...     print(f"{b.center} nm: {b.description}")
nirs4all.synthesis.get_bands_by_overtone(overtone_level: str) List[BandAssignment][source]

Get all bands of a specific overtone level.

Parameters:

overtone_level – Overtone level (“1st”, “2nd”, “3rd”, “combination”, “electronic”).

Returns:

List of BandAssignment objects of the specified overtone level.

Example

>>> # Get all 1st overtone bands
>>> first_overtones = get_bands_by_overtone("1st")
>>> print(f"Found {len(first_overtones)} 1st overtone bands")
nirs4all.synthesis.get_bands_by_tag(tag: str) List[BandAssignment][source]

Get all bands with a specific tag.

Parameters:

tag – Tag to filter by (e.g., “water”, “protein”, “diagnostic”).

Returns:

List of BandAssignment objects with the specified tag.

Example

>>> # Get all diagnostic bands
>>> diagnostic = get_bands_by_tag("diagnostic")
>>> print(len(diagnostic))
nirs4all.synthesis.get_bands_in_range(wavelength_min: float, wavelength_max: float, functional_groups: List[str] | None = None) List[BandAssignment][source]

Get all bands with centers in a wavelength range.

Parameters:
  • wavelength_min – Minimum wavelength (nm).

  • wavelength_max – Maximum wavelength (nm).

  • functional_groups – Optional filter for specific functional groups.

Returns:

List of BandAssignment objects sorted by center wavelength.

Example

>>> # Get all bands in the 1st overtone region
>>> bands = get_bands_in_range(1400, 1600)
>>> for b in bands[:3]:
...     print(f"{b.center} nm: {b.functional_group} {b.description}")
nirs4all.synthesis.get_benchmark_info(name: str) BenchmarkDatasetInfo[source]

Get information about a benchmark dataset.

Parameters:

name – Dataset name.

Returns:

BenchmarkDatasetInfo for the dataset.

Raises:

KeyError – If dataset not found.

Example

>>> info = get_benchmark_info("corn")
>>> print(info.summary())
nirs4all.synthesis.get_benchmark_spectral_properties(name: str) Dict[str, Any][source]

Get spectral properties to match when generating synthetic data.

Parameters:

name – Benchmark dataset name.

Returns:

Dictionary of properties suitable for synthetic generator.

Example

>>> props = get_benchmark_spectral_properties("corn")
>>> generator = SyntheticNIRSGenerator(**props)
nirs4all.synthesis.get_component(name: str) SpectralComponent[source]

Get a single predefined component by name or synonym.

Parameters:

name – Component name (e.g., “water”, “protein”, “lipid”) or synonym (e.g., “amylose” for “starch”).

Returns:

SpectralComponent object.

Raises:

ValueError – If component name is not found.

Example

>>> water = get_component("water")
>>> print(water.category)
>>> print(len(water.bands))
>>>
>>> # Using synonyms
>>> starch = get_component("amylose")  # Returns starch component
nirs4all.synthesis.get_datasets_by_domain(domain: str | BenchmarkDomain) List[str][source]

Get benchmark datasets for a specific domain.

Parameters:

domain – Domain name or enum.

Returns:

List of dataset names in that domain.

Example

>>> pharma_datasets = get_datasets_by_domain("pharmaceutical")
>>> print(pharma_datasets)
nirs4all.synthesis.get_default_noise_config(detector_type: DetectorType) NoiseModelConfig[source]

Get default noise model configuration for a detector type.

Parameters:

detector_type – Type of detector.

Returns:

NoiseModelConfig with appropriate defaults.

nirs4all.synthesis.get_detector_response(detector_type: DetectorType) DetectorSpectralResponse[source]

Get spectral response curve for a detector type.

Parameters:

detector_type – Type of detector.

Returns:

DetectorSpectralResponse object.

nirs4all.synthesis.get_detector_wavelength_range(detector_type: DetectorType) Tuple[float, float][source]

Get the effective wavelength range for a detector type.

Parameters:

detector_type – Type of detector.

Returns:

Tuple of (min_wavelength, max_wavelength) in nm.

nirs4all.synthesis.get_domain_compatible_instruments(domain: str) List[str][source]

Get list of instruments commonly used with a domain.

Parameters:

domain – Domain name.

Returns:

List of instrument names.

Example

>>> instruments = get_domain_compatible_instruments("tablets")
>>> print(instruments)
nirs4all.synthesis.get_domain_components(domain_name: str) List[str][source]

Get typical components for a domain.

Parameters:

domain_name – Name of the domain.

Returns:

List of component names.

Example

>>> get_domain_components("food_dairy")
['water', 'lactose', 'casein', 'lipid', 'moisture', 'protein']
nirs4all.synthesis.get_domain_config(domain_name: str) DomainConfig[source]

Get configuration for a specific domain.

Parameters:

domain_name – Name of the domain (key in APPLICATION_DOMAINS).

Returns:

DomainConfig for the specified domain.

Raises:

ValueError – If domain is not found.

Example

>>> config = get_domain_config("agriculture_grain")
>>> print(config.name)
'Grain and Cereals'
nirs4all.synthesis.get_domains_for_component(component_name: str) List[str][source]

Find domains that typically contain a specific component.

Parameters:

component_name – Name of the component.

Returns:

List of domain names containing this component.

Example

>>> get_domains_for_component("protein")
['agriculture_grain', 'food_meat', 'biomedical_tissue', ...]
nirs4all.synthesis.get_instrument_archetype(name: str) InstrumentArchetype[source]

Get a predefined instrument archetype by name.

Parameters:

name – Instrument archetype name.

Returns:

InstrumentArchetype instance.

Raises:

KeyError – If archetype name not found.

Example

>>> archetype = get_instrument_archetype("foss_xds")
>>> print(archetype.wavelength_range)
(400, 2500)
nirs4all.synthesis.get_instrument_typical_modes(instrument: str) List[str][source]

Get typical measurement modes for an instrument.

Parameters:

instrument – Instrument name.

Returns:

List of measurement mode names.

Example

>>> modes = get_instrument_typical_modes("viavi_micronir")
>>> print(modes)
nirs4all.synthesis.get_instrument_wavelength_info() Dict[str, Dict[str, Any]][source]

Get detailed information about all instrument wavelength grids.

Returns:

  • n_wavelengths: Number of wavelength points

  • wavelength_start: Start wavelength (nm)

  • wavelength_end: End wavelength (nm)

  • mean_step: Mean wavelength step (nm)

Return type:

Dictionary mapping instrument names to info dicts containing

Example

>>> info = get_instrument_wavelength_info()
>>> print(info["micronir_onsite"])
{'n_wavelengths': 125, 'wavelength_start': 908.0, ...}
nirs4all.synthesis.get_instrument_wavelengths(instrument: str) ndarray[source]

Get the wavelength grid for a known instrument.

Returns a copy of the predefined wavelength array for the specified instrument, enabling generation of synthetic data that matches real instrument wavelength grids exactly.

Parameters:

instrument – Instrument identifier (e.g., “micronir_onsite”, “foss_xds”).

Returns:

NumPy array of wavelengths in nm.

Raises:

ValueError – If the instrument is not recognized.

Example

>>> wl = get_instrument_wavelengths("micronir_onsite")
>>> print(f"MicroNIR: {len(wl)} wavelengths from {wl[0]:.0f} to {wl[-1]:.0f} nm")
MicroNIR: 125 wavelengths from 908 to 1676 nm
>>> # Use with SyntheticNIRSGenerator
>>> from nirs4all.synthesis import SyntheticNIRSGenerator
>>> gen = SyntheticNIRSGenerator(wavelengths=wl)
nirs4all.synthesis.get_instruments_by_category() Dict[str, List[str]][source]

Get all instruments organized by category.

Returns:

Dictionary mapping category name to list of instrument names.

nirs4all.synthesis.get_nir_zone(wavelength_nm: float) str | None

Classify a wavelength into its corresponding NIR zone.

Parameters:

wavelength_nm – Wavelength in nm.

Returns:

Zone name string, or None if outside defined zones.

Example

>>> classify_wavelength_zone(1450)
'1st_overtones_OH_NH'
>>> classify_wavelength_zone(2300)
'combination_CH'
nirs4all.synthesis.get_predefined_components() Dict[str, SpectralComponent][source]

Get predefined spectral components based on NIR band assignments.

Returns a dictionary of SpectralComponent objects representing common chemical compounds and functional groups found in NIR spectroscopy applications (agricultural, food, pharmaceutical, petrochemical).

Each component’s band assignments are based on published spectroscopic literature. Key characteristics:

  • Band centers: Wavelength positions (nm) of absorption maxima

  • Sigma: Gaussian width contribution (thermal/inhomogeneous broadening)

  • Gamma: Lorentzian width contribution (pressure/collision broadening)

  • Amplitude: Relative absorption intensity (normalized within component)

Available Components (126 total):
Water-related (2):
  • water: H₂O fundamental O-H vibrations [1, pp. 34-36]

  • moisture: Bound water in organic matrices [2, pp. 358-362]

Proteins and Nitrogen (12):
  • protein: General protein (amide, N-H, C-H) [1, pp. 48-52]

  • nitrogen_compound: Primary/secondary amines [1, pp. 52-54]

  • urea: CO(NH₂)₂ bands [9, p. 1125]

  • amino_acid: Free amino acids [3, pp. 215-220]

  • casein: Milk protein [4, pp. 85-88]

  • gluten: Wheat protein complex [5, pp. 155-160]

  • albumin: Globular protein (egg white, serum)

  • collagen: Fibrous structural protein

  • keratin: Structural protein (hair, nails)

  • zein: Corn protein (prolamin)

  • gelatin: Denatured collagen

  • whey: Milk serum proteins

Lipids and Hydrocarbons (20):
  • lipid: Triglycerides (C-H stretching) [1, pp. 44-48]

  • oil: Vegetable/mineral oils [4, pp. 67-72]

  • saturated_fat: Saturated fatty acids [7, pp. 15-20]

  • unsaturated_fat: Mono/polyunsaturated fats [7, pp. 20-25]

  • aromatic: Benzene derivatives [1, pp. 56-58]

  • alkane: Saturated hydrocarbons [7, pp. 10-15]

  • waxes: Cuticular waxes [7, pp. 15-20]

  • crude_oil: Petroleum crude oil [17], [18]

  • diesel: Diesel fuel [19]

  • gasoline: Gasoline/petrol [17]

  • kerosene: Kerosene/jet fuel [17], [18]

  • pah: Polycyclic aromatic hydrocarbons [16]

  • oleic_acid: Monounsaturated fatty acid (C18:1)

  • linoleic_acid: Polyunsaturated fatty acid (C18:2)

  • linolenic_acid: Polyunsaturated fatty acid (C18:3)

  • palmitic_acid: Saturated fatty acid (C16:0)

  • stearic_acid: Saturated fatty acid (C18:0)

  • phospholipid: Lecithin-like membrane lipids

  • cholesterol: Sterol lipid

  • cocoa_butter: Triglyceride mix

Carbohydrates (18):
  • starch: Amylose/amylopectin [5, pp. 155-160]

  • cellulose: β-1,4-glucan chains [6, pp. 295-300]

  • glucose: D-glucose monosaccharide [2, pp. 368-370]

  • fructose: D-fructose monosaccharide [2, pp. 368-370]

  • sucrose: Disaccharide [2, pp. 370-372]

  • hemicellulose: Xylan/glucomannan [6, pp. 300-303]

  • lignin: Aromatic polymer [6, pp. 303-305]

  • lactose: Milk sugar [12], [4, pp. 85-88]

  • cotton: Cotton cellulose [6, pp. 295-298]

  • dietary_fiber: Plant cell wall [6], [5]

  • maltose: Malt sugar (glucose-glucose disaccharide)

  • raffinose: Trisaccharide (galactose-glucose-fructose)

  • inulin: Fructose polymer (dietary fiber)

  • xylose: Pentose monosaccharide

  • arabinose: Pentose monosaccharide

  • galactose: Hexose monosaccharide

  • mannose: Hexose monosaccharide

  • trehalose: Non-reducing disaccharide

Alcohols and Polyols (9):
  • ethanol: C₂H₅OH [1, pp. 38-40]

  • methanol: CH₃OH [1, pp. 38-40]

  • glycerol: Polyol from fermentation [11]

  • propanol: Propyl alcohol

  • butanol: Butyl alcohol

  • sorbitol: Sugar alcohol

  • mannitol: Sugar alcohol

  • xylitol: Sugar alcohol

  • isopropanol: Isopropyl alcohol

Organic Acids (12):
  • acetic_acid: CH₃COOH [8, pp. 8-10]

  • citric_acid: C₆H₈O₇ [4, pp. 78-80]

  • lactic_acid: CH₃CH(OH)COOH [9, pp. 1128-1130]

  • malic_acid: Fruit acid [4, pp. 78-80]

  • tartaric_acid: Grape/wine acid [11]

  • formic_acid: HCOOH

  • oxalic_acid: (COOH)₂

  • succinic_acid: Dicarboxylic acid

  • fumaric_acid: Unsaturated dicarboxylic acid

  • propionic_acid: CH₃CH₂COOH

  • butyric_acid: Short-chain fatty acid

  • ascorbic_acid: Vitamin C

Pigments (18):

Chlorophylls: - chlorophyll: Chlorophyll a/b combined [2, pp. 375-378] - chlorophyll_a: Primary photosynthetic pigment (Soret 430nm, Q 662nm) - chlorophyll_b: Accessory photosynthetic pigment (Soret 453nm, Q 642nm) Carotenoids: - carotenoid: β-carotene and xanthophylls [2, pp. 378-380] - beta_carotene: Orange carotenoid (λmax 425, 450, 478nm) - lycopene: Red carotenoid (tomatoes) - lutein: Yellow carotenoid (xanthophyll) - xanthophyll: General yellow pigments Flavonoids: - anthocyanin: Red-purple plant pigment - anthocyanin_red: Red anthocyanin (520-540nm) - anthocyanin_purple: Purple anthocyanin (560-580nm, pH-dependent) Hemoproteins (visible-region electronic transitions): - hemoglobin_oxy: Oxygenated hemoglobin (Soret 414nm, Q 542/577nm) - hemoglobin_deoxy: Deoxygenated hemoglobin (Soret 430nm, Q 555nm) - myoglobin: Muscle oxygen-binding protein - cytochrome_c: Electron transport hemoprotein Other pigments: - bilirubin: Bile pigment (heme degradation product) - melanin: Brown-black biopolymer pigment - tannins: Polyphenolic compounds [6], [11]

Pharmaceutical (10):
  • caffeine: C₈H₁₀N₄O₂ [9, pp. 1130-1132]

  • aspirin: Acetylsalicylic acid [9, pp. 1125-1128]

  • paracetamol: Acetaminophen [9, pp. 1132-1135]

  • ibuprofen: Anti-inflammatory drug

  • naproxen: NSAID drug

  • diclofenac: NSAID drug

  • metformin: Diabetes drug

  • omeprazole: Proton pump inhibitor

  • amoxicillin: Antibiotic

  • microcrystalline_cellulose: Pharmaceutical excipient

Polymers (11):
  • polyethylene: HDPE/LDPE plastic [15], [1, pp. 58-60]

  • polystyrene: Aromatic polymer [15], [1, pp. 56-58]

  • polypropylene: PP plastic

  • pvc: Polyvinyl chloride

  • pet: Polyethylene terephthalate

  • polyester: PET fiber [1, pp. 60-62]

  • nylon: Polyamide fiber [1, pp. 60-62]

  • pmma: Polymethyl methacrylate (acrylic)

  • ptfe: Polytetrafluoroethylene (Teflon)

  • abs: Acrylonitrile butadiene styrene

  • natural_rubber: cis-1,4-polyisoprene [15]

Solvents (6):
  • acetone: Ketone solvent [1, pp. 42-44]

  • dmso: Dimethyl sulfoxide

  • ethyl_acetate: Ester solvent

  • toluene: Aromatic solvent

  • chloroform: Halogenated solvent

  • hexane: Alkane solvent

Minerals (8):
  • carbonates: CaCO₃, MgCO₃ [13]

  • gypsum: CaSO₄·2H₂O [13]

  • kaolinite: Clay mineral [13]

  • montmorillonite: Smectite clay

  • illite: Mica-like clay

  • goethite: Iron oxyhydroxide

  • talc: Magnesium silicate

  • silica: Silicon dioxide

Returns:

Dictionary mapping component names to SpectralComponent objects.

Note

This function uses lazy initialization to avoid circular imports. The components are created once and cached for subsequent calls.

References

See module docstring for full reference list.

nirs4all.synthesis.get_product_template(name: str) ProductTemplate[source]

Get a product template by name.

Parameters:

name – Template name.

Returns:

ProductTemplate object.

Raises:

ValueError – If template name is not found.

Example

>>> template = get_product_template("milk_variable_fat")
>>> print(template.description)
Milk with variable fat content (skim to whole)
nirs4all.synthesis.get_temperature_effect_regions() Dict[str, Tuple[float, float]][source]

Get the wavelength regions with significant temperature effects.

Returns:

Dictionary mapping region names to (start, end) wavelength tuples.

nirs4all.synthesis.get_zone_wavelength_range(zone_name: str) Tuple[float, float] | None[source]

Get the wavelength range (nm) for a named NIR zone.

Parameters:

zone_name – Name of the NIR zone (e.g., ‘1st_overtones_OH_NH’).

Returns:

Tuple of (min_wavelength, max_wavelength) in nm, or None if not found.

Example

>>> get_zone_wavelength_range('1st_overtones_CH')
(1600.0, 1818.18...)
nirs4all.synthesis.is_gpu_available() bool[source]

Check if GPU acceleration is available.

Returns:

True if JAX with GPU or CuPy is available.

Example

>>> if is_gpu_available():
...     print("GPU acceleration enabled!")
nirs4all.synthesis.is_nir_region(wavelength_nm: float) bool[source]

Check if a wavelength is in the NIR region (700-2500 nm).

Parameters:

wavelength_nm – Wavelength in nm.

Returns:

True if wavelength is in NIR region.

Example

>>> is_nir_region(1450)
True
>>> is_nir_region(500)
False
nirs4all.synthesis.is_visible_region(wavelength_nm: float) bool[source]

Check if a wavelength is in the visible region (350-700 nm).

Parameters:

wavelength_nm – Wavelength in nm.

Returns:

True if wavelength is in visible region.

Example

>>> is_visible_region(500)
True
>>> is_visible_region(1450)
False
nirs4all.synthesis.list_aggregate_categories(domain: str | None = None) Dict[str, List[str]]

List categories and their aggregates, optionally filtered by domain.

Parameters:

domain – Optional domain filter.

Returns:

[aggregate_names]}.

Return type:

Dictionary of {category

Example

>>> cats = list_categories(domain="food")
>>> for cat, aggs in cats.items():
...     print(f"{cat}: {aggs}")
nirs4all.synthesis.list_aggregate_domains() List[str]

List all unique domains across aggregates.

Returns:

Sorted list of domain names.

Example

>>> domains = list_domains()
>>> print(domains)
['agriculture', 'environmental', 'food', 'industrial', 'pharmaceutical']
nirs4all.synthesis.list_aggregates(domain: str | None = None, category: str | None = None, tags: List[str] | None = None) List[str][source]

List available aggregate components with optional filtering.

Parameters:
  • domain – Filter by domain (e.g., “agriculture”, “food”, “pharmaceutical”).

  • category – Filter by category (e.g., “grain”, “dairy”, “solid_dosage”).

  • tags – Filter by tags (any match).

Returns:

Sorted list of aggregate names matching the criteria.

Example

>>> # List all aggregates
>>> all_aggs = list_aggregates()
>>>
>>> # List food aggregates
>>> food_aggs = list_aggregates(domain="food")
>>>
>>> # List grain aggregates
>>> grain_aggs = list_aggregates(category="grain")
nirs4all.synthesis.list_all_tags() List[str][source]

List all unique tags used across all bands.

Returns:

Sorted list of unique tag names.

Example

>>> tags = list_all_tags()
>>> print(tags[:10])
nirs4all.synthesis.list_bands(functional_group: str | None = None) List[str][source]

List available bands, optionally filtered by functional group.

Parameters:

functional_group – If provided, list only bands for this group.

Returns:

List of band keys (if functional_group specified) or list of “group/key” strings (if no filter).

Example

>>> # List all O-H bands
>>> oh_bands = list_bands("O-H")
>>> print(oh_bands[:3])
['1st_overtone_bound', '1st_overtone_carbohydrate', '1st_overtone_carboxylic']
>>>
>>> # List all bands
>>> all_bands = list_bands()
nirs4all.synthesis.list_benchmark_datasets() List[str][source]

List all registered benchmark datasets.

Returns:

List of dataset names.

Example

>>> datasets = list_benchmark_datasets()
>>> print(datasets)
nirs4all.synthesis.list_categories() Dict[str, List[str]][source]

Return dictionary of categories to component names.

Returns:

Dictionary mapping category names to lists of component names.

Example

>>> categories = list_categories()
>>> for cat, components in categories.items():
...     print(f"{cat}: {len(components)} components")
nirs4all.synthesis.list_detector_types() List[str][source]

List available detector types.

Returns:

List of detector type names.

nirs4all.synthesis.list_domains(category: DomainCategory | None = None) List[str][source]

List available domain names.

Parameters:

category – Optional category filter.

Returns:

List of domain names.

Example

>>> list_domains(DomainCategory.AGRICULTURE)
['agriculture_grain', 'agriculture_forage', ...]
nirs4all.synthesis.list_functional_groups() List[str][source]

List all available functional groups.

Returns:

Sorted list of functional group names.

Example

>>> groups = list_functional_groups()
>>> print(groups[:5])
['Al-OH', 'Anthocyanin', 'C-Cl', 'C-F', 'C-H_aliphatic']
nirs4all.synthesis.list_instrument_archetypes(category: InstrumentCategory | None = None) List[str][source]

List available instrument archetype names.

Parameters:

category – Optional filter by category.

Returns:

List of archetype names.

Example

>>> list_instrument_archetypes(InstrumentCategory.HANDHELD)
['viavi_micronir', 'scio', 'tellspec', 'linksquare', 'siware_neoscanner']
nirs4all.synthesis.list_instrument_wavelength_grids() List[str][source]

List all available predefined instrument wavelength grids.

Returns:

List of instrument identifiers.

Example

>>> grids = list_instrument_wavelength_grids()
>>> print(grids[:3])
['micronir_onsite', 'scio', 'neospectra_micro']
nirs4all.synthesis.list_product_categories() List[str][source]

List all unique product categories.

Returns:

Sorted list of category names.

Example

>>> categories = list_product_categories()
>>> print(categories)
['dairy', 'fruit', 'grain', 'legume', 'meat', 'nn_training', 'solid_dosage']
nirs4all.synthesis.list_product_domains() List[str][source]

List all unique product domains.

Returns:

Sorted list of domain names.

Example

>>> domains = list_product_domains()
>>> print(domains)
['agriculture', 'food', 'pharmaceutical']
nirs4all.synthesis.list_product_templates(category: str | None = None, domain: str | None = None, tags: List[str] | None = None) List[str][source]

List available product templates with optional filtering.

Parameters:
  • category – Filter by category (e.g., “dairy”, “grain”, “pharma”).

  • domain – Filter by domain (e.g., “food”, “agriculture”, “pharmaceutical”).

  • tags – Filter by tags (any match).

Returns:

Sorted list of template names matching the criteria.

Example

>>> # List all templates
>>> all_templates = list_product_templates()
>>>
>>> # List dairy templates
>>> dairy = list_product_templates(category="dairy")
>>>
>>> # List NN training templates
>>> nn_templates = list_product_templates(tags=["nn_training"])
nirs4all.synthesis.load_benchmark_dataset(name: str, data_dir: str | Path | None = None, format: str = 'auto') LoadedBenchmarkDataset[source]

Load a benchmark dataset from disk.

Parameters:
  • name – Dataset name from registry.

  • data_dir – Directory containing dataset files.

  • format – File format (“auto”, “csv”, “mat”, “jdx”).

Returns:

LoadedBenchmarkDataset with data.

Raises:

Example

>>> dataset = load_benchmark_dataset("corn", data_dir="./datasets/")
>>> print(dataset.X.shape, dataset.y.shape)

Note

Dataset files must be obtained separately from their sources. This function provides standardized loading once files are available.

nirs4all.synthesis.multiscale_derivative_fit(fitter: DerivativeAwareForwardModelFitter, y_deriv: ndarray, scales: List[float] | None = None) Dict[str, Any][source]

Multiscale fitting curriculum for derivative spectra.

Fits coarse features first by smoothing the derivative target, then progressively reduces smoothing. Particularly important for derivative data which can have high-frequency noise.

Parameters:
  • fitter – DerivativeAwareForwardModelFitter instance.

  • y_deriv – Target derivative spectrum.

  • scales – List of Gaussian sigma values. Default: [15, 8, 4, 0].

Returns:

Final fit result dict.

Example

>>> result = multiscale_derivative_fit(fitter, deriv_spectrum)
nirs4all.synthesis.multiscale_fit(fitter: ForwardModelFitter, y: ndarray, scales: List[float] | None = None) Dict[str, Any][source]

Multiscale fitting curriculum for raw spectra.

Fits coarse features first by smoothing the target, then progressively reduces smoothing to capture finer details. This improves optimization stability and avoids local minima.

Parameters:
  • fitter – ForwardModelFitter instance.

  • y – Target spectrum.

  • scales – List of Gaussian sigma values for progressive smoothing. Default: [20, 10, 5, 0].

Returns:

Final fit result dict.

Example

>>> result = multiscale_fit(fitter, spectrum, scales=[20, 10, 5, 0])
nirs4all.synthesis.normalize_component_amplitudes(component: SpectralComponent, method: str = 'max') SpectralComponent[source]

Normalize band amplitudes for a component.

This is a convenience wrapper around SpectralComponent.normalized().

Parameters:
  • component – SpectralComponent to normalize.

  • method – Normalization method (“max” or “sum”).

Returns:

New SpectralComponent with normalized amplitudes.

Example

>>> comp = get_component("water")
>>> normalized = normalize_component_amplitudes(comp)
nirs4all.synthesis.product_template_info(name: str) str[source]

Return formatted information about a product template.

Parameters:

name – Template name.

Returns:

Human-readable string with template details.

Example

>>> print(product_template_info("wheat_variable_protein"))
nirs4all.synthesis.quick_realism_check(synthetic_spectra: ndarray, wavelengths: ndarray | None = None, expected_snr_range: Tuple[float, float] = (10, 1000), expected_peak_density: Tuple[float, float] = (0.5, 10.0)) Tuple[bool, List[str]][source]

Perform quick realism checks on synthetic spectra without real data.

This function checks basic properties that realistic spectra should have, without requiring a reference real dataset.

Parameters:
  • synthetic_spectra – Synthetic spectra to check.

  • wavelengths – Wavelength array.

  • expected_snr_range – Expected SNR range (min, max).

  • expected_peak_density – Expected peak density range (peaks per 100 nm).

Returns:

Tuple of (passed, list_of_issues).

Example

>>> X = generator.generate(100)[0]
>>> passed, issues = quick_realism_check(X, wavelengths)
>>> if not passed:
...     print("Issues:", issues)
nirs4all.synthesis.sample_prior(domain: str | None = None, instrument: str | None = None, random_state: int | None = None) Dict[str, Any][source]

Quick function to sample a single configuration from default prior.

Parameters:
  • domain – Optional domain constraint.

  • instrument – Optional instrument constraint.

  • random_state – Random state for reproducibility.

Returns:

Configuration dictionary.

Example

>>> config = sample_prior(domain="food", random_state=42)
>>> print(config["domain"], config["instrument"])
nirs4all.synthesis.sample_prior_batch(n: int, random_state: int | None = None) List[Dict[str, Any]][source]

Quick function to sample multiple configurations from default prior.

Parameters:
  • n – Number of configurations to sample.

  • random_state – Random state for reproducibility.

Returns:

List of configuration dictionaries.

Example

>>> configs = sample_prior_batch(10, random_state=42)
>>> for c in configs:
...     print(c["domain"], c["instrument"])
nirs4all.synthesis.search_components(query: str | None = None, category: str | None = None, subcategory: str | None = None, tags: List[str] | None = None, wavelength_range: Tuple[float, float] | None = None) List[str][source]

Search components by various criteria.

Parameters:
  • query – Fuzzy match on name or synonyms.

  • category – Filter by category (e.g., “proteins”, “carbohydrates”).

  • subcategory – Filter by subcategory (e.g., “monosaccharides”).

  • tags – Filter by tags (any match).

  • wavelength_range – Filter by components with bands in range (min, max).

Returns:

List of matching component names.

Example

>>> # Find all protein-related components
>>> proteins = search_components(category="proteins")
>>>
>>> # Find components with bands in visible-NIR region
>>> vis_nir = search_components(wavelength_range=(400, 1000))
>>>
>>> # Find components tagged for pharmaceutical use
>>> pharma = search_components(tags=["pharma"])
nirs4all.synthesis.simulate_detector_effects(spectra: ndarray, wavelengths: ndarray, detector_type: DetectorType = DetectorType.INGAAS, include_response: bool = True, include_noise: bool = True, random_state: int | None = None) ndarray[source]

Apply detector effects to spectra with simple API.

Parameters:
  • spectra – Input spectra (n_samples, n_wavelengths).

  • wavelengths – Wavelength array (nm).

  • detector_type – Type of detector to simulate.

  • include_response – Whether to apply spectral response.

  • include_noise – Whether to apply noise.

  • random_state – Random seed.

Returns:

Spectra with detector effects applied.

Example

>>> spectra_out = simulate_detector_effects(
...     spectra, wavelengths,
...     detector_type=DetectorType.PBS
... )
nirs4all.synthesis.validate_against_benchmark(synthetic_spectra: ndarray, benchmark_spectra: ndarray, benchmark_name: str, wavelengths: ndarray | None = None, synthetic_targets: ndarray | None = None, benchmark_targets: ndarray | None = None, random_state: int | None = None) DatasetComparisonResult[source]

Validate synthetic data against a benchmark dataset.

Parameters:
  • synthetic_spectra – Synthetic spectra (n_synth, n_wavelengths).

  • benchmark_spectra – Real benchmark spectra (n_bench, n_wavelengths).

  • benchmark_name – Name of the benchmark dataset.

  • wavelengths – Wavelength array.

  • synthetic_targets – Optional targets for TSTR/TRTS evaluation.

  • benchmark_targets – Optional targets for TSTR/TRTS evaluation.

  • random_state – Random state for reproducibility.

Returns:

DatasetComparisonResult with realism score and optional TSTR/TRTS.

Example

>>> result = validate_against_benchmark(
...     synthetic_spectra=X_synth,
...     benchmark_spectra=X_real,
...     benchmark_name="Corn",
... )
>>> print(result.summary())
nirs4all.synthesis.validate_aggregates() List[str][source]

Validate all predefined aggregates.

Returns:

List of validation issues (empty if all valid).

Example

>>> issues = validate_aggregates()
>>> if issues:
...     for issue in issues:
...         print(f"Warning: {issue}")
nirs4all.synthesis.validate_bands() List[str][source]

Validate all band assignments.

Returns:

List of validation issues (empty if all valid).

Example

>>> issues = validate_bands()
>>> if issues:
...     for issue in issues:
...         print(f"Warning: {issue}")
nirs4all.synthesis.validate_component_coverage(wavelength_range: Tuple[float, float] = (350, 2500)) Dict[str, List[str]][source]

Check which components have bands in the given wavelength range.

Parameters:

wavelength_range – (min, max) wavelength in nm.

Returns:

Dictionary with ‘covered’ and ‘not_covered’ component lists.

Example

>>> coverage = validate_component_coverage((1000, 2500))
>>> print(f"Covered: {len(coverage['covered'])}")
>>> print(f"Not covered: {coverage['not_covered']}")
nirs4all.synthesis.validate_concentrations(C: ndarray, n_samples: int | None = None, n_components: int | None = None, check_normalized: bool = False, tolerance: float = 0.01) List[str][source]

Validate concentration matrix.

Parameters:
  • C – Concentration matrix to validate.

  • n_samples – Expected number of samples.

  • n_components – Expected number of components.

  • check_normalized – Whether concentrations should sum to 1.

  • tolerance – Tolerance for normalization check.

Returns:

List of validation warning messages.

Raises:

ValidationError – If critical validation fails.

nirs4all.synthesis.validate_predefined_components() List[str][source]

Validate all predefined components.

Returns:

List of validation warnings/errors (empty if all valid).

Example

>>> issues = validate_predefined_components()
>>> if issues:
...     for issue in issues:
...         print(issue)
... else:
...     print("All components valid!")
nirs4all.synthesis.validate_spectra(X: ndarray, expected_shape: Tuple[int, int] | None = None, check_finite: bool = True, check_positive: bool = False, value_range: Tuple[float, float] | None = None) List[str][source]

Validate generated spectra matrix.

Parameters:
  • X – Spectra matrix to validate.

  • expected_shape – Expected (n_samples, n_wavelengths) shape.

  • check_finite – Whether to check for NaN/Inf values.

  • check_positive – Whether to require all positive values.

  • value_range – Optional (min, max) expected range.

Returns:

List of validation warning messages (empty if all OK).

Raises:

ValidationError – If critical validation fails.

Example

>>> X = np.random.randn(100, 500)
>>> warnings = validate_spectra(X, expected_shape=(100, 500))
>>> if warnings:
...     print("Warnings:", warnings)
nirs4all.synthesis.validate_synthetic_output(X: ndarray, C: ndarray, E: ndarray, wavelengths: ndarray | None = None) List[str][source]

Validate complete synthetic generation output.

Parameters:
  • X – Generated spectra (n_samples, n_wavelengths).

  • C – Concentration matrix (n_samples, n_components).

  • E – Component spectra (n_components, n_wavelengths).

  • wavelengths – Optional wavelength array.

Returns:

List of all validation warnings.

Raises:

ValidationError – If critical validation fails.

Example

>>> from nirs4all.synthesis import SyntheticNIRSGenerator
>>> gen = SyntheticNIRSGenerator(random_state=42)
>>> X, C, E = gen.generate(100)
>>> warnings = validate_synthetic_output(X, C, E, gen.wavelengths)
nirs4all.synthesis.validate_wavelengths(wavelengths: ndarray, expected_range: Tuple[float, float] | None = None, check_monotonic: bool = True, check_uniform: bool = True) List[str][source]

Validate wavelength array.

Parameters:
  • wavelengths – Wavelength array to validate.

  • expected_range – Optional (min, max) expected range in nm.

  • check_monotonic – Whether to check for monotonically increasing values.

  • check_uniform – Whether to check for uniform spacing.

Returns:

List of validation warning messages.

Raises:

ValidationError – If critical validation fails.

nirs4all.synthesis.wavelength_to_wavenumber(lambda_nm: float | ndarray) float | ndarray[source]

Convert wavelength (nm) to wavenumber (cm⁻¹).

The conversion follows the relationship: ν̃ = 10^7 / λ

Parameters:

lambda_nm – Wavelength in nm. Can be a scalar or numpy array.

Returns:

Wavenumber in cm⁻¹ (same shape as input).

Raises:

ValueError – If wavelength is zero or negative.

Example

>>> wavelength_to_wavenumber(1450)  # O-H 1st overtone region
6896.55...
>>> wavelength_to_wavenumber(np.array([1450, 1940]))
array([6896.55..., 5154.64...])
nirs4all.synthesis.wavenumber_to_wavelength(nu_cm: float | ndarray) float | ndarray[source]

Convert wavenumber (cm⁻¹) to wavelength (nm).

The conversion follows the relationship: λ = 10^7 / ν̃

Parameters:

nu_cm – Wavenumber in cm⁻¹. Can be a scalar or numpy array.

Returns:

Wavelength in nm (same shape as input).

Raises:

ValueError – If wavenumber is zero or negative.

Example

>>> wavenumber_to_wavelength(6896)  # 1st overtone O-H
1450.26...
>>> wavenumber_to_wavelength(np.array([6896, 5155]))
array([1450.26..., 1939.88...])