nirs4all.synthesis.products module

Product-level synthetic NIRS generator for neural network training.

This module provides high-level APIs to generate diverse, realistic product samples with controlled variability for training neural networks. Unlike the base SyntheticNIRSGenerator which operates at the component level, ProductGenerator works with predefined product templates that include realistic composition variability, component correlations, and bounds.

Key Features:
  • Predefined product templates with realistic composition ranges

  • Controlled variability types (FIXED, UNIFORM, NORMAL, LOGNORMAL, CORRELATED)

  • Composition constraints (sum to 1.0, realistic bounds)

  • Correlation preservation between components

  • Target flexibility (any component as regression target)

  • Efficient batch generation for NN training (10k-100k samples)

  • Integration with custom wavelength grids

Example

>>> from nirs4all.synthesis import ProductGenerator, list_product_templates
>>>
>>> # List available templates
>>> print(list_product_templates(category="dairy"))
['milk_variable_fat', 'cheese_variable_moisture']
>>>
>>> # Generate dairy product samples
>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=10000, target="fat")
>>>
>>> # High-variability dataset for NN training
>>> generator = ProductGenerator("food_cholesterol_variable")
>>> dataset = generator.generate(n_samples=50000, target="cholesterol")

References

[1] USDA FoodData Central (https://fdc.nal.usda.gov/) [2] Osborne, B. G., Fearn, T., & Hindle, P. H. (1993). Practical NIR Spectroscopy. [3] Williams, P. (2001). Implementation of Near-Infrared Technology.

class nirs4all.synthesis.products.CategoryGenerator(templates: List[str | ProductTemplate], random_state: int | None = None, **kwargs: Any)[source]

Bases: object

Generator combining multiple product templates for diverse datasets.

CategoryGenerator enables creation of training datasets that span multiple product types, useful for building robust models that generalize across categories.

templates

List of ProductTemplate objects.

Type:

List[nirs4all.synthesis.products.ProductTemplate]

generators

List of ProductGenerator objects for each template.

Type:

List[nirs4all.synthesis.products.ProductGenerator]

Parameters:
  • templates – List of template names or ProductTemplate objects.

  • random_state – Random seed for reproducibility.

  • **kwargs – Additional arguments passed to ProductGenerator.

Example

>>> # Combine dairy products
>>> gen = CategoryGenerator(["milk_variable_fat", "cheese_variable_moisture"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")
>>>
>>> # Universal fat predictor training
>>> gen = CategoryGenerator([
...     "milk_variable_fat",
...     "cheese_variable_moisture",
...     "meat_variable_fat",
... ])
>>> dataset = gen.generate(n_samples=10000, target="lipid")
__repr__() str[source]

Return string representation.

generate(n_samples: int = 1000, target: str | None = None, samples_per_template: List[int] | None = None, train_ratio: float = 0.8, shuffle: bool = True, include_template_labels: bool = False) SpectroDataset[source]

Generate combined dataset from multiple templates.

Parameters:
  • n_samples – Total number of samples to generate.

  • target – Component to use as regression target. Must exist in all templates.

  • samples_per_template – Number of samples per template. If None, divides equally.

  • train_ratio – Proportion of samples for training partition.

  • shuffle – Whether to shuffle samples across templates.

  • include_template_labels – If True, adds template index as metadata.

Returns:

SpectroDataset combining samples from all templates.

Example

>>> gen = CategoryGenerator(["milk_variable_fat", "meat_variable_fat"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")
generators: List[ProductGenerator]
templates: List[ProductTemplate]
class nirs4all.synthesis.products.ComponentVariation(component: str, variation_type: VariationType, value: float | None = None, min_value: float | None = None, max_value: float | None = None, mean: float | None = None, std: float | None = None, correlated_with: str | None = None, correlation: float | None = None, compute_as: str | None = None)[source]

Bases: object

Specification for how a component’s concentration varies.

component

Name of the spectral component (must exist in library).

Type:

str

variation_type

Type of variation (FIXED, UNIFORM, NORMAL, etc.).

Type:

nirs4all.synthesis.products.VariationType

value

For FIXED type, the exact value.

Type:

float | None

min_value

For UNIFORM/NORMAL, the minimum bound.

Type:

float | None

max_value

For UNIFORM/NORMAL, the maximum bound.

Type:

float | None

mean

For NORMAL/LOGNORMAL, the distribution mean.

Type:

float | None

std

For NORMAL/LOGNORMAL, the distribution standard deviation.

Type:

float | None

correlated_with

For CORRELATED, the source component name.

Type:

str | None

correlation

For CORRELATED, the correlation coefficient.

Type:

float | None

compute_as

For COMPUTED, a string describing the computation (currently supports “remainder” for 1 - sum(others)).

Type:

str | None

Example

>>> # Fixed moisture content
>>> moisture = ComponentVariation("moisture", VariationType.FIXED, value=0.12)
>>>
>>> # Variable protein with uniform distribution
>>> protein = ComponentVariation(
...     "protein", VariationType.UNIFORM,
...     min_value=0.08, max_value=0.18
... )
>>>
>>> # Starch negatively correlated with protein
>>> starch = ComponentVariation(
...     "starch", VariationType.CORRELATED,
...     correlated_with="protein", correlation=-0.85,
...     min_value=0.55, max_value=0.72
... )
__post_init__() None[source]

Validate specification based on variation type.

component: str
compute_as: str | None = None
correlated_with: str | None = None
correlation: float | None = None
max_value: float | None = None
mean: float | None = None
min_value: float | None = None
std: float | None = None
value: float | None = None
variation_type: VariationType
class nirs4all.synthesis.products.ProductGenerator(template: str | ProductTemplate, random_state: int | None = None, wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, wavelengths: ndarray | None = None, instrument_wavelength_grid: str | None = None, complexity: str = 'realistic')[source]

Bases: object

Generator for product-level synthetic NIRS spectra.

ProductGenerator creates realistic synthetic spectra based on predefined product templates with controlled composition variability. It handles correlation constraints, compositional bounds, and efficient batch generation for neural network training.

template

The ProductTemplate used for generation.

library

ComponentLibrary with the required spectral components.

rng

NumPy random generator for reproducibility.

Parameters:
  • template – Template name (str) or ProductTemplate object.

  • random_state – Random seed for reproducibility.

  • wavelength_start – Start wavelength in nm (default: 1000).

  • wavelength_end – End wavelength in nm (default: 2500).

  • wavelength_step – Wavelength step in nm (default: 2).

  • wavelengths – Custom wavelength array (overrides start/end/step).

  • instrument_wavelength_grid – Predefined instrument grid name.

  • complexity – Spectral complexity (‘simple’, ‘realistic’, ‘complex’).

Example

>>> # Generate milk samples with variable fat
>>> generator = ProductGenerator("milk_variable_fat", random_state=42)
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>>
>>> # High-variability training data
>>> generator = ProductGenerator("universal_protein_predictor")
>>> dataset = generator.generate(n_samples=50000, target="protein")
>>>
>>> # Match specific instrument wavelengths
>>> generator = ProductGenerator(
...     "wheat_variable_protein",
...     instrument_wavelength_grid="foss_xds"
... )
__repr__() str[source]

Return string representation.

generate(n_samples: int = 1000, target: str | None = None, train_ratio: float = 0.8, include_batch_effects: bool = False, n_batches: int = 1, return_concentrations: bool = False) 'SpectroDataset' | Tuple['SpectroDataset', np.ndarray][source]

Generate synthetic product samples.

Parameters:
  • n_samples – Number of samples to generate.

  • target – Component to use as regression target. If None, uses template’s default_target.

  • train_ratio – Proportion of samples for training partition.

  • include_batch_effects – Whether to add batch/session effects.

  • n_batches – Number of batches (if include_batch_effects=True).

  • return_concentrations – If True, also return the full concentration matrix.

Returns:

SpectroDataset with train/test partitions. If return_concentrations=True, returns (dataset, concentrations).

Example

>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
generate_dataset_for_target(target: str, n_samples: int = 1000, target_range: Tuple[float, float] | None = None, **kwargs: Any) SpectroDataset[source]

Generate dataset optimized for a specific target component.

This is a convenience method that generates a dataset and optionally scales the target values to a specified range.

Parameters:
  • target – Component to use as regression target.

  • n_samples – Number of samples to generate.

  • target_range – Optional (min, max) to scale target values.

  • **kwargs – Additional arguments passed to generate().

Returns:

SpectroDataset ready for pipeline use.

Example

>>> generator = ProductGenerator("wheat_variable_protein")
>>> dataset = generator.generate_dataset_for_target(
...     target="protein",
...     n_samples=10000,
...     target_range=(0, 100)  # Scale to percentage
... )
class nirs4all.synthesis.products.ProductTemplate(name: str, description: str, category: str, domain: str, components: ~typing.List[~nirs4all.synthesis.products.ComponentVariation], default_target: str = '', tags: ~typing.List[str] = <factory>, references: ~typing.List[str] = <factory>)[source]

Bases: object

Template defining a product type with composition variability.

A ProductTemplate describes a realistic product type (e.g., wheat grain, milk, pharmaceutical tablet) along with specifications for how each component’s concentration can vary. This enables generation of diverse samples suitable for neural network training.

name

Unique identifier for the template.

Type:

str

description

Human-readable description.

Type:

str

category

Product category (e.g., “dairy”, “grain”, “pharma”).

Type:

str

domain

Application domain (e.g., “agriculture”, “food”, “pharmaceutical”).

Type:

str

components

List of ComponentVariation specifications.

Type:

List[nirs4all.synthesis.products.ComponentVariation]

default_target

Default component to use as regression target.

Type:

str

tags

Classification tags for filtering.

Type:

List[str]

references

Literature or data source citations.

Type:

List[str]

Example

>>> milk_template = ProductTemplate(
...     name="milk_variable_fat",
...     description="Milk with variable fat content (skim to whole)",
...     category="dairy",
...     domain="food",
...     components=[
...         ComponentVariation("water", VariationType.COMPUTED, compute_as="remainder"),
...         ComponentVariation("lipid", VariationType.UNIFORM, min_value=0.005, max_value=0.06),
...         ComponentVariation("casein", VariationType.NORMAL, mean=0.028, std=0.003),
...         ComponentVariation("whey", VariationType.FIXED, value=0.006),
...         ComponentVariation("lactose", VariationType.NORMAL, mean=0.05, std=0.003),
...     ],
...     default_target="lipid",
... )
__post_init__() None[source]

Validate template consistency.

category: str
property component_names: List[str]

Return list of component names in this template.

components: List[ComponentVariation]
default_target: str = ''
description: str
domain: str
info() str[source]

Return formatted information about the template.

name: str
references: List[str]
tags: List[str]
class nirs4all.synthesis.products.VariationType(value)[source]

Bases: Enum

Type of variation for component concentrations.

FIXED

No variation, use exact specified value.

UNIFORM

Uniform distribution between min and max.

NORMAL

Normal (Gaussian) distribution with mean and std.

LOGNORMAL

Log-normal distribution for non-negative values.

CORRELATED

Value derived from correlation with another component.

COMPUTED

Value computed from other components (e.g., 1 - sum(others)).

COMPUTED = 6
CORRELATED = 5
FIXED = 1
LOGNORMAL = 4
NORMAL = 3
UNIFORM = 2
nirs4all.synthesis.products.generate_product_samples(template: str | ProductTemplate, n_samples: int = 1000, target: str | None = None, random_state: int | None = None, **kwargs: Any) SpectroDataset[source]

Generate synthetic product samples (convenience function).

This is a shorthand for creating a ProductGenerator and calling generate().

Parameters:
  • template – Template name or ProductTemplate object.

  • n_samples – Number of samples to generate.

  • target – Component to use as regression target.

  • random_state – Random seed for reproducibility.

  • **kwargs – Additional arguments passed to ProductGenerator.generate().

Returns:

SpectroDataset with synthetic samples.

Example

>>> from nirs4all.synthesis import generate_product_samples
>>>
>>> # Generate milk samples
>>> dataset = generate_product_samples(
...     "milk_variable_fat",
...     n_samples=1000,
...     target="lipid",
...     random_state=42
... )
nirs4all.synthesis.products.get_product_template(name: str) ProductTemplate[source]

Get a product template by name.

Parameters:

name – Template name.

Returns:

ProductTemplate object.

Raises:

ValueError – If template name is not found.

Example

>>> template = get_product_template("milk_variable_fat")
>>> print(template.description)
Milk with variable fat content (skim to whole)
nirs4all.synthesis.products.list_product_categories() List[str][source]

List all unique product categories.

Returns:

Sorted list of category names.

Example

>>> categories = list_product_categories()
>>> print(categories)
['dairy', 'fruit', 'grain', 'legume', 'meat', 'nn_training', 'solid_dosage']
nirs4all.synthesis.products.list_product_domains() List[str][source]

List all unique product domains.

Returns:

Sorted list of domain names.

Example

>>> domains = list_product_domains()
>>> print(domains)
['agriculture', 'food', 'pharmaceutical']
nirs4all.synthesis.products.list_product_templates(category: str | None = None, domain: str | None = None, tags: List[str] | None = None) List[str][source]

List available product templates with optional filtering.

Parameters:
  • category – Filter by category (e.g., “dairy”, “grain”, “pharma”).

  • domain – Filter by domain (e.g., “food”, “agriculture”, “pharmaceutical”).

  • tags – Filter by tags (any match).

Returns:

Sorted list of template names matching the criteria.

Example

>>> # List all templates
>>> all_templates = list_product_templates()
>>>
>>> # List dairy templates
>>> dairy = list_product_templates(category="dairy")
>>>
>>> # List NN training templates
>>> nn_templates = list_product_templates(tags=["nn_training"])
nirs4all.synthesis.products.product_template_info(name: str) str[source]

Return formatted information about a product template.

Parameters:

name – Template name.

Returns:

Human-readable string with template details.

Example

>>> print(product_template_info("wheat_variable_protein"))