nirs4all.synthesis.products module

Product-level synthetic NIRS generator for neural network training.

This module provides high-level APIs to generate diverse, realistic product samples with controlled variability for training neural networks. Unlike the base SyntheticNIRSGenerator which operates at the component level, ProductGenerator works with predefined product templates that include realistic composition variability, component correlations, and bounds.

Key Features:

Predefined product templates with realistic composition ranges
Controlled variability types (FIXED, UNIFORM, NORMAL, LOGNORMAL, CORRELATED)
Composition constraints (sum to 1.0, realistic bounds)
Correlation preservation between components
Target flexibility (any component as regression target)
Efficient batch generation for NN training (10k-100k samples)
Integration with custom wavelength grids

Example

>>> from nirs4all.synthesis import ProductGenerator, list_product_templates
>>>
>>> # List available templates
>>> print(list_product_templates(category="dairy"))
['milk_variable_fat', 'cheese_variable_moisture']
>>>
>>> # Generate dairy product samples
>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=10000, target="fat")
>>>
>>> # High-variability dataset for NN training
>>> generator = ProductGenerator("food_cholesterol_variable")
>>> dataset = generator.generate(n_samples=50000, target="cholesterol")

References

[1] USDA FoodData Central (https://fdc.nal.usda.gov/) [2] Osborne, B. G., Fearn, T., & Hindle, P. H. (1993). Practical NIR Spectroscopy. [3] Williams, P. (2001). Implementation of Near-Infrared Technology.

class nirs4all.synthesis.products.CategoryGenerator(templates: List[str | ProductTemplate], random_state: int | None = None, **kwargs: Any)[source]

Bases: object

Generator combining multiple product templates for diverse datasets.

CategoryGenerator enables creation of training datasets that span multiple product types, useful for building robust models that generalize across categories.

templates

List of ProductTemplate objects.

Type:: List[nirs4all.synthesis.products.ProductTemplate]

generators

List of ProductGenerator objects for each template.

Type:: List[nirs4all.synthesis.products.ProductGenerator]

Parameters:

templates – List of template names or ProductTemplate objects.
random_state – Random seed for reproducibility.
**kwargs – Additional arguments passed to ProductGenerator.

Example

>>> # Combine dairy products
>>> gen = CategoryGenerator(["milk_variable_fat", "cheese_variable_moisture"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")
>>>
>>> # Universal fat predictor training
>>> gen = CategoryGenerator([
...     "milk_variable_fat",
...     "cheese_variable_moisture",
...     "meat_variable_fat",
... ])
>>> dataset = gen.generate(n_samples=10000, target="lipid")

__repr__() → str[source]: Return string representation.

generate(n_samples: int = 1000, target: str | None = None, samples_per_template: List[int] | None = None, train_ratio: float = 0.8, shuffle: bool = True, include_template_labels: bool = False) → SpectroDataset[source]

Generate combined dataset from multiple templates.

Parameters:

n_samples – Total number of samples to generate.
target – Component to use as regression target. Must exist in all templates.
samples_per_template – Number of samples per template. If None, divides equally.
train_ratio – Proportion of samples for training partition.
shuffle – Whether to shuffle samples across templates.
include_template_labels – If True, adds template index as metadata.

Returns:

SpectroDataset combining samples from all templates.

Example

>>> gen = CategoryGenerator(["milk_variable_fat", "meat_variable_fat"])
>>> dataset = gen.generate(n_samples=2000, target="lipid")

generators: List[ProductGenerator]

templates: List[ProductTemplate]

Bases: object

Specification for how a component’s concentration varies.

component

Name of the spectral component (must exist in library).

Type:: str

variation_type

Type of variation (FIXED, UNIFORM, NORMAL, etc.).

Type:: nirs4all.synthesis.products.VariationType

value

For FIXED type, the exact value.

Type:: float | None

min_value

For UNIFORM/NORMAL, the minimum bound.

Type:: float | None

max_value

For UNIFORM/NORMAL, the maximum bound.

Type:: float | None

mean

For NORMAL/LOGNORMAL, the distribution mean.

Type:: float | None

std

For NORMAL/LOGNORMAL, the distribution standard deviation.

Type:: float | None

correlated_with

For CORRELATED, the source component name.

Type:: str | None

correlation

For CORRELATED, the correlation coefficient.

Type:: float | None

compute_as

For COMPUTED, a string describing the computation (currently supports “remainder” for 1 - sum(others)).

Type:: str | None

Example

>>> # Fixed moisture content
>>> moisture = ComponentVariation("moisture", VariationType.FIXED, value=0.12)
>>>
>>> # Variable protein with uniform distribution
>>> protein = ComponentVariation(
...     "protein", VariationType.UNIFORM,
...     min_value=0.08, max_value=0.18
... )
>>>
>>> # Starch negatively correlated with protein
>>> starch = ComponentVariation(
...     "starch", VariationType.CORRELATED,
...     correlated_with="protein", correlation=-0.85,
...     min_value=0.55, max_value=0.72
... )

__post_init__() → None[source]: Validate specification based on variation type.

component: str

compute_as: str | None = None

correlated_with: str | None = None

correlation: float | None = None

max_value: float | None = None

mean: float | None = None

min_value: float | None = None

std: float | None = None

value: float | None = None

variation_type: VariationType

class nirs4all.synthesis.products.ProductGenerator(template: str | ProductTemplate, random_state: int | None = None, wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, wavelengths: ndarray | None = None, instrument_wavelength_grid: str | None = None, complexity: str = 'realistic')[source]

Bases: object

Generator for product-level synthetic NIRS spectra.

ProductGenerator creates realistic synthetic spectra based on predefined product templates with controlled composition variability. It handles correlation constraints, compositional bounds, and efficient batch generation for neural network training.

template: The ProductTemplate used for generation.

library: ComponentLibrary with the required spectral components.

rng: NumPy random generator for reproducibility.

Parameters:

template – Template name (str) or ProductTemplate object.
random_state – Random seed for reproducibility.
wavelength_start – Start wavelength in nm (default: 1000).
wavelength_end – End wavelength in nm (default: 2500).
wavelength_step – Wavelength step in nm (default: 2).
wavelengths – Custom wavelength array (overrides start/end/step).
instrument_wavelength_grid – Predefined instrument grid name.
complexity – Spectral complexity (‘simple’, ‘realistic’, ‘complex’).

Example

>>> # Generate milk samples with variable fat
>>> generator = ProductGenerator("milk_variable_fat", random_state=42)
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>>
>>> # High-variability training data
>>> generator = ProductGenerator("universal_protein_predictor")
>>> dataset = generator.generate(n_samples=50000, target="protein")
>>>
>>> # Match specific instrument wavelengths
>>> generator = ProductGenerator(
...     "wheat_variable_protein",
...     instrument_wavelength_grid="foss_xds"
... )

__repr__() → str[source]: Return string representation.

generate(n_samples: int = 1000, target: str | None = None, train_ratio: float = 0.8, include_batch_effects: bool = False, n_batches: int = 1, return_concentrations: bool = False) → 'SpectroDataset' | Tuple['SpectroDataset', np.ndarray][source]

Generate synthetic product samples.

Parameters:

n_samples – Number of samples to generate.
target – Component to use as regression target. If None, uses template’s default_target.
train_ratio – Proportion of samples for training partition.
include_batch_effects – Whether to add batch/session effects.
n_batches – Number of batches (if include_batch_effects=True).
return_concentrations – If True, also return the full concentration matrix.

Returns:

SpectroDataset with train/test partitions. If return_concentrations=True, returns (dataset, concentrations).

Example

>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=1000, target="lipid")
>>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")

generate_dataset_for_target(target: str, n_samples: int = 1000, target_range: Tuple[float, float] | None = None, **kwargs: Any) → SpectroDataset[source]

Generate dataset optimized for a specific target component.

This is a convenience method that generates a dataset and optionally scales the target values to a specified range.

Parameters:

target – Component to use as regression target.
n_samples – Number of samples to generate.
target_range – Optional (min, max) to scale target values.
**kwargs – Additional arguments passed to generate().

Returns:

SpectroDataset ready for pipeline use.

Example

>>> generator = ProductGenerator("wheat_variable_protein")
>>> dataset = generator.generate_dataset_for_target(
...     target="protein",
...     n_samples=10000,
...     target_range=(0, 100)  # Scale to percentage
... )

class nirs4all.synthesis.products.ProductTemplate(name: str, description: str, category: str, domain: str, components: ~typing.List[~nirs4all.synthesis.products.ComponentVariation], default_target: str = '', tags: ~typing.List[str] = <factory>, references: ~typing.List[str] = <factory>)[source]

Bases: object

Template defining a product type with composition variability.

A ProductTemplate describes a realistic product type (e.g., wheat grain, milk, pharmaceutical tablet) along with specifications for how each component’s concentration can vary. This enables generation of diverse samples suitable for neural network training.

name

Unique identifier for the template.

Type:: str

description

Human-readable description.

Type:: str

category

Product category (e.g., “dairy”, “grain”, “pharma”).

Type:: str

domain

Application domain (e.g., “agriculture”, “food”, “pharmaceutical”).

Type:: str

components

List of ComponentVariation specifications.

Type:: List[nirs4all.synthesis.products.ComponentVariation]

default_target

Default component to use as regression target.

Type:: str

tags

Classification tags for filtering.

Type:: List[str]

references

Literature or data source citations.

Type:: List[str]

Example

>>> milk_template = ProductTemplate(
...     name="milk_variable_fat",
...     description="Milk with variable fat content (skim to whole)",
...     category="dairy",
...     domain="food",
...     components=[
...         ComponentVariation("water", VariationType.COMPUTED, compute_as="remainder"),
...         ComponentVariation("lipid", VariationType.UNIFORM, min_value=0.005, max_value=0.06),
...         ComponentVariation("casein", VariationType.NORMAL, mean=0.028, std=0.003),
...         ComponentVariation("whey", VariationType.FIXED, value=0.006),
...         ComponentVariation("lactose", VariationType.NORMAL, mean=0.05, std=0.003),
...     ],
...     default_target="lipid",
... )

__post_init__() → None[source]: Validate template consistency.

category: str

property component_names: List[str]: Return list of component names in this template.

components: List[ComponentVariation]

default_target: str = ''

description: str

domain: str

info() → str[source]: Return formatted information about the template.

name: str

references: List[str]

tags: List[str]

class nirs4all.synthesis.products.VariationType(value)[source]

Bases: Enum

Type of variation for component concentrations.

FIXED: No variation, use exact specified value.

UNIFORM: Uniform distribution between min and max.

NORMAL: Normal (Gaussian) distribution with mean and std.

LOGNORMAL: Log-normal distribution for non-negative values.

CORRELATED: Value derived from correlation with another component.

COMPUTED: Value computed from other components (e.g., 1 - sum(others)).

COMPUTED = 6

CORRELATED = 5

FIXED = 1

LOGNORMAL = 4

NORMAL = 3

UNIFORM = 2

nirs4all.synthesis.products.generate_product_samples(template: str | ProductTemplate, n_samples: int = 1000, target: str | None = None, random_state: int | None = None, **kwargs: Any) → SpectroDataset[source]

Generate synthetic product samples (convenience function).

This is a shorthand for creating a ProductGenerator and calling generate().

Parameters:

template – Template name or ProductTemplate object.
n_samples – Number of samples to generate.
target – Component to use as regression target.
random_state – Random seed for reproducibility.
**kwargs – Additional arguments passed to ProductGenerator.generate().

Returns:

SpectroDataset with synthetic samples.

Example

>>> from nirs4all.synthesis import generate_product_samples
>>>
>>> # Generate milk samples
>>> dataset = generate_product_samples(
...     "milk_variable_fat",
...     n_samples=1000,
...     target="lipid",
...     random_state=42
... )

nirs4all.synthesis.products.get_product_template(name: str) → ProductTemplate[source]

Get a product template by name.

Parameters:: name – Template name.
Returns:: ProductTemplate object.
Raises:: ValueError – If template name is not found.

Example

>>> template = get_product_template("milk_variable_fat")
>>> print(template.description)
Milk with variable fat content (skim to whole)

nirs4all.synthesis.products.list_product_categories() → List[str][source]

List all unique product categories.

Returns:: Sorted list of category names.

Example

>>> categories = list_product_categories()
>>> print(categories)
['dairy', 'fruit', 'grain', 'legume', 'meat', 'nn_training', 'solid_dosage']

nirs4all.synthesis.products.list_product_domains() → List[str][source]

List all unique product domains.

Returns:: Sorted list of domain names.

Example

>>> domains = list_product_domains()
>>> print(domains)
['agriculture', 'food', 'pharmaceutical']

nirs4all.synthesis.products.list_product_templates(category: str | None = None, domain: str | None = None, tags: List[str] | None = None) → List[str][source]

List available product templates with optional filtering.

Parameters:

category – Filter by category (e.g., “dairy”, “grain”, “pharma”).
domain – Filter by domain (e.g., “food”, “agriculture”, “pharmaceutical”).
tags – Filter by tags (any match).

Returns:

Sorted list of template names matching the criteria.

Example

>>> # List all templates
>>> all_templates = list_product_templates()
>>>
>>> # List dairy templates
>>> dairy = list_product_templates(category="dairy")
>>>
>>> # List NN training templates
>>> nn_templates = list_product_templates(tags=["nn_training"])

nirs4all.synthesis.products.product_template_info(name: str) → str[source]

Return formatted information about a product template.

Parameters:: name – Template name.
Returns:: Human-readable string with template details.

Example

>>> print(product_template_info("wheat_variable_protein"))