nirs4all.synthesis.products module
Product-level synthetic NIRS generator for neural network training.
This module provides high-level APIs to generate diverse, realistic product samples with controlled variability for training neural networks. Unlike the base SyntheticNIRSGenerator which operates at the component level, ProductGenerator works with predefined product templates that include realistic composition variability, component correlations, and bounds.
- Key Features:
Predefined product templates with realistic composition ranges
Controlled variability types (FIXED, UNIFORM, NORMAL, LOGNORMAL, CORRELATED)
Composition constraints (sum to 1.0, realistic bounds)
Correlation preservation between components
Target flexibility (any component as regression target)
Efficient batch generation for NN training (10k-100k samples)
Integration with custom wavelength grids
Example
>>> from nirs4all.synthesis import ProductGenerator, list_product_templates
>>>
>>> # List available templates
>>> print(list_product_templates(category="dairy"))
['milk_variable_fat', 'cheese_variable_moisture']
>>>
>>> # Generate dairy product samples
>>> generator = ProductGenerator("milk_variable_fat")
>>> dataset = generator.generate(n_samples=10000, target="fat")
>>>
>>> # High-variability dataset for NN training
>>> generator = ProductGenerator("food_cholesterol_variable")
>>> dataset = generator.generate(n_samples=50000, target="cholesterol")
References
[1] USDA FoodData Central (https://fdc.nal.usda.gov/) [2] Osborne, B. G., Fearn, T., & Hindle, P. H. (1993). Practical NIR Spectroscopy. [3] Williams, P. (2001). Implementation of Near-Infrared Technology.
- class nirs4all.synthesis.products.CategoryGenerator(templates: List[str | ProductTemplate], random_state: int | None = None, **kwargs: Any)[source]
Bases:
objectGenerator combining multiple product templates for diverse datasets.
CategoryGenerator enables creation of training datasets that span multiple product types, useful for building robust models that generalize across categories.
- templates
List of ProductTemplate objects.
- Type:
- generators
List of ProductGenerator objects for each template.
- Type:
- Parameters:
templates – List of template names or ProductTemplate objects.
random_state – Random seed for reproducibility.
**kwargs – Additional arguments passed to ProductGenerator.
Example
>>> # Combine dairy products >>> gen = CategoryGenerator(["milk_variable_fat", "cheese_variable_moisture"]) >>> dataset = gen.generate(n_samples=2000, target="lipid") >>> >>> # Universal fat predictor training >>> gen = CategoryGenerator([ ... "milk_variable_fat", ... "cheese_variable_moisture", ... "meat_variable_fat", ... ]) >>> dataset = gen.generate(n_samples=10000, target="lipid")
- generate(n_samples: int = 1000, target: str | None = None, samples_per_template: List[int] | None = None, train_ratio: float = 0.8, shuffle: bool = True, include_template_labels: bool = False) SpectroDataset[source]
Generate combined dataset from multiple templates.
- Parameters:
n_samples – Total number of samples to generate.
target – Component to use as regression target. Must exist in all templates.
samples_per_template – Number of samples per template. If None, divides equally.
train_ratio – Proportion of samples for training partition.
shuffle – Whether to shuffle samples across templates.
include_template_labels – If True, adds template index as metadata.
- Returns:
SpectroDataset combining samples from all templates.
Example
>>> gen = CategoryGenerator(["milk_variable_fat", "meat_variable_fat"]) >>> dataset = gen.generate(n_samples=2000, target="lipid")
- generators: List[ProductGenerator]
- templates: List[ProductTemplate]
- class nirs4all.synthesis.products.ComponentVariation(component: str, variation_type: VariationType, value: float | None = None, min_value: float | None = None, max_value: float | None = None, mean: float | None = None, std: float | None = None, correlated_with: str | None = None, correlation: float | None = None, compute_as: str | None = None)[source]
Bases:
objectSpecification for how a component’s concentration varies.
- variation_type
Type of variation (FIXED, UNIFORM, NORMAL, etc.).
For CORRELATED, the source component name.
- Type:
str | None
- compute_as
For COMPUTED, a string describing the computation (currently supports “remainder” for 1 - sum(others)).
- Type:
str | None
Example
>>> # Fixed moisture content >>> moisture = ComponentVariation("moisture", VariationType.FIXED, value=0.12) >>> >>> # Variable protein with uniform distribution >>> protein = ComponentVariation( ... "protein", VariationType.UNIFORM, ... min_value=0.08, max_value=0.18 ... ) >>> >>> # Starch negatively correlated with protein >>> starch = ComponentVariation( ... "starch", VariationType.CORRELATED, ... correlated_with="protein", correlation=-0.85, ... min_value=0.55, max_value=0.72 ... )
- variation_type: VariationType
- class nirs4all.synthesis.products.ProductGenerator(template: str | ProductTemplate, random_state: int | None = None, wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, wavelengths: ndarray | None = None, instrument_wavelength_grid: str | None = None, complexity: str = 'realistic')[source]
Bases:
objectGenerator for product-level synthetic NIRS spectra.
ProductGenerator creates realistic synthetic spectra based on predefined product templates with controlled composition variability. It handles correlation constraints, compositional bounds, and efficient batch generation for neural network training.
- template
The ProductTemplate used for generation.
- library
ComponentLibrary with the required spectral components.
- rng
NumPy random generator for reproducibility.
- Parameters:
template – Template name (str) or ProductTemplate object.
random_state – Random seed for reproducibility.
wavelength_start – Start wavelength in nm (default: 1000).
wavelength_end – End wavelength in nm (default: 2500).
wavelength_step – Wavelength step in nm (default: 2).
wavelengths – Custom wavelength array (overrides start/end/step).
instrument_wavelength_grid – Predefined instrument grid name.
complexity – Spectral complexity (‘simple’, ‘realistic’, ‘complex’).
Example
>>> # Generate milk samples with variable fat >>> generator = ProductGenerator("milk_variable_fat", random_state=42) >>> dataset = generator.generate(n_samples=1000, target="lipid") >>> >>> # High-variability training data >>> generator = ProductGenerator("universal_protein_predictor") >>> dataset = generator.generate(n_samples=50000, target="protein") >>> >>> # Match specific instrument wavelengths >>> generator = ProductGenerator( ... "wheat_variable_protein", ... instrument_wavelength_grid="foss_xds" ... )
- generate(n_samples: int = 1000, target: str | None = None, train_ratio: float = 0.8, include_batch_effects: bool = False, n_batches: int = 1, return_concentrations: bool = False) 'SpectroDataset' | Tuple['SpectroDataset', np.ndarray][source]
Generate synthetic product samples.
- Parameters:
n_samples – Number of samples to generate.
target – Component to use as regression target. If None, uses template’s default_target.
train_ratio – Proportion of samples for training partition.
include_batch_effects – Whether to add batch/session effects.
n_batches – Number of batches (if include_batch_effects=True).
return_concentrations – If True, also return the full concentration matrix.
- Returns:
SpectroDataset with train/test partitions. If return_concentrations=True, returns (dataset, concentrations).
Example
>>> generator = ProductGenerator("milk_variable_fat") >>> dataset = generator.generate(n_samples=1000, target="lipid") >>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
- generate_dataset_for_target(target: str, n_samples: int = 1000, target_range: Tuple[float, float] | None = None, **kwargs: Any) SpectroDataset[source]
Generate dataset optimized for a specific target component.
This is a convenience method that generates a dataset and optionally scales the target values to a specified range.
- Parameters:
target – Component to use as regression target.
n_samples – Number of samples to generate.
target_range – Optional (min, max) to scale target values.
**kwargs – Additional arguments passed to generate().
- Returns:
SpectroDataset ready for pipeline use.
Example
>>> generator = ProductGenerator("wheat_variable_protein") >>> dataset = generator.generate_dataset_for_target( ... target="protein", ... n_samples=10000, ... target_range=(0, 100) # Scale to percentage ... )
- class nirs4all.synthesis.products.ProductTemplate(name: str, description: str, category: str, domain: str, components: ~typing.List[~nirs4all.synthesis.products.ComponentVariation], default_target: str = '', tags: ~typing.List[str] = <factory>, references: ~typing.List[str] = <factory>)[source]
Bases:
objectTemplate defining a product type with composition variability.
A ProductTemplate describes a realistic product type (e.g., wheat grain, milk, pharmaceutical tablet) along with specifications for how each component’s concentration can vary. This enables generation of diverse samples suitable for neural network training.
- components
List of ComponentVariation specifications.
- Type:
Example
>>> milk_template = ProductTemplate( ... name="milk_variable_fat", ... description="Milk with variable fat content (skim to whole)", ... category="dairy", ... domain="food", ... components=[ ... ComponentVariation("water", VariationType.COMPUTED, compute_as="remainder"), ... ComponentVariation("lipid", VariationType.UNIFORM, min_value=0.005, max_value=0.06), ... ComponentVariation("casein", VariationType.NORMAL, mean=0.028, std=0.003), ... ComponentVariation("whey", VariationType.FIXED, value=0.006), ... ComponentVariation("lactose", VariationType.NORMAL, mean=0.05, std=0.003), ... ], ... default_target="lipid", ... )
- components: List[ComponentVariation]
- class nirs4all.synthesis.products.VariationType(value)[source]
Bases:
EnumType of variation for component concentrations.
- FIXED
No variation, use exact specified value.
- UNIFORM
Uniform distribution between min and max.
- NORMAL
Normal (Gaussian) distribution with mean and std.
- LOGNORMAL
Log-normal distribution for non-negative values.
- CORRELATED
Value derived from correlation with another component.
- COMPUTED
Value computed from other components (e.g., 1 - sum(others)).
- COMPUTED = 6
- CORRELATED = 5
- FIXED = 1
- LOGNORMAL = 4
- NORMAL = 3
- UNIFORM = 2
- nirs4all.synthesis.products.generate_product_samples(template: str | ProductTemplate, n_samples: int = 1000, target: str | None = None, random_state: int | None = None, **kwargs: Any) SpectroDataset[source]
Generate synthetic product samples (convenience function).
This is a shorthand for creating a ProductGenerator and calling generate().
- Parameters:
template – Template name or ProductTemplate object.
n_samples – Number of samples to generate.
target – Component to use as regression target.
random_state – Random seed for reproducibility.
**kwargs – Additional arguments passed to ProductGenerator.generate().
- Returns:
SpectroDataset with synthetic samples.
Example
>>> from nirs4all.synthesis import generate_product_samples >>> >>> # Generate milk samples >>> dataset = generate_product_samples( ... "milk_variable_fat", ... n_samples=1000, ... target="lipid", ... random_state=42 ... )
- nirs4all.synthesis.products.get_product_template(name: str) ProductTemplate[source]
Get a product template by name.
- Parameters:
name – Template name.
- Returns:
ProductTemplate object.
- Raises:
ValueError – If template name is not found.
Example
>>> template = get_product_template("milk_variable_fat") >>> print(template.description) Milk with variable fat content (skim to whole)
- nirs4all.synthesis.products.list_product_categories() List[str][source]
List all unique product categories.
- Returns:
Sorted list of category names.
Example
>>> categories = list_product_categories() >>> print(categories) ['dairy', 'fruit', 'grain', 'legume', 'meat', 'nn_training', 'solid_dosage']
- nirs4all.synthesis.products.list_product_domains() List[str][source]
List all unique product domains.
- Returns:
Sorted list of domain names.
Example
>>> domains = list_product_domains() >>> print(domains) ['agriculture', 'food', 'pharmaceutical']
- nirs4all.synthesis.products.list_product_templates(category: str | None = None, domain: str | None = None, tags: List[str] | None = None) List[str][source]
List available product templates with optional filtering.
- Parameters:
category – Filter by category (e.g., “dairy”, “grain”, “pharma”).
domain – Filter by domain (e.g., “food”, “agriculture”, “pharmaceutical”).
tags – Filter by tags (any match).
- Returns:
Sorted list of template names matching the criteria.
Example
>>> # List all templates >>> all_templates = list_product_templates() >>> >>> # List dairy templates >>> dairy = list_product_templates(category="dairy") >>> >>> # List NN training templates >>> nn_templates = list_product_templates(tags=["nn_training"])