nirs4all.synthesis.sources module
Multi-source dataset generation for synthetic NIRS data.
This module provides tools for generating synthetic datasets with multiple data sources, such as combining NIR spectra with molecular markers or auxiliary measurements.
Example
>>> from nirs4all.synthesis.sources import MultiSourceGenerator
>>>
>>> generator = MultiSourceGenerator(random_state=42)
>>>
>>> dataset = generator.generate(
... n_samples=500,
... sources=[
... {"name": "NIR_low", "type": "nir", "wavelength_range": (1000, 1700)},
... {"name": "NIR_high", "type": "nir", "wavelength_range": (1700, 2500)},
... {"name": "markers", "type": "aux", "n_features": 15},
... ]
... )
- class nirs4all.synthesis.sources.MultiSourceGenerator(random_state: int | None = None)[source]
Bases:
objectGenerate synthetic multi-source NIRS datasets.
This class creates datasets combining multiple data sources, such as: - Multiple NIR spectral ranges (e.g., visible-NIR + shortwave-NIR) - NIR spectra + molecular markers - NIR spectra + auxiliary measurements
The generated sources share common underlying structure through component concentrations, creating realistic inter-source correlations.
- rng
NumPy random generator.
- Parameters:
random_state – Random seed for reproducibility.
Example
>>> generator = MultiSourceGenerator(random_state=42) >>> >>> result = generator.generate( ... n_samples=500, ... sources=[ ... { ... "name": "NIR", ... "type": "nir", ... "wavelength_range": (1000, 2500), ... "complexity": "realistic" ... }, ... { ... "name": "markers", ... "type": "aux", ... "n_features": 20, ... "correlation_with_target": 0.7 ... } ... ], ... target_range=(0, 100) ... ) >>> >>> print(result.source_names) ['NIR', 'markers']
- create_dataset(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, train_ratio: float = 0.8, target_range: Tuple[float, float] | None = None, name: str = 'multi_source_synthetic') SpectroDataset[source]
Create a SpectroDataset from multi-source generation.
- Parameters:
n_samples – Number of samples to generate.
sources – List of source configurations.
train_ratio – Proportion of samples for training.
target_range – Optional (min, max) for target scaling.
name – Dataset name.
- Returns:
SpectroDataset with multiple sources configured.
Example
>>> dataset = generator.create_dataset( ... n_samples=500, ... sources=[ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 10} ... ], ... train_ratio=0.8 ... )
- generate(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, target_range: Tuple[float, float] | None = None, concentration_method: str = 'dirichlet', n_components: int = 5) MultiSourceResult[source]
Generate a multi-source dataset.
All sources share underlying component concentrations, which creates realistic correlations between sources. NIR sources generate spectra from these concentrations, while auxiliary sources create features correlated with the same underlying structure.
- Parameters:
n_samples – Number of samples to generate.
sources – List of source configurations (SourceConfig or dict).
target_range – Optional (min, max) for scaling target values.
concentration_method – Method for generating component concentrations.
n_components – Number of underlying components.
- Returns:
MultiSourceResult containing all generated data.
Example
>>> result = generator.generate( ... n_samples=300, ... sources=[ ... {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)}, ... {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)}, ... ] ... )
- class nirs4all.synthesis.sources.MultiSourceResult(sources: ~typing.Dict[str, ~numpy.ndarray], targets: ~numpy.ndarray, source_configs: ~typing.List[~nirs4all.synthesis.sources.SourceConfig], wavelengths: ~typing.Dict[str, ~numpy.ndarray] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]
Bases:
objectContainer for multi-source generation results.
- sources
Dictionary mapping source names to feature arrays.
- Type:
Dict[str, numpy.ndarray]
- targets
Target values.
- Type:
- source_configs
Source configuration objects.
- Type:
- wavelengths
Dictionary mapping NIR source names to wavelength arrays.
- Type:
Dict[str, numpy.ndarray]
- source_configs: List[SourceConfig]
- class nirs4all.synthesis.sources.SourceConfig(name: str, source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir', n_features: int | None = None, wavelength_start: float | None = None, wavelength_end: float | None = None, wavelength_step: float = 2.0, components: List[str] | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal', correlation_with_target: float = 0.5)[source]
Bases:
objectConfiguration for a single data source.
- source_type
Type of source (‘nir’, ‘vis’, ‘aux’, ‘markers’).
- Type:
Literal[‘nir’, ‘vis’, ‘aux’, ‘markers’]
- # NIR-specific
- complexity
Complexity level for NIR sources.
- Type:
Literal[‘simple’, ‘realistic’, ‘complex’]
- # Auxiliary-specific
- distribution
Distribution for auxiliary features.
- Type:
Literal[‘normal’, ‘uniform’, ‘lognormal’]
- nirs4all.synthesis.sources.generate_multi_source(n_samples: int, sources: List[Dict[str, Any]] | None = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, as_dataset: bool = True, train_ratio: float = 0.8, name: str = 'multi_source_synthetic') SpectroDataset | MultiSourceResult[source]
Convenience function for generating multi-source datasets.
- Parameters:
n_samples – Number of samples.
sources – List of source configurations. If None, uses default single NIR source with wavelength range (1000, 2500).
random_state – Random seed.
target_range – Target value range.
as_dataset – If True, returns SpectroDataset.
train_ratio – Training set proportion.
name – Dataset name.
- Returns:
SpectroDataset or MultiSourceResult depending on as_dataset.
Example
>>> dataset = generate_multi_source( ... n_samples=500, ... sources=[ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 15} ... ], ... random_state=42 ... )