nirs4all.data.synthetic.sources module

Multi-source dataset generation for synthetic NIRS data.

This module provides tools for generating synthetic datasets with multiple data sources, such as combining NIR spectra with molecular markers or auxiliary measurements.

Example

>>> from nirs4all.data.synthetic.sources import MultiSourceGenerator
>>>
>>> generator = MultiSourceGenerator(random_state=42)
>>>
>>> dataset = generator.generate(
...     n_samples=500,
...     sources=[
...         {"name": "NIR_low", "type": "nir", "wavelength_range": (1000, 1700)},
...         {"name": "NIR_high", "type": "nir", "wavelength_range": (1700, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 15},
...     ]
... )
class nirs4all.data.synthetic.sources.MultiSourceGenerator(random_state: int | None = None)[source]

Bases: object

Generate synthetic multi-source NIRS datasets.

This class creates datasets combining multiple data sources, such as: - Multiple NIR spectral ranges (e.g., visible-NIR + shortwave-NIR) - NIR spectra + molecular markers - NIR spectra + auxiliary measurements

The generated sources share common underlying structure through component concentrations, creating realistic inter-source correlations.

rng

NumPy random generator.

Parameters:

random_state – Random seed for reproducibility.

Example

>>> generator = MultiSourceGenerator(random_state=42)
>>>
>>> result = generator.generate(
...     n_samples=500,
...     sources=[
...         {
...             "name": "NIR",
...             "type": "nir",
...             "wavelength_range": (1000, 2500),
...             "complexity": "realistic"
...         },
...         {
...             "name": "markers",
...             "type": "aux",
...             "n_features": 20,
...             "correlation_with_target": 0.7
...         }
...     ],
...     target_range=(0, 100)
... )
>>>
>>> print(result.source_names)
['NIR', 'markers']
create_dataset(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, train_ratio: float = 0.8, target_range: Tuple[float, float] | None = None, name: str = 'multi_source_synthetic') SpectroDataset[source]

Create a SpectroDataset from multi-source generation.

Parameters:
  • n_samples – Number of samples to generate.

  • sources – List of source configurations.

  • train_ratio – Proportion of samples for training.

  • target_range – Optional (min, max) for target scaling.

  • name – Dataset name.

Returns:

SpectroDataset with multiple sources configured.

Example

>>> dataset = generator.create_dataset(
...     n_samples=500,
...     sources=[
...         {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 10}
...     ],
...     train_ratio=0.8
... )
generate(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, target_range: Tuple[float, float] | None = None, concentration_method: str = 'dirichlet', n_components: int = 5) MultiSourceResult[source]

Generate a multi-source dataset.

All sources share underlying component concentrations, which creates realistic correlations between sources. NIR sources generate spectra from these concentrations, while auxiliary sources create features correlated with the same underlying structure.

Parameters:
  • n_samples – Number of samples to generate.

  • sources – List of source configurations (SourceConfig or dict).

  • target_range – Optional (min, max) for scaling target values.

  • concentration_method – Method for generating component concentrations.

  • n_components – Number of underlying components.

Returns:

MultiSourceResult containing all generated data.

Example

>>> result = generator.generate(
...     n_samples=300,
...     sources=[
...         {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)},
...         {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)},
...     ]
... )
class nirs4all.data.synthetic.sources.MultiSourceResult(sources: ~typing.Dict[str, ~numpy.ndarray], targets: ~numpy.ndarray, source_configs: ~typing.List[~nirs4all.data.synthetic.sources.SourceConfig], wavelengths: ~typing.Dict[str, ~numpy.ndarray] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]

Bases: object

Container for multi-source generation results.

sources

Dictionary mapping source names to feature arrays.

Type:

Dict[str, numpy.ndarray]

targets

Target values.

Type:

numpy.ndarray

source_configs

Source configuration objects.

Type:

List[nirs4all.data.synthetic.sources.SourceConfig]

wavelengths

Dictionary mapping NIR source names to wavelength arrays.

Type:

Dict[str, numpy.ndarray]

metadata

Optional metadata dictionary.

Type:

Dict[str, Any] | None

get_combined_features() ndarray[source]

Concatenate all sources into single feature matrix.

metadata: Dict[str, Any] | None = None
property n_features_total: int

Get total number of features across all sources.

property n_samples: int

Get number of samples.

source_configs: List[SourceConfig]
property source_names: List[str]

Get list of source names.

sources: Dict[str, ndarray]
targets: ndarray
wavelengths: Dict[str, ndarray]
class nirs4all.data.synthetic.sources.SourceConfig(name: str, source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir', n_features: int | None = None, wavelength_start: float | None = None, wavelength_end: float | None = None, wavelength_step: float = 2.0, components: List[str] | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal', correlation_with_target: float = 0.5)[source]

Bases: object

Configuration for a single data source.

name

Unique identifier for the source.

Type:

str

source_type

Type of source (‘nir’, ‘vis’, ‘aux’, ‘markers’).

Type:

Literal[‘nir’, ‘vis’, ‘aux’, ‘markers’]

n_features

Number of features (auto-calculated for NIR sources).

Type:

int | None

# NIR-specific
wavelength_start

Start wavelength for NIR sources.

Type:

float | None

wavelength_end

End wavelength for NIR sources.

Type:

float | None

wavelength_step

Wavelength step for NIR sources.

Type:

float

components

Component names for NIR sources.

Type:

List[str] | None

complexity

Complexity level for NIR sources.

Type:

Literal[‘simple’, ‘realistic’, ‘complex’]

# Auxiliary-specific
distribution

Distribution for auxiliary features.

Type:

Literal[‘normal’, ‘uniform’, ‘lognormal’]

correlation_with_target

How correlated aux features are with target.

Type:

float

complexity: Literal['simple', 'realistic', 'complex'] = 'simple'
components: List[str] | None = None
correlation_with_target: float = 0.5
distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal'
classmethod from_dict(config: Dict[str, Any]) SourceConfig[source]

Create SourceConfig from dictionary.

n_features: int | None = None
name: str
source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir'
wavelength_end: float | None = None
wavelength_start: float | None = None
wavelength_step: float = 2.0
nirs4all.data.synthetic.sources.generate_multi_source(n_samples: int, sources: List[Dict[str, Any]] | None = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, as_dataset: bool = True, train_ratio: float = 0.8, name: str = 'multi_source_synthetic') SpectroDataset | MultiSourceResult[source]

Convenience function for generating multi-source datasets.

Parameters:
  • n_samples – Number of samples.

  • sources – List of source configurations. If None, uses default single NIR source with wavelength range (1000, 2500).

  • random_state – Random seed.

  • target_range – Target value range.

  • as_dataset – If True, returns SpectroDataset.

  • train_ratio – Training set proportion.

  • name – Dataset name.

Returns:

SpectroDataset or MultiSourceResult depending on as_dataset.

Example

>>> dataset = generate_multi_source(
...     n_samples=500,
...     sources=[
...         {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 15}
...     ],
...     random_state=42
... )