nirs4all.synthesis.config module

Configuration dataclasses for synthetic NIRS data generation.

This module provides structured configuration objects for controlling various aspects of synthetic spectra generation.

class nirs4all.synthesis.config.BatchEffectConfig(enabled: bool = False, n_batches: int = 3, offset_std: float = 0.02, gain_std: float = 0.03)[source]

Bases: object

Configuration for batch/session effects simulation.

enabled

Whether to add batch effects.

Type:

bool

n_batches

Number of measurement batches/sessions.

Type:

int

offset_std

Standard deviation of batch offset.

Type:

float

gain_std

Standard deviation of batch gain multiplier.

Type:

float

enabled: bool = False
gain_std: float = 0.03
n_batches: int = 3
offset_std: float = 0.02
class nirs4all.synthesis.config.ConfounderConfig(signal_to_confound_ratio: float = 1.0, n_confounders: int = 0, spectral_masking: float = 0.0, temporal_drift: bool = False)[source]

Bases: object

Configuration for spectral-target decoupling and confounding effects.

Introduces factors that make the target only partially predictable from spectral features, simulating real-world irreducible error.

signal_to_confound_ratio

Proportion of target variance explainable from spectra. 1.0 = fully predictable, 0.5 = 50% unexplainable.

Type:

float

n_confounders

Number of confounding variables that affect both spectra and target in different ways.

Type:

int

spectral_masking

Fraction of predictive signal hidden in high-noise wavelength regions (0.0-0.5).

Type:

float

temporal_drift

If True, the target-spectra relationship gradually changes across samples.

Type:

bool

n_confounders: int = 0
signal_to_confound_ratio: float = 1.0
spectral_masking: float = 0.0
temporal_drift: bool = False
class nirs4all.synthesis.config.FeatureConfig(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', n_components: int | None = None, component_names: List[str] | None = None)[source]

Bases: object

Configuration for spectral feature generation.

wavelength_start

Start wavelength in nm.

Type:

float

wavelength_end

End wavelength in nm.

Type:

float

wavelength_step

Wavelength step in nm.

Type:

float

complexity

Complexity level affecting noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.

Type:

Literal[‘simple’, ‘realistic’, ‘complex’]

n_components

Number of spectral components (auto if None).

Type:

int | None

component_names

Specific predefined components to use. If None, uses default components based on complexity.

Type:

List[str] | None

complexity: Literal['simple', 'realistic', 'complex'] = 'simple'
component_names: List[str] | None = None
n_components: int | None = None
wavelength_end: float = 2500.0
wavelength_start: float = 1000.0
wavelength_step: float = 2.0
class nirs4all.synthesis.config.MetadataConfig(generate_sample_ids: bool = True, sample_id_prefix: str = 'sample', n_groups: int | None = None, n_repetitions: int | Tuple[int, int] = 1, group_names: List[str] | None = None, additional_columns: Dict[str, Any] | None = None)[source]

Bases: object

Configuration for sample metadata generation.

generate_sample_ids

Whether to generate sample IDs.

Type:

bool

sample_id_prefix

Prefix for sample IDs.

Type:

str

n_groups

Number of sample groups (e.g., biological replicates).

Type:

int | None

n_repetitions

Repetitions per sample, either fixed int or (min, max) range.

Type:

int | Tuple[int, int]

group_names

Optional list of group names.

Type:

List[str] | None

additional_columns

Dict of column_name -> generator function or values.

Type:

Dict[str, Any] | None

additional_columns: Dict[str, Any] | None = None
generate_sample_ids: bool = True
group_names: List[str] | None = None
n_groups: int | None = None
n_repetitions: int | Tuple[int, int] = 1
sample_id_prefix: str = 'sample'
class nirs4all.synthesis.config.MultiRegimeConfig(n_regimes: int = 1, regime_method: Literal['concentration', 'spectral', 'random'] = 'concentration', regime_overlap: float = 0.2, noise_heteroscedasticity: float = 0.0)[source]

Bases: object

Configuration for multi-regime target landscapes.

Creates regions in feature space where the target-spectra relationship differs, simulating subpopulations.

n_regimes

Number of different relationship regimes.

Type:

int

regime_method

How to partition samples into regimes: ‘concentration’, ‘spectral’, or ‘random’.

Type:

Literal[‘concentration’, ‘spectral’, ‘random’]

regime_overlap

Overlap between regimes creating transition zones. 0 = hard boundaries, 0.5 = smooth transitions.

Type:

float

noise_heteroscedasticity

How much prediction noise varies by regime. 0 = same noise everywhere, 1 = very different noise levels.

Type:

float

n_regimes: int = 1
noise_heteroscedasticity: float = 0.0
regime_method: Literal['concentration', 'spectral', 'random'] = 'concentration'
regime_overlap: float = 0.2
class nirs4all.synthesis.config.NonLinearConfig(interactions: Literal['none', 'polynomial', 'synergistic', 'antagonistic'] = 'none', interaction_strength: float = 0.5, hidden_factors: int = 0, polynomial_degree: int = 2)[source]

Bases: object

Configuration for non-linear target relationships.

Enables polynomial, synergistic, or antagonistic interactions between component concentrations and targets, making prediction harder.

interactions

Type of non-linear interaction. Options: ‘none’, ‘polynomial’, ‘synergistic’, ‘antagonistic’.

Type:

Literal[‘none’, ‘polynomial’, ‘synergistic’, ‘antagonistic’]

interaction_strength

Blend factor (0 = linear, 1 = fully non-linear).

Type:

float

hidden_factors

Number of latent variables affecting target but not spectra.

Type:

int

polynomial_degree

Degree for polynomial interactions (2 or 3).

Type:

int

hidden_factors: int = 0
interaction_strength: float = 0.5
interactions: Literal['none', 'polynomial', 'synergistic', 'antagonistic'] = 'none'
polynomial_degree: int = 2
class nirs4all.synthesis.config.OutputConfig(as_dataset: bool = True, include_metadata: bool = False, include_wavelengths: bool = True)[source]

Bases: object

Configuration for output format.

as_dataset

Whether to return SpectroDataset (vs tuple).

Type:

bool

include_metadata

Whether to include generation metadata.

Type:

bool

include_wavelengths

Whether to include wavelength array in output.

Type:

bool

as_dataset: bool = True
include_metadata: bool = False
include_wavelengths: bool = True
class nirs4all.synthesis.config.PartitionConfig(train_ratio: float = 0.8, stratify: bool = False, shuffle: bool = True, group_aware: bool = True)[source]

Bases: object

Configuration for data partitioning (train/test split).

train_ratio

Proportion of samples for training (0.0-1.0).

Type:

float

stratify

Whether to stratify by target (for classification).

Type:

bool

shuffle

Whether to shuffle before splitting.

Type:

bool

group_aware

Whether to keep groups together when splitting.

Type:

bool

group_aware: bool = True
shuffle: bool = True
stratify: bool = False
train_ratio: float = 0.8
class nirs4all.synthesis.config.SyntheticDatasetConfig(n_samples: int = 1000, random_state: int | None = None, features: FeatureConfig = <factory>, targets: TargetConfig = <factory>, metadata: MetadataConfig = <factory>, partitions: PartitionConfig = <factory>, batch_effects: BatchEffectConfig = <factory>, nonlinear: NonLinearConfig = <factory>, confounders: ConfounderConfig = <factory>, multi_regime: MultiRegimeConfig = <factory>, output: OutputConfig = <factory>, name: str = 'synthetic_nirs')[source]

Bases: object

Complete configuration for synthetic dataset generation.

This is the main configuration object that combines all sub-configurations for generating synthetic NIRS datasets.

n_samples

Total number of samples to generate.

Type:

int

random_state

Random seed for reproducibility.

Type:

int | None

features

Feature generation configuration.

Type:

nirs4all.synthesis.config.FeatureConfig

targets

Target variable configuration.

Type:

nirs4all.synthesis.config.TargetConfig

metadata

Sample metadata configuration.

Type:

nirs4all.synthesis.config.MetadataConfig

partitions

Train/test split configuration.

Type:

nirs4all.synthesis.config.PartitionConfig

batch_effects

Batch effect configuration.

Type:

nirs4all.synthesis.config.BatchEffectConfig

output

Output format configuration.

Type:

nirs4all.synthesis.config.OutputConfig

name

Optional dataset name.

Type:

str

Example

>>> config = SyntheticDatasetConfig(
...     n_samples=1000,
...     random_state=42,
...     features=FeatureConfig(complexity="realistic"),
...     targets=TargetConfig(distribution="lognormal", range=(0, 100)),
... )
__post_init__() None[source]

Validate configuration after initialization.

batch_effects: BatchEffectConfig
confounders: ConfounderConfig
features: FeatureConfig
metadata: MetadataConfig
multi_regime: MultiRegimeConfig
n_samples: int = 1000
name: str = 'synthetic_nirs'
nonlinear: NonLinearConfig
output: OutputConfig
partitions: PartitionConfig
random_state: int | None = None
targets: TargetConfig
class nirs4all.synthesis.config.TargetConfig(distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', range: Tuple[float, float] | None = None, n_targets: int | None = None, component_indices: List[int] | None = None, transform: Literal['log', 'sqrt'] | None = None)[source]

Bases: object

Configuration for target variable generation.

distribution

Target value distribution method. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.

Type:

Literal[‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’]

range

Optional (min, max) range for scaling targets.

Type:

Tuple[float, float] | None

n_targets

Number of target variables (auto from components if None).

Type:

int | None

component_indices

Which components to use as targets (all if None).

Type:

List[int] | None

transform

Optional transformation to apply (‘log’, ‘sqrt’, None).

Type:

Literal[‘log’, ‘sqrt’] | None

component_indices: List[int] | None = None
distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet'
n_targets: int | None = None
range: Tuple[float, float] | None = None
transform: Literal['log', 'sqrt'] | None = None