nirs4all.api.generate module

Top-level generate() API for synthetic NIRS data generation.

This module provides the primary entry points for generating synthetic NIRS datasets within nirs4all.

Example

>>> import nirs4all
>>>
>>> # Simple generation
>>> dataset = nirs4all.generate(n_samples=1000, random_state=42)
>>>
>>> # Convenience functions
>>> dataset = nirs4all.generate.regression(n_samples=500)
>>> dataset = nirs4all.generate.classification(n_samples=300, n_classes=3)
>>>
>>> # Builder access
>>> builder = nirs4all.generate.builder(n_samples=1000)

nirs4all.api.generate.builder(n_samples: int = 1000, random_state: int | None = None, name: str = 'synthetic_nirs') → SyntheticDatasetBuilder[source]

Create a SyntheticDatasetBuilder for fine-grained control.

Use this when you need full control over all generation parameters via the fluent builder interface.

Parameters:

n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
name – Dataset name.

Returns:

SyntheticDatasetBuilder instance for method chaining.

Example

>>> import nirs4all
>>>
>>> dataset = (
...     nirs4all.generate.builder(n_samples=1000, random_state=42)
...     .with_features(
...         wavelength_range=(1000, 2500),
...         complexity="realistic",
...         components=["water", "protein", "lipid"]
...     )
...     .with_targets(
...         distribution="lognormal",
...         range=(5, 50),
...         component="protein"
...     )
...     .with_metadata(n_groups=3)
...     .with_partitions(train_ratio=0.8)
...     .with_batch_effects(n_batches=3)
...     .build()
... )

nirs4all.api.generate.classification(n_samples: int = 1000, *, n_classes: int = 2, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', class_separation: float = 1.0, class_weights: List[float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_classification') → 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]

Generate a synthetic NIRS dataset for classification tasks.

This convenience function creates datasets with discrete class labels, suitable for classification experiments.

Parameters:

n_samples – Number of samples to generate.
n_classes – Number of classes (2 for binary, >2 for multiclass).
random_state – Random seed for reproducibility.
complexity – Complexity level (‘simple’, ‘realistic’, ‘complex’).
class_separation – Separation factor between classes. Higher values make classes more distinguishable.
class_weights – Optional class proportions for imbalanced datasets. Should sum to 1.0.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.

Returns:

SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays where y is integer labels.

Return type:

If as_dataset=True

Example

>>> import nirs4all
>>>
>>> # Binary classification
>>> dataset = nirs4all.generate.classification(n_samples=500, n_classes=2)
>>>
>>> # Multiclass with imbalanced classes
>>> dataset = nirs4all.generate.classification(
...     n_samples=1000,
...     n_classes=3,
...     class_weights=[0.5, 0.3, 0.2],
...     random_state=42
... )

nirs4all.api.generate.from_template(template: str | np.ndarray | 'SpectroDataset', n_samples: int = 1000, *, random_state: int | None = None, wavelengths: np.ndarray | None = None, as_dataset: bool = True) → 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]

Generate synthetic data mimicking a real dataset template.

Analyzes the template data and generates synthetic spectra with similar statistical and spectral properties.

Parameters:

template – Real data to mimic. Can be: - Path to dataset folder (str). - Numpy array (n_samples, n_wavelengths). - SpectroDataset object.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
wavelengths – Wavelength grid (required if template is array).
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).

Returns:

Synthetic dataset or arrays with properties similar to template.

Example

>>> import nirs4all
>>>
>>> # From a dataset path
>>> dataset = nirs4all.generate.from_template(
...     "sample_data/regression",
...     n_samples=1000
... )
>>>
>>> # From numpy array
>>> dataset = nirs4all.generate.from_template(
...     X_real,
...     n_samples=500,
...     wavelengths=wavelengths
... )

nirs4all.api.generate.generate(n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', wavelength_range: Tuple[float, float] | None = None, components: List[str] | None = None, target_range: Tuple[float, float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_nirs', **kwargs: Any) → 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]

Generate a synthetic NIRS dataset.

This is the primary function for creating synthetic spectroscopic data. It provides a simple interface for common use cases while allowing full customization through keyword arguments.

Parameters:

n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level affecting noise, scatter, etc. Options: ‘simple’ (fast, minimal noise), ‘realistic’ (typical NIR), ‘complex’ (challenging scenarios).
wavelength_range – Tuple of (start, end) wavelengths in nm. Defaults to (1000, 2500) which covers the full NIR range.
components – List of predefined component names to use. Options: ‘water’, ‘protein’, ‘lipid’, ‘starch’, ‘cellulose’, ‘chlorophyll’, ‘oil’, ‘nitrogen_compound’.
target_range – Optional (min, max) range for scaling targets.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y) tuple.
name – Dataset name.
**kwargs – Additional arguments passed to SyntheticDatasetBuilder.

Returns:

SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays.

Return type:

If as_dataset=True

Example

>>> import nirs4all
>>>
>>> # Basic usage
>>> dataset = nirs4all.generate(n_samples=1000, random_state=42)
>>>
>>> # Quick arrays for prototyping
>>> X, y = nirs4all.generate(n_samples=500, as_dataset=False)
>>>
>>> # Realistic spectra
>>> dataset = nirs4all.generate(
...     n_samples=1000,
...     complexity="realistic",
...     components=["water", "protein", "lipid"],
...     target_range=(0, 100),
...     random_state=42
... )

See also

generate.regression: Convenience function for regression datasets. generate.classification: Convenience function for classification datasets. generate.builder: Access the full builder API.

nirs4all.api.generate.multi_source(n_samples: int = 1000, sources: List[Dict[str, Any]] = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'multi_source_synthetic') → 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]

Generate a synthetic multi-source NIRS dataset.

Multi-source datasets combine different types of data, such as multiple NIR spectral ranges or NIR spectra with auxiliary measurements.

Parameters:

n_samples – Number of samples to generate.
sources – List of source configurations. Each source is a dict with: - name: Unique source identifier (required). - type: Source type - “nir”, “vis”, “aux”, “markers” (default: “nir”). - wavelength_range: (start, end) for NIR sources. - n_features: Number of features for auxiliary sources. - complexity: Complexity level for NIR sources. - components: Component names for NIR sources.
random_state – Random seed for reproducibility.
target_range – Optional (min, max) for scaling targets.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.

Returns:

SpectroDataset with multiple sources. If as_dataset=False: Tuple of (X, y) where X is concatenated features.

Return type:

If as_dataset=True

Example

>>> import nirs4all
>>>
>>> # NIR + markers
>>> dataset = nirs4all.generate.multi_source(
...     n_samples=500,
...     sources=[
...         {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)},
...         {"name": "markers", "type": "aux", "n_features": 15}
...     ],
...     random_state=42
... )
>>>
>>> # Multiple NIR ranges
>>> dataset = nirs4all.generate.multi_source(
...     n_samples=500,
...     sources=[
...         {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)},
...         {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)}
...     ]
... )

nirs4all.api.generate.regression(n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', target_range: Tuple[float, float] | None = None, target_component: str | int | None = None, distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_regression') → 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]

Generate a synthetic NIRS dataset for regression tasks.

This convenience function is optimized for regression scenarios, with sensible defaults for target distribution and scaling.

Parameters:

n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level (‘simple’, ‘realistic’, ‘complex’).
target_range – Target value range (min, max) for scaling.
target_component – Which component to use as target. If None, uses all components (multi-output regression).
distribution – Concentration distribution method.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.

Returns:

SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays.

Return type:

If as_dataset=True

Example

>>> import nirs4all
>>>
>>> # Simple regression dataset
>>> dataset = nirs4all.generate.regression(n_samples=500)
>>>
>>> # Single target with scaling
>>> dataset = nirs4all.generate.regression(
...     n_samples=1000,
...     target_range=(0, 100),
...     target_component="protein",
...     random_state=42
... )

nirs4all.api.generate.to_csv(path: str | 'Path', n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', wavelength_range: Tuple[float, float] | None = None, target_range: Tuple[float, float] | None = None) → Path[source]

Generate synthetic data and export to a single CSV file.

Parameters:

path – Output file path.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level.
wavelength_range – Optional (start, end) wavelengths.
target_range – Optional (min, max) for target scaling.

Returns:

Path to created file.

Example

>>> import nirs4all
>>> path = nirs4all.generate.to_csv("data.csv", n_samples=500)

nirs4all.api.generate.to_folder(path: str | 'Path', n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', train_ratio: float = 0.8, format: Literal['standard', 'single', 'fragmented'] = 'standard', wavelength_range: Tuple[float, float] | None = None, components: List[str] | None = None, target_range: Tuple[float, float] | None = None) → Path[source]

Generate synthetic data and export to a folder.

Creates a folder with CSV files compatible with nirs4all’s DatasetConfigs loader.

Parameters:

path – Output folder path.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level.
train_ratio – Train/test split ratio.
format – Export format (‘standard’, ‘single’, ‘fragmented’).
wavelength_range – Optional (start, end) wavelengths.
components – Optional list of component names.
target_range – Optional (min, max) for target scaling.

Returns:

Path to created folder.

Example

>>> import nirs4all
>>> path = nirs4all.generate.to_folder(
...     "data/synthetic",
...     n_samples=1000,
...     train_ratio=0.8,
...     random_state=42
... )