nirs4all.api.generate module
Top-level generate() API for synthetic NIRS data generation.
This module provides the primary entry points for generating synthetic NIRS datasets within nirs4all.
Example
>>> import nirs4all
>>>
>>> # Simple generation
>>> dataset = nirs4all.generate(n_samples=1000, random_state=42)
>>>
>>> # Convenience functions
>>> dataset = nirs4all.generate.regression(n_samples=500)
>>> dataset = nirs4all.generate.classification(n_samples=300, n_classes=3)
>>>
>>> # Builder access
>>> builder = nirs4all.generate.builder(n_samples=1000)
- nirs4all.api.generate.builder(n_samples: int = 1000, random_state: int | None = None, name: str = 'synthetic_nirs') SyntheticDatasetBuilder[source]
Create a SyntheticDatasetBuilder for fine-grained control.
Use this when you need full control over all generation parameters via the fluent builder interface.
- Parameters:
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
name – Dataset name.
- Returns:
SyntheticDatasetBuilder instance for method chaining.
Example
>>> import nirs4all >>> >>> dataset = ( ... nirs4all.generate.builder(n_samples=1000, random_state=42) ... .with_features( ... wavelength_range=(1000, 2500), ... complexity="realistic", ... components=["water", "protein", "lipid"] ... ) ... .with_targets( ... distribution="lognormal", ... range=(5, 50), ... component="protein" ... ) ... .with_metadata(n_groups=3) ... .with_partitions(train_ratio=0.8) ... .with_batch_effects(n_batches=3) ... .build() ... )
- nirs4all.api.generate.classification(n_samples: int = 1000, *, n_classes: int = 2, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', class_separation: float = 1.0, class_weights: List[float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_classification') 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]
Generate a synthetic NIRS dataset for classification tasks.
This convenience function creates datasets with discrete class labels, suitable for classification experiments.
- Parameters:
n_samples – Number of samples to generate.
n_classes – Number of classes (2 for binary, >2 for multiclass).
random_state – Random seed for reproducibility.
complexity – Complexity level (‘simple’, ‘realistic’, ‘complex’).
class_separation – Separation factor between classes. Higher values make classes more distinguishable.
class_weights – Optional class proportions for imbalanced datasets. Should sum to 1.0.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.
- Returns:
SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays where y is integer labels.
- Return type:
If as_dataset=True
Example
>>> import nirs4all >>> >>> # Binary classification >>> dataset = nirs4all.generate.classification(n_samples=500, n_classes=2) >>> >>> # Multiclass with imbalanced classes >>> dataset = nirs4all.generate.classification( ... n_samples=1000, ... n_classes=3, ... class_weights=[0.5, 0.3, 0.2], ... random_state=42 ... )
- nirs4all.api.generate.from_template(template: str | np.ndarray | 'SpectroDataset', n_samples: int = 1000, *, random_state: int | None = None, wavelengths: np.ndarray | None = None, as_dataset: bool = True) 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]
Generate synthetic data mimicking a real dataset template.
Analyzes the template data and generates synthetic spectra with similar statistical and spectral properties.
- Parameters:
template – Real data to mimic. Can be: - Path to dataset folder (str). - Numpy array (n_samples, n_wavelengths). - SpectroDataset object.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
wavelengths – Wavelength grid (required if template is array).
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
- Returns:
Synthetic dataset or arrays with properties similar to template.
Example
>>> import nirs4all >>> >>> # From a dataset path >>> dataset = nirs4all.generate.from_template( ... "sample_data/regression", ... n_samples=1000 ... ) >>> >>> # From numpy array >>> dataset = nirs4all.generate.from_template( ... X_real, ... n_samples=500, ... wavelengths=wavelengths ... )
- nirs4all.api.generate.generate(n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', wavelength_range: Tuple[float, float] | None = None, components: List[str] | None = None, target_range: Tuple[float, float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_nirs', **kwargs: Any) 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]
Generate a synthetic NIRS dataset.
This is the primary function for creating synthetic spectroscopic data. It provides a simple interface for common use cases while allowing full customization through keyword arguments.
- Parameters:
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level affecting noise, scatter, etc. Options: ‘simple’ (fast, minimal noise), ‘realistic’ (typical NIR), ‘complex’ (challenging scenarios).
wavelength_range – Tuple of (start, end) wavelengths in nm. Defaults to (1000, 2500) which covers the full NIR range.
components – List of predefined component names to use. Options: ‘water’, ‘protein’, ‘lipid’, ‘starch’, ‘cellulose’, ‘chlorophyll’, ‘oil’, ‘nitrogen_compound’.
target_range – Optional (min, max) range for scaling targets.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y) tuple.
name – Dataset name.
**kwargs – Additional arguments passed to SyntheticDatasetBuilder.
- Returns:
SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays.
- Return type:
If as_dataset=True
Example
>>> import nirs4all >>> >>> # Basic usage >>> dataset = nirs4all.generate(n_samples=1000, random_state=42) >>> >>> # Quick arrays for prototyping >>> X, y = nirs4all.generate(n_samples=500, as_dataset=False) >>> >>> # Realistic spectra >>> dataset = nirs4all.generate( ... n_samples=1000, ... complexity="realistic", ... components=["water", "protein", "lipid"], ... target_range=(0, 100), ... random_state=42 ... )
See also
generate.regression: Convenience function for regression datasets. generate.classification: Convenience function for classification datasets. generate.builder: Access the full builder API.
- nirs4all.api.generate.multi_source(n_samples: int = 1000, sources: List[Dict[str, Any]] = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'multi_source_synthetic') 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]
Generate a synthetic multi-source NIRS dataset.
Multi-source datasets combine different types of data, such as multiple NIR spectral ranges or NIR spectra with auxiliary measurements.
- Parameters:
n_samples – Number of samples to generate.
sources – List of source configurations. Each source is a dict with: - name: Unique source identifier (required). - type: Source type - “nir”, “vis”, “aux”, “markers” (default: “nir”). - wavelength_range: (start, end) for NIR sources. - n_features: Number of features for auxiliary sources. - complexity: Complexity level for NIR sources. - components: Component names for NIR sources.
random_state – Random seed for reproducibility.
target_range – Optional (min, max) for scaling targets.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.
- Returns:
SpectroDataset with multiple sources. If as_dataset=False: Tuple of (X, y) where X is concatenated features.
- Return type:
If as_dataset=True
Example
>>> import nirs4all >>> >>> # NIR + markers >>> dataset = nirs4all.generate.multi_source( ... n_samples=500, ... sources=[ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 15} ... ], ... random_state=42 ... ) >>> >>> # Multiple NIR ranges >>> dataset = nirs4all.generate.multi_source( ... n_samples=500, ... sources=[ ... {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)}, ... {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)} ... ] ... )
- nirs4all.api.generate.regression(n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', target_range: Tuple[float, float] | None = None, target_component: str | int | None = None, distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', train_ratio: float = 0.8, as_dataset: bool = True, name: str = 'synthetic_regression') 'SpectroDataset' | Tuple[np.ndarray, np.ndarray][source]
Generate a synthetic NIRS dataset for regression tasks.
This convenience function is optimized for regression scenarios, with sensible defaults for target distribution and scaling.
- Parameters:
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level (‘simple’, ‘realistic’, ‘complex’).
target_range – Target value range (min, max) for scaling.
target_component – Which component to use as target. If None, uses all components (multi-output regression).
distribution – Concentration distribution method.
train_ratio – Proportion of samples for training partition.
as_dataset – If True, returns SpectroDataset. If False, returns (X, y).
name – Dataset name.
- Returns:
SpectroDataset ready for pipeline use. If as_dataset=False: Tuple of (X, y) numpy arrays.
- Return type:
If as_dataset=True
Example
>>> import nirs4all >>> >>> # Simple regression dataset >>> dataset = nirs4all.generate.regression(n_samples=500) >>> >>> # Single target with scaling >>> dataset = nirs4all.generate.regression( ... n_samples=1000, ... target_range=(0, 100), ... target_component="protein", ... random_state=42 ... )
- nirs4all.api.generate.to_csv(path: str | 'Path', n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', wavelength_range: Tuple[float, float] | None = None, target_range: Tuple[float, float] | None = None) Path[source]
Generate synthetic data and export to a single CSV file.
- Parameters:
path – Output file path.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level.
wavelength_range – Optional (start, end) wavelengths.
target_range – Optional (min, max) for target scaling.
- Returns:
Path to created file.
Example
>>> import nirs4all >>> path = nirs4all.generate.to_csv("data.csv", n_samples=500)
- nirs4all.api.generate.to_folder(path: str | 'Path', n_samples: int = 1000, *, random_state: int | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', train_ratio: float = 0.8, format: Literal['standard', 'single', 'fragmented'] = 'standard', wavelength_range: Tuple[float, float] | None = None, components: List[str] | None = None, target_range: Tuple[float, float] | None = None) Path[source]
Generate synthetic data and export to a folder.
Creates a folder with CSV files compatible with nirs4all’s DatasetConfigs loader.
- Parameters:
path – Output folder path.
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
complexity – Complexity level.
train_ratio – Train/test split ratio.
format – Export format (‘standard’, ‘single’, ‘fragmented’).
wavelength_range – Optional (start, end) wavelengths.
components – Optional list of component names.
target_range – Optional (min, max) for target scaling.
- Returns:
Path to created folder.
Example
>>> import nirs4all >>> path = nirs4all.generate.to_folder( ... "data/synthetic", ... n_samples=1000, ... train_ratio=0.8, ... random_state=42 ... )