nirs4all.data.synthetic.benchmarks module
Benchmark dataset utilities for synthetic data validation.
This module provides information about standard NIR benchmark datasets that can be used to validate synthetic data quality.
- Phase 4 Features:
Benchmark dataset registry with metadata
Dataset characteristic summaries
Reference spectral properties
Loader utilities for common formats
Note
This module provides metadata and loading utilities for benchmark datasets. The actual dataset files need to be obtained from their respective sources due to licensing restrictions.
References
Corn (Cargill): M5spec competition dataset
Tecator (meat): StatLib - meat protein/fat/moisture
Shootout 2002: IDRC shootout pharmaceutical tablets
Wheat: Hard red wheat kernels dataset
- class nirs4all.data.synthetic.benchmarks.BenchmarkDatasetInfo(name: str, full_name: str, domain: BenchmarkDomain, n_samples: int, n_wavelengths: int, wavelength_range: Tuple[float, float], targets: List[str], sample_type: str, measurement_mode: str, source_url: str, reference: str, license: str = 'Unknown', typical_snr: Tuple[float, float] = (50, 500), typical_peak_density: Tuple[float, float] = (1.0, 5.0), notes: str = '')[source]
Bases:
objectMetadata for a benchmark dataset.
- domain
Application domain.
- domain: BenchmarkDomain
- class nirs4all.data.synthetic.benchmarks.BenchmarkDomain(value)[source]
-
Domains for benchmark datasets.
- AGRICULTURE = 'agriculture'
- ENVIRONMENTAL = 'environmental'
- FOOD = 'food'
- GENERAL = 'general'
- PETROCHEMICAL = 'petrochemical'
- PHARMACEUTICAL = 'pharmaceutical'
- class nirs4all.data.synthetic.benchmarks.LoadedBenchmarkDataset(info: BenchmarkDatasetInfo, X: ndarray, y: ndarray, wavelengths: ndarray, sample_ids: ndarray | None = None, metadata: Dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectContainer for a loaded benchmark dataset.
- info
Dataset metadata.
- X
Spectral data (n_samples, n_wavelengths).
- Type:
- y
Target values (n_samples, n_targets) or (n_samples,).
- Type:
- wavelengths
Wavelength array.
- Type:
- sample_ids
Optional sample identifiers.
- Type:
numpy.ndarray | None
- info: BenchmarkDatasetInfo
- nirs4all.data.synthetic.benchmarks.create_synthetic_matching_benchmark(benchmark_name: str, n_samples: int | None = None, random_state: int | None = None) Tuple[ndarray, ndarray, ndarray][source]
Create synthetic data matching benchmark dataset properties.
- Parameters:
benchmark_name – Name of benchmark dataset to match.
n_samples – Number of samples (uses benchmark size if None).
random_state – Random state for reproducibility.
- Returns:
Tuple of (spectra, concentrations, component_spectra).
Example
>>> X, C, E = create_synthetic_matching_benchmark("corn", random_state=42) >>> print(X.shape)
- nirs4all.data.synthetic.benchmarks.get_benchmark_info(name: str) BenchmarkDatasetInfo[source]
Get information about a benchmark dataset.
- Parameters:
name – Dataset name.
- Returns:
BenchmarkDatasetInfo for the dataset.
- Raises:
KeyError – If dataset not found.
Example
>>> info = get_benchmark_info("corn") >>> print(info.summary())
- nirs4all.data.synthetic.benchmarks.get_benchmark_spectral_properties(name: str) Dict[str, Any][source]
Get spectral properties to match when generating synthetic data.
- Parameters:
name – Benchmark dataset name.
- Returns:
Dictionary of properties suitable for synthetic generator.
Example
>>> props = get_benchmark_spectral_properties("corn") >>> generator = SyntheticNIRSGenerator(**props)
- nirs4all.data.synthetic.benchmarks.get_datasets_by_domain(domain: str | BenchmarkDomain) List[str][source]
Get benchmark datasets for a specific domain.
- Parameters:
domain – Domain name or enum.
- Returns:
List of dataset names in that domain.
Example
>>> pharma_datasets = get_datasets_by_domain("pharmaceutical") >>> print(pharma_datasets)
- nirs4all.data.synthetic.benchmarks.list_benchmark_datasets() List[str][source]
List all registered benchmark datasets.
- Returns:
List of dataset names.
Example
>>> datasets = list_benchmark_datasets() >>> print(datasets)
- nirs4all.data.synthetic.benchmarks.load_benchmark_dataset(name: str, data_dir: str | Path | None = None, format: str = 'auto') LoadedBenchmarkDataset[source]
Load a benchmark dataset from disk.
- Parameters:
name – Dataset name from registry.
data_dir – Directory containing dataset files.
format – File format (“auto”, “csv”, “mat”, “jdx”).
- Returns:
LoadedBenchmarkDataset with data.
- Raises:
FileNotFoundError – If dataset files not found.
KeyError – If dataset name not in registry.
Example
>>> dataset = load_benchmark_dataset("corn", data_dir="./datasets/") >>> print(dataset.X.shape, dataset.y.shape)
Note
Dataset files must be obtained separately from their sources. This function provides standardized loading once files are available.