nirs4all.data.synthetic.benchmarks module

Benchmark dataset utilities for synthetic data validation.

This module provides information about standard NIR benchmark datasets that can be used to validate synthetic data quality.

Phase 4 Features:

Benchmark dataset registry with metadata
Dataset characteristic summaries
Reference spectral properties
Loader utilities for common formats

Note

This module provides metadata and loading utilities for benchmark datasets. The actual dataset files need to be obtained from their respective sources due to licensing restrictions.

References

Corn (Cargill): M5spec competition dataset
Tecator (meat): StatLib - meat protein/fat/moisture
Shootout 2002: IDRC shootout pharmaceutical tablets
Wheat: Hard red wheat kernels dataset

class nirs4all.data.synthetic.benchmarks.BenchmarkDatasetInfo(name: str, full_name: str, domain: BenchmarkDomain, n_samples: int, n_wavelengths: int, wavelength_range: Tuple[float, float], targets: List[str], sample_type: str, measurement_mode: str, source_url: str, reference: str, license: str = 'Unknown', typical_snr: Tuple[float, float] = (50, 500), typical_peak_density: Tuple[float, float] = (1.0, 5.0), notes: str = '')[source]

Bases: object

Metadata for a benchmark dataset.

name

Dataset name/identifier.

Type:: str

full_name

Full descriptive name.

Type:: str

domain

Application domain.

Type:: nirs4all.data.synthetic.benchmarks.BenchmarkDomain

n_samples

Number of samples (approximate if variable).

Type:: int

n_wavelengths

Number of wavelength points.

Type:: int

wavelength_range

(min, max) wavelength in nm.

Type:: Tuple[float, float]

targets

List of target variable names.

Type:: List[str]

sample_type

Description of sample type.

Type:: str

measurement_mode

Typical measurement mode.

Type:: str

source_url

URL to obtain the dataset.

Type:: str

reference

Publication or source reference.

Type:: str

license

License information.

Type:: str

typical_snr

Typical signal-to-noise ratio range.

Type:: Tuple[float, float]

typical_peak_density

Typical peaks per 100 nm.

Type:: Tuple[float, float]

notes

Additional notes.

Type:: str

domain: BenchmarkDomain

full_name: str

license: str = 'Unknown'

measurement_mode: str

n_samples: int

n_wavelengths: int

name: str

notes: str = ''

reference: str

sample_type: str

source_url: str

summary() → str[source]: Return a human-readable summary.

targets: List[str]

typical_peak_density: Tuple[float, float] = (1.0, 5.0)

typical_snr: Tuple[float, float] = (50, 500)

wavelength_range: Tuple[float, float]

class nirs4all.data.synthetic.benchmarks.BenchmarkDomain(value)[source]

Bases: str, Enum

Domains for benchmark datasets.

AGRICULTURE = 'agriculture'

ENVIRONMENTAL = 'environmental'

FOOD = 'food'

GENERAL = 'general'

PETROCHEMICAL = 'petrochemical'

PHARMACEUTICAL = 'pharmaceutical'

class nirs4all.data.synthetic.benchmarks.LoadedBenchmarkDataset(info: BenchmarkDatasetInfo, X: ndarray, y: ndarray, wavelengths: ndarray, sample_ids: ndarray | None = None, metadata: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Container for a loaded benchmark dataset.

info

Dataset metadata.

Type:: nirs4all.data.synthetic.benchmarks.BenchmarkDatasetInfo

X

Spectral data (n_samples, n_wavelengths).

Type:: numpy.ndarray

y

Target values (n_samples, n_targets) or (n_samples,).

Type:: numpy.ndarray

wavelengths

Wavelength array.

Type:: numpy.ndarray

sample_ids

Optional sample identifiers.

Type:: numpy.ndarray | None

metadata

Optional additional metadata.

Type:: Dict[str, Any]

X: ndarray

info: BenchmarkDatasetInfo

metadata: Dict[str, Any]

sample_ids: ndarray | None = None

wavelengths: ndarray

y: ndarray

nirs4all.data.synthetic.benchmarks.create_synthetic_matching_benchmark(benchmark_name: str, n_samples: int | None = None, random_state: int | None = None) → Tuple[ndarray, ndarray, ndarray][source]

Create synthetic data matching benchmark dataset properties.

Parameters:

benchmark_name – Name of benchmark dataset to match.
n_samples – Number of samples (uses benchmark size if None).
random_state – Random state for reproducibility.

Returns:

Tuple of (spectra, concentrations, component_spectra).

Example

>>> X, C, E = create_synthetic_matching_benchmark("corn", random_state=42)
>>> print(X.shape)

nirs4all.data.synthetic.benchmarks.get_benchmark_info(name: str) → BenchmarkDatasetInfo[source]

Get information about a benchmark dataset.

Parameters:: name – Dataset name.
Returns:: BenchmarkDatasetInfo for the dataset.
Raises:: KeyError – If dataset not found.

Example

>>> info = get_benchmark_info("corn")
>>> print(info.summary())

nirs4all.data.synthetic.benchmarks.get_benchmark_spectral_properties(name: str) → Dict[str, Any][source]

Get spectral properties to match when generating synthetic data.

Parameters:: name – Benchmark dataset name.
Returns:: Dictionary of properties suitable for synthetic generator.

Example

>>> props = get_benchmark_spectral_properties("corn")
>>> generator = SyntheticNIRSGenerator(**props)

nirs4all.data.synthetic.benchmarks.get_datasets_by_domain(domain: str | BenchmarkDomain) → List[str][source]

Get benchmark datasets for a specific domain.

Parameters:: domain – Domain name or enum.
Returns:: List of dataset names in that domain.

Example

>>> pharma_datasets = get_datasets_by_domain("pharmaceutical")
>>> print(pharma_datasets)

nirs4all.data.synthetic.benchmarks.list_benchmark_datasets() → List[str][source]

List all registered benchmark datasets.

Returns:: List of dataset names.

Example

>>> datasets = list_benchmark_datasets()
>>> print(datasets)

nirs4all.data.synthetic.benchmarks.load_benchmark_dataset(name: str, data_dir: str | Path | None = None, format: str = 'auto') → LoadedBenchmarkDataset[source]

Load a benchmark dataset from disk.

Parameters:

name – Dataset name from registry.
data_dir – Directory containing dataset files.
format – File format (“auto”, “csv”, “mat”, “jdx”).

Returns:

LoadedBenchmarkDataset with data.

Raises:

FileNotFoundError – If dataset files not found.
KeyError – If dataset name not in registry.

Example

>>> dataset = load_benchmark_dataset("corn", data_dir="./datasets/")
>>> print(dataset.X.shape, dataset.y.shape)

Note

Dataset files must be obtained separately from their sources. This function provides standardized loading once files are available.