nirs4all.synthesis.benchmarks module

Benchmark dataset utilities for synthetic data validation.

This module provides information about standard NIR benchmark datasets that can be used to validate synthetic data quality.

Phase 4 Features:
  • Benchmark dataset registry with metadata

  • Dataset characteristic summaries

  • Reference spectral properties

  • Loader utilities for common formats

Note

This module provides metadata and loading utilities for benchmark datasets. The actual dataset files need to be obtained from their respective sources due to licensing restrictions.

References

  • Corn (Cargill): M5spec competition dataset

  • Tecator (meat): StatLib - meat protein/fat/moisture

  • Shootout 2002: IDRC shootout pharmaceutical tablets

  • Wheat: Hard red wheat kernels dataset

class nirs4all.synthesis.benchmarks.BenchmarkDatasetInfo(name: str, full_name: str, domain: BenchmarkDomain, n_samples: int, n_wavelengths: int, wavelength_range: Tuple[float, float], targets: List[str], sample_type: str, measurement_mode: str, source_url: str, reference: str, license: str = 'Unknown', typical_snr: Tuple[float, float] = (50, 500), typical_peak_density: Tuple[float, float] = (1.0, 5.0), notes: str = '')[source]

Bases: object

Metadata for a benchmark dataset.

name

Dataset name/identifier.

Type:

str

full_name

Full descriptive name.

Type:

str

domain

Application domain.

Type:

nirs4all.synthesis.benchmarks.BenchmarkDomain

n_samples

Number of samples (approximate if variable).

Type:

int

n_wavelengths

Number of wavelength points.

Type:

int

wavelength_range

(min, max) wavelength in nm.

Type:

Tuple[float, float]

targets

List of target variable names.

Type:

List[str]

sample_type

Description of sample type.

Type:

str

measurement_mode

Typical measurement mode.

Type:

str

source_url

URL to obtain the dataset.

Type:

str

reference

Publication or source reference.

Type:

str

license

License information.

Type:

str

typical_snr

Typical signal-to-noise ratio range.

Type:

Tuple[float, float]

typical_peak_density

Typical peaks per 100 nm.

Type:

Tuple[float, float]

notes

Additional notes.

Type:

str

domain: BenchmarkDomain
full_name: str
license: str = 'Unknown'
measurement_mode: str
n_samples: int
n_wavelengths: int
name: str
notes: str = ''
reference: str
sample_type: str
source_url: str
summary() str[source]

Return a human-readable summary.

targets: List[str]
typical_peak_density: Tuple[float, float] = (1.0, 5.0)
typical_snr: Tuple[float, float] = (50, 500)
wavelength_range: Tuple[float, float]
class nirs4all.synthesis.benchmarks.BenchmarkDomain(value)[source]

Bases: str, Enum

Domains for benchmark datasets.

AGRICULTURE = 'agriculture'
ENVIRONMENTAL = 'environmental'
FOOD = 'food'
GENERAL = 'general'
PETROCHEMICAL = 'petrochemical'
PHARMACEUTICAL = 'pharmaceutical'
class nirs4all.synthesis.benchmarks.LoadedBenchmarkDataset(info: BenchmarkDatasetInfo, X: ndarray, y: ndarray, wavelengths: ndarray, sample_ids: ndarray | None = None, metadata: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Container for a loaded benchmark dataset.

info

Dataset metadata.

Type:

nirs4all.synthesis.benchmarks.BenchmarkDatasetInfo

X

Spectral data (n_samples, n_wavelengths).

Type:

numpy.ndarray

y

Target values (n_samples, n_targets) or (n_samples,).

Type:

numpy.ndarray

wavelengths

Wavelength array.

Type:

numpy.ndarray

sample_ids

Optional sample identifiers.

Type:

numpy.ndarray | None

metadata

Optional additional metadata.

Type:

Dict[str, Any]

X: ndarray
info: BenchmarkDatasetInfo
metadata: Dict[str, Any]
sample_ids: ndarray | None = None
wavelengths: ndarray
y: ndarray
nirs4all.synthesis.benchmarks.create_synthetic_matching_benchmark(benchmark_name: str, n_samples: int | None = None, random_state: int | None = None) Tuple[ndarray, ndarray, ndarray][source]

Create synthetic data matching benchmark dataset properties.

Parameters:
  • benchmark_name – Name of benchmark dataset to match.

  • n_samples – Number of samples (uses benchmark size if None).

  • random_state – Random state for reproducibility.

Returns:

Tuple of (spectra, concentrations, component_spectra).

Example

>>> X, C, E = create_synthetic_matching_benchmark("corn", random_state=42)
>>> print(X.shape)
nirs4all.synthesis.benchmarks.get_benchmark_info(name: str) BenchmarkDatasetInfo[source]

Get information about a benchmark dataset.

Parameters:

name – Dataset name.

Returns:

BenchmarkDatasetInfo for the dataset.

Raises:

KeyError – If dataset not found.

Example

>>> info = get_benchmark_info("corn")
>>> print(info.summary())
nirs4all.synthesis.benchmarks.get_benchmark_spectral_properties(name: str) Dict[str, Any][source]

Get spectral properties to match when generating synthetic data.

Parameters:

name – Benchmark dataset name.

Returns:

Dictionary of properties suitable for synthetic generator.

Example

>>> props = get_benchmark_spectral_properties("corn")
>>> generator = SyntheticNIRSGenerator(**props)
nirs4all.synthesis.benchmarks.get_datasets_by_domain(domain: str | BenchmarkDomain) List[str][source]

Get benchmark datasets for a specific domain.

Parameters:

domain – Domain name or enum.

Returns:

List of dataset names in that domain.

Example

>>> pharma_datasets = get_datasets_by_domain("pharmaceutical")
>>> print(pharma_datasets)
nirs4all.synthesis.benchmarks.list_benchmark_datasets() List[str][source]

List all registered benchmark datasets.

Returns:

List of dataset names.

Example

>>> datasets = list_benchmark_datasets()
>>> print(datasets)
nirs4all.synthesis.benchmarks.load_benchmark_dataset(name: str, data_dir: str | Path | None = None, format: str = 'auto') LoadedBenchmarkDataset[source]

Load a benchmark dataset from disk.

Parameters:
  • name – Dataset name from registry.

  • data_dir – Directory containing dataset files.

  • format – File format (“auto”, “csv”, “mat”, “jdx”).

Returns:

LoadedBenchmarkDataset with data.

Raises:

Example

>>> dataset = load_benchmark_dataset("corn", data_dir="./datasets/")
>>> print(dataset.X.shape, dataset.y.shape)

Note

Dataset files must be obtained separately from their sources. This function provides standardized loading once files are available.