nirs4all.data.synthetic.validation module

Validation utilities for synthetic data generation.

This module provides functions to validate generated synthetic data for correctness and expected properties, including spectral realism scoring for comparing synthetic data against real NIRS spectra.

Phase 4 Features:
  • Spectral realism scorecard with quantitative metrics

  • Correlation length analysis

  • Derivative statistics comparison

  • Peak density analysis

  • Baseline curvature metrics

  • SNR distribution analysis

  • Adversarial validation (classifier distinguishability)

References

  • Engel, J., et al. (2013). Breaking with trends in pre-processing? TrAC Trends in Analytical Chemistry, 50, 96-106.

  • Rinnan, Å., et al. (2009). Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry.

class nirs4all.data.synthetic.validation.DatasetComparisonResult(dataset_name: str, n_real_samples: int, n_synthetic_samples: int, realism_score: SpectralRealismScore, tstr_r2: float | None = None, trts_r2: float | None = None)[source]

Bases: object

Result of comparing synthetic data against a benchmark dataset.

dataset_name

Name of the benchmark dataset.

Type:

str

n_real_samples

Number of samples in real dataset.

Type:

int

n_synthetic_samples

Number of synthetic samples used.

Type:

int

realism_score

The spectral realism score.

Type:

nirs4all.data.synthetic.validation.SpectralRealismScore

tstr_r2

Train-on-Synthetic, Test-on-Real R² (if applicable).

Type:

float | None

trts_r2

Train-on-Real, Test-on-Synthetic R² (if applicable).

Type:

float | None

dataset_name: str
n_real_samples: int
n_synthetic_samples: int
realism_score: SpectralRealismScore
summary() str[source]

Return a human-readable summary.

trts_r2: float | None = None
tstr_r2: float | None = None
class nirs4all.data.synthetic.validation.MetricResult(metric: RealismMetric, value: float, threshold: float, passed: bool, details: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Result of a single realism metric evaluation.

metric

The metric type.

Type:

nirs4all.data.synthetic.validation.RealismMetric

value

The computed metric value.

Type:

float

threshold

The threshold for passing.

Type:

float

passed

Whether the metric passed the threshold.

Type:

bool

details

Additional details about the metric computation.

Type:

Dict[str, Any]

details: Dict[str, Any]
metric: RealismMetric
passed: bool
threshold: float
value: float
class nirs4all.data.synthetic.validation.RealismMetric(value)[source]

Bases: str, Enum

Metrics used in the spectral realism scorecard.

ADVERSARIAL_AUC = 'adversarial_auc'
BASELINE_CURVATURE = 'baseline_curvature'
CORRELATION_LENGTH = 'correlation_length'
DERIVATIVE_STATISTICS = 'derivative_statistics'
PEAK_DENSITY = 'peak_density'
SNR_DISTRIBUTION = 'snr_distribution'
class nirs4all.data.synthetic.validation.SpectralRealismScore(correlation_length_overlap: float, derivative_ks_pvalue: float, peak_density_ratio: float, baseline_curvature_overlap: float, snr_magnitude_match: bool, adversarial_auc: float, overall_pass: bool, metric_results: List[MetricResult] = <factory>, warnings: List[str] = <factory>)[source]

Bases: object

Complete spectral realism assessment results.

This dataclass contains the results of comparing synthetic spectra against real spectra using multiple quantitative metrics.

correlation_length_overlap

Distribution overlap for autocorrelation decay [0-1].

Type:

float

derivative_ks_pvalue

p-value from KS test on derivative distributions.

Type:

float

peak_density_ratio

Ratio of synthetic to real peak densities.

Type:

float

baseline_curvature_overlap

Distribution overlap for baseline curvature [0-1].

Type:

float

snr_magnitude_match

Whether SNR is within one order of magnitude.

Type:

bool

adversarial_auc

AUC of classifier trying to distinguish real from synthetic.

Type:

float

overall_pass

Whether all critical metrics pass.

Type:

bool

metric_results

Individual metric results with details.

Type:

List[nirs4all.data.synthetic.validation.MetricResult]

warnings

Any warnings from the analysis.

Type:

List[str]

Example

>>> score = compute_spectral_realism_scorecard(real_spectra, synthetic_spectra, wavelengths)
>>> print(f"Overall pass: {score.overall_pass}")
>>> print(f"Adversarial AUC: {score.adversarial_auc:.3f}")
>>> for metric in score.metric_results:
...     print(metric)
adversarial_auc: float
baseline_curvature_overlap: float
correlation_length_overlap: float
derivative_ks_pvalue: float
metric_results: List[MetricResult]
overall_pass: bool
peak_density_ratio: float
snr_magnitude_match: bool
summary() str[source]

Return a human-readable summary of the score.

to_dict() Dict[str, Any][source]

Convert to dictionary for serialization.

warnings: List[str]
exception nirs4all.data.synthetic.validation.ValidationError[source]

Bases: Exception

Exception raised when synthetic data validation fails.

nirs4all.data.synthetic.validation.compute_adversarial_validation_auc(real_spectra: ndarray, synthetic_spectra: ndarray, cv_folds: int = 5, random_state: int | None = None) Tuple[float, float][source]

Train classifier to distinguish real vs. synthetic spectra.

A lower AUC indicates that synthetic data is more realistic (harder to distinguish from real data).

Parameters:
  • real_spectra – Real spectra array (n_real, n_wavelengths).

  • synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).

  • cv_folds – Number of cross-validation folds.

  • random_state – Random state for reproducibility.

Returns:

Tuple of (mean_auc, std_auc) across folds.

Target:

AUC < 0.6: Excellent (nearly indistinguishable) AUC < 0.7: Good (hard to distinguish) AUC < 0.8: Acceptable (some differences) AUC >= 0.8: Poor (clearly distinguishable)

Example

>>> real = np.random.randn(100, 500)
>>> synthetic = np.random.randn(100, 500) + 0.1
>>> mean_auc, std_auc = compute_adversarial_validation_auc(real, synthetic)
>>> print(f"AUC: {mean_auc:.3f} ± {std_auc:.3f}")
nirs4all.data.synthetic.validation.compute_baseline_curvature(spectra: ndarray, polynomial_degree: int = 3) ndarray[source]

Compute baseline curvature by fitting polynomials and measuring residuals.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • polynomial_degree – Degree of polynomial to fit.

Returns:

Array of residual standard deviations for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> curvatures = compute_baseline_curvature(X)
nirs4all.data.synthetic.validation.compute_correlation_length(spectra: ndarray, max_lag: int = 50) ndarray[source]

Compute correlation lengths for a set of spectra.

The correlation length is the lag at which the autocorrelation function decays to 1/e of its initial value.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • max_lag – Maximum lag to compute autocorrelation for.

Returns:

Array of correlation lengths for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> lengths = compute_correlation_length(X)
>>> print(f"Mean correlation length: {lengths.mean():.2f}")
nirs4all.data.synthetic.validation.compute_derivative_statistics(spectra: ndarray, wavelengths: ndarray | None = None, order: int = 1) Tuple[ndarray, ndarray][source]

Compute derivative statistics for spectra.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • wavelengths – Wavelength array for proper derivative scaling.

  • order – Derivative order (1 or 2).

Returns:

Tuple of (mean_derivatives, std_derivatives) per sample.

Example

>>> X = np.random.randn(100, 500)
>>> means, stds = compute_derivative_statistics(X, order=1)
nirs4all.data.synthetic.validation.compute_distribution_overlap(dist1: ndarray, dist2: ndarray, n_bins: int = 50) float[source]

Compute overlap between two distributions using histogram intersection.

Parameters:
  • dist1 – First distribution samples.

  • dist2 – Second distribution samples.

  • n_bins – Number of histogram bins.

Returns:

Overlap coefficient in [0, 1], where 1 means identical distributions.

Example

>>> x1 = np.random.randn(1000)
>>> x2 = np.random.randn(1000) + 0.5
>>> overlap = compute_distribution_overlap(x1, x2)
nirs4all.data.synthetic.validation.compute_peak_density(spectra: ndarray, wavelengths: ndarray, window_nm: float = 100.0, prominence_threshold: float = 0.01) ndarray[source]

Compute peak density (peaks per 100 nm) for spectra.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • wavelengths – Wavelength array in nm.

  • window_nm – Window size for density calculation (default 100 nm).

  • prominence_threshold – Minimum peak prominence as fraction of spectrum range.

Returns:

Array of peak densities (peaks per window_nm) for each spectrum.

Example

>>> X = np.random.randn(100, 500)
>>> wl = np.linspace(1000, 2500, 500)
>>> densities = compute_peak_density(X, wl)
nirs4all.data.synthetic.validation.compute_snr(spectra: ndarray, noise_region_fraction: float = 0.1) ndarray[source]

Estimate signal-to-noise ratio for spectra.

Uses the standard deviation of the highest-frequency components (via high-pass filtering) as noise estimate.

Parameters:
  • spectra – Array of shape (n_samples, n_wavelengths).

  • noise_region_fraction – Fraction of spectrum to use for noise estimation.

Returns:

Array of SNR estimates for each spectrum.

Example

>>> X = np.random.randn(100, 500) + np.sin(np.linspace(0, 10, 500))
>>> snr = compute_snr(X)
nirs4all.data.synthetic.validation.compute_spectral_realism_scorecard(real_spectra: ndarray, synthetic_spectra: ndarray, wavelengths: ndarray | None = None, thresholds: Dict[str, float] | None = None, include_adversarial: bool = True, random_state: int | None = None) SpectralRealismScore[source]

Compute comprehensive spectral realism scorecard.

This function computes multiple quantitative metrics to assess whether synthetic spectra are realistic compared to real data.

Parameters:
  • real_spectra – Real spectra array (n_real, n_wavelengths).

  • synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).

  • wavelengths – Wavelength array in nm. If None, uses indices.

  • thresholds – Custom thresholds for metrics. Defaults: - correlation_length_overlap: 0.7 - derivative_ks_pvalue: 0.05 - peak_density_ratio_min: 0.5 - peak_density_ratio_max: 2.0 - baseline_curvature_overlap: 0.6 - snr_order_of_magnitude: 1.0 (log10 difference) - adversarial_auc: 0.7

  • include_adversarial – Whether to compute adversarial AUC (slower).

  • random_state – Random state for adversarial validation.

Returns:

SpectralRealismScore with all metrics and pass/fail status.

Example

>>> from nirs4all.data.synthetic import SyntheticNIRSGenerator
>>> gen = SyntheticNIRSGenerator(random_state=42)
>>> X_synth, _, _ = gen.generate(200)
>>> # X_real would be loaded from real data
>>> X_real = np.random.randn(200, X_synth.shape[1])  # Placeholder
>>> score = compute_spectral_realism_scorecard(X_real, X_synth, gen.wavelengths)
>>> print(score.summary())
nirs4all.data.synthetic.validation.quick_realism_check(synthetic_spectra: ndarray, wavelengths: ndarray | None = None, expected_snr_range: Tuple[float, float] = (10, 1000), expected_peak_density: Tuple[float, float] = (0.5, 10.0)) Tuple[bool, List[str]][source]

Perform quick realism checks on synthetic spectra without real data.

This function checks basic properties that realistic spectra should have, without requiring a reference real dataset.

Parameters:
  • synthetic_spectra – Synthetic spectra to check.

  • wavelengths – Wavelength array.

  • expected_snr_range – Expected SNR range (min, max).

  • expected_peak_density – Expected peak density range (peaks per 100 nm).

Returns:

Tuple of (passed, list_of_issues).

Example

>>> X = generator.generate(100)[0]
>>> passed, issues = quick_realism_check(X, wavelengths)
>>> if not passed:
...     print("Issues:", issues)
nirs4all.data.synthetic.validation.validate_against_benchmark(synthetic_spectra: ndarray, benchmark_spectra: ndarray, benchmark_name: str, wavelengths: ndarray | None = None, synthetic_targets: ndarray | None = None, benchmark_targets: ndarray | None = None, random_state: int | None = None) DatasetComparisonResult[source]

Validate synthetic data against a benchmark dataset.

Parameters:
  • synthetic_spectra – Synthetic spectra (n_synth, n_wavelengths).

  • benchmark_spectra – Real benchmark spectra (n_bench, n_wavelengths).

  • benchmark_name – Name of the benchmark dataset.

  • wavelengths – Wavelength array.

  • synthetic_targets – Optional targets for TSTR/TRTS evaluation.

  • benchmark_targets – Optional targets for TSTR/TRTS evaluation.

  • random_state – Random state for reproducibility.

Returns:

DatasetComparisonResult with realism score and optional TSTR/TRTS.

Example

>>> result = validate_against_benchmark(
...     synthetic_spectra=X_synth,
...     benchmark_spectra=X_real,
...     benchmark_name="Corn",
... )
>>> print(result.summary())
nirs4all.data.synthetic.validation.validate_concentrations(C: ndarray, n_samples: int | None = None, n_components: int | None = None, check_normalized: bool = False, tolerance: float = 0.01) List[str][source]

Validate concentration matrix.

Parameters:
  • C – Concentration matrix to validate.

  • n_samples – Expected number of samples.

  • n_components – Expected number of components.

  • check_normalized – Whether concentrations should sum to 1.

  • tolerance – Tolerance for normalization check.

Returns:

List of validation warning messages.

Raises:

ValidationError – If critical validation fails.

nirs4all.data.synthetic.validation.validate_spectra(X: ndarray, expected_shape: Tuple[int, int] | None = None, check_finite: bool = True, check_positive: bool = False, value_range: Tuple[float, float] | None = None) List[str][source]

Validate generated spectra matrix.

Parameters:
  • X – Spectra matrix to validate.

  • expected_shape – Expected (n_samples, n_wavelengths) shape.

  • check_finite – Whether to check for NaN/Inf values.

  • check_positive – Whether to require all positive values.

  • value_range – Optional (min, max) expected range.

Returns:

List of validation warning messages (empty if all OK).

Raises:

ValidationError – If critical validation fails.

Example

>>> X = np.random.randn(100, 500)
>>> warnings = validate_spectra(X, expected_shape=(100, 500))
>>> if warnings:
...     print("Warnings:", warnings)
nirs4all.data.synthetic.validation.validate_synthetic_output(X: ndarray, C: ndarray, E: ndarray, wavelengths: ndarray | None = None) List[str][source]

Validate complete synthetic generation output.

Parameters:
  • X – Generated spectra (n_samples, n_wavelengths).

  • C – Concentration matrix (n_samples, n_components).

  • E – Component spectra (n_components, n_wavelengths).

  • wavelengths – Optional wavelength array.

Returns:

List of all validation warnings.

Raises:

ValidationError – If critical validation fails.

Example

>>> from nirs4all.data.synthetic import SyntheticNIRSGenerator
>>> gen = SyntheticNIRSGenerator(random_state=42)
>>> X, C, E = gen.generate(100)
>>> warnings = validate_synthetic_output(X, C, E, gen.wavelengths)
nirs4all.data.synthetic.validation.validate_wavelengths(wavelengths: ndarray, expected_range: Tuple[float, float] | None = None, check_monotonic: bool = True, check_uniform: bool = True) List[str][source]

Validate wavelength array.

Parameters:
  • wavelengths – Wavelength array to validate.

  • expected_range – Optional (min, max) expected range in nm.

  • check_monotonic – Whether to check for monotonically increasing values.

  • check_uniform – Whether to check for uniform spacing.

Returns:

List of validation warning messages.

Raises:

ValidationError – If critical validation fails.