nirs4all.data.synthetic.fitter module

Real data fitting utilities for synthetic NIRS spectra generation.

This module provides tools to analyze real NIRS datasets and fit generator parameters to match their statistical and spectral properties.

Key Features:
  • Statistical property analysis (mean, std, skewness, kurtosis)

  • Spectral shape analysis (slope, curvature, noise)

  • PCA structure analysis

  • Parameter estimation for SyntheticNIRSGenerator

  • Comparison between synthetic and real data

  • Phase 1-4 Enhanced Features:
    • Instrument archetype inference (InGaAs, PbS, MEMS, etc.)

    • Measurement mode detection (transmittance, reflectance, ATR)

    • Application domain suggestion (agriculture, pharmaceutical, etc.)

    • Environmental effects estimation (temperature, moisture)

    • Scattering parameter estimation (particle size, EMSC)

    • Wavenumber-based peak analysis for component identification

Example

>>> from nirs4all.data.synthetic import RealDataFitter, SyntheticNIRSGenerator
>>>
>>> # Analyze real data
>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Create generator with fitted parameters (includes all Phase 1-4 features)
>>> generator = fitter.create_matched_generator()
>>> X_synthetic, _, _ = generator.generate(n_samples=1000)
>>>
>>> # Or get all inferred characteristics
>>> print(f"Inferred instrument: {params.inferred_instrument}")
>>> print(f"Inferred domain: {params.inferred_domain}")
>>> print(f"Measurement mode: {params.measurement_mode}")

References

  • Based on comparator.py from bench/synthetic/

  • Enhanced with Phase 1-4 synthetic generator features

class nirs4all.data.synthetic.fitter.DomainInference(domain_name: str = 'unknown', category: str = 'unknown', confidence: float = 0.0, detected_components: ~typing.List[str] = <factory>, alternative_domains: ~typing.Dict[str, float] = <factory>)[source]

Bases: object

Results of application domain inference.

domain_name

Best matching domain name.

Type:

str

category

Domain category.

Type:

str

confidence

Confidence score (0-1).

Type:

float

detected_components

Components detected from peak analysis.

Type:

List[str]

alternative_domains

Other possible domains with scores.

Type:

Dict[str, float]

alternative_domains: Dict[str, float]
category: str = 'unknown'
confidence: float = 0.0
detected_components: List[str]
domain_name: str = 'unknown'
class nirs4all.data.synthetic.fitter.EnvironmentalInference(estimated_temperature_variation: float = 0.0, has_temperature_effects: bool = False, estimated_moisture_variation: float = 0.0, has_moisture_effects: bool = False, water_band_shift: float = 0.0)[source]

Bases: object

Results of environmental effects inference.

estimated_temperature_variation

Estimated temperature variation (°C).

Type:

float

has_temperature_effects

Whether temperature effects are detectable.

Type:

bool

estimated_moisture_variation

Estimated moisture variation.

Type:

float

has_moisture_effects

Whether moisture effects are detectable.

Type:

bool

water_band_shift

Detected shift in water bands (nm).

Type:

float

estimated_moisture_variation: float = 0.0
estimated_temperature_variation: float = 0.0
has_moisture_effects: bool = False
has_temperature_effects: bool = False
water_band_shift: float = 0.0
class nirs4all.data.synthetic.fitter.FittedParameters(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, global_slope_mean: float = 0.0, global_slope_std: float = 0.02, noise_base: float = 0.001, noise_signal_dep: float = 0.005, path_length_std: float = 0.05, baseline_amplitude: float = 0.02, scatter_alpha_std: float = 0.05, scatter_beta_std: float = 0.01, tilt_std: float = 0.01, complexity: str = 'realistic', source_name: str = '', source_properties: ~nirs4all.data.synthetic.fitter.SpectralProperties | None = None, inferred_instrument: str = 'unknown', instrument_inference: ~nirs4all.data.synthetic.fitter.InstrumentInference | None = None, measurement_mode: str = 'transmittance', measurement_mode_confidence: float = 0.0, inferred_domain: str = 'unknown', domain_inference: ~nirs4all.data.synthetic.fitter.DomainInference | None = None, environmental_inference: ~nirs4all.data.synthetic.fitter.EnvironmentalInference | None = None, temperature_config: ~typing.Dict[str, ~typing.Any] = <factory>, moisture_config: ~typing.Dict[str, ~typing.Any] = <factory>, scattering_inference: ~nirs4all.data.synthetic.fitter.ScatteringInference | None = None, particle_size_config: ~typing.Dict[str, ~typing.Any] = <factory>, emsc_config: ~typing.Dict[str, ~typing.Any] = <factory>, detected_components: ~typing.List[str] = <factory>, suggested_n_components: int = 5)[source]

Bases: object

Parameters fitted from real data for synthetic generation.

This dataclass contains all parameters needed to configure a SyntheticNIRSGenerator to produce spectra similar to a real dataset, including Phase 1-4 enhanced features.

# Basic wavelength grid
wavelength_start

Start wavelength (nm).

Type:

float

wavelength_end

End wavelength (nm).

Type:

float

wavelength_step

Wavelength step (nm).

Type:

float

# Slope and baseline parameters
global_slope_mean

Mean global slope.

Type:

float

global_slope_std

Slope standard deviation.

Type:

float

baseline_amplitude

Baseline drift amplitude.

Type:

float

# Noise parameters
noise_base

Base noise level.

Type:

float

noise_signal_dep

Signal-dependent noise factor.

Type:

float

# Scatter parameters
path_length_std

Path length variation.

Type:

float

scatter_alpha_std

Multiplicative scatter std.

Type:

float

scatter_beta_std

Additive scatter std.

Type:

float

tilt_std

Spectral tilt standard deviation.

Type:

float

# Complexity
complexity

Suggested complexity level.

Type:

str

# Source metadata
source_name

Name of source dataset.

Type:

str

source_properties

Full SpectralProperties of source.

Type:

nirs4all.data.synthetic.fitter.SpectralProperties | None

# Phase 1-4 Enhanced Parameters
# Instrument inference
inferred_instrument

Inferred instrument archetype.

Type:

str

instrument_inference

Full instrument inference result.

Type:

nirs4all.data.synthetic.fitter.InstrumentInference | None

# Measurement mode
measurement_mode

Inferred measurement mode.

Type:

str

measurement_mode_confidence

Confidence of inference.

Type:

float

# Domain inference
inferred_domain

Inferred application domain.

Type:

str

domain_inference

Full domain inference result.

Type:

nirs4all.data.synthetic.fitter.DomainInference | None

# Environmental effects
environmental_inference

Environmental effects inference.

Type:

nirs4all.data.synthetic.fitter.EnvironmentalInference | None

temperature_config

Suggested temperature config parameters.

Type:

Dict[str, Any]

moisture_config

Suggested moisture config parameters.

Type:

Dict[str, Any]

# Scattering effects
scattering_inference

Scattering effects inference.

Type:

nirs4all.data.synthetic.fitter.ScatteringInference | None

particle_size_config

Suggested particle size config parameters.

Type:

Dict[str, Any]

emsc_config

Suggested EMSC config parameters.

Type:

Dict[str, Any]

# Detected components for procedural generation
detected_components

List of detected/inferred component names.

Type:

List[str]

suggested_n_components

Suggested number of components.

Type:

int

baseline_amplitude: float = 0.02
complexity: str = 'realistic'
detected_components: List[str]
domain_inference: DomainInference | None = None
emsc_config: Dict[str, Any]
environmental_inference: EnvironmentalInference | None = None
classmethod from_dict(data: Dict[str, Any]) FittedParameters[source]

Create FittedParameters from a dictionary.

Parameters:

data – Dictionary with parameter values.

Returns:

FittedParameters instance.

global_slope_mean: float = 0.0
global_slope_std: float = 0.02
inferred_domain: str = 'unknown'
inferred_instrument: str = 'unknown'
instrument_inference: InstrumentInference | None = None
classmethod load(path: str) FittedParameters[source]

Load parameters from JSON file.

Parameters:

path – Input file path.

Returns:

FittedParameters instance.

measurement_mode: str = 'transmittance'
measurement_mode_confidence: float = 0.0
moisture_config: Dict[str, Any]
noise_base: float = 0.001
noise_signal_dep: float = 0.005
particle_size_config: Dict[str, Any]
path_length_std: float = 0.05
save(path: str) None[source]

Save parameters to JSON file.

Parameters:

path – Output file path.

scatter_alpha_std: float = 0.05
scatter_beta_std: float = 0.01
scattering_inference: ScatteringInference | None = None
source_name: str = ''
source_properties: SpectralProperties | None = None
suggested_n_components: int = 5
summary() str[source]

Generate a human-readable summary of fitted parameters.

Returns:

Multi-line summary string.

temperature_config: Dict[str, Any]
tilt_std: float = 0.01
to_dict() Dict[str, Any][source]

Convert all parameters to a dictionary.

Returns:

Dictionary with all parameter values.

to_full_config() Dict[str, Any][source]

Convert all fitted parameters to a comprehensive configuration.

This includes all Phase 1-4 parameters for complete synthetic data generation matching the source dataset.

Returns:

Dictionary with all configuration parameters.

Example

>>> params = fitter.fit(X_real)
>>> config = params.to_full_config()
>>> # Use with builder pattern or advanced configuration
to_generator_kwargs() Dict[str, Any][source]

Convert fitted parameters to kwargs for SyntheticNIRSGenerator.

Returns:

Dictionary of keyword arguments.

Example

>>> params = fitter.fit(X_real)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())
wavelength_end: float = 2500.0
wavelength_start: float = 1000.0
wavelength_step: float = 2.0
class nirs4all.data.synthetic.fitter.InstrumentInference(archetype_name: str = 'unknown', detector_type: str = 'unknown', wavelength_range: ~typing.Tuple[float, float] = (1000.0, 2500.0), estimated_resolution: float = 8.0, confidence: float = 0.0, alternative_archetypes: ~typing.Dict[str, float] = <factory>)[source]

Bases: object

Results of instrument archetype inference.

archetype_name

Best matching instrument archetype name.

Type:

str

detector_type

Inferred detector type.

Type:

str

wavelength_range

Detected wavelength range.

Type:

Tuple[float, float]

estimated_resolution

Estimated spectral resolution (nm).

Type:

float

confidence

Confidence score (0-1).

Type:

float

alternative_archetypes

Other possible archetypes with scores.

Type:

Dict[str, float]

alternative_archetypes: Dict[str, float]
archetype_name: str = 'unknown'
confidence: float = 0.0
detector_type: str = 'unknown'
estimated_resolution: float = 8.0
wavelength_range: Tuple[float, float] = (1000.0, 2500.0)
class nirs4all.data.synthetic.fitter.MeasurementModeInference(value)[source]

Bases: str, Enum

Inferred measurement mode from spectral analysis.

ATR = 'atr'
REFLECTANCE = 'reflectance'
TRANSFLECTANCE = 'transflectance'
TRANSMITTANCE = 'transmittance'
UNKNOWN = 'unknown'
class nirs4all.data.synthetic.fitter.RealDataFitter[source]

Bases: object

Fit generator parameters to match real dataset properties.

This class analyzes real NIRS data and estimates parameters for the SyntheticNIRSGenerator to produce similar spectra. Includes Phase 1-4 enhanced inference for instruments, domains, and effects.

source_properties

SpectralProperties of the analyzed data.

fitted_params

FittedParameters after fitting.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Access inferred characteristics
>>> print(f"Instrument: {params.inferred_instrument}")
>>> print(f"Domain: {params.inferred_domain}")
>>>
>>> # Create matched generator
>>> generator = fitter.create_matched_generator()
>>> X_synth, _, _ = generator.generate(1000)
create_matched_generator(random_state: int | None = None) SyntheticNIRSGenerator[source]

Create a SyntheticNIRSGenerator configured to match the fitted data.

This method creates a generator with all fitted parameters including Phase 1-4 enhanced features (instrument, domain, effects).

Parameters:

random_state – Random seed for reproducibility.

Returns:

Configured SyntheticNIRSGenerator instance.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>> generator = fitter.create_matched_generator(random_state=42)
>>> X_synth, _, _ = generator.generate(1000)
evaluate_similarity(X_synthetic: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]

Evaluate similarity between synthetic and source data.

Computes various metrics comparing synthetic spectra to the original real data.

Parameters:
  • X_synthetic – Synthetic spectra matrix.

  • wavelengths – Optional wavelength grid.

Returns:

Dictionary with similarity metrics.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> params = fitter.fit(X_real)
>>> X_synth, _, _ = generator.generate(1000)
>>> metrics = fitter.evaluate_similarity(X_synth)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
fit(X: np.ndarray | 'SpectroDataset', *, wavelengths: np.ndarray | None = None, name: str = 'source', infer_instrument: bool = True, infer_domain: bool = True, infer_measurement_mode: bool = True, infer_environmental: bool = True, infer_scattering: bool = True) FittedParameters[source]

Fit generator parameters to real data.

Analyzes the input data and estimates optimal parameters for generating synthetic spectra with similar properties. Includes Phase 1-4 enhanced inference.

Parameters:
  • X – Real spectra matrix (n_samples, n_wavelengths) or SpectroDataset.

  • wavelengths – Wavelength grid (required if X is ndarray).

  • name – Dataset name for reference.

  • infer_instrument – Whether to infer instrument archetype.

  • infer_domain – Whether to infer application domain.

  • infer_measurement_mode – Whether to infer measurement mode.

  • infer_environmental – Whether to infer environmental effects.

  • infer_scattering – Whether to infer scattering parameters.

Returns:

FittedParameters object with estimated parameters.

Raises:

ValueError – If X is empty or has wrong shape.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wl, name="wheat")
>>> print(params.summary())
fit_from_path(path: str, *, name: str | None = None) FittedParameters[source]

Fit parameters from a dataset path.

Loads data using DatasetConfigs and fits parameters.

Parameters:
  • path – Path to dataset folder.

  • name – Optional name override.

Returns:

FittedParameters object.

Example

>>> params = fitter.fit_from_path("sample_data/regression")
get_tuning_recommendations() List[str][source]

Get recommendations for tuning generation parameters.

Based on the fitted parameters and source data, provides suggestions for manual tuning.

Returns:

List of recommendation strings.

Example

>>> params = fitter.fit(X_real)
>>> for rec in fitter.get_tuning_recommendations():
...     print(f"- {rec}")
class nirs4all.data.synthetic.fitter.ScatteringInference(has_scatter_effects: bool = False, estimated_particle_size_um: float = 50.0, multiplicative_scatter_std: float = 0.0, additive_scatter_std: float = 0.0, baseline_curvature: float = 0.0, snv_correctable: bool = False, msc_correctable: bool = False)[source]

Bases: object

Results of scattering effects inference.

has_scatter_effects

Whether significant scatter is detected.

Type:

bool

estimated_particle_size_um

Estimated mean particle size (μm).

Type:

float

multiplicative_scatter_std

Estimated MSC-style multiplicative scatter.

Type:

float

additive_scatter_std

Estimated SNV-style additive scatter.

Type:

float

baseline_curvature

Detected baseline curvature intensity.

Type:

float

snv_correctable

Whether SNV would improve spectra.

Type:

bool

msc_correctable

Whether MSC would improve spectra.

Type:

bool

additive_scatter_std: float = 0.0
baseline_curvature: float = 0.0
estimated_particle_size_um: float = 50.0
has_scatter_effects: bool = False
msc_correctable: bool = False
multiplicative_scatter_std: float = 0.0
snv_correctable: bool = False
class nirs4all.data.synthetic.fitter.SpectralProperties(name: str = 'dataset', n_samples: int = 0, n_wavelengths: int = 0, wavelengths: ndarray | None = None, mean_spectrum: ndarray | None = None, std_spectrum: ndarray | None = None, global_mean: float = 0.0, global_std: float = 0.0, global_range: Tuple[float, float] = (0.0, 0.0), mean_slope: float = 0.0, slope_std: float = 0.0, slopes: ndarray | None = None, mean_curvature: float = 0.0, curvature_std: float = 0.0, skewness: float = 0.0, kurtosis: float = 0.0, noise_estimate: float = 0.0, snr_estimate: float = 0.0, pca_explained_variance: ndarray | None = None, pca_n_components_95: int = 0, n_peaks_mean: float = 0.0, peak_positions: ndarray | None = None, peak_wavenumbers: ndarray | None = None, effective_resolution: float = 8.0, noise_correlation_length: float = 1.0, wavelength_range: Tuple[float, float] = (1000.0, 2500.0), baseline_offset: float = 0.0, kubelka_munk_linearity: float = 0.0, baseline_convexity: float = 0.0, water_band_variation: float = 0.0, oh_band_positions: ndarray | None = None, temperature_sensitivity_score: float = 0.0, scatter_baseline_slope: float = 0.0, scatter_baseline_curvature: float = 0.0, sample_to_sample_offset_std: float = 0.0, sample_to_sample_slope_std: float = 0.0, protein_band_intensity: float = 0.0, carbohydrate_band_intensity: float = 0.0, lipid_band_intensity: float = 0.0, water_band_intensity: float = 0.0)[source]

Bases: object

Container for computed spectral properties of a dataset.

This dataclass holds various statistical and spectral properties computed from a NIRS dataset for comparison and fitting purposes.

name

Dataset identifier.

Type:

str

n_samples

Number of samples.

Type:

int

n_wavelengths

Number of wavelengths.

Type:

int

wavelengths

Wavelength grid.

Type:

numpy.ndarray | None

# Basic statistics
mean_spectrum

Mean spectrum across samples.

Type:

numpy.ndarray | None

std_spectrum

Standard deviation spectrum.

Type:

numpy.ndarray | None

global_mean

Overall mean absorbance.

Type:

float

global_std

Overall standard deviation.

Type:

float

global_range

(min, max) absorbance range.

Type:

Tuple[float, float]

# Shape properties
mean_slope

Average spectral slope (per 1000nm).

Type:

float

slope_std

Standard deviation of slopes.

Type:

float

mean_curvature

Average curvature (second derivative).

Type:

float

# Distribution statistics
skewness

Skewness of absorbance distribution.

Type:

float

kurtosis

Kurtosis of absorbance distribution.

Type:

float

# Noise characteristics
noise_estimate

Estimated noise level.

Type:

float

snr_estimate

Signal-to-noise ratio estimate.

Type:

float

# PCA properties
pca_explained_variance

Explained variance ratios.

Type:

numpy.ndarray | None

pca_n_components_95

Components for 95% variance.

Type:

int

# Peak analysis
n_peaks_mean

Mean number of peaks.

Type:

float

peak_positions

Wavelengths of detected peaks.

Type:

numpy.ndarray | None

peak_wavenumbers

Wavenumber positions of peaks.

Type:

numpy.ndarray | None

# Phase 1-4 Enhanced properties
# Instrument indicators
effective_resolution

Estimated spectral resolution from peak widths.

Type:

float

noise_correlation_length

Correlation length of noise (detector indicator).

Type:

float

wavelength_range

Actual wavelength range of data.

Type:

Tuple[float, float]

# Measurement mode indicators
baseline_offset

Mean baseline offset (transmittance indicator).

Type:

float

kubelka_munk_linearity

K-M linearity score (reflectance indicator).

Type:

float

baseline_convexity

Convexity of baseline (ATR indicator).

Type:

float

# Environmental indicators
water_band_variation

Variation in water band region.

Type:

float

oh_band_positions

Detected O-H band positions.

Type:

numpy.ndarray | None

temperature_sensitivity_score

Score for temperature effect detection.

Type:

float

# Scattering indicators
scatter_baseline_slope

Wavelength-dependent scatter slope.

Type:

float

scatter_baseline_curvature

Curvature from scattering.

Type:

float

sample_to_sample_offset_std

Sample-to-sample offset variation.

Type:

float

sample_to_sample_slope_std

Sample-to-sample slope variation.

Type:

float

# Domain indicators
protein_band_intensity

Intensity in protein band regions.

Type:

float

carbohydrate_band_intensity

Intensity in carbohydrate regions.

Type:

float

lipid_band_intensity

Intensity in lipid band regions.

Type:

float

water_band_intensity

Intensity in water band regions.

Type:

float

baseline_convexity: float = 0.0
baseline_offset: float = 0.0
carbohydrate_band_intensity: float = 0.0
curvature_std: float = 0.0
effective_resolution: float = 8.0
global_mean: float = 0.0
global_range: Tuple[float, float] = (0.0, 0.0)
global_std: float = 0.0
kubelka_munk_linearity: float = 0.0
kurtosis: float = 0.0
lipid_band_intensity: float = 0.0
mean_curvature: float = 0.0
mean_slope: float = 0.0
mean_spectrum: ndarray | None = None
n_peaks_mean: float = 0.0
n_samples: int = 0
n_wavelengths: int = 0
name: str = 'dataset'
noise_correlation_length: float = 1.0
noise_estimate: float = 0.0
oh_band_positions: ndarray | None = None
pca_explained_variance: ndarray | None = None
pca_n_components_95: int = 0
peak_positions: ndarray | None = None
peak_wavenumbers: ndarray | None = None
protein_band_intensity: float = 0.0
sample_to_sample_offset_std: float = 0.0
sample_to_sample_slope_std: float = 0.0
scatter_baseline_curvature: float = 0.0
scatter_baseline_slope: float = 0.0
skewness: float = 0.0
slope_std: float = 0.0
slopes: ndarray | None = None
snr_estimate: float = 0.0
std_spectrum: ndarray | None = None
temperature_sensitivity_score: float = 0.0
water_band_intensity: float = 0.0
water_band_variation: float = 0.0
wavelength_range: Tuple[float, float] = (1000.0, 2500.0)
wavelengths: ndarray | None = None
nirs4all.data.synthetic.fitter.compare_datasets(X_synthetic: ndarray, X_real: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]

Quick comparison between synthetic and real datasets.

Parameters:
  • X_synthetic – Synthetic spectra.

  • X_real – Real spectra.

  • wavelengths – Wavelength grid.

Returns:

Dictionary with comparison metrics.

Example

>>> metrics = compare_datasets(X_synth, X_real)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
nirs4all.data.synthetic.fitter.compute_spectral_properties(X: ndarray, wavelengths: ndarray | None = None, name: str = 'dataset', n_pca_components: int = 20) SpectralProperties[source]

Compute comprehensive spectral properties of a dataset.

Analyzes a matrix of spectra to extract statistical and spectral properties useful for fitting and comparison. Includes Phase 1-4 enhanced properties for instrument, mode, domain, and effect inference.

Parameters:
  • X – Spectra matrix (n_samples, n_wavelengths).

  • wavelengths – Optional wavelength grid.

  • name – Dataset identifier.

  • n_pca_components – Maximum PCA components to compute.

Returns:

SpectralProperties with computed metrics.

Example

>>> props = compute_spectral_properties(X_real, wavelengths)
>>> print(f"Mean slope: {props.mean_slope:.4f}")
>>> print(f"Inferred resolution: {props.effective_resolution:.1f} nm")
nirs4all.data.synthetic.fitter.fit_to_real_data(X: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, name: str = 'source') FittedParameters[source]

Quick function to fit parameters to real data.

Convenience function for simple fitting use cases.

Parameters:
  • X – Real spectra or SpectroDataset.

  • wavelengths – Wavelength grid.

  • name – Dataset name.

Returns:

FittedParameters object.

Example

>>> params = fit_to_real_data(X_real, wavelengths)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())