nirs4all.data.synthetic.fitter module

Real data fitting utilities for synthetic NIRS spectra generation.

This module provides tools to analyze real NIRS datasets and fit generator parameters to match their statistical and spectral properties.

Key Features:

Statistical property analysis (mean, std, skewness, kurtosis)
Spectral shape analysis (slope, curvature, noise)
PCA structure analysis
Parameter estimation for SyntheticNIRSGenerator
Comparison between synthetic and real data
Phase 1-4 Enhanced Features:
- Instrument archetype inference (InGaAs, PbS, MEMS, etc.)
- Measurement mode detection (transmittance, reflectance, ATR)
- Application domain suggestion (agriculture, pharmaceutical, etc.)
- Environmental effects estimation (temperature, moisture)
- Scattering parameter estimation (particle size, EMSC)
- Wavenumber-based peak analysis for component identification

Example

>>> from nirs4all.data.synthetic import RealDataFitter, SyntheticNIRSGenerator
>>>
>>> # Analyze real data
>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Create generator with fitted parameters (includes all Phase 1-4 features)
>>> generator = fitter.create_matched_generator()
>>> X_synthetic, _, _ = generator.generate(n_samples=1000)
>>>
>>> # Or get all inferred characteristics
>>> print(f"Inferred instrument: {params.inferred_instrument}")
>>> print(f"Inferred domain: {params.inferred_domain}")
>>> print(f"Measurement mode: {params.measurement_mode}")

References

Based on comparator.py from bench/synthetic/
Enhanced with Phase 1-4 synthetic generator features

class nirs4all.data.synthetic.fitter.DomainInference(domain_name: str = 'unknown', category: str = 'unknown', confidence: float = 0.0, detected_components: ~typing.List[str] = <factory>, alternative_domains: ~typing.Dict[str, float] = <factory>)[source]

Bases: object

Results of application domain inference.

domain_name

Best matching domain name.

Type:: str

category

Domain category.

Type:: str

confidence

Confidence score (0-1).

Type:: float

detected_components

Components detected from peak analysis.

Type:: List[str]

alternative_domains

Other possible domains with scores.

Type:: Dict[str, float]

alternative_domains: Dict[str, float]

category: str = 'unknown'

confidence: float = 0.0

detected_components: List[str]

domain_name: str = 'unknown'

class nirs4all.data.synthetic.fitter.EnvironmentalInference(estimated_temperature_variation: float = 0.0, has_temperature_effects: bool = False, estimated_moisture_variation: float = 0.0, has_moisture_effects: bool = False, water_band_shift: float = 0.0)[source]

Bases: object

Results of environmental effects inference.

estimated_temperature_variation

Estimated temperature variation (°C).

Type:: float

has_temperature_effects

Whether temperature effects are detectable.

Type:: bool

estimated_moisture_variation

Estimated moisture variation.

Type:: float

has_moisture_effects

Whether moisture effects are detectable.

Type:: bool

water_band_shift

Detected shift in water bands (nm).

Type:: float

estimated_moisture_variation: float = 0.0

estimated_temperature_variation: float = 0.0

has_moisture_effects: bool = False

has_temperature_effects: bool = False

water_band_shift: float = 0.0

class nirs4all.data.synthetic.fitter.FittedParameters(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, global_slope_mean: float = 0.0, global_slope_std: float = 0.02, noise_base: float = 0.001, noise_signal_dep: float = 0.005, path_length_std: float = 0.05, baseline_amplitude: float = 0.02, scatter_alpha_std: float = 0.05, scatter_beta_std: float = 0.01, tilt_std: float = 0.01, complexity: str = 'realistic', source_name: str = '', source_properties: ~nirs4all.data.synthetic.fitter.SpectralProperties | None = None, inferred_instrument: str = 'unknown', instrument_inference: ~nirs4all.data.synthetic.fitter.InstrumentInference | None = None, measurement_mode: str = 'transmittance', measurement_mode_confidence: float = 0.0, inferred_domain: str = 'unknown', domain_inference: ~nirs4all.data.synthetic.fitter.DomainInference | None = None, environmental_inference: ~nirs4all.data.synthetic.fitter.EnvironmentalInference | None = None, temperature_config: ~typing.Dict[str, ~typing.Any] = <factory>, moisture_config: ~typing.Dict[str, ~typing.Any] = <factory>, scattering_inference: ~nirs4all.data.synthetic.fitter.ScatteringInference | None = None, particle_size_config: ~typing.Dict[str, ~typing.Any] = <factory>, emsc_config: ~typing.Dict[str, ~typing.Any] = <factory>, detected_components: ~typing.List[str] = <factory>, suggested_n_components: int = 5)[source]

Bases: object

Parameters fitted from real data for synthetic generation.

This dataclass contains all parameters needed to configure a SyntheticNIRSGenerator to produce spectra similar to a real dataset, including Phase 1-4 enhanced features.

# Basic wavelength grid

wavelength_start

Start wavelength (nm).

Type:: float

wavelength_end

End wavelength (nm).

Type:: float

wavelength_step

Wavelength step (nm).

Type:: float

# Slope and baseline parameters

global_slope_mean

Mean global slope.

Type:: float

global_slope_std

Slope standard deviation.

Type:: float

baseline_amplitude

Baseline drift amplitude.

Type:: float

# Noise parameters

noise_base

Base noise level.

Type:: float

noise_signal_dep

Signal-dependent noise factor.

Type:: float

# Scatter parameters

path_length_std

Path length variation.

Type:: float

scatter_alpha_std

Multiplicative scatter std.

Type:: float

scatter_beta_std

Additive scatter std.

Type:: float

tilt_std

Spectral tilt standard deviation.

Type:: float

# Complexity

complexity

Suggested complexity level.

Type:: str

# Source metadata

source_name

Name of source dataset.

Type:: str

source_properties

Full SpectralProperties of source.

Type:: nirs4all.data.synthetic.fitter.SpectralProperties | None

# Phase 1-4 Enhanced Parameters

# Instrument inference

inferred_instrument

Inferred instrument archetype.

Type:: str

instrument_inference

Full instrument inference result.

Type:: nirs4all.data.synthetic.fitter.InstrumentInference | None

# Measurement mode

measurement_mode

Inferred measurement mode.

Type:: str

measurement_mode_confidence

Confidence of inference.

Type:: float

# Domain inference

inferred_domain

Inferred application domain.

Type:: str

domain_inference

Full domain inference result.

Type:: nirs4all.data.synthetic.fitter.DomainInference | None

# Environmental effects

environmental_inference

Environmental effects inference.

Type:: nirs4all.data.synthetic.fitter.EnvironmentalInference | None

temperature_config

Suggested temperature config parameters.

Type:: Dict[str, Any]

moisture_config

Suggested moisture config parameters.

Type:: Dict[str, Any]

# Scattering effects

scattering_inference

Scattering effects inference.

Type:: nirs4all.data.synthetic.fitter.ScatteringInference | None

particle_size_config

Suggested particle size config parameters.

Type:: Dict[str, Any]

emsc_config

Suggested EMSC config parameters.

Type:: Dict[str, Any]

# Detected components for procedural generation

detected_components

List of detected/inferred component names.

Type:: List[str]

suggested_n_components

Suggested number of components.

Type:: int

baseline_amplitude: float = 0.02

complexity: str = 'realistic'

detected_components: List[str]

domain_inference: DomainInference | None = None

emsc_config: Dict[str, Any]

environmental_inference: EnvironmentalInference | None = None

classmethod from_dict(data: Dict[str, Any]) → FittedParameters[source]

Create FittedParameters from a dictionary.

Parameters:: data – Dictionary with parameter values.
Returns:: FittedParameters instance.

global_slope_mean: float = 0.0

global_slope_std: float = 0.02

inferred_domain: str = 'unknown'

inferred_instrument: str = 'unknown'

instrument_inference: InstrumentInference | None = None

classmethod load(path: str) → FittedParameters[source]

Load parameters from JSON file.

Parameters:: path – Input file path.
Returns:: FittedParameters instance.

measurement_mode: str = 'transmittance'

measurement_mode_confidence: float = 0.0

moisture_config: Dict[str, Any]

noise_base: float = 0.001

noise_signal_dep: float = 0.005

particle_size_config: Dict[str, Any]

path_length_std: float = 0.05

save(path: str) → None[source]

Save parameters to JSON file.

Parameters:: path – Output file path.

scatter_alpha_std: float = 0.05

scatter_beta_std: float = 0.01

scattering_inference: ScatteringInference | None = None

source_name: str = ''

source_properties: SpectralProperties | None = None

suggested_n_components: int = 5

summary() → str[source]

Generate a human-readable summary of fitted parameters.

Returns:: Multi-line summary string.

temperature_config: Dict[str, Any]

tilt_std: float = 0.01

to_dict() → Dict[str, Any][source]

Convert all parameters to a dictionary.

Returns:: Dictionary with all parameter values.

to_full_config() → Dict[str, Any][source]

Convert all fitted parameters to a comprehensive configuration.

This includes all Phase 1-4 parameters for complete synthetic data generation matching the source dataset.

Returns:: Dictionary with all configuration parameters.

Example

>>> params = fitter.fit(X_real)
>>> config = params.to_full_config()
>>> # Use with builder pattern or advanced configuration

to_generator_kwargs() → Dict[str, Any][source]

Convert fitted parameters to kwargs for SyntheticNIRSGenerator.

Returns:: Dictionary of keyword arguments.

Example

>>> params = fitter.fit(X_real)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())

wavelength_end: float = 2500.0

wavelength_start: float = 1000.0

wavelength_step: float = 2.0

class nirs4all.data.synthetic.fitter.InstrumentInference(archetype_name: str = 'unknown', detector_type: str = 'unknown', wavelength_range: ~typing.Tuple[float, float] = (1000.0, 2500.0), estimated_resolution: float = 8.0, confidence: float = 0.0, alternative_archetypes: ~typing.Dict[str, float] = <factory>)[source]

Bases: object

Results of instrument archetype inference.

archetype_name

Best matching instrument archetype name.

Type:: str

detector_type

Inferred detector type.

Type:: str

wavelength_range

Detected wavelength range.

Type:: Tuple[float, float]

estimated_resolution

Estimated spectral resolution (nm).

Type:: float

confidence

Confidence score (0-1).

Type:: float

alternative_archetypes

Other possible archetypes with scores.

Type:: Dict[str, float]

alternative_archetypes: Dict[str, float]

archetype_name: str = 'unknown'

confidence: float = 0.0

detector_type: str = 'unknown'

estimated_resolution: float = 8.0

wavelength_range: Tuple[float, float] = (1000.0, 2500.0)

class nirs4all.data.synthetic.fitter.MeasurementModeInference(value)[source]

Bases: str, Enum

Inferred measurement mode from spectral analysis.

ATR = 'atr'

REFLECTANCE = 'reflectance'

TRANSFLECTANCE = 'transflectance'

TRANSMITTANCE = 'transmittance'

UNKNOWN = 'unknown'

class nirs4all.data.synthetic.fitter.RealDataFitter[source]

Bases: object

Fit generator parameters to match real dataset properties.

This class analyzes real NIRS data and estimates parameters for the SyntheticNIRSGenerator to produce similar spectra. Includes Phase 1-4 enhanced inference for instruments, domains, and effects.

source_properties: SpectralProperties of the analyzed data.

fitted_params: FittedParameters after fitting.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Access inferred characteristics
>>> print(f"Instrument: {params.inferred_instrument}")
>>> print(f"Domain: {params.inferred_domain}")
>>>
>>> # Create matched generator
>>> generator = fitter.create_matched_generator()
>>> X_synth, _, _ = generator.generate(1000)

create_matched_generator(random_state: int | None = None) → SyntheticNIRSGenerator[source]

Create a SyntheticNIRSGenerator configured to match the fitted data.

This method creates a generator with all fitted parameters including Phase 1-4 enhanced features (instrument, domain, effects).

Parameters:: random_state – Random seed for reproducibility.
Returns:: Configured SyntheticNIRSGenerator instance.
Raises:: RuntimeError – If fit() hasn’t been called.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>> generator = fitter.create_matched_generator(random_state=42)
>>> X_synth, _, _ = generator.generate(1000)

evaluate_similarity(X_synthetic: ndarray, wavelengths: ndarray | None = None) → Dict[str, Any][source]

Evaluate similarity between synthetic and source data.

Computes various metrics comparing synthetic spectra to the original real data.

Parameters:

X_synthetic – Synthetic spectra matrix.
wavelengths – Optional wavelength grid.

Returns:

Dictionary with similarity metrics.

Raises:

RuntimeError – If fit() hasn’t been called.

Example

>>> params = fitter.fit(X_real)
>>> X_synth, _, _ = generator.generate(1000)
>>> metrics = fitter.evaluate_similarity(X_synth)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")

fit(X: np.ndarray | 'SpectroDataset', *, wavelengths: np.ndarray | None = None, name: str = 'source', infer_instrument: bool = True, infer_domain: bool = True, infer_measurement_mode: bool = True, infer_environmental: bool = True, infer_scattering: bool = True) → FittedParameters[source]

Fit generator parameters to real data.

Analyzes the input data and estimates optimal parameters for generating synthetic spectra with similar properties. Includes Phase 1-4 enhanced inference.

Parameters:

X – Real spectra matrix (n_samples, n_wavelengths) or SpectroDataset.
wavelengths – Wavelength grid (required if X is ndarray).
name – Dataset name for reference.
infer_instrument – Whether to infer instrument archetype.
infer_domain – Whether to infer application domain.
infer_measurement_mode – Whether to infer measurement mode.
infer_environmental – Whether to infer environmental effects.
infer_scattering – Whether to infer scattering parameters.

Returns:

FittedParameters object with estimated parameters.

Raises:

ValueError – If X is empty or has wrong shape.

Example

>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wl, name="wheat")
>>> print(params.summary())

fit_from_path(path: str, *, name: str | None = None) → FittedParameters[source]

Fit parameters from a dataset path.

Loads data using DatasetConfigs and fits parameters.

Parameters:

path – Path to dataset folder.
name – Optional name override.

Returns:

FittedParameters object.

Example

>>> params = fitter.fit_from_path("sample_data/regression")

get_tuning_recommendations() → List[str][source]

Get recommendations for tuning generation parameters.

Based on the fitted parameters and source data, provides suggestions for manual tuning.

Returns:: List of recommendation strings.

Example

>>> params = fitter.fit(X_real)
>>> for rec in fitter.get_tuning_recommendations():
...     print(f"- {rec}")

class nirs4all.data.synthetic.fitter.ScatteringInference(has_scatter_effects: bool = False, estimated_particle_size_um: float = 50.0, multiplicative_scatter_std: float = 0.0, additive_scatter_std: float = 0.0, baseline_curvature: float = 0.0, snv_correctable: bool = False, msc_correctable: bool = False)[source]

Bases: object

Results of scattering effects inference.

has_scatter_effects

Whether significant scatter is detected.

Type:: bool

estimated_particle_size_um

Estimated mean particle size (μm).

Type:: float

multiplicative_scatter_std

Estimated MSC-style multiplicative scatter.

Type:: float

additive_scatter_std

Estimated SNV-style additive scatter.

Type:: float

baseline_curvature

Detected baseline curvature intensity.

Type:: float

snv_correctable

Whether SNV would improve spectra.

Type:: bool

msc_correctable

Whether MSC would improve spectra.

Type:: bool

additive_scatter_std: float = 0.0

baseline_curvature: float = 0.0

estimated_particle_size_um: float = 50.0

has_scatter_effects: bool = False

msc_correctable: bool = False

multiplicative_scatter_std: float = 0.0

snv_correctable: bool = False

class nirs4all.data.synthetic.fitter.SpectralProperties(name: str = 'dataset', n_samples: int = 0, n_wavelengths: int = 0, wavelengths: ndarray | None = None, mean_spectrum: ndarray | None = None, std_spectrum: ndarray | None = None, global_mean: float = 0.0, global_std: float = 0.0, global_range: Tuple[float, float] = (0.0, 0.0), mean_slope: float = 0.0, slope_std: float = 0.0, slopes: ndarray | None = None, mean_curvature: float = 0.0, curvature_std: float = 0.0, skewness: float = 0.0, kurtosis: float = 0.0, noise_estimate: float = 0.0, snr_estimate: float = 0.0, pca_explained_variance: ndarray | None = None, pca_n_components_95: int = 0, n_peaks_mean: float = 0.0, peak_positions: ndarray | None = None, peak_wavenumbers: ndarray | None = None, effective_resolution: float = 8.0, noise_correlation_length: float = 1.0, wavelength_range: Tuple[float, float] = (1000.0, 2500.0), baseline_offset: float = 0.0, kubelka_munk_linearity: float = 0.0, baseline_convexity: float = 0.0, water_band_variation: float = 0.0, oh_band_positions: ndarray | None = None, temperature_sensitivity_score: float = 0.0, scatter_baseline_slope: float = 0.0, scatter_baseline_curvature: float = 0.0, sample_to_sample_offset_std: float = 0.0, sample_to_sample_slope_std: float = 0.0, protein_band_intensity: float = 0.0, carbohydrate_band_intensity: float = 0.0, lipid_band_intensity: float = 0.0, water_band_intensity: float = 0.0)[source]

Bases: object

Container for computed spectral properties of a dataset.

This dataclass holds various statistical and spectral properties computed from a NIRS dataset for comparison and fitting purposes.

name

Dataset identifier.

Type:: str

n_samples

Number of samples.

Type:: int

n_wavelengths

Number of wavelengths.

Type:: int

wavelengths

Wavelength grid.

Type:: numpy.ndarray | None

# Basic statistics

mean_spectrum

Mean spectrum across samples.

Type:: numpy.ndarray | None

std_spectrum

Standard deviation spectrum.

Type:: numpy.ndarray | None

global_mean

Overall mean absorbance.

Type:: float

global_std

Overall standard deviation.

Type:: float

global_range

(min, max) absorbance range.

Type:: Tuple[float, float]

# Shape properties

mean_slope

Average spectral slope (per 1000nm).

Type:: float

slope_std

Standard deviation of slopes.

Type:: float

mean_curvature

Average curvature (second derivative).

Type:: float

# Distribution statistics

skewness

Skewness of absorbance distribution.

Type:: float

kurtosis

Kurtosis of absorbance distribution.

Type:: float

# Noise characteristics

noise_estimate

Estimated noise level.

Type:: float

snr_estimate

Signal-to-noise ratio estimate.

Type:: float

# PCA properties

pca_explained_variance

Explained variance ratios.

Type:: numpy.ndarray | None

pca_n_components_95

Components for 95% variance.

Type:: int

# Peak analysis

n_peaks_mean

Mean number of peaks.

Type:: float

peak_positions

Wavelengths of detected peaks.

Type:: numpy.ndarray | None

peak_wavenumbers

Wavenumber positions of peaks.

Type:: numpy.ndarray | None

# Phase 1-4 Enhanced properties

# Instrument indicators

effective_resolution

Estimated spectral resolution from peak widths.

Type:: float

noise_correlation_length

Correlation length of noise (detector indicator).

Type:: float

wavelength_range

Actual wavelength range of data.

Type:: Tuple[float, float]

# Measurement mode indicators

baseline_offset

Mean baseline offset (transmittance indicator).

Type:: float

kubelka_munk_linearity

K-M linearity score (reflectance indicator).

Type:: float

baseline_convexity

Convexity of baseline (ATR indicator).

Type:: float

# Environmental indicators

water_band_variation

Variation in water band region.

Type:: float

oh_band_positions

Detected O-H band positions.

Type:: numpy.ndarray | None

temperature_sensitivity_score

Score for temperature effect detection.

Type:: float

# Scattering indicators

scatter_baseline_slope

Wavelength-dependent scatter slope.

Type:: float

scatter_baseline_curvature

Curvature from scattering.

Type:: float

sample_to_sample_offset_std

Sample-to-sample offset variation.

Type:: float

sample_to_sample_slope_std

Sample-to-sample slope variation.

Type:: float

# Domain indicators

protein_band_intensity

Intensity in protein band regions.

Type:: float

carbohydrate_band_intensity

Intensity in carbohydrate regions.

Type:: float

lipid_band_intensity

Intensity in lipid band regions.

Type:: float

water_band_intensity

Intensity in water band regions.

Type:: float

baseline_convexity: float = 0.0

baseline_offset: float = 0.0

carbohydrate_band_intensity: float = 0.0

curvature_std: float = 0.0

effective_resolution: float = 8.0

global_mean: float = 0.0

global_range: Tuple[float, float] = (0.0, 0.0)

global_std: float = 0.0

kubelka_munk_linearity: float = 0.0

kurtosis: float = 0.0

lipid_band_intensity: float = 0.0

mean_curvature: float = 0.0

mean_slope: float = 0.0

mean_spectrum: ndarray | None = None

n_peaks_mean: float = 0.0

n_samples: int = 0

n_wavelengths: int = 0

name: str = 'dataset'

noise_correlation_length: float = 1.0

noise_estimate: float = 0.0

oh_band_positions: ndarray | None = None

pca_explained_variance: ndarray | None = None

pca_n_components_95: int = 0

peak_positions: ndarray | None = None

peak_wavenumbers: ndarray | None = None

protein_band_intensity: float = 0.0

sample_to_sample_offset_std: float = 0.0

sample_to_sample_slope_std: float = 0.0

scatter_baseline_curvature: float = 0.0

scatter_baseline_slope: float = 0.0

skewness: float = 0.0

slope_std: float = 0.0

slopes: ndarray | None = None

snr_estimate: float = 0.0

std_spectrum: ndarray | None = None

temperature_sensitivity_score: float = 0.0

water_band_intensity: float = 0.0

water_band_variation: float = 0.0

wavelength_range: Tuple[float, float] = (1000.0, 2500.0)

wavelengths: ndarray | None = None

nirs4all.data.synthetic.fitter.compare_datasets(X_synthetic: ndarray, X_real: ndarray, wavelengths: ndarray | None = None) → Dict[str, Any][source]

Quick comparison between synthetic and real datasets.

Parameters:

X_synthetic – Synthetic spectra.
X_real – Real spectra.
wavelengths – Wavelength grid.

Returns:

Dictionary with comparison metrics.

Example

>>> metrics = compare_datasets(X_synth, X_real)
>>> print(f"Similarity: {metrics['overall_score']:.1f}/100")

nirs4all.data.synthetic.fitter.compute_spectral_properties(X: ndarray, wavelengths: ndarray | None = None, name: str = 'dataset', n_pca_components: int = 20) → SpectralProperties[source]

Compute comprehensive spectral properties of a dataset.

Analyzes a matrix of spectra to extract statistical and spectral properties useful for fitting and comparison. Includes Phase 1-4 enhanced properties for instrument, mode, domain, and effect inference.

Parameters:

X – Spectra matrix (n_samples, n_wavelengths).
wavelengths – Optional wavelength grid.
name – Dataset identifier.
n_pca_components – Maximum PCA components to compute.

Returns:

SpectralProperties with computed metrics.

Example

>>> props = compute_spectral_properties(X_real, wavelengths)
>>> print(f"Mean slope: {props.mean_slope:.4f}")
>>> print(f"Inferred resolution: {props.effective_resolution:.1f} nm")

nirs4all.data.synthetic.fitter.fit_to_real_data(X: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, name: str = 'source') → FittedParameters[source]

Quick function to fit parameters to real data.

Convenience function for simple fitting use cases.

Parameters:

X – Real spectra or SpectroDataset.
wavelengths – Wavelength grid.
name – Dataset name.

Returns:

FittedParameters object.

Example

>>> params = fit_to_real_data(X_real, wavelengths)
>>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())