nirs4all.synthesis.fitter module
Real data fitting utilities for synthetic NIRS spectra generation.
This module provides tools to analyze real NIRS datasets and fit generator parameters to match their statistical and spectral properties.
- Key Features:
Statistical property analysis (mean, std, skewness, kurtosis)
Spectral shape analysis (slope, curvature, noise)
PCA structure analysis
Parameter estimation for SyntheticNIRSGenerator
Comparison between synthetic and real data
- Phase 1-4 Enhanced Features:
Instrument archetype inference (InGaAs, PbS, MEMS, etc.)
Measurement mode detection (transmittance, reflectance, ATR)
Application domain suggestion (agriculture, pharmaceutical, etc.)
Environmental effects estimation (temperature, moisture)
Scattering parameter estimation (particle size, EMSC)
Wavenumber-based peak analysis for component identification
Example
>>> from nirs4all.synthesis import RealDataFitter, SyntheticNIRSGenerator
>>>
>>> # Analyze real data
>>> fitter = RealDataFitter()
>>> params = fitter.fit(X_real, wavelengths=wavelengths)
>>>
>>> # Create generator with fitted parameters (includes all Phase 1-4 features)
>>> generator = fitter.create_matched_generator()
>>> X_synthetic, _, _ = generator.generate(n_samples=1000)
>>>
>>> # Or get all inferred characteristics
>>> print(f"Inferred instrument: {params.inferred_instrument}")
>>> print(f"Inferred domain: {params.inferred_domain}")
>>> print(f"Measurement mode: {params.measurement_mode}")
References
Based on comparator.py from bench/synthetic/
Enhanced with Phase 1-4 synthetic generator features
- class nirs4all.synthesis.fitter.ComponentFitResult(component_names: List[str], concentrations: ndarray, baseline_coefficients: ndarray | None, fitted_spectrum: ndarray, residuals: ndarray, r_squared: float, rmse: float, wavelengths: ndarray | None = None)[source]
Bases:
objectResult of fitting spectral components to an observed spectrum.
- concentrations
Estimated concentration for each component.
- Type:
- baseline_coefficients
Polynomial baseline coefficients (if fit_baseline=True).
- Type:
numpy.ndarray | None
- fitted_spectrum
Reconstructed spectrum from fit.
- Type:
- residuals
Difference between observed and fitted spectra.
- Type:
- wavelengths
Wavelength grid used for fitting.
- Type:
numpy.ndarray | None
- class nirs4all.synthesis.fitter.ComponentFitter(component_names: List[str] | None = None, wavelengths: ndarray | None = None, fit_baseline: bool = True, baseline_order: int = 2, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, sg_window_length: int = 15, sg_polyorder: int = 2)[source]
Bases:
objectFit linear combinations of spectral components to observed spectra.
Solves: spectrum ≈ Σ(c_i * component_i(λ)) + baseline
Uses non-negative least squares (NNLS) to ensure positive concentrations, which is physically meaningful for spectroscopic analysis.
Preprocessing Support: If your observed spectra are preprocessed (e.g., second derivative, SNV), use the preprocessing parameter to apply the same transformation to component spectra before fitting.
Auto-detection: Set auto_detect_preprocessing=True to automatically detect the preprocessing type from the data (recommended for derivative data).
Example
>>> from nirs4all.synthesis import ComponentFitter >>> >>> # Fit with all available components >>> fitter = ComponentFitter(wavelengths=np.arange(1000, 2500, 2)) >>> result = fitter.fit(observed_spectrum) >>> print(result.summary()) >>> >>> # Fit preprocessed data (e.g., second derivative) >>> fitter = ComponentFitter( ... component_names=["water", "protein", "lipid"], ... wavelengths=wavelengths, ... preprocessing="second_derivative", # Components will be transformed ... ) >>> result = fitter.fit(derivative_spectrum) >>> >>> # Auto-detect preprocessing (recommended for unknown data) >>> fitter = ComponentFitter( ... wavelengths=wavelengths, ... auto_detect_preprocessing=True, # Will detect derivative, SNV, etc. ... ) >>> result = fitter.fit(unknown_spectrum)
- component_names
List of component names to fit.
- wavelengths
Wavelength grid for fitting.
- fit_baseline
Whether to include polynomial baseline.
- baseline_order
Polynomial order for baseline (default 2).
- preprocessing
Preprocessing to apply to components before fitting.
- auto_detect_preprocessing
If True, detect preprocessing from data.
- detected_preprocessing
The detected preprocessing type (after first fit).
- Type:
- detected_preprocessing: PreprocessingType | None
- fit(spectrum: ndarray, method: str = 'nnls') ComponentFitResult[source]
Fit components to a single spectrum.
- Parameters:
spectrum – Observed spectrum, shape (n_wavelengths,).
method – Fitting method. - “nnls”: Non-negative least squares (default, physically meaningful). - “lsq”: Unconstrained least squares (allows negative concentrations).
- Returns:
ComponentFitResult with concentrations, residuals, and fit quality metrics.
Example
>>> result = fitter.fit(observed_spectrum) >>> print(f"R² = {result.r_squared:.4f}") >>> print(f"Top components: {result.top_components(3)}")
- fit_batch(spectra: ndarray, method: str = 'nnls', n_jobs: int = -1) List[ComponentFitResult][source]
Fit components to multiple spectra in parallel.
- Parameters:
spectra – Observed spectra, shape (n_samples, n_wavelengths).
method – Fitting method (“nnls” or “lsq”).
n_jobs – Number of parallel jobs (-1 = all cores, 1 = sequential).
- Returns:
List of ComponentFitResult objects.
Example
>>> results = fitter.fit_batch(X_observed, n_jobs=4) >>> mean_r2 = np.mean([r.r_squared for r in results]) >>> print(f"Mean R² = {mean_r2:.4f}")
- get_concentration_matrix(spectra: ndarray, method: str = 'nnls', n_jobs: int = -1) Tuple[ndarray, List[str]][source]
Get concentration matrix for batch of spectra.
Convenience method that extracts just the concentrations.
- Parameters:
spectra – Observed spectra, shape (n_samples, n_wavelengths).
method – Fitting method (“nnls” or “lsq”).
n_jobs – Number of parallel jobs.
- Returns:
concentrations: Array of shape (n_samples, n_components)
component_names: List of component names
- Return type:
Tuple of
Example
>>> C, names = fitter.get_concentration_matrix(X_observed) >>> water_idx = names.index("water") >>> water_concentrations = C[:, water_idx]
- suggest_components(spectrum: ndarray, top_n: int = 5, threshold: float = 0.01, method: str = 'nnls') List[Tuple[str, float]][source]
Suggest which components are likely present in a spectrum.
Performs a fit and returns the top components by concentration.
- Parameters:
spectrum – Observed spectrum, shape (n_wavelengths,).
top_n – Maximum number of components to return.
threshold – Minimum concentration threshold.
method – Fitting method (“nnls” or “lsq”).
- Returns:
List of (component_name, estimated_concentration) tuples, sorted by concentration descending.
Example
>>> suggestions = fitter.suggest_components(unknown_spectrum) >>> print("Likely components:") >>> for name, conc in suggestions: ... print(f" {name}: {conc:.3f}")
- class nirs4all.synthesis.fitter.DerivativeAwareForwardModelFitter(components: List['SpectralComponent'], canonical_grid: np.ndarray, target_grid: np.ndarray, derivative_order: int = 1, sg_window: int = 15, sg_polyorder: int = 2, baseline_order: int = 6, wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0), ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0), path_length_bounds: Tuple[float, float] = (0.5, 2.0))[source]
Bases:
objectForward model fitter for derivative-preprocessed datasets.
Key principle: Never fit derivative spectra by adding narrow bands. Instead:
Fit latent physical model (raw absorbance)
Apply derivative preprocessing to model output
Compare in derivative space
This ensures concentrations remain physically interpretable without oscillatory artifacts from narrow compensating peaks.
- components
List of SpectralComponent objects.
- Type:
List[‘SpectralComponent’]
- canonical_grid
High-resolution canonical wavelength grid.
- Type:
np.ndarray
- target_grid
Target wavelength grid (dataset grid).
- Type:
np.ndarray
Example
>>> fitter = DerivativeAwareForwardModelFitter( ... components=components, ... canonical_grid=canonical_wl, ... target_grid=dataset_wl, ... derivative_order=1, # First derivative ... ) >>> result = fitter.fit(derivative_spectrum) >>> print(f"R² = {result['r_squared']:.4f}")
- canonical_grid: np.ndarray
- components: List['SpectralComponent']
- fit(y_deriv: ndarray, initial_guess: ndarray | None = None) Dict[str, Any][source]
Fit forward model to derivative spectrum.
- Parameters:
y_deriv – Target spectrum (already derivative-preprocessed).
initial_guess – Initial [wl_shift, ils_sigma, path_length].
- Returns:
r_squared: Coefficient of determination
fitted_deriv: Fitted derivative spectrum
fitted_raw: Reconstructed raw spectrum
residuals_deriv: Fitting residuals
concentrations: Fitted component concentrations
baseline_coeffs: Fitted baseline coefficients
wl_shift, ils_sigma, path_length: Instrument params
- Return type:
Dict with fitted parameters
- target_grid: np.ndarray
- class nirs4all.synthesis.fitter.DomainInference(domain_name: str = 'unknown', category: str = 'unknown', confidence: float = 0.0, detected_components: List[str] = <factory>, alternative_domains: Dict[str, float]=<factory>)[source]
Bases:
objectResults of application domain inference.
- class nirs4all.synthesis.fitter.EdgeArtifactInference(has_edge_artifacts: bool = False, has_detector_rolloff: bool = False, has_stray_light: bool = False, has_truncated_peaks: bool = False, has_edge_curvature: bool = False, left_edge_intensity: float = 0.0, right_edge_intensity: float = 0.0, edge_noise_ratio: float = 1.0, detector_model: str = 'generic_nir', stray_light_fraction: float = 0.0, curvature_type: str = 'none', boundary_peak_amplitudes: Tuple[float, float] = (0.0, 0.0))[source]
Bases:
objectResults of edge artifact inference.
Detects edge deformation effects in NIR spectra caused by: - Detector sensitivity roll-off at wavelength boundaries - Stray light effects (more pronounced at edges) - Truncated absorption bands outside measurement range - Baseline curvature concentrated at edges
References
JASCO (2020). Advantages of high-sensitivity InGaAs detector.
Applied Optics (1975). Resolution and stray light in NIR spectroscopy.
Burns & Ciurczak (2007). Handbook of Near-Infrared Analysis.
- class nirs4all.synthesis.fitter.EnvironmentalInference(estimated_temperature_variation: float = 0.0, has_temperature_effects: bool = False, estimated_moisture_variation: float = 0.0, has_moisture_effects: bool = False, water_band_shift: float = 0.0)[source]
Bases:
objectResults of environmental effects inference.
- class nirs4all.synthesis.fitter.FittedParameters(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, global_slope_mean: float = 0.0, global_slope_std: float = 0.02, noise_base: float = 0.001, noise_signal_dep: float = 0.005, path_length_std: float = 0.05, baseline_amplitude: float = 0.02, scatter_alpha_std: float = 0.05, scatter_beta_std: float = 0.01, tilt_std: float = 0.01, complexity: str = 'realistic', source_name: str = '', source_properties: SpectralProperties | None = None, inferred_instrument: str = 'unknown', instrument_inference: InstrumentInference | None = None, measurement_mode: str = 'transmittance', measurement_mode_confidence: float = 0.0, inferred_domain: str = 'unknown', domain_inference: DomainInference | None = None, environmental_inference: EnvironmentalInference | None = None, temperature_config: Dict[str, ~typing.Any]=<factory>, moisture_config: Dict[str, ~typing.Any]=<factory>, scattering_inference: ScatteringInference | None = None, particle_size_config: Dict[str, ~typing.Any]=<factory>, emsc_config: Dict[str, ~typing.Any]=<factory>, edge_artifact_inference: EdgeArtifactInference | None = None, edge_artifacts_config: Dict[str, ~typing.Any]=<factory>, boundary_components_config: Dict[str, ~typing.Any]=<factory>, preprocessing_inference: PreprocessingInference | None = None, preprocessing_type: str = 'raw_absorbance', is_preprocessed: bool = False, detected_components: List[str] = <factory>, suggested_n_components: int = 5)[source]
Bases:
objectParameters fitted from real data for synthetic generation.
This dataclass contains all parameters needed to configure a SyntheticNIRSGenerator to produce spectra similar to a real dataset, including Phase 1-4 enhanced features.
- # Basic wavelength grid
- # Slope and baseline parameters
- # Noise parameters
- # Scatter parameters
- # Complexity
- # Source metadata
- source_properties
Full SpectralProperties of source.
- Type:
- # Phase 1-4 Enhanced Parameters
- # Instrument inference
- instrument_inference
Full instrument inference result.
- Type:
- # Measurement mode
- # Domain inference
- domain_inference
Full domain inference result.
- Type:
- # Environmental effects
- environmental_inference
Environmental effects inference.
- Type:
- # Scattering effects
- scattering_inference
Scattering effects inference.
- Type:
- # Detected components for procedural generation
- domain_inference: DomainInference | None = None
- edge_artifact_inference: EdgeArtifactInference | None = None
- environmental_inference: EnvironmentalInference | None = None
- classmethod from_dict(data: Dict[str, Any]) FittedParameters[source]
Create FittedParameters from a dictionary.
- Parameters:
data – Dictionary with parameter values.
- Returns:
FittedParameters instance.
- instrument_inference: InstrumentInference | None = None
- classmethod load(path: str) FittedParameters[source]
Load parameters from JSON file.
- Parameters:
path – Input file path.
- Returns:
FittedParameters instance.
- preprocessing_inference: PreprocessingInference | None = None
- scattering_inference: ScatteringInference | None = None
- source_properties: SpectralProperties | None = None
- summary() str[source]
Generate a human-readable summary of fitted parameters.
- Returns:
Multi-line summary string.
- to_dict() Dict[str, Any][source]
Convert all parameters to a dictionary.
- Returns:
Dictionary with all parameter values.
- to_full_config() Dict[str, Any][source]
Convert all fitted parameters to a comprehensive configuration.
This includes all Phase 1-4 parameters for complete synthetic data generation matching the source dataset.
- Returns:
Dictionary with all configuration parameters.
Example
>>> params = fitter.fit(X_real) >>> config = params.to_full_config() >>> # Use with builder pattern or advanced configuration
- class nirs4all.synthesis.fitter.ForwardModelFitter(components: List['SpectralComponent'], canonical_grid: np.ndarray, target_grid: np.ndarray, baseline_order: int = 4, wl_shift_bounds: Tuple[float, float] = (-5.0, 5.0), ils_sigma_bounds: Tuple[float, float] = (2.0, 15.0), path_length_bounds: Tuple[float, float] = (0.5, 2.0))[source]
Bases:
objectVariable projection fitter for physical forward model.
Fits a physical mixture model to observed spectra by separating: - Linear params: concentrations, baseline coefficients (solved via NNLS/lsq) - Nonlinear params: wl_shift, ils_sigma, path_length (solved via optimization)
This approach is numerically stable and physically interpretable.
- components
List of SpectralComponent objects.
- Type:
List[‘SpectralComponent’]
- canonical_grid
High-resolution canonical wavelength grid.
- Type:
np.ndarray
- target_grid
Target wavelength grid (dataset grid).
- Type:
np.ndarray
Example
>>> from nirs4all.synthesis._constants import get_predefined_components >>> components = [get_predefined_components()[n] for n in ['water', 'protein']] >>> fitter = ForwardModelFitter( ... components=components, ... canonical_grid=np.linspace(400, 2500, 4200), ... target_grid=dataset_wavelengths, ... ) >>> result = fitter.fit(spectrum) >>> print(f"R² = {result['r_squared']:.4f}")
- canonical_grid: np.ndarray
- components: List['SpectralComponent']
- fit(y: ndarray, initial_guess: ndarray | None = None) Dict[str, Any][source]
Fit forward model to target spectrum.
- Parameters:
y – Target spectrum.
initial_guess – Initial [wl_shift, ils_sigma, path_length].
- Returns:
r_squared: Coefficient of determination
fitted: Fitted spectrum
residuals: Fitting residuals
concentrations: Fitted component concentrations
baseline_coeffs: Fitted baseline coefficients
wl_shift, ils_sigma, path_length: Instrument params
- Return type:
Dict with fitted parameters
- target_grid: np.ndarray
- class nirs4all.synthesis.fitter.InstrumentChain(wl_shift: float = 0.0, wl_stretch: float = 1.0, ils_sigma: float = 4.0, stray_light: float = 0.001, gain: float = 1.0, offset: float = 0.0)[source]
Bases:
objectForward instrument chain: canonical grid → dataset grid.
Applies the complete measurement chain to transform a high-resolution physical spectrum to the observed instrument grid.
- Chain:
Wavelength warp (shift + stretch)
ILS convolution (Gaussian smoothing)
Stray light / gain / offset
Resample to target grid
Example
>>> chain = InstrumentChain(wl_shift=2.0, ils_sigma=5.0) >>> spectrum_obs = chain.apply(spectrum_phys, canonical_wl, target_wl)
- apply(spectrum: ndarray, canonical_wl: ndarray, target_wl: ndarray) ndarray[source]
Apply full instrument chain.
- Parameters:
spectrum – Input spectrum on canonical grid.
canonical_wl – Canonical wavelength grid (nm).
target_wl – Target wavelength grid (nm).
- Returns:
Transformed spectrum on target grid.
- class nirs4all.synthesis.fitter.InstrumentInference(archetype_name: str = 'unknown', detector_type: str = 'unknown', wavelength_range: Tuple[float, float]=(1000.0, 2500.0), estimated_resolution: float = 8.0, confidence: float = 0.0, alternative_archetypes: Dict[str, float]=<factory>)[source]
Bases:
objectResults of instrument archetype inference.
- class nirs4all.synthesis.fitter.MeasurementModeInference(value)[source]
-
Inferred measurement mode from spectral analysis.
- ATR = 'atr'
- REFLECTANCE = 'reflectance'
- TRANSFLECTANCE = 'transflectance'
- TRANSMITTANCE = 'transmittance'
- UNKNOWN = 'unknown'
- class nirs4all.synthesis.fitter.OperatorVarianceParams(noise_std: float = 0.001, offset_std: float = 0.01, slope_std: float = 0.001, curvature_std: float = 0.0001, mult_scatter_std: float = 0.05)[source]
Bases:
objectParameters for operator-based variance modeling.
Models spectral variation as independent physical sources: - High-frequency noise (detector noise) - Baseline offset/slope/curvature (instrumental drift, scattering) - Multiplicative scatter (sample thickness, optical path variation)
- class nirs4all.synthesis.fitter.OptimizedComponentFitter(wavelengths: ndarray | None = None, priority_categories: List[str] | None = None, max_components: int = 10, baseline_order: int = 4, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, sg_window_length: int = 15, sg_polyorder: int = 3, regularization: float = 1e-06, smooth_sigma_nm: float = 30.0, use_nnls: bool = False)[source]
Bases:
objectOptimize component selection using greedy search with category prioritization.
Unlike ComponentFitter which fits all components simultaneously with NNLS, this class uses a greedy forward selection approach that:
Starts with baseline-only fit
Greedily adds components from priority categories (low threshold)
Fills remaining slots from other categories (higher threshold)
Applies swap refinement to escape local optima
This approach produces much better fits for real-world data by: - Avoiding overfitting to spurious components - Respecting domain knowledge (e.g., protein for dairy, starch for grains) - Allowing both positive and negative coefficients (OLS, not NNLS)
Example
>>> from nirs4all.synthesis import OptimizedComponentFitter >>> >>> # Create fitter for grain analysis >>> fitter = OptimizedComponentFitter( ... wavelengths=wavelengths, ... priority_categories=['carbohydrates', 'proteins', 'water_related'], ... max_components=10, ... ) >>> result = fitter.fit(spectrum) >>> print(result.summary())
- wavelengths
Wavelength grid for fitting.
- priority_categories
Categories to prioritize in component selection.
- max_components
Maximum number of components to select.
- baseline_order
Polynomial order for baseline (default 4).
- preprocessing
Preprocessing to apply to components.
- auto_detect_preprocessing
Auto-detect preprocessing from data.
- detected_preprocessing: PreprocessingType | None
- fit(spectrum: ndarray) OptimizedFitResult[source]
Fit components to a spectrum using greedy category-prioritized selection.
The algorithm: 1. Starts with baseline-only fit 2. Greedily adds components from priority categories (very low threshold: 0.0001) 3. Fills remaining slots from other categories (higher threshold: 0.005) 4. Applies swap refinement (prefers swapping in priority components)
- Parameters:
spectrum – Observed spectrum, shape (n_wavelengths,).
- Returns:
OptimizedFitResult with fit results.
- class nirs4all.synthesis.fitter.OptimizedFitResult(component_names: List[str], concentrations: ndarray, baseline_coefficients: ndarray | None, fitted_spectrum: ndarray, residuals: ndarray, r_squared: float, rmse: float, n_components: int, n_priority_components: int, baseline_r_squared: float, wavelengths: ndarray)[source]
Bases:
objectResult from optimized greedy component fitting.
- concentrations
Fitted concentrations for each component.
- Type:
- baseline_coefficients
Polynomial baseline coefficients.
- Type:
numpy.ndarray | None
- fitted_spectrum
Reconstructed spectrum from fit.
- Type:
- residuals
Fit residuals.
- Type:
- wavelengths
Wavelength grid used for fitting.
- Type:
- class nirs4all.synthesis.fitter.PCAVarianceParams(n_components: int = 5, explained_variance_ratio: ndarray | None = None, score_means: ndarray | None = None, score_stds: ndarray | None = None, components: ndarray | None = None, mean_spectrum: ndarray | None = None)[source]
Bases:
objectParameters for PCA-based variance modeling.
Models spectral variation using principal component score distributions.
- explained_variance_ratio
Explained variance per component.
- Type:
numpy.ndarray | None
- score_means
Mean of PC scores.
- Type:
numpy.ndarray | None
- score_stds
Std of PC scores.
- Type:
numpy.ndarray | None
- components
PCA loading vectors (n_components, n_wavelengths).
- Type:
numpy.ndarray | None
- mean_spectrum
Mean spectrum from PCA.
- Type:
numpy.ndarray | None
- class nirs4all.synthesis.fitter.PreprocessingInference(preprocessing_type: PreprocessingType = PreprocessingType.RAW_ABSORBANCE, confidence: float = 0.0, is_preprocessed: bool = False, global_mean: float = 0.0, global_range: Tuple[float, float] = (0.0, 1.0), zero_crossing_ratio: float = 0.0, per_sample_std_variation: float = 0.0, oscillation_frequency: float = 0.0, suggested_inverse: str | None = None)[source]
Bases:
objectResults of preprocessing type inference.
Detects whether spectral data has been preprocessed (derivatives, normalization, centering, etc.) before being provided to the fitter.
This is crucial for generating synthetic data that matches the real data distribution - synthetic spectra should be generated as raw absorbance and then the same preprocessing applied.
- preprocessing_type
Detected preprocessing type.
- preprocessing_type: PreprocessingType = 'raw_absorbance'
- class nirs4all.synthesis.fitter.PreprocessingType(value)[source]
-
Detected preprocessing type of spectral data.
- FIRST_DERIVATIVE = 'first_derivative'
- MEAN_CENTERED = 'mean_centered'
- MSC_CORRECTED = 'msc_corrected'
- NORMALIZED = 'normalized'
- RAW_ABSORBANCE = 'raw_absorbance'
- RAW_REFLECTANCE = 'raw_reflectance'
- SECOND_DERIVATIVE = 'second_derivative'
- SNV_CORRECTED = 'snv_corrected'
- UNKNOWN = 'unknown'
- class nirs4all.synthesis.fitter.RealBandFitResult(band_names: ~typing.List[str], band_centers: ~numpy.ndarray, amplitudes: ~numpy.ndarray, sigmas: ~numpy.ndarray, baseline_coefficients: ~numpy.ndarray, fitted_spectrum: ~numpy.ndarray, residuals: ~numpy.ndarray, r_squared: float, rmse: float, n_bands: int, wavelengths: ~numpy.ndarray, band_assignments: ~typing.List[~typing.Any] = <factory>)[source]
Bases:
objectResult from real band fitting using known NIR band assignments.
- band_centers
Fixed center wavelengths from NIR_BANDS.
- Type:
- amplitudes
Fitted amplitudes for each band.
- Type:
- sigmas
Sigma values (within constrained ranges).
- Type:
- baseline_coefficients
Polynomial baseline coefficients.
- Type:
- fitted_spectrum
Reconstructed spectrum from fit.
- Type:
- residuals
Fit residuals.
- Type:
- wavelengths
Wavelength grid used for fitting.
- Type:
- band_assignments
Original BandAssignment objects.
- Type:
List[Any]
- class nirs4all.synthesis.fitter.RealBandFitter(baseline_order: int = 4, max_bands: int = 50, target_r2: float = 0.98, allow_sigma_variation: bool = True, sigma_margin: float = 0.3, n_iterations: int = 3)[source]
Bases:
objectFit spectra using REAL NIR band assignments from the _bands.py dictionary.
Unlike pure Gaussian band fitting which optimizes band centers freely, this class uses: - Fixed band centers from known spectroscopic literature assignments - Constrained sigma values based on typical ranges for each band type - Only amplitude optimization (more physically interpretable)
This provides spectroscopically meaningful decomposition that can be linked back to functional groups (O-H, C-H, N-H, etc.) and overtone levels.
Example
>>> from nirs4all.synthesis import RealBandFitter >>> >>> fitter = RealBandFitter(baseline_order=4, max_bands=40) >>> result = fitter.fit(spectrum, wavelengths) >>> print(result.summary()) >>> >>> # See which functional groups contribute >>> for name, center, amp in result.top_bands(10): ... print(f"{center:.0f} nm: {name} (amplitude={amp:.4f})")
- baseline_order
Polynomial baseline order.
- max_bands
Maximum number of bands to use.
- target_r2
Target R² for iterative refinement.
- allow_sigma_variation
Allow sigma to vary within literature ranges.
- sigma_margin
How much sigma can vary from midpoint (0.3 = ±30%).
- fit(spectrum: ndarray, wavelengths: ndarray) RealBandFitResult[source]
Fit spectrum using real NIR band positions.
- Parameters:
spectrum – Target spectrum to fit, shape (n_wavelengths,).
wavelengths – Wavelengths in nm, shape (n_wavelengths,).
- Returns:
RealBandFitResult with fit results and band assignments.
- class nirs4all.synthesis.fitter.RealDataFitter[source]
Bases:
objectFit generator parameters to match real dataset properties.
This class analyzes real NIRS data and estimates parameters for the SyntheticNIRSGenerator to produce similar spectra. Includes Phase 1-4 enhanced inference for instruments, domains, and effects.
- source_properties
SpectralProperties of the analyzed data.
- Type:
- fitted_params
FittedParameters after fitting.
- Type:
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wavelengths) >>> >>> # Access inferred characteristics >>> print(f"Instrument: {params.inferred_instrument}") >>> print(f"Domain: {params.inferred_domain}") >>> >>> # Create matched generator >>> generator = fitter.create_matched_generator() >>> X_synth, _, _ = generator.generate(1000)
- apply_matching_preprocessing(X: ndarray, *, window_length: int = 15, polyorder: int = 2) ndarray[source]
Apply preprocessing to match the detected preprocessing of real data.
If the real data was detected as preprocessed (e.g., second derivative), this method applies the same preprocessing to synthetic raw absorbance spectra so they match the real data distribution.
- Parameters:
X – Raw absorbance spectra from generator (n_samples, n_wavelengths).
window_length – Savitzky-Golay window length for derivatives.
polyorder – Polynomial order for Savitzky-Golay filter.
- Returns:
Preprocessed spectra matching the real data type.
- Raises:
RuntimeError – If fit() hasn’t been called.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wl) >>> generator = fitter.create_matched_generator() >>> X_raw, _, _ = generator.generate(1000) >>> X_matched = fitter.apply_matching_preprocessing(X_raw)
- create_matched_generator(random_state: int | None = None) SyntheticNIRSGenerator[source]
Create a SyntheticNIRSGenerator configured to match the fitted data.
This method creates a generator with all fitted parameters including Phase 1-4 enhanced features (instrument, domain, effects).
- Parameters:
random_state – Random seed for reproducibility.
- Returns:
Configured SyntheticNIRSGenerator instance.
- Raises:
RuntimeError – If fit() hasn’t been called.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wavelengths) >>> generator = fitter.create_matched_generator(random_state=42) >>> X_synth, _, _ = generator.generate(1000)
- evaluate_similarity(X_synthetic: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]
Evaluate similarity between synthetic and source data.
Computes various metrics comparing synthetic spectra to the original real data.
- Parameters:
X_synthetic – Synthetic spectra matrix.
wavelengths – Optional wavelength grid.
- Returns:
Dictionary with similarity metrics.
- Raises:
RuntimeError – If fit() hasn’t been called.
Example
>>> params = fitter.fit(X_real) >>> X_synth, _, _ = generator.generate(1000) >>> metrics = fitter.evaluate_similarity(X_synth) >>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
- fit(X: np.ndarray | 'SpectroDataset', *, wavelengths: np.ndarray | None = None, name: str = 'source', infer_instrument: bool = True, infer_domain: bool = True, infer_measurement_mode: bool = True, infer_environmental: bool = True, infer_scattering: bool = True, infer_edge_artifacts: bool = True, infer_preprocessing: bool = True) FittedParameters[source]
Fit generator parameters to real data.
Analyzes the input data and estimates optimal parameters for generating synthetic spectra with similar properties. Includes Phase 1-6 enhanced inference.
- Parameters:
X – Real spectra matrix (n_samples, n_wavelengths) or SpectroDataset.
wavelengths – Wavelength grid (required if X is ndarray).
name – Dataset name for reference.
infer_instrument – Whether to infer instrument archetype.
infer_domain – Whether to infer application domain.
infer_measurement_mode – Whether to infer measurement mode.
infer_environmental – Whether to infer environmental effects.
infer_scattering – Whether to infer scattering parameters.
infer_edge_artifacts – Whether to infer edge artifact effects.
infer_preprocessing – Whether to detect preprocessing type.
- Returns:
FittedParameters object with estimated parameters.
- Raises:
ValueError – If X is empty or has wrong shape.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wl, name="wheat") >>> print(params.summary())
- fit_from_path(path: str, *, name: str | None = None) FittedParameters[source]
Fit parameters from a dataset path.
Loads data using DatasetConfigs and fits parameters.
- Parameters:
path – Path to dataset folder.
name – Optional name override.
- Returns:
FittedParameters object.
Example
>>> params = fitter.fit_from_path("sample_data/regression")
- fitted_params: FittedParameters | None
- get_tuning_recommendations() List[str][source]
Get recommendations for tuning generation parameters.
Based on the fitted parameters and source data, provides suggestions for manual tuning.
- Returns:
List of recommendation strings.
Example
>>> params = fitter.fit(X_real) >>> for rec in fitter.get_tuning_recommendations(): ... print(f"- {rec}")
- source_properties: SpectralProperties | None
- class nirs4all.synthesis.fitter.ScatteringInference(has_scatter_effects: bool = False, estimated_particle_size_um: float = 50.0, multiplicative_scatter_std: float = 0.0, additive_scatter_std: float = 0.0, baseline_curvature: float = 0.0, snv_correctable: bool = False, msc_correctable: bool = False)[source]
Bases:
objectResults of scattering effects inference.
- class nirs4all.synthesis.fitter.SpectralProperties(name: str = 'dataset', n_samples: int = 0, n_wavelengths: int = 0, wavelengths: ndarray | None = None, mean_spectrum: ndarray | None = None, std_spectrum: ndarray | None = None, global_mean: float = 0.0, global_std: float = 0.0, global_range: Tuple[float, float] = (0.0, 0.0), mean_slope: float = 0.0, slope_std: float = 0.0, slopes: ndarray | None = None, mean_curvature: float = 0.0, curvature_std: float = 0.0, skewness: float = 0.0, kurtosis: float = 0.0, noise_estimate: float = 0.0, snr_estimate: float = 0.0, pca_explained_variance: ndarray | None = None, pca_n_components_95: int = 0, n_peaks_mean: float = 0.0, peak_positions: ndarray | None = None, peak_wavenumbers: ndarray | None = None, effective_resolution: float = 8.0, noise_correlation_length: float = 1.0, wavelength_range: Tuple[float, float] = (1000.0, 2500.0), baseline_offset: float = 0.0, kubelka_munk_linearity: float = 0.0, baseline_convexity: float = 0.0, water_band_variation: float = 0.0, oh_band_positions: ndarray | None = None, temperature_sensitivity_score: float = 0.0, scatter_baseline_slope: float = 0.0, scatter_baseline_curvature: float = 0.0, sample_to_sample_offset_std: float = 0.0, sample_to_sample_slope_std: float = 0.0, protein_band_intensity: float = 0.0, carbohydrate_band_intensity: float = 0.0, lipid_band_intensity: float = 0.0, water_band_intensity: float = 0.0, left_edge_noise_std: float = 0.0, right_edge_noise_std: float = 0.0, center_noise_std: float = 0.0, left_edge_slope: float = 0.0, right_edge_slope: float = 0.0, edge_curvature_intensity: float = 0.0, edge_curvature_asymmetry: float = 0.0, has_boundary_rise_left: bool = False, has_boundary_rise_right: bool = False)[source]
Bases:
objectContainer for computed spectral properties of a dataset.
This dataclass holds various statistical and spectral properties computed from a NIRS dataset for comparison and fitting purposes.
- wavelengths
Wavelength grid.
- Type:
numpy.ndarray | None
- # Basic statistics
- mean_spectrum
Mean spectrum across samples.
- Type:
numpy.ndarray | None
- std_spectrum
Standard deviation spectrum.
- Type:
numpy.ndarray | None
- # Shape properties
- # Distribution statistics
- # Noise characteristics
- # PCA properties
- pca_explained_variance
Explained variance ratios.
- Type:
numpy.ndarray | None
- # Peak analysis
- peak_positions
Wavelengths of detected peaks.
- Type:
numpy.ndarray | None
- peak_wavenumbers
Wavenumber positions of peaks.
- Type:
numpy.ndarray | None
- # Phase 1-4 Enhanced properties
- # Instrument indicators
- # Measurement mode indicators
- # Environmental indicators
- oh_band_positions
Detected O-H band positions.
- Type:
numpy.ndarray | None
- # Scattering indicators
- # Domain indicators
- class nirs4all.synthesis.fitter.VarianceFitResult(operator_params: OperatorVarianceParams, pca_params: PCAVarianceParams, n_samples: int = 0, wavelengths: ndarray | None = None)[source]
Bases:
objectCombined result from variance fitting.
- operator_params
Operator-based variance parameters.
- pca_params
PCA-based variance parameters.
- wavelengths
Wavelength grid.
- Type:
numpy.ndarray | None
- operator_params: OperatorVarianceParams
- pca_params: PCAVarianceParams
- class nirs4all.synthesis.fitter.VarianceFitter(n_pca_components: int = 10)[source]
Bases:
objectFit variance parameters from real spectra.
Provides two complementary methods for modeling spectral variation: - Operator-based: Independent physical sources (noise, scatter, baseline) - PCA-based: Correlated variations capturing the covariance structure
Example
>>> from nirs4all.synthesis import VarianceFitter >>> >>> fitter = VarianceFitter() >>> result = fitter.fit(X_real, wavelengths) >>> >>> # Use operator-based params for generation >>> print(f"Noise level: {result.operator_params.noise_std:.6f}") >>> >>> # Generate synthetic variance using PCA >>> X_variance = fitter.generate_pca_variance(n_samples=100, random_state=42)
- fit(X: ndarray, wavelengths: ndarray | None = None) VarianceFitResult[source]
Fit variance parameters from real spectra.
- Parameters:
X – Real spectra matrix (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
- Returns:
VarianceFitResult with both operator and PCA parameters.
- generate_operator_variance(base_spectrum: ndarray, wavelengths: ndarray, n_samples: int = 100, random_state: int | None = None) ndarray[source]
Generate synthetic spectra using operator-based variance.
- Parameters:
base_spectrum – Mean/fitted spectrum to add variance to.
wavelengths – Wavelength array.
n_samples – Number of samples to generate.
random_state – Random seed.
- Returns:
Array of synthetic spectra (n_samples, n_wavelengths).
- generate_pca_variance(n_samples: int = 100, n_components: int | None = None, random_state: int | None = None) ndarray[source]
Generate synthetic spectra using PCA-based variance.
- Parameters:
n_samples – Number of samples to generate.
n_components – Number of PCA components to use (None = all).
random_state – Random seed.
- Returns:
Array of synthetic spectra (n_samples, n_wavelengths).
- nirs4all.synthesis.fitter.compare_datasets(X_synthetic: ndarray, X_real: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]
Quick comparison between synthetic and real datasets.
- Parameters:
X_synthetic – Synthetic spectra.
X_real – Real spectra.
wavelengths – Wavelength grid.
- Returns:
Dictionary with comparison metrics.
Example
>>> metrics = compare_datasets(X_synth, X_real) >>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
- nirs4all.synthesis.fitter.compute_spectral_properties(X: ndarray, wavelengths: ndarray | None = None, name: str = 'dataset', n_pca_components: int = 20) SpectralProperties[source]
Compute comprehensive spectral properties of a dataset.
Analyzes a matrix of spectra to extract statistical and spectral properties useful for fitting and comparison. Includes Phase 1-4 enhanced properties for instrument, mode, domain, and effect inference.
- Parameters:
X – Spectra matrix (n_samples, n_wavelengths).
wavelengths – Optional wavelength grid.
name – Dataset identifier.
n_pca_components – Maximum PCA components to compute.
- Returns:
SpectralProperties with computed metrics.
Example
>>> props = compute_spectral_properties(X_real, wavelengths) >>> print(f"Mean slope: {props.mean_slope:.4f}") >>> print(f"Inferred resolution: {props.effective_resolution:.1f} nm")
- nirs4all.synthesis.fitter.fit_components(spectrum: ndarray, wavelengths: ndarray, component_names: List[str] | None = None, fit_baseline: bool = True, baseline_order: int = 2, method: str = 'nnls', preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False) ComponentFitResult[source]
Convenience function to fit components to a spectrum.
- Parameters:
spectrum – Observed spectrum.
wavelengths – Wavelength grid.
component_names – Components to fit (None = all available).
fit_baseline – Include polynomial baseline.
baseline_order – Polynomial order for baseline.
method – Fitting method (“nnls” or “lsq”).
preprocessing – Preprocessing to apply to components (e.g., “second_derivative”). Use this when fitting preprocessed data.
auto_detect_preprocessing – If True, automatically detect preprocessing type from the data. This is useful for derivative data where the preprocessing type is unknown. Takes precedence over preprocessing if set.
- Returns:
ComponentFitResult with fit results.
Example
>>> # Fit raw absorbance data >>> result = fit_components(spectrum, wavelengths, ["water", "protein", "lipid"]) >>> >>> # Fit second derivative data >>> result = fit_components( ... deriv_spectrum, wavelengths, ["water", "protein"], ... preprocessing="second_derivative" ... ) >>> >>> # Auto-detect preprocessing (recommended for unknown data) >>> result = fit_components( ... unknown_spectrum, wavelengths, ... auto_detect_preprocessing=True ... )
- nirs4all.synthesis.fitter.fit_components_optimized(spectrum: ndarray, wavelengths: ndarray, priority_categories: List[str] | None = None, max_components: int = 10, baseline_order: int = 4, preprocessing: str | PreprocessingType | None = None, auto_detect_preprocessing: bool = False, smooth_sigma_nm: float = 30.0, use_nnls: bool = False) OptimizedFitResult[source]
Convenience function for optimized component fitting.
Uses greedy category-prioritized selection for better fits than NNLS.
- Parameters:
spectrum – Observed spectrum.
wavelengths – Wavelength grid.
priority_categories – Categories to prioritize (e.g., [‘carbohydrates’, ‘proteins’]).
max_components – Maximum components to select.
baseline_order – Polynomial baseline order.
preprocessing – Preprocessing type (‘first_derivative’, ‘second_derivative’, etc.).
auto_detect_preprocessing – Auto-detect preprocessing from data.
smooth_sigma_nm – Gaussian smoothing sigma in nm to broaden component spectra.
use_nnls – Use non-negative least squares instead of OLS.
- Returns:
OptimizedFitResult with fit results.
Example
>>> result = fit_components_optimized( ... spectrum, wavelengths, ... priority_categories=['carbohydrates', 'proteins'], ... auto_detect_preprocessing=True, ... ) >>> print(f"R² = {result.r_squared:.4f}")
- nirs4all.synthesis.fitter.fit_real_bands(spectrum: ndarray, wavelengths: ndarray, baseline_order: int = 4, max_bands: int = 50, target_r2: float = 0.98, allow_sigma_variation: bool = True) RealBandFitResult[source]
Convenience function for fitting spectrum using real NIR band assignments.
Uses known band positions from the NIR_BANDS dictionary for physically meaningful spectral decomposition.
- Parameters:
spectrum – Observed spectrum.
wavelengths – Wavelength grid in nm.
baseline_order – Polynomial baseline order.
max_bands – Maximum number of bands to use.
target_r2 – Target R² for early stopping.
allow_sigma_variation – Allow sigma to vary within constrained ranges.
- Returns:
RealBandFitResult with fit results.
Example
>>> result = fit_real_bands(spectrum, wavelengths) >>> print(f"R² = {result.r_squared:.4f}") >>> for name, center, amp in result.top_bands(5): ... print(f"{center:.0f} nm: {name}")
- nirs4all.synthesis.fitter.fit_to_real_data(X: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, name: str = 'source') FittedParameters[source]
Quick function to fit parameters to real data.
Convenience function for simple fitting use cases.
- Parameters:
X – Real spectra or SpectroDataset.
wavelengths – Wavelength grid.
name – Dataset name.
- Returns:
FittedParameters object.
Example
>>> params = fit_to_real_data(X_real, wavelengths) >>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())
- nirs4all.synthesis.fitter.fit_variance(X: ndarray, wavelengths: ndarray | None = None, n_pca_components: int = 10) VarianceFitResult[source]
Convenience function to fit variance parameters from real spectra.
- Parameters:
X – Real spectra matrix (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
n_pca_components – Number of PCA components to fit.
- Returns:
VarianceFitResult with fitted parameters.
Example
>>> result = fit_variance(X_real, wavelengths) >>> print(f"Noise level: {result.operator_params.noise_std:.6f}")
- nirs4all.synthesis.fitter.multiscale_derivative_fit(fitter: DerivativeAwareForwardModelFitter, y_deriv: ndarray, scales: List[float] | None = None) Dict[str, Any][source]
Multiscale fitting curriculum for derivative spectra.
Fits coarse features first by smoothing the derivative target, then progressively reduces smoothing. Particularly important for derivative data which can have high-frequency noise.
- Parameters:
fitter – DerivativeAwareForwardModelFitter instance.
y_deriv – Target derivative spectrum.
scales – List of Gaussian sigma values. Default: [15, 8, 4, 0].
- Returns:
Final fit result dict.
Example
>>> result = multiscale_derivative_fit(fitter, deriv_spectrum)
- nirs4all.synthesis.fitter.multiscale_fit(fitter: ForwardModelFitter, y: ndarray, scales: List[float] | None = None) Dict[str, Any][source]
Multiscale fitting curriculum for raw spectra.
Fits coarse features first by smoothing the target, then progressively reduces smoothing to capture finer details. This improves optimization stability and avoids local minima.
- Parameters:
fitter – ForwardModelFitter instance.
y – Target spectrum.
scales – List of Gaussian sigma values for progressive smoothing. Default: [20, 10, 5, 0].
- Returns:
Final fit result dict.
Example
>>> result = multiscale_fit(fitter, spectrum, scales=[20, 10, 5, 0])