nirs4all.data.synthetic package
Submodules
- nirs4all.data.synthetic.accelerated module
AcceleratedArraysAcceleratedArrays.arangeAcceleratedArrays.arrayAcceleratedArrays.backendAcceleratedArrays.cosAcceleratedArrays.dotAcceleratedArrays.expAcceleratedArrays.linspaceAcceleratedArrays.logAcceleratedArrays.matmulAcceleratedArrays.onesAcceleratedArrays.random_normalAcceleratedArrays.random_uniformAcceleratedArrays.sinAcceleratedArrays.sqrtAcceleratedArrays.sumAcceleratedArrays.to_numpyAcceleratedArrays.zeros
AcceleratedGeneratorAcceleratorBackendbenchmark_backends()create_accelerated_arrays()detect_best_backend()generate_spectra_batch_accelerated()generate_voigt_profiles_accelerated()get_acceleration_speedup_estimate()get_backend_info()is_gpu_available()
- nirs4all.data.synthetic.benchmarks module
BenchmarkDatasetInfoBenchmarkDatasetInfo.nameBenchmarkDatasetInfo.full_nameBenchmarkDatasetInfo.domainBenchmarkDatasetInfo.n_samplesBenchmarkDatasetInfo.n_wavelengthsBenchmarkDatasetInfo.wavelength_rangeBenchmarkDatasetInfo.targetsBenchmarkDatasetInfo.sample_typeBenchmarkDatasetInfo.measurement_modeBenchmarkDatasetInfo.source_urlBenchmarkDatasetInfo.referenceBenchmarkDatasetInfo.licenseBenchmarkDatasetInfo.typical_snrBenchmarkDatasetInfo.typical_peak_densityBenchmarkDatasetInfo.notesBenchmarkDatasetInfo.domainBenchmarkDatasetInfo.full_nameBenchmarkDatasetInfo.licenseBenchmarkDatasetInfo.measurement_modeBenchmarkDatasetInfo.n_samplesBenchmarkDatasetInfo.n_wavelengthsBenchmarkDatasetInfo.nameBenchmarkDatasetInfo.notesBenchmarkDatasetInfo.referenceBenchmarkDatasetInfo.sample_typeBenchmarkDatasetInfo.source_urlBenchmarkDatasetInfo.summary()BenchmarkDatasetInfo.targetsBenchmarkDatasetInfo.typical_peak_densityBenchmarkDatasetInfo.typical_snrBenchmarkDatasetInfo.wavelength_range
BenchmarkDomainLoadedBenchmarkDatasetLoadedBenchmarkDataset.infoLoadedBenchmarkDataset.XLoadedBenchmarkDataset.yLoadedBenchmarkDataset.wavelengthsLoadedBenchmarkDataset.sample_idsLoadedBenchmarkDataset.metadataLoadedBenchmarkDataset.XLoadedBenchmarkDataset.infoLoadedBenchmarkDataset.metadataLoadedBenchmarkDataset.sample_idsLoadedBenchmarkDataset.wavelengthsLoadedBenchmarkDataset.y
create_synthetic_matching_benchmark()get_benchmark_info()get_benchmark_spectral_properties()get_datasets_by_domain()list_benchmark_datasets()load_benchmark_dataset()
- nirs4all.data.synthetic.builder module
BuilderStateBuilderState.as_datasetBuilderState.batch_effects_enabledBuilderState.class_separationBuilderState.class_separation_methodBuilderState.class_weightsBuilderState.complexityBuilderState.component_libraryBuilderState.component_namesBuilderState.concentration_methodBuilderState.generate_sample_idsBuilderState.group_namesBuilderState.hidden_factorsBuilderState.include_metadataBuilderState.interaction_strengthBuilderState.n_batchesBuilderState.n_classesBuilderState.n_confoundersBuilderState.n_groupsBuilderState.n_regimesBuilderState.n_repetitionsBuilderState.n_samplesBuilderState.nameBuilderState.noise_heteroscedasticityBuilderState.nonlinear_interactionsBuilderState.polynomial_degreeBuilderState.random_stateBuilderState.regime_methodBuilderState.regime_overlapBuilderState.sample_id_prefixBuilderState.shuffleBuilderState.signal_to_confound_ratioBuilderState.sourcesBuilderState.spectral_maskingBuilderState.stratifyBuilderState.target_componentBuilderState.target_rangeBuilderState.target_transformBuilderState.temporal_driftBuilderState.train_ratioBuilderState.wavelength_endBuilderState.wavelength_startBuilderState.wavelength_step
SyntheticDatasetBuilderSyntheticDatasetBuilder.stateSyntheticDatasetBuilder.__repr__()SyntheticDatasetBuilder.build()SyntheticDatasetBuilder.build_arrays()SyntheticDatasetBuilder.build_dataset()SyntheticDatasetBuilder.export()SyntheticDatasetBuilder.export_to_csv()SyntheticDatasetBuilder.fit_to()SyntheticDatasetBuilder.from_config()SyntheticDatasetBuilder.get_config()SyntheticDatasetBuilder.with_batch_effects()SyntheticDatasetBuilder.with_classification()SyntheticDatasetBuilder.with_complex_target_landscape()SyntheticDatasetBuilder.with_features()SyntheticDatasetBuilder.with_metadata()SyntheticDatasetBuilder.with_nonlinear_targets()SyntheticDatasetBuilder.with_output()SyntheticDatasetBuilder.with_partitions()SyntheticDatasetBuilder.with_sources()SyntheticDatasetBuilder.with_target_complexity()SyntheticDatasetBuilder.with_targets()
- nirs4all.data.synthetic.components module
ComponentLibraryComponentLibrary.rngComponentLibrary.__contains__()ComponentLibrary.__getitem__()ComponentLibrary.__iter__()ComponentLibrary.__len__()ComponentLibrary.add_component()ComponentLibrary.add_random_component()ComponentLibrary.component_namesComponentLibrary.componentsComponentLibrary.compute_all()ComponentLibrary.from_predefined()ComponentLibrary.generate_random_library()ComponentLibrary.n_components
NIRBandSpectralComponent
- nirs4all.data.synthetic.config module
BatchEffectConfigConfounderConfigFeatureConfigFeatureConfig.wavelength_startFeatureConfig.wavelength_endFeatureConfig.wavelength_stepFeatureConfig.complexityFeatureConfig.n_componentsFeatureConfig.component_namesFeatureConfig.complexityFeatureConfig.component_namesFeatureConfig.n_componentsFeatureConfig.wavelength_endFeatureConfig.wavelength_startFeatureConfig.wavelength_step
MetadataConfigMetadataConfig.generate_sample_idsMetadataConfig.sample_id_prefixMetadataConfig.n_groupsMetadataConfig.n_repetitionsMetadataConfig.group_namesMetadataConfig.additional_columnsMetadataConfig.additional_columnsMetadataConfig.generate_sample_idsMetadataConfig.group_namesMetadataConfig.n_groupsMetadataConfig.n_repetitionsMetadataConfig.sample_id_prefix
MultiRegimeConfigNonLinearConfigOutputConfigPartitionConfigSyntheticDatasetConfigSyntheticDatasetConfig.n_samplesSyntheticDatasetConfig.random_stateSyntheticDatasetConfig.featuresSyntheticDatasetConfig.targetsSyntheticDatasetConfig.metadataSyntheticDatasetConfig.partitionsSyntheticDatasetConfig.batch_effectsSyntheticDatasetConfig.outputSyntheticDatasetConfig.nameSyntheticDatasetConfig.__post_init__()SyntheticDatasetConfig.batch_effectsSyntheticDatasetConfig.confoundersSyntheticDatasetConfig.featuresSyntheticDatasetConfig.metadataSyntheticDatasetConfig.multi_regimeSyntheticDatasetConfig.n_samplesSyntheticDatasetConfig.nameSyntheticDatasetConfig.nonlinearSyntheticDatasetConfig.outputSyntheticDatasetConfig.partitionsSyntheticDatasetConfig.random_stateSyntheticDatasetConfig.targets
TargetConfig
- nirs4all.data.synthetic.detectors module
DetectorConfigDetectorConfig.detector_typeDetectorConfig.temperature_kDetectorConfig.integration_time_msDetectorConfig.gainDetectorConfig.noise_modelDetectorConfig.apply_response_curveDetectorConfig.apply_nonlinearityDetectorConfig.nonlinearity_coefficientDetectorConfig.apply_nonlinearityDetectorConfig.apply_response_curveDetectorConfig.detector_typeDetectorConfig.gainDetectorConfig.integration_time_msDetectorConfig.noise_modelDetectorConfig.nonlinearity_coefficientDetectorConfig.temperature_k
DetectorSimulatorDetectorSpectralResponseDetectorSpectralResponse.detector_typeDetectorSpectralResponse.wavelengthsDetectorSpectralResponse.responseDetectorSpectralResponse.peak_wavelengthDetectorSpectralResponse.cutoff_wavelengthDetectorSpectralResponse.short_cutoffDetectorSpectralResponse.peak_qeDetectorSpectralResponse.cutoff_wavelengthDetectorSpectralResponse.detector_typeDetectorSpectralResponse.get_response_at()DetectorSpectralResponse.peak_qeDetectorSpectralResponse.peak_wavelengthDetectorSpectralResponse.responseDetectorSpectralResponse.short_cutoffDetectorSpectralResponse.wavelengths
NoiseModelConfigNoiseModelConfig.shot_noise_enabledNoiseModelConfig.thermal_noise_enabledNoiseModelConfig.read_noise_enabledNoiseModelConfig.flicker_noise_enabledNoiseModelConfig.quantization_noise_enabledNoiseModelConfig.shot_noise_factorNoiseModelConfig.thermal_noise_factorNoiseModelConfig.read_noise_electronsNoiseModelConfig.flicker_corner_freqNoiseModelConfig.adc_bitsNoiseModelConfig.full_scaleNoiseModelConfig.adc_bitsNoiseModelConfig.flicker_corner_freqNoiseModelConfig.flicker_noise_enabledNoiseModelConfig.full_scaleNoiseModelConfig.quantization_noise_enabledNoiseModelConfig.read_noise_electronsNoiseModelConfig.read_noise_enabledNoiseModelConfig.shot_noise_enabledNoiseModelConfig.shot_noise_factorNoiseModelConfig.thermal_noise_enabledNoiseModelConfig.thermal_noise_factor
get_default_noise_config()get_detector_response()get_detector_wavelength_range()list_detector_types()simulate_detector_effects()
- nirs4all.data.synthetic.domains module
ConcentrationPriorDomainCategoryDomainConfigDomainConfig.nameDomainConfig.categoryDomainConfig.descriptionDomainConfig.typical_componentsDomainConfig.component_weightsDomainConfig.concentration_priorsDomainConfig.wavelength_rangeDomainConfig.n_components_rangeDomainConfig.noise_levelDomainConfig.measurement_modeDomainConfig.typical_sample_typesDomainConfig.complexityDomainConfig.additional_paramsDomainConfig.additional_paramsDomainConfig.categoryDomainConfig.complexityDomainConfig.component_weightsDomainConfig.concentration_priorsDomainConfig.descriptionDomainConfig.get_component_weights()DomainConfig.measurement_modeDomainConfig.n_components_rangeDomainConfig.nameDomainConfig.noise_levelDomainConfig.sample_components()DomainConfig.sample_concentrations()DomainConfig.typical_componentsDomainConfig.typical_sample_typesDomainConfig.wavelength_range
create_domain_aware_library()get_domain_components()get_domain_config()get_domains_for_component()list_domains()
- nirs4all.data.synthetic.environmental module
EnvironmentalEffectsConfigEnvironmentalEffectsConfig.temperatureEnvironmentalEffectsConfig.moistureEnvironmentalEffectsConfig.enable_temperatureEnvironmentalEffectsConfig.enable_moistureEnvironmentalEffectsConfig.enable_moistureEnvironmentalEffectsConfig.enable_temperatureEnvironmentalEffectsConfig.moistureEnvironmentalEffectsConfig.temperature
EnvironmentalEffectsSimulatorMoistureConfigMoistureConfig.water_activityMoistureConfig.moisture_contentMoistureConfig.free_water_fractionMoistureConfig.bound_water_shiftMoistureConfig.temperature_interactionMoistureConfig.reference_awMoistureConfig.__post_init__()MoistureConfig.bound_water_shiftMoistureConfig.free_water_fractionMoistureConfig.moisture_contentMoistureConfig.reference_awMoistureConfig.temperature_interactionMoistureConfig.water_activity
MoistureEffectSimulatorSpectralRegionTemperatureConfigTemperatureConfig.reference_temperatureTemperatureConfig.sample_temperatureTemperatureConfig.temperature_variationTemperatureConfig.enable_shiftTemperatureConfig.enable_intensityTemperatureConfig.enable_broadeningTemperatureConfig.region_specificTemperatureConfig.custom_regionsTemperatureConfig.custom_regionsTemperatureConfig.delta_temperatureTemperatureConfig.enable_broadeningTemperatureConfig.enable_intensityTemperatureConfig.enable_shiftTemperatureConfig.reference_temperatureTemperatureConfig.region_specificTemperatureConfig.sample_temperatureTemperatureConfig.temperature_variation
TemperatureEffectParamsTemperatureEffectParams.wavelength_rangeTemperatureEffectParams.shift_per_degreeTemperatureEffectParams.intensity_change_per_degreeTemperatureEffectParams.broadening_per_degreeTemperatureEffectParams.referenceTemperatureEffectParams.broadening_per_degreeTemperatureEffectParams.intensity_change_per_degreeTemperatureEffectParams.referenceTemperatureEffectParams.shift_per_degreeTemperatureEffectParams.wavelength_range
TemperatureEffectSimulatorapply_moisture_effects()apply_temperature_effects()get_temperature_effect_regions()simulate_temperature_series()
- nirs4all.data.synthetic.exporter module
CSVVariationGeneratorCSVVariationGenerator.base_exporterCSVVariationGenerator.as_fragmented()CSVVariationGenerator.as_single_file()CSVVariationGenerator.generate_all_variations()CSVVariationGenerator.with_comma_delimiter()CSVVariationGenerator.with_precision()CSVVariationGenerator.with_row_index()CSVVariationGenerator.with_semicolon_delimiter()CSVVariationGenerator.with_tab_delimiter()CSVVariationGenerator.without_headers()
DatasetExporterExportConfigExportConfig.formatExportConfig.separatorExportConfig.float_precisionExportConfig.include_headersExportConfig.include_indexExportConfig.compressionExportConfig.file_extensionExportConfig.compressionExportConfig.file_extensionExportConfig.float_precisionExportConfig.formatExportConfig.include_headersExportConfig.include_indexExportConfig.separator
export_to_csv()export_to_folder()
- nirs4all.data.synthetic.fitter module
DomainInferenceDomainInference.domain_nameDomainInference.categoryDomainInference.confidenceDomainInference.detected_componentsDomainInference.alternative_domainsDomainInference.alternative_domainsDomainInference.categoryDomainInference.confidenceDomainInference.detected_componentsDomainInference.domain_name
EnvironmentalInferenceEnvironmentalInference.estimated_temperature_variationEnvironmentalInference.has_temperature_effectsEnvironmentalInference.estimated_moisture_variationEnvironmentalInference.has_moisture_effectsEnvironmentalInference.water_band_shiftEnvironmentalInference.estimated_moisture_variationEnvironmentalInference.estimated_temperature_variationEnvironmentalInference.has_moisture_effectsEnvironmentalInference.has_temperature_effectsEnvironmentalInference.water_band_shift
FittedParametersFittedParameters.wavelength_startFittedParameters.wavelength_endFittedParameters.wavelength_stepFittedParameters.global_slope_meanFittedParameters.global_slope_stdFittedParameters.baseline_amplitudeFittedParameters.noise_baseFittedParameters.noise_signal_depFittedParameters.path_length_stdFittedParameters.scatter_alpha_stdFittedParameters.scatter_beta_stdFittedParameters.tilt_stdFittedParameters.complexityFittedParameters.source_nameFittedParameters.source_propertiesFittedParameters.inferred_instrumentFittedParameters.instrument_inferenceFittedParameters.measurement_modeFittedParameters.measurement_mode_confidenceFittedParameters.inferred_domainFittedParameters.domain_inferenceFittedParameters.environmental_inferenceFittedParameters.temperature_configFittedParameters.moisture_configFittedParameters.scattering_inferenceFittedParameters.particle_size_configFittedParameters.emsc_configFittedParameters.detected_componentsFittedParameters.suggested_n_componentsFittedParameters.baseline_amplitudeFittedParameters.complexityFittedParameters.detected_componentsFittedParameters.domain_inferenceFittedParameters.emsc_configFittedParameters.environmental_inferenceFittedParameters.from_dict()FittedParameters.global_slope_meanFittedParameters.global_slope_stdFittedParameters.inferred_domainFittedParameters.inferred_instrumentFittedParameters.instrument_inferenceFittedParameters.load()FittedParameters.measurement_modeFittedParameters.measurement_mode_confidenceFittedParameters.moisture_configFittedParameters.noise_baseFittedParameters.noise_signal_depFittedParameters.particle_size_configFittedParameters.path_length_stdFittedParameters.save()FittedParameters.scatter_alpha_stdFittedParameters.scatter_beta_stdFittedParameters.scattering_inferenceFittedParameters.source_nameFittedParameters.source_propertiesFittedParameters.suggested_n_componentsFittedParameters.summary()FittedParameters.temperature_configFittedParameters.tilt_stdFittedParameters.to_dict()FittedParameters.to_full_config()FittedParameters.to_generator_kwargs()FittedParameters.wavelength_endFittedParameters.wavelength_startFittedParameters.wavelength_step
InstrumentInferenceInstrumentInference.archetype_nameInstrumentInference.detector_typeInstrumentInference.wavelength_rangeInstrumentInference.estimated_resolutionInstrumentInference.confidenceInstrumentInference.alternative_archetypesInstrumentInference.alternative_archetypesInstrumentInference.archetype_nameInstrumentInference.confidenceInstrumentInference.detector_typeInstrumentInference.estimated_resolutionInstrumentInference.wavelength_range
MeasurementModeInferenceRealDataFitterScatteringInferenceScatteringInference.has_scatter_effectsScatteringInference.estimated_particle_size_umScatteringInference.multiplicative_scatter_stdScatteringInference.additive_scatter_stdScatteringInference.baseline_curvatureScatteringInference.snv_correctableScatteringInference.msc_correctableScatteringInference.additive_scatter_stdScatteringInference.baseline_curvatureScatteringInference.estimated_particle_size_umScatteringInference.has_scatter_effectsScatteringInference.msc_correctableScatteringInference.multiplicative_scatter_stdScatteringInference.snv_correctable
SpectralPropertiesSpectralProperties.nameSpectralProperties.n_samplesSpectralProperties.n_wavelengthsSpectralProperties.wavelengthsSpectralProperties.mean_spectrumSpectralProperties.std_spectrumSpectralProperties.global_meanSpectralProperties.global_stdSpectralProperties.global_rangeSpectralProperties.mean_slopeSpectralProperties.slope_stdSpectralProperties.mean_curvatureSpectralProperties.skewnessSpectralProperties.kurtosisSpectralProperties.noise_estimateSpectralProperties.snr_estimateSpectralProperties.pca_explained_varianceSpectralProperties.pca_n_components_95SpectralProperties.n_peaks_meanSpectralProperties.peak_positionsSpectralProperties.peak_wavenumbersSpectralProperties.effective_resolutionSpectralProperties.noise_correlation_lengthSpectralProperties.wavelength_rangeSpectralProperties.baseline_offsetSpectralProperties.kubelka_munk_linearitySpectralProperties.baseline_convexitySpectralProperties.water_band_variationSpectralProperties.oh_band_positionsSpectralProperties.temperature_sensitivity_scoreSpectralProperties.scatter_baseline_slopeSpectralProperties.scatter_baseline_curvatureSpectralProperties.sample_to_sample_offset_stdSpectralProperties.sample_to_sample_slope_stdSpectralProperties.protein_band_intensitySpectralProperties.carbohydrate_band_intensitySpectralProperties.lipid_band_intensitySpectralProperties.water_band_intensitySpectralProperties.baseline_convexitySpectralProperties.baseline_offsetSpectralProperties.carbohydrate_band_intensitySpectralProperties.curvature_stdSpectralProperties.effective_resolutionSpectralProperties.global_meanSpectralProperties.global_rangeSpectralProperties.global_stdSpectralProperties.kubelka_munk_linearitySpectralProperties.kurtosisSpectralProperties.lipid_band_intensitySpectralProperties.mean_curvatureSpectralProperties.mean_slopeSpectralProperties.mean_spectrumSpectralProperties.n_peaks_meanSpectralProperties.n_samplesSpectralProperties.n_wavelengthsSpectralProperties.nameSpectralProperties.noise_correlation_lengthSpectralProperties.noise_estimateSpectralProperties.oh_band_positionsSpectralProperties.pca_explained_varianceSpectralProperties.pca_n_components_95SpectralProperties.peak_positionsSpectralProperties.peak_wavenumbersSpectralProperties.protein_band_intensitySpectralProperties.sample_to_sample_offset_stdSpectralProperties.sample_to_sample_slope_stdSpectralProperties.scatter_baseline_curvatureSpectralProperties.scatter_baseline_slopeSpectralProperties.skewnessSpectralProperties.slope_stdSpectralProperties.slopesSpectralProperties.snr_estimateSpectralProperties.std_spectrumSpectralProperties.temperature_sensitivity_scoreSpectralProperties.water_band_intensitySpectralProperties.water_band_variationSpectralProperties.wavelength_rangeSpectralProperties.wavelengths
compare_datasets()compute_spectral_properties()fit_to_real_data()
- nirs4all.data.synthetic.generator module
SyntheticNIRSGeneratorSyntheticNIRSGenerator.wavelengthsSyntheticNIRSGenerator.n_wavelengthsSyntheticNIRSGenerator.librarySyntheticNIRSGenerator.ESyntheticNIRSGenerator.paramsSyntheticNIRSGenerator.instrumentSyntheticNIRSGenerator.measurement_mode_simulatorSyntheticNIRSGenerator.environmental_simulatorSyntheticNIRSGenerator.scattering_effects_simulatorSyntheticNIRSGenerator.__repr__()SyntheticNIRSGenerator.create_dataset()SyntheticNIRSGenerator.generate()SyntheticNIRSGenerator.generate_batch_effects()SyntheticNIRSGenerator.generate_concentrations()
- nirs4all.data.synthetic.instruments module
DetectorTypeInstrumentArchetypeInstrumentArchetype.nameInstrumentArchetype.categoryInstrumentArchetype.detector_typeInstrumentArchetype.monochromator_typeInstrumentArchetype.wavelength_rangeInstrumentArchetype.spectral_resolutionInstrumentArchetype.wavelength_accuracyInstrumentArchetype.photometric_noiseInstrumentArchetype.photometric_rangeInstrumentArchetype.snrInstrumentArchetype.stray_lightInstrumentArchetype.warm_up_driftInstrumentArchetype.temperature_sensitivityInstrumentArchetype.scan_speedInstrumentArchetype.integration_time_msInstrumentArchetype.optical_pathInstrumentArchetype.multi_sensorInstrumentArchetype.multi_scanInstrumentArchetype.descriptionInstrumentArchetype.categoryInstrumentArchetype.descriptionInstrumentArchetype.detector_typeInstrumentArchetype.get_noise_model_params()InstrumentArchetype.integration_time_msInstrumentArchetype.monochromator_typeInstrumentArchetype.multi_scanInstrumentArchetype.multi_sensorInstrumentArchetype.nameInstrumentArchetype.optical_pathInstrumentArchetype.photometric_noiseInstrumentArchetype.photometric_rangeInstrumentArchetype.scan_speedInstrumentArchetype.snrInstrumentArchetype.spectral_resolutionInstrumentArchetype.stray_lightInstrumentArchetype.temperature_sensitivityInstrumentArchetype.warm_up_driftInstrumentArchetype.wavelength_accuracyInstrumentArchetype.wavelength_range
InstrumentCategoryInstrumentSimulatorMonochromatorTypeMultiScanConfigMultiScanConfig.enabledMultiScanConfig.n_scansMultiScanConfig.averaging_methodMultiScanConfig.scan_to_scan_noiseMultiScanConfig.wavelength_jitterMultiScanConfig.discard_outliersMultiScanConfig.outlier_thresholdMultiScanConfig.averaging_methodMultiScanConfig.discard_outliersMultiScanConfig.enabledMultiScanConfig.n_scansMultiScanConfig.outlier_thresholdMultiScanConfig.scan_to_scan_noiseMultiScanConfig.wavelength_jitter
MultiSensorConfigMultiSensorConfig.enabledMultiSensorConfig.sensorsMultiSensorConfig.stitch_methodMultiSensorConfig.stitch_smoothingMultiSensorConfig.add_stitch_artifactsMultiSensorConfig.artifact_intensityMultiSensorConfig.add_stitch_artifactsMultiSensorConfig.artifact_intensityMultiSensorConfig.enabledMultiSensorConfig.sensorsMultiSensorConfig.stitch_methodMultiSensorConfig.stitch_smoothing
SensorConfigSensorConfig.detector_typeSensorConfig.wavelength_rangeSensorConfig.spectral_resolutionSensorConfig.noise_levelSensorConfig.gainSensorConfig.overlap_rangeSensorConfig.detector_typeSensorConfig.gainSensorConfig.noise_levelSensorConfig.overlap_rangeSensorConfig.spectral_resolutionSensorConfig.wavelength_range
get_instrument_archetype()get_instruments_by_category()list_instrument_archetypes()
- nirs4all.data.synthetic.measurement_modes module
ATRConfigMeasurementModeMeasurementModeConfigMeasurementModeConfig.modeMeasurementModeConfig.transmittanceMeasurementModeConfig.reflectanceMeasurementModeConfig.transflectanceMeasurementModeConfig.atrMeasurementModeConfig.scatteringMeasurementModeConfig.add_specularMeasurementModeConfig.specular_fractionMeasurementModeConfig.add_specularMeasurementModeConfig.atrMeasurementModeConfig.modeMeasurementModeConfig.reflectanceMeasurementModeConfig.scatteringMeasurementModeConfig.specular_fractionMeasurementModeConfig.transflectanceMeasurementModeConfig.transmittance
MeasurementModeSimulatorMeasurementModeSimulator.configMeasurementModeSimulator.rngMeasurementModeSimulator.absorbance_to_reflectance()MeasurementModeSimulator.apply()MeasurementModeSimulator.generate_scattering_coefficients()MeasurementModeSimulator.inverse_kubelka_munk()MeasurementModeSimulator.kubelka_munk()MeasurementModeSimulator.reflectance_to_absorbance()
ReflectanceConfigReflectanceConfig.geometryReflectanceConfig.reference_materialReflectanceConfig.reference_reflectanceReflectanceConfig.illumination_angleReflectanceConfig.collection_angleReflectanceConfig.sample_presentationReflectanceConfig.collection_angleReflectanceConfig.geometryReflectanceConfig.illumination_angleReflectanceConfig.reference_materialReflectanceConfig.reference_reflectanceReflectanceConfig.sample_presentation
ScatteringConfigScatteringConfig.baseline_scatteringScatteringConfig.wavelength_exponentScatteringConfig.particle_size_umScatteringConfig.particle_size_variationScatteringConfig.sample_to_sample_variationScatteringConfig.baseline_scatteringScatteringConfig.particle_size_umScatteringConfig.particle_size_variationScatteringConfig.sample_to_sample_variationScatteringConfig.wavelength_exponent
TransflectanceConfigTransflectanceConfig.path_length_mmTransflectanceConfig.reflector_typeTransflectanceConfig.reflector_reflectanceTransflectanceConfig.spacer_thickness_mmTransflectanceConfig.path_length_mmTransflectanceConfig.reflector_reflectanceTransflectanceConfig.reflector_typeTransflectanceConfig.spacer_thickness_mm
TransmittanceConfigcreate_atr_simulator()create_reflectance_simulator()create_transmittance_simulator()
- nirs4all.data.synthetic.metadata module
MetadataGenerationResultMetadataGenerationResult.sample_idsMetadataGenerationResult.bio_sample_idsMetadataGenerationResult.repetitionsMetadataGenerationResult.groupsMetadataGenerationResult.group_indicesMetadataGenerationResult.n_bio_samplesMetadataGenerationResult.additional_columnsMetadataGenerationResult.additional_columnsMetadataGenerationResult.bio_sample_idsMetadataGenerationResult.group_indicesMetadataGenerationResult.groupsMetadataGenerationResult.n_bio_samplesMetadataGenerationResult.repetitionsMetadataGenerationResult.sample_idsMetadataGenerationResult.to_dict()
MetadataGeneratorgenerate_sample_metadata()
- nirs4all.data.synthetic.prior module
MatrixTypeNIRSPriorConfigNIRSPriorConfig.domain_weightsNIRSPriorConfig.instrument_given_domainNIRSPriorConfig.mode_given_categoryNIRSPriorConfig.matrix_given_domainNIRSPriorConfig.temperature_rangeNIRSPriorConfig.particle_size_rangeNIRSPriorConfig.noise_level_rangeNIRSPriorConfig.domain_weightsNIRSPriorConfig.get_domain_weight()NIRSPriorConfig.instrument_given_domainNIRSPriorConfig.matrix_given_domainNIRSPriorConfig.mode_given_categoryNIRSPriorConfig.n_classes_rangeNIRSPriorConfig.n_samples_rangeNIRSPriorConfig.n_targets_rangeNIRSPriorConfig.noise_level_rangeNIRSPriorConfig.normalize_weights()NIRSPriorConfig.particle_size_rangeNIRSPriorConfig.target_type_weightsNIRSPriorConfig.temperature_range
PriorSamplerPriorSampler.sample()PriorSampler.sample_batch()PriorSampler.sample_components()PriorSampler.sample_domain()PriorSampler.sample_for_domain()PriorSampler.sample_for_instrument()PriorSampler.sample_instrument()PriorSampler.sample_instrument_category()PriorSampler.sample_matrix_type()PriorSampler.sample_measurement_mode()PriorSampler.sample_n_samples()PriorSampler.sample_noise_level()PriorSampler.sample_particle_size()PriorSampler.sample_target_config()PriorSampler.sample_temperature()
get_domain_compatible_instruments()get_instrument_typical_modes()sample_prior()sample_prior_batch()
- nirs4all.data.synthetic.procedural module
FunctionalGroupTypeProceduralComponentConfigProceduralComponentConfig.n_fundamental_bandsProceduralComponentConfig.include_overtonesProceduralComponentConfig.max_overtone_orderProceduralComponentConfig.include_combinationsProceduralComponentConfig.max_combinationsProceduralComponentConfig.h_bond_strengthProceduralComponentConfig.h_bond_variabilityProceduralComponentConfig.anharmonicityProceduralComponentConfig.anharmonicity_variabilityProceduralComponentConfig.amplitude_variabilityProceduralComponentConfig.bandwidth_variabilityProceduralComponentConfig.wavelength_rangeProceduralComponentConfig.functional_groupsProceduralComponentConfig.combination_amplitude_factorProceduralComponentConfig.amplitude_variabilityProceduralComponentConfig.anharmonicityProceduralComponentConfig.anharmonicity_variabilityProceduralComponentConfig.bandwidth_variabilityProceduralComponentConfig.combination_amplitude_factorProceduralComponentConfig.functional_groupsProceduralComponentConfig.h_bond_strengthProceduralComponentConfig.h_bond_variabilityProceduralComponentConfig.include_combinationsProceduralComponentConfig.include_overtonesProceduralComponentConfig.max_combinationsProceduralComponentConfig.max_overtone_orderProceduralComponentConfig.n_fundamental_bandsProceduralComponentConfig.wavelength_range
ProceduralComponentGenerator
- nirs4all.data.synthetic.scattering module
EMSCConfigEMSCConfig.polynomial_orderEMSCConfig.multiplicative_scatter_stdEMSCConfig.additive_scatter_stdEMSCConfig.include_wavelength_termsEMSCConfig.wavelength_coef_stdEMSCConfig.reference_spectrumEMSCConfig.additive_scatter_stdEMSCConfig.include_wavelength_termsEMSCConfig.multiplicative_scatter_stdEMSCConfig.polynomial_orderEMSCConfig.reference_spectrumEMSCConfig.wavelength_coef_std
EMSCTransformSimulatorParticleSizeConfigParticleSizeConfig.distributionParticleSizeConfig.reference_size_umParticleSizeConfig.size_effect_strengthParticleSizeConfig.wavelength_exponentParticleSizeConfig.include_path_length_effectParticleSizeConfig.path_length_sensitivityParticleSizeConfig.distributionParticleSizeConfig.include_path_length_effectParticleSizeConfig.path_length_sensitivityParticleSizeConfig.reference_size_umParticleSizeConfig.size_effect_strengthParticleSizeConfig.wavelength_exponent
ParticleSizeDistributionParticleSizeDistribution.mean_size_umParticleSizeDistribution.std_size_umParticleSizeDistribution.min_size_umParticleSizeDistribution.max_size_umParticleSizeDistribution.distributionParticleSizeDistribution.distributionParticleSizeDistribution.max_size_umParticleSizeDistribution.mean_size_umParticleSizeDistribution.min_size_umParticleSizeDistribution.sample()ParticleSizeDistribution.std_size_um
ParticleSizeSimulatorScatteringCoefficientConfigScatteringCoefficientConfig.baseline_scatteringScatteringCoefficientConfig.wavelength_exponentScatteringCoefficientConfig.particle_size_factorScatteringCoefficientConfig.sample_variationScatteringCoefficientConfig.wavelength_reference_nmScatteringCoefficientConfig.baseline_scatteringScatteringCoefficientConfig.particle_size_factorScatteringCoefficientConfig.sample_variationScatteringCoefficientConfig.wavelength_exponentScatteringCoefficientConfig.wavelength_reference_nm
ScatteringCoefficientGeneratorScatteringEffectsConfigScatteringEffectsConfig.modelScatteringEffectsConfig.particle_sizeScatteringEffectsConfig.emscScatteringEffectsConfig.scattering_coefficientScatteringEffectsConfig.enable_particle_sizeScatteringEffectsConfig.enable_emscScatteringEffectsConfig.emscScatteringEffectsConfig.enable_emscScatteringEffectsConfig.enable_particle_sizeScatteringEffectsConfig.modelScatteringEffectsConfig.particle_sizeScatteringEffectsConfig.scattering_coefficient
ScatteringEffectsSimulatorScatteringModelapply_emsc_distortion()apply_particle_size_effects()generate_scattering_coefficients()simulate_msc_correctable_scatter()simulate_snv_correctable_scatter()
- nirs4all.data.synthetic.sources module
MultiSourceGeneratorMultiSourceResultMultiSourceResult.sourcesMultiSourceResult.targetsMultiSourceResult.source_configsMultiSourceResult.wavelengthsMultiSourceResult.metadataMultiSourceResult.get_combined_features()MultiSourceResult.metadataMultiSourceResult.n_features_totalMultiSourceResult.n_samplesMultiSourceResult.source_configsMultiSourceResult.source_namesMultiSourceResult.sourcesMultiSourceResult.targetsMultiSourceResult.wavelengths
SourceConfigSourceConfig.nameSourceConfig.source_typeSourceConfig.n_featuresSourceConfig.wavelength_startSourceConfig.wavelength_endSourceConfig.wavelength_stepSourceConfig.componentsSourceConfig.complexitySourceConfig.distributionSourceConfig.correlation_with_targetSourceConfig.complexitySourceConfig.componentsSourceConfig.correlation_with_targetSourceConfig.distributionSourceConfig.from_dict()SourceConfig.n_featuresSourceConfig.nameSourceConfig.source_typeSourceConfig.wavelength_endSourceConfig.wavelength_startSourceConfig.wavelength_step
generate_multi_source()
- nirs4all.data.synthetic.targets module
ClassSeparationConfigNonLinearTargetConfigNonLinearTargetConfig.nonlinear_interactionsNonLinearTargetConfig.interaction_strengthNonLinearTargetConfig.hidden_factorsNonLinearTargetConfig.polynomial_degreeNonLinearTargetConfig.signal_to_confound_ratioNonLinearTargetConfig.n_confoundersNonLinearTargetConfig.spectral_maskingNonLinearTargetConfig.temporal_driftNonLinearTargetConfig.n_regimesNonLinearTargetConfig.regime_methodNonLinearTargetConfig.regime_overlapNonLinearTargetConfig.noise_heteroscedasticityNonLinearTargetConfig.hidden_factorsNonLinearTargetConfig.interaction_strengthNonLinearTargetConfig.n_confoundersNonLinearTargetConfig.n_regimesNonLinearTargetConfig.noise_heteroscedasticityNonLinearTargetConfig.nonlinear_interactionsNonLinearTargetConfig.polynomial_degreeNonLinearTargetConfig.regime_methodNonLinearTargetConfig.regime_overlapNonLinearTargetConfig.signal_to_confound_ratioNonLinearTargetConfig.spectral_maskingNonLinearTargetConfig.temporal_drift
NonLinearTargetProcessorTargetGeneratorgenerate_classification_targets()generate_regression_targets()
- nirs4all.data.synthetic.validation module
DatasetComparisonResultDatasetComparisonResult.dataset_nameDatasetComparisonResult.n_real_samplesDatasetComparisonResult.n_synthetic_samplesDatasetComparisonResult.realism_scoreDatasetComparisonResult.tstr_r2DatasetComparisonResult.trts_r2DatasetComparisonResult.dataset_nameDatasetComparisonResult.n_real_samplesDatasetComparisonResult.n_synthetic_samplesDatasetComparisonResult.realism_scoreDatasetComparisonResult.summary()DatasetComparisonResult.trts_r2DatasetComparisonResult.tstr_r2
MetricResultRealismMetricSpectralRealismScoreSpectralRealismScore.correlation_length_overlapSpectralRealismScore.derivative_ks_pvalueSpectralRealismScore.peak_density_ratioSpectralRealismScore.baseline_curvature_overlapSpectralRealismScore.snr_magnitude_matchSpectralRealismScore.adversarial_aucSpectralRealismScore.overall_passSpectralRealismScore.metric_resultsSpectralRealismScore.warningsSpectralRealismScore.adversarial_aucSpectralRealismScore.baseline_curvature_overlapSpectralRealismScore.correlation_length_overlapSpectralRealismScore.derivative_ks_pvalueSpectralRealismScore.metric_resultsSpectralRealismScore.overall_passSpectralRealismScore.peak_density_ratioSpectralRealismScore.snr_magnitude_matchSpectralRealismScore.summary()SpectralRealismScore.to_dict()SpectralRealismScore.warnings
ValidationErrorcompute_adversarial_validation_auc()compute_baseline_curvature()compute_correlation_length()compute_derivative_statistics()compute_distribution_overlap()compute_peak_density()compute_snr()compute_spectral_realism_scorecard()quick_realism_check()validate_against_benchmark()validate_concentrations()validate_spectra()validate_synthetic_output()validate_wavelengths()
- nirs4all.data.synthetic.wavenumber module
CombinationBandResultOvertoneResultapply_hydrogen_bonding_shift()calculate_combination_band()calculate_overtone_position()classify_wavelength_zone()convert_bandwidth_to_wavelength()convert_bandwidth_to_wavenumber()estimate_bandwidth_broadening()get_all_zones_wavelength()get_nir_overtones_for_fundamental()get_zone_wavelength_range()wavelength_to_wavenumber()wavenumber_to_wavelength()
Module contents
Synthetic NIRS Data Generation Module.
This module provides tools for generating realistic synthetic NIRS spectra for testing, examples, benchmarking, and ML research.
- Key Features:
Physically-motivated generation based on Beer-Lambert law
Voigt profile peak shapes (Gaussian + Lorentzian convolution)
Realistic NIR band positions from known spectroscopic databases
Configurable complexity levels (simple, realistic, complex)
Batch/session effects for domain adaptation research
Direct SpectroDataset creation for pipeline integration
- Quick Start:
>>> from nirs4all.data.synthetic import SyntheticNIRSGenerator >>> >>> # Simple generation >>> generator = SyntheticNIRSGenerator(random_state=42) >>> X, Y, E = generator.generate(n_samples=1000) >>> >>> # Create a SpectroDataset >>> dataset = generator.create_dataset(n_train=800, n_test=200)
>>> # Use predefined components >>> from nirs4all.data.synthetic import ComponentLibrary >>> library = ComponentLibrary.from_predefined(["water", "protein", "lipid"]) >>> generator = SyntheticNIRSGenerator(component_library=library)
See also
nirs4all.generate: Top-level generation API
SyntheticDatasetBuilder: Fluent dataset construction
References
Workman Jr, J., & Weyer, L. (2012). Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. CRC Press.
Burns, D. A., & Ciurczak, E. W. (2007). Handbook of Near-Infrared Analysis. CRC Press.
- class nirs4all.data.synthetic.ATRConfig(crystal_material: str = 'diamond', crystal_refractive_index: float = 2.4, incidence_angle: float = 45.0, n_reflections: int = 1, sample_refractive_index: float = 1.5)[source]
Bases:
objectConfiguration for Attenuated Total Reflectance mode.
ATR uses internal reflection within a high-refractive-index crystal. The evanescent wave penetrates into the sample, with penetration depth depending on wavelength.
- class nirs4all.data.synthetic.AcceleratedArrays(backend: AcceleratorBackend, zeros: Callable, ones: Callable, arange: Callable, linspace: Callable, array: Callable, exp: Callable, log: Callable, sqrt: Callable, sin: Callable, cos: Callable, sum: Callable, dot: Callable, matmul: Callable, random_normal: Callable, random_uniform: Callable, to_numpy: Callable)[source]
Bases:
objectContainer for accelerated array operations.
- backend: AcceleratorBackend
- class nirs4all.data.synthetic.AcceleratedGenerator(backend: AcceleratorBackend | None = None, random_state: int | None = None)[source]
Bases:
objectGPU-accelerated synthetic spectrum generator.
This class provides a high-level interface for generating large batches of synthetic spectra using GPU acceleration when available.
- Parameters:
backend – Backend to use (auto-detect if None).
random_state – Random state for reproducibility.
Example
>>> gen = AcceleratedGenerator(random_state=42) >>> print(f"Using backend: {gen.backend}") >>> >>> # Generate 10000 spectra >>> X = gen.generate_batch( ... n_samples=10000, ... wavelengths=np.linspace(1000, 2500, 700), ... component_spectra=E, ... concentrations=C, ... )
- generate_batch(n_samples: int, wavelengths: ndarray, component_spectra: ndarray, concentrations: ndarray, noise_level: float = 0.01) ndarray[source]
Generate a batch of spectra.
- Parameters:
n_samples – Number of samples.
wavelengths – Wavelength array.
component_spectra – Component spectra (n_components, n_wavelengths).
concentrations – Concentrations (n_samples, n_components).
noise_level – Noise level.
- Returns:
Generated spectra (n_samples, n_wavelengths).
- generate_voigt_profiles(wavelengths: ndarray, centers: ndarray, amplitudes: ndarray, sigmas: ndarray, gammas: ndarray) ndarray[source]
Generate Voigt profiles for component spectra.
- Parameters:
wavelengths – Wavelength array.
centers – Band centers.
amplitudes – Band amplitudes.
sigmas – Gaussian widths.
gammas – Lorentzian widths.
- Returns:
Spectrum array.
- class nirs4all.data.synthetic.AcceleratorBackend(value)[source]
-
Available acceleration backends.
- CUPY = 'cupy'
- JAX = 'jax'
- NUMPY = 'numpy'
- class nirs4all.data.synthetic.BatchEffectConfig(enabled: bool = False, n_batches: int = 3, offset_std: float = 0.02, gain_std: float = 0.03)[source]
Bases:
objectConfiguration for batch/session effects simulation.
- class nirs4all.data.synthetic.BenchmarkDatasetInfo(name: str, full_name: str, domain: BenchmarkDomain, n_samples: int, n_wavelengths: int, wavelength_range: Tuple[float, float], targets: List[str], sample_type: str, measurement_mode: str, source_url: str, reference: str, license: str = 'Unknown', typical_snr: Tuple[float, float] = (50, 500), typical_peak_density: Tuple[float, float] = (1.0, 5.0), notes: str = '')[source]
Bases:
objectMetadata for a benchmark dataset.
- domain
Application domain.
- domain: BenchmarkDomain
- class nirs4all.data.synthetic.BenchmarkDomain(value)[source]
-
Domains for benchmark datasets.
- AGRICULTURE = 'agriculture'
- ENVIRONMENTAL = 'environmental'
- FOOD = 'food'
- GENERAL = 'general'
- PETROCHEMICAL = 'petrochemical'
- PHARMACEUTICAL = 'pharmaceutical'
- class nirs4all.data.synthetic.CSVVariationGenerator[source]
Bases:
objectGenerate CSV files with various format variations for loader testing.
This class creates CSV files with different delimiters, encodings, header formats, and other variations to test the robustness of CSV loaders.
- base_exporter
DatasetExporter for actual file writing.
Example
>>> generator = CSVVariationGenerator() >>> >>> # Generate all variations >>> paths = generator.generate_all_variations( ... "test_data", ... X, y, ... wavelengths=wavelengths ... ) >>> >>> # Generate specific variation >>> path = generator.with_semicolon_delimiter( ... "data_semicolon", ... X, y ... )
- as_fragmented(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]
Create fragmented dataset with multiple small files.
- as_single_file(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]
Create single CSV file with all data and partition column.
- generate_all_variations(base_path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Dict[str, Path][source]
Generate CSV files with all format variations.
Creates multiple versions of the dataset with different CSV format options for comprehensive loader testing.
- Parameters:
base_path – Base output folder path.
X – Feature matrix.
y – Target values.
wavelengths – Optional wavelength values.
train_ratio – Train/test split ratio.
random_state – Random seed.
- Returns:
Dictionary mapping variation name to created path.
Example
>>> paths = generator.generate_all_variations( ... "test_variations", ... X, y, ... random_state=42 ... ) >>> print(paths.keys())
- with_comma_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]
Create CSV with comma delimiter.
- with_precision(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None, precision: int = 6) Path[source]
Create CSV with specified floating point precision.
- with_row_index(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]
Create CSV with row index column.
- with_semicolon_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) Path[source]
Create CSV with semicolon delimiter (nirs4all default).
- class nirs4all.data.synthetic.ClassSeparationConfig(separation: float = 1.5, method: Literal['component', 'shift', 'intensity'] = 'component', noise: float = 0.1)[source]
Bases:
objectConfiguration for class separation in classification tasks.
- separation
Separation factor (higher = more separable). Values around 0.5-1.0 create overlapping classes. Values around 2.0-3.0 create well-separated classes.
- Type:
- method
How to create class differences: - “component”: Different component concentration profiles per class. - “shift”: Systematic spectral shifts between classes. - “intensity”: Different overall intensity levels.
- Type:
Literal[‘component’, ‘shift’, ‘intensity’]
- class nirs4all.data.synthetic.CombinationBandResult(mode1_cm: float, mode2_cm: float, wavenumber_cm: float, wavelength_nm: float, amplitude_factor: float, band_type: str)[source]
Bases:
objectResult of combination band calculation.
- class nirs4all.data.synthetic.ComponentLibrary(random_state: int | None = None)[source]
Bases:
objectLibrary of spectral components for synthetic NIRS generation.
Supports both predefined components (based on known NIR band assignments) and programmatically generated random components for research purposes.
- rng
NumPy random generator for reproducibility.
Example
>>> # Create from predefined components >>> library = ComponentLibrary.from_predefined( ... ["water", "protein", "lipid"], ... random_state=42 ... ) >>> >>> # Or generate random components >>> library = ComponentLibrary(random_state=42) >>> library.generate_random_library(n_components=5) >>> >>> # Compute all component spectra >>> wavelengths = np.arange(1000, 2500, 2) >>> E = library.compute_all(wavelengths) # shape: (n_components, n_wavelengths)
- __getitem__(name: str) SpectralComponent[source]
Get component by name.
- add_component(component: SpectralComponent) ComponentLibrary[source]
Add a spectral component to the library.
- Parameters:
component – SpectralComponent to add.
- Returns:
Self for method chaining.
- add_random_component(name: str, n_bands: int = 3, wavelength_range: Tuple[float, float] = (1000, 2500), zones: List[Tuple[float, float]] | None = None) SpectralComponent[source]
Generate and add a random spectral component.
Creates a component with randomly placed absorption bands within the specified wavelength range or zones.
- Parameters:
name – Component name.
n_bands – Number of absorption bands to generate.
wavelength_range – Overall wavelength range for band placement.
zones – Optional list of (min, max) wavelength zones for band centers. If None, uses default NIR-relevant zones.
- Returns:
The generated SpectralComponent.
Example
>>> library = ComponentLibrary(random_state=42) >>> component = library.add_random_component( ... "random_compound", ... n_bands=4, ... wavelength_range=(1000, 2500) ... )
- property components: Dict[str, SpectralComponent]
Get all components in the library.
- compute_all(wavelengths: ndarray) ndarray[source]
Compute spectra for all components at given wavelengths.
- Parameters:
wavelengths – Array of wavelengths in nm.
- Returns:
Array of shape (n_components, n_wavelengths) containing the spectrum of each component.
Example
>>> library = ComponentLibrary.from_predefined(["water", "protein"]) >>> wavelengths = np.arange(1000, 2500, 2) >>> E = library.compute_all(wavelengths) >>> print(E.shape) (2, 751)
- classmethod from_predefined(component_names: List[str] | None = None, random_state: int | None = None) ComponentLibrary[source]
Create a library from predefined spectral components.
- Parameters:
component_names – List of component names to include. If None, includes all predefined components.
random_state – Random seed for reproducibility.
- Returns:
ComponentLibrary instance populated with predefined components.
- Raises:
ValueError – If an unknown component name is specified.
Example
>>> library = ComponentLibrary.from_predefined( ... ["water", "protein", "lipid"] ... )
- generate_random_library(n_components: int = 5, n_bands_range: Tuple[int, int] = (2, 6)) ComponentLibrary[source]
Generate a library of random spectral components.
- Parameters:
n_components – Number of components to generate.
n_bands_range – Range (min, max) for number of bands per component.
- Returns:
Self for method chaining.
Example
>>> library = ComponentLibrary(random_state=42) >>> library.generate_random_library(n_components=5, n_bands_range=(2, 5))
- class nirs4all.data.synthetic.ConcentrationPrior(distribution: str = 'uniform', params: ~typing.Dict[str, float] = <factory>, min_value: float = 0.0, max_value: float = 1.0)[source]
Bases:
objectPrior distribution for component concentrations.
- class nirs4all.data.synthetic.DatasetComparisonResult(dataset_name: str, n_real_samples: int, n_synthetic_samples: int, realism_score: SpectralRealismScore, tstr_r2: float | None = None, trts_r2: float | None = None)[source]
Bases:
objectResult of comparing synthetic data against a benchmark dataset.
- realism_score
The spectral realism score.
- realism_score: SpectralRealismScore
- class nirs4all.data.synthetic.DatasetExporter(config: ExportConfig | None = None)[source]
Bases:
objectExport synthetic datasets to various file formats.
This class provides methods for exporting synthetic NIRS datasets to files and folders compatible with nirs4all’s data loaders.
- config
Export configuration settings.
- Parameters:
config – Optional ExportConfig. Uses defaults if None.
Example
>>> exporter = DatasetExporter() >>> >>> # Export to standard folder structure >>> path = exporter.to_folder( ... "output/data", ... X, y, ... train_ratio=0.8, ... wavelengths=wavelengths ... ) >>> >>> # Export to single CSV >>> path = exporter.to_csv( ... "output/all_data.csv", ... X, y, ... wavelengths=wavelengths ... )
- to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, include_targets: bool = True) Path[source]
Export dataset to a single CSV file.
Creates a CSV file with features (and optionally targets) combined.
- Parameters:
path – Output file path.
X – Feature matrix (n_samples, n_features).
y – Target values (n_samples,) or (n_samples, n_targets).
wavelengths – Optional wavelength values for column headers.
metadata – Optional dict of metadata arrays.
include_targets – Whether to include target column(s).
- Returns:
Path to created file.
Example
>>> exporter.to_csv("data.csv", X, y, wavelengths=wavelengths)
- to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, random_state: int | None = None, format: Literal['standard', 'single', 'fragmented'] | None = None) Path[source]
Export dataset to a folder structure.
Creates a folder with CSV files compatible with nirs4all’s DatasetConfigs loader.
- Parameters:
path – Output folder path.
X – Feature matrix (n_samples, n_features).
y – Target values (n_samples,) or (n_samples, n_targets).
train_ratio – Proportion for training set.
wavelengths – Optional wavelength values for column headers.
metadata – Optional dict of metadata arrays (same length as X).
random_state – Random seed for train/test split.
format – Override config format for this export.
- Returns:
Path to created folder.
- Raises:
ValueError – If X and y have incompatible shapes.
ImportError – If pandas is not available.
Example
>>> exporter.to_folder( ... "data/synthetic", ... X, y, ... train_ratio=0.8, ... wavelengths=np.arange(1000, 2500, 2) ... )
- to_numpy(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, compressed: bool = False) Path[source]
Export dataset to numpy .npy or .npz format.
- Parameters:
path – Output file path (without extension).
X – Feature matrix (n_samples, n_features).
y – Target values.
wavelengths – Optional wavelength values.
compressed – Whether to use compressed format (.npz).
- Returns:
Path to created file.
Example
>>> exporter.to_numpy("data", X, y, compressed=True)
- class nirs4all.data.synthetic.DetectorConfig(detector_type: ~nirs4all.data.synthetic.instruments.DetectorType = DetectorType.INGAAS, temperature_k: float = 293.0, integration_time_ms: float = 100.0, gain: float = 1.0, noise_model: ~nirs4all.data.synthetic.detectors.NoiseModelConfig = <factory>, apply_response_curve: bool = True, apply_nonlinearity: bool = False, nonlinearity_coefficient: float = 0.02)[source]
Bases:
objectComplete detector configuration.
- detector_type
Type of detector.
- noise_model
Noise model configuration.
- detector_type: DetectorType = 'ingaas'
- noise_model: NoiseModelConfig
- class nirs4all.data.synthetic.DetectorSimulator(config: DetectorConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate detector effects on NIR spectra.
Applies detector spectral response, noise models, and nonlinearity to synthetic spectra.
- config
Detector configuration.
- rng
Random number generator.
Example
>>> config = DetectorConfig(detector_type=DetectorType.INGAAS) >>> simulator = DetectorSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, base_signal_level: float = 1.0) ndarray[source]
Apply detector effects to spectra.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
base_signal_level – Reference signal level for noise scaling.
- Returns:
Spectra with detector effects applied.
- class nirs4all.data.synthetic.DetectorSpectralResponse(detector_type: DetectorType, wavelengths: ndarray, response: ndarray, peak_wavelength: float, cutoff_wavelength: float, short_cutoff: float, peak_qe: float = 0.7)[source]
Bases:
objectSpectral response curve for a detector.
Defines the wavelength-dependent sensitivity (quantum efficiency) of the detector.
- detector_type
Type of detector.
- wavelengths
Wavelength grid for response curve (nm).
- Type:
- response
Relative response at each wavelength (0-1).
- Type:
- detector_type: DetectorType
- class nirs4all.data.synthetic.DetectorType(value)[source]
-
Types of NIR detectors.
- INGAAS = 'ingaas'
- INGAAS_EXTENDED = 'ingaas_ext'
- MCT = 'mct'
- MEMS = 'mems'
- PBS = 'pbs'
- PBSE = 'pbse'
- SI = 'si'
- class nirs4all.data.synthetic.DomainCategory(value)[source]
-
Top-level domain categories.
- AGRICULTURE = 'agriculture'
- BEVERAGE = 'beverage'
- BIOMEDICAL = 'biomedical'
- ENVIRONMENTAL = 'environmental'
- FOOD = 'food'
- PETROCHEMICAL = 'petrochemical'
- PHARMACEUTICAL = 'pharmaceutical'
- POLYMER = 'polymer'
- TEXTILE = 'textile'
- class nirs4all.data.synthetic.DomainConfig(name: str, category: ~nirs4all.data.synthetic.domains.DomainCategory, description: str = '', typical_components: ~typing.List[str] = <factory>, component_weights: ~typing.Dict[str, float] | None = None, concentration_priors: ~typing.Dict[str, ~nirs4all.data.synthetic.domains.ConcentrationPrior] = <factory>, wavelength_range: ~typing.Tuple[float, float] = (1000, 2500), n_components_range: ~typing.Tuple[int, int] = (3, 8), noise_level: str = 'medium', measurement_mode: str = 'reflectance', typical_sample_types: ~typing.List[str] = <factory>, complexity: str = 'realistic', additional_params: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectConfiguration for a specific application domain.
Encapsulates all domain-specific parameters needed for generating realistic synthetic NIRS data.
- category
Domain category (agriculture, pharmaceutical, etc.).
- component_weights
Relative importance of each component (for selection).
- concentration_priors
Per-component concentration distributions.
- Type:
Dict[str, nirs4all.data.synthetic.domains.ConcentrationPrior]
- category: DomainCategory
- concentration_priors: Dict[str, ConcentrationPrior]
- sample_components(rng: Generator, n_components: int | None = None) List[str][source]
Sample components for a sample based on domain priors.
- Parameters:
rng – Random number generator.
n_components – Number of components. If None, samples from range.
- Returns:
List of component names.
- sample_concentrations(rng: Generator, components: List[str], n_samples: int = 1) ndarray[source]
Sample concentrations for selected components.
- Parameters:
rng – Random number generator.
components – List of component names.
n_samples – Number of samples.
- Returns:
Concentration matrix (n_samples, n_components).
- class nirs4all.data.synthetic.DomainInference(domain_name: str = 'unknown', category: str = 'unknown', confidence: float = 0.0, detected_components: ~typing.List[str] = <factory>, alternative_domains: ~typing.Dict[str, float] = <factory>)[source]
Bases:
objectResults of application domain inference.
- class nirs4all.data.synthetic.EMSCConfig(polynomial_order: int = 2, multiplicative_scatter_std: float = 0.15, additive_scatter_std: float = 0.05, include_wavelength_terms: bool = True, wavelength_coef_std: float = 0.02, reference_spectrum: ndarray | None = None)[source]
Bases:
objectConfiguration for EMSC-style scattering transformation.
EMSC models scattering distortion as: x = a + b*x_ref + d*λ + e*λ² + …
where a, b are multiplicative/additive scatter, and higher terms model baseline curvature due to scattering.
- reference_spectrum
Optional reference spectrum for EMSC.
- Type:
numpy.ndarray | None
- class nirs4all.data.synthetic.EMSCTransformSimulator(config: EMSCConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate EMSC-style scattering distortions.
Applies the inverse of Extended Multiplicative Scatter Correction, generating realistic scatter distortions that EMSC would correct.
EMSC models spectra as: x = a + b*m + d*λ + e*λ² + … where m is a reference spectrum.
This simulator generates a, b, d, e, … to create scatter distortions.
- config
EMSC configuration.
- rng
Random number generator.
Example
>>> config = EMSCConfig(polynomial_order=2) >>> simulator = EMSCTransformSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, reference_spectrum: ndarray | None = None) ndarray[source]
Apply EMSC-style scattering distortions.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
reference_spectrum – Optional reference spectrum. If None, uses mean of input spectra or config reference.
- Returns:
Modified spectra with scatter distortions applied.
- class nirs4all.data.synthetic.EnvironmentalEffectsConfig(temperature: ~nirs4all.data.synthetic.environmental.TemperatureConfig = <factory>, moisture: ~nirs4all.data.synthetic.environmental.MoistureConfig = <factory>, enable_temperature: bool = True, enable_moisture: bool = True)[source]
Bases:
objectCombined configuration for all environmental effects.
- temperature
Temperature effect configuration.
- moisture
Moisture effect configuration.
- moisture: MoistureConfig
- temperature: TemperatureConfig
- class nirs4all.data.synthetic.EnvironmentalEffectsSimulator(config: EnvironmentalEffectsConfig | None = None, random_state: int | None = None)[source]
Bases:
objectCombined simulator for all environmental effects.
Applies temperature and moisture effects in the correct order with proper interactions.
- config
Environmental effects configuration.
- temperature_sim
Temperature effect simulator.
- moisture_sim
Moisture effect simulator.
- rng
Random number generator.
Example
>>> config = EnvironmentalEffectsConfig( ... temperature=TemperatureConfig(sample_temperature=40.0), ... moisture=MoistureConfig(water_activity=0.7) ... ) >>> simulator = EnvironmentalEffectsSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, sample_temperatures: ndarray | None = None, water_activities: ndarray | None = None) ndarray[source]
Apply all environmental effects to spectra.
Effects are applied in order: 1. Moisture effects (with temperature interaction) 2. Temperature effects
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
sample_temperatures – Optional per-sample temperatures.
water_activities – Optional per-sample water activities.
- Returns:
Modified spectra with all environmental effects applied.
- class nirs4all.data.synthetic.EnvironmentalInference(estimated_temperature_variation: float = 0.0, has_temperature_effects: bool = False, estimated_moisture_variation: float = 0.0, has_moisture_effects: bool = False, water_band_shift: float = 0.0)[source]
Bases:
objectResults of environmental effects inference.
- class nirs4all.data.synthetic.ExportConfig(format: Literal['standard', 'single', 'fragmented'] = 'standard', separator: str = ';', float_precision: int = 6, include_headers: bool = True, include_index: bool = False, compression: Literal['gzip', 'zip'] | None = None, file_extension: str = '.csv')[source]
Bases:
objectConfiguration for dataset export.
- format
Export format (‘standard’, ‘single’, ‘fragmented’). - ‘standard’: Separate Xcal, Ycal, Xval, Yval files. - ‘single’: All data in one file with partition column. - ‘fragmented’: Multiple small files (for loader testing).
- Type:
Literal[‘standard’, ‘single’, ‘fragmented’]
- compression
Optional compression (‘gzip’, ‘zip’, None).
- Type:
Literal[‘gzip’, ‘zip’] | None
- class nirs4all.data.synthetic.FeatureConfig(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', n_components: int | None = None, component_names: List[str] | None = None)[source]
Bases:
objectConfiguration for spectral feature generation.
- complexity
Complexity level affecting noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.
- Type:
Literal[‘simple’, ‘realistic’, ‘complex’]
- class nirs4all.data.synthetic.FittedParameters(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, global_slope_mean: float = 0.0, global_slope_std: float = 0.02, noise_base: float = 0.001, noise_signal_dep: float = 0.005, path_length_std: float = 0.05, baseline_amplitude: float = 0.02, scatter_alpha_std: float = 0.05, scatter_beta_std: float = 0.01, tilt_std: float = 0.01, complexity: str = 'realistic', source_name: str = '', source_properties: ~nirs4all.data.synthetic.fitter.SpectralProperties | None = None, inferred_instrument: str = 'unknown', instrument_inference: ~nirs4all.data.synthetic.fitter.InstrumentInference | None = None, measurement_mode: str = 'transmittance', measurement_mode_confidence: float = 0.0, inferred_domain: str = 'unknown', domain_inference: ~nirs4all.data.synthetic.fitter.DomainInference | None = None, environmental_inference: ~nirs4all.data.synthetic.fitter.EnvironmentalInference | None = None, temperature_config: ~typing.Dict[str, ~typing.Any] = <factory>, moisture_config: ~typing.Dict[str, ~typing.Any] = <factory>, scattering_inference: ~nirs4all.data.synthetic.fitter.ScatteringInference | None = None, particle_size_config: ~typing.Dict[str, ~typing.Any] = <factory>, emsc_config: ~typing.Dict[str, ~typing.Any] = <factory>, detected_components: ~typing.List[str] = <factory>, suggested_n_components: int = 5)[source]
Bases:
objectParameters fitted from real data for synthetic generation.
This dataclass contains all parameters needed to configure a SyntheticNIRSGenerator to produce spectra similar to a real dataset, including Phase 1-4 enhanced features.
- # Basic wavelength grid
- # Slope and baseline parameters
- # Noise parameters
- # Scatter parameters
- # Complexity
- # Source metadata
- source_properties
Full SpectralProperties of source.
- Type:
- # Phase 1-4 Enhanced Parameters
- # Instrument inference
- instrument_inference
Full instrument inference result.
- Type:
- # Measurement mode
- # Domain inference
- domain_inference
Full domain inference result.
- Type:
- # Environmental effects
- environmental_inference
Environmental effects inference.
- # Scattering effects
- scattering_inference
Scattering effects inference.
- Type:
- # Detected components for procedural generation
- domain_inference: DomainInference | None = None
- environmental_inference: EnvironmentalInference | None = None
- classmethod from_dict(data: Dict[str, Any]) FittedParameters[source]
Create FittedParameters from a dictionary.
- Parameters:
data – Dictionary with parameter values.
- Returns:
FittedParameters instance.
- instrument_inference: InstrumentInference | None = None
- classmethod load(path: str) FittedParameters[source]
Load parameters from JSON file.
- Parameters:
path – Input file path.
- Returns:
FittedParameters instance.
- scattering_inference: ScatteringInference | None = None
- source_properties: SpectralProperties | None = None
- summary() str[source]
Generate a human-readable summary of fitted parameters.
- Returns:
Multi-line summary string.
- to_dict() Dict[str, Any][source]
Convert all parameters to a dictionary.
- Returns:
Dictionary with all parameter values.
- to_full_config() Dict[str, Any][source]
Convert all fitted parameters to a comprehensive configuration.
This includes all Phase 1-4 parameters for complete synthetic data generation matching the source dataset.
- Returns:
Dictionary with all configuration parameters.
Example
>>> params = fitter.fit(X_real) >>> config = params.to_full_config() >>> # Use with builder pattern or advanced configuration
- class nirs4all.data.synthetic.FunctionalGroupType(value)[source]
-
Types of functional groups for component generation.
- AMINE = 'amine'
- AROMATIC_CH = 'aromatic_ch'
- CARBONYL = 'carbonyl'
- CARBOXYL = 'carboxyl'
- HYDROXYL = 'hydroxyl'
- METHYL = 'methyl'
- METHYLENE = 'methylene'
- THIOL = 'thiol'
- VINYL = 'vinyl'
- WATER = 'water'
- class nirs4all.data.synthetic.InstrumentArchetype(name: str, category: ~nirs4all.data.synthetic.instruments.InstrumentCategory, detector_type: ~nirs4all.data.synthetic.instruments.DetectorType, monochromator_type: ~nirs4all.data.synthetic.instruments.MonochromatorType, wavelength_range: ~typing.Tuple[float, float], spectral_resolution: float = 8.0, wavelength_accuracy: float = 0.5, photometric_noise: float = 0.0001, photometric_range: ~typing.Tuple[float, float] = (0.0, 3.0), snr: float = 10000.0, stray_light: float = 0.0001, warm_up_drift: float = 0.1, temperature_sensitivity: float = 0.01, scan_speed: float = 1.0, integration_time_ms: float = 100.0, optical_path: str = 'transmission', multi_sensor: ~nirs4all.data.synthetic.instruments.MultiSensorConfig = <factory>, multi_scan: ~nirs4all.data.synthetic.instruments.MultiScanConfig = <factory>, description: str = '')[source]
Bases:
objectParameterized NIR instrument simulation.
Represents a complete instrument model with optical, electronic, and measurement characteristics. Can be used to generate realistic synthetic spectra that match specific instrument types.
- category
Instrument category (benchtop, handheld, etc.).
- detector_type
Primary detector type.
- monochromator_type
Wavelength selection mechanism.
- multi_sensor
Multi-sensor configuration.
- multi_scan
Multi-scan averaging configuration.
- category: InstrumentCategory
- detector_type: DetectorType
- get_noise_model_params() Dict[str, float][source]
Get noise model parameters based on detector type.
- monochromator_type: MonochromatorType
- multi_scan: MultiScanConfig
- multi_sensor: MultiSensorConfig
- class nirs4all.data.synthetic.InstrumentCategory(value)[source]
-
Categories of NIR instruments.
- BENCHTOP = 'benchtop'
- DIODE_ARRAY = 'diode_array'
- EMBEDDED = 'embedded'
- FILTER = 'filter'
- FT_NIR = 'ft_nir'
- HANDHELD = 'handheld'
- PROCESS = 'process'
- class nirs4all.data.synthetic.InstrumentInference(archetype_name: str = 'unknown', detector_type: str = 'unknown', wavelength_range: ~typing.Tuple[float, float] = (1000.0, 2500.0), estimated_resolution: float = 8.0, confidence: float = 0.0, alternative_archetypes: ~typing.Dict[str, float] = <factory>)[source]
Bases:
objectResults of instrument archetype inference.
- class nirs4all.data.synthetic.InstrumentSimulator(archetype: InstrumentArchetype, random_state: int | None = None)[source]
Bases:
objectApply instrument-specific effects to synthetic spectra.
Simulates the complete instrument response including: - Spectral resolution (instrumental broadening) - Multi-sensor stitching - Multi-scan averaging - Detector noise (shot, thermal, 1/f) - Wavelength calibration errors - Stray light effects - Etalon/fringing interference
- archetype
The instrument archetype being simulated.
- rng
Random number generator for reproducibility.
Example
>>> archetype = get_instrument_archetype("viavi_micronir") >>> simulator = InstrumentSimulator(archetype, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, temperature_offset: float = 0.0) Tuple[ndarray, ndarray][source]
Apply all instrument effects to spectra.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
temperature_offset – Temperature deviation from calibration (°C).
- Returns:
Tuple of (modified_spectra, output_wavelengths). Output wavelengths may differ if resampled to instrument grid.
- class nirs4all.data.synthetic.LoadedBenchmarkDataset(info: ~nirs4all.data.synthetic.benchmarks.BenchmarkDatasetInfo, X: ~numpy.ndarray, y: ~numpy.ndarray, wavelengths: ~numpy.ndarray, sample_ids: ~numpy.ndarray | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectContainer for a loaded benchmark dataset.
- info
Dataset metadata.
- X
Spectral data (n_samples, n_wavelengths).
- Type:
- y
Target values (n_samples, n_targets) or (n_samples,).
- Type:
- wavelengths
Wavelength array.
- Type:
- sample_ids
Optional sample identifiers.
- Type:
numpy.ndarray | None
- info: BenchmarkDatasetInfo
- class nirs4all.data.synthetic.MatrixType(value)[source]
-
Physical matrix types that affect spectral properties.
- EMULSION = 'emulsion'
- FILM = 'film'
- GEL = 'gel'
- GRANULAR = 'granular'
- LIQUID = 'liquid'
- PASTE = 'paste'
- POWDER = 'powder'
- SLURRY = 'slurry'
- SOLID = 'solid'
- TISSUE = 'tissue'
- class nirs4all.data.synthetic.MeasurementMode(value)[source]
-
Types of NIR measurement geometries.
- ATR = 'atr'
- FIBER_OPTIC = 'fiber_optic'
- INTERACTANCE = 'interactance'
- REFLECTANCE = 'reflectance'
- TRANSFLECTANCE = 'transflectance'
- TRANSMITTANCE = 'transmittance'
- class nirs4all.data.synthetic.MeasurementModeInference(value)[source]
-
Inferred measurement mode from spectral analysis.
- ATR = 'atr'
- REFLECTANCE = 'reflectance'
- TRANSFLECTANCE = 'transflectance'
- TRANSMITTANCE = 'transmittance'
- UNKNOWN = 'unknown'
- class nirs4all.data.synthetic.MeasurementModeSimulator(config: MeasurementModeConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate different NIR measurement modes.
Converts absorption coefficients to measured signal (absorbance, reflectance, etc.) based on the physics of different measurement geometries.
- config
Measurement mode configuration.
- rng
Random number generator for reproducibility.
Example
>>> config = MeasurementModeConfig(mode=MeasurementMode.REFLECTANCE) >>> simulator = MeasurementModeSimulator(config, random_state=42) >>> reflectance = simulator.apply(absorption_coefficients, wavelengths)
- absorbance_to_reflectance(absorbance: ndarray) ndarray[source]
Convert apparent absorbance to reflectance.
R = 10^(-A)
- Parameters:
absorbance – Apparent absorbance values.
- Returns:
Reflectance values (0-1).
- apply(absorption: ndarray, wavelengths: ndarray, scattering: ndarray | None = None) ndarray[source]
Apply measurement mode transformation.
Converts absorption coefficients (K) to measured signal based on the configured measurement mode.
- Parameters:
absorption – Absorption coefficient array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
scattering – Optional scattering coefficient array (n_samples, n_wavelengths). If None and needed, will be generated automatically.
- Returns:
Measured signal (absorbance, reflectance, etc.) depending on mode.
- generate_scattering_coefficients(shape: Tuple[int, int], wavelengths: ndarray) ndarray[source]
Generate realistic scattering coefficients.
Scattering coefficient follows approximate relationship: S(λ) ∝ λ^(-α) * (particle_size)^β
- Parameters:
shape – Output shape (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
- Returns:
Scattering coefficient array.
- inverse_kubelka_munk(ks_ratio: ndarray) ndarray[source]
Inverse Kubelka-Munk transformation.
Given K/S, solve for R∞: R∞ = 1 + K/S - sqrt((K/S)² + 2*K/S)
- Parameters:
ks_ratio – K/S ratio values.
- Returns:
Reflectance values.
- class nirs4all.data.synthetic.MetadataConfig(generate_sample_ids: bool = True, sample_id_prefix: str = 'sample', n_groups: int | None = None, n_repetitions: int | Tuple[int, int] = 1, group_names: List[str] | None = None, additional_columns: Dict[str, Any] | None = None)[source]
Bases:
objectConfiguration for sample metadata generation.
- n_repetitions
Repetitions per sample, either fixed int or (min, max) range.
- class nirs4all.data.synthetic.MetadataGenerationResult(sample_ids: ndarray, bio_sample_ids: ndarray | None = None, repetitions: ndarray | None = None, groups: ndarray | None = None, group_indices: ndarray | None = None, n_bio_samples: int = 0, additional_columns: Dict[str, ndarray] | None = None)[source]
Bases:
objectContainer for generated metadata.
- sample_ids
Unique sample identifiers.
- Type:
- bio_sample_ids
Biological sample identifiers (before repetitions).
- Type:
numpy.ndarray | None
- repetitions
Repetition number for each sample.
- Type:
numpy.ndarray | None
- groups
Group assignments.
- Type:
numpy.ndarray | None
- group_indices
Integer group indices (for stratification).
- Type:
numpy.ndarray | None
- additional_columns
Any extra columns generated.
- Type:
Dict[str, numpy.ndarray] | None
- class nirs4all.data.synthetic.MetadataGenerator(random_state: int | None = None)[source]
Bases:
objectGenerate realistic metadata for synthetic NIRS datasets.
This class creates sample identifiers, biological sample groupings, repetition structures, and group assignments that mimic real spectroscopy datasets.
- rng
NumPy random generator for reproducibility.
- Parameters:
random_state – Random seed for reproducibility.
Example
>>> generator = MetadataGenerator(random_state=42) >>> >>> # Generate with repetitions and groups >>> metadata = generator.generate( ... n_samples=100, ... sample_id_prefix="WHEAT", ... n_groups=3, ... group_names=["Field_A", "Field_B", "Field_C"], ... n_repetitions=(2, 4) ... ) >>> >>> # Result: Each biological sample has 2-4 spectral measurements >>> print(f"Bio samples: {metadata.n_bio_samples}") >>> print(f"Total samples: {len(metadata.sample_ids)}")
- generate(n_samples: int, *, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1, bio_sample_prefix: str = 'B', additional_columns: Dict[str, Any] | None = None) MetadataGenerationResult[source]
Generate complete metadata for a synthetic dataset.
This method handles the complex logic of generating samples with repetitions while respecting group structures. When repetitions are requested, biological samples are created first, then each is replicated 1 or more times to create the final samples.
- Parameters:
n_samples – Total number of samples (spectra) to generate.
sample_id_prefix – Prefix for sample ID strings.
n_groups – Number of groups (None for no grouping).
group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.
n_repetitions – Number of repetitions per biological sample. If int: fixed number of repetitions. If tuple (min, max): random number in range [min, max].
bio_sample_prefix – Prefix for biological sample IDs.
additional_columns – Dictionary of additional columns to generate. Keys are column names, values can be: - Callable(n_samples, rng) -> np.ndarray - List of values to randomly sample from - Tuple (distribution, params) for numeric data
- Returns:
MetadataGenerationResult containing all generated metadata.
- Raises:
ValueError – If n_samples is less than 1 or if n_repetitions would make it impossible to generate the requested samples.
Example
>>> generator = MetadataGenerator(random_state=42) >>> >>> # Simple case: 100 samples, no repetitions >>> result = generator.generate(100) >>> assert len(result.sample_ids) == 100 >>> >>> # With repetitions: ~50 bio samples, each measured 2 times >>> result = generator.generate(100, n_repetitions=2) >>> assert result.n_bio_samples == 50 >>> >>> # Variable repetitions >>> result = generator.generate(100, n_repetitions=(1, 3))
- class nirs4all.data.synthetic.MetricResult(metric: ~nirs4all.data.synthetic.validation.RealismMetric, value: float, threshold: float, passed: bool, details: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectResult of a single realism metric evaluation.
- metric
The metric type.
- metric: RealismMetric
- class nirs4all.data.synthetic.MoistureConfig(water_activity: float = 0.5, moisture_content: float = 0.1, free_water_fraction: float = 0.3, bound_water_shift: float = 25.0, temperature_interaction: bool = True, reference_aw: float = 0.5)[source]
Bases:
objectConfiguration for moisture/water activity effect simulation.
Moisture affects NIR spectra through: - Direct water absorption bands - Hydrogen bonding with sample matrix - Free vs. bound water ratio
- class nirs4all.data.synthetic.MoistureEffectSimulator(config: MoistureConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate moisture and water activity effects on NIR spectra.
Water in samples exists in different states: - Free water: bulk water with normal O-H bands - Bound water: hydrogen-bonded to matrix, shifted peaks
The ratio of free to bound water depends on water activity (a_w), temperature, and sample matrix composition.
- config
Moisture effect configuration.
- rng
Random number generator.
Example
>>> config = MoistureConfig( ... water_activity=0.7, ... moisture_content=0.15, ... free_water_fraction=0.4 ... ) >>> simulator = MoistureEffectSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, water_activities: ndarray | None = None, temperature_offset: float = 0.0) ndarray[source]
Apply moisture effects to spectra.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
water_activities – Optional per-sample water activity values.
temperature_offset – Temperature offset from reference (affects water state).
- Returns:
Modified spectra with moisture effects applied.
- class nirs4all.data.synthetic.MonochromatorType(value)[source]
-
Types of wavelength selection mechanisms.
- AOTF = 'aotf'
- DMD = 'dmd'
- FABRY_PEROT = 'fabry_perot'
- FILTER_WHEEL = 'filter_wheel'
- FT = 'fourier_transform'
- GRATING = 'grating'
- LVF = 'lvf'
- class nirs4all.data.synthetic.MultiScanConfig(enabled: bool = False, n_scans: int = 16, averaging_method: str = 'mean', scan_to_scan_noise: float = 0.001, wavelength_jitter: float = 0.05, discard_outliers: bool = False, outlier_threshold: float = 3.0)[source]
Bases:
objectConfiguration for multi-scan averaging/accumulation.
Real instruments often acquire multiple scans per sample and average them to improve signal-to-noise ratio. This config simulates that process.
- class nirs4all.data.synthetic.MultiSensorConfig(enabled: bool = False, sensors: ~typing.List[~nirs4all.data.synthetic.instruments.SensorConfig] = <factory>, stitch_method: str = 'weighted', stitch_smoothing: float = 10.0, add_stitch_artifacts: bool = True, artifact_intensity: float = 0.02)[source]
Bases:
objectConfiguration for multi-sensor spectral stitching.
Modern NIR instruments often use multiple sensors/detectors to cover wide wavelength ranges. This config controls how the signals are combined.
- sensors
List of SensorConfig for each sensor.
- Type:
- stitch_method
Method for combining overlapping regions. Options: ‘weighted’, ‘average’, ‘first’, ‘last’, ‘optimal’
- Type:
- sensors: List[SensorConfig]
- class nirs4all.data.synthetic.MultiSourceGenerator(random_state: int | None = None)[source]
Bases:
objectGenerate synthetic multi-source NIRS datasets.
This class creates datasets combining multiple data sources, such as: - Multiple NIR spectral ranges (e.g., visible-NIR + shortwave-NIR) - NIR spectra + molecular markers - NIR spectra + auxiliary measurements
The generated sources share common underlying structure through component concentrations, creating realistic inter-source correlations.
- rng
NumPy random generator.
- Parameters:
random_state – Random seed for reproducibility.
Example
>>> generator = MultiSourceGenerator(random_state=42) >>> >>> result = generator.generate( ... n_samples=500, ... sources=[ ... { ... "name": "NIR", ... "type": "nir", ... "wavelength_range": (1000, 2500), ... "complexity": "realistic" ... }, ... { ... "name": "markers", ... "type": "aux", ... "n_features": 20, ... "correlation_with_target": 0.7 ... } ... ], ... target_range=(0, 100) ... ) >>> >>> print(result.source_names) ['NIR', 'markers']
- create_dataset(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, train_ratio: float = 0.8, target_range: Tuple[float, float] | None = None, name: str = 'multi_source_synthetic') SpectroDataset[source]
Create a SpectroDataset from multi-source generation.
- Parameters:
n_samples – Number of samples to generate.
sources – List of source configurations.
train_ratio – Proportion of samples for training.
target_range – Optional (min, max) for target scaling.
name – Dataset name.
- Returns:
SpectroDataset with multiple sources configured.
Example
>>> dataset = generator.create_dataset( ... n_samples=500, ... sources=[ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 10} ... ], ... train_ratio=0.8 ... )
- generate(n_samples: int, sources: List[SourceConfig | Dict[str, Any]], *, target_range: Tuple[float, float] | None = None, concentration_method: str = 'dirichlet', n_components: int = 5) MultiSourceResult[source]
Generate a multi-source dataset.
All sources share underlying component concentrations, which creates realistic correlations between sources. NIR sources generate spectra from these concentrations, while auxiliary sources create features correlated with the same underlying structure.
- Parameters:
n_samples – Number of samples to generate.
sources – List of source configurations (SourceConfig or dict).
target_range – Optional (min, max) for scaling target values.
concentration_method – Method for generating component concentrations.
n_components – Number of underlying components.
- Returns:
MultiSourceResult containing all generated data.
Example
>>> result = generator.generate( ... n_samples=300, ... sources=[ ... {"name": "VIS-NIR", "type": "nir", "wavelength_range": (400, 1100)}, ... {"name": "SWIR", "type": "nir", "wavelength_range": (1100, 2500)}, ... ] ... )
- class nirs4all.data.synthetic.MultiSourceResult(sources: ~typing.Dict[str, ~numpy.ndarray], targets: ~numpy.ndarray, source_configs: ~typing.List[~nirs4all.data.synthetic.sources.SourceConfig], wavelengths: ~typing.Dict[str, ~numpy.ndarray] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]
Bases:
objectContainer for multi-source generation results.
- sources
Dictionary mapping source names to feature arrays.
- Type:
Dict[str, numpy.ndarray]
- targets
Target values.
- Type:
- source_configs
Source configuration objects.
- Type:
- wavelengths
Dictionary mapping NIR source names to wavelength arrays.
- Type:
Dict[str, numpy.ndarray]
- source_configs: List[SourceConfig]
- class nirs4all.data.synthetic.NIRBand(center: float, sigma: float, gamma: float = 0.0, amplitude: float = 1.0, name: str = '')[source]
Bases:
objectRepresents a single NIR absorption band.
This class models an absorption band using a Voigt profile, which is the convolution of Gaussian (thermal broadening) and Lorentzian (pressure broadening) line shapes.
Example
>>> band = NIRBand(center=1450, sigma=25, gamma=3, amplitude=0.8) >>> wavelengths = np.arange(1400, 1500, 1) >>> spectrum = band.compute(wavelengths)
- compute(wavelengths: ndarray) ndarray[source]
Compute the band profile at given wavelengths using Voigt profile.
- Parameters:
wavelengths – Array of wavelengths in nm at which to evaluate the band.
- Returns:
Array of absorbance values at each wavelength.
Note
When gamma=0, a pure Gaussian profile is used for efficiency. Otherwise, the full Voigt profile (Gaussian ⊗ Lorentzian) is computed.
- class nirs4all.data.synthetic.NIRSPriorConfig(domain_weights: ~typing.Dict[str, float] = <factory>, instrument_given_domain: ~typing.Dict[str, ~typing.Dict[str, float]] = <factory>, mode_given_category: ~typing.Dict[str, ~typing.Dict[str, float]] = <factory>, matrix_given_domain: ~typing.Dict[str, ~typing.Dict[str, float]] = <factory>, temperature_range: ~typing.Tuple[float, float] = (15.0, 40.0), particle_size_range: ~typing.Tuple[float, float] = (5.0, 200.0), noise_level_range: ~typing.Tuple[float, float] = (0.5, 2.0), n_samples_range: ~typing.Tuple[int, int] = (100, 2000), target_type_weights: ~typing.Dict[str, float] = <factory>, n_targets_range: ~typing.Tuple[int, int] = (1, 5), n_classes_range: ~typing.Tuple[int, int] = (2, 5))[source]
Bases:
objectConfiguration for NIRS data generation with conditional sampling.
This class defines the prior distributions and conditional dependencies for sampling complete generation configurations.
Example
>>> config = NIRSPriorConfig() >>> sampler = PriorSampler(config, random_state=42) >>> sample = sampler.sample() >>> print(sample["domain"], sample["instrument"])
- class nirs4all.data.synthetic.NoiseModelConfig(shot_noise_enabled: bool = True, thermal_noise_enabled: bool = True, read_noise_enabled: bool = True, flicker_noise_enabled: bool = False, quantization_noise_enabled: bool = False, shot_noise_factor: float = 1.0, thermal_noise_factor: float = 1.0, read_noise_electrons: float = 50.0, flicker_corner_freq: float = 100.0, adc_bits: int = 16, full_scale: float = 3.0)[source]
Bases:
objectConfiguration for detector noise model.
- class nirs4all.data.synthetic.OutputConfig(as_dataset: bool = True, include_metadata: bool = False, include_wavelengths: bool = True)[source]
Bases:
objectConfiguration for output format.
- class nirs4all.data.synthetic.OvertoneResult(order: int, wavenumber_cm: float, wavelength_nm: float, amplitude_factor: float, bandwidth_factor: float)[source]
Bases:
objectResult of overtone calculation.
- class nirs4all.data.synthetic.ParticleSizeConfig(distribution: ~nirs4all.data.synthetic.scattering.ParticleSizeDistribution = <factory>, reference_size_um: float = 50.0, size_effect_strength: float = 1.0, wavelength_exponent: float = 1.5, include_path_length_effect: bool = True, path_length_sensitivity: float = 0.5)[source]
Bases:
objectConfiguration for particle size effects.
- distribution
Particle size distribution parameters.
- wavelength_exponent
Exponent for wavelength dependence of scattering. - 4.0 = Rayleigh (particles << wavelength) - 0.0 = No wavelength dependence - 1.0-2.0 = Typical for NIR powder samples
- Type:
- distribution: ParticleSizeDistribution
- class nirs4all.data.synthetic.ParticleSizeDistribution(mean_size_um: float = 50.0, std_size_um: float = 15.0, min_size_um: float = 5.0, max_size_um: float = 200.0, distribution: str = 'lognormal')[source]
Bases:
objectParticle size distribution parameters.
Models particle size as a log-normal distribution, which is common for ground/milled samples in NIR analysis.
- class nirs4all.data.synthetic.ParticleSizeSimulator(config: ParticleSizeConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate particle size effects on NIR spectra.
Particle size affects NIR spectra through: - Scattering baseline (smaller particles = more scattering) - Path length through sample (affects Beer-Lambert) - Wavelength dependence of scattering
Uses EMSC-style approach: applies distortions that chemometric preprocessing (SNV, MSC) would correct.
- config
Particle size configuration.
- rng
Random number generator.
Example
>>> config = ParticleSizeConfig( ... distribution=ParticleSizeDistribution(mean_size_um=30.0) ... ) >>> simulator = ParticleSizeSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, particle_sizes: ndarray | None = None) ndarray[source]
Apply particle size effects to spectra.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
particle_sizes – Optional per-sample particle sizes (μm). If None, samples from configured distribution.
- Returns:
Modified spectra with particle size effects applied.
- class nirs4all.data.synthetic.PartitionConfig(train_ratio: float = 0.8, stratify: bool = False, shuffle: bool = True, group_aware: bool = True)[source]
Bases:
objectConfiguration for data partitioning (train/test split).
- class nirs4all.data.synthetic.PriorSampler(config: NIRSPriorConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSample complete generation configurations from prior distributions.
This class implements hierarchical sampling where lower-level configurations are conditioned on higher-level choices.
- Parameters:
config – Prior configuration.
random_state – Random state for reproducibility.
Example
>>> config = NIRSPriorConfig() >>> sampler = PriorSampler(config, random_state=42) >>> >>> # Sample a single configuration >>> sample = sampler.sample() >>> print(sample) >>> >>> # Sample multiple configurations >>> samples = sampler.sample_batch(10)
- sample() Dict[str, Any][source]
Sample a complete dataset configuration from the prior.
- Returns:
Dictionary with all configuration parameters.
Example
>>> sampler = PriorSampler(random_state=42) >>> config = sampler.sample() >>> print(config["domain"]) >>> print(config["instrument"])
- sample_batch(n: int) List[Dict[str, Any]][source]
Sample multiple configurations from the prior.
- Parameters:
n – Number of configurations to sample.
- Returns:
List of configuration dictionaries.
- sample_components(domain: str, n_components: int | None = None) List[str][source]
Sample component set based on domain.
- sample_for_domain(domain: str, n_samples: int | None = None) Dict[str, Any][source]
Sample a configuration constrained to a specific domain.
- Parameters:
domain – Domain to sample for.
n_samples – Optional number of samples (uses prior if None).
- Returns:
Configuration dictionary for the specified domain.
- sample_for_instrument(instrument: str, n_samples: int | None = None) Dict[str, Any][source]
Sample a configuration constrained to a specific instrument.
- Parameters:
instrument – Instrument name to use.
n_samples – Optional number of samples.
- Returns:
Configuration dictionary for the specified instrument.
- sample_instrument_category(domain: str) str[source]
Sample an instrument category given the domain.
- sample_measurement_mode(instrument_category: str) str[source]
Sample a measurement mode given the instrument category.
- class nirs4all.data.synthetic.ProceduralComponentConfig(n_fundamental_bands: int = 3, include_overtones: bool = True, max_overtone_order: int = 3, include_combinations: bool = True, max_combinations: int = 3, h_bond_strength: float = 0.3, h_bond_variability: float = 0.2, anharmonicity: float = 0.02, anharmonicity_variability: float = 0.005, amplitude_variability: float = 0.3, bandwidth_variability: float = 0.2, wavelength_range: Tuple[float, float] = (900, 2500), functional_groups: List[FunctionalGroupType] | None = None, combination_amplitude_factor: float = 0.2)[source]
Bases:
objectConfiguration for procedural component generation.
Controls the complexity and characteristics of generated spectral components including the number of bands, overtone generation, combination bands, and environmental effects.
- functional_groups
Optional list of specific functional groups to use.
- Type:
List[nirs4all.data.synthetic.procedural.FunctionalGroupType] | None
Example
>>> config = ProceduralComponentConfig( ... n_fundamental_bands=3, ... include_overtones=True, ... include_combinations=True, ... h_bond_strength=0.5 ... )
- functional_groups: List[FunctionalGroupType] | None = None
- class nirs4all.data.synthetic.ProceduralComponentGenerator(random_state: int | None = None)[source]
Bases:
objectGenerator for procedurally-created spectral components.
Creates chemically-plausible spectral components with physically-motivated constraints. Uses wavenumber-space calculations for proper overtone and combination band placement.
- rng
NumPy random generator for reproducibility.
Example
>>> generator = ProceduralComponentGenerator(random_state=42) >>> >>> # Generate a single component >>> component = generator.generate_component("my_compound") >>> >>> # Generate a library of components >>> library = generator.generate_library(n_components=10) >>> >>> # Generate with specific configuration >>> config = ProceduralComponentConfig(n_fundamental_bands=4) >>> component = generator.generate_component("complex_compound", config)
- generate_component(name: str, config: ProceduralComponentConfig | None = None, functional_groups: List[FunctionalGroupType] | None = None, correlation_group: int | None = None) SpectralComponent[source]
Generate a single spectral component.
Creates a chemically-plausible component with bands following physical constraints (overtone relationships, combination bands, etc.).
- Parameters:
name – Name for the component.
config – Generation configuration. If None, uses defaults.
functional_groups – Specific functional groups to use. If None, randomly selects based on config.
correlation_group – Optional correlation group ID.
- Returns:
SpectralComponent with generated bands.
Example
>>> generator = ProceduralComponentGenerator(random_state=42) >>> component = generator.generate_component("my_compound") >>> print(f"Generated {len(component.bands)} bands")
- generate_from_functional_groups(name: str, functional_groups: List[FunctionalGroupType | str], config: ProceduralComponentConfig | None = None) SpectralComponent[source]
Generate a component with specified functional groups.
Convenience method for creating components with known chemistry.
- Parameters:
name – Component name.
functional_groups – List of functional groups (enum or string).
config – Optional generation configuration.
- Returns:
SpectralComponent with bands from specified functional groups.
Example
>>> generator = ProceduralComponentGenerator(random_state=42) >>> # Generate an alcohol >>> alcohol = generator.generate_from_functional_groups( ... "alcohol", ... ["hydroxyl", "methyl", "methylene"] ... )
- generate_library(n_components: int, config: ProceduralComponentConfig | None = None, name_prefix: str = 'component') ComponentLibrary[source]
Generate a library of procedural components.
Creates multiple unique components with varied characteristics.
- Parameters:
n_components – Number of components to generate.
config – Generation configuration applied to all components.
name_prefix – Prefix for component names.
- Returns:
ComponentLibrary populated with generated components.
Example
>>> generator = ProceduralComponentGenerator(random_state=42) >>> library = generator.generate_library(10) >>> print(f"Created library with {library.n_components} components")
- generate_variant(base_component: SpectralComponent, variation_scale: float = 0.1, name: str | None = None) SpectralComponent[source]
Generate a variant of an existing component.
Creates a new component with similar characteristics but varied band positions, widths, and amplitudes. Useful for simulating batch effects or matrix variations.
- Parameters:
base_component – Component to base the variant on.
variation_scale – Scale of random variations (0-1).
name – Name for the variant. If None, appends “_variant”.
- Returns:
SpectralComponent variant.
Example
>>> generator = ProceduralComponentGenerator(random_state=42) >>> base = generator.generate_component("base") >>> variant = generator.generate_variant(base, variation_scale=0.15)
- class nirs4all.data.synthetic.RealDataFitter[source]
Bases:
objectFit generator parameters to match real dataset properties.
This class analyzes real NIRS data and estimates parameters for the SyntheticNIRSGenerator to produce similar spectra. Includes Phase 1-4 enhanced inference for instruments, domains, and effects.
- source_properties
SpectralProperties of the analyzed data.
- fitted_params
FittedParameters after fitting.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wavelengths) >>> >>> # Access inferred characteristics >>> print(f"Instrument: {params.inferred_instrument}") >>> print(f"Domain: {params.inferred_domain}") >>> >>> # Create matched generator >>> generator = fitter.create_matched_generator() >>> X_synth, _, _ = generator.generate(1000)
- create_matched_generator(random_state: int | None = None) SyntheticNIRSGenerator[source]
Create a SyntheticNIRSGenerator configured to match the fitted data.
This method creates a generator with all fitted parameters including Phase 1-4 enhanced features (instrument, domain, effects).
- Parameters:
random_state – Random seed for reproducibility.
- Returns:
Configured SyntheticNIRSGenerator instance.
- Raises:
RuntimeError – If fit() hasn’t been called.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wavelengths) >>> generator = fitter.create_matched_generator(random_state=42) >>> X_synth, _, _ = generator.generate(1000)
- evaluate_similarity(X_synthetic: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]
Evaluate similarity between synthetic and source data.
Computes various metrics comparing synthetic spectra to the original real data.
- Parameters:
X_synthetic – Synthetic spectra matrix.
wavelengths – Optional wavelength grid.
- Returns:
Dictionary with similarity metrics.
- Raises:
RuntimeError – If fit() hasn’t been called.
Example
>>> params = fitter.fit(X_real) >>> X_synth, _, _ = generator.generate(1000) >>> metrics = fitter.evaluate_similarity(X_synth) >>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
- fit(X: np.ndarray | 'SpectroDataset', *, wavelengths: np.ndarray | None = None, name: str = 'source', infer_instrument: bool = True, infer_domain: bool = True, infer_measurement_mode: bool = True, infer_environmental: bool = True, infer_scattering: bool = True) FittedParameters[source]
Fit generator parameters to real data.
Analyzes the input data and estimates optimal parameters for generating synthetic spectra with similar properties. Includes Phase 1-4 enhanced inference.
- Parameters:
X – Real spectra matrix (n_samples, n_wavelengths) or SpectroDataset.
wavelengths – Wavelength grid (required if X is ndarray).
name – Dataset name for reference.
infer_instrument – Whether to infer instrument archetype.
infer_domain – Whether to infer application domain.
infer_measurement_mode – Whether to infer measurement mode.
infer_environmental – Whether to infer environmental effects.
infer_scattering – Whether to infer scattering parameters.
- Returns:
FittedParameters object with estimated parameters.
- Raises:
ValueError – If X is empty or has wrong shape.
Example
>>> fitter = RealDataFitter() >>> params = fitter.fit(X_real, wavelengths=wl, name="wheat") >>> print(params.summary())
- fit_from_path(path: str, *, name: str | None = None) FittedParameters[source]
Fit parameters from a dataset path.
Loads data using DatasetConfigs and fits parameters.
- Parameters:
path – Path to dataset folder.
name – Optional name override.
- Returns:
FittedParameters object.
Example
>>> params = fitter.fit_from_path("sample_data/regression")
- get_tuning_recommendations() List[str][source]
Get recommendations for tuning generation parameters.
Based on the fitted parameters and source data, provides suggestions for manual tuning.
- Returns:
List of recommendation strings.
Example
>>> params = fitter.fit(X_real) >>> for rec in fitter.get_tuning_recommendations(): ... print(f"- {rec}")
- class nirs4all.data.synthetic.RealismMetric(value)[source]
-
Metrics used in the spectral realism scorecard.
- ADVERSARIAL_AUC = 'adversarial_auc'
- BASELINE_CURVATURE = 'baseline_curvature'
- CORRELATION_LENGTH = 'correlation_length'
- DERIVATIVE_STATISTICS = 'derivative_statistics'
- PEAK_DENSITY = 'peak_density'
- SNR_DISTRIBUTION = 'snr_distribution'
- class nirs4all.data.synthetic.ReflectanceConfig(geometry: str = 'integrating_sphere', reference_material: str = 'spectralon', reference_reflectance: float = 0.99, illumination_angle: float = 0.0, collection_angle: float = 45.0, sample_presentation: str = 'powder')[source]
Bases:
objectConfiguration for diffuse reflectance measurement mode.
Implements Kubelka-Munk theory: f(R) = (1-R)² / 2R = K/S where R is reflectance, K is absorption coefficient, S is scattering.
- class nirs4all.data.synthetic.ScatteringCoefficientConfig(baseline_scattering: float = 1.0, wavelength_exponent: float = 1.0, particle_size_factor: float = 0.5, sample_variation: float = 0.15, wavelength_reference_nm: float = 1500.0)[source]
Bases:
objectConfiguration for scattering coefficient (S) generation.
For Kubelka-Munk reflectance, we need both absorption (K) and scattering (S) coefficients. This config controls S(λ) generation.
- class nirs4all.data.synthetic.ScatteringCoefficientGenerator(config: ScatteringCoefficientConfig | None = None, random_state: int | None = None)[source]
Bases:
objectGenerate scattering coefficients S(λ) for Kubelka-Munk simulation.
The Kubelka-Munk equation relates reflectance R to absorption K and scattering S: f(R) = (1-R)²/(2R) = K/S
This generator produces realistic S(λ) values for different sample types.
- config
Scattering coefficient configuration.
- rng
Random number generator.
Example
>>> config = ScatteringCoefficientConfig( ... baseline_scattering=1.5, ... wavelength_exponent=1.2 ... ) >>> generator = ScatteringCoefficientGenerator(config, random_state=42) >>> S = generator.generate(n_samples=100, wavelengths=wavelengths)
- generate(n_samples: int, wavelengths: ndarray, particle_sizes: ndarray | None = None) ndarray[source]
Generate scattering coefficients for samples.
- Parameters:
n_samples – Number of samples.
wavelengths – Wavelength array in nm.
particle_sizes – Optional per-sample particle sizes (μm).
- Returns:
Scattering coefficient array (n_samples, n_wavelengths).
- class nirs4all.data.synthetic.ScatteringConfig(baseline_scattering: float = 1.0, wavelength_exponent: float = 1.0, particle_size_um: float = 50.0, particle_size_variation: float = 0.2, sample_to_sample_variation: float = 0.15)[source]
Bases:
objectConfiguration for scattering coefficient generation.
Controls how scattering coefficients are generated for samples, which is essential for Kubelka-Munk reflectance simulation.
- wavelength_exponent
Exponent for wavelength dependence (Rayleigh-like). S(λ) ∝ λ^(-exponent), typically 0.5-2.0
- Type:
- class nirs4all.data.synthetic.ScatteringEffectsConfig(model: ~nirs4all.data.synthetic.scattering.ScatteringModel = ScatteringModel.EMSC, particle_size: ~nirs4all.data.synthetic.scattering.ParticleSizeConfig = <factory>, emsc: ~nirs4all.data.synthetic.scattering.EMSCConfig = <factory>, scattering_coefficient: ~nirs4all.data.synthetic.scattering.ScatteringCoefficientConfig = <factory>, enable_particle_size: bool = True, enable_emsc: bool = True)[source]
Bases:
objectCombined configuration for all scattering effects.
- model
Which scattering model to use.
- particle_size
Particle size effect configuration.
- emsc
EMSC-style transformation configuration.
- scattering_coefficient
Scattering coefficient generation config.
- emsc: EMSCConfig
- model: ScatteringModel = 'emsc'
- particle_size: ParticleSizeConfig
- scattering_coefficient: ScatteringCoefficientConfig
- class nirs4all.data.synthetic.ScatteringEffectsSimulator(config: ScatteringEffectsConfig | None = None, random_state: int | None = None)[source]
Bases:
objectCombined simulator for all scattering effects.
Applies particle size effects and EMSC-style transformations in the correct order.
- config
Scattering effects configuration.
- particle_sim
Particle size simulator.
- emsc_sim
EMSC transformation simulator.
- scatter_gen
Scattering coefficient generator.
- rng
Random number generator.
Example
>>> config = ScatteringEffectsConfig( ... model=ScatteringModel.EMSC, ... particle_size=ParticleSizeConfig( ... distribution=ParticleSizeDistribution(mean_size_um=30.0) ... ) ... ) >>> simulator = ScatteringEffectsSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, particle_sizes: ndarray | None = None) ndarray[source]
Apply all scattering effects to spectra.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
particle_sizes – Optional per-sample particle sizes.
- Returns:
Modified spectra with scattering effects applied.
- generate_scattering_coefficients(n_samples: int, wavelengths: ndarray, particle_sizes: ndarray | None = None) ndarray[source]
Generate scattering coefficients for Kubelka-Munk.
- Parameters:
n_samples – Number of samples.
wavelengths – Wavelength array.
particle_sizes – Optional particle sizes.
- Returns:
Scattering coefficient array (n_samples, n_wavelengths).
- class nirs4all.data.synthetic.ScatteringInference(has_scatter_effects: bool = False, estimated_particle_size_um: float = 50.0, multiplicative_scatter_std: float = 0.0, additive_scatter_std: float = 0.0, baseline_curvature: float = 0.0, snv_correctable: bool = False, msc_correctable: bool = False)[source]
Bases:
objectResults of scattering effects inference.
- class nirs4all.data.synthetic.ScatteringModel(value)[source]
-
Available scattering models.
- EMSC = 'emsc'
- KUBELKA_MUNK = 'kubelka_munk'
- MIE_APPROX = 'mie_approx'
- POLYNOMIAL = 'polynomial'
- RAYLEIGH = 'rayleigh'
- class nirs4all.data.synthetic.SensorConfig(detector_type: DetectorType, wavelength_range: Tuple[float, float], spectral_resolution: float = 8.0, noise_level: float = 1.0, gain: float = 1.0, overlap_range: float = 20.0)[source]
Bases:
objectConfiguration for a single sensor/detector in a multi-sensor system.
Multi-sensor instruments use multiple detectors with different wavelength ranges, then stitch the signals together. This is common in extended-range instruments (e.g., 400-2500 nm coverage using Si + InGaAs detectors).
- detector_type
Type of detector for this sensor.
- detector_type: DetectorType
- class nirs4all.data.synthetic.SourceConfig(name: str, source_type: Literal['nir', 'vis', 'aux', 'markers'] = 'nir', n_features: int | None = None, wavelength_start: float | None = None, wavelength_end: float | None = None, wavelength_step: float = 2.0, components: List[str] | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'simple', distribution: Literal['normal', 'uniform', 'lognormal'] = 'normal', correlation_with_target: float = 0.5)[source]
Bases:
objectConfiguration for a single data source.
- source_type
Type of source (‘nir’, ‘vis’, ‘aux’, ‘markers’).
- Type:
Literal[‘nir’, ‘vis’, ‘aux’, ‘markers’]
- # NIR-specific
- complexity
Complexity level for NIR sources.
- Type:
Literal[‘simple’, ‘realistic’, ‘complex’]
- # Auxiliary-specific
- distribution
Distribution for auxiliary features.
- Type:
Literal[‘normal’, ‘uniform’, ‘lognormal’]
- class nirs4all.data.synthetic.SpectralComponent(name: str, bands: ~typing.List[~nirs4all.data.synthetic.components.NIRBand] = <factory>, correlation_group: int | None = None)[source]
Bases:
objectA spectral component representing a chemical compound or functional group.
Each component consists of multiple absorption bands that together define the characteristic NIR signature of the compound.
- bands
List of NIRBand objects defining the spectral signature.
- Type:
- correlation_group
Optional group ID for components that should have correlated concentrations (e.g., protein and nitrogen compounds).
- Type:
int | None
Example
>>> water = SpectralComponent( ... name="water", ... bands=[ ... NIRBand(center=1450, sigma=25, gamma=3, amplitude=0.8), ... NIRBand(center=1940, sigma=30, gamma=4, amplitude=1.0), ... ], ... correlation_group=1 ... ) >>> wavelengths = np.arange(1000, 2500, 2) >>> spectrum = water.compute(wavelengths)
- class nirs4all.data.synthetic.SpectralProperties(name: str = 'dataset', n_samples: int = 0, n_wavelengths: int = 0, wavelengths: ndarray | None = None, mean_spectrum: ndarray | None = None, std_spectrum: ndarray | None = None, global_mean: float = 0.0, global_std: float = 0.0, global_range: Tuple[float, float] = (0.0, 0.0), mean_slope: float = 0.0, slope_std: float = 0.0, slopes: ndarray | None = None, mean_curvature: float = 0.0, curvature_std: float = 0.0, skewness: float = 0.0, kurtosis: float = 0.0, noise_estimate: float = 0.0, snr_estimate: float = 0.0, pca_explained_variance: ndarray | None = None, pca_n_components_95: int = 0, n_peaks_mean: float = 0.0, peak_positions: ndarray | None = None, peak_wavenumbers: ndarray | None = None, effective_resolution: float = 8.0, noise_correlation_length: float = 1.0, wavelength_range: Tuple[float, float] = (1000.0, 2500.0), baseline_offset: float = 0.0, kubelka_munk_linearity: float = 0.0, baseline_convexity: float = 0.0, water_band_variation: float = 0.0, oh_band_positions: ndarray | None = None, temperature_sensitivity_score: float = 0.0, scatter_baseline_slope: float = 0.0, scatter_baseline_curvature: float = 0.0, sample_to_sample_offset_std: float = 0.0, sample_to_sample_slope_std: float = 0.0, protein_band_intensity: float = 0.0, carbohydrate_band_intensity: float = 0.0, lipid_band_intensity: float = 0.0, water_band_intensity: float = 0.0)[source]
Bases:
objectContainer for computed spectral properties of a dataset.
This dataclass holds various statistical and spectral properties computed from a NIRS dataset for comparison and fitting purposes.
- wavelengths
Wavelength grid.
- Type:
numpy.ndarray | None
- # Basic statistics
- mean_spectrum
Mean spectrum across samples.
- Type:
numpy.ndarray | None
- std_spectrum
Standard deviation spectrum.
- Type:
numpy.ndarray | None
- # Shape properties
- # Distribution statistics
- # Noise characteristics
- # PCA properties
- pca_explained_variance
Explained variance ratios.
- Type:
numpy.ndarray | None
- # Peak analysis
- peak_positions
Wavelengths of detected peaks.
- Type:
numpy.ndarray | None
- peak_wavenumbers
Wavenumber positions of peaks.
- Type:
numpy.ndarray | None
- # Phase 1-4 Enhanced properties
- # Instrument indicators
- # Measurement mode indicators
- # Environmental indicators
- oh_band_positions
Detected O-H band positions.
- Type:
numpy.ndarray | None
- # Scattering indicators
- # Domain indicators
- class nirs4all.data.synthetic.SpectralRealismScore(correlation_length_overlap: float, derivative_ks_pvalue: float, peak_density_ratio: float, baseline_curvature_overlap: float, snr_magnitude_match: bool, adversarial_auc: float, overall_pass: bool, metric_results: ~typing.List[~nirs4all.data.synthetic.validation.MetricResult] = <factory>, warnings: ~typing.List[str] = <factory>)[source]
Bases:
objectComplete spectral realism assessment results.
This dataclass contains the results of comparing synthetic spectra against real spectra using multiple quantitative metrics.
- metric_results
Individual metric results with details.
- Type:
Example
>>> score = compute_spectral_realism_scorecard(real_spectra, synthetic_spectra, wavelengths) >>> print(f"Overall pass: {score.overall_pass}") >>> print(f"Adversarial AUC: {score.adversarial_auc:.3f}") >>> for metric in score.metric_results: ... print(metric)
- metric_results: List[MetricResult]
- class nirs4all.data.synthetic.SpectralRegion(value)[source]
-
NIR spectral regions with distinct temperature responses.
- CH_COMBINATION = 'ch_combination'
- CH_FIRST_OVERTONE = 'ch_1st_overtone'
- NH_COMBINATION = 'nh_combination'
- NH_FIRST_OVERTONE = 'nh_1st_overtone'
- OH_COMBINATION = 'oh_combination'
- OH_FIRST_OVERTONE = 'oh_1st_overtone'
- WATER_BOUND = 'water_bound'
- WATER_FREE = 'water_free'
- class nirs4all.data.synthetic.SyntheticDatasetBuilder(n_samples: int = 1000, random_state: int | None = None, name: str = 'synthetic_nirs')[source]
Bases:
objectFluent builder for constructing synthetic NIRS datasets.
This builder provides a chainable interface for configuring all aspects of synthetic data generation, from spectral features to targets and metadata.
The builder accumulates configuration through method calls, then generates the dataset when
build()is called.- state
Internal BuilderState containing all configuration.
- Parameters:
n_samples – Number of samples to generate.
random_state – Random seed for reproducibility.
name – Dataset name.
Example
>>> from nirs4all.data.synthetic import SyntheticDatasetBuilder >>> >>> # Simple usage >>> dataset = SyntheticDatasetBuilder(n_samples=500).build() >>> >>> # Full configuration >>> dataset = ( ... SyntheticDatasetBuilder(n_samples=1000, random_state=42) ... .with_features( ... wavelength_range=(1000, 2500), ... complexity="realistic", ... components=["water", "protein", "lipid"] ... ) ... .with_targets( ... distribution="lognormal", ... range=(5, 50), ... component="protein" ... ) ... .with_metadata( ... n_groups=3, ... n_repetitions=(2, 5) ... ) ... .with_partitions(train_ratio=0.8) ... .build() ... )
See also
nirs4all.generate: Top-level convenience function. SyntheticNIRSGenerator: Core generation engine.
- build() SpectroDataset | Tuple[np.ndarray, np.ndarray][source]
Build the synthetic dataset with all configured options.
This method generates the data and returns it in the configured format.
- Returns:
SpectroDataset instance. If as_dataset=False: Tuple of (X, y) numpy arrays.
- Return type:
If as_dataset=True (default)
- Raises:
RuntimeError – If build() was already called on this builder.
Example
>>> dataset = builder.build() >>> print(dataset.num_samples) 1000
- build_arrays() Tuple[ndarray, ndarray][source]
Build and return raw numpy arrays.
This is a convenience method equivalent to calling
with_output(as_dataset=False).build().- Returns:
Tuple of (X, y) numpy arrays.
Example
>>> X, y = builder.build_arrays()
- build_dataset() SpectroDataset[source]
Build and return a SpectroDataset.
This is a convenience method equivalent to calling
with_output(as_dataset=True).build().- Returns:
SpectroDataset instance.
Example
>>> dataset = builder.build_dataset()
- export(path: str | 'Path', format: Literal['standard', 'single', 'fragmented'] = 'standard') Path[source]
Generate data and export to folder.
Generates the synthetic data and exports it to a folder structure compatible with nirs4all’s DatasetConfigs loader.
- Parameters:
path – Output folder path.
format – Export format: - ‘standard’: Xcal, Ycal, Xval, Yval files. - ‘single’: All data in one file with partition column. - ‘fragmented’: Multiple small files (for testing).
- Returns:
Path to created folder.
Example
>>> builder = SyntheticDatasetBuilder(n_samples=1000) >>> path = builder.export("data/synthetic", format="standard")
- export_to_csv(path: str | 'Path', include_targets: bool = True) Path[source]
Generate data and export to a single CSV file.
- Parameters:
path – Output file path.
include_targets – Whether to include target column(s).
- Returns:
Path to created file.
Example
>>> path = builder.export_to_csv("data.csv")
- fit_to(template: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, *, match_statistics: bool = True, match_structure: bool = True) SyntheticDatasetBuilder[source]
Configure builder to generate data similar to a template.
Analyzes the template data and adjusts generation parameters to produce synthetic data with similar properties.
- Parameters:
template – Real data to mimic (array or SpectroDataset).
wavelengths – Wavelength grid (if template is array).
match_statistics – Match statistical properties (mean, std).
match_structure – Match PCA structure and complexity.
- Returns:
Self for method chaining.
Example
>>> builder = SyntheticDatasetBuilder(n_samples=1000) >>> builder.fit_to(X_real, wavelengths=wl) >>> X_synth, y = builder.build_arrays()
- classmethod from_config(config: SyntheticDatasetConfig) SyntheticDatasetBuilder[source]
Create a builder from a SyntheticDatasetConfig object.
- Parameters:
config – Configuration object to use.
- Returns:
Configured SyntheticDatasetBuilder instance.
Example
>>> config = SyntheticDatasetConfig(n_samples=500) >>> builder = SyntheticDatasetBuilder.from_config(config) >>> dataset = builder.build()
- get_config() SyntheticDatasetConfig[source]
Get the current configuration as a SyntheticDatasetConfig object.
- Returns:
SyntheticDatasetConfig with all current settings.
Example
>>> config = builder.get_config() >>> print(config.n_samples) 1000
- with_batch_effects(*, enabled: bool = True, n_batches: int = 3) SyntheticDatasetBuilder[source]
Configure batch/session effects simulation.
Batch effects introduce systematic variations between measurement sessions, useful for domain adaptation research.
- Parameters:
enabled – Whether to enable batch effects.
n_batches – Number of measurement batches.
- Returns:
Self for method chaining.
Example
>>> builder.with_batch_effects(n_batches=5)
- with_classification(*, n_classes: int = 2, separation: float = 1.5, class_weights: List[float] | None = None, separation_method: Literal['component', 'threshold', 'cluster'] = 'component') SyntheticDatasetBuilder[source]
Configure target generation for classification tasks.
This creates discrete class labels with controllable separation between classes, enabling classification experiments with varying difficulty levels.
- Parameters:
n_classes – Number of classes to generate.
separation – Class separation factor (higher = more separable). Values around 0.5-1.0: overlapping classes (challenging). Values around 1.5-2.0: moderate separation (realistic). Values around 2.5+: well-separated classes (easy).
class_weights – Optional class weights for imbalanced datasets. Should sum to 1.0.
separation_method – How to create class differences: - “component”: Different component concentration profiles per class. - “threshold”: Classes based on concentration thresholds. - “cluster”: K-means-like cluster assignment.
- Returns:
Self for method chaining.
Example
>>> builder.with_classification( ... n_classes=3, ... separation=2.0, ... class_weights=[0.5, 0.3, 0.2] ... )
- with_complex_target_landscape(*, n_regimes: int = 3, regime_method: Literal['concentration', 'spectral', 'random'] = 'concentration', regime_overlap: float = 0.2, noise_heteroscedasticity: float = 0.5) SyntheticDatasetBuilder[source]
Configure multi-regime target landscapes with spatially-varying relationships.
This creates regions in feature space where the target-spectra relationship differs, simulating subpopulations like ripe/unripe fruit or healthy/diseased.
- Parameters:
n_regimes – Number of different relationship regimes. Default 3.
regime_method – How to partition samples into regimes: - “concentration”: Regimes based on concentration space clustering. - “spectral”: Regimes based on spectral feature patterns. - “random”: Random regime assignment (baseline difficulty).
regime_overlap – Overlap between regimes creating transition zones. 0 = hard boundaries, 0.5 = smooth transitions. Default 0.2.
noise_heteroscedasticity – How much prediction noise varies by regime. 0 = same noise everywhere, 1 = very different noise levels. Default 0.5.
- Returns:
Self for method chaining.
Example
>>> # Create challenging multi-regime landscape >>> builder.with_complex_target_landscape( ... n_regimes=4, ... regime_method="concentration", ... regime_overlap=0.3, ... noise_heteroscedasticity=0.7 ... )
- with_features(*, wavelength_range: Tuple[float, float] | None = None, wavelength_step: float | None = None, complexity: Literal['simple', 'realistic', 'complex'] | None = None, components: List[str] | None = None, component_library: ComponentLibrary | None = None) SyntheticDatasetBuilder[source]
Configure spectral feature generation.
- Parameters:
wavelength_range – Tuple of (start, end) wavelengths in nm.
wavelength_step – Wavelength sampling step in nm.
complexity – Complexity level affecting noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.
components – List of predefined component names to use.
component_library – Pre-configured ComponentLibrary instance.
- Returns:
Self for method chaining.
- Raises:
ValueError – If both components and component_library are specified.
Example
>>> builder.with_features( ... wavelength_range=(1000, 2500), ... complexity="realistic", ... components=["water", "protein"] ... )
- with_metadata(*, sample_ids: bool = True, sample_id_prefix: str | None = None, n_groups: int | None = None, n_repetitions: int | Tuple[int, int] | None = None, group_names: List[str] | None = None) SyntheticDatasetBuilder[source]
Configure sample metadata generation.
Generates realistic metadata including sample IDs, biological sample groupings (with repetitions), and group assignments.
- Parameters:
sample_ids – Whether to generate sample IDs.
sample_id_prefix – Prefix for sample ID strings.
n_groups – Number of sample groups (for grouped cross-validation).
n_repetitions – Repetitions per biological sample. Either a fixed int or a (min, max) tuple for random variation. When set, each “biological sample” gets multiple spectral measurements.
group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.
- Returns:
Self for method chaining.
Example
>>> builder.with_metadata( ... n_groups=5, ... n_repetitions=(2, 4), ... group_names=["Field_A", "Field_B", "Field_C", "Field_D", "Field_E"] ... )
- with_nonlinear_targets(*, interactions: Literal['none', 'polynomial', 'synergistic', 'antagonistic'] = 'polynomial', interaction_strength: float = 0.5, hidden_factors: int = 0, polynomial_degree: int = 2) SyntheticDatasetBuilder[source]
Configure non-linear relationships between concentrations and targets.
This introduces non-linear mixture effects that make targets harder to predict with simple linear models, simulating real chemical interactions.
- Parameters:
interactions – Type of non-linear interaction: - “none”: Pure linear relationship (default behavior). - “polynomial”: Include terms like C1², C1×C2, etc. - “synergistic”: Non-additive effects where combinations enhance target. - “antagonistic”: Saturation/inhibition (Michaelis-Menten-like).
interaction_strength – Blend factor between linear and non-linear. 0 = purely linear, 1 = fully non-linear. Default 0.5.
hidden_factors – Number of latent variables that affect target but have NO spectral signature. Forces models to learn robust features.
polynomial_degree – Maximum degree for polynomial interactions (2 or 3).
- Returns:
Self for method chaining.
Example
>>> # Make targets require non-linear models >>> builder.with_nonlinear_targets( ... interactions="polynomial", ... interaction_strength=0.7, ... hidden_factors=2 ... )
- with_output(*, as_dataset: bool | None = None, include_metadata: bool | None = None) SyntheticDatasetBuilder[source]
Configure output format.
- Parameters:
as_dataset – If True, returns SpectroDataset. If False, returns tuple.
include_metadata – Whether to include generation metadata in output.
- Returns:
Self for method chaining.
Example
>>> builder.with_output(as_dataset=False) # Returns (X, y) tuple
- with_partitions(*, train_ratio: float | None = None, stratify: bool | None = None, shuffle: bool | None = None) SyntheticDatasetBuilder[source]
Configure data partitioning (train/test split).
- Parameters:
train_ratio – Proportion of samples for training (0.0-1.0).
stratify – Whether to stratify by target (for classification).
shuffle – Whether to shuffle before splitting.
- Returns:
Self for method chaining.
Example
>>> builder.with_partitions(train_ratio=0.75, shuffle=True)
- with_sources(sources: List[Dict[str, Any] | Any]) SyntheticDatasetBuilder[source]
Configure multi-source generation.
Multi-source datasets combine different types of data, such as multiple NIR spectral ranges or NIR spectra with auxiliary measurements.
- Parameters:
sources – List of source configurations. Each source is a dict with: - name: Unique source identifier (required). - type: Source type - “nir”, “vis”, “aux”, “markers” (default: “nir”). - wavelength_range: (start, end) for NIR sources. - n_features: Number of features for auxiliary sources. - complexity: Complexity level for NIR sources. - components: Component names for NIR sources.
- Returns:
Self for method chaining.
Example
>>> builder.with_sources([ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 15} ... ])
- with_target_complexity(*, signal_to_confound_ratio: float = 0.7, n_confounders: int = 2, spectral_masking: float = 0.0, temporal_drift: bool = False) SyntheticDatasetBuilder[source]
Configure spectral-target decoupling and confounding effects.
This introduces factors that make the target only partially predictable from spectral features, simulating real-world irreducible error.
- Parameters:
signal_to_confound_ratio – Proportion of target variance explainable from spectra. 1.0 = fully predictable, 0.5 = 50% unexplainable. Default 0.7 (70% predictable).
n_confounders – Number of confounding variables that affect both spectra and target in different ways. Default 2.
spectral_masking – Fraction of predictive signal hidden in high-noise wavelength regions (0.0-0.5). Default 0.0.
temporal_drift – If True, the target-spectra relationship gradually changes across samples, testing model robustness.
- Returns:
Self for method chaining.
Example
>>> # Add realistic confounding >>> builder.with_target_complexity( ... signal_to_confound_ratio=0.6, ... n_confounders=3, ... temporal_drift=True ... )
- with_targets(*, distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] | None = None, range: Tuple[float, float] | None = None, component: str | int | None = None, transform: Literal['log', 'sqrt'] | None = None) SyntheticDatasetBuilder[source]
Configure target variable generation for regression tasks.
- Parameters:
distribution – Concentration distribution method. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.
range – Target value range (min, max) for scaling.
component – Which component to use as target. If None, uses all components (multi-output). If str, uses the component with that name. If int, uses the component at that index.
transform – Optional transformation to apply (‘log’, ‘sqrt’).
- Returns:
Self for method chaining.
Example
>>> builder.with_targets( ... distribution="lognormal", ... range=(5, 50), ... component="protein" ... )
- class nirs4all.data.synthetic.SyntheticDatasetConfig(n_samples: int = 1000, random_state: int | None = None, features: ~nirs4all.data.synthetic.config.FeatureConfig = <factory>, targets: ~nirs4all.data.synthetic.config.TargetConfig = <factory>, metadata: ~nirs4all.data.synthetic.config.MetadataConfig = <factory>, partitions: ~nirs4all.data.synthetic.config.PartitionConfig = <factory>, batch_effects: ~nirs4all.data.synthetic.config.BatchEffectConfig = <factory>, nonlinear: ~nirs4all.data.synthetic.config.NonLinearConfig = <factory>, confounders: ~nirs4all.data.synthetic.config.ConfounderConfig = <factory>, multi_regime: ~nirs4all.data.synthetic.config.MultiRegimeConfig = <factory>, output: ~nirs4all.data.synthetic.config.OutputConfig = <factory>, name: str = 'synthetic_nirs')[source]
Bases:
objectComplete configuration for synthetic dataset generation.
This is the main configuration object that combines all sub-configurations for generating synthetic NIRS datasets.
- features
Feature generation configuration.
- targets
Target variable configuration.
- metadata
Sample metadata configuration.
- partitions
Train/test split configuration.
- batch_effects
Batch effect configuration.
- output
Output format configuration.
Example
>>> config = SyntheticDatasetConfig( ... n_samples=1000, ... random_state=42, ... features=FeatureConfig(complexity="realistic"), ... targets=TargetConfig(distribution="lognormal", range=(0, 100)), ... )
- batch_effects: BatchEffectConfig
- confounders: ConfounderConfig
- features: FeatureConfig
- metadata: MetadataConfig
- multi_regime: MultiRegimeConfig
- nonlinear: NonLinearConfig
- output: OutputConfig
- partitions: PartitionConfig
- targets: TargetConfig
- class nirs4all.data.synthetic.SyntheticNIRSGenerator(wavelength_start: float = 1000.0, wavelength_end: float = 2500.0, wavelength_step: float = 2.0, component_library: ComponentLibrary | None = None, complexity: Literal['simple', 'realistic', 'complex'] = 'realistic', instrument: str | InstrumentArchetype | None = None, measurement_mode: str | MeasurementMode | None = None, multi_sensor_config: MultiSensorConfig | None = None, multi_scan_config: MultiScanConfig | None = None, environmental_config: EnvironmentalEffectsConfig | None = None, scattering_effects_config: ScatteringEffectsConfig | None = None, random_state: int | None = None)[source]
Bases:
objectGenerator for synthetic NIRS spectra with realistic instrumental effects.
This generator implements a physically-motivated model based on Beer-Lambert law with additional effects for baseline, scattering, instrumental response, and noise.
- Model:
A_i(λ) = L_i * Σ_k c_ik * ε_k(λ) + baseline_i(λ) + scatter_i(λ) + noise_i(λ)
- where:
c_ik: concentration of component k in sample i
ε_k(λ): molar absorptivity of component k (Voigt profiles)
L_i: optical path length factor
baseline: polynomial baseline drift
scatter: multiplicative/additive scattering effects
noise: wavelength-dependent Gaussian noise
- Phase 2 Features:
Instrument archetype simulation (FOSS, Bruker, etc.)
Measurement mode physics (transmittance, reflectance, ATR)
Detector response curves and noise models
Multi-sensor stitching (combining signals from different wavelength ranges)
Multi-scan averaging/denoising (simulating multiple scans per sample)
- Phase 3 Features:
Temperature effects on spectral bands (O-H, N-H, C-H shifts)
Moisture and water activity effects
Particle size effects (EMSC-style scattering)
Scattering coefficient generation (Kubelka-Munk)
- wavelengths
Array of wavelength values in nm.
- n_wavelengths
Number of wavelength points.
- library
ComponentLibrary containing spectral components.
- E
Precomputed component spectra matrix (n_components, n_wavelengths).
- params
Dictionary of effect parameters based on complexity level.
- instrument
Optional InstrumentArchetype for realistic simulation.
- measurement_mode_simulator
Optional measurement mode simulator.
- environmental_simulator
Optional Phase 3 environmental effects simulator.
- scattering_effects_simulator
Optional Phase 3 scattering effects simulator.
- Parameters:
wavelength_start – Start wavelength in nm.
wavelength_end – End wavelength in nm.
wavelength_step – Wavelength step in nm.
component_library – Optional ComponentLibrary. If None, generates predefined components for realistic mode or random for simple mode.
complexity – Complexity level controlling noise, scatter, etc. Options: ‘simple’, ‘realistic’, ‘complex’.
instrument – Instrument archetype name or InstrumentArchetype object. If provided, uses instrument-specific wavelength range, detector, etc.
measurement_mode – Measurement mode (transmittance, reflectance, etc.).
multi_sensor_config – Configuration for multi-sensor stitching.
multi_scan_config – Configuration for multi-scan averaging.
environmental_config – Phase 3 configuration for temperature/moisture effects.
scattering_effects_config – Phase 3 configuration for particle size/scattering.
random_state – Random seed for reproducibility.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> X, Y, E = generator.generate(n_samples=1000) >>> print(X.shape, Y.shape, E.shape) (1000, 751) (1000, 5) (5, 751)
>>> # With instrument simulation (Phase 2) >>> generator = SyntheticNIRSGenerator( ... instrument="foss_xds", ... measurement_mode="reflectance", ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3) >>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig >>> env_config = EnvironmentalEffectsConfig( ... enable_temperature=True, ... enable_moisture=True ... ) >>> generator = SyntheticNIRSGenerator( ... environmental_config=env_config, ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # Create a SpectroDataset directly >>> dataset = generator.create_dataset(n_train=800, n_test=200)
See also
ComponentLibrary: For managing spectral components. InstrumentArchetype: For instrument-specific simulation. MeasurementModeSimulator: For measurement mode physics. EnvironmentalEffectsSimulator: For temperature/moisture effects (Phase 3). ScatteringEffectsSimulator: For particle size/scattering effects (Phase 3).
- create_dataset(n_train: int = 800, n_test: int = 200, target_component: str | int | None = None, **generate_kwargs: Any) SpectroDataset[source]
Create a SpectroDataset from synthetic spectra.
This method generates synthetic spectra and wraps them in a SpectroDataset object ready for use with nirs4all pipelines.
- Parameters:
n_train – Number of training samples.
n_test – Number of test samples.
target_component – Which component to use as target. - If None: uses all components as multi-output target. - If str: uses the component with that name. - If int: uses the component at that index.
**generate_kwargs – Additional arguments passed to generate().
- Returns:
SpectroDataset with train/test partitions.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> dataset = generator.create_dataset( ... n_train=800, ... n_test=200, ... target_component="protein" ... ) >>> print(f"Train: {dataset.n_train}, Test: {dataset.n_test}")
- generate(n_samples: int = 1000, concentration_method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', include_batch_effects: bool = False, n_batches: int = 1, include_instrument_effects: bool = True, include_multi_sensor: bool = True, include_multi_scan: bool = True, include_environmental_effects: bool = True, include_scattering_effects: bool = True, temperatures: ndarray | None = None, return_metadata: bool = False) Tuple[ndarray, ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray, Dict[str, Any]][source]
Generate synthetic NIRS spectra.
This is the main generation method that creates synthetic spectra by applying all physical effects in sequence.
- Parameters:
n_samples – Number of spectra to generate.
concentration_method – Method for generating concentrations. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.
include_batch_effects – Whether to add batch/session effects.
n_batches – Number of batches (only if include_batch_effects=True).
include_instrument_effects – Whether to apply instrument-specific effects (detector response, noise). Only applies if instrument was specified during initialization.
include_multi_sensor – Whether to apply multi-sensor stitching effects. Only applies if multi_sensor_config is set.
include_multi_scan – Whether to simulate multi-scan averaging. Only applies if multi_scan_config is set.
include_environmental_effects – Whether to apply Phase 3 temperature and moisture effects. Only applies if environmental_config is set.
include_scattering_effects – Whether to apply Phase 3 particle size and EMSC-style scattering effects. Only applies if scattering_effects_config is set.
temperatures – Optional array of temperatures (°C) for each sample. If None and environmental effects are enabled, random temperatures are generated based on the configuration. Shape: (n_samples,).
return_metadata – Whether to return additional metadata dictionary.
- Returns:
- Tuple of (X, Y, E):
X: Spectra matrix (n_samples, n_wavelengths)
Y: Concentration matrix (n_samples, n_components)
E: Component spectra (n_components, n_wavelengths)
- If return_metadata=True:
- Tuple of (X, Y, E, metadata):
metadata: Dictionary with generation details
- Return type:
If return_metadata=False
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> X, Y, E = generator.generate(n_samples=500) >>> print(f"Spectra: {X.shape}, Targets: {Y.shape}") Spectra: (500, 751), Targets: (500, 5)
>>> # With instrument simulation (Phase 2) >>> generator = SyntheticNIRSGenerator( ... instrument="foss_xds", ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500)
>>> # With environmental effects (Phase 3) >>> from nirs4all.data.synthetic import EnvironmentalEffectsConfig >>> env_config = EnvironmentalEffectsConfig() >>> generator = SyntheticNIRSGenerator( ... environmental_config=env_config, ... random_state=42 ... ) >>> X, Y, E = generator.generate(n_samples=500, include_environmental_effects=True)
>>> # With metadata >>> X, Y, E, meta = generator.generate(100, return_metadata=True) >>> print(meta.keys())
- generate_batch_effects(n_batches: int, samples_per_batch: List[int]) Tuple[ndarray, ndarray][source]
Generate batch/session effects for domain adaptation research.
- Parameters:
n_batches – Number of measurement batches/sessions.
samples_per_batch – List of sample counts per batch.
- Returns:
batch_offsets: Wavelength-dependent offsets per batch.
batch_gains: Multiplicative gains per batch.
- Return type:
Tuple of
- generate_concentrations(n_samples: int, method: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', alpha: ndarray | None = None, correlation_matrix: ndarray | None = None) ndarray[source]
Generate concentration matrix using specified distribution.
- Parameters:
n_samples – Number of samples to generate.
method – Concentration generation method: - ‘dirichlet’: Compositional data (concentrations sum to ~1). - ‘uniform’: Independent uniform [0, 1] values. - ‘lognormal’: Log-normal distributed, normalized. - ‘correlated’: Multivariate with specified correlations.
alpha – Dirichlet concentration parameters (only for ‘dirichlet’ method). Shape: (n_components,). Higher values = more uniform distribution.
correlation_matrix – Correlation structure for ‘correlated’ method. Shape: (n_components, n_components).
- Returns:
Concentration matrix of shape (n_samples, n_components).
- Raises:
ValueError – If method is unknown.
Example
>>> generator = SyntheticNIRSGenerator(random_state=42) >>> C = generator.generate_concentrations(100, method='dirichlet') >>> print(C.shape, C.sum(axis=1).mean()) # Should sum to ~1
- class nirs4all.data.synthetic.TargetConfig(distribution: Literal['dirichlet', 'uniform', 'lognormal', 'correlated'] = 'dirichlet', range: Tuple[float, float] | None = None, n_targets: int | None = None, component_indices: List[int] | None = None, transform: Literal['log', 'sqrt'] | None = None)[source]
Bases:
objectConfiguration for target variable generation.
- distribution
Target value distribution method. Options: ‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’.
- Type:
Literal[‘dirichlet’, ‘uniform’, ‘lognormal’, ‘correlated’]
- transform
Optional transformation to apply (‘log’, ‘sqrt’, None).
- Type:
Literal[‘log’, ‘sqrt’] | None
- class nirs4all.data.synthetic.TargetGenerator(random_state: int | None = None)[source]
Bases:
objectGenerate target variables for synthetic NIRS datasets.
This class creates both regression targets (continuous values correlated with component concentrations) and classification targets (discrete labels with controllable class separation).
- rng
NumPy random generator for reproducibility.
- Parameters:
random_state – Random seed for reproducibility.
Example
>>> generator = TargetGenerator(random_state=42) >>> >>> # Generate concentrations first (from SyntheticNIRSGenerator) >>> C = np.random.rand(100, 5) # 5 components >>> >>> # Regression targets scaled to percentage >>> y = generator.regression( ... n_samples=100, ... concentrations=C, ... component=0, # Use first component ... range=(0, 100) ... ) >>> >>> # Multi-class classification >>> y = generator.classification( ... n_samples=100, ... concentrations=C, ... n_classes=4, ... separation=2.0 ... )
- classification(n_samples: int, concentrations: ndarray | None = None, *, n_classes: int = 2, class_weights: List[float] | None = None, separation: float = 1.5, separation_method: Literal['component', 'threshold', 'cluster'] = 'component', class_names: List[str] | None = None, return_proba: bool = False) ndarray | Tuple[ndarray, ndarray][source]
Generate classification target labels with controllable class separation.
The separation parameter controls how distinguishable classes are in feature space. Higher values create more separable classes.
- Parameters:
n_samples – Number of samples.
concentrations – Component concentration matrix.
n_classes – Number of classes to generate.
class_weights – Class proportions (should sum to 1.0). If None, uses balanced classes.
separation – Class separation factor: - 0.5-1.0: Overlapping classes (challenging) - 1.5-2.0: Moderate separation (realistic) - 2.5+: Well-separated classes (easy)
separation_method – How to create class differences: - “component”: Each class has distinct component profiles - “threshold”: Classes based on concentration thresholds - “cluster”: K-means-like cluster assignment
class_names – Optional string labels for classes.
return_proba – If True, also return class probabilities.
- Returns:
Integer class labels (n_samples,). If return_proba=True: Tuple of (labels, probabilities).
- Return type:
If return_proba=False
Example
>>> # Binary classification with balanced classes >>> y = generator.classification(100, C, n_classes=2) >>> >>> # 3-class with imbalanced weights >>> y = generator.classification( ... 100, C, ... n_classes=3, ... class_weights=[0.5, 0.3, 0.2], ... separation=2.0 ... )
- regression(n_samples: int, concentrations: ndarray | None = None, *, distribution: Literal['uniform', 'normal', 'lognormal', 'bimodal'] = 'uniform', range: Tuple[float, float] | None = None, component: int | str | List[int] | None = None, component_names: List[str] | None = None, correlation: float = 0.9, noise: float = 0.1, transform: Literal['log', 'sqrt'] | None = None) ndarray[source]
Generate regression target values.
- Parameters:
n_samples – Number of samples.
concentrations – Component concentration matrix (n_samples, n_components). If None, generates random base values.
distribution – Target value distribution.
range – (min, max) for scaling targets.
component – Which component(s) to use as target: - None: Weighted combination of all components - int: Use component at that index - str: Use component with that name (requires component_names) - List[int]: Multi-output using specified component indices
component_names – Names of components (for string component selection).
correlation – Correlation between concentrations and targets (0-1).
noise – Noise level to add.
transform – Optional transformation (‘log’, ‘sqrt’).
- Returns:
Target values array. Shape (n_samples,) for single target, or (n_samples, n_targets) for multi-output.
Example
>>> y = generator.regression( ... 100, C, ... distribution="lognormal", ... range=(5, 50), ... component="protein", ... component_names=["water", "protein", "lipid"] ... )
- class nirs4all.data.synthetic.TemperatureConfig(reference_temperature: float = 25.0, sample_temperature: float = 25.0, temperature_variation: float = 0.0, enable_shift: bool = True, enable_intensity: bool = True, enable_broadening: bool = True, region_specific: bool = True, custom_regions: Dict[SpectralRegion, TemperatureEffectParams] | None = None)[source]
Bases:
objectConfiguration for temperature effect simulation.
- custom_regions
Optional custom region parameters to override defaults.
- custom_regions: Dict[SpectralRegion, TemperatureEffectParams] | None = None
- class nirs4all.data.synthetic.TemperatureEffectParams(wavelength_range: Tuple[float, float], shift_per_degree: float, intensity_change_per_degree: float, broadening_per_degree: float, reference: str = '')[source]
Bases:
objectTemperature effect parameters for a spectral region.
Based on literature values for temperature-induced spectral changes in NIR.
- class nirs4all.data.synthetic.TemperatureEffectSimulator(config: TemperatureConfig | None = None, random_state: int | None = None)[source]
Bases:
objectSimulate temperature-dependent spectral changes.
Temperature affects NIR spectra through: - Peak position shifts (especially hydrogen-bonded groups) - Intensity changes (hydrogen bond population changes) - Band broadening (thermal motion)
The effects are strongest for O-H and N-H groups due to their involvement in hydrogen bonding. C-H groups show smaller effects.
- config
Temperature effect configuration.
- rng
Random number generator for reproducibility.
Example
>>> config = TemperatureConfig( ... sample_temperature=40.0, ... reference_temperature=25.0 ... ) >>> simulator = TemperatureEffectSimulator(config, random_state=42) >>> spectra_out = simulator.apply(spectra, wavelengths)
- apply(spectra: ndarray, wavelengths: ndarray, sample_temperatures: ndarray | None = None) ndarray[source]
Apply temperature effects to spectra.
- Parameters:
spectra – Input spectra array (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
sample_temperatures – Optional per-sample temperatures. If None, uses config.sample_temperature with variation.
- Returns:
Modified spectra with temperature effects applied.
- class nirs4all.data.synthetic.TransflectanceConfig(path_length_mm: float = 0.5, reflector_type: str = 'gold', reflector_reflectance: float = 0.95, spacer_thickness_mm: float = 0.5)[source]
Bases:
objectConfiguration for transflectance measurement mode.
Light passes through sample, reflects off a mirror/diffuser, and passes through sample again (double-pass).
- class nirs4all.data.synthetic.TransmittanceConfig(path_length_mm: float = 1.0, path_length_variation: float = 0.02, cuvette_material: str = 'quartz', reference_type: str = 'air')[source]
Bases:
objectConfiguration for transmittance measurement mode.
Implements Beer-Lambert law: A = εcl where A is absorbance, ε is molar absorptivity, c is concentration, and l is path length.
- exception nirs4all.data.synthetic.ValidationError[source]
Bases:
ExceptionException raised when synthetic data validation fails.
- nirs4all.data.synthetic.apply_emsc_distortion(spectra: ndarray, wavelengths: ndarray, multiplicative_std: float = 0.15, additive_std: float = 0.05, random_state: int | None = None) ndarray[source]
Apply EMSC-style scatter distortions with simple API.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
multiplicative_std – Std dev of multiplicative scatter.
additive_std – Std dev of additive scatter.
random_state – Random seed.
- Returns:
Spectra with EMSC-style distortions applied.
Example
>>> # Add realistic scatter distortions >>> spectra_scattered = apply_emsc_distortion(spectra, wavelengths)
- nirs4all.data.synthetic.apply_hydrogen_bonding_shift(wavenumber_cm: float, h_bond_strength: float = 0.5, is_donor: bool = True) float[source]
Apply hydrogen bonding shift to a wavenumber.
Hydrogen bonding weakens X-H bonds, shifting stretching frequencies to lower wavenumbers (red shift). The shift magnitude depends on the hydrogen bond strength.
Typical shifts for O-H: - Free O-H: ~3650 cm⁻¹ - Weak H-bond: ~3500 cm⁻¹ - Strong H-bond: ~3200 cm⁻¹
- Parameters:
wavenumber_cm – Original wavenumber in cm⁻¹.
h_bond_strength – Hydrogen bond strength (0 = none, 1 = very strong).
is_donor – Whether the group is a hydrogen bond donor.
- Returns:
Shifted wavenumber in cm⁻¹.
Example
>>> apply_hydrogen_bonding_shift(3650, h_bond_strength=0.5) 3467.5 # Red-shifted by hydrogen bonding
- nirs4all.data.synthetic.apply_moisture_effects(spectra: ndarray, wavelengths: ndarray, water_activity: float = 0.5, moisture_content: float = 0.1, random_state: int | None = None) ndarray[source]
Apply moisture effects to spectra with simple API.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
water_activity – Water activity (0-1).
moisture_content – Moisture content fraction.
random_state – Random seed.
- Returns:
Spectra with moisture effects applied.
Example
>>> # Simulate wet sample (high water activity) >>> spectra_wet = apply_moisture_effects(spectra, wavelengths, water_activity=0.9)
- nirs4all.data.synthetic.apply_particle_size_effects(spectra: ndarray, wavelengths: ndarray, mean_particle_size_um: float = 50.0, size_variation: float = 15.0, random_state: int | None = None) ndarray[source]
Apply particle size effects to spectra with simple API.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
mean_particle_size_um – Mean particle size in micrometers.
size_variation – Standard deviation of particle size.
random_state – Random seed.
- Returns:
Spectra with particle size effects applied.
Example
>>> # Simulate fine powder sample >>> spectra_fine = apply_particle_size_effects( ... spectra, wavelengths, ... mean_particle_size_um=20.0 ... )
- nirs4all.data.synthetic.apply_temperature_effects(spectra: ndarray, wavelengths: ndarray, temperature: float = 25.0, reference_temperature: float = 25.0, random_state: int | None = None) ndarray[source]
Apply temperature effects to spectra with simple API.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
temperature – Sample temperature in °C.
reference_temperature – Reference temperature in °C.
random_state – Random seed.
- Returns:
Spectra with temperature effects applied.
Example
>>> # Simulate spectra measured at 40°C >>> spectra_40c = apply_temperature_effects(spectra, wavelengths, temperature=40.0)
- nirs4all.data.synthetic.benchmark_backends(n_samples: int = 1000, n_wavelengths: int = 700, n_components: int = 5, n_trials: int = 5) Dict[str, float][source]
Benchmark available backends.
- Parameters:
n_samples – Number of samples to generate.
n_wavelengths – Number of wavelengths.
n_components – Number of components.
n_trials – Number of timing trials.
- Returns:
Dictionary of backend name to mean time in seconds.
Example
>>> results = benchmark_backends() >>> for backend, time in results.items(): ... print(f"{backend}: {time:.4f}s")
- nirs4all.data.synthetic.calculate_combination_band(mode1: str | float | List[str | float], mode2: str | float | None = None, band_type: str = 'sum', coupling_factor: float = 1.0) CombinationBandResult[source]
Calculate combination band position.
Combination bands arise from simultaneous excitation of two vibrational modes. - Sum bands: ν̃_comb = ν̃₁ + ν̃₂ (most common in NIR) - Difference bands: ν̃_comb = |ν̃₁ - ν̃₂| (less common)
- Parameters:
mode1 – First vibration - either a vibration type string (e.g., ‘O-H_stretch’), a numeric wavenumber in cm⁻¹, or a list of two modes.
mode2 – Second vibration (same format as mode1). If mode1 is a list, this should be None.
band_type – ‘sum’ or ‘difference’.
coupling_factor – Mechanical coupling between modes (0-1, affects amplitude).
- Returns:
CombinationBandResult with position and intensity information.
Example
>>> # O-H stretch + O-H bend combination (water) using strings >>> result = calculate_combination_band("O-H_stretch", "O-H_bend") >>> print(f"{result.wavelength_nm:.0f} nm") 1984 nm
>>> # Using a list of modes >>> result = calculate_combination_band(["O-H_stretch", "O-H_bend"])
>>> # Using numeric values >>> result = calculate_combination_band(3400, 1640)
- nirs4all.data.synthetic.calculate_overtone_position(vibration_type_or_frequency: str | float, overtone_order: int, anharmonicity: float | None = None) OvertoneResult[source]
Calculate overtone band position with anharmonicity correction.
For a harmonic oscillator, overtones would be exactly at n × ν̃₀. However, real molecular vibrations are anharmonic, causing overtones to appear at slightly lower wavenumbers than the harmonic prediction.
The anharmonic wavenumber is: ν̃ₙ = n × ν̃₀ × (1 - n × χ) where χ is the anharmonicity constant (typically 0.01-0.03).
- Parameters:
vibration_type_or_frequency – Either a vibration type string (e.g., ‘O-H_stretch’) from FUNDAMENTAL_VIBRATIONS, or a numeric fundamental frequency in cm⁻¹.
overtone_order – Order (1 = fundamental, 2 = 1st overtone, 3 = 2nd overtone).
anharmonicity – Anharmonicity constant χ. If None and vibration_type is a string, uses the default for that vibration type. Otherwise defaults to 0.02.
- Returns:
OvertoneResult with position and intensity information.
Example
>>> result = calculate_overtone_position("O-H_stretch", 2) # O-H 1st overtone >>> print(f"{result.wavelength_nm:.0f} nm") 1442 nm # (with anharmonicity)
>>> result = calculate_overtone_position(3400, 2) # Numeric frequency >>> print(f"{result.wavelength_nm:.0f} nm") 1503 nm
- nirs4all.data.synthetic.classify_wavelength_zone(wavelength_nm: float) str | None[source]
Classify a wavelength into its corresponding NIR zone.
- Parameters:
wavelength_nm – Wavelength in nm.
- Returns:
Zone name string, or None if outside defined zones.
Example
>>> classify_wavelength_zone(1450) '1st_overtones_OH_NH' >>> classify_wavelength_zone(2300) 'combination_CH'
- nirs4all.data.synthetic.compare_datasets(X_synthetic: ndarray, X_real: ndarray, wavelengths: ndarray | None = None) Dict[str, Any][source]
Quick comparison between synthetic and real datasets.
- Parameters:
X_synthetic – Synthetic spectra.
X_real – Real spectra.
wavelengths – Wavelength grid.
- Returns:
Dictionary with comparison metrics.
Example
>>> metrics = compare_datasets(X_synth, X_real) >>> print(f"Similarity: {metrics['overall_score']:.1f}/100")
- nirs4all.data.synthetic.compute_adversarial_validation_auc(real_spectra: ndarray, synthetic_spectra: ndarray, cv_folds: int = 5, random_state: int | None = None) Tuple[float, float][source]
Train classifier to distinguish real vs. synthetic spectra.
A lower AUC indicates that synthetic data is more realistic (harder to distinguish from real data).
- Parameters:
real_spectra – Real spectra array (n_real, n_wavelengths).
synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).
cv_folds – Number of cross-validation folds.
random_state – Random state for reproducibility.
- Returns:
Tuple of (mean_auc, std_auc) across folds.
- Target:
AUC < 0.6: Excellent (nearly indistinguishable) AUC < 0.7: Good (hard to distinguish) AUC < 0.8: Acceptable (some differences) AUC >= 0.8: Poor (clearly distinguishable)
Example
>>> real = np.random.randn(100, 500) >>> synthetic = np.random.randn(100, 500) + 0.1 >>> mean_auc, std_auc = compute_adversarial_validation_auc(real, synthetic) >>> print(f"AUC: {mean_auc:.3f} ± {std_auc:.3f}")
- nirs4all.data.synthetic.compute_baseline_curvature(spectra: ndarray, polynomial_degree: int = 3) ndarray[source]
Compute baseline curvature by fitting polynomials and measuring residuals.
- Parameters:
spectra – Array of shape (n_samples, n_wavelengths).
polynomial_degree – Degree of polynomial to fit.
- Returns:
Array of residual standard deviations for each spectrum.
Example
>>> X = np.random.randn(100, 500) >>> curvatures = compute_baseline_curvature(X)
- nirs4all.data.synthetic.compute_correlation_length(spectra: ndarray, max_lag: int = 50) ndarray[source]
Compute correlation lengths for a set of spectra.
The correlation length is the lag at which the autocorrelation function decays to 1/e of its initial value.
- Parameters:
spectra – Array of shape (n_samples, n_wavelengths).
max_lag – Maximum lag to compute autocorrelation for.
- Returns:
Array of correlation lengths for each spectrum.
Example
>>> X = np.random.randn(100, 500) >>> lengths = compute_correlation_length(X) >>> print(f"Mean correlation length: {lengths.mean():.2f}")
- nirs4all.data.synthetic.compute_derivative_statistics(spectra: ndarray, wavelengths: ndarray | None = None, order: int = 1) Tuple[ndarray, ndarray][source]
Compute derivative statistics for spectra.
- Parameters:
spectra – Array of shape (n_samples, n_wavelengths).
wavelengths – Wavelength array for proper derivative scaling.
order – Derivative order (1 or 2).
- Returns:
Tuple of (mean_derivatives, std_derivatives) per sample.
Example
>>> X = np.random.randn(100, 500) >>> means, stds = compute_derivative_statistics(X, order=1)
- nirs4all.data.synthetic.compute_distribution_overlap(dist1: ndarray, dist2: ndarray, n_bins: int = 50) float[source]
Compute overlap between two distributions using histogram intersection.
- Parameters:
dist1 – First distribution samples.
dist2 – Second distribution samples.
n_bins – Number of histogram bins.
- Returns:
Overlap coefficient in [0, 1], where 1 means identical distributions.
Example
>>> x1 = np.random.randn(1000) >>> x2 = np.random.randn(1000) + 0.5 >>> overlap = compute_distribution_overlap(x1, x2)
- nirs4all.data.synthetic.compute_peak_density(spectra: ndarray, wavelengths: ndarray, window_nm: float = 100.0, prominence_threshold: float = 0.01) ndarray[source]
Compute peak density (peaks per 100 nm) for spectra.
- Parameters:
spectra – Array of shape (n_samples, n_wavelengths).
wavelengths – Wavelength array in nm.
window_nm – Window size for density calculation (default 100 nm).
prominence_threshold – Minimum peak prominence as fraction of spectrum range.
- Returns:
Array of peak densities (peaks per window_nm) for each spectrum.
Example
>>> X = np.random.randn(100, 500) >>> wl = np.linspace(1000, 2500, 500) >>> densities = compute_peak_density(X, wl)
- nirs4all.data.synthetic.compute_snr(spectra: ndarray, noise_region_fraction: float = 0.1) ndarray[source]
Estimate signal-to-noise ratio for spectra.
Uses the standard deviation of the highest-frequency components (via high-pass filtering) as noise estimate.
- Parameters:
spectra – Array of shape (n_samples, n_wavelengths).
noise_region_fraction – Fraction of spectrum to use for noise estimation.
- Returns:
Array of SNR estimates for each spectrum.
Example
>>> X = np.random.randn(100, 500) + np.sin(np.linspace(0, 10, 500)) >>> snr = compute_snr(X)
- nirs4all.data.synthetic.compute_spectral_properties(X: ndarray, wavelengths: ndarray | None = None, name: str = 'dataset', n_pca_components: int = 20) SpectralProperties[source]
Compute comprehensive spectral properties of a dataset.
Analyzes a matrix of spectra to extract statistical and spectral properties useful for fitting and comparison. Includes Phase 1-4 enhanced properties for instrument, mode, domain, and effect inference.
- Parameters:
X – Spectra matrix (n_samples, n_wavelengths).
wavelengths – Optional wavelength grid.
name – Dataset identifier.
n_pca_components – Maximum PCA components to compute.
- Returns:
SpectralProperties with computed metrics.
Example
>>> props = compute_spectral_properties(X_real, wavelengths) >>> print(f"Mean slope: {props.mean_slope:.4f}") >>> print(f"Inferred resolution: {props.effective_resolution:.1f} nm")
- nirs4all.data.synthetic.compute_spectral_realism_scorecard(real_spectra: ndarray, synthetic_spectra: ndarray, wavelengths: ndarray | None = None, thresholds: Dict[str, float] | None = None, include_adversarial: bool = True, random_state: int | None = None) SpectralRealismScore[source]
Compute comprehensive spectral realism scorecard.
This function computes multiple quantitative metrics to assess whether synthetic spectra are realistic compared to real data.
- Parameters:
real_spectra – Real spectra array (n_real, n_wavelengths).
synthetic_spectra – Synthetic spectra array (n_synthetic, n_wavelengths).
wavelengths – Wavelength array in nm. If None, uses indices.
thresholds – Custom thresholds for metrics. Defaults: - correlation_length_overlap: 0.7 - derivative_ks_pvalue: 0.05 - peak_density_ratio_min: 0.5 - peak_density_ratio_max: 2.0 - baseline_curvature_overlap: 0.6 - snr_order_of_magnitude: 1.0 (log10 difference) - adversarial_auc: 0.7
include_adversarial – Whether to compute adversarial AUC (slower).
random_state – Random state for adversarial validation.
- Returns:
SpectralRealismScore with all metrics and pass/fail status.
Example
>>> from nirs4all.data.synthetic import SyntheticNIRSGenerator >>> gen = SyntheticNIRSGenerator(random_state=42) >>> X_synth, _, _ = gen.generate(200) >>> # X_real would be loaded from real data >>> X_real = np.random.randn(200, X_synth.shape[1]) # Placeholder >>> score = compute_spectral_realism_scorecard(X_real, X_synth, gen.wavelengths) >>> print(score.summary())
- nirs4all.data.synthetic.convert_bandwidth_to_wavelength(bandwidth_cm: float, center_nm: float) float[source]
Convert bandwidth from wavenumber to wavelength units.
Since the relationship between wavenumber and wavelength is non-linear, the bandwidth conversion depends on the center wavelength/wavenumber.
The approximation is: Δλ ≈ Δν̃ × λ² / 10^7
This is derived from the differential: dλ = -dν̃ × (10^7 / ν̃²) = -dν̃ × λ² / 10^7 (taking absolute value for bandwidth)
- Parameters:
bandwidth_cm – Bandwidth in cm⁻¹ (e.g., FWHM).
center_nm – Center wavelength in nm.
- Returns:
Bandwidth in nm.
Example
>>> convert_bandwidth_to_wavelength(100, 1450) # 100 cm⁻¹ at 1450 nm 21.025 # approximately >>> convert_bandwidth_to_wavelength(100, 2200) # Same bandwidth at 2200 nm 48.4 # Broader in nm due to non-linear relationship
- nirs4all.data.synthetic.create_atr_simulator(crystal_material: str = 'diamond', incidence_angle: float = 45.0, n_reflections: int = 1, random_state: int | None = None) MeasurementModeSimulator[source]
Create an ATR mode simulator.
- Parameters:
crystal_material – ATR crystal material.
incidence_angle – Incidence angle in degrees.
n_reflections – Number of internal reflections.
random_state – Random seed.
- Returns:
Configured MeasurementModeSimulator.
- nirs4all.data.synthetic.create_domain_aware_library(domain_name: str, n_samples: int = 100, random_state: int | None = None) Tuple[List[str], ndarray][source]
Create component selection and concentrations based on domain priors.
This function samples components and their concentrations according to domain-specific distributions.
- Parameters:
domain_name – Name of the domain.
n_samples – Number of samples to generate concentrations for.
random_state – Random seed for reproducibility.
- Returns:
Tuple of (component_names, concentration_matrix).
Example
>>> components, concentrations = create_domain_aware_library( ... "food_dairy", ... n_samples=50, ... random_state=42 ... ) >>> print(components) ['water', 'lactose', 'casein', 'lipid'] >>> print(concentrations.shape) (50, 4)
- nirs4all.data.synthetic.create_reflectance_simulator(geometry: str = 'integrating_sphere', particle_size_um: float = 50.0, random_state: int | None = None) MeasurementModeSimulator[source]
Create a diffuse reflectance mode simulator.
- Parameters:
geometry – Measurement geometry.
particle_size_um – Mean particle size.
random_state – Random seed.
- Returns:
Configured MeasurementModeSimulator.
- nirs4all.data.synthetic.create_synthetic_matching_benchmark(benchmark_name: str, n_samples: int | None = None, random_state: int | None = None) Tuple[ndarray, ndarray, ndarray][source]
Create synthetic data matching benchmark dataset properties.
- Parameters:
benchmark_name – Name of benchmark dataset to match.
n_samples – Number of samples (uses benchmark size if None).
random_state – Random state for reproducibility.
- Returns:
Tuple of (spectra, concentrations, component_spectra).
Example
>>> X, C, E = create_synthetic_matching_benchmark("corn", random_state=42) >>> print(X.shape)
- nirs4all.data.synthetic.create_transmittance_simulator(path_length_mm: float = 1.0, random_state: int | None = None) MeasurementModeSimulator[source]
Create a transmittance mode simulator.
- Parameters:
path_length_mm – Optical path length in mm.
random_state – Random seed.
- Returns:
Configured MeasurementModeSimulator.
- nirs4all.data.synthetic.detect_best_backend() AcceleratorBackend[source]
Detect the best available acceleration backend.
- Returns:
AcceleratorBackend enum indicating best available option.
Example
>>> backend = detect_best_backend() >>> print(f"Using backend: {backend}")
- nirs4all.data.synthetic.export_to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None) Path[source]
Quick function to export synthetic data to single CSV.
- Parameters:
path – Output file path.
X – Feature matrix.
y – Target values.
wavelengths – Optional wavelength values.
- Returns:
Path to created file.
Example
>>> path = export_to_csv("data.csv", X, y)
- nirs4all.data.synthetic.export_to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, format: Literal['standard', 'single', 'fragmented'] = 'standard', random_state: int | None = None) Path[source]
Quick function to export synthetic data to folder.
Convenience function for simple export use cases.
- Parameters:
path – Output folder path.
X – Feature matrix.
y – Target values.
train_ratio – Train/test split ratio.
wavelengths – Optional wavelength values.
format – Export format.
random_state – Random seed.
- Returns:
Path to created folder.
Example
>>> path = export_to_folder( ... "data/synthetic", ... X, y, ... train_ratio=0.8, ... wavelengths=wavelengths ... )
- nirs4all.data.synthetic.fit_to_real_data(X: np.ndarray | 'SpectroDataset', wavelengths: np.ndarray | None = None, name: str = 'source') FittedParameters[source]
Quick function to fit parameters to real data.
Convenience function for simple fitting use cases.
- Parameters:
X – Real spectra or SpectroDataset.
wavelengths – Wavelength grid.
name – Dataset name.
- Returns:
FittedParameters object.
Example
>>> params = fit_to_real_data(X_real, wavelengths) >>> generator = SyntheticNIRSGenerator(**params.to_generator_kwargs())
- nirs4all.data.synthetic.generate_classification_targets(n_samples: int, concentrations: ndarray | None = None, *, random_state: int | None = None, n_classes: int = 2, class_weights: List[float] | None = None, separation: float = 1.5) ndarray[source]
Convenience function for generating classification targets.
- Parameters:
n_samples – Number of samples.
concentrations – Component concentrations (optional).
random_state – Random seed.
n_classes – Number of classes.
class_weights – Class proportions.
separation – Class separation factor.
- Returns:
Integer class labels array.
- nirs4all.data.synthetic.generate_multi_source(n_samples: int, sources: List[Dict[str, Any]] | None = None, *, random_state: int | None = None, target_range: Tuple[float, float] | None = None, as_dataset: bool = True, train_ratio: float = 0.8, name: str = 'multi_source_synthetic') SpectroDataset | MultiSourceResult[source]
Convenience function for generating multi-source datasets.
- Parameters:
n_samples – Number of samples.
sources – List of source configurations. If None, uses default single NIR source with wavelength range (1000, 2500).
random_state – Random seed.
target_range – Target value range.
as_dataset – If True, returns SpectroDataset.
train_ratio – Training set proportion.
name – Dataset name.
- Returns:
SpectroDataset or MultiSourceResult depending on as_dataset.
Example
>>> dataset = generate_multi_source( ... n_samples=500, ... sources=[ ... {"name": "NIR", "type": "nir", "wavelength_range": (1000, 2500)}, ... {"name": "markers", "type": "aux", "n_features": 15} ... ], ... random_state=42 ... )
- nirs4all.data.synthetic.generate_regression_targets(n_samples: int, concentrations: ndarray | None = None, *, random_state: int | None = None, distribution: str = 'uniform', range: Tuple[float, float] | None = None) ndarray[source]
Convenience function for generating regression targets.
- Parameters:
n_samples – Number of samples.
concentrations – Component concentrations (optional).
random_state – Random seed.
distribution – Target distribution type.
range – Value range (min, max).
- Returns:
Target values array.
- nirs4all.data.synthetic.generate_sample_metadata(n_samples: int, *, random_state: int | None = None, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1) Dict[str, ndarray][source]
Convenience function to generate sample metadata.
This is a simplified interface to MetadataGenerator for common use cases.
- Parameters:
n_samples – Total number of samples to generate.
random_state – Random seed for reproducibility.
sample_id_prefix – Prefix for sample ID strings.
n_groups – Number of groups (None for no grouping).
group_names – Optional list of group names.
n_repetitions – Repetitions per biological sample.
- Returns:
Dictionary with metadata arrays.
Example
>>> metadata = generate_sample_metadata( ... n_samples=100, ... random_state=42, ... n_groups=3, ... n_repetitions=(2, 4) ... ) >>> print(metadata.keys())
- nirs4all.data.synthetic.generate_scattering_coefficients(n_samples: int, wavelengths: ndarray, baseline_scattering: float = 1.0, wavelength_exponent: float = 1.0, particle_sizes: ndarray | None = None, random_state: int | None = None) ndarray[source]
Generate scattering coefficients with simple API.
- Parameters:
n_samples – Number of samples.
wavelengths – Wavelength array (nm).
baseline_scattering – Base scattering coefficient.
wavelength_exponent – Wavelength dependence exponent.
particle_sizes – Optional particle sizes (μm).
random_state – Random seed.
- Returns:
Scattering coefficient array (n_samples, n_wavelengths).
Example
>>> S = generate_scattering_coefficients(100, wavelengths)
- nirs4all.data.synthetic.get_acceleration_speedup_estimate(n_samples: int) float[source]
Estimate speedup from GPU acceleration.
- Parameters:
n_samples – Number of samples to generate.
- Returns:
Estimated speedup factor (1.0 for CPU).
- nirs4all.data.synthetic.get_all_zones_wavelength() List[Tuple[float, float, str]][source]
Get all NIR zones converted to wavelength space.
- Returns:
List of (min_wavelength, max_wavelength, zone_name) tuples in nm.
Example
>>> zones = get_all_zones_wavelength() >>> for min_wl, max_wl, name in zones: ... print(f"{name}: {min_wl:.0f}-{max_wl:.0f} nm")
- nirs4all.data.synthetic.get_backend_info() Dict[str, Any][source]
Get detailed information about available backends.
- Returns:
Dictionary with backend availability and details.
- nirs4all.data.synthetic.get_benchmark_info(name: str) BenchmarkDatasetInfo[source]
Get information about a benchmark dataset.
- Parameters:
name – Dataset name.
- Returns:
BenchmarkDatasetInfo for the dataset.
- Raises:
KeyError – If dataset not found.
Example
>>> info = get_benchmark_info("corn") >>> print(info.summary())
- nirs4all.data.synthetic.get_benchmark_spectral_properties(name: str) Dict[str, Any][source]
Get spectral properties to match when generating synthetic data.
- Parameters:
name – Benchmark dataset name.
- Returns:
Dictionary of properties suitable for synthetic generator.
Example
>>> props = get_benchmark_spectral_properties("corn") >>> generator = SyntheticNIRSGenerator(**props)
- nirs4all.data.synthetic.get_datasets_by_domain(domain: str | BenchmarkDomain) List[str][source]
Get benchmark datasets for a specific domain.
- Parameters:
domain – Domain name or enum.
- Returns:
List of dataset names in that domain.
Example
>>> pharma_datasets = get_datasets_by_domain("pharmaceutical") >>> print(pharma_datasets)
- nirs4all.data.synthetic.get_default_noise_config(detector_type: DetectorType) NoiseModelConfig[source]
Get default noise model configuration for a detector type.
- Parameters:
detector_type – Type of detector.
- Returns:
NoiseModelConfig with appropriate defaults.
- nirs4all.data.synthetic.get_detector_response(detector_type: DetectorType) DetectorSpectralResponse[source]
Get spectral response curve for a detector type.
- Parameters:
detector_type – Type of detector.
- Returns:
DetectorSpectralResponse object.
- nirs4all.data.synthetic.get_detector_wavelength_range(detector_type: DetectorType) Tuple[float, float][source]
Get the effective wavelength range for a detector type.
- Parameters:
detector_type – Type of detector.
- Returns:
Tuple of (min_wavelength, max_wavelength) in nm.
- nirs4all.data.synthetic.get_domain_compatible_instruments(domain: str) List[str][source]
Get list of instruments commonly used with a domain.
- Parameters:
domain – Domain name.
- Returns:
List of instrument names.
Example
>>> instruments = get_domain_compatible_instruments("tablets") >>> print(instruments)
- nirs4all.data.synthetic.get_domain_components(domain_name: str) List[str][source]
Get typical components for a domain.
- Parameters:
domain_name – Name of the domain.
- Returns:
List of component names.
Example
>>> get_domain_components("food_dairy") ['water', 'lactose', 'casein', 'lipid', 'moisture', 'protein']
- nirs4all.data.synthetic.get_domain_config(domain_name: str) DomainConfig[source]
Get configuration for a specific domain.
- Parameters:
domain_name – Name of the domain (key in APPLICATION_DOMAINS).
- Returns:
DomainConfig for the specified domain.
- Raises:
ValueError – If domain is not found.
Example
>>> config = get_domain_config("agriculture_grain") >>> print(config.name) 'Grain and Cereals'
- nirs4all.data.synthetic.get_domains_for_component(component_name: str) List[str][source]
Find domains that typically contain a specific component.
- Parameters:
component_name – Name of the component.
- Returns:
List of domain names containing this component.
Example
>>> get_domains_for_component("protein") ['agriculture_grain', 'food_meat', 'biomedical_tissue', ...]
- nirs4all.data.synthetic.get_instrument_archetype(name: str) InstrumentArchetype[source]
Get a predefined instrument archetype by name.
- Parameters:
name – Instrument archetype name.
- Returns:
InstrumentArchetype instance.
- Raises:
KeyError – If archetype name not found.
Example
>>> archetype = get_instrument_archetype("foss_xds") >>> print(archetype.wavelength_range) (400, 2500)
- nirs4all.data.synthetic.get_instrument_typical_modes(instrument: str) List[str][source]
Get typical measurement modes for an instrument.
- Parameters:
instrument – Instrument name.
- Returns:
List of measurement mode names.
Example
>>> modes = get_instrument_typical_modes("viavi_micronir") >>> print(modes)
- nirs4all.data.synthetic.get_instruments_by_category() Dict[str, List[str]][source]
Get all instruments organized by category.
- Returns:
Dictionary mapping category name to list of instrument names.
- nirs4all.data.synthetic.get_nir_zone(wavelength_nm: float) str | None
Classify a wavelength into its corresponding NIR zone.
- Parameters:
wavelength_nm – Wavelength in nm.
- Returns:
Zone name string, or None if outside defined zones.
Example
>>> classify_wavelength_zone(1450) '1st_overtones_OH_NH' >>> classify_wavelength_zone(2300) 'combination_CH'
- nirs4all.data.synthetic.get_predefined_components() Dict[str, SpectralComponent][source]
Get predefined spectral components based on NIR band assignments.
Returns a dictionary of SpectralComponent objects representing common chemical compounds and functional groups found in NIR spectroscopy applications (agricultural, food, pharmaceutical, petrochemical).
Each component’s band assignments are based on published spectroscopic literature. Key characteristics:
Band centers: Wavelength positions (nm) of absorption maxima
Sigma: Gaussian width contribution (thermal/inhomogeneous broadening)
Gamma: Lorentzian width contribution (pressure/collision broadening)
Amplitude: Relative absorption intensity (normalized within component)
- Available Components:
- Water-related:
water: H₂O fundamental O-H vibrations [1, pp. 34-36]moisture: Bound water in organic matrices [2, pp. 358-362]
- Proteins and Nitrogen:
protein: General protein (amide, N-H, C-H) [1, pp. 48-52]nitrogen_compound: Primary/secondary amines [1, pp. 52-54]urea: CO(NH₂)₂ bands [9, p. 1125]amino_acid: Free amino acids [3, pp. 215-220]
- Lipids and Hydrocarbons:
lipid: Triglycerides (C-H stretching) [1, pp. 44-48]oil: Vegetable/mineral oils [4, pp. 67-72]saturated_fat: Saturated fatty acids [7, pp. 15-20]unsaturated_fat: Mono/polyunsaturated fats [7, pp. 20-25]
- Carbohydrates:
starch: Amylose/amylopectin [5, pp. 155-160]cellulose: β-1,4-glucan chains [6, pp. 295-300]glucose: D-glucose monosaccharide [2, pp. 368-370]fructose: D-fructose monosaccharide [2, pp. 368-370]sucrose: Disaccharide [2, pp. 370-372]hemicellulose: Xylan/glucomannan [6, pp. 300-303]lignin: Aromatic polymer [6, pp. 303-305]
- Alcohols:
ethanol: C₂H₅OH [1, pp. 38-40]methanol: CH₃OH [1, pp. 38-40]
- Organic Acids:
acetic_acid: CH₃COOH [8, pp. 8-10]citric_acid: C₆H₈O₇ [4, pp. 78-80]lactic_acid: CH₃CH(OH)COOH [9, pp. 1128-1130]
- Plant Pigments:
chlorophyll: Chlorophyll a/b [2, pp. 375-378]carotenoid: β-carotene, xanthophylls [2, pp. 378-380]
- Pharmaceutical:
caffeine: C₈H₁₀N₄O₂ [9, pp. 1130-1132]aspirin: Acetylsalicylic acid [9, pp. 1125-1128]paracetamol: Acetaminophen [9, pp. 1132-1135]
- Petrochemical:
aromatic: Benzene derivatives [1, pp. 56-58]alkane: Saturated hydrocarbons [7, pp. 10-15]
- Fibers:
cotton: Cotton cellulose [6, pp. 295-298]polyester: PET fiber [1, pp. 60-62]
- Polymers and Plastics:
polyethylene: HDPE/LDPE plastic [15], [1, pp. 58-60]polystyrene: Aromatic polymer [15], [1, pp. 56-58]natural_rubber: cis-1,4-polyisoprene [15]nylon: Polyamide fiber [1, pp. 60-62]
- Dairy:
lactose: Milk sugar [12], [4, pp. 85-88]casein: Milk protein [4, pp. 85-88]
- Solvents:
acetone: Ketone solvent [1, pp. 42-44]
- Plant Phenolics:
tannins: Phenolic compounds [6], [11]waxes: Cuticular waxes [7, pp. 15-20]
- Fermentation/Beverages:
glycerol: Polyol from fermentation [11]malic_acid: Fruit acid [4, pp. 78-80]tartaric_acid: Grape/wine acid [11]
- Soil Minerals:
carbonates: CaCO₃, MgCO₃ [13]gypsum: CaSO₄·2H₂O [13]kaolinite: Clay mineral [13]
- Agricultural:
gluten: Wheat protein [5, pp. 155-160]dietary_fiber: Plant cell wall [6], [5]
- Returns:
Dictionary mapping component names to SpectralComponent objects.
Note
This function uses lazy initialization to avoid circular imports. The components are created once and cached for subsequent calls.
References
See module docstring for full reference list.
- nirs4all.data.synthetic.get_temperature_effect_regions() Dict[str, Tuple[float, float]][source]
Get the wavelength regions with significant temperature effects.
- Returns:
Dictionary mapping region names to (start, end) wavelength tuples.
- nirs4all.data.synthetic.get_zone_wavelength_range(zone_name: str) Tuple[float, float] | None[source]
Get the wavelength range (nm) for a named NIR zone.
- Parameters:
zone_name – Name of the NIR zone (e.g., ‘1st_overtones_OH_NH’).
- Returns:
Tuple of (min_wavelength, max_wavelength) in nm, or None if not found.
Example
>>> get_zone_wavelength_range('1st_overtones_CH') (1600.0, 1818.18...)
- nirs4all.data.synthetic.is_gpu_available() bool[source]
Check if GPU acceleration is available.
- Returns:
True if JAX with GPU or CuPy is available.
Example
>>> if is_gpu_available(): ... print("GPU acceleration enabled!")
- nirs4all.data.synthetic.list_benchmark_datasets() List[str][source]
List all registered benchmark datasets.
- Returns:
List of dataset names.
Example
>>> datasets = list_benchmark_datasets() >>> print(datasets)
- nirs4all.data.synthetic.list_detector_types() List[str][source]
List available detector types.
- Returns:
List of detector type names.
- nirs4all.data.synthetic.list_domains(category: DomainCategory | None = None) List[str][source]
List available domain names.
- Parameters:
category – Optional category filter.
- Returns:
List of domain names.
Example
>>> list_domains(DomainCategory.AGRICULTURE) ['agriculture_grain', 'agriculture_forage', ...]
- nirs4all.data.synthetic.list_instrument_archetypes(category: InstrumentCategory | None = None) List[str][source]
List available instrument archetype names.
- Parameters:
category – Optional filter by category.
- Returns:
List of archetype names.
Example
>>> list_instrument_archetypes(InstrumentCategory.HANDHELD) ['viavi_micronir', 'scio', 'tellspec', 'linksquare', 'siware_neoscanner']
- nirs4all.data.synthetic.load_benchmark_dataset(name: str, data_dir: str | Path | None = None, format: str = 'auto') LoadedBenchmarkDataset[source]
Load a benchmark dataset from disk.
- Parameters:
name – Dataset name from registry.
data_dir – Directory containing dataset files.
format – File format (“auto”, “csv”, “mat”, “jdx”).
- Returns:
LoadedBenchmarkDataset with data.
- Raises:
FileNotFoundError – If dataset files not found.
KeyError – If dataset name not in registry.
Example
>>> dataset = load_benchmark_dataset("corn", data_dir="./datasets/") >>> print(dataset.X.shape, dataset.y.shape)
Note
Dataset files must be obtained separately from their sources. This function provides standardized loading once files are available.
- nirs4all.data.synthetic.quick_realism_check(synthetic_spectra: ndarray, wavelengths: ndarray | None = None, expected_snr_range: Tuple[float, float] = (10, 1000), expected_peak_density: Tuple[float, float] = (0.5, 10.0)) Tuple[bool, List[str]][source]
Perform quick realism checks on synthetic spectra without real data.
This function checks basic properties that realistic spectra should have, without requiring a reference real dataset.
- Parameters:
synthetic_spectra – Synthetic spectra to check.
wavelengths – Wavelength array.
expected_snr_range – Expected SNR range (min, max).
expected_peak_density – Expected peak density range (peaks per 100 nm).
- Returns:
Tuple of (passed, list_of_issues).
Example
>>> X = generator.generate(100)[0] >>> passed, issues = quick_realism_check(X, wavelengths) >>> if not passed: ... print("Issues:", issues)
- nirs4all.data.synthetic.sample_prior(domain: str | None = None, instrument: str | None = None, random_state: int | None = None) Dict[str, Any][source]
Quick function to sample a single configuration from default prior.
- Parameters:
domain – Optional domain constraint.
instrument – Optional instrument constraint.
random_state – Random state for reproducibility.
- Returns:
Configuration dictionary.
Example
>>> config = sample_prior(domain="food", random_state=42) >>> print(config["domain"], config["instrument"])
- nirs4all.data.synthetic.sample_prior_batch(n: int, random_state: int | None = None) List[Dict[str, Any]][source]
Quick function to sample multiple configurations from default prior.
- Parameters:
n – Number of configurations to sample.
random_state – Random state for reproducibility.
- Returns:
List of configuration dictionaries.
Example
>>> configs = sample_prior_batch(10, random_state=42) >>> for c in configs: ... print(c["domain"], c["instrument"])
- nirs4all.data.synthetic.simulate_detector_effects(spectra: ndarray, wavelengths: ndarray, detector_type: DetectorType = DetectorType.INGAAS, include_response: bool = True, include_noise: bool = True, random_state: int | None = None) ndarray[source]
Apply detector effects to spectra with simple API.
- Parameters:
spectra – Input spectra (n_samples, n_wavelengths).
wavelengths – Wavelength array (nm).
detector_type – Type of detector to simulate.
include_response – Whether to apply spectral response.
include_noise – Whether to apply noise.
random_state – Random seed.
- Returns:
Spectra with detector effects applied.
Example
>>> spectra_out = simulate_detector_effects( ... spectra, wavelengths, ... detector_type=DetectorType.PBS ... )
- nirs4all.data.synthetic.simulate_msc_correctable_scatter(spectra: ndarray, reference: ndarray | None = None, intensity: float = 1.0, random_state: int | None = None) ndarray[source]
Apply scatter effects that MSC (Multiplicative Scatter Correction) would correct.
MSC regresses each spectrum against a reference to remove multiplicative and baseline scatter. This function applies such effects.
- Parameters:
spectra – Input spectra.
reference – Reference spectrum (mean if None).
intensity – Intensity of scatter effects.
random_state – Random seed.
- Returns:
Spectra with MSC-correctable scatter.
Example
>>> # Add scatter that MSC will correct >>> scattered = simulate_msc_correctable_scatter(spectra)
- nirs4all.data.synthetic.simulate_snv_correctable_scatter(spectra: ndarray, intensity: float = 1.0, random_state: int | None = None) ndarray[source]
Apply scatter effects that SNV (Standard Normal Variate) would correct.
SNV corrects multiplicative and additive scatter. This function applies such effects so that SNV preprocessing would restore the original spectra.
- Parameters:
spectra – Input spectra.
intensity – Intensity of scatter effects (0-2, default 1).
random_state – Random seed.
- Returns:
Spectra with SNV-correctable scatter.
Example
>>> # Add scatter that SNV will correct >>> scattered = simulate_snv_correctable_scatter(spectra, intensity=1.5)
- nirs4all.data.synthetic.simulate_temperature_series(spectrum: ndarray, wavelengths: ndarray, temperatures: List[float], reference_temperature: float = 25.0, random_state: int | None = None) ndarray[source]
Generate a series of spectra at different temperatures.
Useful for simulating temperature studies or generating training data for temperature-robust models.
- Parameters:
spectrum – Single reference spectrum (n_wavelengths,).
wavelengths – Wavelength array (nm).
temperatures – List of temperatures to simulate.
reference_temperature – Reference temperature for the input spectrum.
random_state – Random seed.
- Returns:
Array of spectra (n_temperatures, n_wavelengths).
Example
>>> temps = [20, 25, 30, 35, 40] >>> temp_series = simulate_temperature_series(spectrum, wavelengths, temps)
- nirs4all.data.synthetic.validate_against_benchmark(synthetic_spectra: ndarray, benchmark_spectra: ndarray, benchmark_name: str, wavelengths: ndarray | None = None, synthetic_targets: ndarray | None = None, benchmark_targets: ndarray | None = None, random_state: int | None = None) DatasetComparisonResult[source]
Validate synthetic data against a benchmark dataset.
- Parameters:
synthetic_spectra – Synthetic spectra (n_synth, n_wavelengths).
benchmark_spectra – Real benchmark spectra (n_bench, n_wavelengths).
benchmark_name – Name of the benchmark dataset.
wavelengths – Wavelength array.
synthetic_targets – Optional targets for TSTR/TRTS evaluation.
benchmark_targets – Optional targets for TSTR/TRTS evaluation.
random_state – Random state for reproducibility.
- Returns:
DatasetComparisonResult with realism score and optional TSTR/TRTS.
Example
>>> result = validate_against_benchmark( ... synthetic_spectra=X_synth, ... benchmark_spectra=X_real, ... benchmark_name="Corn", ... ) >>> print(result.summary())
- nirs4all.data.synthetic.validate_concentrations(C: ndarray, n_samples: int | None = None, n_components: int | None = None, check_normalized: bool = False, tolerance: float = 0.01) List[str][source]
Validate concentration matrix.
- Parameters:
C – Concentration matrix to validate.
n_samples – Expected number of samples.
n_components – Expected number of components.
check_normalized – Whether concentrations should sum to 1.
tolerance – Tolerance for normalization check.
- Returns:
List of validation warning messages.
- Raises:
ValidationError – If critical validation fails.
- nirs4all.data.synthetic.validate_spectra(X: ndarray, expected_shape: Tuple[int, int] | None = None, check_finite: bool = True, check_positive: bool = False, value_range: Tuple[float, float] | None = None) List[str][source]
Validate generated spectra matrix.
- Parameters:
X – Spectra matrix to validate.
expected_shape – Expected (n_samples, n_wavelengths) shape.
check_finite – Whether to check for NaN/Inf values.
check_positive – Whether to require all positive values.
value_range – Optional (min, max) expected range.
- Returns:
List of validation warning messages (empty if all OK).
- Raises:
ValidationError – If critical validation fails.
Example
>>> X = np.random.randn(100, 500) >>> warnings = validate_spectra(X, expected_shape=(100, 500)) >>> if warnings: ... print("Warnings:", warnings)
- nirs4all.data.synthetic.validate_synthetic_output(X: ndarray, C: ndarray, E: ndarray, wavelengths: ndarray | None = None) List[str][source]
Validate complete synthetic generation output.
- Parameters:
X – Generated spectra (n_samples, n_wavelengths).
C – Concentration matrix (n_samples, n_components).
E – Component spectra (n_components, n_wavelengths).
wavelengths – Optional wavelength array.
- Returns:
List of all validation warnings.
- Raises:
ValidationError – If critical validation fails.
Example
>>> from nirs4all.data.synthetic import SyntheticNIRSGenerator >>> gen = SyntheticNIRSGenerator(random_state=42) >>> X, C, E = gen.generate(100) >>> warnings = validate_synthetic_output(X, C, E, gen.wavelengths)
- nirs4all.data.synthetic.validate_wavelengths(wavelengths: ndarray, expected_range: Tuple[float, float] | None = None, check_monotonic: bool = True, check_uniform: bool = True) List[str][source]
Validate wavelength array.
- Parameters:
wavelengths – Wavelength array to validate.
expected_range – Optional (min, max) expected range in nm.
check_monotonic – Whether to check for monotonically increasing values.
check_uniform – Whether to check for uniform spacing.
- Returns:
List of validation warning messages.
- Raises:
ValidationError – If critical validation fails.
- nirs4all.data.synthetic.wavelength_to_wavenumber(lambda_nm: float | ndarray) float | ndarray[source]
Convert wavelength (nm) to wavenumber (cm⁻¹).
The conversion follows the relationship: ν̃ = 10^7 / λ
- Parameters:
lambda_nm – Wavelength in nm. Can be a scalar or numpy array.
- Returns:
Wavenumber in cm⁻¹ (same shape as input).
- Raises:
ValueError – If wavelength is zero or negative.
Example
>>> wavelength_to_wavenumber(1450) # O-H 1st overtone region 6896.55... >>> wavelength_to_wavenumber(np.array([1450, 1940])) array([6896.55..., 5154.64...])
- nirs4all.data.synthetic.wavenumber_to_wavelength(nu_cm: float | ndarray) float | ndarray[source]
Convert wavenumber (cm⁻¹) to wavelength (nm).
The conversion follows the relationship: λ = 10^7 / ν̃
- Parameters:
nu_cm – Wavenumber in cm⁻¹. Can be a scalar or numpy array.
- Returns:
Wavelength in nm (same shape as input).
- Raises:
ValueError – If wavenumber is zero or negative.
Example
>>> wavenumber_to_wavelength(6896) # 1st overtone O-H 1450.26... >>> wavenumber_to_wavelength(np.array([6896, 5155])) array([1450.26..., 1939.88...])