nirs4all.data package
Subpackages
- nirs4all.data.aggregation package
- Submodules
- Module contents
AggregationConfigAggregationConfig.columnAggregationConfig.methodAggregationConfig.exclude_outliersAggregationConfig.outlier_thresholdAggregationConfig.min_samplesAggregationConfig.custom_functionAggregationConfig.feature_methodAggregationConfig.target_methodAggregationConfig.__post_init__()AggregationConfig.columnAggregationConfig.custom_functionAggregationConfig.exclude_outliersAggregationConfig.feature_methodAggregationConfig.from_config()AggregationConfig.is_enabled()AggregationConfig.methodAggregationConfig.min_samplesAggregationConfig.outlier_thresholdAggregationConfig.target_method
AggregationMethodAggregatoraggregate_data()
- nirs4all.data.detection package
- Submodules
- Module contents
AutoDetectorDetectionResultDetectionResult.delimiterDetectionResult.decimal_separatorDetectionResult.has_headerDetectionResult.header_unitDetectionResult.signal_typeDetectionResult.encodingDetectionResult.n_columnsDetectionResult.n_rowsDetectionResult.confidenceDetectionResult.warningsDetectionResult.confidenceDetectionResult.decimal_separatorDetectionResult.delimiterDetectionResult.encodingDetectionResult.has_headerDetectionResult.header_unitDetectionResult.n_columnsDetectionResult.n_rowsDetectionResult.signal_typeDetectionResult.to_params()DetectionResult.warnings
detect_file_parameters()detect_signal_type()
- nirs4all.data.loaders package
- Submodules
- nirs4all.data.loaders.archive_loader module
- nirs4all.data.loaders.base module
- nirs4all.data.loaders.csv_loader module
- nirs4all.data.loaders.csv_loader_new module
- nirs4all.data.loaders.excel_loader module
- nirs4all.data.loaders.loader module
- nirs4all.data.loaders.matlab_loader module
- nirs4all.data.loaders.numpy_loader module
- nirs4all.data.loaders.parquet_loader module
- Module contents
ArchiveHandlerArchiveHandler.decompress_gzip()ArchiveHandler.decompress_gzip_bytes()ArchiveHandler.extract_bytes_from_tar()ArchiveHandler.extract_bytes_from_zip()ArchiveHandler.extract_from_tar()ArchiveHandler.extract_from_zip()ArchiveHandler.is_archive()ArchiveHandler.is_compressed()ArchiveHandler.list_tar_members()ArchiveHandler.list_zip_members()
CSVLoaderEnhancedZipLoaderExcelLoaderFileLoadErrorFileLoaderFormatNotSupportedErrorLoaderErrorLoaderRegistryLoaderResultMatlabLoaderNumpyLoaderParquetLoaderTarLoaderget_loader_for_file()get_supported_formats()list_archive_members()load_csv()load_csv_new()load_excel()load_file()load_matlab()load_numpy()load_parquet()register_loader()
- Submodules
- nirs4all.data.parsers package
- Submodules
- Module contents
- nirs4all.data.partition package
- Submodules
- Module contents
PartitionAssignerPartitionErrorPartitionResultPartitionResult.train_indicesPartitionResult.test_indicesPartitionResult.predict_indicesPartitionResult.train_dataPartitionResult.test_dataPartitionResult.predict_dataPartitionResult.partition_columnPartitionResult.get_data()PartitionResult.get_indices()PartitionResult.has_predictPartitionResult.has_testPartitionResult.has_trainPartitionResult.partition_columnPartitionResult.predict_dataPartitionResult.predict_indicesPartitionResult.test_dataPartitionResult.test_indicesPartitionResult.train_dataPartitionResult.train_indices
- nirs4all.data.performance package
- Submodules
- Module contents
CacheEntryCacheEntry.dataCacheEntry.keyCacheEntry.timestampCacheEntry.size_bytesCacheEntry.source_pathCacheEntry.source_mtimeCacheEntry.hit_countCacheEntry.dataCacheEntry.hit_countCacheEntry.is_stale()CacheEntry.keyCacheEntry.size_bytesCacheEntry.source_mtimeCacheEntry.source_pathCacheEntry.timestamp
DataCacheLazyArrayLazyDatasetcache_manager()
- nirs4all.data.schema package
- Subpackages
- Submodules
- nirs4all.data.schema.config module
AggregateMethodCategoricalModeColumnConfigDatasetConfigSchemaFileConfigFoldConfigFoldDefinitionHeaderUnitLoadingParamsNAPolicyPartitionConfigPartitionTypePreprocessingAppliedSharedMetadataConfigSharedTargetsConfigSignalTypeEnumSourceConfigSourceFileConfigTaskTypeVariationConfigVariationFileConfigVariationMode
- nirs4all.data.schema.config module
- Module contents
AggregateMethodCategoricalModeColumnConfigConfigValidatorDatasetConfigSchemaDatasetConfigSchema.aggregateDatasetConfigSchema.aggregate_exclude_outliersDatasetConfigSchema.aggregate_methodDatasetConfigSchema.descriptionDatasetConfigSchema.filesDatasetConfigSchema.foldsDatasetConfigSchema.from_dict()DatasetConfigSchema.get_effective_params()DatasetConfigSchema.get_selected_variations()DatasetConfigSchema.get_source_count()DatasetConfigSchema.get_source_names()DatasetConfigSchema.get_variation_count()DatasetConfigSchema.get_variation_names()DatasetConfigSchema.global_paramsDatasetConfigSchema.is_files_format()DatasetConfigSchema.is_legacy_format()DatasetConfigSchema.is_multi_source()DatasetConfigSchema.is_sources_format()DatasetConfigSchema.is_variations_format()DatasetConfigSchema.model_configDatasetConfigSchema.nameDatasetConfigSchema.normalize_aggregate_method()DatasetConfigSchema.normalize_task_type()DatasetConfigSchema.normalize_variation_mode()DatasetConfigSchema.parse_loading_params()DatasetConfigSchema.parse_shared_metadata()DatasetConfigSchema.parse_shared_targets()DatasetConfigSchema.parse_sources()DatasetConfigSchema.parse_variations()DatasetConfigSchema.shared_metadataDatasetConfigSchema.shared_targetsDatasetConfigSchema.sourcesDatasetConfigSchema.task_typeDatasetConfigSchema.test_groupDatasetConfigSchema.test_group_filterDatasetConfigSchema.test_group_paramsDatasetConfigSchema.test_paramsDatasetConfigSchema.test_xDatasetConfigSchema.test_x_filterDatasetConfigSchema.test_x_paramsDatasetConfigSchema.test_yDatasetConfigSchema.test_y_filterDatasetConfigSchema.test_y_paramsDatasetConfigSchema.to_dict()DatasetConfigSchema.to_legacy_format()DatasetConfigSchema.train_groupDatasetConfigSchema.train_group_filterDatasetConfigSchema.train_group_paramsDatasetConfigSchema.train_paramsDatasetConfigSchema.train_xDatasetConfigSchema.train_x_filterDatasetConfigSchema.train_x_paramsDatasetConfigSchema.train_yDatasetConfigSchema.train_y_filterDatasetConfigSchema.train_y_paramsDatasetConfigSchema.validate_data_sources()DatasetConfigSchema.variation_modeDatasetConfigSchema.variation_prefixDatasetConfigSchema.variation_selectDatasetConfigSchema.variationsDatasetConfigSchema.variations_to_legacy_format()
FileConfigFoldConfigFoldDefinitionHeaderUnitLoadingParamsLoadingParams.categorical_modeLoadingParams.decimal_separatorLoadingParams.delimiterLoadingParams.encodingLoadingParams.has_headerLoadingParams.header_unitLoadingParams.merge_with()LoadingParams.model_configLoadingParams.na_policyLoadingParams.normalize_header_unit()LoadingParams.normalize_signal_type()LoadingParams.signal_type
NAPolicyPartitionConfigPartitionConfig.columnPartitionConfig.model_configPartitionConfig.predictPartitionConfig.predict_filePartitionConfig.predict_valuesPartitionConfig.random_statePartitionConfig.shufflePartitionConfig.stratifyPartitionConfig.testPartitionConfig.test_filePartitionConfig.test_valuesPartitionConfig.to_assigner_spec()PartitionConfig.trainPartitionConfig.train_filePartitionConfig.train_valuesPartitionConfig.typePartitionConfig.unknown_policyPartitionConfig.validate_partition_method()
PartitionTypePathOrArrayPreprocessingAppliedSharedMetadataConfigSharedTargetsConfigSignalTypeEnumSourceConfigSourceFileConfigTaskTypeValidationErrorValidationResultValidationWarningVariationConfigVariationConfig.descriptionVariationConfig.filesVariationConfig.get_test_paths()VariationConfig.get_train_paths()VariationConfig.model_configVariationConfig.nameVariationConfig.paramsVariationConfig.preprocessing_appliedVariationConfig.test_xVariationConfig.train_xVariationConfig.validate_variation_files()
VariationFileConfigVariationMode
- nirs4all.data.selection package
- Submodules
- Module contents
- nirs4all.data.serialization package
- nirs4all.data.synthetic package
- Subpackages
- Submodules
- nirs4all.data.synthetic.accelerated module
- nirs4all.data.synthetic.benchmarks module
- nirs4all.data.synthetic.builder module
- nirs4all.data.synthetic.components module
- nirs4all.data.synthetic.config module
- nirs4all.data.synthetic.detectors module
- nirs4all.data.synthetic.domains module
- nirs4all.data.synthetic.environmental module
- nirs4all.data.synthetic.exporter module
- nirs4all.data.synthetic.fitter module
ComponentFitResultComponentFitterDerivativeAwareForwardModelFitterDomainInferenceEdgeArtifactInferenceEnvironmentalInferenceFittedParametersForwardModelFitterInstrumentChainInstrumentInferenceMeasurementModeInferenceOperatorVarianceParamsOptimizedComponentFitterOptimizedFitResultPCAVarianceParamsPreprocessingInferencePreprocessingTypeRealBandFitResultRealBandFitterRealDataFitterScatteringInferenceSpectralPropertiesVarianceFitResultVarianceFittercompare_datasets()compute_spectral_properties()fit_components()fit_components_optimized()fit_real_bands()fit_to_real_data()fit_variance()multiscale_derivative_fit()multiscale_fit()
- nirs4all.data.synthetic.generator module
- nirs4all.data.synthetic.instruments module
DetectorTypeEdgeArtifactsConfigInstrumentArchetypeInstrumentCategoryInstrumentSimulatorMonochromatorTypeMultiScanConfigMultiSensorConfigSensorConfigget_instrument_archetype()get_instrument_wavelength_info()get_instrument_wavelengths()get_instruments_by_category()list_instrument_archetypes()list_instrument_wavelength_grids()
- nirs4all.data.synthetic.measurement_modes module
- nirs4all.data.synthetic.metadata module
- nirs4all.data.synthetic.prior module
- nirs4all.data.synthetic.procedural module
- nirs4all.data.synthetic.products module
- nirs4all.data.synthetic.scattering module
- nirs4all.data.synthetic.sources module
- nirs4all.data.synthetic.targets module
- nirs4all.data.synthetic.validation module
DatasetComparisonResultMetricResultRealismMetricSpectralRealismScoreValidationErrorcompute_adversarial_validation_auc()compute_baseline_curvature()compute_correlation_length()compute_derivative_statistics()compute_distribution_overlap()compute_peak_density()compute_snr()compute_spectral_realism_scorecard()quick_realism_check()validate_against_benchmark()validate_concentrations()validate_spectra()validate_synthetic_output()validate_wavelengths()
- nirs4all.data.synthetic.wavenumber module
CombinationBandResultOvertoneResultapply_hydrogen_bonding_shift()calculate_combination_band()calculate_overtone_position()classify_wavelength_extended()classify_wavelength_zone()convert_bandwidth_to_wavelength()convert_bandwidth_to_wavenumber()estimate_bandwidth_broadening()get_all_zones_extended()get_all_zones_wavelength()get_nir_overtones_for_fundamental()get_zone_wavelength_range()is_nir_region()is_visible_region()wavelength_to_wavenumber()wavenumber_to_wavelength()
- Module contents
ATRConfigAcceleratedArraysAcceleratedArrays.arangeAcceleratedArrays.arrayAcceleratedArrays.backendAcceleratedArrays.cosAcceleratedArrays.dotAcceleratedArrays.expAcceleratedArrays.linspaceAcceleratedArrays.logAcceleratedArrays.matmulAcceleratedArrays.onesAcceleratedArrays.random_normalAcceleratedArrays.random_uniformAcceleratedArrays.sinAcceleratedArrays.sqrtAcceleratedArrays.sumAcceleratedArrays.to_numpyAcceleratedArrays.zeros
AcceleratedGeneratorAcceleratorBackendAggregateComponentAggregateComponent.nameAggregateComponent.componentsAggregateComponent.descriptionAggregateComponent.domainAggregateComponent.categoryAggregateComponent.variabilityAggregateComponent.correlationsAggregateComponent.tagsAggregateComponent.referencesAggregateComponent.categoryAggregateComponent.componentsAggregateComponent.correlationsAggregateComponent.descriptionAggregateComponent.domainAggregateComponent.info()AggregateComponent.nameAggregateComponent.referencesAggregateComponent.spectral_categoryAggregateComponent.tagsAggregateComponent.validate()AggregateComponent.variability
BandAssignmentBandAssignment.centerBandAssignment.wavenumberBandAssignment.functional_groupBandAssignment.overtone_levelBandAssignment.assignmentBandAssignment.descriptionBandAssignment.sigma_rangeBandAssignment.gamma_rangeBandAssignment.intensityBandAssignment.chemical_contextBandAssignment.affected_byBandAssignment.common_compoundsBandAssignment.referencesBandAssignment.tagsBandAssignment.affected_byBandAssignment.assignmentBandAssignment.centerBandAssignment.chemical_contextBandAssignment.common_compoundsBandAssignment.descriptionBandAssignment.functional_groupBandAssignment.gamma_rangeBandAssignment.info()BandAssignment.intensityBandAssignment.overtone_levelBandAssignment.referencesBandAssignment.sigma_rangeBandAssignment.tagsBandAssignment.to_nir_band()BandAssignment.wavenumber
BatchEffectConfigBenchmarkDatasetInfoBenchmarkDatasetInfo.nameBenchmarkDatasetInfo.full_nameBenchmarkDatasetInfo.domainBenchmarkDatasetInfo.n_samplesBenchmarkDatasetInfo.n_wavelengthsBenchmarkDatasetInfo.wavelength_rangeBenchmarkDatasetInfo.targetsBenchmarkDatasetInfo.sample_typeBenchmarkDatasetInfo.measurement_modeBenchmarkDatasetInfo.source_urlBenchmarkDatasetInfo.referenceBenchmarkDatasetInfo.licenseBenchmarkDatasetInfo.typical_snrBenchmarkDatasetInfo.typical_peak_densityBenchmarkDatasetInfo.notesBenchmarkDatasetInfo.domainBenchmarkDatasetInfo.full_nameBenchmarkDatasetInfo.licenseBenchmarkDatasetInfo.measurement_modeBenchmarkDatasetInfo.n_samplesBenchmarkDatasetInfo.n_wavelengthsBenchmarkDatasetInfo.nameBenchmarkDatasetInfo.notesBenchmarkDatasetInfo.referenceBenchmarkDatasetInfo.sample_typeBenchmarkDatasetInfo.source_urlBenchmarkDatasetInfo.summary()BenchmarkDatasetInfo.targetsBenchmarkDatasetInfo.typical_peak_densityBenchmarkDatasetInfo.typical_snrBenchmarkDatasetInfo.wavelength_range
BenchmarkDomainCSVVariationGeneratorCSVVariationGenerator.base_exporterCSVVariationGenerator.as_fragmented()CSVVariationGenerator.as_single_file()CSVVariationGenerator.generate_all_variations()CSVVariationGenerator.with_comma_delimiter()CSVVariationGenerator.with_precision()CSVVariationGenerator.with_row_index()CSVVariationGenerator.with_semicolon_delimiter()CSVVariationGenerator.with_tab_delimiter()CSVVariationGenerator.without_headers()
CategoryGeneratorClassSeparationConfigCombinationBandResultComponentFitResultComponentFitResult.component_namesComponentFitResult.concentrationsComponentFitResult.baseline_coefficientsComponentFitResult.fitted_spectrumComponentFitResult.residualsComponentFitResult.r_squaredComponentFitResult.rmseComponentFitResult.wavelengthsComponentFitResult.baseline_coefficientsComponentFitResult.component_namesComponentFitResult.concentrationsComponentFitResult.fitted_spectrumComponentFitResult.r_squaredComponentFitResult.residualsComponentFitResult.rmseComponentFitResult.summary()ComponentFitResult.to_dict()ComponentFitResult.top_components()ComponentFitResult.wavelengths
ComponentFitterComponentFitter.component_namesComponentFitter.wavelengthsComponentFitter.fit_baselineComponentFitter.baseline_orderComponentFitter.preprocessingComponentFitter.auto_detect_preprocessingComponentFitter.detected_preprocessingComponentFitter.detected_preprocessingComponentFitter.fit()ComponentFitter.fit_batch()ComponentFitter.get_concentration_matrix()ComponentFitter.suggest_components()
ComponentLibraryComponentLibrary.rngComponentLibrary.__contains__()ComponentLibrary.__getitem__()ComponentLibrary.__iter__()ComponentLibrary.__len__()ComponentLibrary.add_boundary_component()ComponentLibrary.add_boundary_components_from_known()ComponentLibrary.add_component()ComponentLibrary.add_random_component()ComponentLibrary.component_namesComponentLibrary.componentsComponentLibrary.compute_all()ComponentLibrary.from_predefined()ComponentLibrary.generate_random_library()ComponentLibrary.n_components
ComponentVariationComponentVariation.componentComponentVariation.variation_typeComponentVariation.valueComponentVariation.min_valueComponentVariation.max_valueComponentVariation.meanComponentVariation.stdComponentVariation.correlated_withComponentVariation.correlationComponentVariation.compute_asComponentVariation.__post_init__()ComponentVariation.componentComponentVariation.compute_asComponentVariation.correlated_withComponentVariation.correlationComponentVariation.max_valueComponentVariation.meanComponentVariation.min_valueComponentVariation.stdComponentVariation.valueComponentVariation.variation_type
ConcentrationPriorDatasetComparisonResultDatasetComparisonResult.dataset_nameDatasetComparisonResult.n_real_samplesDatasetComparisonResult.n_synthetic_samplesDatasetComparisonResult.realism_scoreDatasetComparisonResult.tstr_r2DatasetComparisonResult.trts_r2DatasetComparisonResult.dataset_nameDatasetComparisonResult.n_real_samplesDatasetComparisonResult.n_synthetic_samplesDatasetComparisonResult.realism_scoreDatasetComparisonResult.summary()DatasetComparisonResult.trts_r2DatasetComparisonResult.tstr_r2
DatasetExporterDerivativeAwareForwardModelFitterDerivativeAwareForwardModelFitter.componentsDerivativeAwareForwardModelFitter.canonical_gridDerivativeAwareForwardModelFitter.target_gridDerivativeAwareForwardModelFitter.derivative_orderDerivativeAwareForwardModelFitter.sg_windowDerivativeAwareForwardModelFitter.sg_polyorderDerivativeAwareForwardModelFitter.baseline_orderDerivativeAwareForwardModelFitter.__post_init__()DerivativeAwareForwardModelFitter.baseline_orderDerivativeAwareForwardModelFitter.canonical_gridDerivativeAwareForwardModelFitter.componentsDerivativeAwareForwardModelFitter.derivative_orderDerivativeAwareForwardModelFitter.fit()DerivativeAwareForwardModelFitter.ils_sigma_boundsDerivativeAwareForwardModelFitter.path_length_boundsDerivativeAwareForwardModelFitter.sg_polyorderDerivativeAwareForwardModelFitter.sg_windowDerivativeAwareForwardModelFitter.target_gridDerivativeAwareForwardModelFitter.wl_shift_bounds
DetectorConfigDetectorConfig.detector_typeDetectorConfig.temperature_kDetectorConfig.integration_time_msDetectorConfig.gainDetectorConfig.noise_modelDetectorConfig.apply_response_curveDetectorConfig.apply_nonlinearityDetectorConfig.nonlinearity_coefficientDetectorConfig.apply_nonlinearityDetectorConfig.apply_response_curveDetectorConfig.detector_typeDetectorConfig.gainDetectorConfig.integration_time_msDetectorConfig.noise_modelDetectorConfig.nonlinearity_coefficientDetectorConfig.temperature_k
DetectorSimulatorDetectorSpectralResponseDetectorSpectralResponse.detector_typeDetectorSpectralResponse.wavelengthsDetectorSpectralResponse.responseDetectorSpectralResponse.peak_wavelengthDetectorSpectralResponse.cutoff_wavelengthDetectorSpectralResponse.short_cutoffDetectorSpectralResponse.peak_qeDetectorSpectralResponse.cutoff_wavelengthDetectorSpectralResponse.detector_typeDetectorSpectralResponse.get_response_at()DetectorSpectralResponse.peak_qeDetectorSpectralResponse.peak_wavelengthDetectorSpectralResponse.responseDetectorSpectralResponse.short_cutoffDetectorSpectralResponse.wavelengths
DetectorTypeDomainCategoryDomainConfigDomainConfig.nameDomainConfig.categoryDomainConfig.descriptionDomainConfig.typical_componentsDomainConfig.component_weightsDomainConfig.concentration_priorsDomainConfig.wavelength_rangeDomainConfig.n_components_rangeDomainConfig.noise_levelDomainConfig.measurement_modeDomainConfig.typical_sample_typesDomainConfig.complexityDomainConfig.additional_paramsDomainConfig.additional_paramsDomainConfig.categoryDomainConfig.complexityDomainConfig.component_weightsDomainConfig.concentration_priorsDomainConfig.descriptionDomainConfig.get_component_weights()DomainConfig.measurement_modeDomainConfig.n_components_rangeDomainConfig.nameDomainConfig.noise_levelDomainConfig.sample_components()DomainConfig.sample_concentrations()DomainConfig.typical_componentsDomainConfig.typical_sample_typesDomainConfig.wavelength_range
DomainInferenceDomainInference.domain_nameDomainInference.categoryDomainInference.confidenceDomainInference.detected_componentsDomainInference.alternative_domainsDomainInference.alternative_domainsDomainInference.categoryDomainInference.confidenceDomainInference.detected_componentsDomainInference.domain_name
EMSCConfigEMSCConfig.polynomial_orderEMSCConfig.multiplicative_scatter_stdEMSCConfig.additive_scatter_stdEMSCConfig.include_wavelength_termsEMSCConfig.wavelength_coef_stdEMSCConfig.reference_spectrumEMSCConfig.additive_scatter_stdEMSCConfig.include_wavelength_termsEMSCConfig.multiplicative_scatter_stdEMSCConfig.polynomial_orderEMSCConfig.reference_spectrumEMSCConfig.wavelength_coef_std
EdgeArtifactsConfigEdgeArtifactsConfig.enable_detector_rolloffEdgeArtifactsConfig.enable_stray_lightEdgeArtifactsConfig.enable_truncated_peaksEdgeArtifactsConfig.enable_edge_curvatureEdgeArtifactsConfig.detector_modelEdgeArtifactsConfig.rolloff_severityEdgeArtifactsConfig.stray_fractionEdgeArtifactsConfig.stray_wavelength_dependentEdgeArtifactsConfig.left_peak_amplitudeEdgeArtifactsConfig.right_peak_amplitudeEdgeArtifactsConfig.curvature_typeEdgeArtifactsConfig.left_curvature_severityEdgeArtifactsConfig.right_curvature_severityEdgeArtifactsConfig.curvature_typeEdgeArtifactsConfig.detector_modelEdgeArtifactsConfig.enable_detector_rolloffEdgeArtifactsConfig.enable_edge_curvatureEdgeArtifactsConfig.enable_stray_lightEdgeArtifactsConfig.enable_truncated_peaksEdgeArtifactsConfig.left_curvature_severityEdgeArtifactsConfig.left_peak_amplitudeEdgeArtifactsConfig.right_curvature_severityEdgeArtifactsConfig.right_peak_amplitudeEdgeArtifactsConfig.rolloff_severityEdgeArtifactsConfig.stray_fractionEdgeArtifactsConfig.stray_wavelength_dependent
EnvironmentalEffectsConfigEnvironmentalEffectsConfig.temperatureEnvironmentalEffectsConfig.moistureEnvironmentalEffectsConfig.enable_temperatureEnvironmentalEffectsConfig.enable_moistureEnvironmentalEffectsConfig.enable_moistureEnvironmentalEffectsConfig.enable_temperatureEnvironmentalEffectsConfig.moistureEnvironmentalEffectsConfig.temperature
EnvironmentalInferenceEnvironmentalInference.estimated_temperature_variationEnvironmentalInference.has_temperature_effectsEnvironmentalInference.estimated_moisture_variationEnvironmentalInference.has_moisture_effectsEnvironmentalInference.water_band_shiftEnvironmentalInference.estimated_moisture_variationEnvironmentalInference.estimated_temperature_variationEnvironmentalInference.has_moisture_effectsEnvironmentalInference.has_temperature_effectsEnvironmentalInference.water_band_shift
ExportConfigExportConfig.formatExportConfig.separatorExportConfig.float_precisionExportConfig.include_headersExportConfig.include_indexExportConfig.compressionExportConfig.file_extensionExportConfig.compressionExportConfig.file_extensionExportConfig.float_precisionExportConfig.formatExportConfig.include_headersExportConfig.include_indexExportConfig.separator
FeatureConfigFeatureConfig.wavelength_startFeatureConfig.wavelength_endFeatureConfig.wavelength_stepFeatureConfig.complexityFeatureConfig.n_componentsFeatureConfig.component_namesFeatureConfig.complexityFeatureConfig.component_namesFeatureConfig.n_componentsFeatureConfig.wavelength_endFeatureConfig.wavelength_startFeatureConfig.wavelength_step
FittedParametersFittedParameters.wavelength_startFittedParameters.wavelength_endFittedParameters.wavelength_stepFittedParameters.global_slope_meanFittedParameters.global_slope_stdFittedParameters.baseline_amplitudeFittedParameters.noise_baseFittedParameters.noise_signal_depFittedParameters.path_length_stdFittedParameters.scatter_alpha_stdFittedParameters.scatter_beta_stdFittedParameters.tilt_stdFittedParameters.complexityFittedParameters.source_nameFittedParameters.source_propertiesFittedParameters.inferred_instrumentFittedParameters.instrument_inferenceFittedParameters.measurement_modeFittedParameters.measurement_mode_confidenceFittedParameters.inferred_domainFittedParameters.domain_inferenceFittedParameters.environmental_inferenceFittedParameters.temperature_configFittedParameters.moisture_configFittedParameters.scattering_inferenceFittedParameters.particle_size_configFittedParameters.emsc_configFittedParameters.detected_componentsFittedParameters.suggested_n_componentsFittedParameters.baseline_amplitudeFittedParameters.boundary_components_configFittedParameters.complexityFittedParameters.detected_componentsFittedParameters.domain_inferenceFittedParameters.edge_artifact_inferenceFittedParameters.edge_artifacts_configFittedParameters.emsc_configFittedParameters.environmental_inferenceFittedParameters.from_dict()FittedParameters.global_slope_meanFittedParameters.global_slope_stdFittedParameters.inferred_domainFittedParameters.inferred_instrumentFittedParameters.instrument_inferenceFittedParameters.is_preprocessedFittedParameters.load()FittedParameters.measurement_modeFittedParameters.measurement_mode_confidenceFittedParameters.moisture_configFittedParameters.noise_baseFittedParameters.noise_signal_depFittedParameters.particle_size_configFittedParameters.path_length_stdFittedParameters.preprocessing_inferenceFittedParameters.preprocessing_typeFittedParameters.save()FittedParameters.scatter_alpha_stdFittedParameters.scatter_beta_stdFittedParameters.scattering_inferenceFittedParameters.source_nameFittedParameters.source_propertiesFittedParameters.suggested_n_componentsFittedParameters.summary()FittedParameters.temperature_configFittedParameters.tilt_stdFittedParameters.to_dict()FittedParameters.to_full_config()FittedParameters.to_generator_kwargs()FittedParameters.wavelength_endFittedParameters.wavelength_startFittedParameters.wavelength_step
ForwardModelFitterForwardModelFitter.componentsForwardModelFitter.canonical_gridForwardModelFitter.target_gridForwardModelFitter.baseline_orderForwardModelFitter.wl_shift_boundsForwardModelFitter.ils_sigma_boundsForwardModelFitter.path_length_boundsForwardModelFitter.__post_init__()ForwardModelFitter.baseline_orderForwardModelFitter.canonical_gridForwardModelFitter.componentsForwardModelFitter.fit()ForwardModelFitter.ils_sigma_boundsForwardModelFitter.path_length_boundsForwardModelFitter.target_gridForwardModelFitter.wl_shift_bounds
FunctionalGroupTypeInstrumentArchetypeInstrumentArchetype.nameInstrumentArchetype.categoryInstrumentArchetype.detector_typeInstrumentArchetype.monochromator_typeInstrumentArchetype.wavelength_rangeInstrumentArchetype.spectral_resolutionInstrumentArchetype.wavelength_accuracyInstrumentArchetype.photometric_noiseInstrumentArchetype.photometric_rangeInstrumentArchetype.snrInstrumentArchetype.stray_lightInstrumentArchetype.warm_up_driftInstrumentArchetype.temperature_sensitivityInstrumentArchetype.scan_speedInstrumentArchetype.integration_time_msInstrumentArchetype.optical_pathInstrumentArchetype.multi_sensorInstrumentArchetype.multi_scanInstrumentArchetype.descriptionInstrumentArchetype.categoryInstrumentArchetype.descriptionInstrumentArchetype.detector_typeInstrumentArchetype.get_noise_model_params()InstrumentArchetype.integration_time_msInstrumentArchetype.monochromator_typeInstrumentArchetype.multi_scanInstrumentArchetype.multi_sensorInstrumentArchetype.nameInstrumentArchetype.optical_pathInstrumentArchetype.photometric_noiseInstrumentArchetype.photometric_rangeInstrumentArchetype.scan_speedInstrumentArchetype.snrInstrumentArchetype.spectral_resolutionInstrumentArchetype.stray_lightInstrumentArchetype.temperature_sensitivityInstrumentArchetype.warm_up_driftInstrumentArchetype.wavelength_accuracyInstrumentArchetype.wavelength_range
InstrumentCategoryInstrumentChainInstrumentChain.wl_shiftInstrumentChain.wl_stretchInstrumentChain.ils_sigmaInstrumentChain.stray_lightInstrumentChain.gainInstrumentChain.offsetInstrumentChain.apply()InstrumentChain.gainInstrumentChain.ils_sigmaInstrumentChain.offsetInstrumentChain.stray_lightInstrumentChain.wl_shiftInstrumentChain.wl_stretch
InstrumentInferenceInstrumentInference.archetype_nameInstrumentInference.detector_typeInstrumentInference.wavelength_rangeInstrumentInference.estimated_resolutionInstrumentInference.confidenceInstrumentInference.alternative_archetypesInstrumentInference.alternative_archetypesInstrumentInference.archetype_nameInstrumentInference.confidenceInstrumentInference.detector_typeInstrumentInference.estimated_resolutionInstrumentInference.wavelength_range
InstrumentSimulatorLoadedBenchmarkDatasetLoadedBenchmarkDataset.infoLoadedBenchmarkDataset.XLoadedBenchmarkDataset.yLoadedBenchmarkDataset.wavelengthsLoadedBenchmarkDataset.sample_idsLoadedBenchmarkDataset.metadataLoadedBenchmarkDataset.XLoadedBenchmarkDataset.infoLoadedBenchmarkDataset.metadataLoadedBenchmarkDataset.sample_idsLoadedBenchmarkDataset.wavelengthsLoadedBenchmarkDataset.y
MatrixTypeMeasurementModeMeasurementModeInferenceMeasurementModeSimulatorMeasurementModeSimulator.configMeasurementModeSimulator.rngMeasurementModeSimulator.absorbance_to_reflectance()MeasurementModeSimulator.apply()MeasurementModeSimulator.generate_scattering_coefficients()MeasurementModeSimulator.inverse_kubelka_munk()MeasurementModeSimulator.kubelka_munk()MeasurementModeSimulator.reflectance_to_absorbance()
MetadataConfigMetadataConfig.generate_sample_idsMetadataConfig.sample_id_prefixMetadataConfig.n_groupsMetadataConfig.n_repetitionsMetadataConfig.group_namesMetadataConfig.additional_columnsMetadataConfig.additional_columnsMetadataConfig.generate_sample_idsMetadataConfig.group_namesMetadataConfig.n_groupsMetadataConfig.n_repetitionsMetadataConfig.sample_id_prefix
MetadataGenerationResultMetadataGenerationResult.sample_idsMetadataGenerationResult.bio_sample_idsMetadataGenerationResult.repetitionsMetadataGenerationResult.groupsMetadataGenerationResult.group_indicesMetadataGenerationResult.n_bio_samplesMetadataGenerationResult.additional_columnsMetadataGenerationResult.additional_columnsMetadataGenerationResult.bio_sample_idsMetadataGenerationResult.group_indicesMetadataGenerationResult.groupsMetadataGenerationResult.n_bio_samplesMetadataGenerationResult.repetitionsMetadataGenerationResult.sample_idsMetadataGenerationResult.to_dict()
MetadataGeneratorMetricResultMoistureConfigMoistureConfig.water_activityMoistureConfig.moisture_contentMoistureConfig.free_water_fractionMoistureConfig.bound_water_shiftMoistureConfig.temperature_interactionMoistureConfig.reference_awMoistureConfig.__post_init__()MoistureConfig.bound_water_shiftMoistureConfig.free_water_fractionMoistureConfig.moisture_contentMoistureConfig.reference_awMoistureConfig.temperature_interactionMoistureConfig.water_activity
MonochromatorTypeMultiScanConfigMultiScanConfig.enabledMultiScanConfig.n_scansMultiScanConfig.averaging_methodMultiScanConfig.scan_to_scan_noiseMultiScanConfig.wavelength_jitterMultiScanConfig.discard_outliersMultiScanConfig.outlier_thresholdMultiScanConfig.averaging_methodMultiScanConfig.discard_outliersMultiScanConfig.enabledMultiScanConfig.n_scansMultiScanConfig.outlier_thresholdMultiScanConfig.scan_to_scan_noiseMultiScanConfig.wavelength_jitter
MultiSensorConfigMultiSensorConfig.enabledMultiSensorConfig.sensorsMultiSensorConfig.stitch_methodMultiSensorConfig.stitch_smoothingMultiSensorConfig.add_stitch_artifactsMultiSensorConfig.artifact_intensityMultiSensorConfig.add_stitch_artifactsMultiSensorConfig.artifact_intensityMultiSensorConfig.enabledMultiSensorConfig.sensorsMultiSensorConfig.stitch_methodMultiSensorConfig.stitch_smoothing
MultiSourceGeneratorMultiSourceResultMultiSourceResult.sourcesMultiSourceResult.targetsMultiSourceResult.source_configsMultiSourceResult.wavelengthsMultiSourceResult.metadataMultiSourceResult.get_combined_features()MultiSourceResult.metadataMultiSourceResult.n_features_totalMultiSourceResult.n_samplesMultiSourceResult.source_configsMultiSourceResult.source_namesMultiSourceResult.sourcesMultiSourceResult.targetsMultiSourceResult.wavelengths
NIRBandNIRSPriorConfigNIRSPriorConfig.domain_weightsNIRSPriorConfig.instrument_given_domainNIRSPriorConfig.mode_given_categoryNIRSPriorConfig.matrix_given_domainNIRSPriorConfig.temperature_rangeNIRSPriorConfig.particle_size_rangeNIRSPriorConfig.noise_level_rangeNIRSPriorConfig.domain_weightsNIRSPriorConfig.get_domain_weight()NIRSPriorConfig.instrument_given_domainNIRSPriorConfig.matrix_given_domainNIRSPriorConfig.mode_given_categoryNIRSPriorConfig.n_classes_rangeNIRSPriorConfig.n_samples_rangeNIRSPriorConfig.n_targets_rangeNIRSPriorConfig.noise_level_rangeNIRSPriorConfig.normalize_weights()NIRSPriorConfig.particle_size_rangeNIRSPriorConfig.target_type_weightsNIRSPriorConfig.temperature_range
NoiseModelConfigNoiseModelConfig.shot_noise_enabledNoiseModelConfig.thermal_noise_enabledNoiseModelConfig.read_noise_enabledNoiseModelConfig.flicker_noise_enabledNoiseModelConfig.quantization_noise_enabledNoiseModelConfig.shot_noise_factorNoiseModelConfig.thermal_noise_factorNoiseModelConfig.read_noise_electronsNoiseModelConfig.flicker_corner_freqNoiseModelConfig.adc_bitsNoiseModelConfig.full_scaleNoiseModelConfig.adc_bitsNoiseModelConfig.flicker_corner_freqNoiseModelConfig.flicker_noise_enabledNoiseModelConfig.full_scaleNoiseModelConfig.quantization_noise_enabledNoiseModelConfig.read_noise_electronsNoiseModelConfig.read_noise_enabledNoiseModelConfig.shot_noise_enabledNoiseModelConfig.shot_noise_factorNoiseModelConfig.thermal_noise_enabledNoiseModelConfig.thermal_noise_factor
OperatorVarianceParamsOperatorVarianceParams.noise_stdOperatorVarianceParams.offset_stdOperatorVarianceParams.slope_stdOperatorVarianceParams.curvature_stdOperatorVarianceParams.mult_scatter_stdOperatorVarianceParams.curvature_stdOperatorVarianceParams.mult_scatter_stdOperatorVarianceParams.noise_stdOperatorVarianceParams.offset_stdOperatorVarianceParams.slope_stdOperatorVarianceParams.to_dict()
OptimizedComponentFitterOptimizedComponentFitter.wavelengthsOptimizedComponentFitter.priority_categoriesOptimizedComponentFitter.max_componentsOptimizedComponentFitter.baseline_orderOptimizedComponentFitter.preprocessingOptimizedComponentFitter.auto_detect_preprocessingOptimizedComponentFitter.detected_preprocessingOptimizedComponentFitter.fit()
OptimizedFitResultOptimizedFitResult.component_namesOptimizedFitResult.concentrationsOptimizedFitResult.baseline_coefficientsOptimizedFitResult.fitted_spectrumOptimizedFitResult.residualsOptimizedFitResult.r_squaredOptimizedFitResult.rmseOptimizedFitResult.n_componentsOptimizedFitResult.n_priority_componentsOptimizedFitResult.baseline_r_squaredOptimizedFitResult.wavelengthsOptimizedFitResult.baseline_coefficientsOptimizedFitResult.baseline_r_squaredOptimizedFitResult.component_namesOptimizedFitResult.concentrationsOptimizedFitResult.fitted_spectrumOptimizedFitResult.n_componentsOptimizedFitResult.n_priority_componentsOptimizedFitResult.r_squaredOptimizedFitResult.residualsOptimizedFitResult.rmseOptimizedFitResult.summary()OptimizedFitResult.top_components()OptimizedFitResult.wavelengths
OutputConfigOvertoneResultPCAVarianceParamsPCAVarianceParams.n_componentsPCAVarianceParams.explained_variance_ratioPCAVarianceParams.score_meansPCAVarianceParams.score_stdsPCAVarianceParams.componentsPCAVarianceParams.mean_spectrumPCAVarianceParams.componentsPCAVarianceParams.explained_variance_ratioPCAVarianceParams.mean_spectrumPCAVarianceParams.n_componentsPCAVarianceParams.score_meansPCAVarianceParams.score_stds
ParticleSizeConfigParticleSizeConfig.distributionParticleSizeConfig.reference_size_umParticleSizeConfig.size_effect_strengthParticleSizeConfig.wavelength_exponentParticleSizeConfig.include_path_length_effectParticleSizeConfig.path_length_sensitivityParticleSizeConfig.distributionParticleSizeConfig.include_path_length_effectParticleSizeConfig.path_length_sensitivityParticleSizeConfig.reference_size_umParticleSizeConfig.size_effect_strengthParticleSizeConfig.wavelength_exponent
ParticleSizeDistributionParticleSizeDistribution.mean_size_umParticleSizeDistribution.std_size_umParticleSizeDistribution.min_size_umParticleSizeDistribution.max_size_umParticleSizeDistribution.distributionParticleSizeDistribution.distributionParticleSizeDistribution.max_size_umParticleSizeDistribution.mean_size_umParticleSizeDistribution.min_size_umParticleSizeDistribution.sample()ParticleSizeDistribution.std_size_um
PartitionConfigPreprocessingInferencePreprocessingInference.preprocessing_typePreprocessingInference.confidencePreprocessingInference.is_preprocessedPreprocessingInference.global_meanPreprocessingInference.global_rangePreprocessingInference.zero_crossing_ratioPreprocessingInference.per_sample_std_variationPreprocessingInference.oscillation_frequencyPreprocessingInference.suggested_inversePreprocessingInference.confidencePreprocessingInference.global_meanPreprocessingInference.global_rangePreprocessingInference.is_preprocessedPreprocessingInference.oscillation_frequencyPreprocessingInference.per_sample_std_variationPreprocessingInference.preprocessing_typePreprocessingInference.suggested_inversePreprocessingInference.zero_crossing_ratio
PreprocessingTypePriorSamplerPriorSampler.sample()PriorSampler.sample_batch()PriorSampler.sample_components()PriorSampler.sample_domain()PriorSampler.sample_for_domain()PriorSampler.sample_for_instrument()PriorSampler.sample_instrument()PriorSampler.sample_instrument_category()PriorSampler.sample_matrix_type()PriorSampler.sample_measurement_mode()PriorSampler.sample_n_samples()PriorSampler.sample_noise_level()PriorSampler.sample_particle_size()PriorSampler.sample_target_config()PriorSampler.sample_temperature()
ProceduralComponentConfigProceduralComponentConfig.n_fundamental_bandsProceduralComponentConfig.include_overtonesProceduralComponentConfig.max_overtone_orderProceduralComponentConfig.include_combinationsProceduralComponentConfig.max_combinationsProceduralComponentConfig.h_bond_strengthProceduralComponentConfig.h_bond_variabilityProceduralComponentConfig.anharmonicityProceduralComponentConfig.anharmonicity_variabilityProceduralComponentConfig.amplitude_variabilityProceduralComponentConfig.bandwidth_variabilityProceduralComponentConfig.wavelength_rangeProceduralComponentConfig.functional_groupsProceduralComponentConfig.combination_amplitude_factorProceduralComponentConfig.amplitude_variabilityProceduralComponentConfig.anharmonicityProceduralComponentConfig.anharmonicity_variabilityProceduralComponentConfig.bandwidth_variabilityProceduralComponentConfig.combination_amplitude_factorProceduralComponentConfig.functional_groupsProceduralComponentConfig.h_bond_strengthProceduralComponentConfig.h_bond_variabilityProceduralComponentConfig.include_combinationsProceduralComponentConfig.include_overtonesProceduralComponentConfig.max_combinationsProceduralComponentConfig.max_overtone_orderProceduralComponentConfig.n_fundamental_bandsProceduralComponentConfig.wavelength_range
ProceduralComponentGeneratorProductGeneratorProductTemplateProductTemplate.nameProductTemplate.descriptionProductTemplate.categoryProductTemplate.domainProductTemplate.componentsProductTemplate.default_targetProductTemplate.tagsProductTemplate.referencesProductTemplate.__post_init__()ProductTemplate.categoryProductTemplate.component_namesProductTemplate.componentsProductTemplate.default_targetProductTemplate.descriptionProductTemplate.domainProductTemplate.info()ProductTemplate.nameProductTemplate.referencesProductTemplate.tags
RealBandFitResultRealBandFitResult.band_namesRealBandFitResult.band_centersRealBandFitResult.amplitudesRealBandFitResult.sigmasRealBandFitResult.baseline_coefficientsRealBandFitResult.fitted_spectrumRealBandFitResult.residualsRealBandFitResult.r_squaredRealBandFitResult.rmseRealBandFitResult.n_bandsRealBandFitResult.wavelengthsRealBandFitResult.band_assignmentsRealBandFitResult.amplitudesRealBandFitResult.band_assignmentsRealBandFitResult.band_centersRealBandFitResult.band_namesRealBandFitResult.baseline_coefficientsRealBandFitResult.fitted_spectrumRealBandFitResult.n_bandsRealBandFitResult.r_squaredRealBandFitResult.residualsRealBandFitResult.rmseRealBandFitResult.sigmasRealBandFitResult.summary()RealBandFitResult.top_bands()RealBandFitResult.wavelengths
RealBandFitterRealDataFitterRealDataFitter.source_propertiesRealDataFitter.fitted_paramsRealDataFitter.apply_matching_preprocessing()RealDataFitter.create_matched_generator()RealDataFitter.evaluate_similarity()RealDataFitter.fit()RealDataFitter.fit_from_path()RealDataFitter.fitted_paramsRealDataFitter.get_tuning_recommendations()RealDataFitter.source_properties
RealismMetricReflectanceConfigReflectanceConfig.geometryReflectanceConfig.reference_materialReflectanceConfig.reference_reflectanceReflectanceConfig.illumination_angleReflectanceConfig.collection_angleReflectanceConfig.sample_presentationReflectanceConfig.collection_angleReflectanceConfig.geometryReflectanceConfig.illumination_angleReflectanceConfig.reference_materialReflectanceConfig.reference_reflectanceReflectanceConfig.sample_presentation
ScatteringCoefficientConfigScatteringCoefficientConfig.baseline_scatteringScatteringCoefficientConfig.wavelength_exponentScatteringCoefficientConfig.particle_size_factorScatteringCoefficientConfig.sample_variationScatteringCoefficientConfig.wavelength_reference_nmScatteringCoefficientConfig.baseline_scatteringScatteringCoefficientConfig.particle_size_factorScatteringCoefficientConfig.sample_variationScatteringCoefficientConfig.wavelength_exponentScatteringCoefficientConfig.wavelength_reference_nm
ScatteringConfigScatteringConfig.baseline_scatteringScatteringConfig.wavelength_exponentScatteringConfig.particle_size_umScatteringConfig.particle_size_variationScatteringConfig.sample_to_sample_variationScatteringConfig.baseline_scatteringScatteringConfig.particle_size_umScatteringConfig.particle_size_variationScatteringConfig.sample_to_sample_variationScatteringConfig.wavelength_exponent
ScatteringEffectsConfigScatteringEffectsConfig.modelScatteringEffectsConfig.particle_sizeScatteringEffectsConfig.emscScatteringEffectsConfig.scattering_coefficientScatteringEffectsConfig.enable_particle_sizeScatteringEffectsConfig.enable_emscScatteringEffectsConfig.emscScatteringEffectsConfig.enable_emscScatteringEffectsConfig.enable_particle_sizeScatteringEffectsConfig.modelScatteringEffectsConfig.particle_sizeScatteringEffectsConfig.scattering_coefficient
ScatteringInferenceScatteringInference.has_scatter_effectsScatteringInference.estimated_particle_size_umScatteringInference.multiplicative_scatter_stdScatteringInference.additive_scatter_stdScatteringInference.baseline_curvatureScatteringInference.snv_correctableScatteringInference.msc_correctableScatteringInference.additive_scatter_stdScatteringInference.baseline_curvatureScatteringInference.estimated_particle_size_umScatteringInference.has_scatter_effectsScatteringInference.msc_correctableScatteringInference.multiplicative_scatter_stdScatteringInference.snv_correctable
ScatteringModelSensorConfigSensorConfig.detector_typeSensorConfig.wavelength_rangeSensorConfig.spectral_resolutionSensorConfig.noise_levelSensorConfig.gainSensorConfig.overlap_rangeSensorConfig.detector_typeSensorConfig.gainSensorConfig.noise_levelSensorConfig.overlap_rangeSensorConfig.spectral_resolutionSensorConfig.wavelength_range
SourceConfigSourceConfig.nameSourceConfig.source_typeSourceConfig.n_featuresSourceConfig.wavelength_startSourceConfig.wavelength_endSourceConfig.wavelength_stepSourceConfig.componentsSourceConfig.complexitySourceConfig.distributionSourceConfig.correlation_with_targetSourceConfig.complexitySourceConfig.componentsSourceConfig.correlation_with_targetSourceConfig.distributionSourceConfig.from_dict()SourceConfig.n_featuresSourceConfig.nameSourceConfig.source_typeSourceConfig.wavelength_endSourceConfig.wavelength_startSourceConfig.wavelength_step
SpectralComponentSpectralComponent.nameSpectralComponent.bandsSpectralComponent.correlation_groupSpectralComponent.categorySpectralComponent.subcategorySpectralComponent.synonymsSpectralComponent.formulaSpectralComponent.cas_numberSpectralComponent.referencesSpectralComponent.tagsSpectralComponent.bandsSpectralComponent.cas_numberSpectralComponent.categorySpectralComponent.compute()SpectralComponent.correlation_groupSpectralComponent.formulaSpectralComponent.has_bands_in_range()SpectralComponent.info()SpectralComponent.is_normalized()SpectralComponent.nameSpectralComponent.normalized()SpectralComponent.referencesSpectralComponent.subcategorySpectralComponent.synonymsSpectralComponent.tagsSpectralComponent.validate()
SpectralPropertiesSpectralProperties.nameSpectralProperties.n_samplesSpectralProperties.n_wavelengthsSpectralProperties.wavelengthsSpectralProperties.mean_spectrumSpectralProperties.std_spectrumSpectralProperties.global_meanSpectralProperties.global_stdSpectralProperties.global_rangeSpectralProperties.mean_slopeSpectralProperties.slope_stdSpectralProperties.mean_curvatureSpectralProperties.skewnessSpectralProperties.kurtosisSpectralProperties.noise_estimateSpectralProperties.snr_estimateSpectralProperties.pca_explained_varianceSpectralProperties.pca_n_components_95SpectralProperties.n_peaks_meanSpectralProperties.peak_positionsSpectralProperties.peak_wavenumbersSpectralProperties.effective_resolutionSpectralProperties.noise_correlation_lengthSpectralProperties.wavelength_rangeSpectralProperties.baseline_offsetSpectralProperties.kubelka_munk_linearitySpectralProperties.baseline_convexitySpectralProperties.water_band_variationSpectralProperties.oh_band_positionsSpectralProperties.temperature_sensitivity_scoreSpectralProperties.scatter_baseline_slopeSpectralProperties.scatter_baseline_curvatureSpectralProperties.sample_to_sample_offset_stdSpectralProperties.sample_to_sample_slope_stdSpectralProperties.protein_band_intensitySpectralProperties.carbohydrate_band_intensitySpectralProperties.lipid_band_intensitySpectralProperties.water_band_intensitySpectralProperties.baseline_convexitySpectralProperties.baseline_offsetSpectralProperties.carbohydrate_band_intensitySpectralProperties.center_noise_stdSpectralProperties.curvature_stdSpectralProperties.edge_curvature_asymmetrySpectralProperties.edge_curvature_intensitySpectralProperties.effective_resolutionSpectralProperties.global_meanSpectralProperties.global_rangeSpectralProperties.global_stdSpectralProperties.has_boundary_rise_leftSpectralProperties.has_boundary_rise_rightSpectralProperties.kubelka_munk_linearitySpectralProperties.kurtosisSpectralProperties.left_edge_noise_stdSpectralProperties.left_edge_slopeSpectralProperties.lipid_band_intensitySpectralProperties.mean_curvatureSpectralProperties.mean_slopeSpectralProperties.mean_spectrumSpectralProperties.n_peaks_meanSpectralProperties.n_samplesSpectralProperties.n_wavelengthsSpectralProperties.nameSpectralProperties.noise_correlation_lengthSpectralProperties.noise_estimateSpectralProperties.oh_band_positionsSpectralProperties.pca_explained_varianceSpectralProperties.pca_n_components_95SpectralProperties.peak_positionsSpectralProperties.peak_wavenumbersSpectralProperties.protein_band_intensitySpectralProperties.right_edge_noise_stdSpectralProperties.right_edge_slopeSpectralProperties.sample_to_sample_offset_stdSpectralProperties.sample_to_sample_slope_stdSpectralProperties.scatter_baseline_curvatureSpectralProperties.scatter_baseline_slopeSpectralProperties.skewnessSpectralProperties.slope_stdSpectralProperties.slopesSpectralProperties.snr_estimateSpectralProperties.std_spectrumSpectralProperties.temperature_sensitivity_scoreSpectralProperties.water_band_intensitySpectralProperties.water_band_variationSpectralProperties.wavelength_rangeSpectralProperties.wavelengths
SpectralRealismScoreSpectralRealismScore.correlation_length_overlapSpectralRealismScore.derivative_ks_pvalueSpectralRealismScore.peak_density_ratioSpectralRealismScore.baseline_curvature_overlapSpectralRealismScore.snr_magnitude_matchSpectralRealismScore.adversarial_aucSpectralRealismScore.overall_passSpectralRealismScore.metric_resultsSpectralRealismScore.warningsSpectralRealismScore.adversarial_aucSpectralRealismScore.baseline_curvature_overlapSpectralRealismScore.correlation_length_overlapSpectralRealismScore.derivative_ks_pvalueSpectralRealismScore.metric_resultsSpectralRealismScore.overall_passSpectralRealismScore.peak_density_ratioSpectralRealismScore.snr_magnitude_matchSpectralRealismScore.summary()SpectralRealismScore.to_dict()SpectralRealismScore.warnings
SpectralRegionSyntheticDatasetBuilderSyntheticDatasetBuilder.stateSyntheticDatasetBuilder.__repr__()SyntheticDatasetBuilder.build()SyntheticDatasetBuilder.build_arrays()SyntheticDatasetBuilder.build_dataset()SyntheticDatasetBuilder.export()SyntheticDatasetBuilder.export_to_csv()SyntheticDatasetBuilder.fit_to()SyntheticDatasetBuilder.from_config()SyntheticDatasetBuilder.get_config()SyntheticDatasetBuilder.with_aggregate()SyntheticDatasetBuilder.with_batch_effects()SyntheticDatasetBuilder.with_classification()SyntheticDatasetBuilder.with_complex_target_landscape()SyntheticDatasetBuilder.with_features()SyntheticDatasetBuilder.with_metadata()SyntheticDatasetBuilder.with_nonlinear_targets()SyntheticDatasetBuilder.with_output()SyntheticDatasetBuilder.with_partitions()SyntheticDatasetBuilder.with_sources()SyntheticDatasetBuilder.with_target_complexity()SyntheticDatasetBuilder.with_targets()SyntheticDatasetBuilder.with_wavelengths()
SyntheticDatasetConfigSyntheticDatasetConfig.n_samplesSyntheticDatasetConfig.random_stateSyntheticDatasetConfig.featuresSyntheticDatasetConfig.targetsSyntheticDatasetConfig.metadataSyntheticDatasetConfig.partitionsSyntheticDatasetConfig.batch_effectsSyntheticDatasetConfig.outputSyntheticDatasetConfig.nameSyntheticDatasetConfig.__post_init__()SyntheticDatasetConfig.batch_effectsSyntheticDatasetConfig.confoundersSyntheticDatasetConfig.featuresSyntheticDatasetConfig.metadataSyntheticDatasetConfig.multi_regimeSyntheticDatasetConfig.n_samplesSyntheticDatasetConfig.nameSyntheticDatasetConfig.nonlinearSyntheticDatasetConfig.outputSyntheticDatasetConfig.partitionsSyntheticDatasetConfig.random_stateSyntheticDatasetConfig.targets
SyntheticNIRSGeneratorSyntheticNIRSGenerator.wavelengthsSyntheticNIRSGenerator.n_wavelengthsSyntheticNIRSGenerator.librarySyntheticNIRSGenerator.ESyntheticNIRSGenerator.paramsSyntheticNIRSGenerator.instrumentSyntheticNIRSGenerator.measurement_mode_simulatorSyntheticNIRSGenerator.__repr__()SyntheticNIRSGenerator.create_dataset()SyntheticNIRSGenerator.generate()SyntheticNIRSGenerator.generate_batch_effects()SyntheticNIRSGenerator.generate_concentrations()SyntheticNIRSGenerator.generate_from_concentrations()
TargetConfigTargetGeneratorTemperatureConfigTemperatureConfig.reference_temperatureTemperatureConfig.sample_temperatureTemperatureConfig.temperature_variationTemperatureConfig.enable_shiftTemperatureConfig.enable_intensityTemperatureConfig.enable_broadeningTemperatureConfig.region_specificTemperatureConfig.custom_regionsTemperatureConfig.custom_regionsTemperatureConfig.delta_temperatureTemperatureConfig.enable_broadeningTemperatureConfig.enable_intensityTemperatureConfig.enable_shiftTemperatureConfig.reference_temperatureTemperatureConfig.region_specificTemperatureConfig.sample_temperatureTemperatureConfig.temperature_variation
TemperatureEffectParamsTemperatureEffectParams.wavelength_rangeTemperatureEffectParams.shift_per_degreeTemperatureEffectParams.intensity_change_per_degreeTemperatureEffectParams.broadening_per_degreeTemperatureEffectParams.referenceTemperatureEffectParams.broadening_per_degreeTemperatureEffectParams.intensity_change_per_degreeTemperatureEffectParams.referenceTemperatureEffectParams.shift_per_degreeTemperatureEffectParams.wavelength_range
TransflectanceConfigTransflectanceConfig.path_length_mmTransflectanceConfig.reflector_typeTransflectanceConfig.reflector_reflectanceTransflectanceConfig.spacer_thickness_mmTransflectanceConfig.path_length_mmTransflectanceConfig.reflector_reflectanceTransflectanceConfig.reflector_typeTransflectanceConfig.spacer_thickness_mm
TransmittanceConfigValidationErrorVarianceFitResultVarianceFitterVariationTypeaggregate_info()apply_hydrogen_bonding_shift()available_components()band_info()band_summary()benchmark_backends()calculate_combination_band()calculate_overtone_position()classify_wavelength_extended()classify_wavelength_zone()compare_datasets()component_info()compute_adversarial_validation_auc()compute_baseline_curvature()compute_correlation_length()compute_derivative_statistics()compute_distribution_overlap()compute_peak_density()compute_snr()compute_spectral_properties()compute_spectral_realism_scorecard()convert_bandwidth_to_wavelength()create_atr_simulator()create_domain_aware_library()create_reflectance_simulator()create_synthetic_matching_benchmark()create_transmittance_simulator()detect_best_backend()expand_aggregate()export_to_csv()export_to_folder()fit_components()fit_components_optimized()fit_real_bands()fit_to_real_data()fit_variance()generate_band_spectrum()generate_classification_targets()generate_multi_source()generate_product_samples()generate_regression_targets()generate_sample_metadata()get_acceleration_speedup_estimate()get_aggregate()get_all_zones_extended()get_all_zones_wavelength()get_backend_info()get_band()get_bands_by_compound()get_bands_by_overtone()get_bands_by_tag()get_bands_in_range()get_benchmark_info()get_benchmark_spectral_properties()get_component()get_datasets_by_domain()get_default_noise_config()get_detector_response()get_detector_wavelength_range()get_domain_compatible_instruments()get_domain_components()get_domain_config()get_domains_for_component()get_instrument_archetype()get_instrument_typical_modes()get_instrument_wavelength_info()get_instrument_wavelengths()get_instruments_by_category()get_nir_zone()get_predefined_components()get_product_template()get_temperature_effect_regions()get_zone_wavelength_range()is_gpu_available()is_nir_region()is_visible_region()list_aggregate_categories()list_aggregate_domains()list_aggregates()list_all_tags()list_bands()list_benchmark_datasets()list_categories()list_detector_types()list_domains()list_functional_groups()list_instrument_archetypes()list_instrument_wavelength_grids()list_product_categories()list_product_domains()list_product_templates()load_benchmark_dataset()multiscale_derivative_fit()multiscale_fit()normalize_component_amplitudes()product_template_info()quick_realism_check()sample_prior()sample_prior_batch()search_components()simulate_detector_effects()validate_against_benchmark()validate_aggregates()validate_bands()validate_component_coverage()validate_concentrations()validate_predefined_components()validate_spectra()validate_synthetic_output()validate_wavelengths()wavelength_to_wavenumber()wavenumber_to_wavelength()
Submodules
- nirs4all.data.binning module
- nirs4all.data.config module
- nirs4all.data.config_parser module
- nirs4all.data.dataset module
SpectroDatasetSpectroDataset.nameSpectroDataset.featuresSpectroDataset.targetsSpectroDataset.metadata_accessorSpectroDataset.foldsSpectroDataset.__str__()SpectroDataset.add_features()SpectroDataset.add_merged_features()SpectroDataset.add_metadata()SpectroDataset.add_metadata_column()SpectroDataset.add_processed_targets()SpectroDataset.add_samples()SpectroDataset.add_samples_batch()SpectroDataset.add_targets()SpectroDataset.aggregateSpectroDataset.aggregate_exclude_outliersSpectroDataset.aggregate_methodSpectroDataset.aggregate_outlier_thresholdSpectroDataset.augment_samples()SpectroDataset.detect_signal_type()SpectroDataset.features_processings()SpectroDataset.features_sources()SpectroDataset.float_headers()SpectroDataset.foldsSpectroDataset.get_dataset_metadata()SpectroDataset.get_merged_features()SpectroDataset.header_unit()SpectroDataset.headers()SpectroDataset.index_column()SpectroDataset.is_classificationSpectroDataset.is_multi_source()SpectroDataset.is_regressionSpectroDataset.keep_sources()SpectroDataset.metadata()SpectroDataset.metadata_column()SpectroDataset.metadata_columnsSpectroDataset.metadata_numeric()SpectroDataset.n_sourcesSpectroDataset.num_classesSpectroDataset.num_featuresSpectroDataset.num_foldsSpectroDataset.num_samplesSpectroDataset.print_summary()SpectroDataset.replace_features()SpectroDataset.reshape_reps_to_preprocessings()SpectroDataset.reshape_reps_to_sources()SpectroDataset.set_aggregate()SpectroDataset.set_aggregate_exclude_outliers()SpectroDataset.set_aggregate_method()SpectroDataset.set_content_hash()SpectroDataset.set_folds()SpectroDataset.set_signal_type()SpectroDataset.set_source_path()SpectroDataset.set_task_type()SpectroDataset.short_preprocessings_str()SpectroDataset.signal_type()SpectroDataset.signal_typesSpectroDataset.task_typeSpectroDataset.update_features()SpectroDataset.update_metadata()SpectroDataset.wavelengths_cm1()SpectroDataset.wavelengths_nm()SpectroDataset.x()SpectroDataset.y()
- nirs4all.data.ensemble_utils module
- nirs4all.data.features module
FeaturesFeatures.sourcesFeatures.cacheFeatures.add_samples()Features.add_samples_batch_3d()Features.augment_samples()Features.headers()Features.headers_listFeatures.keep_sources()Features.num_featuresFeatures.num_processingsFeatures.num_samplesFeatures.preprocessing_strFeatures.update_features()Features.x()
- nirs4all.data.indexer module
IndexerIndexer.__repr__()Indexer.__str__()Indexer.add_processings()Indexer.add_rows()Indexer.add_rows_dict()Indexer.add_samples()Indexer.add_samples_dict()Indexer.augment_rows()Indexer.default_valuesIndexer.dfIndexer.get_augmented_for_origins()Indexer.get_column_values()Indexer.get_excluded_samples()Indexer.get_exclusion_summary()Indexer.get_origin_for_sample()Indexer.mark_excluded()Indexer.mark_included()Indexer.next_row_index()Indexer.next_sample_index()Indexer.register_samples()Indexer.register_samples_dict()Indexer.replace_processings()Indexer.reset_exclusions()Indexer.reset_processings()Indexer.uniques()Indexer.update_by_filter()Indexer.update_by_indices()Indexer.x_indices()Indexer.y_indices()
- nirs4all.data.io module
- nirs4all.data.metadata module
- nirs4all.data.predictions module
PredictionResultPredictionResult.__repr__()PredictionResult.__str__()PredictionResult.config_namePredictionResult.dataset_namePredictionResult.eval_score()PredictionResult.fold_idPredictionResult.idPredictionResult.model_namePredictionResult.op_counterPredictionResult.save_to_csv()PredictionResult.step_idxPredictionResult.summary()
PredictionResultsListPredictionsPredictions.__len__()Predictions.__repr__()Predictions.__str__()Predictions.add_prediction()Predictions.add_predictions()Predictions.aggregate()Predictions.archive_to_catalog()Predictions.clear()Predictions.clear_caches()Predictions.compare_across_datasets()Predictions.filter_by_branch()Predictions.filter_by_criteria()Predictions.filter_predictions()Predictions.get_best()Predictions.get_cache_stats()Predictions.get_configs()Predictions.get_datasets()Predictions.get_entry_partitions()Predictions.get_folds()Predictions.get_models()Predictions.get_models_before_step()Predictions.get_oof_predictions()Predictions.get_partitions()Predictions.get_prediction_by_id()Predictions.get_predictions_by_step()Predictions.get_similar()Predictions.get_summary_stats()Predictions.get_unique_values()Predictions.list_runs()Predictions.load()Predictions.load_from_file()Predictions.load_from_file_cls()Predictions.load_from_parquet()Predictions.merge_parquet_files()Predictions.merge_predictions()Predictions.num_predictionsPredictions.pred_long_string()Predictions.pred_short_string()Predictions.query_best()Predictions.save_all_to_csv()Predictions.save_predictions_to_csv()Predictions.save_to_file()Predictions.save_to_parquet()Predictions.to_dataframe()Predictions.to_dicts()Predictions.to_pandas()Predictions.top()
- nirs4all.data.signal_type module
SignalTypeSignalType.ABSORBANCESignalType.AUTOSignalType.KUBELKA_MUNKSignalType.LOG_1_RSignalType.LOG_1_TSignalType.PREPROCESSEDSignalType.REFLECTANCESignalType.REFLECTANCE_PERCENTSignalType.TRANSMITTANCESignalType.TRANSMITTANCE_PERCENTSignalType.UNKNOWNSignalType.from_string()SignalType.is_absorbance_likeSignalType.is_determinableSignalType.is_fractionSignalType.is_percentSignalType.is_reflectance_basedSignalType.is_transmittance_based
SignalTypeDetectordetect_signal_type()normalize_signal_type()
- nirs4all.data.targets module
TargetsTargets.num_samplesTargets.num_targetsTargets.num_classesTargets.num_processingsTargets.processing_idsTargets.__repr__()Targets.__str__()Targets.add_processed_targets()Targets.add_targets()Targets.get_processing_ancestry()Targets.get_targets()Targets.get_task_type_for_processing()Targets.invert_transform()Targets.num_classesTargets.num_processingsTargets.num_samplesTargets.num_targetsTargets.processing_idsTargets.set_task_type()Targets.task_typeTargets.task_type_forcedTargets.transform_predictions()Targets.y()
- nirs4all.data.types module
Module contents
SpectroDataset - A specialized dataset API for spectroscopy data.
This module provides zero-copy, multi-source aware dataset management with transparent versioning and fine-grained indexing capabilities.
- Submodules:
synthetic: Synthetic NIRS spectra generation tools.
- class nirs4all.data.ColumnConfig(*, features: List[int] | List[str] | str | Dict[str, Any] | None = None, targets: List[int] | List[str] | str | Dict[str, Any] | None = None, metadata: List[int] | List[str] | str | Dict[str, Any] | None = None)[source]
Bases:
BaseModelConfiguration for column selection and role assignment.
This is a stub for future implementation of the files syntax. Currently, column selection is handled by the loader directly.
- model_config = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- exception nirs4all.data.ColumnSelectionError[source]
Bases:
ExceptionRaised when column selection fails.
- class nirs4all.data.ColumnSelector(case_sensitive: bool = True)[source]
Bases:
objectFlexible column selector for DataFrames.
Supports multiple selection methods: - By name: [“col1”, “col2”] or “col_name” - By index: [0, 1, 2] or 0 - By range: “2:-1” (slice syntax as string) - By regex pattern: {“regex”: “^feature_.*”} - By exclusion: {“exclude”: [“id”, “date”]} - Combined: {“include”: [0, 1], “exclude”: [“id”]}
Example
>>> selector = ColumnSelector() >>> result = selector.select(df, "2:-1") >>> print(result.names) # Column names in range >>> print(result.data) # Selected columns as DataFrame
- parse_selection(selection: Any, available_columns: List[str]) List[int][source]
Parse a selection specification and return column indices.
This is a convenience method for when you don’t have a DataFrame but want to validate and resolve a selection.
- Parameters:
selection – Column selection specification.
available_columns – List of available column names.
- Returns:
List of column indices.
- Raises:
ColumnSelectionError – If selection is invalid.
- select(df: DataFrame, selection: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) SelectionResult[source]
Select columns from a DataFrame.
- Parameters:
df – The DataFrame to select columns from.
selection – Column selection specification. Can be: - None: Select all columns - int: Single column index - str: Single column name or range string (“2:-1”) - List[int]: List of column indices - List[str]: List of column names - Dict: Complex selection (see class docstring)
- Returns:
SelectionResult with indices, names, and selected data.
- Raises:
ColumnSelectionError – If selection is invalid or columns not found.
- class nirs4all.data.ConfigNormalizer(parsers: List[BaseParser] | None = None)[source]
Bases:
objectNormalizes dataset configurations from various input formats.
This class combines multiple parsers to handle: - Folder paths (auto-scanning) - JSON/YAML config files - Dictionary configurations (legacy format) - Sources configurations (multi-source format) - Variations configurations (preprocessed data / feature variations) - In-memory numpy arrays
All inputs are normalized to a canonical dictionary format that can be validated and processed by the loader.
Example
```python normalizer = ConfigNormalizer()
# From folder path config, name = normalizer.normalize(“/path/to/data/”)
# From config file config, name = normalizer.normalize(“config.yaml”)
# From dictionary config, name = normalizer.normalize({“train_x”: “data/X.csv”})
# From sources format config, name = normalizer.normalize({
- “sources”: [
{“name”: “NIR”, “train_x”: “NIR_train.csv”}, {“name”: “MIR”, “train_x”: “MIR_train.csv”}
]
})
# From variations format config, name = normalizer.normalize({
- “variations”: [
{“name”: “raw”, “train_x”: “X_raw.csv”}, {“name”: “snv”, “train_x”: “X_snv.csv”}
], “variation_mode”: “separate”
})
- class nirs4all.data.ConfigValidator(check_file_existence: bool = False, custom_validators: List[Callable] | None = None)[source]
Bases:
objectValidator for dataset configurations.
Provides validation rules and methods for checking dataset configurations. Supports both legacy and new format configurations.
Example
```python validator = ConfigValidator() result = validator.validate(config_dict) if not result.is_valid:
- for error in result.errors:
print(f”Error: {error}”)
- class nirs4all.data.DatasetConfigSchema(*, name: str | None = None, description: str | None = None, task_type: TaskType | None = None, train_x: Any | None = None, train_y: Any | None = None, train_group: Any | None = None, test_x: Any | None = None, test_y: Any | None = None, test_group: Any | None = None, train_x_filter: List[int] | None = None, train_y_filter: List[int] | None = None, train_group_filter: List[int] | None = None, test_x_filter: List[int] | None = None, test_y_filter: List[int] | None = None, test_group_filter: List[int] | None = None, global_params: LoadingParams | None = None, train_params: LoadingParams | None = None, test_params: LoadingParams | None = None, train_x_params: LoadingParams | None = None, train_y_params: LoadingParams | None = None, train_group_params: LoadingParams | None = None, test_x_params: LoadingParams | None = None, test_y_params: LoadingParams | None = None, test_group_params: LoadingParams | None = None, aggregate: str | bool | None = None, aggregate_method: AggregateMethod | None = None, aggregate_exclude_outliers: bool | None = None, files: List[FileConfig] | None = None, sources: List[SourceConfig] | None = None, shared_targets: SharedTargetsConfig | List[SharedTargetsConfig] | None = None, shared_metadata: SharedMetadataConfig | List[SharedMetadataConfig] | None = None, variations: List[VariationConfig] | None = None, variation_mode: VariationMode | None = None, variation_select: List[str] | None = None, variation_prefix: bool | None = None, folds: FoldConfig | List[Dict[str, Any]] | str | None = None, **extra_data: Any)[source]
Bases:
BaseModelComplete dataset configuration schema.
This model represents the normalized, validated form of a dataset configuration. It supports both the legacy format (train_x, test_x, etc.) and is designed to be extensible for the new files syntax.
All input configurations are normalized to this schema before processing.
- aggregate_method: AggregateMethod | None
- files: List[FileConfig] | None
- get_effective_params(partition: str, data_type: str) LoadingParams[source]
Get effective loading parameters for a specific data file.
Parameters are merged with precedence: specific > partition > global.
- Parameters:
partition – ‘train’ or ‘test’
data_type – ‘x’, ‘y’, or ‘group’
- Returns:
Merged LoadingParams.
- get_selected_variations() List[VariationConfig][source]
Get the variations to use based on variation_mode and variation_select.
For mode=’select’, returns only the selected variations. For other modes, returns all variations.
- Returns:
List of VariationConfig objects to use.
- get_source_count() int[source]
Get the number of feature sources.
- Returns:
Number of sources (1 for single-source, >1 for multi-source).
- get_source_names() List[str][source]
Get names of all sources in this config.
- Returns:
List of source names, or empty list if not multi-source.
- get_variation_count() int[source]
Get the number of feature variations.
- Returns:
Number of variations.
- get_variation_names() List[str][source]
Get names of all variations in this config.
- Returns:
List of variation names, or empty list if no variations.
- global_params: LoadingParams | None
- model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod normalize_aggregate_method(v: Any) AggregateMethod | None[source]
Normalize aggregate_method to enum.
- classmethod normalize_variation_mode(v: Any) VariationMode | None[source]
Normalize variation_mode to enum.
- classmethod parse_loading_params(v: Any) LoadingParams | None[source]
Parse dict to LoadingParams if needed.
Parse shared metadata configuration.
Parse shared targets configuration.
- classmethod parse_sources(v: Any) List[SourceConfig] | None[source]
Parse sources list to SourceConfig objects.
- classmethod parse_variations(v: Any) List[VariationConfig] | None[source]
Parse variations list to VariationConfig objects.
- sources: List[SourceConfig] | None
- test_group_params: LoadingParams | None
- test_params: LoadingParams | None
- test_x_params: LoadingParams | None
- test_y_params: LoadingParams | None
- to_legacy_format() Dict[str, Any][source]
Convert sources or variations format to legacy format for backward compatibility.
This converts the sources/variations syntax to the train_x/test_x array syntax that existing loaders understand.
- Returns:
Dictionary with legacy format configuration.
- train_group_params: LoadingParams | None
- train_params: LoadingParams | None
- train_x_params: LoadingParams | None
- train_y_params: LoadingParams | None
- validate_data_sources() DatasetConfigSchema[source]
Validate that at least one data source is specified.
- variation_mode: VariationMode | None
- variations: List[VariationConfig] | None
- variations_to_legacy_format() Dict[str, Any][source]
Convert variations format to legacy format for backward compatibility.
This converts the variations syntax to the train_x/test_x format that existing loaders understand. The conversion depends on variation_mode:
separate: Returns config for first variation (caller handles multiple runs)
concat: Returns list of paths to be concatenated
select: Returns config for selected variations only
compare: Same as separate (caller handles comparison)
- Returns:
Dictionary with legacy format configuration.
- class nirs4all.data.DatasetConfigs(configurations: Dict[str, Any] | List[Dict[str, Any]] | str | List[str], task_type: str | List[str] = 'auto', signal_type: str | SignalType | List[str | SignalType] | None = None, aggregate: str | bool | List[str | bool | None] | None = None, aggregate_method: str | List[str] | None = None, aggregate_exclude_outliers: bool | List[bool] | None = None)[source]
Bases:
object- get_dataset(config, name) SpectroDataset[source]
Get dataset by config and name (backward compatible).
Note: When called directly, uses the first task_type (or ‘auto’ if single dataset). For proper per-dataset task_type handling, use iter_datasets() or get_dataset_at().
- get_dataset_at(index) SpectroDataset[source]
- get_datasets() List[SpectroDataset][source]
- class nirs4all.data.FeatureLayout(value)[source]
-
Feature data layout formats.
String values ensure backward compatibility with existing pipelines that use layout=”3d_transpose” as strings.
- FLAT_2D = '2d'
- FLAT_2D_INTERLEAVED = '2d_interleaved'
- VOLUME_3D = '3d'
- VOLUME_3D_TRANSPOSE = '3d_transpose'
- class nirs4all.data.FeatureSource(padding: bool = True, pad_value: float = 0.0)[source]
Bases:
objectManages a 3D numpy array of features using modular components.
This class provides efficient storage and manipulation of feature data with multiple processing stages. Each sample can have multiple processing versions (e.g., raw, normalized, filtered), all stored in a single aligned 3D array.
The implementation uses a component-based architecture for better modularity: - ArrayStorage: Manages the 3D numpy array - ProcessingManager: Tracks processing IDs and their indices - HeaderManager: Manages feature headers and units - LayoutTransformer: Transforms arrays to different layouts - UpdateStrategy: Handles update operation logic - AugmentationHandler: Manages sample augmentation
- padding
Whether to allow padding when adding features with fewer dimensions.
- pad_value
Value to use for padding (default: 0.0).
- add_samples(new_samples: ndarray, headers: List[str] | None = None) None[source]
Add new samples to the feature source.
Only allowed when there’s only one processing (raw). Samples are added as a new row in the array with a single processing dimension.
- Parameters:
new_samples – 2D array of shape (n_samples, n_features).
headers – Optional list of feature header names.
- Raises:
ValueError – If the dataset already has multiple processings, or if new_samples is not 2D.
- add_samples_batch_3d(data: ndarray) None[source]
Add multiple samples with 3D data in a single operation - O(N) instead of O(N²).
This method is optimized for bulk insertion of augmented samples where each sample may have multiple processings.
- Parameters:
data – 3D array of shape (n_samples, n_processings, n_features).
- Raises:
ValueError – If data dimensions don’t match existing processings/features.
- augment_samples(sample_indices: List[int], data: ndarray, processings: List[str], count_list: List[int]) None[source]
Create augmented samples by duplicating existing samples.
- Parameters:
sample_indices – List of sample indices to augment.
data – Augmented feature data of shape (total_augmented_samples, n_features).
processings – Processing names for the augmented data.
count_list – Number of augmentations per sample.
- property header_unit: str
Get the unit type of the headers.
- Returns:
Unit type string (“cm-1”, “nm”, “none”, “text”, “index”).
- property headers: List[str] | None
Get the feature headers.
- Returns:
List of header strings, or None if not set.
- property num_2d_features: int
Get total features when flattened to 2D.
- Returns:
Product of processings and features dimensions.
- property num_features: int
Get the number of features per processing.
- Returns:
Number of features (third dimension of array).
- property num_processings: int
Get the number of processing stages.
- Returns:
Number of unique processings (second dimension of array).
- property num_samples: int
Get the number of samples.
- Returns:
Number of samples (first dimension of array).
- property processing_ids: List[str]
Get a copy of the processing ID list.
- Returns:
List of processing identifiers.
- reset_features(features: ndarray, processings: List[str]) None[source]
Reset features and processings.
Replaces all features and processings with new data.
- Parameters:
features – New feature data (2D or 3D).
processings – List of new processing names.
- set_headers(headers: List[str] | None, unit: str = 'cm-1') None[source]
Set feature headers with unit metadata.
- Parameters:
headers – List of header strings (wavelengths, feature names, etc.).
unit – Unit type - “cm-1” (wavenumber), “nm” (wavelength), “none”, “text”, “index”.
- update_features(source_processings: list[str], features: list[ndarray] | list[list[ndarray]], processings: list[str]) None[source]
Add new features or replace existing ones.
- Parameters:
source_processings – List of existing processing names to replace. Empty string “” means add new.
features – List of feature arrays, each of shape (n_samples, n_features), or single array.
processings – List of target processing names for the data.
Example
# Add new ‘savgol’ and ‘detrend’, replace ‘raw’ with ‘msc’ update_features([“”, “raw”, “”],
[savgol_data, msc_data, detrend_data], [“savgol”, “msc”, “detrend”])
- x(indices: list[int] | ndarray, layout: str) ndarray[source]
Retrieve feature data in specified layout.
- Parameters:
indices – Sample indices to retrieve.
layout – Output format: - “2d”: Flatten to (samples, processings * features) - “2d_interleaved”: Transpose then flatten to (samples, features * processings) - “3d”: Keep as (samples, processings, features) - “3d_transpose”: Transpose to (samples, features, processings)
- Returns:
Feature array in requested layout.
- Raises:
ValueError – If layout is unknown.
- class nirs4all.data.FileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None, link_by: str | None = None)[source]
Bases:
BaseModelConfiguration for a single data file.
This is a stub for future implementation of the files syntax. It describes how to load and interpret a single data file.
- columns: ColumnConfig | None
- model_config = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- partition: PartitionType | None
- class nirs4all.data.HeaderUnit(value)[source]
-
Feature header unit types.
Defines the type of measurement units used in feature headers.
- INDEX = 'index'
- NONE = 'none'
- TEXT = 'text'
- WAVELENGTH = 'nm'
- WAVENUMBER = 'cm-1'
- class nirs4all.data.LoadingParams(*, delimiter: str | None = None, decimal_separator: str | None = None, has_header: bool | None = None, header_unit: HeaderUnit | str | None = None, signal_type: SignalTypeEnum | str | None = None, encoding: str | None = None, na_policy: NAPolicy | str | None = None, categorical_mode: CategoricalMode | str | None = None, **extra_data: Any)[source]
Bases:
BaseModelParameters for loading data files.
These parameters control how CSV and other files are parsed. Parameters can be specified at global, partition, or file level, with more specific levels overriding general ones.
- categorical_mode: CategoricalMode | str | None
- header_unit: HeaderUnit | str | None
- merge_with(other: LoadingParams | None) LoadingParams[source]
Merge with another LoadingParams, self taking precedence.
- Parameters:
other – Another LoadingParams to merge with (lower priority).
- Returns:
New LoadingParams with merged values.
- model_config = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod normalize_header_unit(v: Any) HeaderUnit | str | None[source]
Normalize header_unit to enum if possible.
- classmethod normalize_signal_type(v: Any) SignalTypeEnum | str | None[source]
Normalize signal_type to enum if possible.
- signal_type: SignalTypeEnum | str | None
- class nirs4all.data.PartitionAssigner(default_random_state: int | None = None, base_path: Path | None = None)[source]
Bases:
objectFlexible partition assigner for DataFrames.
Supports multiple partition methods: - Static: “train”, “test”, “predict” (assign entire DataFrame) - Column-based: {“column”: “split”, “train_values”: […], “test_values”: […]} - Percentage-based: {“train”: “80%”, “test”: “20%”, “shuffle”: True} - Index-based: {“train”: [0,1,2], “test”: [3,4,5]} - Index file: {“train_file”: “train_idx.txt”, “test_file”: “test_idx.txt”}
Example
>>> assigner = PartitionAssigner() >>> result = assigner.assign(df, {"train": "80%", "test": "20%"}) >>> print(len(result.train_data), len(result.test_data))
- DEFAULT_PREDICT_VALUES = ('predict', 'prediction', 'unknown')
- DEFAULT_TEST_VALUES = ('test', 'testing', 'val', 'validation', 'valid')
- DEFAULT_TRAIN_VALUES = ('train', 'training', 'cal', 'calibration')
- PARTITION_NAMES = ('train', 'test', 'predict')
- assign(df: DataFrame, partition: str | Dict[str, Any] | None) PartitionResult[source]
Assign rows to partitions.
- Parameters:
df – The DataFrame to partition.
partition – Partition specification. Can be: - str: Static partition (“train”, “test”, “predict”) - dict: Complex partition (column-based, percentage, or index) - None: No partitioning (returns empty result)
- Returns:
PartitionResult with indices and data for each partition.
- Raises:
PartitionError – If partition specification is invalid.
- concatenate_partitions(results: Sequence[PartitionResult]) PartitionResult[source]
Concatenate multiple partition results.
Useful when combining multiple files with the same partition. Indices are adjusted to account for concatenation order.
- Parameters:
results – Sequence of PartitionResult objects.
- Returns:
Combined PartitionResult.
- class nirs4all.data.PartitionConfig(*, type: PartitionType | None = None, column: str | None = None, train_values: List[str] | None = None, test_values: List[str] | None = None, predict_values: List[str] | None = None, unknown_policy: Literal['error', 'ignore', 'train'] | None = None, train: str | List[int] | None = None, test: str | List[int] | None = None, predict: str | List[int] | None = None, shuffle: bool | None = None, random_state: int | None = None, stratify: str | None = None, train_file: str | None = None, test_file: str | None = None, predict_file: str | None = None, **extra_data: Any)[source]
Bases:
BaseModelConfiguration for partition assignment.
Supports multiple partition methods: - Static: Assign entire file to a partition (use type) - Column-based: Partition based on column values (use column) - Percentage-based: Split by percentage (use train, test with percentages) - Index-based: Explicit index lists (use train, test with lists) - Index file: Load indices from external files (use train_file, test_file)
Examples
# Static partition (entire file) partition:
type: train
# Column-based partition partition:
column: “split” train_values: [“train”, “training”] test_values: [“test”, “validation”]
# Percentage-based partition partition:
train: “80%” test: “80%:100%” shuffle: true random_state: 42
# Index-based partition partition:
train: [0, 1, 2, 3, 4] test: [5, 6, 7, 8, 9]
# Index file partition partition:
train_file: “train_indices.txt” test_file: “test_indices.txt”
- model_config = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- to_assigner_spec() str | Dict[str, Any] | None[source]
Convert this config to a spec for PartitionAssigner.
- Returns:
Partition specification for PartitionAssigner.assign().
- type: PartitionType | None
- validate_partition_method() PartitionConfig[source]
Validate that partition specification is consistent.
- exception nirs4all.data.PartitionError[source]
Bases:
ExceptionRaised when partition assignment fails.
- class nirs4all.data.PartitionResult(train_indices: List[int] = <factory>, test_indices: List[int] = <factory>, predict_indices: List[int] = <factory>, train_data: DataFrame | None = None, test_data: DataFrame | None = None, predict_data: DataFrame | None = None, partition_column: str | None = None)[source]
Bases:
objectResult of a partition assignment operation.
- train_data
DataFrame subset for training.
- Type:
pandas.core.frame.DataFrame | None
- test_data
DataFrame subset for testing.
- Type:
pandas.core.frame.DataFrame | None
- predict_data
DataFrame subset for prediction.
- Type:
pandas.core.frame.DataFrame | None
- get_data(partition: Literal['train', 'test', 'predict']) DataFrame | None[source]
Get data for a specific partition.
- class nirs4all.data.PredictionAnalyzer(predictions_obj: Predictions, dataset_name_override: str | None = None, config: ChartConfig | None = None, output_dir: str | None = None, cache_size: int = 50, default_aggregate: str | None = None, default_aggregate_method: str | None = None, default_aggregate_exclude_outliers: bool = False)[source]
Bases:
objectOrchestrator for prediction analysis and visualization.
Provides a unified interface for creating various prediction visualizations. Delegates to specialized chart classes for rendering.
Includes a caching layer (PredictionCache) to avoid recomputing expensive aggregations when multiple charts use the same parameters. The cache is keyed by (aggregate, rank_metric, rank_partition, display_partition, group_by, filters) and stores the results of predictions.top() calls.
Leverages the refactored Predictions API (predictions.top(), PredictionResult, etc.) for efficient data access and avoids redundant calculations.
- predictions
Predictions object containing prediction data.
- dataset_name_override
Optional dataset name override for display.
- config
ChartConfig for customization across all charts.
- output_dir
Directory to save generated charts.
- cache
PredictionCache for caching aggregated results.
- default_aggregate
Default aggregation column for all visualization methods.
Example
>>> from nirs4all.data.predictions import Predictions >>> predictions = Predictions.load('predictions.json') >>> analyzer = PredictionAnalyzer(predictions) >>> >>> # Plot top 5 models - first call computes aggregation >>> fig = analyzer.plot_top_k(k=5, aggregate='ID') >>> >>> # Plot heatmap - uses cached aggregation (fast!) >>> fig = analyzer.plot_heatmap('model_name', 'preprocessings', aggregate='ID') >>> >>> # Check cache stats >>> print(analyzer.get_cache_stats()) >>> >>> # With default aggregation from dataset config >>> runner = PipelineRunner() >>> predictions, _ = runner.run(pipeline, DatasetConfigs(path, aggregate='sample_id')) >>> analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate) >>> # All plots now use sample_id aggregation by default >>> fig = analyzer.plot_top_k(k=5) # Aggregated automatically
- branch_summary(metrics: List[str] | None = None, display_partition: str = 'test', aggregate: str | None = None, as_dataframe: bool = True, **filters) DataFrame | Dict[str, Dict[str, Any]][source]
Generate summary statistics comparing branch performance.
Computes mean, std, min, max for each metric across branches.
- Parameters:
metrics – List of metrics to compute (default: [‘rmse’, ‘r2’] or [‘balanced_accuracy’, ‘f1’] for classification).
display_partition – Partition to compute metrics from (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column (e.g., ‘ID’) before computing statistics.
as_dataframe – If True, return pandas DataFrame. If False, return dict.
**filters – Additional filter criteria.
- Returns:
branch_name: Branch identifier
branch_id: Numeric branch ID
count: Number of predictions
{metric}_mean: Mean value
{metric}_std: Standard deviation
{metric}_min: Minimum value
{metric}_max: Maximum value
- Return type:
DataFrame or dict with branch summary statistics
Examples
>>> summary = analyzer.branch_summary(metrics=['rmse', 'r2']) >>> print(summary.to_markdown())
>>> summary = analyzer.branch_summary( ... metrics=['balanced_accuracy', 'f1'], ... aggregate='ID' ... )
- clear_cache() None[source]
Clear all caches.
Call this if the underlying predictions data has been modified to ensure fresh results are computed. Clears both: - Analyzer’s query result cache - Ranker’s aggregation and score caches
- generate_report(output_path: str, branch_comparison: bool = True, include_diagrams: bool = True, include_tables: bool = True, metrics: List[str] | None = None, partition: str = 'test', title: str | None = None) str[source]
Generate HTML report with branch analysis.
Creates a comprehensive HTML report with branch comparisons, visualizations, and statistical tables.
- Parameters:
output_path – Path for the output HTML file.
branch_comparison – If True, include branch comparison section.
include_diagrams – If True, include branch diagram visualization.
include_tables – If True, include summary statistics tables.
metrics – List of metrics to include (default: [‘rmse’, ‘r2’]).
partition – Partition for metrics (default: ‘test’).
title – Report title (default: ‘Branch Comparison Report’).
- Returns:
Path to the generated HTML file.
Examples
>>> path = analyzer.generate_report( ... 'reports/branch_comparison.html', ... branch_comparison=True, ... metrics=['rmse', 'r2', 'mae'] ... )
- get_branch_ids() List[int][source]
Get list of unique branch IDs in predictions.
- Returns:
List of branch IDs (empty list if no branches)
Examples
>>> branch_ids = analyzer.get_branch_ids() >>> print(branch_ids) # [0, 1, 2]
- get_branches() List[str][source]
Get list of unique branch names in predictions.
- Returns:
List of branch names (empty list if no branches)
Examples
>>> branches = analyzer.get_branches() >>> print(branches) # ['snv_pca', 'msc_detrend', 'derivative']
- get_cache_stats() Dict[str, Any][source]
Get cache performance statistics.
- Returns:
analyzer_cache: Query result cache stats
ranker_cache: Aggregation and score cache stats
- Return type:
Dictionary with stats for both analyzer and ranker caches
- get_cached_predictions(n: int, rank_metric: str, rank_partition: str = 'val', display_partition: str = 'test', display_metrics: List[str] | None = None, aggregate: str | None = None, aggregate_method: str | None = None, aggregate_exclude_outliers: bool | None = None, group_by: str | List[str] | None = None, aggregate_partitions: bool = True, **filters)[source]
Get predictions with caching support.
This method wraps predictions.top() with a caching layer. Charts should call this method instead of directly calling predictions.top() to benefit from caching.
The cache key includes: aggregate, rank_metric, rank_partition, display_partition, group_by, and all filters.
- Parameters:
n – Number of top predictions to return.
rank_metric – Metric for ranking.
rank_partition – Partition for ranking (default: ‘val’).
display_partition – Partition for display (default: ‘test’).
display_metrics – List of metrics to compute for display.
aggregate – Aggregation column (e.g., ‘ID’) or None.
aggregate_method – Aggregation method (‘mean’, ‘median’, ‘vote’). If None, uses default_aggregate_method from constructor.
aggregate_exclude_outliers – If True, exclude outliers using T² before aggregation. If None, uses default_aggregate_exclude_outliers from constructor.
group_by – Grouping column(s) for deduplication.
aggregate_partitions – If True, include all partition data.
**filters – Additional filter criteria.
- Returns:
PredictionResultsList from cache or fresh computation.
Example
>>> # First call computes and caches >>> preds = analyzer.get_cached_predictions( ... n=5, rank_metric='rmse', aggregate='ID' ... ) >>> # Second call with same params is instant >>> preds = analyzer.get_cached_predictions( ... n=5, rank_metric='rmse', aggregate='ID' ... )
- plot_branch_boxplot(rank_metric: str | None = None, display_metric: str | None = None, display_partition: str = 'test', aggregate: str | None = None, figsize: tuple | None = None, config: ChartConfig | None = None, **filters) Figure[source]
Plot boxplot comparing score distributions across branches.
Creates a boxplot showing the distribution of metric values for each branch.
- Parameters:
rank_metric – Metric for ranking models (default: auto-detect).
display_metric – Metric to display (default: same as rank_metric).
display_partition – Partition to display results from (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column.
figsize – Figure size tuple (default: auto-computed).
config – Optional ChartConfig to override defaults.
**filters – Additional filter criteria.
- Returns:
matplotlib Figure with branch comparison boxplot.
Examples
>>> fig = analyzer.plot_branch_boxplot(display_metric='rmse') >>> fig = analyzer.plot_branch_boxplot( ... display_metric='r2', ... aggregate='ID' ... )
- plot_branch_comparison(rank_metric: str | None = None, display_metric: str | None = None, display_partition: str = 'test', aggregate: str | None = None, show_ci: bool = True, ci_level: float = 0.95, figsize: tuple | None = None, config: ChartConfig | None = None, **filters) Figure[source]
Plot bar chart comparing branch performance with confidence intervals.
Creates a grouped bar chart showing mean metric values for each branch with optional confidence intervals.
- Parameters:
rank_metric – Metric for ranking models (default: auto-detect).
display_metric – Metric to display (default: same as rank_metric).
display_partition – Partition to display results from (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column.
show_ci – If True, show confidence intervals (default: True).
ci_level – Confidence level for intervals (default: 0.95).
figsize – Figure size tuple (default: auto-computed).
config – Optional ChartConfig to override defaults.
**filters – Additional filter criteria.
- Returns:
matplotlib Figure with branch comparison bar chart.
Examples
>>> fig = analyzer.plot_branch_comparison(display_metric='rmse') >>> fig = analyzer.plot_branch_comparison( ... display_metric='r2', ... aggregate='ID', ... show_ci=True ... )
- plot_branch_diagram(show_metrics: bool = True, metric: str | None = None, partition: str = 'test', figsize: tuple | None = None, title: str | None = None, config: Dict[str, Any] | None = None) Figure[source]
Plot DAG diagram showing the branching structure of the pipeline.
Creates a visual diagram showing shared steps, branch nodes, and post-branch models in a hierarchical layout.
- Parameters:
show_metrics – If True, show metrics in branch nodes (default: True).
metric – Metric to display (default: auto-detect).
partition – Partition for metrics (default: ‘test’).
figsize – Figure size tuple (default: auto-computed).
title – Optional title for the diagram.
config – Additional configuration dict for BranchDiagram.
- Returns:
matplotlib Figure with branch DAG diagram.
Examples
>>> fig = analyzer.plot_branch_diagram(metric='rmse') >>> fig = analyzer.plot_branch_diagram( ... show_metrics=True, ... metric='r2', ... partition='val' ... )
- plot_branch_heatmap(y_var: str = 'fold_id', rank_metric: str | None = None, display_metric: str | None = None, display_partition: str = 'test', aggregate: str | None = None, config: ChartConfig | None = None, **kwargs) Figure[source]
Plot heatmap of branch performance across folds or other variable.
Creates a heatmap with branches on x-axis and another variable (e.g., fold_id) on y-axis.
- Parameters:
y_var – Variable for y-axis (default: ‘fold_id’).
rank_metric – Metric for ranking (default: auto-detect).
display_metric – Metric to display (default: same as rank_metric).
display_partition – Partition to display (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column.
config – Optional ChartConfig to override defaults.
**kwargs – Additional parameters passed to plot_heatmap.
- Returns:
matplotlib Figure with branch heatmap.
Examples
>>> fig = analyzer.plot_branch_heatmap(display_metric='rmse') >>> fig = analyzer.plot_branch_heatmap( ... y_var='model_name', ... display_metric='r2' ... )
- plot_candlestick(variable: str, display_metric: str | None = None, display_partition: str = 'test', aggregate: str | None = None, config: ChartConfig | None = None, **kwargs) Figure[source]
Plot candlestick chart for score distribution by variable.
- Parameters:
variable – Variable to group by (e.g., ‘model_name’, ‘preprocessings’).
display_metric – Metric to analyze (default: auto-detect from task type).
display_partition – Partition to display scores from (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column or ‘y’. When ‘y’, groups by y_true values. When a column name (e.g., ‘ID’), groups by that metadata column. Aggregated predictions have recalculated metrics.
config – Optional ChartConfig to override analyzer’s default config for this chart.
**kwargs – Additional parameters (dataset_name, figsize, filters).
- Returns:
matplotlib Figure object.
Example
>>> fig = analyzer.plot_candlestick('model_name', display_metric='rmse') >>> fig = analyzer.plot_candlestick('model_name', display_metric='rmse', aggregate='ID')
- plot_confusion_matrix(k: int = 5, rank_metric: str | None = None, rank_partition: str = 'val', display_metric: str | List[str] = '', display_partition: str = 'test', show_scores: bool = True, aggregate: str | None = None, config: ChartConfig | None = None, **kwargs) Figure | List[Figure][source]
Plot confusion matrices for top K classification models.
When multiple datasets are present and no dataset_name is specified, creates one figure per dataset.
- Parameters:
k – Number of top models to show (default: 5).
rank_metric – Metric for ranking (default: auto-detect from task type).
rank_partition – Partition used for ranking models (default: ‘val’).
display_metric – Metric(s) to display in titles. Can be a single string (e.g., ‘accuracy’) or a list of strings for multiple metrics (e.g., [‘balanced_accuracy’, ‘accuracy’]). Metric names are shown in abbreviated form (default: same as rank_metric).
display_partition – Partition to display confusion matrix from (default: ‘test’).
show_scores – If True, show scores in chart titles (default: True).
aggregate – If provided, aggregate predictions by this metadata column or ‘y’.
config – Optional ChartConfig to override analyzer’s default config for this chart.
**kwargs – Additional parameters (dataset_name, figsize, filters).
- Returns:
matplotlib Figure object or list of Figure objects (one per dataset).
Example
>>> fig = analyzer.plot_confusion_matrix(k=3, rank_metric='f1') >>> fig = analyzer.plot_confusion_matrix(k=3, aggregate='ID') >>> # Multiple metrics displayed with abbreviated names >>> fig = analyzer.plot_confusion_matrix( ... k=3, ... display_metric=['balanced_accuracy', 'accuracy'] ... )
- plot_heatmap(x_var: str, y_var: str, rank_metric: str | None = None, rank_partition: str = 'val', display_metric: str = '', display_partition: str = 'test', normalize: bool = False, rank_agg: str = 'best', display_agg: str = 'best', show_counts: bool = True, local_scale: bool = False, column_scale: bool = False, aggregate: str | None = None, top_k: int | None = None, sort_by_value: bool = False, sort_by: str | None = None, config: ChartConfig | None = None, **kwargs) Figure[source]
Plot performance heatmap across two variables.
For each (x_var, y_var) cell: 1. Rank predictions by rank_metric on rank_partition using rank_agg 2. Display display_metric from display_partition using display_agg 3. Normalize per dataset if requested 4. Show counts if requested
- Parameters:
x_var – Variable for x-axis (e.g., ‘model_name’, ‘preprocessings’).
y_var – Variable for y-axis (e.g., ‘dataset_name’, ‘partition’).
rank_metric – Metric used to rank/select models (default: auto-detect from task type).
rank_partition – Partition used for ranking models (default: ‘val’).
display_metric – Metric to display in heatmap (default: same as rank_metric).
display_partition – Partition to display scores from (default: ‘test’).
normalize – If True, show normalized scores in cells. Colors always use normalized (default: False).
rank_agg – Aggregation for ranking (‘best’, ‘worst’, ‘mean’, ‘median’) (default: ‘best’).
display_agg – Aggregation for display scores (‘best’, ‘worst’, ‘mean’, ‘median’) (default: ‘mean’).
show_counts – Show prediction counts in cells (default: True).
local_scale – If True, colorbar shows actual metric values; if False, shows 0-1 normalized (default: False).
column_scale – If True, normalize colors per column (best in column = 1.0). Automatically sets local_scale=False when enabled (default: False).
aggregate – If provided, aggregate predictions by this metadata column (e.g., ‘ID’).
top_k – If provided, show only top K models. Selection uses Borda count: first keeps top-1 per column, then ranks by Borda count.
sort_by_value – If True, sort Y-axis by ranking score (best first) instead of alphabetically. Uses rank_metric on rank_partition. Deprecated: use sort_by=’value’ instead.
sort_by – Sorting method for Y-axis (rows). Options: - None: Alphabetical sorting (default). - ‘value’: Sort by ranking score on rank_partition column. - ‘mean’: Sort by mean score across all columns. - ‘median’: Sort by median score across all columns. - ‘borda’: Sort by Borda count (sum of ranks across columns). - ‘condorcet’: Sort by pairwise wins (Copeland method). - ‘consensus’: Sort by consensus (geometric mean of normalized ranks).
config – Optional ChartConfig to override analyzer’s default config for this chart.
**kwargs – Additional filters (dataset_name, model_name, etc.).
- Returns:
matplotlib Figure object.
Example
>>> # Rank on best val RMSE, display mean test RMSE >>> fig = analyzer.plot_heatmap('model_name', 'dataset_name') >>> >>> # Rank on mean val R2, display best test F1 >>> fig = analyzer.plot_heatmap( ... 'model_name', 'dataset_name', ... rank_metric='r2', ... rank_agg='mean', ... display_metric='f1', ... display_agg='best' ... ) >>> >>> # Use column normalization for comparing across partitions >>> fig = analyzer.plot_heatmap( ... 'partition', 'model_name', ... column_scale=True ... )
- plot_histogram(display_metric: str | None = None, display_partition: str = 'test', aggregate: str | None = None, config: ChartConfig | None = None, **kwargs) Figure | List[Figure][source]
Plot score distribution histogram.
When multiple datasets are present and no dataset_name is specified, creates one figure per dataset.
- Parameters:
display_metric – Metric to plot (default: auto-detect from task type).
display_partition – Partition to display scores from (default: ‘test’).
aggregate – If provided, aggregate predictions by this metadata column or ‘y’. When ‘y’, groups by y_true values. When a column name (e.g., ‘ID’), groups by that metadata column. Aggregated predictions have recalculated metrics.
config – Optional ChartConfig to override analyzer’s default config for this chart.
**kwargs – Additional parameters (dataset_name, bins, figsize, filters).
- Returns:
matplotlib Figure object or list of Figure objects (one per dataset).
Example
>>> fig = analyzer.plot_histogram(display_metric='r2', display_partition='val') >>> fig = analyzer.plot_histogram(display_metric='rmse', aggregate='ID')
- plot_nested_branches(level1_var: str = 'branch_path_level1', level2_var: str = 'branch_path_level2', metric: str | None = None, partition: str = 'test', plot_type: str = 'grouped_bar', figsize: tuple | None = None, config: ChartConfig | None = None, **filters) Figure[source]
Plot nested branch comparison for hierarchical experiments.
Creates grouped bar charts or faceted plots for nested branch structures.
- Parameters:
level1_var – Variable for first level grouping (outer group).
level2_var – Variable for second level grouping (inner group/x-axis).
metric – Metric to display (default: auto-detect).
partition – Partition for metrics (default: ‘test’).
plot_type – Type of plot (‘grouped_bar’, ‘facet’).
figsize – Figure size tuple.
config – Optional ChartConfig to override defaults.
**filters – Additional filter criteria.
- Returns:
matplotlib Figure with nested branch visualization.
Examples
>>> # Compare outlier strategies × preprocessing >>> fig = analyzer.plot_nested_branches( ... level1_var='outlier_strategy', ... level2_var='preprocessing', ... metric='rmse' ... )
- plot_top_k(k: int = 5, rank_metric: str | None = None, rank_partition: str = 'val', display_metric: str = '', display_partition: str = 'all', show_scores: bool = True, aggregate: str | None = None, config: ChartConfig | None = None, **kwargs) Figure | List[Figure][source]
Plot top K model comparison (scatter + residuals).
Models are ranked by rank_metric on rank_partition, then predictions from display_partition(s) are shown.
When multiple datasets are present and no dataset_name is specified, creates one figure per dataset.
- Parameters:
k – Number of top models to show (default: 5).
rank_metric – Metric for ranking models (default: auto-detect from task type).
rank_partition – Partition used for ranking (default: ‘val’).
display_metric – Metric to display in titles (default: same as rank_metric).
display_partition – Partition(s) to display (‘all’ or specific partition).
show_scores – If True, show scores in chart titles (default: True).
aggregate – If provided, aggregate predictions by this metadata column or ‘y’. When ‘y’, groups by y_true values. When a column name (e.g., ‘ID’), groups by that metadata column. Aggregated predictions have recalculated metrics.
config – Optional ChartConfig to override analyzer’s default config for this chart.
**kwargs – Additional parameters (dataset_name, figsize, filters).
- Returns:
matplotlib Figure object or list of Figure objects (one per dataset).
Example
>>> fig = analyzer.plot_top_k(k=3, rank_metric='r2') >>> fig = analyzer.plot_top_k(k=3, aggregate='ID') # Aggregated by ID
- class nirs4all.data.PredictionResult[source]
Bases:
dictEnhanced dictionary for a single prediction with convenience methods.
Extends standard dict with property accessors and methods for saving, evaluating, and summarizing predictions.
- Features:
Property accessors (id, model_name, dataset_name, etc.)
save_to_csv() - save individual result
eval_score() - compute metrics on-the-fly
summary() - generate tab report
Examples
>>> result = PredictionResult({ ... "id": "abc123", ... "dataset_name": "wheat", ... "model_name": "PLS", ... "y_true": [1, 2, 3], ... "y_pred": [1.1, 2.2, 3.3] ... }) >>> result.model_name 'PLS' >>> scores = result.eval_score(["rmse", "r2"]) >>> result.save_to_csv("results")
- eval_score(metrics: List[str] | None = None) Dict[str, Any][source]
Evaluate scores for this prediction using specified metrics.
- Parameters:
metrics – List of metrics to compute (if None, returns all available metrics)
- Returns:
Dictionary of metric names to scores. For aggregated results: {“train”: {…}, “val”: {…}, “test”: {…}} For single partition: {“rmse”: …, “r2”: …, …}
Examples
>>> scores = result.eval_score(["rmse", "r2", "mae"]) >>> # For aggregated: scores = {"train": {"rmse": 0.5}, "val": {...}, "test": {...}} >>> # For single: scores = {"rmse": 0.5, "r2": 0.9}
- save_to_csv(path_or_file: str = 'results', filename: str | None = None) None[source]
Save prediction result to CSV file.
- Parameters:
path_or_file – Base path (folder) or complete file path (if ends with .csv)
filename – Optional filename (if path_or_file is a folder)
Examples
>>> result.save_to_csv("output") # Saves to output/{dataset}/{id}.csv >>> result.save_to_csv("output/my_result.csv") # Saves to output/my_result.csv >>> result.save_to_csv("output", "my_result.csv") # Saves to output/my_result.csv
- class nirs4all.data.PredictionResultsList(predictions: List[Dict[str, Any] | PredictionResult] | None = None)[source]
Bases:
listList container for PredictionResult objects with batch operations.
Extends standard list with prediction-specific batch functionality.
- Features:
save() - batch CSV export
get() - retrieve by ID
filter() - chain filtering
Iterator support
Examples
>>> results = PredictionResultsList([result1, result2, result3]) >>> results.save("output/predictions.csv") >>> best = results.get("abc123") >>> len(results) 3
- get(prediction_id: str) PredictionResult | None[source]
Get a prediction by its ID.
- Parameters:
prediction_id – The ID of the prediction to retrieve
- Returns:
PredictionResult if found, None otherwise
Examples
>>> result = results.get("abc123")
- save(path: str = 'results', filename: str | None = None) None[source]
Save all predictions to a single CSV file with structured headers.
- CSV Structure:
Line 1: dataset_name
Line 2: model_classname + model_id
Line 3: fold_id
Line 4: partition
Lines 5+: prediction data (y_true, y_pred columns)
- Parameters:
path – Base directory path (default: “results”)
filename – Optional filename (if None, auto-generated from first prediction)
Examples
>>> results.save("output") >>> results.save("output", "my_predictions.csv")
- class nirs4all.data.Predictions(filepath: str | List[str] | None = None)[source]
Bases:
objectMain facade for prediction management.
Delegates to specialized components while maintaining backward-compatible public API.
- Architecture:
Storage: PredictionStorage (DataFrame backend)
Serializer: PredictionSerializer (JSON/Parquet hybrid)
Indexer: PredictionIndexer (filtering operations)
Ranker: PredictionRanker (ranking and top-k)
Aggregator: PartitionAggregator (partition combining)
Query: CatalogQueryEngine (catalog operations)
Examples
>>> # Create and add predictions >>> pred = Predictions() >>> pred.add_prediction( ... dataset_name="wheat", ... model_name="PLS", ... partition="test", ... y_true=y_true, ... y_pred=y_pred, ... test_score=0.85 ... ) >>> >>> # Query top models >>> top_5 = pred.top(n=5, rank_metric="rmse", rank_partition="val") >>> >>> # Save and load >>> pred.save_to_file("predictions.json") >>> loaded = Predictions.load("predictions.json")
- add_prediction(dataset_name: str, dataset_path: str = '', config_name: str = '', config_path: str = '', pipeline_uid: str | None = None, step_idx: int = 0, op_counter: int = 0, model_name: str = '', model_classname: str = '', model_path: str = '', fold_id: str | int | None = None, sample_indices: List[int] | None = None, weights: List[float] | None = None, metadata: Dict[str, Any] | None = None, partition: str = '', y_true: ndarray | None = None, y_pred: ndarray | None = None, y_proba: ndarray | None = None, val_score: float | None = None, test_score: float | None = None, train_score: float | None = None, metric: str = 'mse', task_type: str = 'regression', n_samples: int = 0, n_features: int = 0, preprocessings: str = '', best_params: Dict[str, Any] | None = None, scores: Dict[str, Dict[str, float]] | None = None, branch_id: int | None = None, branch_name: str | None = None, exclusion_count: int | None = None, exclusion_rate: float | None = None, model_artifact_id: str | None = None, trace_id: str | None = None) str[source]
Add a single prediction to storage.
Delegates to PredictionStorage component.
- Parameters:
dataset_name – Dataset name
dataset_path – Path to dataset file
config_name – Configuration name
config_path – Path to config file
pipeline_uid – Unique pipeline identifier
step_idx – Pipeline step index
op_counter – Operation counter
model_name – Model name
model_classname – Model class name
model_path – Path to saved model
fold_id – Cross-validation fold ID
sample_indices – Indices of samples used
weights – Sample weights
metadata – Additional metadata
partition – Data partition (train/val/test)
y_true – True labels
y_pred – Predicted labels
y_proba – Class probabilities for classification (shape: n_samples x n_classes)
val_score – Validation score
test_score – Test score
train_score – Training score
metric – Metric name
task_type – Task type (classification/regression)
n_samples – Number of samples
n_features – Number of features
preprocessings – Preprocessing steps applied
best_params – Best hyperparameters
scores – Dictionary of pre-computed scores per partition
branch_id – Branch identifier for pipeline branching (0-indexed)
branch_name – Human-readable branch name
exclusion_count – Number of samples excluded during training (outlier_excluder)
exclusion_rate – Rate of samples excluded (0.0-1.0, outlier_excluder)
model_artifact_id – Deterministic artifact ID for model loading (v2 system)
trace_id – Execution trace ID for deterministic prediction replay (v2 system)
- Returns:
Prediction ID
- add_predictions(dataset_name: str | List[str], dataset_path: str | List[str] = '', config_name: str | List[str] = '', config_path: str | List[str] = '', pipeline_uid: str | None | List[str | None] = None, step_idx: int | List[int] = 0, op_counter: int | List[int] = 0, model_name: str | List[str] = '', model_classname: str | List[str] = '', model_path: str | List[str] = '', fold_id: str | None | List[str | None] = None, sample_indices: List[int] | None | List[List[int] | None] = None, weights: List[float] | None | List[List[float] | None] = None, metadata: Dict[str, Any] | None | List[Dict[str, Any] | None] = None, partition: str | List[str] = '', y_true: ndarray | None | List[ndarray | None] = None, y_pred: ndarray | None | List[ndarray | None] = None, val_score: float | None | List[float | None] = None, test_score: float | None | List[float | None] = None, train_score: float | None | List[float | None] = None, metric: str | List[str] = 'mse', task_type: str | List[str] = 'regression', n_samples: int | List[int] = 0, n_features: int | List[int] = 0, preprocessings: str | List[str] = '', best_params: Dict[str, Any] | None | List[Dict[str, Any] | None] = None, scores: Dict[str, Dict[str, float]] | None | List[Dict[str, Dict[str, float]] | None] = None, branch_id: int | None | List[int | None] = None, branch_name: str | None | List[str | None] = None, trace_id: str | None | List[str | None] = None) None[source]
Add multiple predictions to storage (batch operation).
For each parameter, if it’s a single value it will be broadcast to all predictions. If it’s a list, each index corresponds to one prediction.
- Parameters:
add_prediction (Same as)
lists (but can be single values or)
- static aggregate(y_pred: ndarray, group_ids: ndarray, y_proba: ndarray | None = None, y_true: ndarray | None = None, method: str = 'mean', exclude_outliers: bool = False, outlier_threshold: float = 0.95) Dict[str, Any][source]
Aggregate predictions by group (e.g., same sample ID with multiple measurements).
For datasets with multiple samples per target (e.g., 4 measurements for each sample ID), this function averages predictions within each group to produce one prediction per group.
For regression: averages y_pred values within each group. For classification: averages y_proba (if available) then takes argmax,
or uses majority voting on y_pred if no probabilities.
- Parameters:
y_pred – Predicted values array (n_samples,) or (n_samples, 1)
group_ids – Group identifiers array (n_samples,) - samples with same ID are grouped
y_proba – Optional class probabilities array (n_samples, n_classes) for classification
y_true – Optional true values array (n_samples,) for computing aggregated ground truth
method – Aggregation method - ‘mean’ (default), ‘median’, ‘vote’ (for classification)
exclude_outliers – If True, exclude outliers within each group before aggregation using Hotelling’s T² statistic. Useful when some measurements are anomalous.
outlier_threshold – Confidence level for T² outlier detection (default 0.95). Measurements with T² > chi2.ppf(threshold, 1) are excluded.
- Returns:
‘y_pred’: Aggregated predictions (n_groups,)
’y_proba’: Aggregated probabilities (n_groups, n_classes) if input had y_proba
’y_true’: Aggregated true values (n_groups,) if input had y_true
’group_ids’: Unique group identifiers (n_groups,)
’group_sizes’: Number of samples per group (n_groups,)
’outliers_excluded’: Number of outliers excluded per group (if exclude_outliers=True)
- Return type:
Dictionary containing
Examples
>>> # Aggregate 4 samples per ID for regression >>> result = Predictions.aggregate(y_pred, sample_ids) >>> aggregated_pred = result['y_pred'] # One prediction per unique ID
>>> # Aggregate for classification with probabilities >>> result = Predictions.aggregate(y_pred, sample_ids, y_proba=proba) >>> aggregated_proba = result['y_proba'] # Averaged probabilities
>>> # Aggregate with outlier exclusion >>> result = Predictions.aggregate(y_pred, sample_ids, exclude_outliers=True) >>> print(f"Outliers excluded: {result['outliers_excluded'].sum()}")
- archive_to_catalog(catalog_dir: Path, pipeline_dir: Path, metrics: Dict[str, Any] = None) str[source]
Archive pipeline predictions to catalog.
Loads predictions CSV from pipeline directory, adds metadata, and saves to catalog.
Delegates to PredictionStorage for CSV loading.
- Parameters:
catalog_dir – Catalog directory for storage
pipeline_dir – Pipeline directory containing predictions.csv
metrics – Optional metadata dict to add to predictions
- Returns:
Generated prediction ID
- clear_caches() None[source]
Clear all internal caches.
Call this when the underlying data has been modified to ensure fresh results are computed. This clears: - Ranker’s aggregation cache (cached aggregated y_true/y_pred) - Ranker’s score cache (cached metric scores)
Examples
>>> predictions.add_prediction(...) # Add new data >>> predictions.clear_caches() # Clear to ensure fresh results
- compare_across_datasets(pipeline_hash: str, metric: str = 'test_score') DataFrame[source]
Compare a pipeline’s performance across multiple datasets.
Delegates to CatalogQueryEngine component.
- Parameters:
pipeline_hash – Pipeline UID to compare
metric – Metric column to compare
- Returns:
DataFrame with one row per dataset
- filter_by_branch(branch_id: int | None = None, branch_name: str | None = None, include_no_branch: bool = False, load_arrays: bool = True) List[Dict[str, Any]][source]
Filter predictions by branch context.
Convenience method for meta-model stacking to retrieve predictions from a specific branch in branched pipelines.
- Parameters:
branch_id – Branch ID to filter by.
branch_name – Branch name to filter by.
include_no_branch – If True, include predictions with no branch info.
load_arrays – If True, load actual arrays from registry.
- Returns:
List of predictions from the specified branch.
Examples
>>> # Get predictions from branch 0 >>> branch_preds = predictions.filter_by_branch(branch_id=0) >>> # Get predictions from named branch >>> branch_preds = predictions.filter_by_branch(branch_name='preprocessing_a')
- filter_by_criteria(dataset_name: str | None = None, date_range: Tuple[str, str] | None = None, metric_thresholds: Dict[str, float] | None = None) DataFrame[source]
Filter predictions by multiple criteria (catalog query).
Delegates to CatalogQueryEngine component.
- Parameters:
dataset_name – Filter by dataset name
date_range – Tuple of (start_date, end_date)
metric_thresholds – Dict of metric names to threshold values
- Returns:
Filtered DataFrame
- filter_predictions(dataset_name: str | None = None, partition: str | None = None, config_name: str | None = None, model_name: str | None = None, fold_id: str | None = None, step_idx: int | None = None, branch_id: int | None = None, branch_name: str | None = None, load_arrays: bool = True, **kwargs) List[Dict[str, Any]][source]
Filter predictions and return as list of dictionaries.
Delegates to PredictionIndexer for filtering, then deserializes results. Supports lazy loading of arrays for performance optimization.
- Parameters:
dataset_name – Filter by dataset name
partition – Filter by partition
config_name – Filter by config name
model_name – Filter by model name
fold_id – Filter by fold ID
step_idx – Filter by step index
branch_id – Filter by branch ID (for pipeline branching)
branch_name – Filter by branch name (for pipeline branching)
load_arrays – If True, loads actual arrays from registry (slower). If False, returns metadata only with array references (fast).
**kwargs – Additional filter criteria
- Returns:
List of prediction dictionaries with deserialized numpy arrays (if load_arrays=True) or metadata with array_id references (if load_arrays=False)
Examples
>>> # Fast metadata-only query >>> preds = predictions.filter_predictions(dataset_name="wheat", load_arrays=False) >>> # Full query with arrays >>> preds = predictions.filter_predictions(dataset_name="wheat", load_arrays=True) >>> # Filter by branch >>> branch_preds = predictions.filter_predictions(branch_id=0)
- get_best(metric: str = '', ascending: bool | None = None, aggregate_partitions: bool = False, **filters) PredictionResult | None[source]
Get the best prediction for a specific metric.
Delegates to PredictionRanker component.
- Parameters:
metric – Metric to optimize
ascending – Sort order. If True, sorts ascending (lower is better). If False, sorts descending (higher is better). If None, infers from metric.
aggregate_partitions – If True, add partition data
**filters – Additional filter criteria
- Returns:
Best prediction or None
- get_cache_stats() Dict[str, Any][source]
Get cache statistics for debugging performance.
Returns a dictionary with hit rates and sizes for: - aggregation_cache: Cached aggregated arrays - score_cache: Cached metric scores
- Returns:
Dictionary with cache statistics
Examples
>>> stats = predictions.get_cache_stats() >>> print(f"Aggregation cache hit rate: {stats['aggregation_cache']['hit_rate']:.1%}")
- get_configs() List[str][source]
Get list of unique config names.
Delegates to PredictionIndexer component.
- Returns:
List of config names
- get_datasets() List[str][source]
Get list of unique dataset names.
Delegates to PredictionIndexer component.
- Returns:
List of dataset names
- get_entry_partitions(entry: Dict) Dict[str, Dict | None][source]
Get all partition data for an entry.
- Parameters:
entry – Prediction entry dictionary
- Returns:
Dictionary with ‘train’, ‘val’, ‘test’ keys containing partition data
- get_folds() List[str][source]
Get list of unique fold IDs.
Delegates to PredictionIndexer component.
- Returns:
List of fold IDs
- get_models() List[str][source]
Get list of unique model names.
Delegates to PredictionIndexer component.
- Returns:
List of model names
- get_models_before_step(step_idx: int, branch_id: int | None = None, unique_names: bool = True) List[str][source]
Get model names from steps before a given step index.
Convenience method for meta-model stacking to identify source models that can be used for stacking.
- Parameters:
step_idx – Current step index (models before this are returned).
branch_id – Optional filter by branch ID.
unique_names – If True, return unique model names only.
- Returns:
List of model names from previous steps.
Examples
>>> # Get models available for stacking at step 5 >>> source_models = predictions.get_models_before_step(step_idx=5)
- get_oof_predictions(model_name: str | None = None, step_idx: int | None = None, branch_id: int | None = None, exclude_averaged: bool = True, load_arrays: bool = True) List[Dict[str, Any]][source]
Get out-of-fold (validation partition) predictions.
Convenience method for meta-model stacking to retrieve OOF predictions that can be used to construct training features without data leakage.
- Parameters:
model_name – Optional filter by model name.
step_idx – Optional filter by step index.
branch_id – Optional filter by branch ID.
exclude_averaged – If True, exclude ‘avg’ and ‘w_avg’ fold entries. Default True for OOF reconstruction.
load_arrays – If True, load actual arrays from registry.
- Returns:
List of validation partition predictions.
Examples
>>> # Get all OOF predictions >>> oof = predictions.get_oof_predictions() >>> # Get OOF predictions for a specific model >>> oof = predictions.get_oof_predictions(model_name='PLS')
- get_partitions() List[str][source]
Get list of unique partitions.
Delegates to PredictionIndexer component.
- Returns:
List of partitions
- get_prediction_by_id(prediction_id: str, load_arrays: bool = True) Dict[str, Any] | None[source]
Get a single prediction by its ID using direct lookup.
This is an O(1) lookup that avoids iterating all predictions, which is much faster than using filter_predictions for ID lookups.
- Parameters:
prediction_id – Unique prediction identifier (hash ID)
load_arrays – If True, loads actual arrays from registry (slower). If False, returns metadata only with array references (fast).
- Returns:
Prediction dictionary or None if not found
Examples
>>> pred = predictions.get_prediction_by_id("abc123def456") >>> if pred: ... print(f"Found model: {pred['model_name']}")
- get_predictions_by_step(step_idx: int, partition: str | None = None, branch_id: int | None = None, load_arrays: bool = True, **kwargs) List[Dict[str, Any]][source]
Get predictions from a specific pipeline step.
Convenience method for meta-model stacking to retrieve predictions from source models at a specific step index.
- Parameters:
step_idx – Pipeline step index to filter by.
partition – Optional partition filter (‘train’, ‘val’, ‘test’).
branch_id – Optional branch ID filter.
load_arrays – If True, load actual arrays from registry.
**kwargs – Additional filter criteria.
- Returns:
List of prediction dictionaries from the specified step.
Examples
>>> # Get all predictions from step 2 >>> preds = predictions.get_predictions_by_step(step_idx=2) >>> # Get validation predictions from step 2 >>> val_preds = predictions.get_predictions_by_step( ... step_idx=2, partition='val' ... )
- get_similar(**filter_kwargs) Dict[str, Any] | None[source]
Get the first prediction matching filter criteria.
- Parameters:
**filter_kwargs – Filter criteria (same as filter_predictions)
- Returns:
First matching prediction or None
- get_summary_stats(metric: str = 'test_score') Dict[str, float][source]
Get summary statistics for a metric.
Delegates to CatalogQueryEngine component.
- Parameters:
metric – Metric column name
- Returns:
Dictionary with min, max, mean, median, std
- get_unique_values(column: str) List[str][source]
Get unique values for a specific column.
Delegates to PredictionIndexer component.
- Parameters:
column – Column name
- Returns:
List of unique values
- list_runs(dataset_name: str | None = None) DataFrame[source]
List all prediction runs with summary information.
Delegates to CatalogQueryEngine component.
- Parameters:
dataset_name – Filter by dataset name (None for all)
- Returns:
DataFrame with run summary
- classmethod load(dataset_name: str | None = None, path: str = 'results', aggregate_partitions: bool = False, **filters) Predictions[source]
Load predictions from results directory structure.
- Parameters:
dataset_name – Name of dataset to load (None for all)
path – Base path to search for predictions
aggregate_partitions – If True, aggregate partition data
**filters – Additional filter criteria
- Returns:
Predictions instance with loaded data
- load_from_file(filepath: str, merge: bool = True) None[source]
Load predictions from split Parquet format.
Supports: - Split Parquet with array registry (.meta.parquet + .arrays.parquet)
When called multiple times (e.g., from __init__ with multiple files), predictions are merged by default.
- Parameters:
filepath – Path to .meta.parquet file
merge – If True and storage already has data, merge loaded data. If False, replace existing data. (default: True)
Examples
>>> predictions.load_from_file("predictions.meta.parquet") >>> # Load additional predictions (merged) >>> predictions.load_from_file("more_predictions.meta.parquet")
- classmethod load_from_file_cls(filepath: str) Predictions[source]
Load predictions from JSON file as class method.
- Parameters:
filepath – Input file path
- Returns:
Predictions instance with loaded data (empty if file doesn’t exist)
- classmethod load_from_parquet(catalog_dir: Path, prediction_ids: list = None) Predictions[source]
Load predictions from split Parquet storage.
- Parameters:
catalog_dir – Path to catalog directory
prediction_ids – Optional list of prediction IDs to load
- Returns:
Predictions instance with loaded data
- classmethod merge_parquet_files(input_files: List[str], output_file: str, deduplicate: bool = True) Predictions[source]
Merge multiple prediction parquet files into a single output file.
This is a utility method to consolidate predictions from multiple experiment runs into a single file for easier analysis.
- Parameters:
input_files – List of paths to .meta.parquet files to merge.
output_file – Output path for the merged .meta.parquet file.
deduplicate – If True, remove duplicate prediction IDs (keep first). Default is True.
- Returns:
Predictions instance containing the merged data.
- Raises:
ValueError – If no input files are provided.
FileNotFoundError – If any input file does not exist.
Examples
>>> # Merge multiple experiment runs >>> merged = Predictions.merge_parquet_files( ... input_files=[ ... "run1/predictions.meta.parquet", ... "run2/predictions.meta.parquet", ... "run3/predictions.meta.parquet" ... ], ... output_file="combined/all_predictions.meta.parquet" ... ) >>> print(f"Merged {len(merged)} predictions")
>>> # Merge without deduplication >>> merged = Predictions.merge_parquet_files( ... input_files=["exp1.meta.parquet", "exp2.meta.parquet"], ... output_file="merged.meta.parquet", ... deduplicate=False ... )
- merge_predictions(other: Predictions) None[source]
Merge predictions from another Predictions instance.
Delegates to PredictionStorage component.
- Parameters:
other – Another Predictions instance to merge
- classmethod pred_long_string(entry: Dict, metrics: List[str] | None = None) str[source]
Generate long string representation of a prediction.
- Parameters:
entry – Prediction dictionary
metrics – Optional list of metrics to display
- Returns:
Long description string with config
- classmethod pred_short_string(entry: Dict, metrics: List[str] | None = None, partition: str | List[str] = 'test') str[source]
Generate short string representation of a prediction.
- Parameters:
entry – Prediction dictionary
metrics – Optional list of metrics to display
- Returns:
Short description string
- query_best(dataset_name: str | None = None, metric: str = 'test_score', n: int = 10, ascending: bool = False) DataFrame[source]
Query for best performing pipelines by metric (catalog query).
Delegates to CatalogQueryEngine component.
- Parameters:
dataset_name – Filter by dataset name
metric – Metric column to rank by
n – Number of top results
ascending – If True, lower scores rank higher
- Returns:
DataFrame with top n predictions
- static save_all_to_csv(predictions: Predictions, path: str = 'results', aggregate_partitions: bool = False, **filters) None[source]
Save all predictions to CSV files.
- Parameters:
predictions – Predictions instance
path – Base path for saving
aggregate_partitions – If True, save one file per model with all partitions
**filters – Additional filter criteria
- static save_predictions_to_csv(y_true: ndarray | List[float] | None = None, y_pred: ndarray | List[float] | None = None, filepath: str = '', prefix: str = '', suffix: str = '') None[source]
Save y_true and y_pred arrays to a CSV file.
- Parameters:
y_true – True values array
y_pred – Predicted values array
filepath – Output CSV file path
prefix – Optional prefix for column names
suffix – Optional suffix for column names
- save_to_file(filepath: str, format: str = 'parquet') None[source]
Save predictions to split Parquet format with array registry.
- Parameters:
filepath – Output file path (should end with .meta.parquet)
format – Format to use (only “parquet” is supported)
Examples
>>> predictions.save_to_file("predictions.meta.parquet")
- save_to_parquet(catalog_dir: Path, prediction_id: str = None) tuple[source]
Save predictions as split Parquet (metadata + arrays separate).
Appends to existing files if they exist.
Delegates to PredictionStorage component.
- Parameters:
catalog_dir – Directory for catalog storage
prediction_id – Optional prediction ID (generates UUID if None)
- Returns:
Tuple of (meta_path, data_path)
- to_dicts(load_arrays: bool = True) List[Dict[str, Any]][source]
Get predictions as list of dictionaries.
- Parameters:
load_arrays – If True, hydrate array references with actual arrays. If False, returns metadata with array IDs only (faster).
- Returns:
List of prediction dictionaries
- top(n: int, rank_metric: str = '', rank_partition: str = 'val', display_metrics: List[str] | None = None, display_partition: str = 'test', aggregate_partitions: bool = False, ascending: bool | None = None, group_by_fold: bool = False, aggregate: str | None = None, group_by: str | List[str] | None = None, best_per_model: bool = False, return_grouped: bool = False, **filters) PredictionResultsList | Dict[Tuple, PredictionResultsList][source]
Get top n models ranked by a metric on a specific partition.
Delegates to PredictionRanker component.
- Parameters:
n – Number of top models to return. When group_by is used, this means top N per group (e.g., top 3 per dataset).
rank_metric – Metric to rank by (if empty, uses record’s metric or val_score)
rank_partition – Partition to rank on (default: “val”)
display_metrics – Metrics to compute for display (default: task_type defaults)
display_partition – Partition to display results from (default: “test”)
aggregate_partitions – If True, add train/val/test nested dicts in results
ascending – Sort order. If True, sorts ascending (lower is better). If False, sorts descending (higher is better). If None, infers from metric.
group_by_fold – If True, include fold_id in model identity (rank per fold)
aggregate – If provided, aggregate predictions by this metadata column or ‘y’. When ‘y’, groups by y_true values. When a column name (e.g., ‘ID’), groups by that metadata column. Aggregated predictions have recalculated metrics.
group_by – Group predictions by column(s). When provided: - Returns top N results per group (not N total) - Each result includes a ‘group_key’ field for easy filtering - Can be a single column name (str) or list of columns - Examples: ‘dataset_name’, [‘model_name’, ‘dataset_name’]
best_per_model – DEPRECATED - Use group_by=[‘model_name’] instead. If True, keep only the best prediction per model_name.
return_grouped – If True and group_by is set, return a dict mapping group keys to PredictionResultsList instead of a flat list. Default: False (returns flat list sorted by global rank).
**filters – Additional filter criteria (dataset_name, config_name, etc.)
- Returns:
- PredictionResultsList containing top n
models per group, sorted by rank_metric. Each result includes ‘group_key’.
If return_grouped=True: Dict mapping group keys (tuples) to PredictionResultsList, one list per group with top n results each.
- Return type:
If return_grouped=False (default)
Examples
>>> # Top 3 per dataset (flat list) >>> top_per_ds = predictions.top(n=3, group_by='dataset_name') >>> # Filter by group_key >>> ds1_results = [r for r in top_per_ds if r['group_key'] == ('dataset1',)] >>> >>> # Top 3 per dataset (grouped dict) >>> grouped = predictions.top(n=3, group_by='dataset_name', return_grouped=True) >>> for key, results in grouped.items(): ... print(f"{key}: {len(results)} results")
- class nirs4all.data.RoleAssigner(case_sensitive: bool = True, allow_overlap: bool = False)[source]
Bases:
objectAssign columns to data roles (features, targets, metadata).
Validates that: - No column is assigned to multiple roles - At least features are assigned - Indices are valid
Supports the same column selection syntax as ColumnSelector.
Example
>>> assigner = RoleAssigner() >>> result = assigner.assign(df, { ... "features": "2:-1", # All columns except first 2 and last ... "targets": -1, # Last column ... "metadata": [0, 1] # First 2 columns ... })
- assign(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) RoleAssignmentResult[source]
Assign columns to roles.
- Parameters:
df – The DataFrame to assign roles from.
roles – Dictionary mapping role names to column selections. Supported roles: “features”, “targets”, “metadata” Also accepts: “x” (alias for features), “y” (alias for targets)
- Returns:
RoleAssignmentResult with separated DataFrames.
- Raises:
RoleAssignmentError – If assignment is invalid (overlap, missing features).
- assign_auto(df: DataFrame, target_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None, metadata_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None) RoleAssignmentResult[source]
Auto-assign roles with specified targets and metadata.
Features are automatically set to all remaining columns.
- Parameters:
df – The DataFrame to assign roles from.
target_columns – Column selection for targets (Y).
metadata_columns – Column selection for metadata.
- Returns:
RoleAssignmentResult with separated DataFrames.
- extract_y_from_x(df: DataFrame, y_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) RoleAssignmentResult[source]
Extract target columns from a features DataFrame.
This is useful when Y columns are embedded in the X data.
- Parameters:
df – DataFrame containing both features and targets.
y_columns – Column selection for targets to extract.
- Returns:
RoleAssignmentResult with features (remaining) and targets (extracted).
- validate_roles(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) List[str][source]
Validate a role specification without performing assignment.
- Parameters:
df – The DataFrame to validate against.
roles – Role specification to validate.
- Returns:
List of warning messages (empty if no warnings).
- Raises:
RoleAssignmentError – If role specification is invalid.
- exception nirs4all.data.RoleAssignmentError[source]
Bases:
ExceptionRaised when role assignment fails.
- exception nirs4all.data.RowSelectionError[source]
Bases:
ExceptionRaised when row selection fails.
- class nirs4all.data.RowSelector(default_random_state: int | None = None)[source]
Bases:
objectFlexible row selector for DataFrames.
Supports multiple selection methods: - All rows: None - By index: [0, 1, 2] or 0 - By range: “0:100” (slice syntax as string) - By percentage: “0:80%” or “80%:100%” - By condition: {“where”: {“column”: “quality”, “op”: “>”, “value”: 0.5}} - Random sample: {“sample”: 100, “random_state”: 42} - Stratified sample: {“sample”: 100, “stratify”: “class”, “random_state”: 42} - Head/Tail: {“head”: 100} or {“tail”: 50}
Example
>>> selector = RowSelector() >>> result = selector.select(df, "0:80%") >>> print(len(result.data)) # 80% of rows
- OPERATORS: Dict[str, Callable[[Any, Any], bool]] = {'!=': <function RowSelector.<lambda>>, '<': <function RowSelector.<lambda>>, '<=': <function RowSelector.<lambda>>, '==': <function RowSelector.<lambda>>, '>': <function RowSelector.<lambda>>, '>=': <function RowSelector.<lambda>>, 'contains': <function RowSelector.<lambda>>, 'endswith': <function RowSelector.<lambda>>, 'in': <function RowSelector.<lambda>>, 'isna': <function RowSelector.<lambda>>, 'not in': <function RowSelector.<lambda>>, 'notna': <function RowSelector.<lambda>>, 'regex': <function RowSelector.<lambda>>, 'startswith': <function RowSelector.<lambda>>}
- select(df: DataFrame, selection: int | str | List[int] | Dict[str, Any] | slice | None) RowSelectionResult[source]
Select rows from a DataFrame.
- Parameters:
df – The DataFrame to select rows from.
selection – Row selection specification. Can be: - None: Select all rows - int: Single row index - str: Range string (“0:100”) or percentage (“0:80%”) - List[int]: List of row indices - Dict: Complex selection (see class docstring)
- Returns:
RowSelectionResult with indices, mask, and selected data.
- Raises:
RowSelectionError – If selection is invalid or rows not found.
- class nirs4all.data.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]
Bases:
objectLink samples across multiple data files by key column.
Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source
Example
>>> linker = SampleLinker() >>> result = linker.link( ... { ... "X": features_df, # Has columns: sample_id, feature1, feature2 ... "Y": targets_df, # Has columns: sample_id, target ... "M": metadata_df, # Has columns: sample_id, group, date ... }, ... link_by="sample_id" ... ) >>> # Linked DataFrames have aligned rows >>> X_linked = result.linked_data["X"] # Without sample_id column
- create_sample_index(sources: Dict[str, DataFrame], link_by: str) DataFrame[source]
Create a sample index showing key presence across sources.
- Parameters:
sources – Dictionary of source DataFrames.
link_by – Key column name.
- Returns:
DataFrame with keys as index and boolean columns per source.
- link(sources: Dict[str, DataFrame], link_by: str, keep_key_column: bool = False) LinkingResult[source]
Link multiple data sources by key column.
- Parameters:
sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.
link_by – Name of the column to use for linking.
keep_key_column – Whether to keep the key column in output DataFrames.
- Returns:
LinkingResult with linked DataFrames.
- Raises:
LinkingError – If linking fails (missing key columns, no matches, etc.).
- link_aligned(sources: Dict[str, DataFrame], validate: bool = True) Dict[str, DataFrame][source]
Link sources that are already aligned by row index.
This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).
- Parameters:
sources – Dictionary of aligned DataFrames.
validate – Whether to validate that all sources have same row count.
- Returns:
Dictionary of DataFrames (unchanged, just validated).
- Raises:
LinkingError – If validation fails.
- class nirs4all.data.SignalType(value)[source]
-
Spectral signal types for NIRS/spectroscopy data.
Defines the measurement type of spectral data. String values ensure backward compatibility with config files.
- ABSORBANCE = 'absorbance'
- AUTO = 'auto'
- KUBELKA_MUNK = 'kubelka_munk'
- LOG_1_R = 'log_1_r'
- LOG_1_T = 'log_1_t'
- PREPROCESSED = 'preprocessed'
- REFLECTANCE = 'reflectance'
- REFLECTANCE_PERCENT = 'reflectance%'
- TRANSMITTANCE = 'transmittance'
- TRANSMITTANCE_PERCENT = 'transmittance%'
- UNKNOWN = 'unknown'
- classmethod from_string(value: str) SignalType[source]
Parse signal type from various string representations.
- Parameters:
value – String representation (e.g., “A”, “R”, “%R”, “absorbance”, etc.)
- Returns:
SignalType enum value
- class nirs4all.data.SignalTypeDetector(wavelengths: ndarray | None = None, wavelength_unit: str = 'nm')[source]
Bases:
objectHeuristic detector for spectral signal types.
Uses value ranges and optionally wavelength information to determine whether data is absorbance, reflectance, or transmittance.
- WATER_BANDS_CM1 = [6897, 5155, 4000]
- WATER_BANDS_NM = [1450, 1940, 2500]
- detect(spectra: ndarray, confidence_threshold: float = 0.7) Tuple[SignalType, float, str][source]
Detect the signal type of spectral data.
- Parameters:
spectra – Spectral data array of shape (n_samples, n_features)
confidence_threshold – Minimum confidence to return a definite type
- Returns:
Tuple of (SignalType, confidence, reason_string)
- class nirs4all.data.SpectroDataset(name: str = 'Unknown_dataset')[source]
Bases:
objectMain dataset facade for spectroscopy and ML/DL pipelines.
Coordinates feature, target, and metadata management through specialized accessor interfaces. The primary API uses direct methods like dataset.x() and dataset.y() for convenience.
- features
Feature data accessor (internal use)
- Type:
FeatureAccessor
- targets
Target data accessor (internal use)
- Type:
TargetAccessor
- metadata_accessor
Metadata accessor (internal use)
- Type:
MetadataAccessor
- folds
Cross-validation fold splits
- Type:
List[Tuple]
Examples
>>> # Create dataset >>> dataset = SpectroDataset("my_dataset") >>> # Add samples >>> dataset.add_samples(X_train, {"partition": "train"}) >>> dataset.add_targets(y_train) >>> # Get data >>> X = dataset.x({"partition": "train"}) >>> y = dataset.y({"partition": "train"})
- add_features(features: list[ndarray] | list[list[ndarray]], processings: list[str], source: int = -1) None[source]
Add processed feature versions to existing data.
- add_merged_features(features: ndarray, processing_name: str = 'merged', source: int = 0, processing_names: List[str] | None = None) None[source]
Add merged features from branch merge operations.
This method is used by MergeController to store the output of branch merging operations. The merged features REPLACE all existing processings to become the new feature set for subsequent steps.
- Parameters:
features –
Feature array to store: - 2D array of shape (n_samples, n_features): flattened features - 3D array of shape (n_samples, n_processings, n_features):
features with preserved preprocessing dimension
processing_name – Name for the merged processing (default: “merged”). Used when features is 2D (single processing).
source – Target source index (default: 0, first source).
processing_names – Optional list of processing names for 3D features. If not provided, generates names like “merged_0”, “merged_1”, etc.
- Raises:
ValueError – If features is not 2D or 3D, or sample count doesn’t match.
Example
>>> # 2D merged features (flattened) >>> merged = np.concatenate([branch0_features, branch1_features], axis=1) >>> dataset.add_merged_features(merged, "merged_snv_msc") >>> >>> # 3D merged features (preserved preprocessing dimension) >>> merged_3d = np.stack([snv_features, msc_features], axis=1) >>> dataset.add_merged_features(merged_3d, processing_names=["snv", "msc"])
- add_metadata(data: ndarray | Any, headers: List[str] | None = None) None[source]
Add metadata rows (aligns with add_samples call order).
- Parameters:
data – Metadata as 2D array (n_samples, n_cols) or DataFrame
headers – Column names (required if data is ndarray)
- add_metadata_column(column: str, values: List | ndarray) None[source]
Add new metadata column.
- Parameters:
column – Column name
values – Column values (must match number of samples)
- add_processed_targets(processing_name: str, targets: ndarray, ancestor_processing: str = 'numeric', transformer: TransformerMixin | None = None) None[source]
Add processed target version (e.g., scaled, encoded).
- add_samples(data: ndarray | list[ndarray], indexes: Dict[str, Any] | None = None, headers: List[str] | List[List[str]] | None = None, header_unit: str | List[str] | None = None) None[source]
Add feature samples to the dataset.
- Parameters:
data – Feature data (single or multi-source)
indexes – Optional index dictionary (partition, group, branch, fold)
headers – Feature headers (wavelengths, feature names)
header_unit – Unit type for headers (“cm-1”, “nm”, “none”, “text”, “index”)
- add_samples_batch(data: ndarray | List[ndarray], indexes_list: List[Dict[str, Any]]) None[source]
Add multiple samples in a single batch operation - O(N) instead of O(N²).
This method is optimized for bulk insertion of augmented samples. It performs only one array concatenation and one indexer append, making it dramatically faster than calling add_samples() in a loop.
- Parameters:
data – 3D array of shape (n_samples, n_processings, n_features) for single source, or list of 3D arrays for multi-source datasets.
indexes_list – List of index dictionaries, one per sample.
Example
>>> # Batch add 100 augmented samples >>> data = np.random.rand(100, 2, 500) >>> indexes = [{"partition": "train", "origin": i, "augmentation": "noise"} for i in range(100)] >>> dataset.add_samples_batch(data, indexes)
- property aggregate: str | None
Get the aggregation setting for sample-level prediction aggregation.
- Returns:
No aggregation - ‘y’: Aggregate by target values (y_true) - str: Aggregate by specified metadata column name
- Return type:
None
Example
>>> dataset.aggregate 'sample_id' # Predictions will be aggregated by sample_id column
- property aggregate_exclude_outliers: bool
Get whether T² outlier exclusion is enabled for aggregation.
- Returns:
True if outliers should be excluded before aggregation
- Return type:
- property aggregate_method: str
Get the aggregation method for sample-level prediction aggregation.
- Returns:
Aggregation method (‘mean’, ‘median’, or ‘vote’)
- Return type:
Example
>>> dataset.aggregate_method 'mean' # Predictions will be averaged within groups
- property aggregate_outlier_threshold: float
Get the outlier detection threshold for T² exclusion.
- Returns:
Confidence level (0-1) for chi-square critical value
- Return type:
- augment_samples(data: ndarray | list[ndarray], processings: list[str], augmentation_id: str, selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None, count: int | List[int] = 1) List[int][source]
Create augmented versions of existing samples.
- detect_signal_type(src: int = 0, force_redetect: bool = False) Tuple[SignalType, float, str][source]
Detect signal type using heuristics.
Uses value range analysis and optionally wavelength band direction to determine the most likely signal type.
- Parameters:
src – Source index (default: 0)
force_redetect – If True, ignores cached/forced values and re-runs detection
- Returns:
Tuple of (SignalType, confidence, reason_string)
Example
>>> signal_type, confidence, reason = dataset.detect_signal_type() >>> print(f"Detected {signal_type.value} ({confidence:.0%}): {reason}")
- float_headers(src: int = 0) ndarray[source]
Get headers as float array (legacy method).
WARNING: This method assumes headers are numeric and doesn’t handle unit conversion. Use wavelengths_cm1() or wavelengths_nm() for wavelength data.
- Parameters:
src – Source index
- Returns:
Headers converted to float array
- Raises:
ValueError – If headers cannot be converted to float
- get_dataset_metadata(include_y_stats: bool = True) Dict[str, Any][source]
Get comprehensive dataset metadata for run manifests.
Returns metadata suitable for efficient path resolution and dataset version tracking in run manifests.
- Parameters:
include_y_stats – If True, include target variable statistics
- Returns:
name: Dataset name
path: Original file path (if set)
hash: Content hash (if computed)
file_size: File size in bytes (if available)
n_samples: Number of samples
n_features: Number of features
n_sources: Number of feature sources
task_type: Classification or regression
num_classes: Number of classes (classification only)
y_columns: Target column names
y_stats: Target statistics (min, max, mean, std)
wavelength_range: [min, max] wavelength
wavelength_unit: Unit (nm, cm-1)
signal_types: List of signal types per source
metadata_columns: Available metadata columns
- Return type:
Dict with
Example
>>> dataset = SpectroDataset.load("wheat.n4a") >>> meta = dataset.get_dataset_metadata() >>> print(meta["n_samples"], meta["y_stats"])
- get_merged_features(processing_name: str = 'merged', source: int = 0, selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None) ndarray[source]
Get merged features by processing name.
Retrieves features that were added via add_merged_features(). Since merged features replace all existing processings, this returns the features for the single merged processing.
- Parameters:
processing_name – Name of the merged processing (default: “merged”).
source – Source index to get features from (default: 0).
selector – Optional sample filter.
- Returns:
2D array of merged features (n_samples, n_merged_features).
- Raises:
ValueError – If the processing name doesn’t exist.
Example
>>> X_merged = dataset.get_merged_features("merged_snv_msc") >>> print(X_merged.shape) # (n_samples, n_merged_features)
- header_unit(src: int = 0) str[source]
Get the unit type of headers for a data source.
- Parameters:
src – Source index
- Returns:
“cm-1”, “nm”, “none”, “text”, “index”
- Return type:
Unit string
- index_column(col: str, filter: Dict[str, Any] = {}) List[int][source]
Get values from index column.
- keep_sources(source_indices: int | List[int]) None[source]
Keep only specified sources, removing all others.
Used after merge operations with output_as=”features” to consolidate to a single source. This is called automatically by MergeController when output_as=”features” is used.
- Parameters:
source_indices – Single source index or list of source indices to keep.
- Raises:
ValueError – If source indices are invalid.
Example
>>> # After merge with output_as="features", keep only source 0 >>> dataset.keep_sources(0)
- metadata(selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None, columns: List[str] | None = None, include_augmented: bool = True)[source]
Get metadata as DataFrame.
- Parameters:
selector – Filter selector (e.g., {“partition”: “train”})
columns – Specific columns to return (None = all)
include_augmented – If True, include augmented versions of selected samples. Default True for backward compatibility.
- Returns:
Polars DataFrame with metadata
- metadata_column(column: str, selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None, include_augmented: bool = True) ndarray[source]
Get single metadata column as array.
- Parameters:
column – Column name
selector – Filter selector (e.g., {“partition”: “train”})
include_augmented – If True, include augmented versions of selected samples. Default True for backward compatibility.
- Returns:
Numpy array of column values
- metadata_numeric(column: str, selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None, method: Literal['label', 'onehot'] = 'label', include_augmented: bool = True) Tuple[ndarray, Dict][source]
Get numeric encoding of metadata column.
- Parameters:
column – Column name
selector – Filter selector (e.g., {“partition”: “train”})
method – “label” for label encoding or “onehot” for one-hot encoding
include_augmented – If True, include augmented versions of selected samples. Default True for backward compatibility.
- Returns:
Tuple of (numeric_array, encoding_info)
- print_summary() None[source]
Print a comprehensive summary of the dataset.
Shows counts, dimensions, number of sources, target versions, etc.
- replace_features(source_processings: list[str], features: list[ndarray] | list[list[ndarray]], processings: list[str], source: int = -1) None[source]
Replace existing processed features with new versions.
- reshape_reps_to_preprocessings(config: RepetitionConfig) None[source]
Transform repetitions into additional preprocessing slots.
Each repetition becomes a new preprocessing dimension, reducing the number of samples but increasing the preprocessing count. This enables multi-preprocessing modeling strategies.
Input: n_sources × (n_samples, n_pp, n_features) Output: n_sources × (n_unique_samples, n_pp × n_reps, n_features)
- Parameters:
config – RepetitionConfig with column and options.
- Raises:
ValueError – If grouping column not found, groups have unequal sizes and on_unequal=”error”, or no valid groups found.
Example
>>> # With 120 samples (30 unique × 4 reps), 1 source, 1 pp, 500 features >>> config = RepetitionConfig(column="Sample_ID") >>> dataset.reshape_reps_to_preprocessings(config) >>> # Result: 1 source × (30 samples, 4 pp, 500 features)
- reshape_reps_to_sources(config: RepetitionConfig) None[source]
Transform repetitions into separate data sources.
Each repetition index becomes a new source, reducing the number of samples but increasing the number of sources. This enables per-source branching and multi-source modeling strategies.
Input: n_sources × (n_samples, n_pp, n_features) Output: (n_sources × n_reps) × (n_unique_samples, n_pp, n_features)
- Parameters:
config – RepetitionConfig with column and options.
- Raises:
ValueError – If grouping column not found, groups have unequal sizes and on_unequal=”error”, or no valid groups found.
Example
>>> # With 120 samples (30 unique × 4 reps), 1 source, 500 features >>> config = RepetitionConfig(column="Sample_ID") >>> dataset.reshape_reps_to_sources(config) >>> # Result: 4 sources × (30 samples, 1 pp, 500 features)
- set_aggregate(value: str | bool | None) None[source]
Set the aggregation behavior for sample-level prediction aggregation.
When set, predictions from multiple spectra of the same biological sample (as identified by the aggregation key) will be aggregated automatically during scoring and reporting.
- Parameters:
value – Aggregation setting - None: No aggregation (default behavior) - True: Aggregate by y_true values (target grouping) - str: Aggregate by specified metadata column (e.g., ‘sample_id’, ‘ID’)
Example
>>> dataset.set_aggregate('sample_id') # Aggregate by sample_id metadata column >>> dataset.set_aggregate(True) # Aggregate by y values >>> dataset.set_aggregate(None) # Disable aggregation
- set_aggregate_exclude_outliers(value: bool, threshold: float = 0.95) None[source]
Enable/disable T² based outlier exclusion before aggregation.
When enabled, uses Hotelling’s T² statistic to identify and exclude outlier measurements within each sample group before averaging.
- Parameters:
value – True to enable outlier exclusion, False to disable
threshold – Confidence level for outlier detection (0-1, default 0.95)
Example
>>> dataset.set_aggregate_exclude_outliers(True, threshold=0.95)
- set_aggregate_method(value: str | None) None[source]
Set the aggregation method for sample-level prediction aggregation.
- Parameters:
value – Aggregation method - None: Use default method (mean for regression, vote for classification) - ‘mean’: Average predictions within each group - ‘median’: Median prediction within each group - ‘vote’: Majority voting for classification
Example
>>> dataset.set_aggregate_method('median')
- set_content_hash(hash_value: str) None[source]
Set the content hash for version tracking.
- Parameters:
hash_value – Content hash string
- set_folds(folds_iterable) None[source]
Set cross-validation folds from an iterable of (train_idx, val_idx) tuples.
- set_signal_type(signal_type: str | SignalType, src: int = 0, forced: bool = True) None[source]
Set the signal type for a data source.
- Parameters:
signal_type – Signal type (string or SignalType enum)
src – Source index (default: 0)
forced – If True, prevents auto-detection from overriding (default: True)
Example
>>> dataset.set_signal_type("absorbance", src=0) >>> dataset.set_signal_type(SignalType.REFLECTANCE_PERCENT, src=1)
- set_source_path(path: str) None[source]
Set the source file path for metadata tracking.
- Parameters:
path – Path to the original dataset file
- set_task_type(task_type: str | TaskType, forced: bool = True) None[source]
Set the task type explicitly.
- Parameters:
task_type – Task type as string (‘regression’, ‘binary_classification’, ‘multiclass_classification’) or TaskType enum
forced – If True, prevents auto-detection from overriding this value in subsequent y_processing steps (e.g., after MinMaxScaler). Default True.
- signal_type(src: int = 0) SignalType[source]
Get the signal type for a data source.
If not set, attempts auto-detection based on value ranges and optionally wavelength band analysis.
- Parameters:
src – Source index (default: 0)
- Returns:
SignalType enum value
Example
>>> signal = dataset.signal_type(0) >>> if signal == SignalType.REFLECTANCE: ... dataset.convert_to_absorbance(0)
- property signal_types: List[SignalType]
Get signal types for all sources.
- Returns:
List of SignalType values, one per source
- update_features(source_processings: list[str], features: list[ndarray] | list[list[ndarray]], processings: list[str], source: int = -1) None[source]
Update existing processed features.
- update_metadata(column: str, values: List | ndarray, selector: Dict[str, Any] | DataSelector | ExecutionContext | None = None, include_augmented: bool = True) None[source]
Update metadata values for selected samples.
- Parameters:
column – Column name
values – New values
selector – Filter selector (None = all samples)
include_augmented – If True, include augmented versions of selected samples. Default True for backward compatibility.
- wavelengths_cm1(src: int = 0) ndarray[source]
Get wavelengths in cm⁻¹ (wavenumber), converting from nm if needed.
- Parameters:
src – Source index
- Returns:
Wavelengths in cm⁻¹ as float array
- Raises:
ValueError – If headers cannot be converted to wavelengths
- wavelengths_nm(src: int = 0) ndarray[source]
Get wavelengths in nm, converting from cm⁻¹ if needed.
- Parameters:
src – Source index
- Returns:
Wavelengths in nm as float array
- Raises:
ValueError – If headers cannot be converted to wavelengths
- x(selector: Dict[str, Any] | DataSelector | ExecutionContext | None, layout: Literal['2d', '3d', '2d_t', '3d_i'] = '2d', concat_source: bool = True, include_augmented: bool = True, include_excluded: bool = False) ndarray | list[ndarray][source]
Get feature data with automatic augmented sample aggregation.
- Parameters:
selector – Filter criteria (partition, group, branch, etc.)
layout – Output layout (“2d” or “3d”)
concat_source – If True, concatenate multiple sources along feature axis
include_augmented – If True, include augmented versions of selected samples. If False, return only base samples (origin=null). Default True for backward compatibility.
include_excluded – If True, include samples marked as excluded. If False (default), exclude samples marked as excluded=True. Use True when transforming ALL features (e.g., preprocessing).
- Returns:
Feature data array(s)
Example
>>> # Get all train samples (base + augmented) >>> X_train = dataset.x({"partition": "train"}) >>> # Get only base train samples (for splitting) >>> X_base = dataset.x({"partition": "train"}, include_augmented=False) >>> # Get all features including excluded (for transformations) >>> X_all = dataset.x({"partition": "train"}, include_excluded=True)
- y(selector: Dict[str, Any] | DataSelector | ExecutionContext | None, include_augmented: bool = True, include_excluded: bool = False) ndarray[source]
Get target data - automatically maps augmented samples to their origin for y values.
- Parameters:
selector – Filter criteria (partition, group, branch, etc.)
include_augmented – If True, include augmented versions of selected samples. Augmented samples are automatically mapped to their origin’s y value. If False, return only base samples. Default True for backward compatibility.
include_excluded – If True, include samples marked as excluded. If False (default), exclude samples marked as excluded=True. Use True when transforming ALL targets (e.g., y_processing).
- Returns:
Target values array
Example
>>> # Get all train targets (base + augmented, with mapping) >>> y_train = dataset.y({"partition": "train"}) >>> # Get only base train targets (for splitting) >>> y_base = dataset.y({"partition": "train"}, include_augmented=False) >>> # Get all targets including excluded (for y_processing) >>> y_all = dataset.y({"partition": "train"}, include_excluded=True)
- class nirs4all.data.TaskType(value)[source]
-
Task type for the dataset.
- AUTO = 'auto'
- BINARY_CLASSIFICATION = 'binary_classification'
- MULTICLASS_CLASSIFICATION = 'multiclass_classification'
- REGRESSION = 'regression'
- class nirs4all.data.ValidationError(code: str, message: str, field: str | None = None, value: Any = None, suggestion: str | None = None)[source]
Bases:
objectRepresents a validation error.
- value
The value that caused the error.
- Type:
Any
- class nirs4all.data.ValidationResult(is_valid: bool, errors: List[ValidationError] = <factory>, warnings: List[ValidationWarning] = <factory>, normalized_config: Dict[str, ~typing.Any] | None=None)[source]
Bases:
objectResult of configuration validation.
- errors
List of validation errors.
- warnings
List of validation warnings.
- errors: List[ValidationError]
- warnings: List[ValidationWarning]
- class nirs4all.data.ValidationWarning(code: str, message: str, field: str | None = None)[source]
Bases:
objectRepresents a validation warning (non-fatal issue).
- nirs4all.data.detect_signal_type(spectra: ndarray, wavelengths: ndarray | None = None, wavelength_unit: str = 'nm') Tuple[SignalType, float, str][source]
Convenience function to detect signal type.
- Parameters:
spectra – Spectral data array (n_samples, n_features)
wavelengths – Optional wavelength values for band analysis
wavelength_unit – Unit of wavelengths (“nm” or “cm-1”)
- Returns:
Tuple of (SignalType, confidence, reason)
Example
>>> spectra = np.random.rand(100, 500) * 0.8 # Values in [0, 0.8] >>> signal_type, confidence, reason = detect_signal_type(spectra) >>> print(f"Detected: {signal_type.value} ({confidence:.0%})")
- nirs4all.data.normalize_config(input_data: Any) Tuple[Dict[str, Any] | None, str][source]
Convenience function to normalize a configuration.
- Parameters:
input_data – Configuration in any supported format.
- Returns:
Tuple of (normalized_config, dataset_name).
- nirs4all.data.normalize_header_unit(unit: str | HeaderUnit) HeaderUnit[source]
Convert string header unit to enum.
- Parameters:
unit – Unit as string or enum
- Returns:
HeaderUnit enum value
- Raises:
ValueError – If unit string is invalid
- nirs4all.data.normalize_layout(layout: str | FeatureLayout) FeatureLayout[source]
Convert string layout to enum for backward compatibility.
- Parameters:
layout – Layout as string or enum
- Returns:
FeatureLayout enum value
- Raises:
ValueError – If layout string is invalid
- nirs4all.data.normalize_signal_type(signal_type: str | SignalType) SignalType[source]
Normalize a signal type input to SignalType enum.
- Parameters:
signal_type – String or SignalType enum
- Returns:
SignalType enum value