nirs4all.data.schema.config module
Schema definitions for dataset configuration.
This module defines Pydantic models for validating and normalizing dataset configurations. It supports both the legacy format (train_x, test_x, etc.) and the planned new format (files, sources, variations).
The models provide: - Type validation and coercion - Default value handling - Clear documentation via Field descriptions - Serialization/deserialization
- class nirs4all.data.schema.config.AggregateMethod(value)[source]
-
Method for aggregating predictions.
- MEAN = 'mean'
- MEDIAN = 'median'
- VOTE = 'vote'
- class nirs4all.data.schema.config.CategoricalMode(value)[source]
-
Mode for handling categorical columns in Y data.
- AUTO = 'auto'
- NONE = 'none'
- PRESERVE = 'preserve'
- class nirs4all.data.schema.config.ColumnConfig(*, features: List[int] | List[str] | str | Dict[str, Any] | None = None, targets: List[int] | List[str] | str | Dict[str, Any] | None = None, metadata: List[int] | List[str] | str | Dict[str, Any] | None = None)[source]
Bases:
BaseModelConfiguration for column selection and role assignment.
This is a stub for future implementation of the files syntax. Currently, column selection is handled by the loader directly.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class nirs4all.data.schema.config.DatasetConfigSchema(*, name: str | None = None, description: str | None = None, task_type: TaskType | None = None, train_x: Any | None = None, train_y: Any | None = None, train_group: Any | None = None, test_x: Any | None = None, test_y: Any | None = None, test_group: Any | None = None, train_x_filter: List[int] | None = None, train_y_filter: List[int] | None = None, train_group_filter: List[int] | None = None, test_x_filter: List[int] | None = None, test_y_filter: List[int] | None = None, test_group_filter: List[int] | None = None, global_params: LoadingParams | None = None, train_params: LoadingParams | None = None, test_params: LoadingParams | None = None, train_x_params: LoadingParams | None = None, train_y_params: LoadingParams | None = None, train_group_params: LoadingParams | None = None, test_x_params: LoadingParams | None = None, test_y_params: LoadingParams | None = None, test_group_params: LoadingParams | None = None, aggregate: str | bool | None = None, aggregate_method: AggregateMethod | None = None, aggregate_exclude_outliers: bool | None = None, files: List[FileConfig] | None = None, sources: List[SourceConfig] | None = None, shared_targets: SharedTargetsConfig | List[SharedTargetsConfig] | None = None, shared_metadata: SharedMetadataConfig | List[SharedMetadataConfig] | None = None, variations: List[VariationConfig] | None = None, variation_mode: VariationMode | None = None, variation_select: List[str] | None = None, variation_prefix: bool | None = None, folds: FoldConfig | List[Dict[str, Any]] | str | None = None, **extra_data: Any)[source]
Bases:
BaseModelComplete dataset configuration schema.
This model represents the normalized, validated form of a dataset configuration. It supports both the legacy format (train_x, test_x, etc.) and is designed to be extensible for the new files syntax.
All input configurations are normalized to this schema before processing.
- aggregate_method: AggregateMethod | None
- files: List[FileConfig] | None
- get_effective_params(partition: str, data_type: str) LoadingParams[source]
Get effective loading parameters for a specific data file.
Parameters are merged with precedence: specific > partition > global.
- Parameters:
partition – ‘train’ or ‘test’
data_type – ‘x’, ‘y’, or ‘group’
- Returns:
Merged LoadingParams.
- get_selected_variations() List[VariationConfig][source]
Get the variations to use based on variation_mode and variation_select.
For mode=’select’, returns only the selected variations. For other modes, returns all variations.
- Returns:
List of VariationConfig objects to use.
- get_source_count() int[source]
Get the number of feature sources.
- Returns:
Number of sources (1 for single-source, >1 for multi-source).
- get_source_names() List[str][source]
Get names of all sources in this config.
- Returns:
List of source names, or empty list if not multi-source.
- get_variation_count() int[source]
Get the number of feature variations.
- Returns:
Number of variations.
- get_variation_names() List[str][source]
Get names of all variations in this config.
- Returns:
List of variation names, or empty list if no variations.
- global_params: LoadingParams | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow', 'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod normalize_aggregate_method(v: Any) AggregateMethod | None[source]
Normalize aggregate_method to enum.
- classmethod normalize_variation_mode(v: Any) VariationMode | None[source]
Normalize variation_mode to enum.
- classmethod parse_loading_params(v: Any) LoadingParams | None[source]
Parse dict to LoadingParams if needed.
Parse shared metadata configuration.
Parse shared targets configuration.
- classmethod parse_sources(v: Any) List[SourceConfig] | None[source]
Parse sources list to SourceConfig objects.
- classmethod parse_variations(v: Any) List[VariationConfig] | None[source]
Parse variations list to VariationConfig objects.
- sources: List[SourceConfig] | None
- test_group_params: LoadingParams | None
- test_params: LoadingParams | None
- test_x_params: LoadingParams | None
- test_y_params: LoadingParams | None
- to_legacy_format() Dict[str, Any][source]
Convert sources or variations format to legacy format for backward compatibility.
This converts the sources/variations syntax to the train_x/test_x array syntax that existing loaders understand.
- Returns:
Dictionary with legacy format configuration.
- train_group_params: LoadingParams | None
- train_params: LoadingParams | None
- train_x_params: LoadingParams | None
- train_y_params: LoadingParams | None
- validate_data_sources() DatasetConfigSchema[source]
Validate that at least one data source is specified.
- variation_mode: VariationMode | None
- variations: List[VariationConfig] | None
- variations_to_legacy_format() Dict[str, Any][source]
Convert variations format to legacy format for backward compatibility.
This converts the variations syntax to the train_x/test_x format that existing loaders understand. The conversion depends on variation_mode:
separate: Returns config for first variation (caller handles multiple runs)
concat: Returns list of paths to be concatenated
select: Returns config for selected variations only
compare: Same as separate (caller handles comparison)
- Returns:
Dictionary with legacy format configuration.
- class nirs4all.data.schema.config.FileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None, link_by: str | None = None)[source]
Bases:
BaseModelConfiguration for a single data file.
This is a stub for future implementation of the files syntax. It describes how to load and interpret a single data file.
- columns: ColumnConfig | None
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- partition: PartitionType | None
- class nirs4all.data.schema.config.FoldConfig(*, folds: List[FoldDefinition] | None = None, file: str | None = None, format: Literal['auto', 'csv', 'json', 'yaml', 'txt'] | None = 'auto', column: str | None = None)[source]
Bases:
BaseModelConfiguration for cross-validation fold definitions.
Supports multiple ways to specify folds: - Inline: List of FoldDefinition objects - File: Path to a fold file (CSV, JSON, YAML) - Column: Column name in metadata containing fold assignments
Examples
# Inline fold definitions folds:
train: [0, 1, 2, 3, 4] val: [5, 6, 7, 8, 9]
train: [5, 6, 7, 8, 9] val: [0, 1, 2, 3, 4]
# File reference folds:
file: “path/to/folds.csv” format: auto
# Column in metadata folds:
column: “cv_fold”
- folds: List[FoldDefinition] | None
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- to_fold_list() List[Tuple[List[int], List[int]]] | None[source]
Convert inline fold definitions to fold list.
- Returns:
List of (train_indices, val_indices) tuples, or None if not inline.
- validate_fold_source() FoldConfig[source]
Validate that exactly one fold source is specified.
- class nirs4all.data.schema.config.FoldDefinition(*, train: ~typing.List[int], val: ~typing.List[int] = <factory>)[source]
Bases:
BaseModelDefinition of a single cross-validation fold.
Specifies which sample indices belong to training and validation sets.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class nirs4all.data.schema.config.HeaderUnit(value)[source]
-
Unit type for spectral headers.
- INDEX = 'index'
- NONE = 'none'
- TEXT = 'text'
- WAVELENGTH = 'nm'
- WAVENUMBER = 'cm-1'
- class nirs4all.data.schema.config.LoadingParams(*, delimiter: str | None = None, decimal_separator: str | None = None, has_header: bool | None = None, header_unit: HeaderUnit | str | None = None, signal_type: SignalTypeEnum | str | None = None, encoding: str | None = None, na_policy: NAPolicy | str | None = None, categorical_mode: CategoricalMode | str | None = None, **extra_data: Any)[source]
Bases:
BaseModelParameters for loading data files.
These parameters control how CSV and other files are parsed. Parameters can be specified at global, partition, or file level, with more specific levels overriding general ones.
- categorical_mode: CategoricalMode | str | None
- header_unit: HeaderUnit | str | None
- merge_with(other: LoadingParams | None) LoadingParams[source]
Merge with another LoadingParams, self taking precedence.
- Parameters:
other – Another LoadingParams to merge with (lower priority).
- Returns:
New LoadingParams with merged values.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod normalize_header_unit(v: Any) HeaderUnit | str | None[source]
Normalize header_unit to enum if possible.
- classmethod normalize_signal_type(v: Any) SignalTypeEnum | str | None[source]
Normalize signal_type to enum if possible.
- signal_type: SignalTypeEnum | str | None
- class nirs4all.data.schema.config.NAPolicy(value)[source]
-
Policy for handling NA/missing values.
- ABORT = 'abort'
- AUTO = 'auto'
- REMOVE = 'remove'
- class nirs4all.data.schema.config.PartitionConfig(*, type: PartitionType | None = None, column: str | None = None, train_values: List[str] | None = None, test_values: List[str] | None = None, predict_values: List[str] | None = None, unknown_policy: Literal['error', 'ignore', 'train'] | None = None, train: str | List[int] | None = None, test: str | List[int] | None = None, predict: str | List[int] | None = None, shuffle: bool | None = None, random_state: int | None = None, stratify: str | None = None, train_file: str | None = None, test_file: str | None = None, predict_file: str | None = None, **extra_data: Any)[source]
Bases:
BaseModelConfiguration for partition assignment.
Supports multiple partition methods: - Static: Assign entire file to a partition (use type) - Column-based: Partition based on column values (use column) - Percentage-based: Split by percentage (use train, test with percentages) - Index-based: Explicit index lists (use train, test with lists) - Index file: Load indices from external files (use train_file, test_file)
Examples
# Static partition (entire file) partition:
type: train
# Column-based partition partition:
column: “split” train_values: [“train”, “training”] test_values: [“test”, “validation”]
# Percentage-based partition partition:
train: “80%” test: “80%:100%” shuffle: true random_state: 42
# Index-based partition partition:
train: [0, 1, 2, 3, 4] test: [5, 6, 7, 8, 9]
# Index file partition partition:
train_file: “train_indices.txt” test_file: “test_indices.txt”
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- to_assigner_spec() str | Dict[str, Any] | None[source]
Convert this config to a spec for PartitionAssigner.
- Returns:
Partition specification for PartitionAssigner.assign().
- type: PartitionType | None
- validate_partition_method() PartitionConfig[source]
Validate that partition specification is consistent.
- class nirs4all.data.schema.config.PartitionType(value)[source]
-
Partition assignment type.
- PREDICT = 'predict'
- TEST = 'test'
- TRAIN = 'train'
- class nirs4all.data.schema.config.PreprocessingApplied(*, type: str, description: str | None = None, software: str | None = None, params: Dict[str, Any] | None = None, **extra_data: Any)[source]
Bases:
BaseModelMetadata about preprocessing that was applied offline.
This is informational only - helps track provenance of preprocessed data.
Example
- preprocessing_applied:
type: “SNV” description: “Standard Normal Variate” software: “OPUS 8.0”
type: “SG_smooth” params:
window: 15 polyorder: 2
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Bases:
BaseModelConfiguration for shared metadata in multi-source datasets.
When using multiple sources, metadata can be shared across all sources. This configuration specifies how to load and link metadata.
Examples
# Simple shared metadata metadata:
path: data/metadata.csv link_by: sample_id
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Bases:
BaseModelConfiguration for shared targets in multi-source datasets.
When using multiple sources, targets can be shared across all sources. This configuration specifies how to load and link targets.
Examples
# Simple shared targets targets:
path: data/targets.csv link_by: sample_id
# With column selection targets:
path: data/all_data.csv columns: [0] # First column is target link_by: sample_id
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class nirs4all.data.schema.config.SignalTypeEnum(value)[source]
-
Signal type for spectral data.
- ABSORBANCE = 'absorbance'
- AUTO = 'auto'
- KUBELKA_MUNK = 'kubelka-munk'
- LOG_1_R = 'log(1/R)'
- REFLECTANCE = 'reflectance'
- REFLECTANCE_PERCENT = 'reflectance%'
- TRANSMITTANCE = 'transmittance'
- TRANSMITTANCE_PERCENT = 'transmittance%'
- class nirs4all.data.schema.config.SourceConfig(*, name: str, files: List[str | SourceFileConfig | Dict[str, Any]] | None = None, train_x: str | None = None, test_x: str | None = None, params: LoadingParams | None = None, link_by: str | None = None)[source]
Bases:
BaseModelConfiguration for a single feature source in multi-source datasets.
A source represents a distinct feature set, typically from different instruments, sensors, or measurement types. Each source has its own files, loading parameters, and signal type.
Examples
# NIR spectrometer source sources:
name: “NIR” files:
path: data/NIR_train.csv partition: train
path: data/NIR_test.csv partition: test
- params:
header_unit: nm signal_type: absorbance
# Multi-source with shared targets sources:
name: “NIR” files: […]
name: “MIR” files: […]
- targets:
path: data/targets.csv link_by: sample_id
- get_test_paths() List[str][source]
Get all test file paths for this source.
- Returns:
List of paths to test files.
- get_train_paths() List[str][source]
Get all training file paths for this source.
- Returns:
List of paths to training files.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- validate_source_files() SourceConfig[source]
Validate that source has at least one data source.
- class nirs4all.data.schema.config.SourceFileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None)[source]
Bases:
BaseModelConfiguration for a single file within a source.
Similar to FileConfig but simplified for source context.
- columns: ColumnConfig | None
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- partition: PartitionType | None
- class nirs4all.data.schema.config.TaskType(value)[source]
-
Task type for the dataset.
- AUTO = 'auto'
- BINARY_CLASSIFICATION = 'binary_classification'
- MULTICLASS_CLASSIFICATION = 'multiclass_classification'
- REGRESSION = 'regression'
- class nirs4all.data.schema.config.VariationConfig(*, name: str, description: str | None = None, files: List[str | VariationFileConfig | Dict[str, Any]] | None = None, train_x: str | None = None, test_x: str | None = None, params: LoadingParams | None = None, preprocessing_applied: List[PreprocessingApplied] | None = None)[source]
Bases:
BaseModelConfiguration for a single feature variation.
A variation represents a different “view” of the same samples, such as: - Pre-computed preprocessing (SNV, MSC, derivatives) - Different variables from time series data - Different feature representations
All variations must have the same number of samples (rows).
Examples
# Simple variation variations:
name: “raw” files:
path: data/spectra_raw.csv partition: train
# Variation with preprocessing provenance variations:
name: “snv” description: “SNV preprocessed spectra” preprocessing_applied:
type: “SNV” software: “OPUS 8.0”
- files:
path: data/spectra_snv.csv partition: train
# Using direct paths variations:
name: “raw” train_x: data/X_raw_train.csv test_x: data/X_raw_test.csv
- get_test_paths() List[str][source]
Get all test file paths for this variation.
- Returns:
List of paths to test files.
- get_train_paths() List[str][source]
Get all training file paths for this variation.
- Returns:
List of paths to training files.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- preprocessing_applied: List[PreprocessingApplied] | None
- validate_variation_files() VariationConfig[source]
Validate that variation has at least one data source.
- class nirs4all.data.schema.config.VariationFileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None, header: Dict[str, Any] | None = None)[source]
Bases:
BaseModelConfiguration for a single file within a variation.
Similar to SourceFileConfig but for variation context.
- columns: ColumnConfig | None
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- params: LoadingParams | None
- partition: PartitionType | None
- class nirs4all.data.schema.config.VariationMode(value)[source]
-
Mode for handling feature variations.
Feature variations represent different “views” of the same samples, such as pre-computed preprocessing variants or different variables from time series data.
- COMPARE = 'compare'
- CONCAT = 'concat'
- SELECT = 'select'
- SEPARATE = 'separate'