nirs4all.data.schema package

Subpackages

Submodules

Module contents

Schema module for dataset configuration.

This module provides Pydantic-based schema models for dataset configuration, providing type safety, validation, and clear documentation of the configuration format.

The schema supports: - Legacy format (train_x, test_x, etc.) - fully implemented - New files syntax (planned for future phases) - Multi-source datasets with sources syntax - Feature variations for preprocessed data or multi-variable datasets

class nirs4all.data.schema.AggregateMethod(value)[source]

Bases: str, Enum

Method for aggregating predictions.

MEAN = 'mean'
MEDIAN = 'median'
VOTE = 'vote'
class nirs4all.data.schema.CategoricalMode(value)[source]

Bases: str, Enum

Mode for handling categorical columns in Y data.

AUTO = 'auto'
NONE = 'none'
PRESERVE = 'preserve'
class nirs4all.data.schema.ColumnConfig(*, features: List[int] | List[str] | str | Dict[str, Any] | None = None, targets: List[int] | List[str] | str | Dict[str, Any] | None = None, metadata: List[int] | List[str] | str | Dict[str, Any] | None = None)[source]

Bases: BaseModel

Configuration for column selection and role assignment.

This is a stub for future implementation of the files syntax. Currently, column selection is handled by the loader directly.

features: List[int] | List[str] | str | Dict[str, Any] | None
metadata: List[int] | List[str] | str | Dict[str, Any] | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

targets: List[int] | List[str] | str | Dict[str, Any] | None
class nirs4all.data.schema.ConfigValidator(check_file_existence: bool = False, custom_validators: List[Callable] | None = None)[source]

Bases: object

Validator for dataset configurations.

Provides validation rules and methods for checking dataset configurations. Supports both legacy and new format configurations.

Example

```python validator = ConfigValidator() result = validator.validate(config_dict) if not result.is_valid:

for error in result.errors:

print(f”Error: {error}”)

```

validate(config: Dict[str, Any]) ValidationResult[source]

Validate a configuration dictionary.

Parameters:

config – Configuration dictionary to validate.

Returns:

ValidationResult with errors, warnings, and normalized config.

class nirs4all.data.schema.DatasetConfigSchema(*, name: str | None = None, description: str | None = None, task_type: TaskType | None = None, train_x: Any | None = None, train_y: Any | None = None, train_group: Any | None = None, test_x: Any | None = None, test_y: Any | None = None, test_group: Any | None = None, train_x_filter: List[int] | None = None, train_y_filter: List[int] | None = None, train_group_filter: List[int] | None = None, test_x_filter: List[int] | None = None, test_y_filter: List[int] | None = None, test_group_filter: List[int] | None = None, global_params: LoadingParams | None = None, train_params: LoadingParams | None = None, test_params: LoadingParams | None = None, train_x_params: LoadingParams | None = None, train_y_params: LoadingParams | None = None, train_group_params: LoadingParams | None = None, test_x_params: LoadingParams | None = None, test_y_params: LoadingParams | None = None, test_group_params: LoadingParams | None = None, aggregate: str | bool | None = None, aggregate_method: AggregateMethod | None = None, aggregate_exclude_outliers: bool | None = None, files: List[FileConfig] | None = None, sources: List[SourceConfig] | None = None, shared_targets: SharedTargetsConfig | List[SharedTargetsConfig] | None = None, shared_metadata: SharedMetadataConfig | List[SharedMetadataConfig] | None = None, variations: List[VariationConfig] | None = None, variation_mode: VariationMode | None = None, variation_select: List[str] | None = None, variation_prefix: bool | None = None, folds: FoldConfig | List[Dict[str, Any]] | str | None = None, **extra_data: Any)[source]

Bases: BaseModel

Complete dataset configuration schema.

This model represents the normalized, validated form of a dataset configuration. It supports both the legacy format (train_x, test_x, etc.) and is designed to be extensible for the new files syntax.

All input configurations are normalized to this schema before processing.

aggregate: str | bool | None
aggregate_exclude_outliers: bool | None
aggregate_method: AggregateMethod | None
description: str | None
files: List[FileConfig] | None
folds: FoldConfig | List[Dict[str, Any]] | str | None
classmethod from_dict(data: Dict[str, Any]) DatasetConfigSchema[source]

Create from dictionary.

get_effective_params(partition: str, data_type: str) LoadingParams[source]

Get effective loading parameters for a specific data file.

Parameters are merged with precedence: specific > partition > global.

Parameters:
  • partition – ‘train’ or ‘test’

  • data_type – ‘x’, ‘y’, or ‘group’

Returns:

Merged LoadingParams.

get_selected_variations() List[VariationConfig][source]

Get the variations to use based on variation_mode and variation_select.

For mode=’select’, returns only the selected variations. For other modes, returns all variations.

Returns:

List of VariationConfig objects to use.

get_source_count() int[source]

Get the number of feature sources.

Returns:

Number of sources (1 for single-source, >1 for multi-source).

get_source_names() List[str][source]

Get names of all sources in this config.

Returns:

List of source names, or empty list if not multi-source.

get_variation_count() int[source]

Get the number of feature variations.

Returns:

Number of variations.

get_variation_names() List[str][source]

Get names of all variations in this config.

Returns:

List of variation names, or empty list if no variations.

global_params: LoadingParams | None
is_files_format() bool[source]

Check if this config uses new files format.

is_legacy_format() bool[source]

Check if this config uses legacy format (train_x/test_x).

is_multi_source() bool[source]

Check if this config has multiple feature sources.

is_sources_format() bool[source]

Check if this config uses the new sources format.

is_variations_format() bool[source]

Check if this config uses the variations format.

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None
classmethod normalize_aggregate_method(v: Any) AggregateMethod | None[source]

Normalize aggregate_method to enum.

classmethod normalize_task_type(v: Any) TaskType | None[source]

Normalize task_type to enum.

classmethod normalize_variation_mode(v: Any) VariationMode | None[source]

Normalize variation_mode to enum.

classmethod parse_loading_params(v: Any) LoadingParams | None[source]

Parse dict to LoadingParams if needed.

classmethod parse_shared_metadata(v: Any) SharedMetadataConfig | List[SharedMetadataConfig] | None[source]

Parse shared metadata configuration.

classmethod parse_shared_targets(v: Any) SharedTargetsConfig | List[SharedTargetsConfig] | None[source]

Parse shared targets configuration.

classmethod parse_sources(v: Any) List[SourceConfig] | None[source]

Parse sources list to SourceConfig objects.

classmethod parse_variations(v: Any) List[VariationConfig] | None[source]

Parse variations list to VariationConfig objects.

shared_metadata: SharedMetadataConfig | List[SharedMetadataConfig] | None
shared_targets: SharedTargetsConfig | List[SharedTargetsConfig] | None
sources: List[SourceConfig] | None
task_type: TaskType | None
test_group: Any | None
test_group_filter: List[int] | None
test_group_params: LoadingParams | None
test_params: LoadingParams | None
test_x: Any | None
test_x_filter: List[int] | None
test_x_params: LoadingParams | None
test_y: Any | None
test_y_filter: List[int] | None
test_y_params: LoadingParams | None
to_dict() Dict[str, Any][source]

Convert to dictionary, excluding None values.

to_legacy_format() Dict[str, Any][source]

Convert sources or variations format to legacy format for backward compatibility.

This converts the sources/variations syntax to the train_x/test_x array syntax that existing loaders understand.

Returns:

Dictionary with legacy format configuration.

train_group: Any | None
train_group_filter: List[int] | None
train_group_params: LoadingParams | None
train_params: LoadingParams | None
train_x: Any | None
train_x_filter: List[int] | None
train_x_params: LoadingParams | None
train_y: Any | None
train_y_filter: List[int] | None
train_y_params: LoadingParams | None
validate_data_sources() DatasetConfigSchema[source]

Validate that at least one data source is specified.

variation_mode: VariationMode | None
variation_prefix: bool | None
variation_select: List[str] | None
variations: List[VariationConfig] | None
variations_to_legacy_format() Dict[str, Any][source]

Convert variations format to legacy format for backward compatibility.

This converts the variations syntax to the train_x/test_x format that existing loaders understand. The conversion depends on variation_mode:

  • separate: Returns config for first variation (caller handles multiple runs)

  • concat: Returns list of paths to be concatenated

  • select: Returns config for selected variations only

  • compare: Same as separate (caller handles comparison)

Returns:

Dictionary with legacy format configuration.

class nirs4all.data.schema.FileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None, link_by: str | None = None)[source]

Bases: BaseModel

Configuration for a single data file.

This is a stub for future implementation of the files syntax. It describes how to load and interpret a single data file.

columns: ColumnConfig | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: LoadingParams | None
partition: PartitionType | None
path: str
class nirs4all.data.schema.FoldConfig(*, folds: List[FoldDefinition] | None = None, file: str | None = None, format: Literal['auto', 'csv', 'json', 'yaml', 'txt'] | None = 'auto', column: str | None = None)[source]

Bases: BaseModel

Configuration for cross-validation fold definitions.

Supports multiple ways to specify folds: - Inline: List of FoldDefinition objects - File: Path to a fold file (CSV, JSON, YAML) - Column: Column name in metadata containing fold assignments

Examples

# Inline fold definitions folds:

  • train: [0, 1, 2, 3, 4] val: [5, 6, 7, 8, 9]

  • train: [5, 6, 7, 8, 9] val: [0, 1, 2, 3, 4]

# File reference folds:

file: “path/to/folds.csv” format: auto

# Column in metadata folds:

column: “cv_fold”

column: str | None
file: str | None
folds: List[FoldDefinition] | None
format: Literal['auto', 'csv', 'json', 'yaml', 'txt'] | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

to_fold_list() List[Tuple[List[int], List[int]]] | None[source]

Convert inline fold definitions to fold list.

Returns:

List of (train_indices, val_indices) tuples, or None if not inline.

validate_fold_source() FoldConfig[source]

Validate that exactly one fold source is specified.

class nirs4all.data.schema.FoldDefinition(*, train: ~typing.List[int], val: ~typing.List[int] = <factory>)[source]

Bases: BaseModel

Definition of a single cross-validation fold.

Specifies which sample indices belong to training and validation sets.

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

train: List[int]
val: List[int]
class nirs4all.data.schema.HeaderUnit(value)[source]

Bases: str, Enum

Unit type for spectral headers.

INDEX = 'index'
NONE = 'none'
TEXT = 'text'
WAVELENGTH = 'nm'
WAVENUMBER = 'cm-1'
class nirs4all.data.schema.LoadingParams(*, delimiter: str | None = None, decimal_separator: str | None = None, has_header: bool | None = None, header_unit: HeaderUnit | str | None = None, signal_type: SignalTypeEnum | str | None = None, encoding: str | None = None, na_policy: NAPolicy | str | None = None, categorical_mode: CategoricalMode | str | None = None, **extra_data: Any)[source]

Bases: BaseModel

Parameters for loading data files.

These parameters control how CSV and other files are parsed. Parameters can be specified at global, partition, or file level, with more specific levels overriding general ones.

categorical_mode: CategoricalMode | str | None
decimal_separator: str | None
delimiter: str | None
encoding: str | None
has_header: bool | None
header_unit: HeaderUnit | str | None
merge_with(other: LoadingParams | None) LoadingParams[source]

Merge with another LoadingParams, self taking precedence.

Parameters:

other – Another LoadingParams to merge with (lower priority).

Returns:

New LoadingParams with merged values.

model_config = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

na_policy: NAPolicy | str | None
classmethod normalize_header_unit(v: Any) HeaderUnit | str | None[source]

Normalize header_unit to enum if possible.

classmethod normalize_signal_type(v: Any) SignalTypeEnum | str | None[source]

Normalize signal_type to enum if possible.

signal_type: SignalTypeEnum | str | None
class nirs4all.data.schema.NAPolicy(value)[source]

Bases: str, Enum

Policy for handling NA/missing values.

ABORT = 'abort'
AUTO = 'auto'
REMOVE = 'remove'
class nirs4all.data.schema.PartitionConfig(*, type: PartitionType | None = None, column: str | None = None, train_values: List[str] | None = None, test_values: List[str] | None = None, predict_values: List[str] | None = None, unknown_policy: Literal['error', 'ignore', 'train'] | None = None, train: str | List[int] | None = None, test: str | List[int] | None = None, predict: str | List[int] | None = None, shuffle: bool | None = None, random_state: int | None = None, stratify: str | None = None, train_file: str | None = None, test_file: str | None = None, predict_file: str | None = None, **extra_data: Any)[source]

Bases: BaseModel

Configuration for partition assignment.

Supports multiple partition methods: - Static: Assign entire file to a partition (use type) - Column-based: Partition based on column values (use column) - Percentage-based: Split by percentage (use train, test with percentages) - Index-based: Explicit index lists (use train, test with lists) - Index file: Load indices from external files (use train_file, test_file)

Examples

# Static partition (entire file) partition:

type: train

# Column-based partition partition:

column: “split” train_values: [“train”, “training”] test_values: [“test”, “validation”]

# Percentage-based partition partition:

train: “80%” test: “80%:100%” shuffle: true random_state: 42

# Index-based partition partition:

train: [0, 1, 2, 3, 4] test: [5, 6, 7, 8, 9]

# Index file partition partition:

train_file: “train_indices.txt” test_file: “test_indices.txt”

column: str | None
model_config = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict: str | List[int] | None
predict_file: str | None
predict_values: List[str] | None
random_state: int | None
shuffle: bool | None
stratify: str | None
test: str | List[int] | None
test_file: str | None
test_values: List[str] | None
to_assigner_spec() str | Dict[str, Any] | None[source]

Convert this config to a spec for PartitionAssigner.

Returns:

Partition specification for PartitionAssigner.assign().

train: str | List[int] | None
train_file: str | None
train_values: List[str] | None
type: PartitionType | None
unknown_policy: Literal['error', 'ignore', 'train'] | None
validate_partition_method() PartitionConfig[source]

Validate that partition specification is consistent.

class nirs4all.data.schema.PartitionType(value)[source]

Bases: str, Enum

Partition assignment type.

PREDICT = 'predict'
TEST = 'test'
TRAIN = 'train'
nirs4all.data.schema.PathOrArray

alias of Any

class nirs4all.data.schema.PreprocessingApplied(*, type: str, description: str | None = None, software: str | None = None, params: Dict[str, Any] | None = None, **extra_data: Any)[source]

Bases: BaseModel

Metadata about preprocessing that was applied offline.

This is informational only - helps track provenance of preprocessed data.

Example

preprocessing_applied:
  • type: “SNV” description: “Standard Normal Variate” software: “OPUS 8.0”

  • type: “SG_smooth” params:

    window: 15 polyorder: 2

description: str | None
model_config = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: Dict[str, Any] | None
software: str | None
type: str
class nirs4all.data.schema.SharedMetadataConfig(*, path: str, columns: List[int] | List[str] | str | Dict[str, Any] | None = None, link_by: str | None = None, params: LoadingParams | None = None, partition: PartitionType | None = None)[source]

Bases: BaseModel

Configuration for shared metadata in multi-source datasets.

When using multiple sources, metadata can be shared across all sources. This configuration specifies how to load and link metadata.

Examples

# Simple shared metadata metadata:

path: data/metadata.csv link_by: sample_id

columns: List[int] | List[str] | str | Dict[str, Any] | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: LoadingParams | None
partition: PartitionType | None
path: str
class nirs4all.data.schema.SharedTargetsConfig(*, path: str, columns: List[int] | List[str] | str | Dict[str, Any] | None = None, link_by: str | None = None, params: LoadingParams | None = None, partition: PartitionType | None = None)[source]

Bases: BaseModel

Configuration for shared targets in multi-source datasets.

When using multiple sources, targets can be shared across all sources. This configuration specifies how to load and link targets.

Examples

# Simple shared targets targets:

path: data/targets.csv link_by: sample_id

# With column selection targets:

path: data/all_data.csv columns: [0] # First column is target link_by: sample_id

columns: List[int] | List[str] | str | Dict[str, Any] | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: LoadingParams | None
partition: PartitionType | None
path: str
class nirs4all.data.schema.SignalTypeEnum(value)[source]

Bases: str, Enum

Signal type for spectral data.

ABSORBANCE = 'absorbance'
AUTO = 'auto'
KUBELKA_MUNK = 'kubelka-munk'
LOG_1_R = 'log(1/R)'
REFLECTANCE = 'reflectance'
REFLECTANCE_PERCENT = 'reflectance%'
TRANSMITTANCE = 'transmittance'
TRANSMITTANCE_PERCENT = 'transmittance%'
class nirs4all.data.schema.SourceConfig(*, name: str, files: List[str | SourceFileConfig | Dict[str, Any]] | None = None, train_x: str | None = None, test_x: str | None = None, params: LoadingParams | None = None, link_by: str | None = None)[source]

Bases: BaseModel

Configuration for a single feature source in multi-source datasets.

A source represents a distinct feature set, typically from different instruments, sensors, or measurement types. Each source has its own files, loading parameters, and signal type.

Examples

# NIR spectrometer source sources:

  • name: “NIR” files:

    • path: data/NIR_train.csv partition: train

    • path: data/NIR_test.csv partition: test

    params:

    header_unit: nm signal_type: absorbance

# Multi-source with shared targets sources:

  • name: “NIR” files: […]

  • name: “MIR” files: […]

targets:

path: data/targets.csv link_by: sample_id

files: List[str | SourceFileConfig | Dict[str, Any]] | None
get_test_paths() List[str][source]

Get all test file paths for this source.

Returns:

List of paths to test files.

get_train_paths() List[str][source]

Get all training file paths for this source.

Returns:

List of paths to training files.

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
params: LoadingParams | None
test_x: str | None
train_x: str | None
validate_source_files() SourceConfig[source]

Validate that source has at least one data source.

class nirs4all.data.schema.SourceFileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None)[source]

Bases: BaseModel

Configuration for a single file within a source.

Similar to FileConfig but simplified for source context.

columns: ColumnConfig | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: LoadingParams | None
partition: PartitionType | None
path: str
class nirs4all.data.schema.TaskType(value)[source]

Bases: str, Enum

Task type for the dataset.

AUTO = 'auto'
BINARY_CLASSIFICATION = 'binary_classification'
MULTICLASS_CLASSIFICATION = 'multiclass_classification'
REGRESSION = 'regression'
class nirs4all.data.schema.ValidationError(code: str, message: str, field: str | None = None, value: Any = None, suggestion: str | None = None)[source]

Bases: object

Represents a validation error.

code

Error code for programmatic handling.

Type:

str

message

Human-readable error message.

Type:

str

field

The configuration field that caused the error.

Type:

str | None

value

The value that caused the error.

Type:

Any

suggestion

Optional suggestion for fixing the error.

Type:

str | None

code: str
field: str | None = None
message: str
suggestion: str | None = None
value: Any = None
class nirs4all.data.schema.ValidationResult(is_valid: bool, errors: List[ValidationError] = <factory>, warnings: List[ValidationWarning] = <factory>, normalized_config: Dict[str, ~typing.Any] | None=None)[source]

Bases: object

Result of configuration validation.

is_valid

Whether the configuration is valid (no errors).

Type:

bool

errors

List of validation errors.

Type:

List[nirs4all.data.schema.validation.validators.ValidationError]

warnings

List of validation warnings.

Type:

List[nirs4all.data.schema.validation.validators.ValidationWarning]

normalized_config

The validated and normalized configuration.

Type:

Dict[str, Any] | None

errors: List[ValidationError]
is_valid: bool
normalized_config: Dict[str, Any] | None = None
raise_if_invalid() None[source]

Raise ValueError if configuration is invalid.

warnings: List[ValidationWarning]
class nirs4all.data.schema.ValidationWarning(code: str, message: str, field: str | None = None)[source]

Bases: object

Represents a validation warning (non-fatal issue).

code

Warning code for programmatic handling.

Type:

str

message

Human-readable warning message.

Type:

str

field

The configuration field that caused the warning.

Type:

str | None

code: str
field: str | None = None
message: str
class nirs4all.data.schema.VariationConfig(*, name: str, description: str | None = None, files: List[str | VariationFileConfig | Dict[str, Any]] | None = None, train_x: str | None = None, test_x: str | None = None, params: LoadingParams | None = None, preprocessing_applied: List[PreprocessingApplied] | None = None)[source]

Bases: BaseModel

Configuration for a single feature variation.

A variation represents a different “view” of the same samples, such as: - Pre-computed preprocessing (SNV, MSC, derivatives) - Different variables from time series data - Different feature representations

All variations must have the same number of samples (rows).

Examples

# Simple variation variations:

  • name: “raw” files:

    • path: data/spectra_raw.csv partition: train

# Variation with preprocessing provenance variations:

  • name: “snv” description: “SNV preprocessed spectra” preprocessing_applied:

    • type: “SNV” software: “OPUS 8.0”

    files:
    • path: data/spectra_snv.csv partition: train

# Using direct paths variations:

  • name: “raw” train_x: data/X_raw_train.csv test_x: data/X_raw_test.csv

description: str | None
files: List[str | VariationFileConfig | Dict[str, Any]] | None
get_test_paths() List[str][source]

Get all test file paths for this variation.

Returns:

List of paths to test files.

get_train_paths() List[str][source]

Get all training file paths for this variation.

Returns:

List of paths to training files.

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
params: LoadingParams | None
preprocessing_applied: List[PreprocessingApplied] | None
test_x: str | None
train_x: str | None
validate_variation_files() VariationConfig[source]

Validate that variation has at least one data source.

class nirs4all.data.schema.VariationFileConfig(*, path: str, partition: PartitionType | None = None, columns: ColumnConfig | None = None, params: LoadingParams | None = None, header: Dict[str, Any] | None = None)[source]

Bases: BaseModel

Configuration for a single file within a variation.

Similar to SourceFileConfig but for variation context.

columns: ColumnConfig | None
header: Dict[str, Any] | None
model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

params: LoadingParams | None
partition: PartitionType | None
path: str
class nirs4all.data.schema.VariationMode(value)[source]

Bases: str, Enum

Mode for handling feature variations.

Feature variations represent different “views” of the same samples, such as pre-computed preprocessing variants or different variables from time series data.

COMPARE = 'compare'
CONCAT = 'concat'
SELECT = 'select'
SEPARATE = 'separate'