nirs4all.data.schema.validation package

Submodules

Module contents

Validation module for dataset configuration.

This module provides validators for dataset configuration schemas, offering detailed error messages and validation results.

class nirs4all.data.schema.validation.ConfigValidator(check_file_existence: bool = False, custom_validators: List[Callable] | None = None)[source]

Bases: object

Validator for dataset configurations.

Provides validation rules and methods for checking dataset configurations. Supports both legacy and new format configurations.

Example

```python validator = ConfigValidator() result = validator.validate(config_dict) if not result.is_valid:

for error in result.errors:

print(f”Error: {error}”)

```

validate(config: Dict[str, Any]) ValidationResult[source]

Validate a configuration dictionary.

Parameters:

config – Configuration dictionary to validate.

Returns:

ValidationResult with errors, warnings, and normalized config.

class nirs4all.data.schema.validation.DiagnosticBuilder[source]

Bases: object

Builder for diagnostic messages.

Example

```python builder = DiagnosticBuilder()

# Create error message error = builder.create(

ErrorRegistry.E200, path=”/path/to/file.csv”

)

# Create with location error = builder.create(

ErrorRegistry.E401, line=10, error=”Unexpected token”, location=”config.json:10”

)

create(error_code: ErrorCode, location: str | None = None, **kwargs) DiagnosticMessage[source]

Create a diagnostic message.

Parameters:
  • error_code – The error code definition.

  • location – Optional file/line location.

  • **kwargs – Parameters for message template.

Returns:

DiagnosticMessage instance.

file_not_found(path: str) DiagnosticMessage[source]

Create file not found error.

invalid_value(field: str, value: Any, valid_values: List[Any]) DiagnosticMessage[source]

Create invalid value error.

missing_field(field: str) DiagnosticMessage[source]

Create missing field error.

type_error(field: str, expected: str, actual: str) DiagnosticMessage[source]

Create type error.

class nirs4all.data.schema.validation.DiagnosticMessage(error_code: ~nirs4all.data.schema.validation.error_codes.ErrorCode, message: str, suggestion: str | None = None, context: ~typing.Dict[str, ~typing.Any] = <factory>, location: str | None = None)[source]

Bases: object

A diagnostic message with formatted content.

error_code

The ErrorCode definition.

Type:

nirs4all.data.schema.validation.error_codes.ErrorCode

message

Formatted error message.

Type:

str

suggestion

Formatted suggestion (if any).

Type:

str | None

context

Additional context information.

Type:

Dict[str, Any]

location

File/line location (if applicable).

Type:

str | None

property category: ErrorCategory

Get the error category.

property code: str

Get the error code string.

context: Dict[str, Any]
error_code: ErrorCode
location: str | None = None
message: str
property severity: ErrorSeverity

Get the error severity.

suggestion: str | None = None
to_dict() Dict[str, Any][source]

Convert to dictionary.

class nirs4all.data.schema.validation.DiagnosticReport(messages: ~typing.List[~nirs4all.data.schema.validation.error_codes.DiagnosticMessage] = <factory>, config_path: str | None = None)[source]

Bases: object

Collection of diagnostic messages.

messages

List of diagnostic messages.

Type:

List[nirs4all.data.schema.validation.error_codes.DiagnosticMessage]

config_path

Path to the configuration file (if any).

Type:

str | None

add(message: DiagnosticMessage) None[source]

Add a diagnostic message.

add_error(error_code: ErrorCode, location: str | None = None, **kwargs) DiagnosticMessage[source]

Create and add an error message.

config_path: str | None = None
property errors: List[DiagnosticMessage]

Get all error messages.

property has_errors: bool

Check if there are any errors.

property is_valid: bool

Check if configuration is valid (no errors).

messages: List[DiagnosticMessage]
to_dict() Dict[str, Any][source]

Convert to dictionary.

property warnings: List[DiagnosticMessage]

Get all warning messages.

class nirs4all.data.schema.validation.ErrorCategory(value)[source]

Bases: str, Enum

Categories of configuration errors.

AGGREGATION = 'aggregation'
DATA = 'data'
FILE = 'file'
FOLD = 'fold'
LOADING = 'loading'
PARTITION = 'partition'
RUNTIME = 'runtime'
SCHEMA = 'schema'
VARIATION = 'variation'
class nirs4all.data.schema.validation.ErrorCode(code: str, category: ErrorCategory, severity: ErrorSeverity, message_template: str, suggestion_template: str | None = None, documentation_url: str | None = None)[source]

Bases: object

Error code definition.

code

Unique error code (e.g., “E001”).

Type:

str

category

Error category.

Type:

nirs4all.data.schema.validation.error_codes.ErrorCategory

severity

Error severity.

Type:

nirs4all.data.schema.validation.error_codes.ErrorSeverity

message_template

Template for error message with {placeholders}.

Type:

str

suggestion_template

Template for fix suggestion.

Type:

str | None

documentation_url

Link to relevant documentation.

Type:

str | None

category: ErrorCategory
code: str
documentation_url: str | None = None
message_template: str
severity: ErrorSeverity
suggestion_template: str | None = None
class nirs4all.data.schema.validation.ErrorRegistry[source]

Bases: object

Registry of all error codes.

E100 = ErrorCode(code='E100', category=<ErrorCategory.SCHEMA: 'schema'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Invalid configuration structure: {details}', suggestion_template='Check that the configuration is a valid dictionary.', documentation_url=None)
E101 = ErrorCode(code='E101', category=<ErrorCategory.SCHEMA: 'schema'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Missing required field: {field}', suggestion_template="Add the '{field}' field to your configuration.", documentation_url=None)
E102 = ErrorCode(code='E102', category=<ErrorCategory.SCHEMA: 'schema'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Invalid type for '{field}': expected {expected}, got {actual}", suggestion_template="Change '{field}' to be of type {expected}.", documentation_url=None)
E103 = ErrorCode(code='E103', category=<ErrorCategory.SCHEMA: 'schema'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Invalid value for '{field}': {value}. Valid values: {valid_values}", suggestion_template='Use one of the valid values: {valid_values}', documentation_url=None)
E104 = ErrorCode(code='E104', category=<ErrorCategory.SCHEMA: 'schema'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='No data source specified', suggestion_template="Add 'train_x', 'test_x', 'folder', 'sources', or 'variations' to your configuration.", documentation_url=None)
E200 = ErrorCode(code='E200', category=<ErrorCategory.FILE: 'file'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='File not found: {path}', suggestion_template='Check that the file path is correct and the file exists.', documentation_url=None)
E201 = ErrorCode(code='E201', category=<ErrorCategory.FILE: 'file'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Cannot read file: {path}. Error: {error}', suggestion_template='Check file permissions and encoding.', documentation_url=None)
E202 = ErrorCode(code='E202', category=<ErrorCategory.FILE: 'file'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Unsupported file format: {format}', suggestion_template='Supported formats: CSV, NPY, NPZ, Parquet, Excel, MATLAB', documentation_url=None)
E203 = ErrorCode(code='E203', category=<ErrorCategory.FILE: 'file'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Empty file: {path}', suggestion_template='Ensure the file contains data.', documentation_url=None)
E204 = ErrorCode(code='E204', category=<ErrorCategory.FILE: 'file'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template='File encoding issue: {path}. Using fallback encoding: {encoding}', suggestion_template='Specify the encoding explicitly in loading parameters.', documentation_url=None)
E300 = ErrorCode(code='E300', category=<ErrorCategory.DATA: 'data'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Data shape mismatch: {details}', suggestion_template='Ensure all data arrays have consistent sample counts.', documentation_url=None)
E301 = ErrorCode(code='E301', category=<ErrorCategory.DATA: 'data'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="NA values found in data and na_policy='abort': {details}", suggestion_template="Set na_policy='remove' or clean your data before loading.", documentation_url=None)
E302 = ErrorCode(code='E302', category=<ErrorCategory.DATA: 'data'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template='NA values removed: {count} rows affected', suggestion_template='Review your data for missing values.', documentation_url=None)
E303 = ErrorCode(code='E303', category=<ErrorCategory.DATA: 'data'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Column not found: '{column}' in {file}", suggestion_template='Available columns: {available}', documentation_url=None)
E304 = ErrorCode(code='E304', category=<ErrorCategory.DATA: 'data'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template='Non-numeric values in feature data at column(s): {columns}', suggestion_template='Features should be numeric. Non-numeric values will be converted to NaN.', documentation_url=None)
E400 = ErrorCode(code='E400', category=<ErrorCategory.LOADING: 'loading'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Failed to parse CSV: {error}', suggestion_template='Check delimiter and encoding settings.', documentation_url=None)
E401 = ErrorCode(code='E401', category=<ErrorCategory.LOADING: 'loading'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Invalid JSON configuration at line {line}: {error}', suggestion_template='Check JSON syntax around line {line}.', documentation_url=None)
E402 = ErrorCode(code='E402', category=<ErrorCategory.LOADING: 'loading'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Invalid YAML configuration at line {line}: {error}', suggestion_template='Check YAML indentation and syntax around line {line}.', documentation_url=None)
E403 = ErrorCode(code='E403', category=<ErrorCategory.LOADING: 'loading'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Archive error: {error}', suggestion_template='Ensure the archive is not corrupted and contains the expected files.', documentation_url=None)
E500 = ErrorCode(code='E500', category=<ErrorCategory.PARTITION: 'partition'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Invalid partition specification: {details}', suggestion_template="Use 'train', 'test', column-based, or percentage-based partition.", documentation_url=None)
E501 = ErrorCode(code='E501', category=<ErrorCategory.PARTITION: 'partition'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Partition column not found: '{column}'", suggestion_template='Available columns: {available}', documentation_url=None)
E502 = ErrorCode(code='E502', category=<ErrorCategory.PARTITION: 'partition'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Partition indices out of range: max index {max_index}, data has {n_samples} samples', suggestion_template='Ensure partition indices are within valid range.', documentation_url=None)
E503 = ErrorCode(code='E503', category=<ErrorCategory.PARTITION: 'partition'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template='Overlapping partition indices detected', suggestion_template='Train and test indices should not overlap.', documentation_url=None)
E600 = ErrorCode(code='E600', category=<ErrorCategory.AGGREGATION: 'aggregation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Aggregation column not found: '{column}'", suggestion_template='Available columns in metadata: {available}', documentation_url=None)
E601 = ErrorCode(code='E601', category=<ErrorCategory.AGGREGATION: 'aggregation'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template="Group '{group}' has only {count} sample(s), below minimum {min_samples}", suggestion_template='Consider lowering aggregate_min_samples or reviewing your data.', documentation_url=None)
E602 = ErrorCode(code='E602', category=<ErrorCategory.AGGREGATION: 'aggregation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Invalid aggregation method: '{method}'", suggestion_template='Valid methods: mean, median, vote, min, max, sum, std, first, last', documentation_url=None)
E700 = ErrorCode(code='E700', category=<ErrorCategory.VARIATION: 'variation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Duplicate source name: '{name}'", suggestion_template='Each source must have a unique name.', documentation_url=None)
E701 = ErrorCode(code='E701', category=<ErrorCategory.VARIATION: 'variation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="Duplicate variation name: '{name}'", suggestion_template='Each variation must have a unique name.', documentation_url=None)
E702 = ErrorCode(code='E702', category=<ErrorCategory.VARIATION: 'variation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Unknown variation(s) in variation_select: {names}', suggestion_template='Available variations: {available}', documentation_url=None)
E703 = ErrorCode(code='E703', category=<ErrorCategory.VARIATION: 'variation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template="variation_mode='select' requires 'variation_select' to be specified", suggestion_template='Add \'variation_select: ["var1", "var2"]\' to your configuration.', documentation_url=None)
E704 = ErrorCode(code='E704', category=<ErrorCategory.VARIATION: 'variation'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Sample count mismatch across sources: {details}', suggestion_template='All sources must have the same number of samples.', documentation_url=None)
E800 = ErrorCode(code='E800', category=<ErrorCategory.FOLD: 'fold'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Invalid fold file format: {error}', suggestion_template='Fold files should be CSV with fold columns or JSON/YAML with fold definitions.', documentation_url=None)
E801 = ErrorCode(code='E801', category=<ErrorCategory.FOLD: 'fold'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Fold sample IDs do not match dataset: {details}', suggestion_template='Ensure fold file was generated for this dataset.', documentation_url=None)
E802 = ErrorCode(code='E802', category=<ErrorCategory.FOLD: 'fold'>, severity=<ErrorSeverity.WARNING: 'warning'>, message_template='Fold file has {fold_samples} samples, dataset has {data_samples} samples', suggestion_template='Folds will be adjusted to match current dataset size.', documentation_url=None)
E900 = ErrorCode(code='E900', category=<ErrorCategory.RUNTIME: 'runtime'>, severity=<ErrorSeverity.ERROR: 'error'>, message_template='Unexpected error during loading: {error}', suggestion_template='Please report this issue with the full error traceback.', documentation_url=None)
classmethod all_codes() Dict[str, ErrorCode][source]

Get all error codes.

classmethod get(code: str) ErrorCode | None[source]

Get error code by code string.

class nirs4all.data.schema.validation.ErrorSeverity(value)[source]

Bases: str, Enum

Severity levels for errors.

ERROR = 'error'
INFO = 'info'
WARNING = 'warning'
class nirs4all.data.schema.validation.ValidationError(code: str, message: str, field: str | None = None, value: Any = None, suggestion: str | None = None)[source]

Bases: object

Represents a validation error.

code

Error code for programmatic handling.

Type:

str

message

Human-readable error message.

Type:

str

field

The configuration field that caused the error.

Type:

str | None

value

The value that caused the error.

Type:

Any

suggestion

Optional suggestion for fixing the error.

Type:

str | None

code: str
field: str | None = None
message: str
suggestion: str | None = None
value: Any = None
class nirs4all.data.schema.validation.ValidationResult(is_valid: bool, errors: ~typing.List[~nirs4all.data.schema.validation.validators.ValidationError] = <factory>, warnings: ~typing.List[~nirs4all.data.schema.validation.validators.ValidationWarning] = <factory>, normalized_config: ~typing.Dict[str, ~typing.Any] | None = None)[source]

Bases: object

Result of configuration validation.

is_valid

Whether the configuration is valid (no errors).

Type:

bool

errors

List of validation errors.

Type:

List[nirs4all.data.schema.validation.validators.ValidationError]

warnings

List of validation warnings.

Type:

List[nirs4all.data.schema.validation.validators.ValidationWarning]

normalized_config

The validated and normalized configuration.

Type:

Dict[str, Any] | None

errors: List[ValidationError]
is_valid: bool
normalized_config: Dict[str, Any] | None = None
raise_if_invalid() None[source]

Raise ValueError if configuration is invalid.

warnings: List[ValidationWarning]
class nirs4all.data.schema.validation.ValidationWarning(code: str, message: str, field: str | None = None)[source]

Bases: object

Represents a validation warning (non-fatal issue).

code

Warning code for programmatic handling.

Type:

str

message

Human-readable warning message.

Type:

str

field

The configuration field that caused the warning.

Type:

str | None

code: str
field: str | None = None
message: str
nirs4all.data.schema.validation.validate_config(config: Dict[str, Any], check_file_existence: bool = False) ValidationResult[source]

Convenience function to validate a configuration.

Parameters:
  • config – Configuration dictionary to validate.

  • check_file_existence – Whether to check if referenced files exist.

Returns:

ValidationResult with errors, warnings, and normalized config.