nirs4all.data.parsers package

Submodules

nirs4all.data.parsers.base module
- BaseParser
  - BaseParser.can_parse()
  - BaseParser.parse()
- ParserResult
nirs4all.data.parsers.files_parser module
nirs4all.data.parsers.folder_parser module
- FolderParser
nirs4all.data.parsers.legacy_parser module
- LegacyParser
  - LegacyParser.can_parse()
  - LegacyParser.parse()
- normalize_config_keys()
nirs4all.data.parsers.normalizer module
- ConfigNormalizer
  - ConfigNormalizer.normalize()
- normalize_config()

Module contents

Parsers module for dataset configuration.

This module provides parsers for converting various input formats to the normalized DatasetConfigSchema.

Parsers: - LegacyParser: Handles the current train_x/test_x format - FilesParser: Handles the new files syntax - SourcesParser: Handles multi-source datasets (Phase 6) - VariationsParser: Handles feature variations / preprocessed data (Phase 7) - FolderParser: Handles folder auto-scanning

The ConfigNormalizer combines all parsers and produces a canonical representation.

class nirs4all.data.parsers.BaseParser[source]

Bases: ABC

Abstract base class for configuration parsers.

Subclasses must implement: - can_parse(): Check if this parser can handle the input - parse(): Parse the input and return a ParserResult

abstractmethod can_parse(input_data: Any) → bool[source]

Check if this parser can handle the given input.

Parameters:: input_data – The input to check.
Returns:: True if this parser can handle the input, False otherwise.

abstractmethod parse(input_data: Any) → ParserResult[source]

Parse the input and return a configuration.

Parameters:: input_data – The input to parse.
Returns:: ParserResult with parsed configuration or errors.

class nirs4all.data.parsers.ConfigNormalizer(parsers: List[BaseParser] | None = None)[source]

Bases: object

Normalizes dataset configurations from various input formats.

This class combines multiple parsers to handle: - Folder paths (auto-scanning) - JSON/YAML config files - Dictionary configurations (legacy format) - Sources configurations (multi-source format) - Variations configurations (preprocessed data / feature variations) - In-memory numpy arrays

All inputs are normalized to a canonical dictionary format that can be validated and processed by the loader.

Example

```python normalizer = ConfigNormalizer()

# From folder path config, name = normalizer.normalize(“/path/to/data/”)

# From config file config, name = normalizer.normalize(“config.yaml”)

# From dictionary config, name = normalizer.normalize({“train_x”: “data/X.csv”})

# From sources format config, name = normalizer.normalize({

“sources”: [
{“name”: “NIR”, “train_x”: “NIR_train.csv”}, {“name”: “MIR”, “train_x”: “MIR_train.csv”}

]

})

# From variations format config, name = normalizer.normalize({

“variations”: [
{“name”: “raw”, “train_x”: “X_raw.csv”}, {“name”: “snv”, “train_x”: “X_snv.csv”}

], “variation_mode”: “separate”

})

normalize(input_data: Any) → Tuple[Dict[str, Any] | None, str][source]

Normalize a configuration to canonical format.

Parameters:: input_data – Configuration in any supported format.
Returns:: Tuple of (normalized_config, dataset_name). Returns (None, ‘Unknown_dataset’) if parsing fails.

class nirs4all.data.parsers.FilesParser[source]

Bases: BaseParser

Parser for new ‘files’ syntax configuration.

The files syntax provides: - Flexible column selection (by index, name, regex, range) - Row selection and filtering - Partition assignment per file or via partition config - Key-based sample linking across files

can_parse(input_data: Any) → bool[source]

Check if this is a files-format configuration.

Parameters:: input_data – The input to check.
Returns:: True if input has ‘files’ key with non-empty list.

parse(input_data: Dict[str, Any]) → ParserResult[source]

Parse a files-format configuration.

Parameters:: input_data – Dictionary configuration to parse.
Returns:: ParserResult with parsed configuration.

class nirs4all.data.parsers.FolderParser[source]

Bases: BaseParser

Parser for folder-based dataset configuration.

This parser scans a folder for data files matching standard naming conventions and creates a configuration dictionary.

Supported file formats: - CSV files (.csv) - Compressed CSV files (.csv.gz, .csv.zip)

Multi-source detection: - If multiple files match the same pattern (e.g., Xcal_NIR.csv, Xcal_MIR.csv),

they are treated as multi-source data.

SUPPORTED_EXTENSIONS = {'.csv', '.csv.gz', '.csv.zip', '.gz', '.zip'}

can_parse(input_data: Any) → bool[source]

Check if input is a folder path.

Parameters:: input_data – The input to check.
Returns:: True if input is a string path to an existing directory.

parse(input_data: Any) → ParserResult[source]

Parse a folder path into a configuration.

Parameters:: input_data – Folder path (str, Path) or dict with ‘folder’ key.
Returns:: ParserResult with configuration from scanned files.

class nirs4all.data.parsers.LegacyParser[source]

Bases: BaseParser

Parser for legacy train_x/test_x configuration format.

This parser handles dictionary configurations using the established key format: train_x, train_y, test_x, test_y, train_group, test_group.

It also handles flexible key naming (X_train, Xtrain, etc.) by normalizing to the standard format.

can_parse(input_data: Any) → bool[source]

Check if this is a legacy format configuration.

Parameters:: input_data – The input to check.
Returns:: True if input is a dict with legacy keys or data arrays.

parse(input_data: Dict[str, Any]) → ParserResult[source]

Parse a legacy format configuration.

Parameters:: input_data – Dictionary configuration to parse.
Returns:: ParserResult with normalized configuration.

class nirs4all.data.parsers.ParserResult(success: bool, config: Dict[str, ~typing.Any] | None=None, dataset_name: str | None = None, errors: List[str] = <factory>, warnings: List[str] = <factory>, source_type: str | None = None)[source]

Bases: object

Result of parsing a configuration.

success

Whether parsing was successful.

Type:: bool

config

The parsed configuration dictionary.

Type:: Dict[str, Any] | None

dataset_name

The extracted or inferred dataset name.

Type:: str | None

errors

List of error messages if parsing failed.

Type:: List[str]

warnings

List of warning messages (non-fatal issues).

Type:: List[str]

source_type

Type of source that was parsed (‘dict’, ‘file’, ‘folder’, ‘array’).

Type:: str | None

config: Dict[str, Any] | None = None

dataset_name: str | None = None

errors: List[str]

source_type: str | None = None

success: bool

warnings: List[str]

class nirs4all.data.parsers.SourcesParser[source]

Bases: BaseParser

Parser for multi-source ‘sources’ syntax configuration.

The sources syntax provides: - Named feature sources (e.g., NIR, MIR spectrometers) - Per-source loading parameters - Automatic source alignment by sample key - Shared targets and metadata across sources

Example configuration:

sources:

name: “NIR” files:
- path: data/NIR_train.csv partition: train
- path: data/NIR_test.csv partition: test
params:
header_unit: nm signal_type: absorbance
name: “MIR” train_x: data/MIR_train.csv test_x: data/MIR_test.csv params:

header_unit: cm-1

targets:

path: data/targets.csv link_by: sample_id

metadata:

path: data/metadata.csv link_by: sample_id

can_parse(input_data: Any) → bool[source]

Check if this is a sources-format configuration.

Parameters:: input_data – The input to check.
Returns:: True if input has ‘sources’ key with non-empty list.

parse(input_data: Dict[str, Any]) → ParserResult[source]

Parse a sources-format configuration.

Converts the sources syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.

Parameters:: input_data – Dictionary configuration to parse.
Returns:: ParserResult with parsed configuration.

class nirs4all.data.parsers.VariationsParser[source]

Bases: BaseParser

Parser for feature variations ‘variations’ syntax configuration.

The variations syntax provides: - Named feature variations (e.g., raw, snv, derivative) - Per-variation loading parameters - Preprocessing provenance tracking - Multiple variation modes (separate, concat, select, compare)

Example configuration:

variations:

name: “raw” files:
- path: data/spectra_raw.csv partition: train
- path: data/spectra_raw_test.csv partition: test
name: “snv” description: “SNV preprocessed spectra” preprocessing_applied:
- type: “SNV” software: “OPUS 8.0”
train_x: data/spectra_snv_train.csv test_x: data/spectra_snv_test.csv

variation_mode: separate # or concat, select, compare variation_select: [“raw”, “snv”] # only for mode=select

targets:: path: data/targets.csv link_by: sample_id

can_parse(input_data: Any) → bool[source]

Check if this is a variations-format configuration.

Parameters:: input_data – The input to check.
Returns:: True if input has ‘variations’ key with non-empty list.

parse(input_data: Dict[str, Any]) → ParserResult[source]

Parse a variations-format configuration.

Converts the variations syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.

Parameters:: input_data – Dictionary configuration to parse.
Returns:: ParserResult with parsed configuration.

nirs4all.data.parsers.normalize_config(input_data: Any) → Tuple[Dict[str, Any] | None, str][source]

Convenience function to normalize a configuration.

Parameters:: input_data – Configuration in any supported format.
Returns:: Tuple of (normalized_config, dataset_name).