nirs4all.data.parsers package
Submodules
Module contents
Parsers module for dataset configuration.
This module provides parsers for converting various input formats to the normalized DatasetConfigSchema.
Parsers: - LegacyParser: Handles the current train_x/test_x format - FilesParser: Handles the new files syntax - SourcesParser: Handles multi-source datasets (Phase 6) - VariationsParser: Handles feature variations / preprocessed data (Phase 7) - FolderParser: Handles folder auto-scanning
The ConfigNormalizer combines all parsers and produces a canonical representation.
- class nirs4all.data.parsers.BaseParser[source]
Bases:
ABCAbstract base class for configuration parsers.
Subclasses must implement: - can_parse(): Check if this parser can handle the input - parse(): Parse the input and return a ParserResult
- abstractmethod can_parse(input_data: Any) bool[source]
Check if this parser can handle the given input.
- Parameters:
input_data – The input to check.
- Returns:
True if this parser can handle the input, False otherwise.
- abstractmethod parse(input_data: Any) ParserResult[source]
Parse the input and return a configuration.
- Parameters:
input_data – The input to parse.
- Returns:
ParserResult with parsed configuration or errors.
- class nirs4all.data.parsers.ConfigNormalizer(parsers: List[BaseParser] | None = None)[source]
Bases:
objectNormalizes dataset configurations from various input formats.
This class combines multiple parsers to handle: - Folder paths (auto-scanning) - JSON/YAML config files - Dictionary configurations (legacy format) - Sources configurations (multi-source format) - Variations configurations (preprocessed data / feature variations) - In-memory numpy arrays
All inputs are normalized to a canonical dictionary format that can be validated and processed by the loader.
Example
```python normalizer = ConfigNormalizer()
# From folder path config, name = normalizer.normalize(“/path/to/data/”)
# From config file config, name = normalizer.normalize(“config.yaml”)
# From dictionary config, name = normalizer.normalize({“train_x”: “data/X.csv”})
# From sources format config, name = normalizer.normalize({
- “sources”: [
{“name”: “NIR”, “train_x”: “NIR_train.csv”}, {“name”: “MIR”, “train_x”: “MIR_train.csv”}
]
})
# From variations format config, name = normalizer.normalize({
- “variations”: [
{“name”: “raw”, “train_x”: “X_raw.csv”}, {“name”: “snv”, “train_x”: “X_snv.csv”}
], “variation_mode”: “separate”
})
- class nirs4all.data.parsers.FilesParser[source]
Bases:
BaseParserParser for new ‘files’ syntax configuration.
The files syntax provides: - Flexible column selection (by index, name, regex, range) - Row selection and filtering - Partition assignment per file or via partition config - Key-based sample linking across files
- class nirs4all.data.parsers.FolderParser[source]
Bases:
BaseParserParser for folder-based dataset configuration.
This parser scans a folder for data files matching standard naming conventions and creates a configuration dictionary.
Supported file formats: - CSV files (.csv) - Compressed CSV files (.csv.gz, .csv.zip)
Multi-source detection: - If multiple files match the same pattern (e.g., Xcal_NIR.csv, Xcal_MIR.csv),
they are treated as multi-source data.
- SUPPORTED_EXTENSIONS = {'.csv', '.csv.gz', '.csv.zip', '.gz', '.zip'}
- can_parse(input_data: Any) bool[source]
Check if input is a folder path.
- Parameters:
input_data – The input to check.
- Returns:
True if input is a string path to an existing directory.
- parse(input_data: Any) ParserResult[source]
Parse a folder path into a configuration.
- Parameters:
input_data – Folder path (str, Path) or dict with ‘folder’ key.
- Returns:
ParserResult with configuration from scanned files.
- class nirs4all.data.parsers.LegacyParser[source]
Bases:
BaseParserParser for legacy train_x/test_x configuration format.
This parser handles dictionary configurations using the established key format: train_x, train_y, test_x, test_y, train_group, test_group.
It also handles flexible key naming (X_train, Xtrain, etc.) by normalizing to the standard format.
- class nirs4all.data.parsers.ParserResult(success: bool, config: Dict[str, ~typing.Any] | None=None, dataset_name: str | None = None, errors: List[str] = <factory>, warnings: List[str] = <factory>, source_type: str | None = None)[source]
Bases:
objectResult of parsing a configuration.
- class nirs4all.data.parsers.SourcesParser[source]
Bases:
BaseParserParser for multi-source ‘sources’ syntax configuration.
The sources syntax provides: - Named feature sources (e.g., NIR, MIR spectrometers) - Per-source loading parameters - Automatic source alignment by sample key - Shared targets and metadata across sources
- Example configuration:
- sources:
name: “NIR” files:
path: data/NIR_train.csv partition: train
path: data/NIR_test.csv partition: test
- params:
header_unit: nm signal_type: absorbance
name: “MIR” train_x: data/MIR_train.csv test_x: data/MIR_test.csv params:
header_unit: cm-1
- targets:
path: data/targets.csv link_by: sample_id
- metadata:
path: data/metadata.csv link_by: sample_id
- can_parse(input_data: Any) bool[source]
Check if this is a sources-format configuration.
- Parameters:
input_data – The input to check.
- Returns:
True if input has ‘sources’ key with non-empty list.
- parse(input_data: Dict[str, Any]) ParserResult[source]
Parse a sources-format configuration.
Converts the sources syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.
- Parameters:
input_data – Dictionary configuration to parse.
- Returns:
ParserResult with parsed configuration.
- class nirs4all.data.parsers.VariationsParser[source]
Bases:
BaseParserParser for feature variations ‘variations’ syntax configuration.
The variations syntax provides: - Named feature variations (e.g., raw, snv, derivative) - Per-variation loading parameters - Preprocessing provenance tracking - Multiple variation modes (separate, concat, select, compare)
- Example configuration:
- variations:
name: “raw” files:
path: data/spectra_raw.csv partition: train
path: data/spectra_raw_test.csv partition: test
name: “snv” description: “SNV preprocessed spectra” preprocessing_applied:
type: “SNV” software: “OPUS 8.0”
train_x: data/spectra_snv_train.csv test_x: data/spectra_snv_test.csv
variation_mode: separate # or concat, select, compare variation_select: [“raw”, “snv”] # only for mode=select
- targets:
path: data/targets.csv link_by: sample_id
- can_parse(input_data: Any) bool[source]
Check if this is a variations-format configuration.
- Parameters:
input_data – The input to check.
- Returns:
True if input has ‘variations’ key with non-empty list.
- parse(input_data: Dict[str, Any]) ParserResult[source]
Parse a variations-format configuration.
Converts the variations syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.
- Parameters:
input_data – Dictionary configuration to parse.
- Returns:
ParserResult with parsed configuration.