nirs4all.data.parsers.files_parser module
Files parser for dataset configuration.
This parser handles the new ‘files’ syntax defined in the specification. Implemented in Phase 4 to support partition assignment.
The files syntax allows specifying multiple files with column/row selection and partition assignment within a single configuration.
Example
- files:
path: data/measurements.csv partition: train columns:
features: “2:-1” targets: -1 metadata: [0, 1]
# Or with complex partition: files:
path: data/all_data.csv partition:
column: “split” train_values: [“train”] test_values: [“test”]
The sources syntax (Phase 6) allows specifying multiple feature sources for sensor fusion or multi-instrument datasets:
Example
- sources:
name: “NIR” files:
path: data/NIR_train.csv partition: train
path: data/NIR_test.csv partition: test
- params:
header_unit: nm signal_type: absorbance
name: “MIR” train_x: data/MIR_train.csv test_x: data/MIR_test.csv params:
header_unit: cm-1 signal_type: absorbance
- targets:
path: data/targets.csv link_by: sample_id
- class nirs4all.data.parsers.files_parser.FilesParser[source]
Bases:
BaseParserParser for new ‘files’ syntax configuration.
The files syntax provides: - Flexible column selection (by index, name, regex, range) - Row selection and filtering - Partition assignment per file or via partition config - Key-based sample linking across files
- class nirs4all.data.parsers.files_parser.SourcesParser[source]
Bases:
BaseParserParser for multi-source ‘sources’ syntax configuration.
The sources syntax provides: - Named feature sources (e.g., NIR, MIR spectrometers) - Per-source loading parameters - Automatic source alignment by sample key - Shared targets and metadata across sources
- Example configuration:
- sources:
name: “NIR” files:
path: data/NIR_train.csv partition: train
path: data/NIR_test.csv partition: test
- params:
header_unit: nm signal_type: absorbance
name: “MIR” train_x: data/MIR_train.csv test_x: data/MIR_test.csv params:
header_unit: cm-1
- targets:
path: data/targets.csv link_by: sample_id
- metadata:
path: data/metadata.csv link_by: sample_id
- can_parse(input_data: Any) bool[source]
Check if this is a sources-format configuration.
- Parameters:
input_data – The input to check.
- Returns:
True if input has ‘sources’ key with non-empty list.
- parse(input_data: Dict[str, Any]) ParserResult[source]
Parse a sources-format configuration.
Converts the sources syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.
- Parameters:
input_data – Dictionary configuration to parse.
- Returns:
ParserResult with parsed configuration.
- class nirs4all.data.parsers.files_parser.VariationsParser[source]
Bases:
BaseParserParser for feature variations ‘variations’ syntax configuration.
The variations syntax provides: - Named feature variations (e.g., raw, snv, derivative) - Per-variation loading parameters - Preprocessing provenance tracking - Multiple variation modes (separate, concat, select, compare)
- Example configuration:
- variations:
name: “raw” files:
path: data/spectra_raw.csv partition: train
path: data/spectra_raw_test.csv partition: test
name: “snv” description: “SNV preprocessed spectra” preprocessing_applied:
type: “SNV” software: “OPUS 8.0”
train_x: data/spectra_snv_train.csv test_x: data/spectra_snv_test.csv
variation_mode: separate # or concat, select, compare variation_select: [“raw”, “snv”] # only for mode=select
- targets:
path: data/targets.csv link_by: sample_id
- can_parse(input_data: Any) bool[source]
Check if this is a variations-format configuration.
- Parameters:
input_data – The input to check.
- Returns:
True if input has ‘variations’ key with non-empty list.
- parse(input_data: Dict[str, Any]) ParserResult[source]
Parse a variations-format configuration.
Converts the variations syntax to a DatasetConfigSchema that can be further converted to legacy format for backward compatibility.
- Parameters:
input_data – Dictionary configuration to parse.
- Returns:
ParserResult with parsed configuration.