nirs4all.data.synthetic.exporter module

Dataset export utilities for synthetic NIRS data.

This module provides tools for exporting synthetic datasets to various file formats and folder structures compatible with nirs4all loaders.

Key Features:

Export to CSV files (single or multi-file format)
Export to nirs4all standard folder structure (Xcal, Ycal, Xval, Yval)
Export with metadata (sample IDs, groups, etc.)
Generate CSV variations for loader testing

Example

>>> from nirs4all.data.synthetic import SyntheticDatasetBuilder, DatasetExporter
>>>
>>> builder = SyntheticDatasetBuilder(n_samples=1000, random_state=42)
>>> X, y = builder.build_arrays()
>>>
>>> exporter = DatasetExporter()
>>> path = exporter.to_folder(
...     "output/synthetic_data",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=builder.state._wavelengths
... )

class nirs4all.data.synthetic.exporter.CSVVariationGenerator[source]

Bases: object

Generate CSV files with various format variations for loader testing.

This class creates CSV files with different delimiters, encodings, header formats, and other variations to test the robustness of CSV loaders.

base_exporter: DatasetExporter for actual file writing.

Example

>>> generator = CSVVariationGenerator()
>>>
>>> # Generate all variations
>>> paths = generator.generate_all_variations(
...     "test_data",
...     X, y,
...     wavelengths=wavelengths
... )
>>>
>>> # Generate specific variation
>>> path = generator.with_semicolon_delimiter(
...     "data_semicolon",
...     X, y
... )

as_fragmented(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create fragmented dataset with multiple small files.

as_single_file(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create single CSV file with all data and partition column.

generate_all_variations(base_path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Dict[str, Path][source]

Generate CSV files with all format variations.

Creates multiple versions of the dataset with different CSV format options for comprehensive loader testing.

Parameters:

base_path – Base output folder path.
X – Feature matrix.
y – Target values.
wavelengths – Optional wavelength values.
train_ratio – Train/test split ratio.
random_state – Random seed.

Returns:

Dictionary mapping variation name to created path.

Example

>>> paths = generator.generate_all_variations(
...     "test_variations",
...     X, y,
...     random_state=42
... )
>>> print(paths.keys())

with_comma_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create CSV with comma delimiter.

with_precision(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None, precision: int = 6) → Path[source]: Create CSV with specified floating point precision.

with_row_index(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create CSV with row index column.

with_semicolon_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create CSV with semicolon delimiter (nirs4all default).

with_tab_delimiter(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create CSV with tab delimiter.

without_headers(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, random_state: int | None = None) → Path[source]: Create CSV without column headers.

class nirs4all.data.synthetic.exporter.DatasetExporter(config: ExportConfig | None = None)[source]

Bases: object

Export synthetic datasets to various file formats.

This class provides methods for exporting synthetic NIRS datasets to files and folders compatible with nirs4all’s data loaders.

config: Export configuration settings.

Parameters:: config – Optional ExportConfig. Uses defaults if None.

Example

>>> exporter = DatasetExporter()
>>>
>>> # Export to standard folder structure
>>> path = exporter.to_folder(
...     "output/data",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=wavelengths
... )
>>>
>>> # Export to single CSV
>>> path = exporter.to_csv(
...     "output/all_data.csv",
...     X, y,
...     wavelengths=wavelengths
... )

to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, include_targets: bool = True) → Path[source]

Export dataset to a single CSV file.

Creates a CSV file with features (and optionally targets) combined.

Parameters:

path – Output file path.
X – Feature matrix (n_samples, n_features).
y – Target values (n_samples,) or (n_samples, n_targets).
wavelengths – Optional wavelength values for column headers.
metadata – Optional dict of metadata arrays.
include_targets – Whether to include target column(s).

Returns:

Path to created file.

Example

>>> exporter.to_csv("data.csv", X, y, wavelengths=wavelengths)

to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, metadata: Dict[str, ndarray] | None = None, random_state: int | None = None, format: Literal['standard', 'single', 'fragmented'] | None = None) → Path[source]

Export dataset to a folder structure.

Creates a folder with CSV files compatible with nirs4all’s DatasetConfigs loader.

Parameters:

path – Output folder path.
X – Feature matrix (n_samples, n_features).
y – Target values (n_samples,) or (n_samples, n_targets).
train_ratio – Proportion for training set.
wavelengths – Optional wavelength values for column headers.
metadata – Optional dict of metadata arrays (same length as X).
random_state – Random seed for train/test split.
format – Override config format for this export.

Returns:

Path to created folder.

Raises:

ValueError – If X and y have incompatible shapes.
ImportError – If pandas is not available.

Example

>>> exporter.to_folder(
...     "data/synthetic",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=np.arange(1000, 2500, 2)
... )

to_numpy(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None, compressed: bool = False) → Path[source]

Export dataset to numpy .npy or .npz format.

Parameters:

path – Output file path (without extension).
X – Feature matrix (n_samples, n_features).
y – Target values.
wavelengths – Optional wavelength values.
compressed – Whether to use compressed format (.npz).

Returns:

Path to created file.

Example

>>> exporter.to_numpy("data", X, y, compressed=True)

class nirs4all.data.synthetic.exporter.ExportConfig(format: Literal['standard', 'single', 'fragmented'] = 'standard', separator: str = ';', float_precision: int = 6, include_headers: bool = True, include_index: bool = False, compression: Literal['gzip', 'zip'] | None = None, file_extension: str = '.csv')[source]

Bases: object

Configuration for dataset export.

format

Export format (‘standard’, ‘single’, ‘fragmented’). - ‘standard’: Separate Xcal, Ycal, Xval, Yval files. - ‘single’: All data in one file with partition column. - ‘fragmented’: Multiple small files (for loader testing).

Type:: Literal[‘standard’, ‘single’, ‘fragmented’]

separator

CSV delimiter character.

Type:: str

float_precision

Decimal precision for floating point values.

Type:: int

include_headers

Whether to include column headers in CSV.

Type:: bool

include_index

Whether to include row index.

Type:: bool

compression

Optional compression (‘gzip’, ‘zip’, None).

Type:: Literal[‘gzip’, ‘zip’] | None

file_extension

File extension to use.

Type:: str

compression: Literal['gzip', 'zip'] | None = None

file_extension: str = '.csv'

float_precision: int = 6

format: Literal['standard', 'single', 'fragmented'] = 'standard'

include_headers: bool = True

include_index: bool = False

separator: str = ';'

nirs4all.data.synthetic.exporter.export_to_csv(path: str | Path, X: ndarray, y: ndarray, *, wavelengths: ndarray | None = None) → Path[source]

Quick function to export synthetic data to single CSV.

Parameters:

path – Output file path.
X – Feature matrix.
y – Target values.
wavelengths – Optional wavelength values.

Returns:

Path to created file.

Example

>>> path = export_to_csv("data.csv", X, y)

nirs4all.data.synthetic.exporter.export_to_folder(path: str | Path, X: ndarray, y: ndarray, *, train_ratio: float = 0.8, wavelengths: ndarray | None = None, format: Literal['standard', 'single', 'fragmented'] = 'standard', random_state: int | None = None) → Path[source]

Quick function to export synthetic data to folder.

Convenience function for simple export use cases.

Parameters:

path – Output folder path.
X – Feature matrix.
y – Target values.
train_ratio – Train/test split ratio.
wavelengths – Optional wavelength values.
format – Export format.
random_state – Random seed.

Returns:

Path to created folder.

Example

>>> path = export_to_folder(
...     "data/synthetic",
...     X, y,
...     train_ratio=0.8,
...     wavelengths=wavelengths
... )