nirs4all.data.loaders.parquet_loader module

Parquet file loader implementation.

This module provides the ParquetLoader class for loading Apache Parquet files. Requires pyarrow or fastparquet as a dependency.

class nirs4all.data.loaders.parquet_loader.ParquetLoader[source]

Bases: FileLoader

Loader for Apache Parquet files.

Requires pyarrow or fastparquet to be installed.

Supports: - Single Parquet files (.parquet, .pq) - Partitioned datasets (directory of parquet files) - Column selection for efficient loading

Parameters:
  • columns – List of column names to load (default: all columns).

  • engine – Parquet engine to use (‘auto’, ‘pyarrow’, or ‘fastparquet’).

  • filters – Row group filters for predicate pushdown (pyarrow only).

  • header_unit – Unit for headers (‘cm-1’, ‘nm’, ‘text’, etc.)

Example

>>> loader = ParquetLoader()
>>> result = loader.load(
...     Path("data.parquet"),
...     columns=["feature_1", "feature_2"],
... )
load(path: Path, columns: List[str] | None = None, engine: str = 'auto', filters: List | None = None, header_unit: str = 'text', data_type: str = 'x', **params: Any) LoaderResult[source]

Load data from a Parquet file.

Parameters:
  • path – Path to the Parquet file or directory.

  • columns – List of column names to load. If None, loads all columns.

  • engine – Parquet engine (‘auto’, ‘pyarrow’, or ‘fastparquet’).

  • filters – Row group filters for predicate pushdown (pyarrow only).

  • header_unit – Unit type for headers.

  • data_type – Type of data (‘x’, ‘y’, or ‘metadata’).

  • **params – Additional parameters passed to read_parquet.

Returns:

LoaderResult with the loaded data.

name: ClassVar[str] = 'Parquet Loader'
priority: ClassVar[int] = 35
supported_extensions: ClassVar[Tuple[str, ...]] = ('.parquet', '.pq')
classmethod supports(path: Path) bool[source]

Check if this loader supports the given file.

nirs4all.data.loaders.parquet_loader.load_parquet(path, columns: List[str] | None = None, engine: str = 'auto', header_unit: str = 'text', **params)[source]

Load a Parquet file.

Convenience function for direct use.

Parameters:
  • path – Path to the Parquet file.

  • columns – Column names to load.

  • engine – Parquet engine to use.

  • header_unit – Unit type for headers.

  • **params – Additional parameters.

Returns:

Tuple of (DataFrame, report, na_mask, headers, header_unit).