nirs4all.data.loaders.parquet_loader module
Parquet file loader implementation.
This module provides the ParquetLoader class for loading Apache Parquet files. Requires pyarrow or fastparquet as a dependency.
- class nirs4all.data.loaders.parquet_loader.ParquetLoader[source]
Bases:
FileLoaderLoader for Apache Parquet files.
Requires pyarrow or fastparquet to be installed.
Supports: - Single Parquet files (.parquet, .pq) - Partitioned datasets (directory of parquet files) - Column selection for efficient loading
- Parameters:
columns – List of column names to load (default: all columns).
engine – Parquet engine to use (‘auto’, ‘pyarrow’, or ‘fastparquet’).
filters – Row group filters for predicate pushdown (pyarrow only).
header_unit – Unit for headers (‘cm-1’, ‘nm’, ‘text’, etc.)
Example
>>> loader = ParquetLoader() >>> result = loader.load( ... Path("data.parquet"), ... columns=["feature_1", "feature_2"], ... )
- load(path: Path, columns: List[str] | None = None, engine: str = 'auto', filters: List | None = None, header_unit: str = 'text', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a Parquet file.
- Parameters:
path – Path to the Parquet file or directory.
columns – List of column names to load. If None, loads all columns.
engine – Parquet engine (‘auto’, ‘pyarrow’, or ‘fastparquet’).
filters – Row group filters for predicate pushdown (pyarrow only).
header_unit – Unit type for headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters passed to read_parquet.
- Returns:
LoaderResult with the loaded data.
- nirs4all.data.loaders.parquet_loader.load_parquet(path, columns: List[str] | None = None, engine: str = 'auto', header_unit: str = 'text', **params)[source]
Load a Parquet file.
Convenience function for direct use.
- Parameters:
path – Path to the Parquet file.
columns – Column names to load.
engine – Parquet engine to use.
header_unit – Unit type for headers.
**params – Additional parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).