nirs4all.data.loaders package
Submodules
- nirs4all.data.loaders.archive_loader module
- nirs4all.data.loaders.base module
ArchiveHandlerArchiveHandler.decompress_gzip()ArchiveHandler.decompress_gzip_bytes()ArchiveHandler.extract_bytes_from_tar()ArchiveHandler.extract_bytes_from_zip()ArchiveHandler.extract_from_tar()ArchiveHandler.extract_from_zip()ArchiveHandler.is_archive()ArchiveHandler.is_compressed()ArchiveHandler.list_tar_members()ArchiveHandler.list_zip_members()
FileLoadErrorFileLoaderFormatNotSupportedErrorLoaderErrorLoaderRegistryLoaderResultregister_loader()
- nirs4all.data.loaders.csv_loader module
- nirs4all.data.loaders.csv_loader_new module
- nirs4all.data.loaders.excel_loader module
- nirs4all.data.loaders.loader module
- nirs4all.data.loaders.matlab_loader module
- nirs4all.data.loaders.numpy_loader module
- nirs4all.data.loaders.parquet_loader module
Module contents
File loaders module for nirs4all.
This module provides a pluggable file loading system supporting multiple file formats with automatic format detection and configurable loading parameters.
- Supported Formats:
CSV (.csv, .csv.gz, .csv.zip) - via CSVLoader
NumPy (.npy, .npz) - via NumpyLoader
Parquet (.parquet, .pq) - via ParquetLoader (requires pyarrow or fastparquet)
Excel (.xlsx, .xls) - via ExcelLoader (requires openpyxl/xlrd)
MATLAB (.mat) - via MatlabLoader (requires scipy, optionally h5py)
Archives (.tar, .tar.gz, .tgz, .zip) - via TarLoader, EnhancedZipLoader
- Usage:
>>> from nirs4all.data.loaders import LoaderRegistry, load_file >>> >>> # Using the registry >>> registry = LoaderRegistry.get_instance() >>> result = registry.load("data.csv", delimiter=",") >>> >>> # Or using the convenience function >>> data, report, na_mask, headers, unit = load_file("data.csv") >>> >>> # Direct loader usage >>> from nirs4all.data.loaders import CSVLoader >>> loader = CSVLoader() >>> result = loader.load(Path("data.csv"))
- Adding Custom Loaders:
>>> from nirs4all.data.loaders import FileLoader, register_loader >>> >>> @register_loader ... class MyLoader(FileLoader): ... supported_extensions = (".myext",) ... name = "My Loader" ... ... @classmethod ... def supports(cls, path): ... return path.suffix.lower() == ".myext" ... ... def load(self, path, **params): ... # Load implementation ... pass
- Backward Compatibility:
The legacy load_csv function is still available for existing code: >>> from nirs4all.data.loaders.csv_loader import load_csv
- class nirs4all.data.loaders.ArchiveHandler[source]
Bases:
objectUtility class for handling compressed files and archives.
Supports: - Gzip compressed files (.gz) - Zip archives (.zip) with member selection - Tar archives (.tar, .tar.gz, .tgz, .tar.bz2) with member selection
- static decompress_gzip(path: Path, encoding: str = 'utf-8') str[source]
Decompress a gzip file and return content as string.
- Parameters:
path – Path to the gzip file.
encoding – Text encoding to use.
- Returns:
Decompressed file content as string.
- static decompress_gzip_bytes(path: Path) bytes[source]
Decompress a gzip file and return content as bytes.
- Parameters:
path – Path to the gzip file.
- Returns:
Decompressed file content as bytes.
- static extract_bytes_from_tar(path: Path, member: str | None = None) bytes[source]
Extract a file from a tar archive as bytes.
- Parameters:
path – Path to the tar file.
member – Name of the member to extract. If None, auto-detect.
- Returns:
Content of the extracted file as bytes.
- static extract_bytes_from_zip(path: Path, member: str | None = None) bytes[source]
Extract a file from a zip archive as bytes.
- Parameters:
path – Path to the zip file.
member – Name of the member to extract. If None, auto-detect.
- Returns:
Content of the extracted file as bytes.
- static extract_from_tar(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]
Extract a file from a tar archive.
- Parameters:
path – Path to the tar file.
member – Name of the member to extract. If None, auto-detect.
encoding – Text encoding to use.
- Returns:
Content of the extracted file as string.
- Raises:
FileLoadError – If no suitable member is found.
- static extract_from_zip(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]
Extract a file from a zip archive.
- Parameters:
path – Path to the zip file.
member – Name of the member to extract. If None, auto-detect.
encoding – Text encoding to use.
- Returns:
Content of the extracted file as string.
- Raises:
FileLoadError – If no suitable member is found.
- static is_archive(path: Path) bool[source]
Check if a file is an archive (contains multiple files).
- class nirs4all.data.loaders.CSVLoader[source]
Bases:
FileLoaderLoader for CSV files.
Supports: - Plain CSV files (.csv) - Gzip-compressed CSV files (.csv.gz) - Zip-compressed CSV files (.csv.zip)
- Parameters:
delimiter – Field delimiter (default: ‘;’)
decimal_separator – Decimal separator (default: ‘.’)
has_header – Whether first row is header (default: True)
header_unit – Unit for headers (‘cm-1’, ‘nm’, etc.)
na_policy – How to handle NA values (‘remove’ or ‘abort’)
categorical_mode – How to handle categorical data (‘auto’, ‘preserve’, ‘none’)
data_type – Type of data being loaded (‘x’, ‘y’, or ‘metadata’)
encoding – File encoding (default: ‘utf-8’)
member – For zip files, specific member to extract
- load(path: Path, na_policy: str = 'auto', data_type: str = 'x', categorical_mode: str = 'auto', header_unit: str = 'cm-1', encoding: str = 'utf-8', member: str | None = None, **user_params: Any) LoaderResult[source]
Load data from a CSV file.
- Parameters:
path – Path to the CSV file.
na_policy – How to handle NA values (‘remove’, ‘abort’, or ‘auto’).
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
categorical_mode – How to handle categorical columns.
header_unit – Unit type for headers.
encoding – File encoding.
member – For zip files, specific member to extract.
**user_params – Additional CSV parsing parameters.
- Returns:
LoaderResult with the loaded data.
- class nirs4all.data.loaders.EnhancedZipLoader[source]
Bases:
FileLoaderEnhanced loader for zip archive files.
This loader provides additional features over the basic zip support in the CSV loader, including: - Member listing and selection - Support for non-CSV files in archives - Binary file extraction (for NumPy, Parquet, etc.)
- Parameters:
member – Name of the member file to extract.
password – Password for encrypted archives.
encoding – Text encoding for text files.
Example
>>> loader = EnhancedZipLoader() >>> result = loader.load( ... Path("data.zip"), ... member="train/features.csv", ... )
- load(path: Path, member: str | None = None, password: str | None = None, encoding: str = 'utf-8', header_unit: str = 'cm-1', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a zip archive.
- Parameters:
path – Path to the zip archive.
member – Name of the member to extract.
password – Password for encrypted archives.
encoding – Text encoding for text files.
header_unit – Unit type for headers.
data_type – Type of data.
**params – Additional parameters for the inner loader.
- Returns:
LoaderResult with the loaded data.
- class nirs4all.data.loaders.ExcelLoader[source]
Bases:
FileLoaderLoader for Excel spreadsheet files.
Supports: - Modern Excel files (.xlsx) via openpyxl - Legacy Excel files (.xls) via xlrd
- Parameters:
sheet_name – Sheet name or index to load (default: 0, first sheet). Can be a string (sheet name), integer (0-indexed), or None (all sheets).
header – Row number to use as header (default: 0). Use None for no header.
skip_rows – Number of rows to skip at the beginning.
skip_footer – Number of rows to skip at the end.
usecols – Columns to load (can be list of names, indices, or Excel-style range).
engine – Excel engine to use (‘auto’, ‘openpyxl’, or ‘xlrd’).
header_unit – Unit for headers (‘cm-1’, ‘nm’, ‘text’, etc.)
Example
>>> loader = ExcelLoader() >>> result = loader.load( ... Path("data.xlsx"), ... sheet_name="Sheet1", ... skip_rows=2, ... )
- load(path: Path, sheet_name: str | int | None = 0, header: int | None = 0, skip_rows: int | None = None, skip_footer: int = 0, usecols: List[str] | List[int] | str | None = None, engine: str = 'auto', header_unit: str = 'text', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from an Excel file.
- Parameters:
path – Path to the Excel file.
sheet_name – Sheet to load (name, index, or None for all).
header – Row number for header (0-indexed), or None.
skip_rows – Number of rows to skip at start.
skip_footer – Number of rows to skip at end.
usecols – Columns to load.
engine – Excel engine to use.
header_unit – Unit type for headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters passed to read_excel.
- Returns:
LoaderResult with the loaded data.
- exception nirs4all.data.loaders.FileLoadError[source]
Bases:
LoaderErrorRaised when a file cannot be loaded.
- class nirs4all.data.loaders.FileLoader[source]
Bases:
ABCAbstract base class for file loaders.
All file format loaders should inherit from this class and implement the required methods for loading and format detection.
- Class Attributes:
supported_extensions: Tuple of file extensions this loader handles. name: Human-readable name for the loader. priority: Loading priority (lower = higher priority) when multiple
loaders match. Default: 50.
Example
>>> class CSVLoader(FileLoader): ... supported_extensions = (".csv",) ... name = "CSV Loader" ... ... @classmethod ... def supports(cls, path: Path) -> bool: ... return path.suffix.lower() in cls.supported_extensions ... ... def load(self, path: Path, **params) -> LoaderResult: ... # Load CSV file ... pass
- classmethod detect_format(path: Path) str | None[source]
Detect the file format from the path.
- Parameters:
path – Path to analyze.
- Returns:
Format name if detected, None otherwise.
- classmethod get_base_path(path: Path) Path[source]
Get the base path without compression extensions.
For example, ‘data.csv.gz’ -> ‘data.csv’
- Parameters:
path – Path to process.
- Returns:
Path without compression extension(s).
- abstractmethod load(path: Path, **params: Any) LoaderResult[source]
Load data from a file.
- Parameters:
path – Path to the file to load.
**params – Loader-specific parameters.
- Returns:
LoaderResult containing the loaded data and metadata.
- Raises:
FileLoadError – If the file cannot be loaded.
- exception nirs4all.data.loaders.FormatNotSupportedError[source]
Bases:
LoaderErrorRaised when a file format is not supported.
- exception nirs4all.data.loaders.LoaderError[source]
Bases:
ExceptionBase exception for loader errors.
- class nirs4all.data.loaders.LoaderRegistry[source]
Bases:
objectRegistry for file loaders.
The registry maintains a list of available loaders and provides methods for finding the appropriate loader for a given file.
Example
>>> registry = LoaderRegistry() >>> registry.register(CSVLoader) >>> registry.register(ParquetLoader) >>> loader = registry.get_loader(Path("data.csv")) >>> result = loader.load(Path("data.csv"))
- static __new__(cls) LoaderRegistry[source]
Implement singleton pattern.
- classmethod get_instance() LoaderRegistry[source]
Get the singleton registry instance.
- get_loader(path: str | Path) FileLoader[source]
Get the appropriate loader for a file.
- Parameters:
path – Path to the file to load.
- Returns:
An instance of the appropriate loader.
- Raises:
FormatNotSupportedError – If no loader supports the file format.
- get_registered_loaders() List[Type[FileLoader]][source]
Get all registered loader classes.
- Returns:
List of registered loader classes.
- get_supported_extensions() List[str][source]
Get all supported file extensions.
- Returns:
List of supported extensions across all registered loaders.
- load(path: str | Path, **params: Any) LoaderResult[source]
Load a file using the appropriate loader.
This is a convenience method that finds the right loader and loads the file.
- Parameters:
path – Path to the file to load.
**params – Loading parameters to pass to the loader.
- Returns:
LoaderResult containing the loaded data.
- Raises:
FormatNotSupportedError – If no loader supports the file format.
FileLoadError – If the file cannot be loaded.
- register(loader_class: Type[FileLoader]) None[source]
Register a file loader.
- Parameters:
loader_class – The loader class to register.
- unregister(loader_class: Type[FileLoader]) None[source]
Unregister a file loader.
- Parameters:
loader_class – The loader class to unregister.
- class nirs4all.data.loaders.LoaderResult(data: DataFrame | None = None, report: Dict[str, Any] | None = None, na_mask: Series | None = None, headers: List[str] | None = None, header_unit: str = 'cm-1')[source]
Bases:
objectResult container for file loading operations.
- data
The loaded data as a pandas DataFrame.
- report
Dictionary containing loading metadata and diagnostics.
- na_mask
Boolean Series indicating rows with NA values.
- headers
List of column headers.
- header_unit
The unit type for headers (e.g., ‘cm-1’, ‘nm’).
- class nirs4all.data.loaders.MatlabLoader[source]
Bases:
FileLoaderLoader for MATLAB .mat files.
Supports: - MATLAB v4, v6, v7 files via scipy.io - MATLAB v7.3 (HDF5) files via h5py (if available)
- Parameters:
variable – Name of the variable to load. If None, auto-detects.
squeeze_me – Squeeze unit matrix dimensions (default: True).
struct_as_record – Load MATLAB structs as numpy record arrays (default: False).
header_unit – Unit for generated headers (‘index’, ‘cm-1’, ‘nm’, etc.)
Example
>>> loader = MatlabLoader() >>> result = loader.load( ... Path("data.mat"), ... variable="X", ... )
- load(path: Path, variable: str | None = None, squeeze_me: bool = True, struct_as_record: bool = False, header_unit: str = 'index', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a MATLAB .mat file.
- Parameters:
path – Path to the MATLAB file.
variable – Name of the variable to load. If None, auto-detects.
squeeze_me – Squeeze unit matrix dimensions.
struct_as_record – Load structs as record arrays.
header_unit – Unit type for generated headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters.
- Returns:
LoaderResult with the loaded data.
- class nirs4all.data.loaders.NumpyLoader[source]
Bases:
FileLoaderLoader for NumPy array files.
Supports: - Single array files (.npy) - Multi-array archives (.npz)
- Parameters:
allow_pickle – Whether to allow loading pickled objects (default: False). Setting this to True may pose a security risk with untrusted files.
key – For .npz files, the key of the array to load. If not specified, uses the first array.
header_unit – Unit for generated headers (‘cm-1’, ‘nm’, ‘index’, etc.)
- Security Note:
NumPy’s allow_pickle=True can execute arbitrary code when loading untrusted files. Only enable this for files you trust completely.
- load(path: Path, allow_pickle: bool = False, key: str | None = None, header_unit: str = 'index', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a NumPy file.
- Parameters:
path – Path to the NumPy file.
allow_pickle – Whether to allow loading pickled objects.
key – For .npz files, the key of the array to load.
header_unit – Unit type for generated headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters (ignored).
- Returns:
LoaderResult with the loaded data as a DataFrame.
- class nirs4all.data.loaders.ParquetLoader[source]
Bases:
FileLoaderLoader for Apache Parquet files.
Requires pyarrow or fastparquet to be installed.
Supports: - Single Parquet files (.parquet, .pq) - Partitioned datasets (directory of parquet files) - Column selection for efficient loading
- Parameters:
columns – List of column names to load (default: all columns).
engine – Parquet engine to use (‘auto’, ‘pyarrow’, or ‘fastparquet’).
filters – Row group filters for predicate pushdown (pyarrow only).
header_unit – Unit for headers (‘cm-1’, ‘nm’, ‘text’, etc.)
Example
>>> loader = ParquetLoader() >>> result = loader.load( ... Path("data.parquet"), ... columns=["feature_1", "feature_2"], ... )
- load(path: Path, columns: List[str] | None = None, engine: str = 'auto', filters: List | None = None, header_unit: str = 'text', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a Parquet file.
- Parameters:
path – Path to the Parquet file or directory.
columns – List of column names to load. If None, loads all columns.
engine – Parquet engine (‘auto’, ‘pyarrow’, or ‘fastparquet’).
filters – Row group filters for predicate pushdown (pyarrow only).
header_unit – Unit type for headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters passed to read_parquet.
- Returns:
LoaderResult with the loaded data.
- class nirs4all.data.loaders.TarLoader[source]
Bases:
FileLoaderLoader for tar archive files.
Supports: - Plain tar files (.tar) - Gzip-compressed tar files (.tar.gz, .tgz) - Bzip2-compressed tar files (.tar.bz2) - XZ-compressed tar files (.tar.xz)
- Parameters:
member – Name of the member file to extract. If None, auto-detects the first suitable file (prefers CSV).
encoding – Text encoding for the extracted file (default: ‘utf-8’).
inner_loader_params – Parameters to pass to the inner file loader.
Example
>>> loader = TarLoader() >>> result = loader.load( ... Path("data.tar.gz"), ... member="data/train.csv", ... )
- load(path: Path, member: str | None = None, encoding: str = 'utf-8', header_unit: str = 'cm-1', data_type: str = 'x', **params: Any) LoaderResult[source]
Load data from a tar archive.
- Parameters:
path – Path to the tar archive.
member – Name of the member to extract. If None, auto-detects.
encoding – Text encoding for extracted files.
header_unit – Unit type for headers.
data_type – Type of data (‘x’, ‘y’, or ‘metadata’).
**params – Additional parameters for the inner loader.
- Returns:
LoaderResult with the loaded data.
- nirs4all.data.loaders.get_loader_for_file(path: str | Path) FileLoader[source]
Get the appropriate loader for a file.
- Parameters:
path – Path to the file.
- Returns:
Instance of the appropriate FileLoader subclass.
- Raises:
FormatNotSupportedError – If no loader supports the file format.
- nirs4all.data.loaders.get_supported_formats() Dict[str, List[str]][source]
Get all supported file formats and their extensions.
- Returns:
Dictionary mapping loader names to their supported extensions.
Example
>>> formats = get_supported_formats() >>> for name, exts in formats.items(): ... print(f"{name}: {', '.join(exts)}")
- nirs4all.data.loaders.list_archive_members(path) List[str][source]
List members in an archive file.
- Parameters:
path – Path to the archive.
- Returns:
List of member names.
- Raises:
FileLoadError – If the archive cannot be read.
- nirs4all.data.loaders.load_csv(path, na_policy='auto', data_type='x', categorical_mode='auto', header_unit='cm-1', **user_params)[source]
Loads a CSV file using specified or default parameters, cleans data, handles NA values, and performs type conversions.
- Parameters:
path (str or Path) – Path to the CSV file (.csv, .gz, .zip).
na_policy (str) – ‘remove’ or ‘abort’ (or ‘auto’ which acts like ‘remove’). This policy applies to row removal if NAs are found.
data_type (str) – ‘x’ or ‘y’. Influences type conversion.
categorical_mode (str) – How to handle string columns in ‘y’ data: - ‘auto’: Convert string columns to numerical categories. - ‘preserve’: Keep string columns (will become NaN if not convertible by final astype). - ‘none’: Treat all columns as potentially numeric.
header_unit (str) – Unit type of headers - “cm-1” (wavenumber), “nm” (wavelength), “none” (no headers), “text” (string headers), “index” (feature indices). Default: “cm-1”
**user_params – CSV parsing parameters (delimiter, decimal_separator, has_header) and other pandas.read_csv arguments.
- Returns:
DataFrame with processed data (before NA row removal).
Report dictionary.
Boolean Series indicating rows with NAs (aligned with the returned DataFrame).
List of column headers (or None if no headers).
Header unit string.
None if an error occurs before this stage.
- Return type:
(Union[pandas.DataFrame, None], dict, Union[pandas.Series, None], Union[List[str], None], str)
- nirs4all.data.loaders.load_csv_new(path, na_policy: str = 'auto', data_type: str = 'x', categorical_mode: str = 'auto', header_unit: str = 'cm-1', **user_params)
Load a CSV file using the CSVLoader.
This function maintains backward compatibility with the original load_csv API.
- Parameters:
path – Path to the CSV file.
na_policy – How to handle NA values.
data_type – Type of data being loaded.
categorical_mode – How to handle categorical columns.
header_unit – Unit type for headers.
**user_params – Additional CSV parsing parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).
- nirs4all.data.loaders.load_excel(path, sheet_name: str | int | None = 0, header: int | None = 0, skip_rows: int | None = None, skip_footer: int = 0, usecols: List[str] | List[int] | str | None = None, engine: str = 'auto', header_unit: str = 'text', **params)[source]
Load an Excel file.
Convenience function for direct use.
- Parameters:
path – Path to the Excel file.
sheet_name – Sheet to load.
header – Row number for header.
skip_rows – Rows to skip at start.
skip_footer – Rows to skip at end.
usecols – Columns to load.
engine – Excel engine to use.
header_unit – Unit type for headers.
**params – Additional parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).
- nirs4all.data.loaders.load_file(path: str | Path, **params: Any) Tuple[DataFrame | None, Dict[str, Any], Series | None, List[str], str][source]
Load a data file with automatic format detection.
This is the main entry point for loading files. It automatically detects the file format and uses the appropriate loader.
- Parameters:
path – Path to the file to load.
**params – Format-specific loading parameters. Common parameters include: - header_unit: Unit for headers (‘cm-1’, ‘nm’, ‘text’, etc.) - data_type: Type of data (‘x’, ‘y’, or ‘metadata’) - delimiter: CSV delimiter - sheet_name: Excel sheet to load - variable: MATLAB variable name - member: Archive member to extract
- Returns:
DataFrame with loaded data (or None on error)
Report dictionary with loading metadata
NA mask Series (rows with missing values)
List of column headers
Header unit string
- Return type:
Tuple of
- Raises:
FormatNotSupportedError – If no loader supports the file format.
Example
>>> data, report, na_mask, headers, unit = load_file("data.csv") >>> if report.get("error"): ... print(f"Error: {report['error']}") >>> else: ... print(f"Loaded {data.shape[0]} samples with {data.shape[1]} features")
- nirs4all.data.loaders.load_matlab(path, variable: str | None = None, squeeze_me: bool = True, header_unit: str = 'index', **params)[source]
Load a MATLAB .mat file.
Convenience function for direct use.
- Parameters:
path – Path to the MATLAB file.
variable – Name of the variable to load.
squeeze_me – Squeeze unit dimensions.
header_unit – Unit type for headers.
**params – Additional parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).
- nirs4all.data.loaders.load_numpy(path, allow_pickle: bool = False, key: str | None = None, header_unit: str = 'index', **params)[source]
Load a NumPy file.
Convenience function for backward compatibility.
- Parameters:
path – Path to the NumPy file.
allow_pickle – Whether to allow pickled objects.
key – For .npz files, the array key to load.
header_unit – Unit type for generated headers.
**params – Additional parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).
- nirs4all.data.loaders.load_parquet(path, columns: List[str] | None = None, engine: str = 'auto', header_unit: str = 'text', **params)[source]
Load a Parquet file.
Convenience function for direct use.
- Parameters:
path – Path to the Parquet file.
columns – Column names to load.
engine – Parquet engine to use.
header_unit – Unit type for headers.
**params – Additional parameters.
- Returns:
Tuple of (DataFrame, report, na_mask, headers, header_unit).
- nirs4all.data.loaders.register_loader(cls: Type[FileLoader]) Type[FileLoader][source]
Decorator to register a loader with the global registry.
Example
>>> @register_loader ... class MyLoader(FileLoader): ... ...