nirs4all.data.loaders.base module
Base file loader interface and registry.
This module defines the abstract FileLoader base class and LoaderRegistry for a pluggable file loading system. It supports multiple file formats with automatic format detection and configurable loading parameters.
Phase 2 Implementation - Dataset Configuration Roadmap
- class nirs4all.data.loaders.base.ArchiveHandler[source]
Bases:
objectUtility class for handling compressed files and archives.
Supports: - Gzip compressed files (.gz) - Zip archives (.zip) with member selection - Tar archives (.tar, .tar.gz, .tgz, .tar.bz2) with member selection
- static decompress_gzip(path: Path, encoding: str = 'utf-8') str[source]
Decompress a gzip file and return content as string.
- Parameters:
path – Path to the gzip file.
encoding – Text encoding to use.
- Returns:
Decompressed file content as string.
- static decompress_gzip_bytes(path: Path) bytes[source]
Decompress a gzip file and return content as bytes.
- Parameters:
path – Path to the gzip file.
- Returns:
Decompressed file content as bytes.
- static extract_bytes_from_tar(path: Path, member: str | None = None) bytes[source]
Extract a file from a tar archive as bytes.
- Parameters:
path – Path to the tar file.
member – Name of the member to extract. If None, auto-detect.
- Returns:
Content of the extracted file as bytes.
- static extract_bytes_from_zip(path: Path, member: str | None = None) bytes[source]
Extract a file from a zip archive as bytes.
- Parameters:
path – Path to the zip file.
member – Name of the member to extract. If None, auto-detect.
- Returns:
Content of the extracted file as bytes.
- static extract_from_tar(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]
Extract a file from a tar archive.
- Parameters:
path – Path to the tar file.
member – Name of the member to extract. If None, auto-detect.
encoding – Text encoding to use.
- Returns:
Content of the extracted file as string.
- Raises:
FileLoadError – If no suitable member is found.
- static extract_from_zip(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]
Extract a file from a zip archive.
- Parameters:
path – Path to the zip file.
member – Name of the member to extract. If None, auto-detect.
encoding – Text encoding to use.
- Returns:
Content of the extracted file as string.
- Raises:
FileLoadError – If no suitable member is found.
- static is_archive(path: Path) bool[source]
Check if a file is an archive (contains multiple files).
- exception nirs4all.data.loaders.base.FileLoadError[source]
Bases:
LoaderErrorRaised when a file cannot be loaded.
- class nirs4all.data.loaders.base.FileLoader[source]
Bases:
ABCAbstract base class for file loaders.
All file format loaders should inherit from this class and implement the required methods for loading and format detection.
- Class Attributes:
supported_extensions: Tuple of file extensions this loader handles. name: Human-readable name for the loader. priority: Loading priority (lower = higher priority) when multiple
loaders match. Default: 50.
Example
>>> class CSVLoader(FileLoader): ... supported_extensions = (".csv",) ... name = "CSV Loader" ... ... @classmethod ... def supports(cls, path: Path) -> bool: ... return path.suffix.lower() in cls.supported_extensions ... ... def load(self, path: Path, **params) -> LoaderResult: ... # Load CSV file ... pass
- classmethod detect_format(path: Path) str | None[source]
Detect the file format from the path.
- Parameters:
path – Path to analyze.
- Returns:
Format name if detected, None otherwise.
- classmethod get_base_path(path: Path) Path[source]
Get the base path without compression extensions.
For example, ‘data.csv.gz’ -> ‘data.csv’
- Parameters:
path – Path to process.
- Returns:
Path without compression extension(s).
- abstractmethod load(path: Path, **params: Any) LoaderResult[source]
Load data from a file.
- Parameters:
path – Path to the file to load.
**params – Loader-specific parameters.
- Returns:
LoaderResult containing the loaded data and metadata.
- Raises:
FileLoadError – If the file cannot be loaded.
- exception nirs4all.data.loaders.base.FormatNotSupportedError[source]
Bases:
LoaderErrorRaised when a file format is not supported.
- exception nirs4all.data.loaders.base.LoaderError[source]
Bases:
ExceptionBase exception for loader errors.
- class nirs4all.data.loaders.base.LoaderRegistry[source]
Bases:
objectRegistry for file loaders.
The registry maintains a list of available loaders and provides methods for finding the appropriate loader for a given file.
Example
>>> registry = LoaderRegistry() >>> registry.register(CSVLoader) >>> registry.register(ParquetLoader) >>> loader = registry.get_loader(Path("data.csv")) >>> result = loader.load(Path("data.csv"))
- static __new__(cls) LoaderRegistry[source]
Implement singleton pattern.
- classmethod get_instance() LoaderRegistry[source]
Get the singleton registry instance.
- get_loader(path: str | Path) FileLoader[source]
Get the appropriate loader for a file.
- Parameters:
path – Path to the file to load.
- Returns:
An instance of the appropriate loader.
- Raises:
FormatNotSupportedError – If no loader supports the file format.
- get_registered_loaders() List[Type[FileLoader]][source]
Get all registered loader classes.
- Returns:
List of registered loader classes.
- get_supported_extensions() List[str][source]
Get all supported file extensions.
- Returns:
List of supported extensions across all registered loaders.
- load(path: str | Path, **params: Any) LoaderResult[source]
Load a file using the appropriate loader.
This is a convenience method that finds the right loader and loads the file.
- Parameters:
path – Path to the file to load.
**params – Loading parameters to pass to the loader.
- Returns:
LoaderResult containing the loaded data.
- Raises:
FormatNotSupportedError – If no loader supports the file format.
FileLoadError – If the file cannot be loaded.
- register(loader_class: Type[FileLoader]) None[source]
Register a file loader.
- Parameters:
loader_class – The loader class to register.
- unregister(loader_class: Type[FileLoader]) None[source]
Unregister a file loader.
- Parameters:
loader_class – The loader class to unregister.
- class nirs4all.data.loaders.base.LoaderResult(data: DataFrame | None = None, report: Dict[str, Any] | None = None, na_mask: Series | None = None, headers: List[str] | None = None, header_unit: str = 'cm-1')[source]
Bases:
objectResult container for file loading operations.
- data
The loaded data as a pandas DataFrame.
- report
Dictionary containing loading metadata and diagnostics.
- na_mask
Boolean Series indicating rows with NA values.
- headers
List of column headers.
- header_unit
The unit type for headers (e.g., ‘cm-1’, ‘nm’).
- nirs4all.data.loaders.base.register_loader(cls: Type[FileLoader]) Type[FileLoader][source]
Decorator to register a loader with the global registry.
Example
>>> @register_loader ... class MyLoader(FileLoader): ... ...