nirs4all.data.loaders.base module

Base file loader interface and registry.

This module defines the abstract FileLoader base class and LoaderRegistry for a pluggable file loading system. It supports multiple file formats with automatic format detection and configurable loading parameters.

Phase 2 Implementation - Dataset Configuration Roadmap

class nirs4all.data.loaders.base.ArchiveHandler[source]

Bases: object

Utility class for handling compressed files and archives.

Supports: - Gzip compressed files (.gz) - Zip archives (.zip) with member selection - Tar archives (.tar, .tar.gz, .tgz, .tar.bz2) with member selection

static decompress_gzip(path: Path, encoding: str = 'utf-8') str[source]

Decompress a gzip file and return content as string.

Parameters:
  • path – Path to the gzip file.

  • encoding – Text encoding to use.

Returns:

Decompressed file content as string.

static decompress_gzip_bytes(path: Path) bytes[source]

Decompress a gzip file and return content as bytes.

Parameters:

path – Path to the gzip file.

Returns:

Decompressed file content as bytes.

static extract_bytes_from_tar(path: Path, member: str | None = None) bytes[source]

Extract a file from a tar archive as bytes.

Parameters:
  • path – Path to the tar file.

  • member – Name of the member to extract. If None, auto-detect.

Returns:

Content of the extracted file as bytes.

static extract_bytes_from_zip(path: Path, member: str | None = None) bytes[source]

Extract a file from a zip archive as bytes.

Parameters:
  • path – Path to the zip file.

  • member – Name of the member to extract. If None, auto-detect.

Returns:

Content of the extracted file as bytes.

static extract_from_tar(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]

Extract a file from a tar archive.

Parameters:
  • path – Path to the tar file.

  • member – Name of the member to extract. If None, auto-detect.

  • encoding – Text encoding to use.

Returns:

Content of the extracted file as string.

Raises:

FileLoadError – If no suitable member is found.

static extract_from_zip(path: Path, member: str | None = None, encoding: str = 'utf-8') str[source]

Extract a file from a zip archive.

Parameters:
  • path – Path to the zip file.

  • member – Name of the member to extract. If None, auto-detect.

  • encoding – Text encoding to use.

Returns:

Content of the extracted file as string.

Raises:

FileLoadError – If no suitable member is found.

static is_archive(path: Path) bool[source]

Check if a file is an archive (contains multiple files).

static is_compressed(path: Path) bool[source]

Check if a file is compressed.

static list_tar_members(path: Path) List[str][source]

List members in a tar archive.

Parameters:

path – Path to the tar file.

Returns:

List of member names in the archive.

static list_zip_members(path: Path) List[str][source]

List members in a zip archive.

Parameters:

path – Path to the zip file.

Returns:

List of member names in the archive.

exception nirs4all.data.loaders.base.FileLoadError[source]

Bases: LoaderError

Raised when a file cannot be loaded.

class nirs4all.data.loaders.base.FileLoader[source]

Bases: ABC

Abstract base class for file loaders.

All file format loaders should inherit from this class and implement the required methods for loading and format detection.

Class Attributes:

supported_extensions: Tuple of file extensions this loader handles. name: Human-readable name for the loader. priority: Loading priority (lower = higher priority) when multiple

loaders match. Default: 50.

Example

>>> class CSVLoader(FileLoader):
...     supported_extensions = (".csv",)
...     name = "CSV Loader"
...
...     @classmethod
...     def supports(cls, path: Path) -> bool:
...         return path.suffix.lower() in cls.supported_extensions
...
...     def load(self, path: Path, **params) -> LoaderResult:
...         # Load CSV file
...         pass
classmethod detect_format(path: Path) str | None[source]

Detect the file format from the path.

Parameters:

path – Path to analyze.

Returns:

Format name if detected, None otherwise.

classmethod get_base_path(path: Path) Path[source]

Get the base path without compression extensions.

For example, ‘data.csv.gz’ -> ‘data.csv’

Parameters:

path – Path to process.

Returns:

Path without compression extension(s).

abstractmethod load(path: Path, **params: Any) LoaderResult[source]

Load data from a file.

Parameters:
  • path – Path to the file to load.

  • **params – Loader-specific parameters.

Returns:

LoaderResult containing the loaded data and metadata.

Raises:

FileLoadError – If the file cannot be loaded.

name: ClassVar[str] = 'Base Loader'
priority: ClassVar[int] = 50
supported_extensions: ClassVar[Tuple[str, ...]] = ()
abstractmethod classmethod supports(path: Path) bool[source]

Check if this loader can handle the given file.

Parameters:

path – Path to the file to check.

Returns:

True if this loader can handle the file, False otherwise.

exception nirs4all.data.loaders.base.FormatNotSupportedError[source]

Bases: LoaderError

Raised when a file format is not supported.

exception nirs4all.data.loaders.base.LoaderError[source]

Bases: Exception

Base exception for loader errors.

class nirs4all.data.loaders.base.LoaderRegistry[source]

Bases: object

Registry for file loaders.

The registry maintains a list of available loaders and provides methods for finding the appropriate loader for a given file.

Example

>>> registry = LoaderRegistry()
>>> registry.register(CSVLoader)
>>> registry.register(ParquetLoader)
>>> loader = registry.get_loader(Path("data.csv"))
>>> result = loader.load(Path("data.csv"))
static __new__(cls) LoaderRegistry[source]

Implement singleton pattern.

clear() None[source]

Clear all registered loaders (mainly for testing).

classmethod get_instance() LoaderRegistry[source]

Get the singleton registry instance.

get_loader(path: str | Path) FileLoader[source]

Get the appropriate loader for a file.

Parameters:

path – Path to the file to load.

Returns:

An instance of the appropriate loader.

Raises:

FormatNotSupportedError – If no loader supports the file format.

get_registered_loaders() List[Type[FileLoader]][source]

Get all registered loader classes.

Returns:

List of registered loader classes.

get_supported_extensions() List[str][source]

Get all supported file extensions.

Returns:

List of supported extensions across all registered loaders.

load(path: str | Path, **params: Any) LoaderResult[source]

Load a file using the appropriate loader.

This is a convenience method that finds the right loader and loads the file.

Parameters:
  • path – Path to the file to load.

  • **params – Loading parameters to pass to the loader.

Returns:

LoaderResult containing the loaded data.

Raises:
register(loader_class: Type[FileLoader]) None[source]

Register a file loader.

Parameters:

loader_class – The loader class to register.

unregister(loader_class: Type[FileLoader]) None[source]

Unregister a file loader.

Parameters:

loader_class – The loader class to unregister.

class nirs4all.data.loaders.base.LoaderResult(data: DataFrame | None = None, report: Dict[str, Any] | None = None, na_mask: Series | None = None, headers: List[str] | None = None, header_unit: str = 'cm-1')[source]

Bases: object

Result container for file loading operations.

data

The loaded data as a pandas DataFrame.

report

Dictionary containing loading metadata and diagnostics.

na_mask

Boolean Series indicating rows with NA values.

headers

List of column headers.

header_unit

The unit type for headers (e.g., ‘cm-1’, ‘nm’).

property error: str | None

Get error message if loading failed.

property success: bool

Check if loading was successful.

nirs4all.data.loaders.base.register_loader(cls: Type[FileLoader]) Type[FileLoader][source]

Decorator to register a loader with the global registry.

Example

>>> @register_loader
... class MyLoader(FileLoader):
...     ...