nirs4all.data.performance package

Submodules

Module contents

Performance optimization module for dataset loading.

This module provides lazy loading, caching, and memory-mapped file support.

class nirs4all.data.performance.CacheEntry(data: ~typing.Any, key: str, timestamp: float = <factory>, size_bytes: int = 0, source_path: str | None = None, source_mtime: float | None = None, hit_count: int = 0)[source]

Bases: object

A cached data entry.

data

The cached data.

Type:

Any

key

Cache key.

Type:

str

timestamp

When the data was cached.

Type:

float

size_bytes

Estimated size in bytes.

Type:

int

source_path

Original file path (if applicable).

Type:

str | None

source_mtime

Modification time of source file.

Type:

float | None

hit_count

Number of times this entry was accessed.

Type:

int

data: Any
hit_count: int = 0
is_stale() bool[source]

Check if entry is stale (source file modified).

key: str
size_bytes: int = 0
source_mtime: float | None = None
source_path: str | None = None
timestamp: float
class nirs4all.data.performance.DataCache(max_size_mb: float = 500, max_entries: int = 100, ttl_seconds: float | None = None)[source]

Bases: object

LRU cache for loaded data.

Provides in-memory caching with: - Configurable size limits - LRU eviction policy - File modification detection - Thread-safe access - Cache statistics

Example

```python cache = DataCache(max_size_mb=500)

# Store data cache.set(“my_data”, numpy_array, source_path=”/path/to/file.csv”)

# Retrieve data data = cache.get(“my_data”)

# With automatic loading data = cache.get_or_load(“key”, lambda: load_expensive_data())

# Check stats print(cache.stats()) ```

clear() None[source]

Clear all cached data.

get(key: str) Any | None[source]

Get data from cache.

Parameters:

key – Cache key.

Returns:

Cached data or None if not found.

get_or_load(key: str, loader: Callable[[], T], source_path: str | None = None) T[source]

Get from cache or load and cache.

Parameters:
  • key – Cache key.

  • loader – Function to call if not cached.

  • source_path – Optional source file path.

Returns:

Cached or newly loaded data.

invalidate(key: str) bool[source]

Remove entry from cache.

Parameters:

key – Cache key.

Returns:

True if entry was removed.

set(key: str, data: Any, source_path: str | None = None) None[source]

Store data in cache.

Parameters:
  • key – Cache key.

  • data – Data to cache.

  • source_path – Optional source file path for staleness detection.

stats() Dict[str, Any][source]

Get cache statistics.

Returns:

Dictionary with cache statistics.

class nirs4all.data.performance.LazyArray(loader: Callable[[], ndarray], shape: Tuple[int, ...] | None = None, dtype: dtype | None = None, source_path: str | None = None)[source]

Bases: object

A lazy-loading array wrapper.

Defers loading until array data is actually accessed. Supports numpy array interface for compatibility.

Example

```python # Create lazy array lazy = LazyArray(

loader=lambda: np.load(“large_file.npy”), shape=(10000, 500), dtype=np.float32

)

# Array not loaded yet print(lazy.shape) # (10000, 500)

# Triggers loading on first access data = lazy[0:100] # Now loads the data

# Explicit loading lazy.load() full_data = lazy.data ```

__array__(dtype=None)[source]

Support numpy array conversion.

__getitem__(key)[source]

Get item from array (triggers load).

__len__() int[source]

Get length (first dimension).

property data: ndarray

Get the loaded data (triggers load if needed).

property dtype: dtype | None

Get array dtype (may trigger load if unknown).

property is_loaded: bool

Check if data has been loaded.

load() ndarray[source]

Load the data if not already loaded.

Returns:

The loaded numpy array.

property ndim: int

Get number of dimensions.

property shape: Tuple[int, ...] | None

Get array shape (may trigger load if unknown).

unload() None[source]

Unload data to free memory.

class nirs4all.data.performance.LazyDataset(x_loader: Callable[[], ndarray] | None = None, y_loader: Callable[[], ndarray] | None = None, metadata_loader: Callable[[], Any] | None = None, x_shape: Tuple[int, ...] | None = None, y_shape: Tuple[int, ...] | None = None, name: str = 'dataset')[source]

Bases: object

A lazy-loading dataset wrapper.

Wraps multiple data components (X, y, metadata) as lazy arrays that load on demand.

Example

```python # Create from loader functions dataset = LazyDataset(

x_loader=lambda: load_features(“X.csv”), y_loader=lambda: load_targets(“Y.csv”), metadata_loader=lambda: load_metadata(“M.csv”)

)

# Nothing loaded yet print(dataset.x_shape) # Returns cached shape if known

# Triggers X loading only X_data = dataset.X

# Load everything dataset.load_all() ```

property X: ndarray | None

Get features (triggers load if needed).

property is_metadata_loaded: bool

Check if metadata is loaded.

property is_x_loaded: bool

Check if X is loaded.

property is_y_loaded: bool

Check if y is loaded.

load_all() None[source]

Load all data components.

property metadata: Any | None

Get metadata (triggers load if needed).

property n_features: int

Get number of features.

property n_samples: int

Get number of samples.

unload_all() None[source]

Unload all data to free memory.

property x_shape: Tuple[int, ...] | None

Get X shape without loading.

property y: ndarray | None

Get targets (triggers load if needed).

property y_shape: Tuple[int, ...] | None

Get y shape without loading.

nirs4all.data.performance.cache_manager(max_size_mb: float = 500) DataCache[source]

Get or create the global cache instance.

Parameters:

max_size_mb – Maximum cache size (only used when creating).

Returns:

DataCache instance.