nirs4all.data.features module

class nirs4all.data.features.Features(cache: bool = False)[source]

Bases: object

Manages N aligned NumPy sources + a Polars index.

This class coordinates multiple FeatureSource objects, ensuring they remain aligned in terms of sample count while allowing different feature dimensions and processing pipelines per source.

sources: List of FeatureSource objects managing individual feature arrays.

cache: Whether to enable caching for operations.

Add samples to all sources, ensuring alignment.

Parameters:

data – Single 2D array or list of 2D arrays, one per source.
headers – Optional feature headers. Single list applies to all sources, or list of lists for per-source headers.
header_unit – Optional unit type for headers (“cm-1”, “nm”, “none”, “text”, “index”). Single string applies to all sources, or list for per-source units.

Raises:

ValueError – If number of data arrays doesn’t match existing sources, or if headers/units lists don’t match number of sources.

add_samples_batch_3d(data: ndarray | List[ndarray]) → None[source]

Add multiple samples with 3D data in a single operation - O(N) instead of O(N²).

This method is optimized for bulk insertion of augmented samples where each sample may have multiple processings. Much faster than calling add_samples() in a loop.

Parameters:: data – Single 3D array of shape (n_samples, n_processings, n_features) or list of 3D arrays for multi-source datasets.
Raises:: ValueError – If number of data arrays doesn’t match existing sources, or if data dimensions don’t match.

augment_samples(sample_indices: List[int], data: ndarray | list[ndarray], processings: list[str], count: int | List[int]) → None[source]

Create augmented samples from existing ones.

Parameters:

sample_indices – List of sample indices to augment
data – Augmented feature data (single array or list of arrays for multi-source)
processings – Processing names for the augmented data
count – Number of augmentations per sample (int) or per sample list

headers(src: int) → List[str][source]

Get the list of feature headers for a specific source.

Parameters:: src – Source index.
Returns:: List of header strings for the specified source.

property headers_list: List[List[str]] | List[str]

Get the list of feature headers per source.

Returns:: List of header lists, one per source.

keep_sources(source_indices: int | List[int]) → None[source]

Keep only specified sources, removing all others.

Used after merge operations with output_as=”features” to consolidate to a single source.

Parameters:: source_indices – Single source index or list of source indices to keep.
Raises:: ValueError – If no sources exist or source indices are invalid.

property num_features: List[int] | int

Get the number of features per source.

Returns:: Single int if only one source, otherwise list of ints (one per source).

property num_processings: List[int] | int

Get the number of unique processing IDs per source.

Returns:: Single int if only one source, otherwise list of ints (one per source).

property num_samples: int

Get the number of samples (rows) across all sources.

Returns:: Number of samples in the first source (all sources have the same count).

property preprocessing_str: List[List[str]] | List[str]

Get the list of processing IDs per source.

Returns:: List of processing ID lists, one per source.

update_features(source_processings: list[str], features: list[ndarray] | list[list[ndarray]], processings: list[str], source: int = -1) → None[source]

Update or add new feature processings to a specific source.

Parameters:

source_processings – List of existing processing names to replace. Empty string “” means add new.
features – Feature arrays to add or replace (single array or list of arrays).
processings – Target processing names for the features.
source – Source index to update (default: 0 if negative).

x(indices: list[int] | ndarray, layout: str = '2d', concat_source: bool = True) → ndarray | list[ndarray][source]

Retrieve feature data for specified samples.

Parameters:

indices – Sample indices to retrieve.
layout – Data layout format (“2d”, “2d_interleaved”, “3d”, “3d_transpose”).
concat_source – If True and multiple sources exist, concatenate along feature dimension.

Returns:

Feature array(s) in the requested layout. Single array if concat_source=True or only one source, otherwise list of arrays.

Raises:

ValueError – If no features are available.