nirs4all.data.features module

class nirs4all.data.features.Features(cache: bool = False)[source]

Bases: object

Manages N aligned NumPy sources + a Polars index.

This class coordinates multiple FeatureSource objects, ensuring they remain aligned in terms of sample count while allowing different feature dimensions and processing pipelines per source.

sources

List of FeatureSource objects managing individual feature arrays.

cache

Whether to enable caching for operations.

add_samples(data: ndarray | list[ndarray], headers: List[str] | List[List[str]] | None = None, header_unit: str | List[str] | None = None) None[source]

Add samples to all sources, ensuring alignment.

Parameters:
  • data – Single 2D array or list of 2D arrays, one per source.

  • headers – Optional feature headers. Single list applies to all sources, or list of lists for per-source headers.

  • header_unit – Optional unit type for headers (“cm-1”, “nm”, “none”, “text”, “index”). Single string applies to all sources, or list for per-source units.

Raises:

ValueError – If number of data arrays doesn’t match existing sources, or if headers/units lists don’t match number of sources.

add_samples_batch_3d(data: ndarray | List[ndarray]) None[source]

Add multiple samples with 3D data in a single operation - O(N) instead of O(N²).

This method is optimized for bulk insertion of augmented samples where each sample may have multiple processings. Much faster than calling add_samples() in a loop.

Parameters:

data – Single 3D array of shape (n_samples, n_processings, n_features) or list of 3D arrays for multi-source datasets.

Raises:

ValueError – If number of data arrays doesn’t match existing sources, or if data dimensions don’t match.

augment_samples(sample_indices: List[int], data: ndarray | list[ndarray], processings: list[str], count: int | List[int]) None[source]

Create augmented samples from existing ones.

Parameters:
  • sample_indices – List of sample indices to augment

  • data – Augmented feature data (single array or list of arrays for multi-source)

  • processings – Processing names for the augmented data

  • count – Number of augmentations per sample (int) or per sample list

headers(src: int) List[str][source]

Get the list of feature headers for a specific source.

Parameters:

src – Source index.

Returns:

List of header strings for the specified source.

property headers_list: List[List[str]] | List[str]

Get the list of feature headers per source.

Returns:

List of header lists, one per source.

keep_sources(source_indices: int | List[int]) None[source]

Keep only specified sources, removing all others.

Used after merge operations with output_as=”features” to consolidate to a single source.

Parameters:

source_indices – Single source index or list of source indices to keep.

Raises:

ValueError – If no sources exist or source indices are invalid.

property num_features: List[int] | int

Get the number of features per source.

Returns:

Single int if only one source, otherwise list of ints (one per source).

property num_processings: List[int] | int

Get the number of unique processing IDs per source.

Returns:

Single int if only one source, otherwise list of ints (one per source).

property num_samples: int

Get the number of samples (rows) across all sources.

Returns:

Number of samples in the first source (all sources have the same count).

property preprocessing_str: List[List[str]] | List[str]

Get the list of processing IDs per source.

Returns:

List of processing ID lists, one per source.

update_features(source_processings: list[str], features: list[ndarray] | list[list[ndarray]], processings: list[str], source: int = -1) None[source]

Update or add new feature processings to a specific source.

Parameters:
  • source_processings – List of existing processing names to replace. Empty string “” means add new.

  • features – Feature arrays to add or replace (single array or list of arrays).

  • processings – Target processing names for the features.

  • source – Source index to update (default: 0 if negative).

x(indices: list[int] | ndarray, layout: str = '2d', concat_source: bool = True) ndarray | list[ndarray][source]

Retrieve feature data for specified samples.

Parameters:
  • indices – Sample indices to retrieve.

  • layout – Data layout format (“2d”, “2d_interleaved”, “3d”, “3d_transpose”).

  • concat_source – If True and multiple sources exist, concatenate along feature dimension.

Returns:

Feature array(s) in the requested layout. Single array if concat_source=True or only one source, otherwise list of arrays.

Raises:

ValueError – If no features are available.