nirs4all.data.selection.sample_linker module

Sample linker for dataset configuration.

This module provides key-based sample linking across multiple data files, enabling joining of features, targets, and metadata by a common identifier.

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {"features": features_df, "targets": targets_df},
...     link_by="sample_id"
... )
>>> print(result.linked_data)  # Joined DataFrame

exception nirs4all.data.selection.sample_linker.LinkingError[source]

Bases: Exception

Raised when sample linking fails.

class nirs4all.data.selection.sample_linker.LinkingResult(linked_data: ~typing.Dict[str, ~pandas.core.frame.DataFrame], key_column: str, matched_keys: ~typing.Set[~typing.Any], missing_keys: ~typing.Dict[str, ~typing.Set[~typing.Any]], sample_count: int, report: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Result of a sample linking operation.

linked_data

Dictionary of linked DataFrames (key column removed).

Type:: Dict[str, pandas.core.frame.DataFrame]

key_column

The column used for linking.

Type:: str

matched_keys

Set of keys present in all sources.

Type:: Set[Any]

missing_keys

Dictionary mapping source names to their missing keys.

Type:: Dict[str, Set[Any]]

sample_count

Number of linked samples.

Type:: int

report

Detailed linking report.

Type:: Dict[str, Any]

key_column: str

linked_data: Dict[str, DataFrame]

matched_keys: Set[Any]

missing_keys: Dict[str, Set[Any]]

report: Dict[str, Any]

sample_count: int

class nirs4all.data.selection.sample_linker.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]

Bases: object

Link samples across multiple data files by key column.

Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {
...         "X": features_df,    # Has columns: sample_id, feature1, feature2
...         "Y": targets_df,     # Has columns: sample_id, target
...         "M": metadata_df,    # Has columns: sample_id, group, date
...     },
...     link_by="sample_id"
... )
>>> # Linked DataFrames have aligned rows
>>> X_linked = result.linked_data["X"]  # Without sample_id column

create_sample_index(sources: Dict[str, DataFrame], link_by: str) → DataFrame[source]

Create a sample index showing key presence across sources.

Parameters:

sources – Dictionary of source DataFrames.
link_by – Key column name.

Returns:

DataFrame with keys as index and boolean columns per source.

link(sources: Dict[str, DataFrame], link_by: str, keep_key_column: bool = False) → LinkingResult[source]

Link multiple data sources by key column.

Parameters:

sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.
link_by – Name of the column to use for linking.
keep_key_column – Whether to keep the key column in output DataFrames.

Returns:

LinkingResult with linked DataFrames.

Raises:

LinkingError – If linking fails (missing key columns, no matches, etc.).

link_aligned(sources: Dict[str, DataFrame], validate: bool = True) → Dict[str, DataFrame][source]

Link sources that are already aligned by row index.

This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).

Parameters:

sources – Dictionary of aligned DataFrames.
validate – Whether to validate that all sources have same row count.

Returns:

Dictionary of DataFrames (unchanged, just validated).

Raises:

LinkingError – If validation fails.

nirs4all.data.selection.sample_linker.link_xy(x_df: DataFrame, y_df: DataFrame, link_by: str, mode: str = 'inner') → tuple[source]

Convenience function to link X and Y DataFrames.

Parameters:

x_df – Features DataFrame.
y_df – Targets DataFrame.
link_by – Key column name.
mode – Linking mode.

Returns:

Tuple of (X_linked, Y_linked) DataFrames.

nirs4all.data.selection.sample_linker.link_xym(x_df: DataFrame, y_df: DataFrame, m_df: DataFrame, link_by: str, mode: str = 'inner') → tuple[source]

Convenience function to link X, Y, and metadata DataFrames.

Parameters:

x_df – Features DataFrame.
y_df – Targets DataFrame.
m_df – Metadata DataFrame.
link_by – Key column name.
mode – Linking mode.

Returns:

Tuple of (X_linked, Y_linked, M_linked) DataFrames.