nirs4all.data.selection.sample_linker module

Sample linker for dataset configuration.

This module provides key-based sample linking across multiple data files, enabling joining of features, targets, and metadata by a common identifier.

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {"features": features_df, "targets": targets_df},
...     link_by="sample_id"
... )
>>> print(result.linked_data)  # Joined DataFrame
exception nirs4all.data.selection.sample_linker.LinkingError[source]

Bases: Exception

Raised when sample linking fails.

class nirs4all.data.selection.sample_linker.LinkingResult(linked_data: ~typing.Dict[str, ~pandas.core.frame.DataFrame], key_column: str, matched_keys: ~typing.Set[~typing.Any], missing_keys: ~typing.Dict[str, ~typing.Set[~typing.Any]], sample_count: int, report: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Result of a sample linking operation.

linked_data

Dictionary of linked DataFrames (key column removed).

Type:

Dict[str, pandas.core.frame.DataFrame]

key_column

The column used for linking.

Type:

str

matched_keys

Set of keys present in all sources.

Type:

Set[Any]

missing_keys

Dictionary mapping source names to their missing keys.

Type:

Dict[str, Set[Any]]

sample_count

Number of linked samples.

Type:

int

report

Detailed linking report.

Type:

Dict[str, Any]

key_column: str
linked_data: Dict[str, DataFrame]
matched_keys: Set[Any]
missing_keys: Dict[str, Set[Any]]
report: Dict[str, Any]
sample_count: int
class nirs4all.data.selection.sample_linker.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]

Bases: object

Link samples across multiple data files by key column.

Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {
...         "X": features_df,    # Has columns: sample_id, feature1, feature2
...         "Y": targets_df,     # Has columns: sample_id, target
...         "M": metadata_df,    # Has columns: sample_id, group, date
...     },
...     link_by="sample_id"
... )
>>> # Linked DataFrames have aligned rows
>>> X_linked = result.linked_data["X"]  # Without sample_id column
create_sample_index(sources: Dict[str, DataFrame], link_by: str) DataFrame[source]

Create a sample index showing key presence across sources.

Parameters:
  • sources – Dictionary of source DataFrames.

  • link_by – Key column name.

Returns:

DataFrame with keys as index and boolean columns per source.

Link multiple data sources by key column.

Parameters:
  • sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.

  • link_by – Name of the column to use for linking.

  • keep_key_column – Whether to keep the key column in output DataFrames.

Returns:

LinkingResult with linked DataFrames.

Raises:

LinkingError – If linking fails (missing key columns, no matches, etc.).

Link sources that are already aligned by row index.

This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).

Parameters:
  • sources – Dictionary of aligned DataFrames.

  • validate – Whether to validate that all sources have same row count.

Returns:

Dictionary of DataFrames (unchanged, just validated).

Raises:

LinkingError – If validation fails.

Convenience function to link X and Y DataFrames.

Parameters:
  • x_df – Features DataFrame.

  • y_df – Targets DataFrame.

  • link_by – Key column name.

  • mode – Linking mode.

Returns:

Tuple of (X_linked, Y_linked) DataFrames.

Convenience function to link X, Y, and metadata DataFrames.

Parameters:
  • x_df – Features DataFrame.

  • y_df – Targets DataFrame.

  • m_df – Metadata DataFrame.

  • link_by – Key column name.

  • mode – Linking mode.

Returns:

Tuple of (X_linked, Y_linked, M_linked) DataFrames.