nirs4all.data.selection.sample_linker module
Sample linker for dataset configuration.
This module provides key-based sample linking across multiple data files, enabling joining of features, targets, and metadata by a common identifier.
Example
>>> linker = SampleLinker()
>>> result = linker.link(
... {"features": features_df, "targets": targets_df},
... link_by="sample_id"
... )
>>> print(result.linked_data) # Joined DataFrame
- exception nirs4all.data.selection.sample_linker.LinkingError[source]
Bases:
ExceptionRaised when sample linking fails.
- class nirs4all.data.selection.sample_linker.LinkingResult(linked_data: ~typing.Dict[str, ~pandas.core.frame.DataFrame], key_column: str, matched_keys: ~typing.Set[~typing.Any], missing_keys: ~typing.Dict[str, ~typing.Set[~typing.Any]], sample_count: int, report: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectResult of a sample linking operation.
- linked_data
Dictionary of linked DataFrames (key column removed).
- Type:
Dict[str, pandas.core.frame.DataFrame]
- matched_keys
Set of keys present in all sources.
- Type:
Set[Any]
- class nirs4all.data.selection.sample_linker.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]
Bases:
objectLink samples across multiple data files by key column.
Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source
Example
>>> linker = SampleLinker() >>> result = linker.link( ... { ... "X": features_df, # Has columns: sample_id, feature1, feature2 ... "Y": targets_df, # Has columns: sample_id, target ... "M": metadata_df, # Has columns: sample_id, group, date ... }, ... link_by="sample_id" ... ) >>> # Linked DataFrames have aligned rows >>> X_linked = result.linked_data["X"] # Without sample_id column
- create_sample_index(sources: Dict[str, DataFrame], link_by: str) DataFrame[source]
Create a sample index showing key presence across sources.
- Parameters:
sources – Dictionary of source DataFrames.
link_by – Key column name.
- Returns:
DataFrame with keys as index and boolean columns per source.
- link(sources: Dict[str, DataFrame], link_by: str, keep_key_column: bool = False) LinkingResult[source]
Link multiple data sources by key column.
- Parameters:
sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.
link_by – Name of the column to use for linking.
keep_key_column – Whether to keep the key column in output DataFrames.
- Returns:
LinkingResult with linked DataFrames.
- Raises:
LinkingError – If linking fails (missing key columns, no matches, etc.).
- link_aligned(sources: Dict[str, DataFrame], validate: bool = True) Dict[str, DataFrame][source]
Link sources that are already aligned by row index.
This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).
- Parameters:
sources – Dictionary of aligned DataFrames.
validate – Whether to validate that all sources have same row count.
- Returns:
Dictionary of DataFrames (unchanged, just validated).
- Raises:
LinkingError – If validation fails.
- nirs4all.data.selection.sample_linker.link_xy(x_df: DataFrame, y_df: DataFrame, link_by: str, mode: str = 'inner') tuple[source]
Convenience function to link X and Y DataFrames.
- Parameters:
x_df – Features DataFrame.
y_df – Targets DataFrame.
link_by – Key column name.
mode – Linking mode.
- Returns:
Tuple of (X_linked, Y_linked) DataFrames.
- nirs4all.data.selection.sample_linker.link_xym(x_df: DataFrame, y_df: DataFrame, m_df: DataFrame, link_by: str, mode: str = 'inner') tuple[source]
Convenience function to link X, Y, and metadata DataFrames.
- Parameters:
x_df – Features DataFrame.
y_df – Targets DataFrame.
m_df – Metadata DataFrame.
link_by – Key column name.
mode – Linking mode.
- Returns:
Tuple of (X_linked, Y_linked, M_linked) DataFrames.