nirs4all.data.selection package

Submodules

Module contents

Selection module for dataset configuration.

This module provides flexible column and row selection for dataset loading, supporting multiple selection syntaxes (index, name, range, regex, exclusion).

Classes:: ColumnSelector: Select columns from a DataFrame using various methods RowSelector: Select rows from a DataFrame using various methods SampleLinker: Link samples across multiple files by key column RoleAssigner: Assign columns to data roles (features, targets, metadata)

exception nirs4all.data.selection.ColumnSelectionError[source]

Bases: Exception

Raised when column selection fails.

class nirs4all.data.selection.ColumnSelector(case_sensitive: bool = True)[source]

Bases: object

Flexible column selector for DataFrames.

Supports multiple selection methods: - By name: [“col1”, “col2”] or “col_name” - By index: [0, 1, 2] or 0 - By range: “2:-1” (slice syntax as string) - By regex pattern: {“regex”: “^feature_.*”} - By exclusion: {“exclude”: [“id”, “date”]} - Combined: {“include”: [0, 1], “exclude”: [“id”]}

Example

>>> selector = ColumnSelector()
>>> result = selector.select(df, "2:-1")
>>> print(result.names)  # Column names in range
>>> print(result.data)   # Selected columns as DataFrame

parse_selection(selection: Any, available_columns: List[str]) → List[int][source]

Parse a selection specification and return column indices.

This is a convenience method for when you don’t have a DataFrame but want to validate and resolve a selection.

Parameters:

selection – Column selection specification.
available_columns – List of available column names.

Returns:

List of column indices.

Raises:

ColumnSelectionError – If selection is invalid.

Select columns from a DataFrame.

Parameters:

df – The DataFrame to select columns from.
selection – Column selection specification. Can be: - None: Select all columns - int: Single column index - str: Single column name or range string (“2:-1”) - List[int]: List of column indices - List[str]: List of column names - Dict: Complex selection (see class docstring)

Returns:

SelectionResult with indices, names, and selected data.

Raises:

ColumnSelectionError – If selection is invalid or columns not found.

exception nirs4all.data.selection.LinkingError[source]

Bases: Exception

Raised when sample linking fails.

class nirs4all.data.selection.RoleAssigner(case_sensitive: bool = True, allow_overlap: bool = False)[source]

Bases: object

Assign columns to data roles (features, targets, metadata).

Validates that: - No column is assigned to multiple roles - At least features are assigned - Indices are valid

Supports the same column selection syntax as ColumnSelector.

Example

>>> assigner = RoleAssigner()
>>> result = assigner.assign(df, {
...     "features": "2:-1",       # All columns except first 2 and last
...     "targets": -1,            # Last column
...     "metadata": [0, 1]        # First 2 columns
... })

Assign columns to roles.

Parameters:

df – The DataFrame to assign roles from.
roles – Dictionary mapping role names to column selections. Supported roles: “features”, “targets”, “metadata” Also accepts: “x” (alias for features), “y” (alias for targets)

Returns:

RoleAssignmentResult with separated DataFrames.

Raises:

RoleAssignmentError – If assignment is invalid (overlap, missing features).

Auto-assign roles with specified targets and metadata.

Features are automatically set to all remaining columns.

Parameters:

df – The DataFrame to assign roles from.
target_columns – Column selection for targets (Y).
metadata_columns – Column selection for metadata.

Returns:

RoleAssignmentResult with separated DataFrames.

Extract target columns from a features DataFrame.

This is useful when Y columns are embedded in the X data.

Parameters:

df – DataFrame containing both features and targets.
y_columns – Column selection for targets to extract.

Returns:

RoleAssignmentResult with features (remaining) and targets (extracted).

Validate a role specification without performing assignment.

Parameters:

df – The DataFrame to validate against.
roles – Role specification to validate.

Returns:

List of warning messages (empty if no warnings).

Raises:

RoleAssignmentError – If role specification is invalid.

exception nirs4all.data.selection.RoleAssignmentError[source]

Bases: Exception

Raised when role assignment fails.

exception nirs4all.data.selection.RowSelectionError[source]

Bases: Exception

Raised when row selection fails.

class nirs4all.data.selection.RowSelector(default_random_state: int | None = None)[source]

Bases: object

Flexible row selector for DataFrames.

Supports multiple selection methods: - All rows: None - By index: [0, 1, 2] or 0 - By range: “0:100” (slice syntax as string) - By percentage: “0:80%” or “80%:100%” - By condition: {“where”: {“column”: “quality”, “op”: “>”, “value”: 0.5}} - Random sample: {“sample”: 100, “random_state”: 42} - Stratified sample: {“sample”: 100, “stratify”: “class”, “random_state”: 42} - Head/Tail: {“head”: 100} or {“tail”: 50}

Example

>>> selector = RowSelector()
>>> result = selector.select(df, "0:80%")
>>> print(len(result.data))  # 80% of rows

OPERATORS: Dict[str, Callable[[Any, Any], bool]] = {'!=': <function RowSelector.<lambda>>, '<': <function RowSelector.<lambda>>, '<=': <function RowSelector.<lambda>>, '==': <function RowSelector.<lambda>>, '>': <function RowSelector.<lambda>>, '>=': <function RowSelector.<lambda>>, 'contains': <function RowSelector.<lambda>>, 'endswith': <function RowSelector.<lambda>>, 'in': <function RowSelector.<lambda>>, 'isna': <function RowSelector.<lambda>>, 'not in': <function RowSelector.<lambda>>, 'notna': <function RowSelector.<lambda>>, 'regex': <function RowSelector.<lambda>>, 'startswith': <function RowSelector.<lambda>>}

Select rows from a DataFrame.

Parameters:

df – The DataFrame to select rows from.
selection – Row selection specification. Can be: - None: Select all rows - int: Single row index - str: Range string (“0:100”) or percentage (“0:80%”) - List[int]: List of row indices - Dict: Complex selection (see class docstring)

Returns:

RowSelectionResult with indices, mask, and selected data.

Raises:

RowSelectionError – If selection is invalid or rows not found.

class nirs4all.data.selection.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]

Bases: object

Link samples across multiple data files by key column.

Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {
...         "X": features_df,    # Has columns: sample_id, feature1, feature2
...         "Y": targets_df,     # Has columns: sample_id, target
...         "M": metadata_df,    # Has columns: sample_id, group, date
...     },
...     link_by="sample_id"
... )
>>> # Linked DataFrames have aligned rows
>>> X_linked = result.linked_data["X"]  # Without sample_id column

create_sample_index(sources: Dict[str, DataFrame], link_by: str) → DataFrame[source]

Create a sample index showing key presence across sources.

Parameters:

sources – Dictionary of source DataFrames.
link_by – Key column name.

Returns:

DataFrame with keys as index and boolean columns per source.

link(sources: Dict[str, DataFrame], link_by: str, keep_key_column: bool = False) → LinkingResult[source]

Link multiple data sources by key column.

Parameters:

sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.
link_by – Name of the column to use for linking.
keep_key_column – Whether to keep the key column in output DataFrames.

Returns:

LinkingResult with linked DataFrames.

Raises:

LinkingError – If linking fails (missing key columns, no matches, etc.).

link_aligned(sources: Dict[str, DataFrame], validate: bool = True) → Dict[str, DataFrame][source]

Link sources that are already aligned by row index.

This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).

Parameters:

sources – Dictionary of aligned DataFrames.
validate – Whether to validate that all sources have same row count.

Returns:

Dictionary of DataFrames (unchanged, just validated).

Raises:

LinkingError – If validation fails.