nirs4all.data.selection package

Submodules

Module contents

Selection module for dataset configuration.

This module provides flexible column and row selection for dataset loading, supporting multiple selection syntaxes (index, name, range, regex, exclusion).

Classes:

ColumnSelector: Select columns from a DataFrame using various methods RowSelector: Select rows from a DataFrame using various methods SampleLinker: Link samples across multiple files by key column RoleAssigner: Assign columns to data roles (features, targets, metadata)

exception nirs4all.data.selection.ColumnSelectionError[source]

Bases: Exception

Raised when column selection fails.

class nirs4all.data.selection.ColumnSelector(case_sensitive: bool = True)[source]

Bases: object

Flexible column selector for DataFrames.

Supports multiple selection methods: - By name: [“col1”, “col2”] or “col_name” - By index: [0, 1, 2] or 0 - By range: “2:-1” (slice syntax as string) - By regex pattern: {“regex”: “^feature_.*”} - By exclusion: {“exclude”: [“id”, “date”]} - Combined: {“include”: [0, 1], “exclude”: [“id”]}

Example

>>> selector = ColumnSelector()
>>> result = selector.select(df, "2:-1")
>>> print(result.names)  # Column names in range
>>> print(result.data)   # Selected columns as DataFrame
parse_selection(selection: Any, available_columns: List[str]) List[int][source]

Parse a selection specification and return column indices.

This is a convenience method for when you don’t have a DataFrame but want to validate and resolve a selection.

Parameters:
  • selection – Column selection specification.

  • available_columns – List of available column names.

Returns:

List of column indices.

Raises:

ColumnSelectionError – If selection is invalid.

select(df: DataFrame, selection: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) SelectionResult[source]

Select columns from a DataFrame.

Parameters:
  • df – The DataFrame to select columns from.

  • selection – Column selection specification. Can be: - None: Select all columns - int: Single column index - str: Single column name or range string (“2:-1”) - List[int]: List of column indices - List[str]: List of column names - Dict: Complex selection (see class docstring)

Returns:

SelectionResult with indices, names, and selected data.

Raises:

ColumnSelectionError – If selection is invalid or columns not found.

exception nirs4all.data.selection.LinkingError[source]

Bases: Exception

Raised when sample linking fails.

class nirs4all.data.selection.RoleAssigner(case_sensitive: bool = True, allow_overlap: bool = False)[source]

Bases: object

Assign columns to data roles (features, targets, metadata).

Validates that: - No column is assigned to multiple roles - At least features are assigned - Indices are valid

Supports the same column selection syntax as ColumnSelector.

Example

>>> assigner = RoleAssigner()
>>> result = assigner.assign(df, {
...     "features": "2:-1",       # All columns except first 2 and last
...     "targets": -1,            # Last column
...     "metadata": [0, 1]        # First 2 columns
... })
assign(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) RoleAssignmentResult[source]

Assign columns to roles.

Parameters:
  • df – The DataFrame to assign roles from.

  • roles – Dictionary mapping role names to column selections. Supported roles: “features”, “targets”, “metadata” Also accepts: “x” (alias for features), “y” (alias for targets)

Returns:

RoleAssignmentResult with separated DataFrames.

Raises:

RoleAssignmentError – If assignment is invalid (overlap, missing features).

assign_auto(df: DataFrame, target_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None, metadata_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None) RoleAssignmentResult[source]

Auto-assign roles with specified targets and metadata.

Features are automatically set to all remaining columns.

Parameters:
  • df – The DataFrame to assign roles from.

  • target_columns – Column selection for targets (Y).

  • metadata_columns – Column selection for metadata.

Returns:

RoleAssignmentResult with separated DataFrames.

extract_y_from_x(df: DataFrame, y_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) RoleAssignmentResult[source]

Extract target columns from a features DataFrame.

This is useful when Y columns are embedded in the X data.

Parameters:
  • df – DataFrame containing both features and targets.

  • y_columns – Column selection for targets to extract.

Returns:

RoleAssignmentResult with features (remaining) and targets (extracted).

validate_roles(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) List[str][source]

Validate a role specification without performing assignment.

Parameters:
  • df – The DataFrame to validate against.

  • roles – Role specification to validate.

Returns:

List of warning messages (empty if no warnings).

Raises:

RoleAssignmentError – If role specification is invalid.

exception nirs4all.data.selection.RoleAssignmentError[source]

Bases: Exception

Raised when role assignment fails.

exception nirs4all.data.selection.RowSelectionError[source]

Bases: Exception

Raised when row selection fails.

class nirs4all.data.selection.RowSelector(default_random_state: int | None = None)[source]

Bases: object

Flexible row selector for DataFrames.

Supports multiple selection methods: - All rows: None - By index: [0, 1, 2] or 0 - By range: “0:100” (slice syntax as string) - By percentage: “0:80%” or “80%:100%” - By condition: {“where”: {“column”: “quality”, “op”: “>”, “value”: 0.5}} - Random sample: {“sample”: 100, “random_state”: 42} - Stratified sample: {“sample”: 100, “stratify”: “class”, “random_state”: 42} - Head/Tail: {“head”: 100} or {“tail”: 50}

Example

>>> selector = RowSelector()
>>> result = selector.select(df, "0:80%")
>>> print(len(result.data))  # 80% of rows
OPERATORS: Dict[str, Callable[[Any, Any], bool]] = {'!=': <function RowSelector.<lambda>>, '<': <function RowSelector.<lambda>>, '<=': <function RowSelector.<lambda>>, '==': <function RowSelector.<lambda>>, '>': <function RowSelector.<lambda>>, '>=': <function RowSelector.<lambda>>, 'contains': <function RowSelector.<lambda>>, 'endswith': <function RowSelector.<lambda>>, 'in': <function RowSelector.<lambda>>, 'isna': <function RowSelector.<lambda>>, 'not in': <function RowSelector.<lambda>>, 'notna': <function RowSelector.<lambda>>, 'regex': <function RowSelector.<lambda>>, 'startswith': <function RowSelector.<lambda>>}
select(df: DataFrame, selection: int | str | List[int] | Dict[str, Any] | slice | None) RowSelectionResult[source]

Select rows from a DataFrame.

Parameters:
  • df – The DataFrame to select rows from.

  • selection – Row selection specification. Can be: - None: Select all rows - int: Single row index - str: Range string (“0:100”) or percentage (“0:80%”) - List[int]: List of row indices - Dict: Complex selection (see class docstring)

Returns:

RowSelectionResult with indices, mask, and selected data.

Raises:

RowSelectionError – If selection is invalid or rows not found.

class nirs4all.data.selection.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]

Bases: object

Link samples across multiple data files by key column.

Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source

Example

>>> linker = SampleLinker()
>>> result = linker.link(
...     {
...         "X": features_df,    # Has columns: sample_id, feature1, feature2
...         "Y": targets_df,     # Has columns: sample_id, target
...         "M": metadata_df,    # Has columns: sample_id, group, date
...     },
...     link_by="sample_id"
... )
>>> # Linked DataFrames have aligned rows
>>> X_linked = result.linked_data["X"]  # Without sample_id column
create_sample_index(sources: Dict[str, DataFrame], link_by: str) DataFrame[source]

Create a sample index showing key presence across sources.

Parameters:
  • sources – Dictionary of source DataFrames.

  • link_by – Key column name.

Returns:

DataFrame with keys as index and boolean columns per source.

Link multiple data sources by key column.

Parameters:
  • sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.

  • link_by – Name of the column to use for linking.

  • keep_key_column – Whether to keep the key column in output DataFrames.

Returns:

LinkingResult with linked DataFrames.

Raises:

LinkingError – If linking fails (missing key columns, no matches, etc.).

Link sources that are already aligned by row index.

This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).

Parameters:
  • sources – Dictionary of aligned DataFrames.

  • validate – Whether to validate that all sources have same row count.

Returns:

Dictionary of DataFrames (unchanged, just validated).

Raises:

LinkingError – If validation fails.