nirs4all.data.selection package
Submodules
- nirs4all.data.selection.column_selector module
- nirs4all.data.selection.role_assigner module
RoleAssignerRoleAssignmentErrorRoleAssignmentResultRoleAssignmentResult.featuresRoleAssignmentResult.targetsRoleAssignmentResult.metadataRoleAssignmentResult.feature_indicesRoleAssignmentResult.target_indicesRoleAssignmentResult.metadata_indicesRoleAssignmentResult.XRoleAssignmentResult.feature_indicesRoleAssignmentResult.featuresRoleAssignmentResult.metadataRoleAssignmentResult.metadata_indicesRoleAssignmentResult.target_indicesRoleAssignmentResult.targetsRoleAssignmentResult.y
- nirs4all.data.selection.row_selector module
- nirs4all.data.selection.sample_linker module
LinkingErrorLinkingResultLinkingResult.linked_dataLinkingResult.key_columnLinkingResult.matched_keysLinkingResult.missing_keysLinkingResult.sample_countLinkingResult.reportLinkingResult.key_columnLinkingResult.linked_dataLinkingResult.matched_keysLinkingResult.missing_keysLinkingResult.reportLinkingResult.sample_count
SampleLinkerlink_xy()link_xym()
Module contents
Selection module for dataset configuration.
This module provides flexible column and row selection for dataset loading, supporting multiple selection syntaxes (index, name, range, regex, exclusion).
- Classes:
ColumnSelector: Select columns from a DataFrame using various methods RowSelector: Select rows from a DataFrame using various methods SampleLinker: Link samples across multiple files by key column RoleAssigner: Assign columns to data roles (features, targets, metadata)
- exception nirs4all.data.selection.ColumnSelectionError[source]
Bases:
ExceptionRaised when column selection fails.
- class nirs4all.data.selection.ColumnSelector(case_sensitive: bool = True)[source]
Bases:
objectFlexible column selector for DataFrames.
Supports multiple selection methods: - By name: [“col1”, “col2”] or “col_name” - By index: [0, 1, 2] or 0 - By range: “2:-1” (slice syntax as string) - By regex pattern: {“regex”: “^feature_.*”} - By exclusion: {“exclude”: [“id”, “date”]} - Combined: {“include”: [0, 1], “exclude”: [“id”]}
Example
>>> selector = ColumnSelector() >>> result = selector.select(df, "2:-1") >>> print(result.names) # Column names in range >>> print(result.data) # Selected columns as DataFrame
- parse_selection(selection: Any, available_columns: List[str]) List[int][source]
Parse a selection specification and return column indices.
This is a convenience method for when you don’t have a DataFrame but want to validate and resolve a selection.
- Parameters:
selection – Column selection specification.
available_columns – List of available column names.
- Returns:
List of column indices.
- Raises:
ColumnSelectionError – If selection is invalid.
- select(df: DataFrame, selection: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) SelectionResult[source]
Select columns from a DataFrame.
- Parameters:
df – The DataFrame to select columns from.
selection – Column selection specification. Can be: - None: Select all columns - int: Single column index - str: Single column name or range string (“2:-1”) - List[int]: List of column indices - List[str]: List of column names - Dict: Complex selection (see class docstring)
- Returns:
SelectionResult with indices, names, and selected data.
- Raises:
ColumnSelectionError – If selection is invalid or columns not found.
- exception nirs4all.data.selection.LinkingError[source]
Bases:
ExceptionRaised when sample linking fails.
- class nirs4all.data.selection.RoleAssigner(case_sensitive: bool = True, allow_overlap: bool = False)[source]
Bases:
objectAssign columns to data roles (features, targets, metadata).
Validates that: - No column is assigned to multiple roles - At least features are assigned - Indices are valid
Supports the same column selection syntax as ColumnSelector.
Example
>>> assigner = RoleAssigner() >>> result = assigner.assign(df, { ... "features": "2:-1", # All columns except first 2 and last ... "targets": -1, # Last column ... "metadata": [0, 1] # First 2 columns ... })
- assign(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) RoleAssignmentResult[source]
Assign columns to roles.
- Parameters:
df – The DataFrame to assign roles from.
roles – Dictionary mapping role names to column selections. Supported roles: “features”, “targets”, “metadata” Also accepts: “x” (alias for features), “y” (alias for targets)
- Returns:
RoleAssignmentResult with separated DataFrames.
- Raises:
RoleAssignmentError – If assignment is invalid (overlap, missing features).
- assign_auto(df: DataFrame, target_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None, metadata_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None = None) RoleAssignmentResult[source]
Auto-assign roles with specified targets and metadata.
Features are automatically set to all remaining columns.
- Parameters:
df – The DataFrame to assign roles from.
target_columns – Column selection for targets (Y).
metadata_columns – Column selection for metadata.
- Returns:
RoleAssignmentResult with separated DataFrames.
- extract_y_from_x(df: DataFrame, y_columns: int | str | List[int] | List[str] | Dict[str, Any] | slice | None) RoleAssignmentResult[source]
Extract target columns from a features DataFrame.
This is useful when Y columns are embedded in the X data.
- Parameters:
df – DataFrame containing both features and targets.
y_columns – Column selection for targets to extract.
- Returns:
RoleAssignmentResult with features (remaining) and targets (extracted).
- validate_roles(df: DataFrame, roles: Dict[str, int | str | List[int] | List[str] | Dict[str, Any] | slice | None]) List[str][source]
Validate a role specification without performing assignment.
- Parameters:
df – The DataFrame to validate against.
roles – Role specification to validate.
- Returns:
List of warning messages (empty if no warnings).
- Raises:
RoleAssignmentError – If role specification is invalid.
- exception nirs4all.data.selection.RoleAssignmentError[source]
Bases:
ExceptionRaised when role assignment fails.
- exception nirs4all.data.selection.RowSelectionError[source]
Bases:
ExceptionRaised when row selection fails.
- class nirs4all.data.selection.RowSelector(default_random_state: int | None = None)[source]
Bases:
objectFlexible row selector for DataFrames.
Supports multiple selection methods: - All rows: None - By index: [0, 1, 2] or 0 - By range: “0:100” (slice syntax as string) - By percentage: “0:80%” or “80%:100%” - By condition: {“where”: {“column”: “quality”, “op”: “>”, “value”: 0.5}} - Random sample: {“sample”: 100, “random_state”: 42} - Stratified sample: {“sample”: 100, “stratify”: “class”, “random_state”: 42} - Head/Tail: {“head”: 100} or {“tail”: 50}
Example
>>> selector = RowSelector() >>> result = selector.select(df, "0:80%") >>> print(len(result.data)) # 80% of rows
- OPERATORS: Dict[str, Callable[[Any, Any], bool]] = {'!=': <function RowSelector.<lambda>>, '<': <function RowSelector.<lambda>>, '<=': <function RowSelector.<lambda>>, '==': <function RowSelector.<lambda>>, '>': <function RowSelector.<lambda>>, '>=': <function RowSelector.<lambda>>, 'contains': <function RowSelector.<lambda>>, 'endswith': <function RowSelector.<lambda>>, 'in': <function RowSelector.<lambda>>, 'isna': <function RowSelector.<lambda>>, 'not in': <function RowSelector.<lambda>>, 'notna': <function RowSelector.<lambda>>, 'regex': <function RowSelector.<lambda>>, 'startswith': <function RowSelector.<lambda>>}
- select(df: DataFrame, selection: int | str | List[int] | Dict[str, Any] | slice | None) RowSelectionResult[source]
Select rows from a DataFrame.
- Parameters:
df – The DataFrame to select rows from.
selection – Row selection specification. Can be: - None: Select all rows - int: Single row index - str: Range string (“0:100”) or percentage (“0:80%”) - List[int]: List of row indices - Dict: Complex selection (see class docstring)
- Returns:
RowSelectionResult with indices, mask, and selected data.
- Raises:
RowSelectionError – If selection is invalid or rows not found.
- class nirs4all.data.selection.SampleLinker(mode: str = 'inner', on_missing: str = 'warn')[source]
Bases:
objectLink samples across multiple data files by key column.
Supports multiple linking modes: - “inner”: Keep only samples present in all sources (default) - “left”: Keep all samples from the first source - “outer”: Keep all samples from any source
Example
>>> linker = SampleLinker() >>> result = linker.link( ... { ... "X": features_df, # Has columns: sample_id, feature1, feature2 ... "Y": targets_df, # Has columns: sample_id, target ... "M": metadata_df, # Has columns: sample_id, group, date ... }, ... link_by="sample_id" ... ) >>> # Linked DataFrames have aligned rows >>> X_linked = result.linked_data["X"] # Without sample_id column
- create_sample_index(sources: Dict[str, DataFrame], link_by: str) DataFrame[source]
Create a sample index showing key presence across sources.
- Parameters:
sources – Dictionary of source DataFrames.
link_by – Key column name.
- Returns:
DataFrame with keys as index and boolean columns per source.
- link(sources: Dict[str, DataFrame], link_by: str, keep_key_column: bool = False) LinkingResult[source]
Link multiple data sources by key column.
- Parameters:
sources – Dictionary mapping source names to DataFrames. Each DataFrame must have the key column.
link_by – Name of the column to use for linking.
keep_key_column – Whether to keep the key column in output DataFrames.
- Returns:
LinkingResult with linked DataFrames.
- Raises:
LinkingError – If linking fails (missing key columns, no matches, etc.).
- link_aligned(sources: Dict[str, DataFrame], validate: bool = True) Dict[str, DataFrame][source]
Link sources that are already aligned by row index.
This is a simpler linking method for sources that are guaranteed to have matching rows (same samples in same order).
- Parameters:
sources – Dictionary of aligned DataFrames.
validate – Whether to validate that all sources have same row count.
- Returns:
Dictionary of DataFrames (unchanged, just validated).
- Raises:
LinkingError – If validation fails.