nirs4all.data.partition package

Submodules

Module contents

Partition module for dataset configuration.

This module provides flexible partition assignment for dataset loading, supporting static, column-based, percentage-based, and index-based partition methods.

Classes:

PartitionAssigner: Assign rows to train/test/predict partitions PartitionError: Raised when partition assignment fails

Supported partition methods:
  • Static: Assign entire file to a partition

  • Column-based: Partition based on column values

  • Percentage-based: Split by percentage with optional shuffle/stratify

  • Index-based: Explicit index lists or external files

class nirs4all.data.partition.PartitionAssigner(default_random_state: int | None = None, base_path: Path | None = None)[source]

Bases: object

Flexible partition assigner for DataFrames.

Supports multiple partition methods: - Static: “train”, “test”, “predict” (assign entire DataFrame) - Column-based: {“column”: “split”, “train_values”: […], “test_values”: […]} - Percentage-based: {“train”: “80%”, “test”: “20%”, “shuffle”: True} - Index-based: {“train”: [0,1,2], “test”: [3,4,5]} - Index file: {“train_file”: “train_idx.txt”, “test_file”: “test_idx.txt”}

Example

>>> assigner = PartitionAssigner()
>>> result = assigner.assign(df, {"train": "80%", "test": "20%"})
>>> print(len(result.train_data), len(result.test_data))
DEFAULT_PREDICT_VALUES = ('predict', 'prediction', 'unknown')
DEFAULT_TEST_VALUES = ('test', 'testing', 'val', 'validation', 'valid')
DEFAULT_TRAIN_VALUES = ('train', 'training', 'cal', 'calibration')
PARTITION_NAMES = ('train', 'test', 'predict')
assign(df: DataFrame, partition: str | Dict[str, Any] | None) PartitionResult[source]

Assign rows to partitions.

Parameters:
  • df – The DataFrame to partition.

  • partition – Partition specification. Can be: - str: Static partition (“train”, “test”, “predict”) - dict: Complex partition (column-based, percentage, or index) - None: No partitioning (returns empty result)

Returns:

PartitionResult with indices and data for each partition.

Raises:

PartitionError – If partition specification is invalid.

concatenate_partitions(results: Sequence[PartitionResult]) PartitionResult[source]

Concatenate multiple partition results.

Useful when combining multiple files with the same partition. Indices are adjusted to account for concatenation order.

Parameters:

results – Sequence of PartitionResult objects.

Returns:

Combined PartitionResult.

exception nirs4all.data.partition.PartitionError[source]

Bases: Exception

Raised when partition assignment fails.

class nirs4all.data.partition.PartitionResult(train_indices: List[int] = <factory>, test_indices: List[int] = <factory>, predict_indices: List[int] = <factory>, train_data: DataFrame | None = None, test_data: DataFrame | None = None, predict_data: DataFrame | None = None, partition_column: str | None = None)[source]

Bases: object

Result of a partition assignment operation.

train_indices

List of indices assigned to training partition.

Type:

List[int]

test_indices

List of indices assigned to test partition.

Type:

List[int]

predict_indices

List of indices assigned to predict partition (no targets).

Type:

List[int]

train_data

DataFrame subset for training.

Type:

pandas.core.frame.DataFrame | None

test_data

DataFrame subset for testing.

Type:

pandas.core.frame.DataFrame | None

predict_data

DataFrame subset for prediction.

Type:

pandas.core.frame.DataFrame | None

partition_column

Name of column used for partitioning (if column-based).

Type:

str | None

get_data(partition: Literal['train', 'test', 'predict']) DataFrame | None[source]

Get data for a specific partition.

get_indices(partition: Literal['train', 'test', 'predict']) List[int][source]

Get indices for a specific partition.

property has_predict: bool

Check if predict data exists.

property has_test: bool

Check if test data exists.

property has_train: bool

Check if training data exists.

partition_column: str | None = None
predict_data: DataFrame | None = None
predict_indices: List[int]
test_data: DataFrame | None = None
test_indices: List[int]
train_data: DataFrame | None = None
train_indices: List[int]