nirs4all.data.partition.partition_assigner module
Partition assigner for dataset configuration.
This module provides flexible partition assignment for DataFrames, supporting multiple assignment methods including static, column-based, percentage-based, and index-based partitions.
Example
>>> assigner = PartitionAssigner()
>>> # Static partition
>>> result = assigner.assign(df, partition="train")
>>> # Column-based partition
>>> result = assigner.assign(df, {
... "column": "split",
... "train_values": ["train", "training"],
... "test_values": ["test", "validation"]
... })
>>> # Percentage-based partition
>>> result = assigner.assign(df, {
... "train": "80%",
... "test": "20%",
... "shuffle": True,
... "random_state": 42
... })
- class nirs4all.data.partition.partition_assigner.PartitionAssigner(default_random_state: int | None = None, base_path: Path | None = None)[source]
Bases:
objectFlexible partition assigner for DataFrames.
Supports multiple partition methods: - Static: “train”, “test”, “predict” (assign entire DataFrame) - Column-based: {“column”: “split”, “train_values”: […], “test_values”: […]} - Percentage-based: {“train”: “80%”, “test”: “20%”, “shuffle”: True} - Index-based: {“train”: [0,1,2], “test”: [3,4,5]} - Index file: {“train_file”: “train_idx.txt”, “test_file”: “test_idx.txt”}
Example
>>> assigner = PartitionAssigner() >>> result = assigner.assign(df, {"train": "80%", "test": "20%"}) >>> print(len(result.train_data), len(result.test_data))
- DEFAULT_PREDICT_VALUES = ('predict', 'prediction', 'unknown')
- DEFAULT_TEST_VALUES = ('test', 'testing', 'val', 'validation', 'valid')
- DEFAULT_TRAIN_VALUES = ('train', 'training', 'cal', 'calibration')
- PARTITION_NAMES = ('train', 'test', 'predict')
- assign(df: DataFrame, partition: str | Dict[str, Any] | None) PartitionResult[source]
Assign rows to partitions.
- Parameters:
df – The DataFrame to partition.
partition – Partition specification. Can be: - str: Static partition (“train”, “test”, “predict”) - dict: Complex partition (column-based, percentage, or index) - None: No partitioning (returns empty result)
- Returns:
PartitionResult with indices and data for each partition.
- Raises:
PartitionError – If partition specification is invalid.
- concatenate_partitions(results: Sequence[PartitionResult]) PartitionResult[source]
Concatenate multiple partition results.
Useful when combining multiple files with the same partition. Indices are adjusted to account for concatenation order.
- Parameters:
results – Sequence of PartitionResult objects.
- Returns:
Combined PartitionResult.
- exception nirs4all.data.partition.partition_assigner.PartitionError[source]
Bases:
ExceptionRaised when partition assignment fails.
- class nirs4all.data.partition.partition_assigner.PartitionResult(train_indices: List[int] = <factory>, test_indices: List[int] = <factory>, predict_indices: List[int] = <factory>, train_data: DataFrame | None = None, test_data: DataFrame | None = None, predict_data: DataFrame | None = None, partition_column: str | None = None)[source]
Bases:
objectResult of a partition assignment operation.
- train_data
DataFrame subset for training.
- Type:
pandas.core.frame.DataFrame | None
- test_data
DataFrame subset for testing.
- Type:
pandas.core.frame.DataFrame | None
- predict_data
DataFrame subset for prediction.
- Type:
pandas.core.frame.DataFrame | None
- get_data(partition: Literal['train', 'test', 'predict']) DataFrame | None[source]
Get data for a specific partition.