# Force Group Splitting: Universal Group Support

This guide explains the `force_group` parameter that enables **any sklearn-compatible splitter** to work with grouped samples, ensuring all samples from the same group stay together in train or test sets.

## Quick Start

```python
from sklearn.model_selection import KFold, ShuffleSplit, StratifiedKFold

# Use KFold with group-awareness
{"split": KFold(n_splits=5), "force_group": "Sample_ID"}

# Use ShuffleSplit with group-awareness
{"split": ShuffleSplit(test_size=0.2), "force_group": "Sample_ID"}

# Stratify on binned y values (for regression)
{"split": KFold(n_splits=5), "force_group": "y", "n_bins": 10}
```

## Why Force Group?

### The Problem

When dealing with repeated measurements (multiple spectra per sample), standard cross-validation can cause **data leakage**:

```python
# Problem: KFold may put measurements from the same sample in both train and test
{"split": KFold(n_splits=5)}  # Data leakage risk!
```

### The Traditional Solution

Use group-aware splitters like `GroupKFold`:

```python
{"split": GroupKFold(n_splits=5), "group": "Sample_ID"}
```

But this limits you to only group-aware splitters (`GroupKFold`, `GroupShuffleSplit`, `StratifiedGroupKFold`).

### The Force Group Solution

`force_group` wraps **any** splitter to add group-awareness:

```python
# Now ANY splitter works with groups!
{"split": KFold(n_splits=5), "force_group": "Sample_ID"}
{"split": ShuffleSplit(n_splits=10, test_size=0.2), "force_group": "Sample_ID"}
{"split": StratifiedKFold(n_splits=5), "force_group": "Sample_ID"}
```

## How It Works

1. **Aggregate**: Samples are grouped by the specified column
2. **Split**: The inner splitter works on "virtual samples" (one per group)
3. **Expand**: Fold indices are expanded back to original sample indices

```
Original Data (100 samples, 20 groups)
        ↓
   Aggregation (20 virtual samples)
        ↓
   Splitter operates on 20 samples
        ↓
   Expansion back to 100 samples
```

## Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `force_group` | `str` | Metadata column name for grouping, or `"y"` for target-based binning |
| `aggregation` | `str` | X aggregation method: `"mean"` (default), `"median"`, `"first"` |
| `y_aggregation` | `str` | Y aggregation method: `"mean"`, `"mode"`, `"first"` (auto-detected) |
| `n_bins` | `int` | Number of bins for `force_group="y"` (default: 5) |

## Usage Examples

### Basic Group Splitting

```python
from sklearn.model_selection import KFold

pipeline = [
    {"split": KFold(n_splits=5), "force_group": "Sample_ID"},
    PLSRegression(n_components=5)
]
```

### With Different Aggregation Methods

```python
# Use median aggregation (more robust to outliers)
{"split": KFold(n_splits=5), "force_group": "Sample_ID", "aggregation": "median"}

# Use first sample per group (fastest, no actual aggregation)
{"split": ShuffleSplit(test_size=0.2), "force_group": "Sample_ID", "aggregation": "first"}
```

### Y-Binning for Regression

Use `force_group="y"` to bin continuous target values into groups:

```python
# Bin y values into 10 quantile bins, then split by bins
{"split": KFold(n_splits=5), "force_group": "y", "n_bins": 10}
```

This ensures samples with similar y values tend to be in the same fold, providing more balanced y distribution across folds.

### Stratified Splitting with Groups

For classification with group-awareness:

```python
from sklearn.model_selection import StratifiedKFold

{
    "split": StratifiedKFold(n_splits=5),
    "force_group": "Sample_ID",
    "y_aggregation": "mode"  # Use most common class in group
}
```

## Comparison with `group` Parameter

| Feature | `group` | `force_group` |
|---------|---------|---------------|
| Works with GroupKFold | ✓ | ✓ |
| Works with KFold | ✗ | ✓ |
| Works with ShuffleSplit | ✗ | ✓ |
| Works with StratifiedKFold | ✗ | ✓ |
| Y-binning support | ✗ | ✓ |
| Aggregation options | ✗ | ✓ |

## Best Practices

1. **Choose appropriate aggregation**: Use `"mean"` for normal distributions, `"median"` for outlier robustness, `"first"` for speed

2. **Set n_bins appropriately**: For `force_group="y"`:
   - More bins = finer stratification but requires more samples
   - Fewer bins = more robust but coarser grouping
   - Recommended: 5-20 bins for datasets with 100+ samples

3. **Match y_aggregation to task**:
   - Classification: use `"mode"` (most common class)
   - Regression: use `"mean"` (average value)

4. **Prefer `force_group` over `group`** when using non-group-aware splitters to avoid silent failures

## Technical Details

Under the hood, `force_group` uses `GroupedSplitterWrapper`:

```python
from nirs4all.operators.splitters import GroupedSplitterWrapper

wrapper = GroupedSplitterWrapper(
    splitter=KFold(n_splits=5),
    aggregation="mean",
    y_aggregation="mean"
)

for train_idx, test_idx in wrapper.split(X, y, groups=sample_ids):
    # train_idx and test_idx are original sample indices
    # All samples from the same group are in the same fold
    pass
```

## Related

- [Writing Pipelines](./writing_pipelines.md): General pipeline authoring guide
- {doc}`/reference/pipeline_syntax` - Pipeline syntax reference