# Sample Aggregation

## Overview

In NIRS applications, it's common to have multiple spectral measurements (repetitions) for the same physical sample. For example:

- 4 scans per soil sample to reduce measurement noise
- Multiple measurements at different positions on a grain sample
- Repeated measurements for quality control

The **aggregation** feature allows you to:

1. Train models on all individual spectra (to maximize data)
2. Evaluate and report performance on **aggregated predictions** (one prediction per physical sample)

When aggregation is enabled, predictions from multiple spectra of the same biological sample are automatically combined, and both raw and aggregated metrics are reported.

## Quick Start

### Define Aggregation at Dataset Level

```python
from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.visualization.predictions import PredictionAnalyzer
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import MinMaxScaler

# Define dataset with aggregation column
dataset = DatasetConfigs(
    "path/to/spectra",
    aggregate="sample_id"  # Aggregate by sample_id column in metadata
)

# Define pipeline
pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=3, test_size=0.25),
    {"model": PLSRegression(n_components=10)}
]

# Run pipeline
runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(PipelineConfigs(pipeline, "PLS"), dataset)

# Create analyzer with same aggregate setting
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All plots now use aggregation by default
fig = analyzer.plot_top_k(k=5)  # Automatically aggregated by sample_id
```

## Aggregation Methods

### By Metadata Column (Recommended)

Use a column from your metadata file to group samples:

```python
# Metadata file should contain a column like 'sample_id', 'ID', 'batch', etc.
dataset = DatasetConfigs("path/to/data", aggregate="sample_id")
```

### By Target Values

For classification tasks, aggregate by target class:

```python
# Aggregate spectra sharing the same target value
dataset = DatasetConfigs("path/to/data", aggregate=True)
```

### Via Config Dictionary

When using configuration dictionaries:

```python
config = {
    "train_x": "data/spectra.csv",
    "train_y": "data/targets.csv",
    "train_m": "data/metadata.csv",  # Contains 'sample_id' column
    "aggregate": "sample_id"
}

dataset = DatasetConfigs(config)
```

## Pipeline Output

When aggregation is enabled, the TabReport shows both raw and aggregated metrics:

```
|-----------|--------|----------|------|-------|
| Partition | Nsamp  | Nfeat    | R2   | RMSE  |
|-----------|--------|----------|------|-------|
| Cros Val  | 400    | 200      | 0.87 | 0.712 |
| Cros Val* | 100    | 200      | 0.92 | 0.598 |  <- Aggregated
| Test      | 100    | 200      | 0.85 | 0.756 |
| Test*     | 25     | 200      | 0.90 | 0.632 |  <- Aggregated
|-----------|--------|----------|------|-------|
* Aggregated by sample_id
```

The asterisk (`*`) rows show performance when predictions for repeated measurements are averaged before computing metrics.

## Visualization with Aggregation

### Automatic Aggregation via `default_aggregate`

When you set `default_aggregate` on the analyzer, all visualization methods use it automatically:

```python
# Get aggregate setting from last run
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All these plots use aggregation automatically
fig1 = analyzer.plot_top_k(k=5)
fig2 = analyzer.plot_histogram()
fig3 = analyzer.plot_heatmap('model_name', 'preprocessings')
fig4 = analyzer.plot_candlestick('model_name')
```

### Overriding the Default

You can override the default for specific plots:

```python
# Use default aggregation
fig1 = analyzer.plot_top_k(k=5)

# Override: disable aggregation for this plot
fig2 = analyzer.plot_top_k(k=5, aggregate='')

# Override: use different aggregation column
fig3 = analyzer.plot_top_k(k=5, aggregate='batch_id')
```

### Manual Aggregation per Plot

Without setting a default, specify aggregation per method call:

```python
analyzer = PredictionAnalyzer(predictions)

# Explicit aggregation
fig = analyzer.plot_top_k(k=5, aggregate='sample_id')
fig = analyzer.plot_heatmap('model', 'preprocessing', aggregate='sample_id')
```

## Multi-Dataset Aggregation

Different datasets can have different aggregation columns:

```python
config1 = {
    "train_x": "dataset1/spectra.csv",
    "train_y": "dataset1/targets.csv",
    "train_m": "dataset1/metadata.csv",
    "aggregate": "sample_id"  # Dataset 1 uses sample_id
}

config2 = {
    "train_x": "dataset2/spectra.csv",
    "train_y": "dataset2/targets.csv",
    "train_m": "dataset2/metadata.csv",
    "aggregate": "batch_number"  # Dataset 2 uses batch_number
}

dataset = DatasetConfigs([config1, config2])
```

Alternatively, use a list of aggregate values:

```python
dataset = DatasetConfigs(
    [config1, config2],
    aggregate=["sample_id", "batch_number"]
)
```

## Priority Resolution

When aggregation is specified in multiple places, the priority order is:

1. **Constructor parameter** (highest priority)
2. **Config dictionary** (lower priority)

```python
config = {
    "train_x": "...",
    "aggregate": "sample_id"  # Config-level setting
}

# Constructor parameter overrides config dict
dataset = DatasetConfigs(config, aggregate="batch_id")  # Uses "batch_id"
```

## Aggregation Algorithm

For **regression** tasks:
- Predictions for samples in the same group are averaged
- y_true values are also averaged (for consistent comparison)

For **classification** tasks:
- Probabilities (if available) are averaged, then argmax is applied
- Without probabilities, majority voting is used

## Complete Example

```python
"""
Example: Soil Analysis with Multiple Scans per Sample
Each soil sample has 4 spectral scans to reduce measurement noise.
"""

from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.visualization.predictions import PredictionAnalyzer
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# Dataset config with aggregation
dataset = DatasetConfigs(
    {
        "train_x": "soil_data/spectra_train.csv",
        "train_y": "soil_data/targets_train.csv",
        "train_m": "soil_data/metadata_train.csv",  # Has 'sample_id' column
        "test_x": "soil_data/spectra_test.csv",
        "test_y": "soil_data/targets_test.csv",
        "test_m": "soil_data/metadata_test.csv",
    },
    aggregate="sample_id"
)

# Pipeline with hyperparameter search
pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
]

# Add models with different n_components
for n in [5, 10, 15, 20]:
    pipeline.append({"model": PLSRegression(n_components=n)})

# Run
runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(PipelineConfigs(pipeline, "SoilPLS"), dataset)

# Analyze with aggregation
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All visualizations use aggregated metrics
fig1 = analyzer.plot_top_k(k=3, rank_metric='rmse')
fig2 = analyzer.plot_heatmap('model_name', 'preprocessings')

plt.show()
```

## See Also

- {doc}`/reference/predictions_api` - Predictions API reference
- {doc}`/user_guide/visualization/prediction_charts` - Visualization methods
- {doc}`/getting_started/index` - Quick start guide