Sample Aggregation

Overview

In NIRS applications, it’s common to have multiple spectral measurements (repetitions) for the same physical sample. For example:

4 scans per soil sample to reduce measurement noise
Multiple measurements at different positions on a grain sample
Repeated measurements for quality control

The aggregation feature allows you to:

Train models on all individual spectra (to maximize data)
Evaluate and report performance on aggregated predictions (one prediction per physical sample)

When aggregation is enabled, predictions from multiple spectra of the same biological sample are automatically combined, and both raw and aggregated metrics are reported.

Quick Start

Define Aggregation at Dataset Level

from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.visualization.predictions import PredictionAnalyzer
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import MinMaxScaler

# Define dataset with aggregation column
dataset = DatasetConfigs(
    "path/to/spectra",
    aggregate="sample_id"  # Aggregate by sample_id column in metadata
)

# Define pipeline
pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=3, test_size=0.25),
    {"model": PLSRegression(n_components=10)}
]

# Run pipeline
runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(PipelineConfigs(pipeline, "PLS"), dataset)

# Create analyzer with same aggregate setting
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All plots now use aggregation by default
fig = analyzer.plot_top_k(k=5)  # Automatically aggregated by sample_id

Aggregation Methods

By Metadata Column (Recommended)

Use a column from your metadata file to group samples:

# Metadata file should contain a column like 'sample_id', 'ID', 'batch', etc.
dataset = DatasetConfigs("path/to/data", aggregate="sample_id")

By Target Values

For classification tasks, aggregate by target class:

# Aggregate spectra sharing the same target value
dataset = DatasetConfigs("path/to/data", aggregate=True)

Via Config Dictionary

When using configuration dictionaries:

config = {
    "train_x": "data/spectra.csv",
    "train_y": "data/targets.csv",
    "train_m": "data/metadata.csv",  # Contains 'sample_id' column
    "aggregate": "sample_id"
}

dataset = DatasetConfigs(config)

Pipeline Output

When aggregation is enabled, the TabReport shows both raw and aggregated metrics:

|-----------|--------|----------|------|-------|
| Partition | Nsamp  | Nfeat    | R2   | RMSE  |
|-----------|--------|----------|------|-------|
| Cros Val  | 400    | 200      | 0.87 | 0.712 |
| Cros Val* | 100    | 200      | 0.92 | 0.598 |  <- Aggregated
| Test      | 100    | 200      | 0.85 | 0.756 |
| Test*     | 25     | 200      | 0.90 | 0.632 |  <- Aggregated
|-----------|--------|----------|------|-------|
* Aggregated by sample_id

The asterisk (*) rows show performance when predictions for repeated measurements are averaged before computing metrics.

Visualization with Aggregation

Automatic Aggregation via `default_aggregate`

When you set default_aggregate on the analyzer, all visualization methods use it automatically:

# Get aggregate setting from last run
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All these plots use aggregation automatically
fig1 = analyzer.plot_top_k(k=5)
fig2 = analyzer.plot_histogram()
fig3 = analyzer.plot_heatmap('model_name', 'preprocessings')
fig4 = analyzer.plot_candlestick('model_name')

Overriding the Default

You can override the default for specific plots:

# Use default aggregation
fig1 = analyzer.plot_top_k(k=5)

# Override: disable aggregation for this plot
fig2 = analyzer.plot_top_k(k=5, aggregate='')

# Override: use different aggregation column
fig3 = analyzer.plot_top_k(k=5, aggregate='batch_id')

Manual Aggregation per Plot

Without setting a default, specify aggregation per method call:

analyzer = PredictionAnalyzer(predictions)

# Explicit aggregation
fig = analyzer.plot_top_k(k=5, aggregate='sample_id')
fig = analyzer.plot_heatmap('model', 'preprocessing', aggregate='sample_id')

Multi-Dataset Aggregation

Different datasets can have different aggregation columns:

config1 = {
    "train_x": "dataset1/spectra.csv",
    "train_y": "dataset1/targets.csv",
    "train_m": "dataset1/metadata.csv",
    "aggregate": "sample_id"  # Dataset 1 uses sample_id
}

config2 = {
    "train_x": "dataset2/spectra.csv",
    "train_y": "dataset2/targets.csv",
    "train_m": "dataset2/metadata.csv",
    "aggregate": "batch_number"  # Dataset 2 uses batch_number
}

dataset = DatasetConfigs([config1, config2])

Alternatively, use a list of aggregate values:

dataset = DatasetConfigs(
    [config1, config2],
    aggregate=["sample_id", "batch_number"]
)

Priority Resolution

When aggregation is specified in multiple places, the priority order is:

Constructor parameter (highest priority)
Config dictionary (lower priority)

config = {
    "train_x": "...",
    "aggregate": "sample_id"  # Config-level setting
}

# Constructor parameter overrides config dict
dataset = DatasetConfigs(config, aggregate="batch_id")  # Uses "batch_id"

Aggregation Algorithm

For regression tasks:

Predictions for samples in the same group are averaged
y_true values are also averaged (for consistent comparison)

For classification tasks:

Probabilities (if available) are averaged, then argmax is applied
Without probabilities, majority voting is used

Complete Example

"""
Example: Soil Analysis with Multiple Scans per Sample
Each soil sample has 4 spectral scans to reduce measurement noise.
"""

from nirs4all.data import DatasetConfigs
from nirs4all.pipeline import PipelineRunner, PipelineConfigs
from nirs4all.visualization.predictions import PredictionAnalyzer
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# Dataset config with aggregation
dataset = DatasetConfigs(
    {
        "train_x": "soil_data/spectra_train.csv",
        "train_y": "soil_data/targets_train.csv",
        "train_m": "soil_data/metadata_train.csv",  # Has 'sample_id' column
        "test_x": "soil_data/spectra_test.csv",
        "test_y": "soil_data/targets_test.csv",
        "test_m": "soil_data/metadata_test.csv",
    },
    aggregate="sample_id"
)

# Pipeline with hyperparameter search
pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
]

# Add models with different n_components
for n in [5, 10, 15, 20]:
    pipeline.append({"model": PLSRegression(n_components=n)})

# Run
runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(PipelineConfigs(pipeline, "SoilPLS"), dataset)

# Analyze with aggregation
analyzer = PredictionAnalyzer(predictions, default_aggregate=runner.last_aggregate)

# All visualizations use aggregated metrics
fig1 = analyzer.plot_top_k(k=3, rank_metric='rmse')
fig2 = analyzer.plot_heatmap('model_name', 'preprocessings')

plt.show()