Cross-Validation Examples

This section covers cross-validation strategies to properly evaluate model performance on NIRS data.

Overview

Example

Topic

Difficulty

Duration

U01

CV Strategies

★★☆☆☆

~4 min

U02

Group Splitting

★★☆☆☆

~3 min

U03

Sample Filtering

★★☆☆☆

~3 min

U04

Aggregation

★★☆☆☆

~3 min


U01: CV Strategies

Select appropriate cross-validation for your data structure.

📄 View source code

What You’ll Learn

  • Standard CV: KFold, ShuffleSplit, RepeatedKFold

  • Stratified CV for classification

  • Time-series CV for temporal data

  • Leave-One-Out for small datasets

CV Strategy Selection Guide

Strategy

When to Use

KFold

Standard regression, moderate-large datasets

ShuffleSplit

Flexible test size, many random splits

RepeatedKFold

Small datasets, need robust estimates

StratifiedKFold

Classification, class imbalance

StratifiedShuffleSplit

Classification + flexible splits

TimeSeriesSplit

Temporal/sequential data

LeaveOneOut

Very small datasets (<50 samples)

KFold - Standard K-Fold

Divides data into K non-overlapping folds:

from sklearn.model_selection import KFold

pipeline = [
    MinMaxScaler(),
    SNV(),

    # 5-fold cross-validation
    KFold(n_splits=5, shuffle=True, random_state=42),

    PLSRegression(n_components=10)
]

Key parameters:

  • n_splits: Number of folds (typically 5-10)

  • shuffle: Randomize before splitting (recommended)

  • random_state: For reproducibility

ShuffleSplit - Random Splits

More flexible than KFold:

from sklearn.model_selection import ShuffleSplit

# 10 random splits with 25% test
ShuffleSplit(n_splits=10, test_size=0.25, random_state=42)

Advantages:

  • Control test size exactly

  • Number of splits independent of test size

  • Good for large datasets

RepeatedKFold - Multiple Repetitions

Repeats K-fold CV multiple times with different shuffles:

from sklearn.model_selection import RepeatedKFold

# 5-fold repeated 3 times = 15 total evaluations
RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)

Best for:

  • Small datasets where variance is high

  • When you need robust uncertainty estimates

StratifiedKFold - Classification

Preserves class proportions in each fold:

from sklearn.model_selection import StratifiedKFold

# Essential for imbalanced classification
StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Critical for:

  • Imbalanced class distributions

  • Multi-class classification

  • Ensuring each fold has representative samples

TimeSeriesSplit - Temporal Data

Expanding window approach for sequential data:

from sklearn.model_selection import TimeSeriesSplit

# Train on past, test on future
TimeSeriesSplit(n_splits=5)

Prevents:

  • Look-ahead bias

  • Data leakage from future to past

Use when:

  • Samples are ordered by time

  • Temporal dependencies exist

LeaveOneOut - Small Datasets

Leave exactly one sample out per fold:

from sklearn.model_selection import LeaveOneOut

# N samples = N folds
LeaveOneOut()

Trade-offs:

  • ✓ Maximum use of data for training

  • ✓ N evaluations give complete picture

  • ✗ Slow for large datasets

  • ✗ High variance in estimates


U02: Group Splitting

Handle grouped/clustered data correctly.

📄 View source code

What You’ll Learn

  • GroupKFold for clustered samples

  • GroupShuffleSplit for random group splits

  • Handling biological replicates

  • Avoiding data leakage from groups

Why Group Splitting?

When samples are not independent:

Example: 5 measurements per fruit
├── Fruit 1: samples 1-5
├── Fruit 2: samples 6-10
├── Fruit 3: samples 11-15
└── ...

Wrong: Random split → Fruit 1's samples in both train AND test
Right: Group split → All of Fruit 1's samples in train OR test

GroupKFold

from sklearn.model_selection import GroupKFold

# Requires group labels
groups = [0]*5 + [1]*5 + [2]*5 + [3]*5 + [4]*5  # 5 groups of 5

pipeline = [
    MinMaxScaler(),
    SNV(),
    GroupKFold(n_splits=5),
    PLSRegression(n_components=10)
]

# Pass groups in dataset
result = nirs4all.run(
    pipeline=pipeline,
    dataset=(X, y, {"groups": groups})
)

GroupShuffleSplit

Random group-aware splits:

from sklearn.model_selection import GroupShuffleSplit

GroupShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

Common Grouping Scenarios

Scenario

Group Definition

Biological replicates

Sample/individual ID

Time series by subject

Subject ID

Multi-site study

Site ID

Batch effects

Batch ID


U03: Sample Filtering

Filter samples based on criteria during cross-validation.

📄 View source code

What You’ll Learn

  • Filtering outliers

  • Conditioning on metadata

  • Train/test set requirements

Sample Filtering in Pipeline

pipeline = [
    MinMaxScaler(),
    SNV(),

    # Filter: only include samples meeting criteria
    {"filter": {
        "y_range": (0, 100),       # Target value range
        "partition": ["train"]     # Which partitions to filter
    }},

    ShuffleSplit(n_splits=3),
    PLSRegression(n_components=10)
]

Filtering Options

Filter

Description

y_range

Keep samples with y in (min, max)

x_range

Keep samples with X values in range

metadata

Filter based on metadata columns

outliers

Remove statistical outliers


U04: Aggregation

Aggregate results across folds and variants.

📄 View source code

What You’ll Learn

  • Aggregating metrics across folds

  • Statistical summaries

  • Confidence intervals

Understanding Aggregation

When you run a pipeline with cross-validation, you get:

  • One prediction per fold

  • Multiple folds per model configuration

Aggregation summarizes these:

# Get predictions
result = nirs4all.run(pipeline=pipeline, dataset=dataset)

# Aggregate across folds
summary = result.predictions.aggregate(
    by=['model_name', 'preprocessings'],
    metrics=['rmse', 'r2'],
    aggregations=['mean', 'std', 'min', 'max']
)

Aggregation Options

Aggregation

Description

mean

Average across folds

std

Standard deviation

min, max

Range

median

Robust central tendency

ci_95

95% confidence interval

Visualization with Aggregation

analyzer = PredictionAnalyzer(result.predictions)

# Candlestick shows distribution
analyzer.plot_candlestick(
    variable="model_name",
    display_metric='rmse'
)

# Heatmap with aggregation
analyzer.plot_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    aggregation='mean',  # or 'best', 'median'
    display_metric='rmse'
)

CV Best Practices

1. Match CV to Data Structure

# Independent samples
ShuffleSplit(n_splits=5, test_size=0.2)

# Classification
StratifiedKFold(n_splits=5)

# Grouped samples
GroupKFold(n_splits=5)

# Time series
TimeSeriesSplit(n_splits=5)

2. Number of Splits

Dataset Size

Recommended Splits

< 50

LeaveOneOut or RepeatedKFold

50-200

5-fold with repeats

200-1000

5-10 fold

> 1000

ShuffleSplit (faster)

3. Always Shuffle for Regression

# Good: Shuffle before splitting
KFold(n_splits=5, shuffle=True, random_state=42)

# Risky: Sequential splits may have patterns
KFold(n_splits=5, shuffle=False)  # Only for time series

4. Set random_state for Reproducibility

# Reproducible
ShuffleSplit(n_splits=5, random_state=42)

# Different each run (not reproducible)
ShuffleSplit(n_splits=5)  # random_state=None

5. Validate Stratification

For classification, check class distribution in each fold:

result = nirs4all.run(pipeline=pipeline, dataset=dataset)

# View fold distribution
nirs4all.run(
    pipeline=[..., "fold_chart", ...],
    dataset=dataset,
    plots_visible=True
)

Running These Examples

cd examples

# Run all CV examples
./run.sh -n "U0*.py" -c user

# Run with plots to see fold distributions
python user/05_cross_validation/U01_cv_strategies.py --plots --show

Next Steps

After mastering cross-validation:

  • Deployment: Save and deploy trained models

  • Explainability: Understand model decisions

  • Advanced: Nested CV, custom splitters