# Cross-Validation Examples This section covers cross-validation strategies to properly evaluate model performance on NIRS data. ```{contents} On this page :local: :depth: 2 ``` ## Overview | Example | Topic | Difficulty | Duration | |---------|-------|------------|----------| | [U01](#u01-cv-strategies) | CV Strategies | ★★☆☆☆ | ~4 min | | [U02](#u02-group-splitting) | Group Splitting | ★★☆☆☆ | ~3 min | | [U03](#u03-sample-filtering) | Sample Filtering | ★★☆☆☆ | ~3 min | | [U04](#u04-aggregation) | Aggregation | ★★☆☆☆ | ~3 min | --- ## U01: CV Strategies **Select appropriate cross-validation for your data structure.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/05_cross_validation/U01_cv_strategies.py) ### What You'll Learn - Standard CV: KFold, ShuffleSplit, RepeatedKFold - Stratified CV for classification - Time-series CV for temporal data - Leave-One-Out for small datasets ### CV Strategy Selection Guide | Strategy | When to Use | |----------|-------------| | **KFold** | Standard regression, moderate-large datasets | | **ShuffleSplit** | Flexible test size, many random splits | | **RepeatedKFold** | Small datasets, need robust estimates | | **StratifiedKFold** | Classification, class imbalance | | **StratifiedShuffleSplit** | Classification + flexible splits | | **TimeSeriesSplit** | Temporal/sequential data | | **LeaveOneOut** | Very small datasets (<50 samples) | ### KFold - Standard K-Fold Divides data into K non-overlapping folds: ```python from sklearn.model_selection import KFold pipeline = [ MinMaxScaler(), SNV(), # 5-fold cross-validation KFold(n_splits=5, shuffle=True, random_state=42), PLSRegression(n_components=10) ] ``` **Key parameters:** - `n_splits`: Number of folds (typically 5-10) - `shuffle`: Randomize before splitting (recommended) - `random_state`: For reproducibility ### ShuffleSplit - Random Splits More flexible than KFold: ```python from sklearn.model_selection import ShuffleSplit # 10 random splits with 25% test ShuffleSplit(n_splits=10, test_size=0.25, random_state=42) ``` **Advantages:** - Control test size exactly - Number of splits independent of test size - Good for large datasets ### RepeatedKFold - Multiple Repetitions Repeats K-fold CV multiple times with different shuffles: ```python from sklearn.model_selection import RepeatedKFold # 5-fold repeated 3 times = 15 total evaluations RepeatedKFold(n_splits=5, n_repeats=3, random_state=42) ``` **Best for:** - Small datasets where variance is high - When you need robust uncertainty estimates ### StratifiedKFold - Classification Preserves class proportions in each fold: ```python from sklearn.model_selection import StratifiedKFold # Essential for imbalanced classification StratifiedKFold(n_splits=5, shuffle=True, random_state=42) ``` **Critical for:** - Imbalanced class distributions - Multi-class classification - Ensuring each fold has representative samples ### TimeSeriesSplit - Temporal Data Expanding window approach for sequential data: ```python from sklearn.model_selection import TimeSeriesSplit # Train on past, test on future TimeSeriesSplit(n_splits=5) ``` **Prevents:** - Look-ahead bias - Data leakage from future to past **Use when:** - Samples are ordered by time - Temporal dependencies exist ### LeaveOneOut - Small Datasets Leave exactly one sample out per fold: ```python from sklearn.model_selection import LeaveOneOut # N samples = N folds LeaveOneOut() ``` **Trade-offs:** - ✓ Maximum use of data for training - ✓ N evaluations give complete picture - ✗ Slow for large datasets - ✗ High variance in estimates --- ## U02: Group Splitting **Handle grouped/clustered data correctly.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/05_cross_validation/U02_group_splitting.py) ### What You'll Learn - GroupKFold for clustered samples - GroupShuffleSplit for random group splits - Handling biological replicates - Avoiding data leakage from groups ### Why Group Splitting? When samples are not independent: ``` Example: 5 measurements per fruit ├── Fruit 1: samples 1-5 ├── Fruit 2: samples 6-10 ├── Fruit 3: samples 11-15 └── ... Wrong: Random split → Fruit 1's samples in both train AND test Right: Group split → All of Fruit 1's samples in train OR test ``` ### GroupKFold ```python from sklearn.model_selection import GroupKFold # Requires group labels groups = [0]*5 + [1]*5 + [2]*5 + [3]*5 + [4]*5 # 5 groups of 5 pipeline = [ MinMaxScaler(), SNV(), GroupKFold(n_splits=5), PLSRegression(n_components=10) ] # Pass groups in dataset result = nirs4all.run( pipeline=pipeline, dataset=(X, y, {"groups": groups}) ) ``` ### GroupShuffleSplit Random group-aware splits: ```python from sklearn.model_selection import GroupShuffleSplit GroupShuffleSplit(n_splits=10, test_size=0.2, random_state=42) ``` ### Common Grouping Scenarios | Scenario | Group Definition | |----------|------------------| | Biological replicates | Sample/individual ID | | Time series by subject | Subject ID | | Multi-site study | Site ID | | Batch effects | Batch ID | --- ## U03: Sample Filtering **Filter samples based on criteria during cross-validation.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/05_cross_validation/U03_sample_filtering.py) ### What You'll Learn - Filtering outliers - Conditioning on metadata - Train/test set requirements ### Sample Filtering in Pipeline ```python pipeline = [ MinMaxScaler(), SNV(), # Filter: only include samples meeting criteria {"filter": { "y_range": (0, 100), # Target value range "partition": ["train"] # Which partitions to filter }}, ShuffleSplit(n_splits=3), PLSRegression(n_components=10) ] ``` ### Filtering Options | Filter | Description | |--------|-------------| | `y_range` | Keep samples with y in (min, max) | | `x_range` | Keep samples with X values in range | | `metadata` | Filter based on metadata columns | | `outliers` | Remove statistical outliers | --- ## U04: Aggregation **Aggregate results across folds and variants.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/05_cross_validation/U04_aggregation.py) ### What You'll Learn - Aggregating metrics across folds - Statistical summaries - Confidence intervals ### Understanding Aggregation When you run a pipeline with cross-validation, you get: - One prediction per fold - Multiple folds per model configuration **Aggregation** summarizes these: ```python # Get predictions result = nirs4all.run(pipeline=pipeline, dataset=dataset) # Aggregate across folds summary = result.predictions.aggregate( by=['model_name', 'preprocessings'], metrics=['rmse', 'r2'], aggregations=['mean', 'std', 'min', 'max'] ) ``` ### Aggregation Options | Aggregation | Description | |-------------|-------------| | `mean` | Average across folds | | `std` | Standard deviation | | `min`, `max` | Range | | `median` | Robust central tendency | | `ci_95` | 95% confidence interval | ### Visualization with Aggregation ```python analyzer = PredictionAnalyzer(result.predictions) # Candlestick shows distribution analyzer.plot_candlestick( variable="model_name", display_metric='rmse' ) # Heatmap with aggregation analyzer.plot_heatmap( x_var="model_name", y_var="preprocessings", aggregation='mean', # or 'best', 'median' display_metric='rmse' ) ``` --- ## CV Best Practices ### 1. Match CV to Data Structure ```python # Independent samples ShuffleSplit(n_splits=5, test_size=0.2) # Classification StratifiedKFold(n_splits=5) # Grouped samples GroupKFold(n_splits=5) # Time series TimeSeriesSplit(n_splits=5) ``` ### 2. Number of Splits | Dataset Size | Recommended Splits | |--------------|-------------------| | < 50 | LeaveOneOut or RepeatedKFold | | 50-200 | 5-fold with repeats | | 200-1000 | 5-10 fold | | > 1000 | ShuffleSplit (faster) | ### 3. Always Shuffle for Regression ```python # Good: Shuffle before splitting KFold(n_splits=5, shuffle=True, random_state=42) # Risky: Sequential splits may have patterns KFold(n_splits=5, shuffle=False) # Only for time series ``` ### 4. Set random_state for Reproducibility ```python # Reproducible ShuffleSplit(n_splits=5, random_state=42) # Different each run (not reproducible) ShuffleSplit(n_splits=5) # random_state=None ``` ### 5. Validate Stratification For classification, check class distribution in each fold: ```python result = nirs4all.run(pipeline=pipeline, dataset=dataset) # View fold distribution nirs4all.run( pipeline=[..., "fold_chart", ...], dataset=dataset, plots_visible=True ) ``` --- ## Running These Examples ```bash cd examples # Run all CV examples ./run.sh -n "U0*.py" -c user # Run with plots to see fold distributions python user/05_cross_validation/U01_cv_strategies.py --plots --show ``` ## Next Steps After mastering cross-validation: - **Deployment**: Save and deploy trained models - **Explainability**: Understand model decisions - **Advanced**: Nested CV, custom splitters