# Pipeline Branching Pipeline branching enables splitting a pipeline into multiple parallel sub-pipelines ("branches"), each with its own preprocessing context while sharing common upstream state (splits, initial preprocessing). ## Overview Branching is useful when you want to: - **Compare preprocessing strategies**: Test SNV vs MSC vs derivatives with the same model - **Explore model-preprocessing combinations**: Different models with different preprocessing - **Efficient experimentation**: Single dataset load, shared CV splits - **Independent Y processing**: Different target transformations per branch ## Basic Syntax ### List Syntax (Anonymous Branches) The simplest form uses a list of lists: ```python pipeline = [ ShuffleSplit(n_splits=5), MinMaxScaler(), # Shared - applied once {"branch": [ [SNV()], # Branch 0 [MSC()], # Branch 1 [FirstDerivative()], # Branch 2 ]}, PLSRegression(n_components=5), # Runs on EACH branch ] ``` Result: 15 predictions (3 branches × 5 folds) ### Named Branches (Dictionary Syntax) For better tracking, use named branches: ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": { "snv_pca": [SNV(), PCA(n_components=10)], "msc_detrend": [MSC(), Detrend()], "derivative": [FirstDerivative()], }}, PLSRegression(n_components=5), ] ``` Named branches appear in predictions and visualizations with their given names. ### Generator Syntax Use `_or_` generators for dynamic branch creation: ```python pipeline = [ ShuffleSplit(n_splits=3), {"branch": {"_or_": [SNV(), MSC(), FirstDerivative()]}}, PLSRegression(n_components=5), ] ``` This expands to 3 branches automatically named based on the operator class. ## Multi-Step Branches Branches can contain multiple steps: ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": { "snv_pca": [ SNV(), PCA(n_components=10), ], "msc_savgol": [ MSC(), SavitzkyGolay(window_length=11, polyorder=2), ], }}, PLSRegression(n_components=5), ] ``` ## Branch-Specific Y Processing Each branch can have its own Y transformation: ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": { "scaled_y": [ SNV(), {"y_processing": StandardScaler()}, # Y scaling in this branch only ], "raw_y": [ MSC(), # No Y processing - uses numeric targets ], }}, PLSRegression(n_components=5), ] ``` ## In-Branch Model Training Models can be placed inside branches: ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": { "snv_pls": [SNV(), PLSRegression(n_components=5)], "msc_pls": [MSC(), PLSRegression(n_components=10)], "derivative_rf": [FirstDerivative(), RandomForestRegressor()], }}, ] ``` Each branch trains its own model independently. ## Post-Branch Steps Steps after the branch block execute on **each branch**: ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": [ [SNV(), PCA(n_components=10)], [MSC(), Detrend()], ]}, PLSRegression(n_components=5), # Trains twice: once per branch ] ``` The PLSRegression trains on: - Branch 0: SNV→PCA features - Branch 1: MSC→Detrend features ## Visualization ### Branch Summary ```python from nirs4all.visualization.predictions import PredictionAnalyzer analyzer = PredictionAnalyzer(predictions) # Get summary statistics summary = analyzer.branch_summary(metrics=['rmse', 'r2']) print(summary.to_markdown()) ``` Output: | branch_name | branch_id | count | rmse_mean | rmse_std | r2_mean | r2_std | |-------------|-----------|-------|-----------|----------|---------|--------| | snv_pca | 0 | 5 | 0.123 | 0.008 | 0.945 | 0.012 | | msc_detrend | 1 | 5 | 0.145 | 0.011 | 0.932 | 0.015 | | derivative | 2 | 5 | 0.167 | 0.015 | 0.918 | 0.019 | ### Branch Comparison Bar Chart ```python fig = analyzer.plot_branch_comparison( display_metric='rmse', display_partition='test', show_ci=True, # Show confidence intervals ci_level=0.95 ) ``` ### Branch Boxplot ```python fig = analyzer.plot_branch_boxplot( display_metric='rmse', display_partition='test' ) ``` ### Branch × Fold Heatmap ```python fig = analyzer.plot_branch_heatmap( y_var='fold_id', display_metric='rmse' ) ``` ### Using Standard Heatmap ```python fig = analyzer.plot_heatmap( x_var='branch_name', y_var='model_name', display_metric='rmse' ) ``` ## Filtering by Branch ```python # By branch name snv_preds = predictions.filter_predictions(branch_name='snv_pca') # By branch ID branch_0 = predictions.filter_predictions(branch_id=0) # Top model per branch top = predictions.top(n=1, rank_metric='rmse', branch_name='snv_pca') ``` ## Helper Methods ```python # Get all branch names branches = analyzer.get_branches() # ['snv_pca', 'msc_detrend', 'derivative'] # Get all branch IDs branch_ids = analyzer.get_branch_ids() # [0, 1, 2] ``` ## Key Behaviors 1. **Shared state before branch**: Splits and upstream preprocessing are computed once 2. **Independent contexts**: Each branch has its own X and Y processing state 3. **Single dataset load**: No redundant I/O 4. **Post-branch iteration**: Steps after branch execute on all branches 5. **Branch metadata**: Predictions include `branch_id` and `branch_name` ## Combining with Generators Branches work with `_or_` and `_range_` generators: ```python pipeline = [ ShuffleSplit(n_splits=3), {"branch": {"_or_": [SNV(), MSC(), FirstDerivative()]}}, # 3 branches {"_range_": [5, 15, 5], "param": "n_components", "model": PLSRegression}, # 3 PLS variants ] # Result: 27 predictions (3 branches × 3 n_components × 3 folds) ``` ## Example Use Cases ### Use Case 1: Preprocessing Comparison ```python pipeline = [ GroupKFold(n_splits=5), {"y_processing": StandardScaler()}, {"branch": [ [SNV()], [MSC()], [FirstDerivative()], [SecondDerivative()], ]}, PLSRegression(n_components=10), ] ``` ### Use Case 2: Model-Preprocessing Exploration ```python pipeline = [ ShuffleSplit(n_splits=3), {"branch": { "snv_pls5": [SNV(), PLSRegression(n_components=5)], "snv_pls10": [SNV(), PLSRegression(n_components=10)], "msc_rf": [MSC(), RandomForestRegressor()], }}, ] ``` ### Use Case 3: Spectral Preprocessing Comparison ```python pipeline = [ ShuffleSplit(n_splits=5), {"branch": { "raw": [], # No preprocessing "snv": [SNV()], "msc": [MSC()], "d1": [FirstDerivative()], "d2": [SecondDerivative()], "savgol_d1": [SavitzkyGolay(deriv=1)], "savgol_d2": [SavitzkyGolay(deriv=2)], }}, PLSRegression(n_components=10), ] ``` ## See Also - {doc}`writing_pipelines` - Pipeline configuration guide - {doc}`/reference/generator_keywords` - Generator syntax reference - {doc}`/reference/predictions_api` - Predictions API reference - {doc}`stacking` - Meta-model stacking guide