# Pipeline Diagram

Visualize your pipeline structure as a directed acyclic graph (DAG).

## Overview

The `PipelineDiagram` class creates a visual representation of your pipeline execution, showing:

- All pipeline steps with operator names
- Dataset shapes at each step (samples × processings × features)
- Branching and merging points
- Model training steps
- Cross-validation splitters
- Best scores for model nodes

```{figure} ../../assets/pipeline_diagram.png
:align: center
:width: 90%
:alt: Pipeline diagram showing a branching pipeline with SNV leading to FirstDerivative and SavitzkyGolay branches, each with PLS models, merged and fed to a RandomForest meta-model

Pipeline Diagram with Branching Structure
```

## Basic Usage

### From Execution Trace

The recommended way to create a pipeline diagram is from an execution trace, which captures actual runtime shapes:

```python
from nirs4all.visualization import PipelineDiagram

# Run your pipeline
result = nirs4all.run(pipeline, dataset, verbose=1)

# Get the execution trace
trace = result.execution_trace

# Create diagram from trace
diagram = PipelineDiagram.from_trace(trace, result.predictions)
fig = diagram.render(title="My Pipeline Structure")
fig.savefig("pipeline_diagram.png", dpi=150, bbox_inches='tight')
```

### From Pipeline Definition

You can also create a diagram from a pipeline definition (without runtime shapes):

```python
from nirs4all.visualization import PipelineDiagram

pipeline = [
    MinMaxScaler(),
    SNV(),
    ShuffleSplit(n_splits=5),
    {"model": PLSRegression(n_components=10)}
]

diagram = PipelineDiagram(pipeline_steps=pipeline)
fig = diagram.render(initial_shape=(100, 1, 500))
```

## Shape Notation

The diagram uses **S×P×F** notation to show dataset dimensions:

- **S** = samples (number of observations)
- **P** = processings (preprocessing views/augmentations)
- **F** = features (wavelengths/columns)

For example, `(100, [1, 500])` means:
- 100 samples
- 1 processing view
- 500 features

Multi-source datasets show shapes for each source:
- `(100, [1, 500], [1, 200])` = 100 samples, two sources with 500 and 200 features

## Node Types and Colors

| Node Type | Color | Description |
|-----------|-------|-------------|
| **Input** | Gray | Dataset entry point |
| **Preprocessing** | Blue | Scalers, transformers, derivatives |
| **Feature Augmentation** | Teal | Feature generation (SNV, Detrend, etc.) |
| **Sample Augmentation** | Green | Data augmentation |
| **Y Processing** | Amber | Target transformation |
| **Splitter** | Purple | Cross-validation splitters |
| **Branch** | Teal | Branch entry points |
| **Merge** | Teal | Branch merge points |
| **Model** | Red | Model training (shows best score) |

## Configuration Options

Customize the diagram appearance:

```python
config = {
    'figsize': (14, 10),      # Figure size
    'fontsize': 10,           # Base font size
    'node_width': 2.5,        # Node width
    'node_height': 0.8,       # Node height
    'show_shapes': True,      # Show shape info on nodes
    'compact': False,         # Use compact labels
}

diagram = PipelineDiagram(
    pipeline_steps=pipeline,
    config=config
)
fig = diagram.render()
```

### Render Options

```python
fig = diagram.render(
    show_shapes=True,         # Override config's show_shapes
    figsize=(16, 12),         # Override figure size
    title="My Pipeline",      # Custom title
    initial_shape=(100, 1, 500)  # Initial dataset shape
)
```

## Branching Visualization

The diagram automatically handles branched pipelines:

```python
pipeline = [
    MinMaxScaler(),
    {"branch": [
        [SNV(), PLSRegression(n_components=10)],
        [Detrend(), PLSRegression(n_components=8)],
        [FirstDerivative(), PLSRegression(n_components=12)],
    ]},
    {"merge": "predictions"},
    {"model": Ridge(), "name": "MetaModel"},
]
```

This creates a diagram showing:
1. Shared preprocessing (MinMaxScaler)
2. Three parallel branches (SNV, Detrend, FirstDerivative)
3. Merge node collecting predictions
4. Final meta-model

## Source Branch Visualization

Multi-source datasets with per-source preprocessing:

```python
pipeline = [
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), Detrend()],
    }},
    {"merge_sources": "concat"},
    PLSRegression(n_components=10),
]
```

The diagram shows separate branches for each data source.

## Example Output

Here's what a complex pipeline diagram looks like:

```
                    ┌─────────────┐
                    │   Dataset   │
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ MinMaxScaler│
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Branch    │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │     SNV     │ │   Detrend   │ │ FirstDeriv  │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │    PLS_1    │ │    PLS_2    │ │    PLS_3    │
    │   ★ 0.85    │ │   ★ 0.82    │ │   ★ 0.88    │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                    ┌──────▼──────┐
                    │    Merge    │
                    │(predictions)│
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  MetaModel  │
                    │   ★ 0.91    │
                    └─────────────┘
```

## Using with PredictionAnalyzer

The `PredictionAnalyzer` also provides a branch diagram method:

```python
from nirs4all.visualization import PredictionAnalyzer

analyzer = PredictionAnalyzer(predictions)
fig = analyzer.plot_branch_diagram(
    show_metrics=True,
    metric='rmse',
    partition='test'
)
```

## Convenience Function

For quick visualization:

```python
from nirs4all.visualization import plot_pipeline_diagram

fig = plot_pipeline_diagram(
    trace=execution_trace,
    predictions=predictions,
    show_shapes=True,
    title="My Pipeline"
)
```

## Best Practices

1. **Use execution traces**: They provide accurate runtime shapes
2. **Enable show_shapes**: Helps understand data flow
3. **Save high DPI**: Use `dpi=150` or higher for presentations
4. **Add titles**: Descriptive titles help document experiments
5. **Check model scores**: The diagram shows best scores on model nodes

## See Also

- {doc}`prediction_charts` - Prediction visualization
- {doc}`/user_guide/pipelines/branching` - Pipeline branching guide
- {doc}`/developer/index` - Architecture details