Pipeline Diagram

Visualize your pipeline structure as a directed acyclic graph (DAG).

Overview

The PipelineDiagram class creates a visual representation of your pipeline execution, showing:

All pipeline steps with operator names
Dataset shapes at each step (samples × processings × features)
Branching and merging points
Model training steps
Cross-validation splitters
Best scores for model nodes

Pipeline diagram showing a branching pipeline with SNV leading to FirstDerivative and SavitzkyGolay branches, each with PLS models, merged and fed to a RandomForest meta-model — Pipeline Diagram with Branching Structure

Basic Usage

From Execution Trace

The recommended way to create a pipeline diagram is from an execution trace, which captures actual runtime shapes:

from nirs4all.visualization import PipelineDiagram

# Run your pipeline
result = nirs4all.run(pipeline, dataset, verbose=1)

# Get the execution trace
trace = result.execution_trace

# Create diagram from trace
diagram = PipelineDiagram.from_trace(trace, result.predictions)
fig = diagram.render(title="My Pipeline Structure")
fig.savefig("pipeline_diagram.png", dpi=150, bbox_inches='tight')

From Pipeline Definition

You can also create a diagram from a pipeline definition (without runtime shapes):

from nirs4all.visualization import PipelineDiagram

pipeline = [
    MinMaxScaler(),
    SNV(),
    ShuffleSplit(n_splits=5),
    {"model": PLSRegression(n_components=10)}
]

diagram = PipelineDiagram(pipeline_steps=pipeline)
fig = diagram.render(initial_shape=(100, 1, 500))

Shape Notation

The diagram uses S×P×F notation to show dataset dimensions:

S = samples (number of observations)
P = processings (preprocessing views/augmentations)
F = features (wavelengths/columns)

For example, (100, [1, 500]) means:

100 samples
1 processing view
500 features

Multi-source datasets show shapes for each source:

(100, [1, 500], [1, 200]) = 100 samples, two sources with 500 and 200 features

Node Types and Colors

Node Type	Color	Description
Input	Gray	Dataset entry point
Preprocessing	Blue	Scalers, transformers, derivatives
Feature Augmentation	Teal	Feature generation (SNV, Detrend, etc.)
Sample Augmentation	Green	Data augmentation
Y Processing	Amber	Target transformation
Splitter	Purple	Cross-validation splitters
Branch	Teal	Branch entry points
Merge	Teal	Branch merge points
Model	Red	Model training (shows best score)

Configuration Options

Customize the diagram appearance:

config = {
    'figsize': (14, 10),      # Figure size
    'fontsize': 10,           # Base font size
    'node_width': 2.5,        # Node width
    'node_height': 0.8,       # Node height
    'show_shapes': True,      # Show shape info on nodes
    'compact': False,         # Use compact labels
}

diagram = PipelineDiagram(
    pipeline_steps=pipeline,
    config=config
)
fig = diagram.render()

Render Options

fig = diagram.render(
    show_shapes=True,         # Override config's show_shapes
    figsize=(16, 12),         # Override figure size
    title="My Pipeline",      # Custom title
    initial_shape=(100, 1, 500)  # Initial dataset shape
)

Branching Visualization

The diagram automatically handles branched pipelines:

pipeline = [
    MinMaxScaler(),
    {"branch": [
        [SNV(), PLSRegression(n_components=10)],
        [Detrend(), PLSRegression(n_components=8)],
        [FirstDerivative(), PLSRegression(n_components=12)],
    ]},
    {"merge": "predictions"},
    {"model": Ridge(), "name": "MetaModel"},
]

This creates a diagram showing:

Shared preprocessing (MinMaxScaler)
Three parallel branches (SNV, Detrend, FirstDerivative)
Merge node collecting predictions
Final meta-model

Source Branch Visualization

Multi-source datasets with per-source preprocessing:

pipeline = [
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), Detrend()],
    }},
    {"merge_sources": "concat"},
    PLSRegression(n_components=10),
]

The diagram shows separate branches for each data source.

Example Output

Here’s what a complex pipeline diagram looks like:

                    ┌─────────────┐
                    │   Dataset   │
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ MinMaxScaler│
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Branch    │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │     SNV     │ │   Detrend   │ │ FirstDeriv  │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │    PLS_1    │ │    PLS_2    │ │    PLS_3    │
    │   ★ 0.85    │ │   ★ 0.82    │ │   ★ 0.88    │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                    ┌──────▼──────┐
                    │    Merge    │
                    │(predictions)│
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  MetaModel  │
                    │   ★ 0.91    │
                    └─────────────┘

Using with PredictionAnalyzer

The PredictionAnalyzer also provides a branch diagram method:

from nirs4all.visualization import PredictionAnalyzer

analyzer = PredictionAnalyzer(predictions)
fig = analyzer.plot_branch_diagram(
    show_metrics=True,
    metric='rmse',
    partition='test'
)

Convenience Function

For quick visualization:

from nirs4all.visualization import plot_pipeline_diagram

fig = plot_pipeline_diagram(
    trace=execution_trace,
    predictions=predictions,
    show_shapes=True,
    title="My Pipeline"
)

Best Practices

Use execution traces: They provide accurate runtime shapes
Enable show_shapes: Helps understand data flow
Save high DPI: Use dpi=150 or higher for presentations
Add titles: Descriptive titles help document experiments
Check model scores: The diagram shows best scores on model nodes