Pipeline Diagram

Visualize your pipeline structure as a directed acyclic graph (DAG).

Overview

The PipelineDiagram class creates a visual representation of your pipeline execution, showing:

  • All pipeline steps with operator names

  • Dataset shapes at each step (samples × processings × features)

  • Branching and merging points

  • Model training steps

  • Cross-validation splitters

  • Best scores for model nodes

Pipeline diagram showing a branching pipeline with SNV leading to FirstDerivative and SavitzkyGolay branches, each with PLS models, merged and fed to a RandomForest meta-model

Pipeline Diagram with Branching Structure

Basic Usage

From Execution Trace

The recommended way to create a pipeline diagram is from an execution trace, which captures actual runtime shapes:

from nirs4all.visualization import PipelineDiagram

# Run your pipeline
result = nirs4all.run(pipeline, dataset, verbose=1)

# Get the execution trace
trace = result.execution_trace

# Create diagram from trace
diagram = PipelineDiagram.from_trace(trace, result.predictions)
fig = diagram.render(title="My Pipeline Structure")
fig.savefig("pipeline_diagram.png", dpi=150, bbox_inches='tight')

From Pipeline Definition

You can also create a diagram from a pipeline definition (without runtime shapes):

from nirs4all.visualization import PipelineDiagram

pipeline = [
    MinMaxScaler(),
    SNV(),
    ShuffleSplit(n_splits=5),
    {"model": PLSRegression(n_components=10)}
]

diagram = PipelineDiagram(pipeline_steps=pipeline)
fig = diagram.render(initial_shape=(100, 1, 500))

Shape Notation

The diagram uses S×P×F notation to show dataset dimensions:

  • S = samples (number of observations)

  • P = processings (preprocessing views/augmentations)

  • F = features (wavelengths/columns)

For example, (100, [1, 500]) means:

  • 100 samples

  • 1 processing view

  • 500 features

Multi-source datasets show shapes for each source:

  • (100, [1, 500], [1, 200]) = 100 samples, two sources with 500 and 200 features

Node Types and Colors

Node Type

Color

Description

Input

Gray

Dataset entry point

Preprocessing

Blue

Scalers, transformers, derivatives

Feature Augmentation

Teal

Feature generation (SNV, Detrend, etc.)

Sample Augmentation

Green

Data augmentation

Y Processing

Amber

Target transformation

Splitter

Purple

Cross-validation splitters

Branch

Teal

Branch entry points

Merge

Teal

Branch merge points

Model

Red

Model training (shows best score)

Configuration Options

Customize the diagram appearance:

config = {
    'figsize': (14, 10),      # Figure size
    'fontsize': 10,           # Base font size
    'node_width': 2.5,        # Node width
    'node_height': 0.8,       # Node height
    'show_shapes': True,      # Show shape info on nodes
    'compact': False,         # Use compact labels
}

diagram = PipelineDiagram(
    pipeline_steps=pipeline,
    config=config
)
fig = diagram.render()

Render Options

fig = diagram.render(
    show_shapes=True,         # Override config's show_shapes
    figsize=(16, 12),         # Override figure size
    title="My Pipeline",      # Custom title
    initial_shape=(100, 1, 500)  # Initial dataset shape
)

Branching Visualization

The diagram automatically handles branched pipelines:

pipeline = [
    MinMaxScaler(),
    {"branch": [
        [SNV(), PLSRegression(n_components=10)],
        [Detrend(), PLSRegression(n_components=8)],
        [FirstDerivative(), PLSRegression(n_components=12)],
    ]},
    {"merge": "predictions"},
    {"model": Ridge(), "name": "MetaModel"},
]

This creates a diagram showing:

  1. Shared preprocessing (MinMaxScaler)

  2. Three parallel branches (SNV, Detrend, FirstDerivative)

  3. Merge node collecting predictions

  4. Final meta-model

Source Branch Visualization

Multi-source datasets with per-source preprocessing:

pipeline = [
    {"source_branch": {
        "NIR": [SNV(), FirstDerivative()],
        "Raman": [MSC(), Detrend()],
    }},
    {"merge_sources": "concat"},
    PLSRegression(n_components=10),
]

The diagram shows separate branches for each data source.

Example Output

Here’s what a complex pipeline diagram looks like:

                    ┌─────────────┐
                    │   Dataset   │
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ MinMaxScaler│
                    │ (100,1,500) │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Branch    │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │     SNV     │ │   Detrend   │ │ FirstDeriv  │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │    PLS_1    │ │    PLS_2    │ │    PLS_3    │
    │   ★ 0.85    │ │   ★ 0.82    │ │   ★ 0.88    │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                    ┌──────▼──────┐
                    │    Merge    │
                    │(predictions)│
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  MetaModel  │
                    │   ★ 0.91    │
                    └─────────────┘

Using with PredictionAnalyzer

The PredictionAnalyzer also provides a branch diagram method:

from nirs4all.visualization import PredictionAnalyzer

analyzer = PredictionAnalyzer(predictions)
fig = analyzer.plot_branch_diagram(
    show_metrics=True,
    metric='rmse',
    partition='test'
)

Convenience Function

For quick visualization:

from nirs4all.visualization import plot_pipeline_diagram

fig = plot_pipeline_diagram(
    trace=execution_trace,
    predictions=predictions,
    show_shapes=True,
    title="My Pipeline"
)

Best Practices

  1. Use execution traces: They provide accurate runtime shapes

  2. Enable show_shapes: Helps understand data flow

  3. Save high DPI: Use dpi=150 or higher for presentations

  4. Add titles: Descriptive titles help document experiments

  5. Check model scores: The diagram shows best scores on model nodes

See Also