Architecture Overview

This document provides a high-level overview of the nirs4all pipeline architecture for developers who want to understand the internals or contribute to the project.

Architecture Philosophy

The pipeline module is designed around a layered architecture with separation of concerns:

  1. Orchestration: Managing multiple datasets and pipeline configurations

  2. Execution: Running a specific sequence of steps on a specific dataset

  3. Step Logic: The actual implementation of a pipeline step (model training, preprocessing, etc.)

System Overview

┌─────────────────────────────────────────────────────────────────┐
│                        PipelineRunner                           │
│                   (Public API / Facade)                         │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      PipelineOrchestrator                        │
│              (Manages datasets × configurations)                 │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                       PipelineExecutor                           │
│                (Executes steps sequentially)                     │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         StepRunner                               │
│              (Routes steps to controllers)                       │
└───────────────┬─────────────────────────────────┬───────────────┘
                │                                 │
                ▼                                 ▼
        ┌───────────────┐               ┌─────────────────┐
        │   StepParser  │               │ ControllerRouter│
        │  (Normalize)  │               │   (Dispatch)    │
        └───────────────┘               └────────┬────────┘
                                                 │
                                                 ▼
                                    ┌────────────────────────┐
                                    │  CONTROLLER_REGISTRY   │
                                    │    (Priority-based)    │
                                    └────────────────────────┘

Key Components

1. PipelineRunner

Location: nirs4all/pipeline/runner.py

Role: The public entry point (Facade pattern)

Responsibilities:

  • Provides simple API for users (run(), predict(), export(), retrain())

  • Initializes the environment (workspace, logging)

  • Delegates work to the Orchestrator

from nirs4all.pipeline import PipelineRunner

runner = PipelineRunner(save_artifacts=True, verbose=1)
predictions, per_dataset = runner.run(pipeline, dataset)

2. PipelineOrchestrator

Location: nirs4all/pipeline/execution/orchestrator.py

Role: The high-level manager

Responsibilities:

  • Iterates over all provided Datasets

  • Iterates over all provided Pipeline Configurations

  • Manages global results (aggregating predictions across runs)

  • Instantiates a PipelineExecutor for each (Dataset, Pipeline) pair

3. PipelineExecutor

Location: nirs4all/pipeline/execution/executor.py

Role: The sequence runner

Responsibilities:

  • Executes a list of steps sequentially

  • Manages the ExecutionContext (state propagation)

  • Handles artifact management (saving models, logs) for a single run

  • Catches errors and handles the “continue on error” logic

4. StepRunner

Location: nirs4all/pipeline/steps/step_runner.py

Role: The unit executor

Responsibilities:

  • Parses the raw step definition (dict, string, object) using StepParser

  • Routes the step to the appropriate Controller using ControllerRouter

  • Executes the Controller

5. StepParser

Location: nirs4all/pipeline/steps/parser.py

Role: Step configuration parser

Responsibilities:

  • Normalizes different step syntaxes (dict, string, instance, list) into a canonical ParsedStep format

  • Extracts keywords from step configurations

  • Identifies step types (workflow, serialized, subpipeline, direct)

  • Deserializes operators when needed

6. ControllerRouter

Location: nirs4all/pipeline/steps/router.py

Role: Controller selection

Responsibilities:

  • Matches parsed steps to appropriate controllers using priority-based selection

  • Queries each controller’s matches() method

  • Returns the highest-priority matching controller

7. Controllers

Location: nirs4all/controllers/

Role: The business logic

Responsibilities:

  • Implements the actual logic for a step (e.g., ModelController, PreprocessingController)

  • Interacts with the SpectroDataset

  • Updates the ExecutionContext

  • Returns artifacts (files, objects) to be saved

See Controller System for details on the controller system.

Data Flow

The data flow relies on two main objects passed through the layers:

SpectroDataset

The data container holding:

  • X: Feature matrix (spectral data)

  • y: Target values

  • metadata: Sample metadata

  • fold_indices: Cross-validation assignments

It is mutable but typically modified via internal state updates managed by controllers.

ExecutionContext

A composite object containing:

  • DataSelector: Immutable configuration for how to read data (e.g., “train” partition)

  • PipelineState: Mutable state tracking (e.g., current Y-transformation)

  • StepMetadata: Ephemeral flags for communication between steps

  • Custom: Controller-specific data (e.g., branch contexts)

Directory Structure

nirs4all/
├── pipeline/                    # Pipeline execution engine
│   ├── runner.py                # PipelineRunner (public API)
│   ├── config/                  # Configuration handling
│   │   ├── config.py            # PipelineConfigs
│   │   └── context.py           # ExecutionContext, RuntimeContext
│   ├── execution/               # Execution infrastructure
│   │   ├── orchestrator.py      # PipelineOrchestrator
│   │   └── executor.py          # PipelineExecutor
│   ├── steps/                   # Step processing
│   │   ├── parser.py            # StepParser
│   │   ├── router.py            # ControllerRouter
│   │   └── step_runner.py       # StepRunner
│   ├── bundle/                  # Export/import bundles
│   │   ├── generator.py         # BundleGenerator
│   │   └── loader.py            # BundleLoader
│   └── storage/                 # Artifact management
│       └── artifacts/           # Artifact registry, loader
├── controllers/                 # Step handlers
│   ├── registry.py              # @register_controller
│   ├── controller.py            # OperatorController base
│   ├── transforms/              # TransformerMixin controllers
│   ├── models/                  # Model training controllers
│   ├── data/                    # Data manipulation (branch, merge)
│   └── splitters/               # Cross-validation controllers
├── data/                        # Data handling
│   ├── config.py                # DatasetConfigs
│   ├── dataset.py               # SpectroDataset
│   └── predictions.py           # Predictions container
└── operators/                   # Pipeline operators
    ├── transforms/              # NIRS-specific transformers
    ├── augmentation/            # Data augmentation
    ├── models/                  # Pre-built models
    └── splitters/               # Splitting algorithms

Common Patterns

Registry Pattern

Controllers are discovered automatically via the @register_controller decorator:

from nirs4all.controllers.registry import register_controller
from nirs4all.controllers.controller import OperatorController

@register_controller
class MyController(OperatorController):
    priority = 50
    # ...

Priority Pattern

Controllers compete for steps based on priority. Lower numbers = higher priority:

Priority

Use Case

1-10

Special/high-priority controllers

20-50

Specific operator controllers

80-100

Generic fallback controllers

1000+

Catch-all controllers

Facade Pattern

PipelineRunner hides complexity from users, providing a simple API:

runner = PipelineRunner()
predictions, _ = runner.run(pipeline, dataset)
y_pred, _ = runner.predict(source, new_data)

Context Object Pattern

ExecutionContext encapsulates state and is immutably updated through steps. Controllers receive and return context objects rather than modifying global state.

Strategy Pattern

Controllers implement different strategies for handling different operator types. The router selects the appropriate controller at runtime.

Extension Points

The system provides multiple extension points:

  1. Custom Controllers: Add new controllers with @register_controller

  2. Custom Keywords: Use any non-reserved keyword in step dictionaries

  3. Custom Operators: Any Python callable or class can be an operator

  4. Custom Context Data: Use context.custom dict for controller-specific data

  5. Custom Artifacts: Controllers can return any serializable artifacts

See Also

Source files:

  • nirs4all/pipeline/runner.py - PipelineRunner implementation

  • nirs4all/pipeline/execution/orchestrator.py - Orchestrator implementation

  • nirs4all/controllers/ - Controller implementations