Workspace Architecture

Version: 4.0 Status: Implemented

This document describes the nirs4all workspace structure based on the DuckDB storage backend.

Design Principles

Principle

Description

Database-first

All structured data in a single DuckDB file (store.duckdb)

Flat artifacts

Binary artifacts in content-addressed flat directory

Export on demand

Human-readable files produced only by explicit export operations

Chain as first-class entity

The preprocessing-to-model chain is stored, not reconstructed

No folder hierarchy

No nested runs/ directories, no YAML manifests, no pipeline.json files


Directory Structure

workspace/
├── store.duckdb                        # All structured data (DuckDB database)
│                                        #   Tables: runs, pipelines, chains,
│                                        #   predictions, prediction_arrays,
│                                        #   artifacts, logs
│
├── artifacts/                           # Flat content-addressed binary storage
│   ├── ab/                              # 2-char shard prefix
│   │   └── abc123def456.joblib          # Fitted model/transformer
│   ├── cd/
│   │   └── cde789012345.joblib
│   └── ...
│
└── exports/                             # User-triggered exports (on demand)
    ├── model.n4a                        # Exported bundle
    ├── predictions.parquet              # Exported predictions
    └── run_summary.yaml                 # Exported run metadata

DuckDB Schema (7 tables)

Table

Purpose

Key Columns

runs

Experiment sessions

run_id, name, status, config, datasets

pipelines

Expanded pipeline configs

pipeline_id, run_id, expanded_config, dataset_name

chains

Preprocessing-to-model chains

chain_id, pipeline_id, steps, fold_artifacts, shared_artifacts

predictions

Scalar prediction scores

prediction_id, pipeline_id, chain_id, val_score, test_score

prediction_arrays

Dense y_true/y_pred arrays

prediction_id, y_true DOUBLE[], y_pred DOUBLE[]

artifacts

Artifact metadata & ref counts

artifact_id, artifact_path, content_hash, ref_count

logs

Structured execution logs

log_id, pipeline_id, step_idx, event, duration_ms


API Classes

WorkspaceStore

Central class for all workspace persistence. Replaces the legacy ManifestManager, SimulationSaver, PipelineWriter, PredictionStorage, and ArrayRegistry.

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Run lifecycle
run_id = store.begin_run(name="experiment_1", config={...}, datasets=[...])
pipeline_id = store.begin_pipeline(run_id, name="0001_pls", ...)
chain_id = store.save_chain(pipeline_id, steps=[...], ...)
pred_id = store.save_prediction(pipeline_id, chain_id, ...)
store.complete_pipeline(pipeline_id, best_val=0.12, ...)
store.complete_run(run_id, summary={...})

# Queries
top = store.top_predictions(n=5, metric="val_score")
runs = store.list_runs(status="completed")
preds = store.query_predictions(dataset_name="wheat", partition="val")

# Chain replay (in-workspace prediction)
y_pred = store.replay_chain(chain_id="abc123", X=X_new)

# Export on demand
store.export_chain("abc123", Path("model.n4a"))
store.export_predictions_parquet(Path("results.parquet"), dataset_name="wheat")
store.export_run(run_id, Path("run_summary.yaml"))

# Cleanup
store.delete_run(run_id, delete_artifacts=True)
store.gc_artifacts()
store.vacuum()

PipelineLibrary

Manage reusable pipeline templates with category support.

from nirs4all.pipeline.storage.library import PipelineLibrary

library = PipelineLibrary(workspace_path)

# Save with category and tags
library.save_template(
    pipeline_config,
    name="optimized_pls",
    category="regression",
    tags=["nirs", "pls", "optimized"],
    metrics={"rmse": 0.42}
)

# Search templates
templates = library.list_templates(category="regression", tags=["pls"])
config = library.load_template("optimized_pls")

Common Workflows

1. Training Session

import nirs4all

result = nirs4all.run(
    pipeline=[MinMaxScaler(), PLSRegression(10)],
    dataset="sample_data/regression",
    verbose=1,
)
# All metadata, predictions, and artifacts are stored in store.duckdb + artifacts/

2. Export Best Model

# Export best model as a portable bundle
result.export("model.n4a")

# Or export specific prediction's model
result.export("model.n4a", prediction_id="abc123")

3. Predict from Bundle

import nirs4all

# Predict from exported bundle (no workspace needed)
preds = nirs4all.predict("model.n4a", new_data)

4. Query Predictions Across Runs

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Top models across all datasets
top = store.top_predictions(20, metric="val_score", group_by="model_class")

# Filter by dataset and partition
wheat_preds = store.query_predictions(dataset_name="wheat", partition="test")

# Export filtered predictions
store.export_predictions_parquet(Path("wheat_results.parquet"), dataset_name="wheat")

5. Delete Old Runs

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Delete a run and cascade to all related data
store.delete_run(run_id, delete_artifacts=True)

# Reclaim disk space
store.vacuum()

See Also