Workspace Architecture

Version: 4.0 Status: Implemented

This document describes the nirs4all workspace structure based on the DuckDB storage backend.

Design Principles

Principle	Description
Database-first	All structured data in a single DuckDB file (`store.duckdb`)
Flat artifacts	Binary artifacts in content-addressed flat directory
Export on demand	Human-readable files produced only by explicit export operations
Chain as first-class entity	The preprocessing-to-model chain is stored, not reconstructed
No folder hierarchy	No nested `runs/` directories, no YAML manifests, no `pipeline.json` files

Directory Structure

workspace/
├── store.duckdb                        # All structured data (DuckDB database)
│                                        #   Tables: runs, pipelines, chains,
│                                        #   predictions, prediction_arrays,
│                                        #   artifacts, logs
│
├── artifacts/                           # Flat content-addressed binary storage
│   ├── ab/                              # 2-char shard prefix
│   │   └── abc123def456.joblib          # Fitted model/transformer
│   ├── cd/
│   │   └── cde789012345.joblib
│   └── ...
│
└── exports/                             # User-triggered exports (on demand)
    ├── model.n4a                        # Exported bundle
    ├── predictions.parquet              # Exported predictions
    └── run_summary.yaml                 # Exported run metadata

DuckDB Schema (7 tables)

Table	Purpose	Key Columns
`runs`	Experiment sessions	`run_id`, `name`, `status`, `config`, `datasets`
`pipelines`	Expanded pipeline configs	`pipeline_id`, `run_id`, `expanded_config`, `dataset_name`
`chains`	Preprocessing-to-model chains	`chain_id`, `pipeline_id`, `steps`, `fold_artifacts`, `shared_artifacts`
`predictions`	Scalar prediction scores	`prediction_id`, `pipeline_id`, `chain_id`, `val_score`, `test_score`
`prediction_arrays`	Dense y_true/y_pred arrays	`prediction_id`, `y_true DOUBLE[]`, `y_pred DOUBLE[]`
`artifacts`	Artifact metadata & ref counts	`artifact_id`, `artifact_path`, `content_hash`, `ref_count`
`logs`	Structured execution logs	`log_id`, `pipeline_id`, `step_idx`, `event`, `duration_ms`

API Classes

WorkspaceStore

Central class for all workspace persistence. Replaces the legacy ManifestManager, SimulationSaver, PipelineWriter, PredictionStorage, and ArrayRegistry.

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Run lifecycle
run_id = store.begin_run(name="experiment_1", config={...}, datasets=[...])
pipeline_id = store.begin_pipeline(run_id, name="0001_pls", ...)
chain_id = store.save_chain(pipeline_id, steps=[...], ...)
pred_id = store.save_prediction(pipeline_id, chain_id, ...)
store.complete_pipeline(pipeline_id, best_val=0.12, ...)
store.complete_run(run_id, summary={...})

# Queries
top = store.top_predictions(n=5, metric="val_score")
runs = store.list_runs(status="completed")
preds = store.query_predictions(dataset_name="wheat", partition="val")

# Chain replay (in-workspace prediction)
y_pred = store.replay_chain(chain_id="abc123", X=X_new)

# Export on demand
store.export_chain("abc123", Path("model.n4a"))
store.export_predictions_parquet(Path("results.parquet"), dataset_name="wheat")
store.export_run(run_id, Path("run_summary.yaml"))

# Cleanup
store.delete_run(run_id, delete_artifacts=True)
store.gc_artifacts()
store.vacuum()

PipelineLibrary

Manage reusable pipeline templates with category support.

from nirs4all.pipeline.storage.library import PipelineLibrary

library = PipelineLibrary(workspace_path)

# Save with category and tags
library.save_template(
    pipeline_config,
    name="optimized_pls",
    category="regression",
    tags=["nirs", "pls", "optimized"],
    metrics={"rmse": 0.42}
)

# Search templates
templates = library.list_templates(category="regression", tags=["pls"])
config = library.load_template("optimized_pls")

Common Workflows

1. Training Session

import nirs4all

result = nirs4all.run(
    pipeline=[MinMaxScaler(), PLSRegression(10)],
    dataset="sample_data/regression",
    verbose=1,
)
# All metadata, predictions, and artifacts are stored in store.duckdb + artifacts/

2. Export Best Model

# Export best model as a portable bundle
result.export("model.n4a")

# Or export specific prediction's model
result.export("model.n4a", prediction_id="abc123")

3. Predict from Bundle

import nirs4all

# Predict from exported bundle (no workspace needed)
preds = nirs4all.predict("model.n4a", new_data)

4. Query Predictions Across Runs

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Top models across all datasets
top = store.top_predictions(20, metric="val_score", group_by="model_class")

# Filter by dataset and partition
wheat_preds = store.query_predictions(dataset_name="wheat", partition="test")

# Export filtered predictions
store.export_predictions_parquet(Path("wheat_results.parquet"), dataset_name="wheat")

5. Delete Old Runs

from nirs4all.pipeline.storage import WorkspaceStore

store = WorkspaceStore(workspace_path)

# Delete a run and cascade to all related data
store.delete_run(run_id, delete_artifacts=True)

# Reclaim disk space
store.vacuum()

See Also

Storage API - WorkspaceStore API reference
Pipeline Syntax - Pipeline configuration
CLI Reference - Command-line interface