Workspace Architecture
Version: 4.0 Status: Implemented
This document describes the nirs4all workspace structure based on the DuckDB storage backend.
Design Principles
Principle |
Description |
|---|---|
Database-first |
All structured data in a single DuckDB file ( |
Flat artifacts |
Binary artifacts in content-addressed flat directory |
Export on demand |
Human-readable files produced only by explicit export operations |
Chain as first-class entity |
The preprocessing-to-model chain is stored, not reconstructed |
No folder hierarchy |
No nested |
Directory Structure
workspace/
├── store.duckdb # All structured data (DuckDB database)
│ # Tables: runs, pipelines, chains,
│ # predictions, prediction_arrays,
│ # artifacts, logs
│
├── artifacts/ # Flat content-addressed binary storage
│ ├── ab/ # 2-char shard prefix
│ │ └── abc123def456.joblib # Fitted model/transformer
│ ├── cd/
│ │ └── cde789012345.joblib
│ └── ...
│
└── exports/ # User-triggered exports (on demand)
├── model.n4a # Exported bundle
├── predictions.parquet # Exported predictions
└── run_summary.yaml # Exported run metadata
DuckDB Schema (7 tables)
Table |
Purpose |
Key Columns |
|---|---|---|
|
Experiment sessions |
|
|
Expanded pipeline configs |
|
|
Preprocessing-to-model chains |
|
|
Scalar prediction scores |
|
|
Dense y_true/y_pred arrays |
|
|
Artifact metadata & ref counts |
|
|
Structured execution logs |
|
API Classes
WorkspaceStore
Central class for all workspace persistence. Replaces the legacy ManifestManager, SimulationSaver, PipelineWriter, PredictionStorage, and ArrayRegistry.
from nirs4all.pipeline.storage import WorkspaceStore
store = WorkspaceStore(workspace_path)
# Run lifecycle
run_id = store.begin_run(name="experiment_1", config={...}, datasets=[...])
pipeline_id = store.begin_pipeline(run_id, name="0001_pls", ...)
chain_id = store.save_chain(pipeline_id, steps=[...], ...)
pred_id = store.save_prediction(pipeline_id, chain_id, ...)
store.complete_pipeline(pipeline_id, best_val=0.12, ...)
store.complete_run(run_id, summary={...})
# Queries
top = store.top_predictions(n=5, metric="val_score")
runs = store.list_runs(status="completed")
preds = store.query_predictions(dataset_name="wheat", partition="val")
# Chain replay (in-workspace prediction)
y_pred = store.replay_chain(chain_id="abc123", X=X_new)
# Export on demand
store.export_chain("abc123", Path("model.n4a"))
store.export_predictions_parquet(Path("results.parquet"), dataset_name="wheat")
store.export_run(run_id, Path("run_summary.yaml"))
# Cleanup
store.delete_run(run_id, delete_artifacts=True)
store.gc_artifacts()
store.vacuum()
PipelineLibrary
Manage reusable pipeline templates with category support.
from nirs4all.pipeline.storage.library import PipelineLibrary
library = PipelineLibrary(workspace_path)
# Save with category and tags
library.save_template(
pipeline_config,
name="optimized_pls",
category="regression",
tags=["nirs", "pls", "optimized"],
metrics={"rmse": 0.42}
)
# Search templates
templates = library.list_templates(category="regression", tags=["pls"])
config = library.load_template("optimized_pls")
Common Workflows
1. Training Session
import nirs4all
result = nirs4all.run(
pipeline=[MinMaxScaler(), PLSRegression(10)],
dataset="sample_data/regression",
verbose=1,
)
# All metadata, predictions, and artifacts are stored in store.duckdb + artifacts/
2. Export Best Model
# Export best model as a portable bundle
result.export("model.n4a")
# Or export specific prediction's model
result.export("model.n4a", prediction_id="abc123")
3. Predict from Bundle
import nirs4all
# Predict from exported bundle (no workspace needed)
preds = nirs4all.predict("model.n4a", new_data)
4. Query Predictions Across Runs
from nirs4all.pipeline.storage import WorkspaceStore
store = WorkspaceStore(workspace_path)
# Top models across all datasets
top = store.top_predictions(20, metric="val_score", group_by="model_class")
# Filter by dataset and partition
wheat_preds = store.query_predictions(dataset_name="wheat", partition="test")
# Export filtered predictions
store.export_predictions_parquet(Path("wheat_results.parquet"), dataset_name="wheat")
5. Delete Old Runs
from nirs4all.pipeline.storage import WorkspaceStore
store = WorkspaceStore(workspace_path)
# Delete a run and cascade to all related data
store.delete_run(run_id, delete_artifacts=True)
# Reclaim disk space
store.vacuum()
See Also
Storage API - WorkspaceStore API reference
Pipeline Syntax - Pipeline configuration
CLI Reference - Command-line interface