# Workspace Architecture **Version**: 4.0 **Status**: Implemented This document describes the nirs4all workspace structure based on the DuckDB storage backend. ## Design Principles | Principle | Description | |-----------|-------------| | **Database-first** | All structured data in a single DuckDB file (`store.duckdb`) | | **Flat artifacts** | Binary artifacts in content-addressed flat directory | | **Export on demand** | Human-readable files produced only by explicit export operations | | **Chain as first-class entity** | The preprocessing-to-model chain is stored, not reconstructed | | **No folder hierarchy** | No nested `runs/` directories, no YAML manifests, no `pipeline.json` files | --- ## Directory Structure ``` workspace/ ├── store.duckdb # All structured data (DuckDB database) │ # Tables: runs, pipelines, chains, │ # predictions, prediction_arrays, │ # artifacts, logs │ ├── artifacts/ # Flat content-addressed binary storage │ ├── ab/ # 2-char shard prefix │ │ └── abc123def456.joblib # Fitted model/transformer │ ├── cd/ │ │ └── cde789012345.joblib │ └── ... │ └── exports/ # User-triggered exports (on demand) ├── model.n4a # Exported bundle ├── predictions.parquet # Exported predictions └── run_summary.yaml # Exported run metadata ``` --- ## DuckDB Schema (7 tables) | Table | Purpose | Key Columns | |-------|---------|-------------| | `runs` | Experiment sessions | `run_id`, `name`, `status`, `config`, `datasets` | | `pipelines` | Expanded pipeline configs | `pipeline_id`, `run_id`, `expanded_config`, `dataset_name` | | `chains` | Preprocessing-to-model chains | `chain_id`, `pipeline_id`, `steps`, `fold_artifacts`, `shared_artifacts` | | `predictions` | Scalar prediction scores | `prediction_id`, `pipeline_id`, `chain_id`, `val_score`, `test_score` | | `prediction_arrays` | Dense y_true/y_pred arrays | `prediction_id`, `y_true DOUBLE[]`, `y_pred DOUBLE[]` | | `artifacts` | Artifact metadata & ref counts | `artifact_id`, `artifact_path`, `content_hash`, `ref_count` | | `logs` | Structured execution logs | `log_id`, `pipeline_id`, `step_idx`, `event`, `duration_ms` | --- ## API Classes ### WorkspaceStore Central class for all workspace persistence. Replaces the legacy ManifestManager, SimulationSaver, PipelineWriter, PredictionStorage, and ArrayRegistry. ```python from nirs4all.pipeline.storage import WorkspaceStore store = WorkspaceStore(workspace_path) # Run lifecycle run_id = store.begin_run(name="experiment_1", config={...}, datasets=[...]) pipeline_id = store.begin_pipeline(run_id, name="0001_pls", ...) chain_id = store.save_chain(pipeline_id, steps=[...], ...) pred_id = store.save_prediction(pipeline_id, chain_id, ...) store.complete_pipeline(pipeline_id, best_val=0.12, ...) store.complete_run(run_id, summary={...}) # Queries top = store.top_predictions(n=5, metric="val_score") runs = store.list_runs(status="completed") preds = store.query_predictions(dataset_name="wheat", partition="val") # Chain replay (in-workspace prediction) y_pred = store.replay_chain(chain_id="abc123", X=X_new) # Export on demand store.export_chain("abc123", Path("model.n4a")) store.export_predictions_parquet(Path("results.parquet"), dataset_name="wheat") store.export_run(run_id, Path("run_summary.yaml")) # Cleanup store.delete_run(run_id, delete_artifacts=True) store.gc_artifacts() store.vacuum() ``` ### PipelineLibrary Manage reusable pipeline templates with category support. ```python from nirs4all.pipeline.storage.library import PipelineLibrary library = PipelineLibrary(workspace_path) # Save with category and tags library.save_template( pipeline_config, name="optimized_pls", category="regression", tags=["nirs", "pls", "optimized"], metrics={"rmse": 0.42} ) # Search templates templates = library.list_templates(category="regression", tags=["pls"]) config = library.load_template("optimized_pls") ``` --- ## Common Workflows ### 1. Training Session ```python import nirs4all result = nirs4all.run( pipeline=[MinMaxScaler(), PLSRegression(10)], dataset="sample_data/regression", verbose=1, ) # All metadata, predictions, and artifacts are stored in store.duckdb + artifacts/ ``` ### 2. Export Best Model ```python # Export best model as a portable bundle result.export("model.n4a") # Or export specific prediction's model result.export("model.n4a", prediction_id="abc123") ``` ### 3. Predict from Bundle ```python import nirs4all # Predict from exported bundle (no workspace needed) preds = nirs4all.predict("model.n4a", new_data) ``` ### 4. Query Predictions Across Runs ```python from nirs4all.pipeline.storage import WorkspaceStore store = WorkspaceStore(workspace_path) # Top models across all datasets top = store.top_predictions(20, metric="val_score", group_by="model_class") # Filter by dataset and partition wheat_preds = store.query_predictions(dataset_name="wheat", partition="test") # Export filtered predictions store.export_predictions_parquet(Path("wheat_results.parquet"), dataset_name="wheat") ``` ### 5. Delete Old Runs ```python from nirs4all.pipeline.storage import WorkspaceStore store = WorkspaceStore(workspace_path) # Delete a run and cascade to all related data store.delete_run(run_id, delete_artifacts=True) # Reclaim disk space store.vacuum() ``` --- ## See Also - [Storage API](./storage.md) - WorkspaceStore API reference - [Pipeline Syntax](/reference/pipeline_syntax) - Pipeline configuration - [CLI Reference](/reference/cli) - Command-line interface