# Workspace Architecture **Version**: 3.3 **Status**: Implemented This document describes the nirs4all workspace directory structure and file organization. ## Design Principles | Principle | Description | |-----------|-------------| | **Shallow structure** | Maximum 3 levels deep for easy navigation | | **Sequential numbering** | Pipelines numbered `0001`, `0002`, etc. for clear execution order | | **Dataset-centric runs** | All pipelines for a dataset in one folder | | **Fast access** | Best results accessible via `best_.csv` per dataset | | **Content-addressed binaries** | Deduplication within runs via `_binaries/` | | **Library flexibility** | Three types: filtered, pipeline, fullrun | --- ## Directory Structure ``` workspace/ │ ├── runs/ # Experimental runs │ │ │ ├── wheat_sample1/ # Dataset name (no date prefix) │ │ │ │ │ ├── best_0042_pls_baseline_x9y8z7.csv # Best prediction (auto-updated) │ │ │ │ │ ├── _binaries/ # Shared artifacts (lazy creation) │ │ │ ├── transformer_MinMaxScaler_abc123.joblib │ │ │ └── model_PLSRegression_def456.joblib │ │ │ │ │ ├── 0001_pls_baseline_a1b2c3/ # Pipeline: number + name + hash │ │ │ ├── pipeline.json # Pipeline configuration │ │ │ ├── manifest.yaml # V3 manifest with artifacts │ │ │ ├── metrics.json # Train/val/test metrics │ │ │ └── folds_*.csv # Fold predictions │ │ │ │ │ ├── 0002_b2c3d4/ # Pipeline without custom name │ │ └── 0003_c3d4e5/ │ │ │ └── corn_samples/ # Another dataset │ ├── best_0088_svm_opt_m5n6o7.csv │ ├── _binaries/ │ └── 0001_svm_opt_x1y2z3/ │ ├── binaries/ # Centralized artifact storage (V3) │ ├── wheat_sample1/ # Per-dataset binaries │ │ ├── model_PLSRegression_abc123.joblib │ │ └── transformer_StandardScaler_def456.joblib │ └── corn_samples/ │ ├── exports/ # Best results (fast access) │ │ │ ├── wheat_sample1/ # Dataset-based exports │ │ ├── PLSRegression_predictions.csv │ │ ├── PLSRegression_pipeline.json │ │ └── PLSRegression_summary.json │ │ │ ├── best_predictions/ # Quick access to predictions only │ │ ├── wheat_sample1_0042_x9y8z7.csv │ │ └── corn_samples_0088_m5n6o7.csv │ │ │ └── session_reports/ │ └── wheat_sample1.html │ ├── library/ # Reusable pipelines │ │ │ ├── templates/ # Pipeline configs (no binaries) │ │ ├── baseline_pls.json │ │ └── optimized_svm.json │ │ │ └── trained/ # Trained pipelines (3 types) │ │ │ ├── filtered/ # Config + metrics only │ │ └── wheat_quality_v1/ │ │ │ ├── pipeline/ # Config + all binaries │ │ └── wheat_quality_v1/ │ │ │ └── fullrun/ # Everything + training data │ └── wheat_quality_v1/ │ └── catalog/ # Prediction index (permanent) ├── predictions_meta.parquet # Fast queries (no arrays) ├── predictions_data.parquet # Arrays (on-demand) └── archives/ └── best_predictions/ ``` --- ## Key File Formats ### pipeline.json ```json { "id": "0042_x9y8z7", "hash": "x9y8z7", "created_at": "2024-10-23T10:45:30Z", "status": "completed", "steps": [ { "step": 0, "operator": "StandardScaler", "class": "sklearn.preprocessing.StandardScaler", "params": {"with_mean": true, "with_std": true} }, { "step": 1, "operator": "PLSRegression", "class": "sklearn.cross_decomposition.PLSRegression", "params": {"n_components": 5} } ], "artifacts": [ { "step": 0, "name": "StandardScaler", "hash": "abc123", "path": "../_binaries/transformer_StandardScaler_abc123.joblib" } ] } ``` ### metrics.json ```json { "train": {"rmse": 0.32, "r2": 0.95, "mae": 0.25}, "val": {"rmse": 0.38, "r2": 0.92, "mae": 0.31}, "test": {"rmse": 0.42, "r2": 0.90, "mae": 0.34}, "cross_validation": { "folds": 5, "mean_rmse": 0.40, "std_rmse": 0.05, "fold_results": [ {"fold": 1, "rmse": 0.38}, {"fold": 2, "rmse": 0.45} ] } } ``` ### manifest.yaml (V3) ```yaml schema_version: "3.0" pipeline_id: "0042_x9y8z7" dataset: wheat_sample1 n_features: 1024 artifacts: items: - artifact_id: "0042_x9y8z7$abc123def456:all" chain_path: "s1.StandardScaler" artifact_type: transformer class_name: StandardScaler path: transformer_StandardScaler_abc123.joblib content_hash: "sha256:abc123..." version: 3 ``` --- ## Naming Conventions ### Pipeline IDs Format: `NNNN_[name_]hash` - `NNNN`: 4-digit sequential number (0001, 0002, ...) - `name`: Optional custom name (lowercase, underscores) - `hash`: 6-character config hash Examples: - `0001_a1b2c3` (no custom name) - `0042_pls_baseline_x9y8z7` (with custom name) ### Artifact Filenames Format: `{type}_{class}_{short_hash}.{ext}` Examples: - `model_PLSRegression_abc123def456.joblib` - `transformer_StandardScaler_def456789012.joblib` - `encoder_LabelEncoder_ghi789012345.joblib` --- ## Library Types | Type | Contents | Use Case | Size | |------|----------|----------|------| | **templates/** | Config JSON only | Share pipeline recipes | Small | | **filtered/** | Config + metrics | Track experiments | Small | | **pipeline/** | Config + all binaries | Deploy/retrain | Medium | | **fullrun/** | Everything + data | Full reproducibility | Large | --- ## API Classes ### SimulationSaver Main class for managing pipeline output storage. ```python from nirs4all.pipeline.storage.io import SimulationSaver saver = SimulationSaver( base_path=runs_dir, save_artifacts=True, save_charts=True ) # Register pipeline saver.register("0001_abc123") # Save files saver.save_json("metrics.json", metrics_dict) saver.save_output(step_number=1, name="chart", data=png_bytes, extension=".png") # Export best results saver.export_best_for_dataset("wheat_sample1", workspace_path, runs_dir) ``` ### LibraryManager Manage saved pipeline templates and trained models. ```python from nirs4all.workspace import LibraryManager library = LibraryManager(workspace / "library") # Save template library.save_template(pipeline_config, "baseline_pls", "PLS baseline") # Save trained pipeline library.save_pipeline_full(run_dir, pipeline_dir, "wheat_quality_v1") # List and load templates = library.list_templates() config = library.load_template("baseline_pls") ``` ### PipelineLibrary Alternative library manager with category support. ```python from nirs4all.pipeline.storage.library import PipelineLibrary library = PipelineLibrary(workspace_path) # Save with category and tags library.save_template( pipeline_config, name="optimized_pls", category="regression", tags=["nirs", "pls", "optimized"], metrics={"rmse": 0.42} ) # Search templates templates = library.list_templates(category="regression", tags=["pls"]) ``` --- ## Common Workflows ### 1. Training Session ```python from nirs4all.pipeline import PipelineRunner runner = PipelineRunner( workspace="./workspace", save_artifacts=True ) # Run pipeline - creates workspace/runs/{dataset}/0001_hash/ predictions, per_dataset = runner.run(pipeline, dataset) ``` ### 2. Export Best Model ```python from nirs4all.workspace import LibraryManager # After training, save best to library library = LibraryManager(workspace / "library") library.save_pipeline_full( run_dir=runs_dir / "wheat_sample1", pipeline_dir=runs_dir / "wheat_sample1" / "0042_x9y8z7", name="wheat_quality_prod_v1" ) ``` ### 3. Load and Predict ```python from nirs4all.pipeline import PipelineRunner runner = PipelineRunner(workspace="./workspace") predictions = runner.predict( source="library/trained/pipeline/wheat_quality_prod_v1", data=new_samples ) ``` ### 4. Cleanup Old Runs ```bash # Delete runs older than 30 days (catalog preserves best results) find workspace/runs -mtime +30 -type d -name "20*" -exec rm -rf {} + ``` --- ## See Also - [Storage API](./storage.md) - Artifact storage reference - [Pipeline Syntax](/reference/pipeline_syntax) - Pipeline configuration - [CLI Reference](/reference/cli) - Command-line interface