# Artifacts & Storage This guide covers the artifact storage system and workspace structure in nirs4all. ## Overview The artifacts system (V3) provides: - **Deterministic artifact IDs** based on operator chains for complete execution path tracking - **Content-addressed storage** with deduplication across pipelines - **Dependency tracking** for stacking and transfer learning - **LRU caching** for efficient artifact loading ## Workspace Structure ``` workspace/ ├── runs/ # Experimental runs │ └── {dataset}/ # Dataset-centric organization │ ├── _binaries/ # Shared artifacts (deduplicated) │ ├── 0001_hash/ # Pipeline 1 │ └── 0002_name_hash/ # Pipeline 2 (with custom name) │ ├── binaries/ # Centralized artifact storage (V3) │ └── {dataset}/ # Per-dataset binaries │ ├── exports/ # Best results (fast access) │ ├── {dataset}/ # Full exports │ └── best_predictions/ # Predictions only │ ├── library/ # Reusable pipelines │ ├── templates/ # Config only │ └── trained/ # With binaries │ ├── filtered/ # Config + metrics │ ├── pipeline/ # Full pipeline │ └── fullrun/ # Everything + data │ └── catalog/ # Prediction index ├── predictions_meta.parquet # Metadata (fast queries) └── predictions_data.parquet # Arrays (on-demand) ``` ## Artifact Types | Type | Description | |------|-------------| | `MODEL` | Trained ML models (sklearn, TensorFlow, PyTorch, JAX) | | `TRANSFORMER` | Fitted preprocessors (scalers, feature extractors) | | `SPLITTER` | Train/test split configuration | | `ENCODER` | Label encoders, y-scalers | | `META_MODEL` | Stacking meta-models with source dependencies | ## V3 Artifact ID Format Format: `{pipeline_id}${chain_hash}:{fold_id}` Examples: - `0001_pls$a1b2c3d4e5f6:all` - Shared artifact - `0001_pls$7f8e9d0c1b2a:0` - Fold 0 artifact - `0001_pls$3c4d5e6f7a8b:1` - Fold 1 artifact The chain hash is computed from the operator chain path (e.g., `s1.MinMaxScaler>s3.PLS[br=0]`), ensuring deterministic identification across branching, multi-source, and stacking scenarios. ## Using the ArtifactRegistry The `ArtifactRegistry` is the central class for artifact management: ```python from pathlib import Path from nirs4all.pipeline.storage.artifacts import ArtifactRegistry, ArtifactType # Initialize registry registry = ArtifactRegistry( workspace=Path("./workspace"), dataset="wheat_sample1", pipeline_id="0001_pls_abc123" ) # Register an artifact with V3 chain-based ID record = registry.register_with_chain( obj=trained_model, chain="s1.MinMaxScaler>s3.PLSRegression", artifact_type=ArtifactType.MODEL, step_index=3, fold_id=0, params={"n_components": 10} ) print(f"Saved: {record.artifact_id}") # Output: 0001_pls_abc123$a1b2c3d4e5f6:0 ``` ### Key Methods ```python # Generate ID from chain artifact_id = registry.generate_id(chain, fold_id=0) # Resolve ID to record record = registry.resolve(artifact_id) # Get by chain path (V3) record = registry.get_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0) # Get artifacts for a step records = registry.get_artifacts_for_step( pipeline_id="0001", step_index=3, branch_path=[0], fold_id=None ) # Get fold models for CV averaging fold_records = registry.get_fold_models( pipeline_id="0001", step_index=3 ) ``` ## Using the ArtifactLoader The `ArtifactLoader` provides efficient loading with caching: ```python from nirs4all.pipeline.storage.artifacts import ArtifactLoader # Create from manifest loader = ArtifactLoader.from_manifest(manifest, results_dir) # Load by ID (uses LRU cache) model = loader.load_by_id("0001_pls$abc123:0") # Load by chain path model = loader.load_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0) # Load all artifacts for a step artifacts = loader.load_for_step( step_index=3, branch_path=[0], fold_id=0 ) for artifact_id, obj in artifacts: print(f"Loaded: {artifact_id}") # Load fold models for ensemble fold_models = loader.load_fold_models(step_index=3) for fold_id, model in fold_models: print(f"Fold {fold_id}: {model}") ``` ### Meta-Model Loading (Stacking) ```python # Load meta-model with all source models meta_model, sources, feature_cols = loader.load_meta_model_with_sources( artifact_id="0001_pls$abc123:all", validate_branch=True ) # sources is [(source_id, source_model), ...] # feature_cols is ["PLSRegression_pred", "RandomForest_pred", ...] ``` ### Cache Management ```python # Get cache statistics info = loader.get_cache_info() print(f"Cache hit rate: {info['hit_rate']:.2%}") # Preload artifacts loader.preload_artifacts(artifact_ids=["0001:3:0", "0001:3:1"]) # Clear cache loader.clear_cache() # Resize cache loader.set_cache_size(200) ``` ## Library Management ### Save Pipeline Templates ```python from nirs4all.workspace import LibraryManager library = LibraryManager(workspace / "library") # Save config-only template library.save_template( pipeline_config=pipeline_dict, name="baseline_pls", description="PLS baseline with SNV preprocessing" ) # Save full trained pipeline library.save_pipeline_full( run_dir=runs_dir / "wheat_sample1", pipeline_dir=runs_dir / "wheat_sample1" / "0042_x9y8z7", name="wheat_quality_v1" ) ``` ### Load and Reuse ```python # List templates templates = library.list_templates() for t in templates: print(f"{t['name']}: {t['description']}") # Load template config = library.load_template("baseline_pls") # Use in pipeline runner = PipelineRunner(workspace="./workspace") predictions = runner.run(config, new_dataset) ``` ## Cleanup Utilities ```python # Find orphaned artifacts orphans = registry.find_orphaned_artifacts() print(f"Found {len(orphans)} orphaned files") # Delete orphans (dry run first) deleted, freed = registry.delete_orphaned_artifacts(dry_run=True) print(f"Would delete {len(deleted)} files, freeing {freed / 1024:.1f} KB") # Actually delete deleted, freed = registry.delete_orphaned_artifacts(dry_run=False) # Get storage statistics stats = registry.get_stats() print(f"Total artifacts: {stats['total_artifacts']}") print(f"Unique files: {stats['unique_files']}") print(f"Deduplication ratio: {stats['deduplication_ratio']:.1%}") ``` ## Best Practices 1. **Use chain-based registration** (`register_with_chain`) for new code to ensure deterministic artifact IDs. 2. **Let the registry handle deduplication** - identical artifacts (same content hash) automatically share the same file. 3. **Use the loader's cache** - the LRU cache significantly improves performance when loading the same artifacts multiple times. 4. **Track dependencies for meta-models** - always register source models before the meta-model to enable proper dependency resolution. 5. **Clean up periodically** - use `find_orphaned_artifacts()` and `delete_orphaned_artifacts()` to reclaim disk space. ## See Also - {doc}`/reference/storage` - Storage API reference - {doc}`/reference/workspace` - Workspace architecture - {doc}`architecture` - Pipeline architecture overview - {doc}`/reference/pipeline_syntax` - Pipeline configuration syntax