Artifacts & Storage
This guide covers the artifact storage system and workspace structure in nirs4all.
Overview
The artifacts system (V3) provides:
Deterministic artifact IDs based on operator chains for complete execution path tracking
Content-addressed storage with deduplication across pipelines
Dependency tracking for stacking and transfer learning
LRU caching for efficient artifact loading
Workspace Structure
workspace/
├── runs/ # Experimental runs
│ └── {dataset}/ # Dataset-centric organization
│ ├── _binaries/ # Shared artifacts (deduplicated)
│ ├── 0001_hash/ # Pipeline 1
│ └── 0002_name_hash/ # Pipeline 2 (with custom name)
│
├── binaries/ # Centralized artifact storage (V3)
│ └── {dataset}/ # Per-dataset binaries
│
├── exports/ # Best results (fast access)
│ ├── {dataset}/ # Full exports
│ └── best_predictions/ # Predictions only
│
├── library/ # Reusable pipelines
│ ├── templates/ # Config only
│ └── trained/ # With binaries
│ ├── filtered/ # Config + metrics
│ ├── pipeline/ # Full pipeline
│ └── fullrun/ # Everything + data
│
└── catalog/ # Prediction index
├── predictions_meta.parquet # Metadata (fast queries)
└── predictions_data.parquet # Arrays (on-demand)
Artifact Types
Type |
Description |
|---|---|
|
Trained ML models (sklearn, TensorFlow, PyTorch, JAX) |
|
Fitted preprocessors (scalers, feature extractors) |
|
Train/test split configuration |
|
Label encoders, y-scalers |
|
Stacking meta-models with source dependencies |
V3 Artifact ID Format
Format: {pipeline_id}${chain_hash}:{fold_id}
Examples:
0001_pls$a1b2c3d4e5f6:all- Shared artifact0001_pls$7f8e9d0c1b2a:0- Fold 0 artifact0001_pls$3c4d5e6f7a8b:1- Fold 1 artifact
The chain hash is computed from the operator chain path (e.g., s1.MinMaxScaler>s3.PLS[br=0]), ensuring deterministic identification across branching, multi-source, and stacking scenarios.
Using the ArtifactRegistry
The ArtifactRegistry is the central class for artifact management:
from pathlib import Path
from nirs4all.pipeline.storage.artifacts import ArtifactRegistry, ArtifactType
# Initialize registry
registry = ArtifactRegistry(
workspace=Path("./workspace"),
dataset="wheat_sample1",
pipeline_id="0001_pls_abc123"
)
# Register an artifact with V3 chain-based ID
record = registry.register_with_chain(
obj=trained_model,
chain="s1.MinMaxScaler>s3.PLSRegression",
artifact_type=ArtifactType.MODEL,
step_index=3,
fold_id=0,
params={"n_components": 10}
)
print(f"Saved: {record.artifact_id}")
# Output: 0001_pls_abc123$a1b2c3d4e5f6:0
Key Methods
# Generate ID from chain
artifact_id = registry.generate_id(chain, fold_id=0)
# Resolve ID to record
record = registry.resolve(artifact_id)
# Get by chain path (V3)
record = registry.get_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0)
# Get artifacts for a step
records = registry.get_artifacts_for_step(
pipeline_id="0001",
step_index=3,
branch_path=[0],
fold_id=None
)
# Get fold models for CV averaging
fold_records = registry.get_fold_models(
pipeline_id="0001",
step_index=3
)
Using the ArtifactLoader
The ArtifactLoader provides efficient loading with caching:
from nirs4all.pipeline.storage.artifacts import ArtifactLoader
# Create from manifest
loader = ArtifactLoader.from_manifest(manifest, results_dir)
# Load by ID (uses LRU cache)
model = loader.load_by_id("0001_pls$abc123:0")
# Load by chain path
model = loader.load_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0)
# Load all artifacts for a step
artifacts = loader.load_for_step(
step_index=3,
branch_path=[0],
fold_id=0
)
for artifact_id, obj in artifacts:
print(f"Loaded: {artifact_id}")
# Load fold models for ensemble
fold_models = loader.load_fold_models(step_index=3)
for fold_id, model in fold_models:
print(f"Fold {fold_id}: {model}")
Meta-Model Loading (Stacking)
# Load meta-model with all source models
meta_model, sources, feature_cols = loader.load_meta_model_with_sources(
artifact_id="0001_pls$abc123:all",
validate_branch=True
)
# sources is [(source_id, source_model), ...]
# feature_cols is ["PLSRegression_pred", "RandomForest_pred", ...]
Cache Management
# Get cache statistics
info = loader.get_cache_info()
print(f"Cache hit rate: {info['hit_rate']:.2%}")
# Preload artifacts
loader.preload_artifacts(artifact_ids=["0001:3:0", "0001:3:1"])
# Clear cache
loader.clear_cache()
# Resize cache
loader.set_cache_size(200)
Library Management
Save Pipeline Templates
from nirs4all.workspace import LibraryManager
library = LibraryManager(workspace / "library")
# Save config-only template
library.save_template(
pipeline_config=pipeline_dict,
name="baseline_pls",
description="PLS baseline with SNV preprocessing"
)
# Save full trained pipeline
library.save_pipeline_full(
run_dir=runs_dir / "wheat_sample1",
pipeline_dir=runs_dir / "wheat_sample1" / "0042_x9y8z7",
name="wheat_quality_v1"
)
Load and Reuse
# List templates
templates = library.list_templates()
for t in templates:
print(f"{t['name']}: {t['description']}")
# Load template
config = library.load_template("baseline_pls")
# Use in pipeline
runner = PipelineRunner(workspace="./workspace")
predictions = runner.run(config, new_dataset)
Cleanup Utilities
# Find orphaned artifacts
orphans = registry.find_orphaned_artifacts()
print(f"Found {len(orphans)} orphaned files")
# Delete orphans (dry run first)
deleted, freed = registry.delete_orphaned_artifacts(dry_run=True)
print(f"Would delete {len(deleted)} files, freeing {freed / 1024:.1f} KB")
# Actually delete
deleted, freed = registry.delete_orphaned_artifacts(dry_run=False)
# Get storage statistics
stats = registry.get_stats()
print(f"Total artifacts: {stats['total_artifacts']}")
print(f"Unique files: {stats['unique_files']}")
print(f"Deduplication ratio: {stats['deduplication_ratio']:.1%}")
Best Practices
Use chain-based registration (
register_with_chain) for new code to ensure deterministic artifact IDs.Let the registry handle deduplication - identical artifacts (same content hash) automatically share the same file.
Use the loader’s cache - the LRU cache significantly improves performance when loading the same artifacts multiple times.
Track dependencies for meta-models - always register source models before the meta-model to enable proper dependency resolution.
Clean up periodically - use
find_orphaned_artifacts()anddelete_orphaned_artifacts()to reclaim disk space.
See Also
Storage API Reference - Storage API reference
Workspace Architecture - Workspace architecture
Architecture Overview - Pipeline architecture overview
Writing a Pipeline in nirs4all - Pipeline configuration syntax