Artifacts & Storage

This guide covers the artifact storage system and workspace structure in nirs4all.

Overview

The artifacts system (V3) provides:

  • Deterministic artifact IDs based on operator chains for complete execution path tracking

  • Content-addressed storage with deduplication across pipelines

  • Dependency tracking for stacking and transfer learning

  • LRU caching for efficient artifact loading

Workspace Structure

workspace/
├── runs/                          # Experimental runs
│   └── {dataset}/                 # Dataset-centric organization
│       ├── _binaries/             # Shared artifacts (deduplicated)
│       ├── 0001_hash/             # Pipeline 1
│       └── 0002_name_hash/        # Pipeline 2 (with custom name)
│
├── binaries/                      # Centralized artifact storage (V3)
│   └── {dataset}/                 # Per-dataset binaries
│
├── exports/                       # Best results (fast access)
│   ├── {dataset}/                 # Full exports
│   └── best_predictions/          # Predictions only
│
├── library/                       # Reusable pipelines
│   ├── templates/                 # Config only
│   └── trained/                   # With binaries
│       ├── filtered/              # Config + metrics
│       ├── pipeline/              # Full pipeline
│       └── fullrun/               # Everything + data
│
└── catalog/                       # Prediction index
    ├── predictions_meta.parquet   # Metadata (fast queries)
    └── predictions_data.parquet   # Arrays (on-demand)

Artifact Types

Type

Description

MODEL

Trained ML models (sklearn, TensorFlow, PyTorch, JAX)

TRANSFORMER

Fitted preprocessors (scalers, feature extractors)

SPLITTER

Train/test split configuration

ENCODER

Label encoders, y-scalers

META_MODEL

Stacking meta-models with source dependencies

V3 Artifact ID Format

Format: {pipeline_id}${chain_hash}:{fold_id}

Examples:

  • 0001_pls$a1b2c3d4e5f6:all - Shared artifact

  • 0001_pls$7f8e9d0c1b2a:0 - Fold 0 artifact

  • 0001_pls$3c4d5e6f7a8b:1 - Fold 1 artifact

The chain hash is computed from the operator chain path (e.g., s1.MinMaxScaler>s3.PLS[br=0]), ensuring deterministic identification across branching, multi-source, and stacking scenarios.

Using the ArtifactRegistry

The ArtifactRegistry is the central class for artifact management:

from pathlib import Path
from nirs4all.pipeline.storage.artifacts import ArtifactRegistry, ArtifactType

# Initialize registry
registry = ArtifactRegistry(
    workspace=Path("./workspace"),
    dataset="wheat_sample1",
    pipeline_id="0001_pls_abc123"
)

# Register an artifact with V3 chain-based ID
record = registry.register_with_chain(
    obj=trained_model,
    chain="s1.MinMaxScaler>s3.PLSRegression",
    artifact_type=ArtifactType.MODEL,
    step_index=3,
    fold_id=0,
    params={"n_components": 10}
)

print(f"Saved: {record.artifact_id}")
# Output: 0001_pls_abc123$a1b2c3d4e5f6:0

Key Methods

# Generate ID from chain
artifact_id = registry.generate_id(chain, fold_id=0)

# Resolve ID to record
record = registry.resolve(artifact_id)

# Get by chain path (V3)
record = registry.get_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0)

# Get artifacts for a step
records = registry.get_artifacts_for_step(
    pipeline_id="0001",
    step_index=3,
    branch_path=[0],
    fold_id=None
)

# Get fold models for CV averaging
fold_records = registry.get_fold_models(
    pipeline_id="0001",
    step_index=3
)

Using the ArtifactLoader

The ArtifactLoader provides efficient loading with caching:

from nirs4all.pipeline.storage.artifacts import ArtifactLoader

# Create from manifest
loader = ArtifactLoader.from_manifest(manifest, results_dir)

# Load by ID (uses LRU cache)
model = loader.load_by_id("0001_pls$abc123:0")

# Load by chain path
model = loader.load_by_chain("s1.MinMaxScaler>s3.PLS", fold_id=0)

# Load all artifacts for a step
artifacts = loader.load_for_step(
    step_index=3,
    branch_path=[0],
    fold_id=0
)
for artifact_id, obj in artifacts:
    print(f"Loaded: {artifact_id}")

# Load fold models for ensemble
fold_models = loader.load_fold_models(step_index=3)
for fold_id, model in fold_models:
    print(f"Fold {fold_id}: {model}")

Meta-Model Loading (Stacking)

# Load meta-model with all source models
meta_model, sources, feature_cols = loader.load_meta_model_with_sources(
    artifact_id="0001_pls$abc123:all",
    validate_branch=True
)

# sources is [(source_id, source_model), ...]
# feature_cols is ["PLSRegression_pred", "RandomForest_pred", ...]

Cache Management

# Get cache statistics
info = loader.get_cache_info()
print(f"Cache hit rate: {info['hit_rate']:.2%}")

# Preload artifacts
loader.preload_artifacts(artifact_ids=["0001:3:0", "0001:3:1"])

# Clear cache
loader.clear_cache()

# Resize cache
loader.set_cache_size(200)

Library Management

Save Pipeline Templates

from nirs4all.workspace import LibraryManager

library = LibraryManager(workspace / "library")

# Save config-only template
library.save_template(
    pipeline_config=pipeline_dict,
    name="baseline_pls",
    description="PLS baseline with SNV preprocessing"
)

# Save full trained pipeline
library.save_pipeline_full(
    run_dir=runs_dir / "wheat_sample1",
    pipeline_dir=runs_dir / "wheat_sample1" / "0042_x9y8z7",
    name="wheat_quality_v1"
)

Load and Reuse

# List templates
templates = library.list_templates()
for t in templates:
    print(f"{t['name']}: {t['description']}")

# Load template
config = library.load_template("baseline_pls")

# Use in pipeline
runner = PipelineRunner(workspace="./workspace")
predictions = runner.run(config, new_dataset)

Cleanup Utilities

# Find orphaned artifacts
orphans = registry.find_orphaned_artifacts()
print(f"Found {len(orphans)} orphaned files")

# Delete orphans (dry run first)
deleted, freed = registry.delete_orphaned_artifacts(dry_run=True)
print(f"Would delete {len(deleted)} files, freeing {freed / 1024:.1f} KB")

# Actually delete
deleted, freed = registry.delete_orphaned_artifacts(dry_run=False)

# Get storage statistics
stats = registry.get_stats()
print(f"Total artifacts: {stats['total_artifacts']}")
print(f"Unique files: {stats['unique_files']}")
print(f"Deduplication ratio: {stats['deduplication_ratio']:.1%}")

Best Practices

  1. Use chain-based registration (register_with_chain) for new code to ensure deterministic artifact IDs.

  2. Let the registry handle deduplication - identical artifacts (same content hash) automatically share the same file.

  3. Use the loader’s cache - the LRU cache significantly improves performance when loading the same artifacts multiple times.

  4. Track dependencies for meta-models - always register source models before the meta-model to enable proper dependency resolution.

  5. Clean up periodically - use find_orphaned_artifacts() and delete_orphaned_artifacts() to reclaim disk space.

See Also