nirs4all.pipeline.storage.artifacts.types module

Artifact type definitions for the V3 artifacts system.

This module defines the core data structures for artifact management: - ArtifactType: Enum for artifact classification - ArtifactRecord: Complete artifact metadata for manifest storage

The V3 artifacts system uses operator chains for complete execution path tracking, enabling deterministic artifact IDs that work correctly with branching, multi-source, stacking, and cross-validation.

Key V3 improvements: - OperatorChain tracking for full execution path - Source index tracking for multi-source pipelines - Chain hash-based artifact IDs for deterministic identification - Unified handling of all edge cases (branching, stacking, bundles)

class nirs4all.pipeline.storage.artifacts.types.ArtifactRecord(artifact_id: str, content_hash: str, path: str, chain_path: str = '', source_index: int | None = None, pipeline_id: str = '', branch_path: List[int] = <factory>, step_index: int = 0, substep_index: int | None = None, fold_id: int | None = None, artifact_type: ArtifactType = ArtifactType.MODEL, class_name: str = '', custom_name: str = '', depends_on: List[str] = <factory>, format: str = 'joblib', format_version: str = '', nirs4all_version: str = '', size_bytes: int = 0, created_at: str = <factory>, params: Dict[str, ~typing.Any]=<factory>, meta_config: MetaModelConfig | None = None, version: int = 3)[source]

Bases: object

Complete artifact metadata for manifest storage (V3).

This record contains all metadata needed to: - Uniquely identify an artifact via operator chain - Load the artifact from centralized storage - Resolve dependencies for stacking/transfer - Track serialization format and library versions

V3 Format:

artifact_id: “{pipeline_id}${chain_hash}:{fold_id}” chain_path: Full operator chain path string

artifact_id

Unique, deterministic ID based on chain hash Format: “{pipeline_id}${chain_hash}:{fold_id}”

Type:

str

content_hash

SHA256 hash of binary content (for deduplication)

Type:

str

path

Relative path in binaries/<dataset>/ directory

Type:

str

# Chain tracking
Type:

V3

chain_path

Serialized operator chain path

Type:

str

source_index

Multi-source index (None for single source)

Type:

int | None

# Context
pipeline_id

Parent pipeline ID (e.g., “0001_pls_abc123”)

Type:

str

branch_path

Branch hierarchy as list of indices (empty = pre-branch)

Type:

List[int]

step_index

Logical step index within execution

Type:

int

substep_index

Index within substep (for [model1, model2])

Type:

int | None

fold_id

CV fold identifier (None = shared across folds)

Type:

int | None

# Classification
artifact_type

Type classification (model, transformer, etc.)

Type:

nirs4all.pipeline.storage.artifacts.types.ArtifactType

class_name

Python class name (e.g., “PLSRegression”)

Type:

str

custom_name

User-defined name for the artifact

Type:

str

# Dependencies
depends_on

List of artifact_ids this artifact depends on

Type:

List[str]

# Serialization
format

Serialization format (joblib, pickle, keras, etc.)

Type:

str

format_version

Library version string

Type:

str

nirs4all_version

nirs4all version that created this artifact

Type:

str

size_bytes

Size of serialized binary in bytes

Type:

int

created_at

ISO timestamp of creation

Type:

str

# Metadata
params

Hyperparameters for models

Type:

Dict[str, Any]

meta_config

Configuration for meta-models

Type:

nirs4all.pipeline.storage.artifacts.types.MetaModelConfig | None

version

Schema version (3 for V3)

Type:

int

artifact_id: str
artifact_type: ArtifactType = 'model'
branch_path: List[int]
property chain_hash: str

Get chain hash from artifact ID (V3 format).

Returns:

Chain hash portion of the artifact ID, or empty if not V3 format

chain_path: str = ''
class_name: str = ''
content_hash: str
created_at: str
custom_name: str = ''
depends_on: List[str]
fold_id: int | None = None
format: str = 'joblib'
format_version: str = ''
classmethod from_dict(data: Dict[str, Any]) ArtifactRecord[source]

Create ArtifactRecord from dictionary.

Parameters:

data – Dictionary from YAML manifest

Returns:

ArtifactRecord instance

get_branch_path_str() str[source]

Get branch path as string.

Returns:

Colon-separated branch indices or empty string

get_fold_str() str[source]

Get fold ID as string.

Returns:

Fold ID as string or “all” for shared artifacts

property is_branch_specific: bool

Check if artifact is branch-specific.

Returns:

True if artifact belongs to a specific branch path

property is_fold_specific: bool

Check if artifact is fold-specific.

Returns:

True if artifact belongs to a specific CV fold

property is_meta_model: bool

Check if artifact is a meta-model.

Returns:

True if artifact is a stacking meta-model

property is_source_specific: bool

Check if artifact is source-specific.

Returns:

True if artifact belongs to a specific source in multi-source

matches_context(step_index: int | None = None, branch_path: List[int] | None = None, source_index: int | None = None, fold_id: int | None = None) bool[source]

Check if artifact matches a given context.

Parameters:
  • step_index – Step to match (None = any)

  • branch_path – Branch path to match (None = any)

  • source_index – Source index to match (None = any)

  • fold_id – Fold ID to match (None = any)

Returns:

True if artifact matches all specified filters

meta_config: MetaModelConfig | None = None
nirs4all_version: str = ''
params: Dict[str, Any]
path: str
pipeline_id: str = ''
property short_hash: str

Get short version of content hash for filenames.

Returns:

prefix if present)

Return type:

First 12 characters of hash (after sha256

size_bytes: int = 0
source_index: int | None = None
step_index: int = 0
substep_index: int | None = None
to_dict() Dict[str, Any][source]

Convert to dictionary for YAML serialization.

Handles enum conversion and nested dataclass serialization.

Returns:

Dictionary suitable for YAML safe_dump

version: int = 3
class nirs4all.pipeline.storage.artifacts.types.ArtifactType(value)[source]

Bases: str, Enum

Classification of artifact types.

Each type has specific handling: - model: Trained ML models (sklearn, tensorflow, pytorch, etc.) - transformer: Fitted preprocessors (scalers, feature extractors) - splitter: Train/test split configuration (for reproducibility) - encoder: Label encoders, y-scalers - meta_model: Stacking meta-models with source model dependencies

ENCODER = 'encoder'
META_MODEL = 'meta_model'
MODEL = 'model'
SPLITTER = 'splitter'
TRANSFORMER = 'transformer'
class nirs4all.pipeline.storage.artifacts.types.MetaModelConfig(source_models: Dict[str, ~typing.Any]]=<factory>, feature_columns: List[str] = <factory>)[source]

Bases: object

Configuration for meta-model source tracking.

Stores the ordered source models that feed into a stacking meta-model, along with their feature column mapping.

source_models

Ordered list of source model artifact IDs with feature indices

Type:

List[Dict[str, Any]]

feature_columns

Feature column names in the meta-model input order

Type:

List[str]

feature_columns: List[str]
classmethod from_dict(data: Dict[str, Any]) MetaModelConfig[source]

Create from dictionary.

source_models: List[Dict[str, Any]]
to_dict() Dict[str, Any][source]

Convert to dictionary for YAML serialization.