# Outputs vs Artifacts: Serialization Architecture ## Overview The nirs4all serialization system distinguishes between two types of saved files: 1. **Artifacts** - Internal binary objects (models, transformers, scalers) stored in content-addressed storage 2. **Outputs** - Human-readable files (charts, reports, CSV) stored in organized directories > **Note:** This document describes the artifacts architecture overview including > branching, stacking, and pipeline artifact management. ## Architecture: "Return, Don't Save" To ensure clean separation of concerns and testability, controllers **do not save files directly**. Instead, they return a `StepOutput` object containing the data to be saved. The `PipelineExecutor` handles the actual file I/O. ### Artifacts (Internal Binary Storage) **Purpose:** Deduplicated storage of trained models and transformers **Location:** `workspace/artifacts//.` **Method:** Return `StepOutput(artifacts={...})` **Respects save_artifacts flag:** Yes **What gets stored as artifacts:** - Trained models (sklearn, keras, pytorch, catboost, lightgbm) - Fitted transformers (scalers, preprocessors) - Fitted splitters (cross-validation objects) - Resampler objects **Benefits:** - Automatic deduplication (identical objects stored once) - Content-addressed (SHA256) for integrity - Referenced in DuckDB `artifacts` table with `ref_count` tracking - Space-efficient (e.g., 25% reduction in tests) **Example:** ```python # In model controller from nirs4all.pipeline.execution.result import StepOutput return context, StepOutput( artifacts={"model": trained_model} ) # Executor saves to: workspace/artifacts/ab/abc123...joblib ``` ### Outputs (Human-Readable Files) **Purpose:** User-accessible files for viewing and sharing **Location:** `workspace/exports/_/` **Method:** Return `StepOutput(outputs=[...])` **Respects save_charts flag:** Yes **What gets stored as outputs:** - Charts (PNG images) - Reports (CSV, TXT) - Summaries - Exported predictions **Benefits:** - Easy to find and open - Organized by dataset and pipeline - Readable names (e.g., `2D_Chart.png`) - Can be copied, shared, or included in papers **Example:** ```python # In chart controller from nirs4all.pipeline.execution.result import StepOutput # Generate chart data img_png_binary = ... return context, StepOutput( outputs=[(img_png_binary, "2D_Chart", "png")] ) # Executor saves to: workspace/exports/regression_Q1_47be36/2D_Chart.png ``` ## Directory Structure ``` workspace/ ├── store.duckdb # All structured data (runs, pipelines, chains, │ # predictions, artifacts registry, logs) ├── artifacts/ # Binary artifacts (models, transformers) │ ├── ab/ │ │ └── abc123...joblib # Content-addressed storage │ └── cd/ │ └── cdef456...pkl │ ├── exports/ # User-triggered exports & human-readable outputs │ ├── regression_Q1_47be36/ │ │ ├── 2D_Chart.png # Charts │ │ ├── Y_distribution_train_test.png │ │ └── fold_visualization_3folds_train.png │ ├── model.n4a # Bundle exports │ └── results.parquet # Prediction exports │ └── library/ # Reusable pipeline templates └── templates/ └── baseline_pls.json ``` ## save_artifacts / save_charts Flag Behavior The `save_artifacts` and `save_charts` parameters in `PipelineRunner` control artifacts and outputs separately: ```python # Save everything (default) runner = PipelineRunner(save_artifacts=True, save_charts=True) # Save only artifacts (models, transformers) runner = PipelineRunner(save_artifacts=True, save_charts=False) # Save only charts (visualizations) runner = PipelineRunner(save_artifacts=False, save_charts=True) # Dry run - no files saved runner = PipelineRunner(save_artifacts=False, save_charts=False) ``` When `save_artifacts=False`: - **Executor** skips saving artifacts to DuckDB and artifacts/ directory - Pipeline can still run and generate predictions - Models won't be reloadable for predict mode or chain replay When `save_charts=False`: - **Executor** skips saving outputs (charts, reports) - Pipeline runs faster - No chart files created ## Code Examples ### Saving a Chart (Output) ```python def execute(self, step_info, dataset, context, runtime_context, ...): # Generate chart fig, ax = plt.subplots() ax.plot(data) # Save to buffer img_buffer = io.BytesIO() fig.savefig(img_buffer, format='png', dpi=300) img_png_binary = img_buffer.getvalue() # Return StepOutput return context, StepOutput( outputs=[(img_png_binary, "2D_Chart", "png")] ) ``` ### Saving a Model (Artifact) ```python def execute(self, step_info, dataset, context, runtime_context, ...): # Train model model.fit(X, y) # Return StepOutput return context, StepOutput( artifacts={"model": model} ) ``` ## Migration Notes ### Before (Old System) Controllers called `saver.save_output()` or `saver.persist_artifact()` directly: - ❌ Coupled to file system - ❌ Hard to test without mocking I/O - ❌ Inconsistent return types ### After (New System) Controllers return `StepOutput`: - ✅ Decoupled from I/O - ✅ Easy to test (check returned object) - ✅ Consistent return type (`StepOutput`) ## Finding Your Files ### Charts and Reports (Outputs) ```bash # All outputs organized by pipeline in the exports directory workspace/exports/ ├── regression_Q1_47be36/ │ ├── 2D_Chart.png # Your charts are here │ ├── 3D_Chart.png │ ├── Y_distribution_train_test.png │ └── fold_visualization_3folds_train.png ``` ### Models and Transformers (Artifacts) ```bash # Artifact references are in store.duckdb (artifacts and chains tables) workspace/store.duckdb # Binary artifacts are in content-addressed storage workspace/artifacts/ab/abc123...joblib # Model file workspace/artifacts/cd/cdef456...pkl # Scaler file ``` ## Best Practices 1. **For human viewing** (charts, reports) -- Return in `outputs` list 2. **For pipeline replay** (models, transformers) -- Return in `artifacts` dict 3. **Disable saving for tests** -- Set `save_artifacts=False, save_charts=False` 4. **Check exports directory** -- `workspace/exports/_/` ## Implementation Details ### StepOutput Class ```python @dataclass class StepOutput: """Standardized output from a controller execution.""" # Internal binaries (models, transformers) artifacts: Dict[str, Any] = field(default_factory=dict) # User outputs (charts, reports) # List of tuples: (data_object, filename_hint, type_hint) outputs: List[Tuple[Any, str, str]] = field(default_factory=list) ``` ### PipelineExecutor The executor handles the actual saving: ```python # In PipelineExecutor._execute_steps for output_data, name, ext in step_result.outputs: self.saver.save_output(name=name, data=output_data, extension=ext) for name, artifact in step_result.artifacts.items(): self.saver.persist_artifact(step_number, name, artifact) ``` ## FAQ **Q: Why not store charts as artifacts?** A: Charts are outputs meant for human viewing, not pipeline replay. They don't need deduplication or content-addressing. **Q: Where did my charts go after the refactoring?** A: Check `workspace/exports/_/`. **Q: Can I disable chart saving?** A: Yes! Set `save_artifacts=False, save_charts=False` when creating `PipelineRunner`. **Q: What if two pipelines generate the same chart?** A: Each pipeline gets its own outputs directory, so charts won't conflict. **Q: Can I extract artifacts to readable locations?** A: Models are binary -- not human-readable. Use `WorkspaceStore.replay_chain()` or export to `.n4a` bundle to load them. ## Summary | Feature | Artifacts | Outputs | |---------|-----------|---------| | **Purpose** | Internal binary objects | Human-readable files | | **Location** | `artifacts//` | `exports/_/` | | **Registry** | DuckDB `artifacts` table | On-disk files | | **Names** | Hash-based | Human-readable | | **Deduplication** | Yes (content-addressed) | No | | **Easy to find** | No (use store queries) | Yes | | **Respects save flag** | save_artifacts | save_charts | | **Examples** | Models, transformers | Charts, reports | This system provides the best of both worlds: efficient storage for internal objects and easy access for human outputs.