# Writing a Pipeline in nirs4all This guide explains all possible syntaxes for defining pipeline steps in nirs4all, from simple operators to complex model configurations with hyperparameter optimization. --- ## Table of Contents 1. [Overview](#overview) 2. [Pipeline vs Pipeline Generator](#pipeline-vs-pipeline-generator) 3. [Basic Step Syntaxes](#basic-step-syntaxes) 4. [Model Step Syntaxes](#model-step-syntaxes) 5. [Generator Syntaxes](#generator-syntaxes) 6. [File Formats](#file-formats) 7. [Syntax Reference Table](#syntax-reference-table) 8. [Serialization Rules](#serialization-rules) --- ## Overview A **pipeline** in nirs4all is a list of processing steps that transform data and train/apply models. Each step can be: - A **transformer** (preprocessing, feature engineering) - A **cross-validator** (data splitting strategy) - A **model** (for training or prediction) - A **visualization** (charts, reports) - A **special operator** (resampling, augmentation) Pipelines are defined in Python as lists, and can be saved/loaded from JSON or YAML files. ### Philosophy nirs4all accepts **multiple syntaxes** for maximum flexibility: - **Python objects** (class, instance, function) - **String references** (module paths, file paths, controller names) - **Dictionaries** (explicit configuration with parameters) ### Core Principle: Hash-Based Uniqueness **The fundamental rule**: Different syntaxes that produce the **same object** must serialize to the **same canonical form**, resulting in the **same hash**. **Example - All these are equivalent**: ```python # Syntax 1: Class StandardScaler # Syntax 2: Instance with defaults StandardScaler() # Syntax 3: Instance with explicit default value MinMaxScaler(feature_range=(0, 1)) # (0, 1) is the default # Syntax 4: String path "sklearn.preprocessing.StandardScaler" # Syntax 5: Dict {"class": "sklearn.preprocessing.StandardScaler"} ``` **All serialize to**: ```json "sklearn.preprocessing._data.StandardScaler" ``` **Result**: Same hash → Recognized as identical pipelines → Proper deduplication. **Counter-example - These are different**: ```python # Different class StandardScaler # vs MinMaxScaler # Same class, different non-default params MinMaxScaler(feature_range=(0, 1)) # default # vs MinMaxScaler(feature_range=(0, 2)) # non-default ``` These produce **different serializations** and **different hashes**. **All syntaxes are normalized** during serialization to a canonical form for hash-based uniqueness: ```python { "class": "module.path.ClassName", "params": {"param1": "value1"} } ``` Or simply (when all parameters are default): ```json "module.path.ClassName" ``` **Critical principle**: **Same object = Same serialization = Same hash** This ensures: - ✅ **Hash-based uniqueness**: Identical configurations produce identical hashes (regardless of input syntax) - ✅ **Minimalism**: Only non-default parameters are included - ✅ **Deduplication**: Pipeline variations are properly detected and merged - ✅ **Reproducibility**: Exact pipeline state can be restored **Example**: All these syntaxes produce the **same serialization**: ```python StandardScaler # Class StandardScaler() # Instance with defaults MinMaxScaler(feature_range=(0, 1)) # Instance with default value "sklearn.preprocessing.StandardScaler" # String path {"class": "sklearn.preprocessing.StandardScaler"} # Dict ``` All serialize to: ```json "sklearn.preprocessing._data.StandardScaler" ``` Because they all create the **same object** with **default parameters**. --- ## Pipeline vs Pipeline Generator ### Pipeline Definition A **pipeline definition** is a concrete list of steps that will be executed sequentially: ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, test_size=0.2), PLSRegression(n_components=10) ] ``` This creates **one pipeline** that will be run **once**. ### Pipeline Generator Definition A **pipeline generator** uses special syntax (`_or_`, `_range_`) to define **multiple pipeline variations**: ```python pipeline_generator = [ MinMaxScaler(), {"_or_": [Detrend, FirstDerivative, Gaussian]}, # Creates 3 variations PLSRegression(n_components=10) ] ``` This creates **three pipelines**: 1. `MinMaxScaler() → Detrend → PLSRegression(...)` 2. `MinMaxScaler() → FirstDerivative → PLSRegression(...)` 3. `MinMaxScaler() → Gaussian → PLSRegression(...)` **Generator keys**: - `_or_`: Choose between alternatives (creates N pipelines) - `_range_`: Sweep parameter values (creates M pipelines) - `size`: Limit combinations (for feature augmentation) - `count`: Randomly sample N configurations Generators are **expanded** by `PipelineConfigs` before execution, producing multiple concrete pipelines. --- ## Basic Step Syntaxes ### 1. Class Reference (Uninstantiated) **Syntax**: Pass a Python class directly. ```python from sklearn.preprocessing import StandardScaler pipeline = [ StandardScaler # Class, not instance ] ``` **Behavior**: nirs4all instantiates with **default parameters**. **Serializes to** (class path only, no params dict): ```json "sklearn.preprocessing._data.StandardScaler" ``` **Hash**: Based on class path only (all defaults). --- ### 2. Instance with Parameters **Syntax**: Pass an instantiated object. ```python from sklearn.preprocessing import MinMaxScaler pipeline = [ MinMaxScaler(feature_range=(0, 1)) # Default value ] ``` **Serializes to** (no params if all are defaults): ```json "sklearn.preprocessing._data.MinMaxScaler" ``` **Example with non-default parameter**: ```python MinMaxScaler(feature_range=(0, 2)) # Non-default ``` **Serializes to**: ```json { "class": "sklearn.preprocessing._data.MinMaxScaler", "params": { "feature_range": [0, 2] } } ``` **Note**: Only **non-default** parameters are saved (via `_changed_kwargs()`). This ensures that `MinMaxScaler()`, `MinMaxScaler(feature_range=(0, 1))`, and the class `MinMaxScaler` all produce the **same serialization and hash**. --- ### 3. String - Module Path **Syntax**: Full module path to a class. ```python pipeline = [ "sklearn.preprocessing.StandardScaler" ] ``` **Behavior**: Same as class reference - instantiated with defaults. **Serializes to** (identical to class reference): ```json "sklearn.preprocessing._data.StandardScaler" ``` **Hash**: Same as using the class directly or an instance with default params. *(Note: Internal module path may differ from public API path)* --- ### 4. String - Controller Name **Syntax**: Short name for built-in nirs4all controllers. ```python pipeline = [ "chart_2d" # Built-in visualization ] ``` **Behavior**: Resolves to a registered controller (e.g., `ChartController2D`). **Serializes to**: ```json "chart_2d" ``` --- ### 5. String - File Path (Saved Transformer) **Syntax**: Path to a saved transformer file (`.pkl`, `.joblib`). ```python pipeline = [ "my/super/transformer.pkl" ] ``` **Behavior**: Loads the transformer from disk during execution. **Serializes to**: ```json "my/super/transformer.pkl" ``` --- ### 6. Dictionary - Explicit Configuration **Syntax**: Dict with `class` and optional `params` keys. ```python pipeline = [ { "class": "sklearn.preprocessing.StandardScaler" } ] ``` **Serializes to** (identical to class reference): ```json "sklearn.preprocessing._data.StandardScaler" ``` **With non-default parameters**: ```python pipeline = [ { "class": "sklearn.model_selection.ShuffleSplit", "params": { "n_splits": 3, "test_size": 0.25 } } ] ``` **Serializes to** (same as input, with normalized class path): ```json { "class": "sklearn.model_selection._split.ShuffleSplit", "params": { "n_splits": 3, "test_size": 0.25 } } ``` **Hash**: Based on class path + non-default params only. --- ### 7. Dictionary - Special Operators **Syntax**: Dict with operator-specific keys (e.g., `y_processing`, `feature_augmentation`). ```python pipeline = [ {"y_processing": MinMaxScaler}, # Target variable scaling {"feature_augmentation": Detrend} # Feature engineering ] ``` **Serializes to**: ```json { "y_processing": { "class": "sklearn.preprocessing._data.MinMaxScaler" } } ``` **Note**: Class wrapped in dict with `class` key during preprocessing (`_preprocess_steps()`). --- ## Model Step Syntaxes Models have additional complexity due to: - Custom naming - Training parameters - Hyperparameter optimization (finetuning) - Support for functions (not just classes) ### 1. Model - Instance **Syntax**: Pass a model instance directly. ```python from sklearn.cross_decomposition import PLSRegression pipeline = [ PLSRegression(n_components=10) ] ``` **Serializes to**: ```json { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } ``` **Note**: If all params are default, serializes to just the class string: ```python PLSRegression() # All defaults ``` Serializes to: ```json "sklearn.cross_decomposition._pls.PLSRegression" ``` **Hash**: Based on class + non-default params. --- ### 2. Model - Dict with Name **Syntax**: Dict with `model` and `name` keys. ```python pipeline = [ { "model": PLSRegression(n_components=10), "name": "PLS_10_components" } ] ``` **Serializes to**: ```json { "name": "PLS_10_components", "model": { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } } ``` **Purpose**: Custom naming for tracking specific models in results. **Hash behavior**: The `name` field **affects the hash** because it's part of the step configuration. This means: ```python # These produce DIFFERENT hashes: {"model": PLSRegression(n_components=10), "name": "Model_A"} {"model": PLSRegression(n_components=10), "name": "Model_B"} ``` Even though the model is identical, different names create different pipeline variants (useful for comparing the same model with different training strategies). --- ### 3. Model - Dict with Class and Params **Syntax**: Nested dict structure. ```python pipeline = [ { "model": { "class": "sklearn.cross_decomposition.PLSRegression", "params": { "n_components": 10 } } } ] ``` **Serializes to**: ```json { "model": { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } } ``` **With defaults only**: ```python pipeline = [ { "model": { "class": "sklearn.cross_decomposition.PLSRegression" } } ] ``` Serializes to: ```json { "model": "sklearn.cross_decomposition._pls.PLSRegression" } ``` **Hash**: Based on model class + non-default params. --- ### 4. Model - Function (Neural Network) **Syntax**: Pass a function that returns a model. ```python from nirs4all.operators.models.tensorflow.nicon import nicon pipeline = [ nicon # Function that builds a TensorFlow model ] ``` **Serializes to**: ```json { "function": "nirs4all.operators.models.cirad_tf.nicon" } ``` --- ### 5. Model - String File Path **Syntax**: Path to saved model file. ```python pipeline = [ "My_awesome_model.pkl", # Scikit-learn model "My_awesome_tf_model.keras", # TensorFlow/Keras model "my_model.pth" # PyTorch model ] ``` **Behavior**: Loads model from disk (framework detected by extension). **Serializes to**: (preserves file path) --- ### 6. Model - With Training Parameters **Syntax**: Dict with `model` and `train_params` keys (for neural networks). ```python from nirs4all.operators.models.tensorflow.nicon import nicon pipeline = [ { "model": nicon, "train_params": { "epochs": 250, "batch_size": 32, "verbose": 0 } } ] ``` **Serializes to**: ```json { "model": { "function": "nirs4all.operators.models.cirad_tf.nicon" }, "train_params": { "epochs": 250, "batch_size": 32, "verbose": 0 } } ``` --- ### 6b. Model - With Architecture Parameters (Customizable NN) **Syntax**: Dict with `model`, `model_params`, and optional `train_params` keys. Use `model_params` to customize neural network architecture (filters, kernel sizes, etc.) at training time without finetuning. ```python from nirs4all.operators.models.pytorch.nicon import customizable_nicon pipeline = [ { "model": customizable_nicon, "name": "CustomNN", "model_params": { # Architecture parameters "filters1": 32, "filters2": 64, "kernel_size1": 9, "dropout_rate": 0.3 }, "train_params": { # Training loop parameters "epochs": 250, "batch_size": 32, "lr": 0.001 } } ] ``` **Key distinction**: - `model_params`: Parameters passed to the **model builder function** (architecture) - `train_params`: Parameters for the **training loop** (epochs, batch size, optimizer settings) **Serializes to**: ```json { "name": "CustomNN", "model": { "function": "nirs4all.operators.models.pytorch.nicon.customizable_nicon" }, "model_params": { "filters1": 32, "filters2": 64, "kernel_size1": 9, "dropout_rate": 0.3 }, "train_params": { "epochs": 250, "batch_size": 32, "lr": 0.001 } } ``` **💡 Tip**: This is useful when you know the optimal architecture and want to train without finetuning overhead. --- ### 7. Model - With Hyperparameter Optimization (Finetuning) **Syntax**: Dict with `model`, optional `name`, and `finetune_params` keys. ```python pipeline = [ { "model": PLSRegression(), "name": "PLS-Finetuned", "finetune_params": { "n_trials": 20, "verbose": 2, "approach": "single", "eval_mode": "best", "sample": "grid", "model_params": { 'n_components': ('int', 1, 30), # Tuple: (type, min, max) } } } ] ``` **Finetuning parameters**: - `n_trials`: Number of optimization trials - `verbose`: 0=silent, 1=basic, 2=detailed - `approach`: "single", "grouped", or "individual" - `eval_mode`: "best" or "avg" (for grouped approach) - `sample`: Optimizer strategy ("random", "grid", "bayes", "tpe", "hyperband", etc.) - `model_params`: Dict of parameters to optimize - Value format: `(type, min, max)` for ranges - Or list `[value1, value2, ...]` for categorical **Serializes to**: ```json { "name": "PLS-Finetuned", "model": { "class": "sklearn.cross_decomposition._pls.PLSRegression" }, "finetune_params": { "n_trials": 20, "verbose": 2, "approach": "single", "eval_mode": "best", "sample": "grid", "model_params": { "n_components": ["int", 1, 30] } } } ``` **⚠️ Important**: Tuples like `('int', 1, 30)` are converted to **lists** `["int", 1, 30]` during serialization for YAML/JSON compatibility. --- ### 8. Model - Neural Network with Finetuning **Syntax**: Combine function models with hyperparameter optimization. ```python from nirs4all.operators.models.tensorflow.nicon import customizable_nicon pipeline = [ { "model": customizable_nicon, "name": "NN-Optimized", "finetune_params": { "n_trials": 30, "verbose": 2, "sample": "hyperband", "approach": "single", "model_params": { "filters_1": [8, 16, 32, 64], # Categorical choices "filters_3": [8, 16, 32, 64] }, "train_params": { "epochs": 10, "verbose": 0 } }, "train_params": { "epochs": 250, # Final training after optimization "verbose": 0 } } ] ``` **Two-stage training**: 1. `finetune_params.train_params`: Used during **hyperparameter search** (fewer epochs) 2. `train_params`: Used for **final training** with best parameters (more epochs) --- ### 9. Model - Custom Code File **Syntax**: Dict with `source_file` and `class` keys. ```python pipeline = [ { "source_file": "my_model.py", "class": "MyAwesomeModel" } ] ``` **Behavior**: Dynamically imports `MyAwesomeModel` from `my_model.py`. **Serializes to**: (preserves source file and class name) --- ## Generator Syntaxes Generators allow automatic creation of **multiple pipeline variations** for experimentation. ### 1. `_or_` - Alternative Choices **Syntax**: Dict with `_or_` key containing a list of choices. ```python preprocessing_options = [ Detrend, FirstDerivative, Gaussian, StandardNormalVariate ] pipeline = [ {"_or_": preprocessing_options} # Creates 4 pipelines (one per choice) ] ``` **Expands to**: 4 separate pipelines, each using one preprocessing method. --- ### 2. `_or_` with `count` - Random Sampling **Syntax**: Add `count` key to limit number of generated pipelines. ```python pipeline = [ {"_or_": preprocessing_options, "count": 2} # Randomly select 2 ] ``` **Expands to**: 2 pipelines (randomly sampled from 4 options). --- ### 3. `_or_` with `size` - Combinations **Syntax**: Add `size` key to select N items at once (creates combinations). ```python pipeline = [ {"_or_": preprocessing_options, "size": 2} # Choose 2 at a time ] ``` **Expands to**: All combinations of 2 items from 4 options = 6 pipelines: - `[Detrend, FirstDerivative]` - `[Detrend, Gaussian]` - `[Detrend, StandardNormalVariate]` - `[FirstDerivative, Gaussian]` - `[FirstDerivative, StandardNormalVariate]` - `[Gaussian, StandardNormalVariate]` --- ### 4. `_or_` with `size` Range - Variable Size **Syntax**: Use tuple `(from, to)` for size range. ```python pipeline = [ {"_or_": preprocessing_options, "size": (1, 2)} # 1 or 2 items ] ``` **Expands to**: All combinations of size 1 + all combinations of size 2 = 4 + 6 = 10 pipelines. --- ### 5. `_or_` with `size` and `count` - Limited Combinations **Syntax**: Combine `size` and `count` to randomly sample from combinations. ```python pipeline = [ {"feature_augmentation": { "_or_": preprocessing_options, "size": (1, 2), "count": 5 # Randomly pick 5 combinations }} ] ``` **Expands to**: 5 randomly sampled pipelines from all possible 1-2 item combinations. --- ### 6. `_or_` with Nested Arrays - Second-Order Combinations **Syntax**: Use list `[outer_size, inner_size]` for nested combinations (sub-pipelines). ```python pipeline = [ {"feature_augmentation": { "_or_": preprocessing_options, "size": [2, (1, 2)] # 2 sub-pipelines, each with 1-2 items }} ] ``` **Behavior**: 1. Creates all **inner arrangements** (permutations of 1-2 items) 2. Selects **outer combinations** (choose 2 sub-pipelines) **Example expansion**: ```python [ [[Detrend], [FirstDerivative]], [[Detrend], [Gaussian]], [[FirstDerivative, Detrend], [Gaussian, StandardNormalVariate]], ... ] ``` **Note**: Inner uses **permutations** (order matters), outer uses **combinations** (order doesn't matter). --- ### 7. `_range_` - Numeric Parameter Sweep **Syntax**: Dict with `_range_` key and model configuration. ```python pipeline = [ { "_range_": [1, 12, 2], # Start, end, step "param": "n_components", "model": { "class": "sklearn.cross_decomposition.PLSRegression" } } ] ``` **Expands to**: 6 pipelines with `n_components` = 1, 3, 5, 7, 9, 11. **Alternative syntax** (dict): ```python { "_range_": {"from": 1, "to": 12, "step": 2}, "param": "n_components", "model": PLSRegression } ``` --- ### 8. `_range_` with `count` - Sampled Range **Syntax**: Add `count` key to randomly sample from range. ```python pipeline = [ { "_range_": [1, 30], # 30 values "count": 10, # Sample 10 randomly "param": "n_components", "model": PLSRegression } ] ``` **Expands to**: 10 pipelines with randomly selected `n_components` values from 1-30. --- ## File Formats Pipelines can be defined in Python, JSON, or YAML. ### Python Format ```python from nirs4all.pipeline import PipelineConfigs pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5), PLSRegression(n_components=10) ] config = PipelineConfigs(pipeline, name="my_pipeline") ``` --- ### JSON Format **File**: `pipeline.json` ```json { "pipeline": [ { "class": "sklearn.preprocessing._data.MinMaxScaler" }, { "class": "sklearn.model_selection._split.ShuffleSplit", "params": { "n_splits": 5 } }, { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } ] } ``` **Load**: ```python config = PipelineConfigs("pipeline.json", name="my_pipeline") ``` --- ### YAML Format **File**: `pipeline.yaml` ```yaml pipeline: - class: sklearn.preprocessing._data.MinMaxScaler - class: sklearn.model_selection._split.ShuffleSplit params: n_splits: 5 - class: sklearn.cross_decomposition._pls.PLSRegression params: n_components: 10 ``` **Load**: ```python config = PipelineConfigs("pipeline.yaml", name="my_pipeline") ``` --- ### Generator in YAML ```yaml pipeline: - class: sklearn.preprocessing._data.MinMaxScaler - feature_augmentation: _or_: - class: nirs4all.operators.transforms.Detrend - class: nirs4all.operators.transforms.FirstDerivative - class: nirs4all.operators.transforms.Gaussian size: [1, 2] count: 5 - class: sklearn.model_selection._split.ShuffleSplit params: n_splits: 5 - _range_: [1, 12, 2] param: n_components model: class: sklearn.cross_decomposition._pls.PLSRegression ``` **Note**: Tuples in Python become **lists** in JSON/YAML: - Python: `('int', 1, 30)` → YAML: `["int", 1, 30]` --- ## Syntax Reference Table | Syntax | Example | Use Case | Serializes To | |--------|---------|----------|---------------| | **Class** | `StandardScaler` | Default params | `"sklearn.preprocessing._data.StandardScaler"` | | **Instance (defaults)** | `StandardScaler()` | Default params | `"sklearn.preprocessing._data.StandardScaler"` (same as class) | | **Instance (default values)** | `MinMaxScaler(feature_range=(0,1))` | Explicit defaults | `"sklearn.preprocessing._data.MinMaxScaler"` (no params) | | **Instance (custom)** | `MinMaxScaler(feature_range=(0,2))` | Non-default params | `{"class": "...", "params": {"feature_range": [0, 2]}}` | | **String (module)** | `"sklearn.preprocessing.StandardScaler"` | Portable reference | `"sklearn.preprocessing._data.StandardScaler"` (same as class) | | **String (controller)** | `"chart_2d"` | Built-in operator | `"chart_2d"` | | **String (file)** | `"transformer.pkl"` | Saved object | `"transformer.pkl"` | | **Dict (explicit, defaults)** | `{"class": "sklearn.preprocessing.StandardScaler"}` | Full control | `"sklearn.preprocessing._data.StandardScaler"` (same as class) | | **Dict (explicit, params)** | `{"class": "...", "params": {...}}` | Full control | `{"class": "...", "params": {...}}` (non-default only) | | **Dict (operator)** | `{"y_processing": MinMaxScaler}` | Special operator | `{"y_processing": "sklearn.preprocessing._data.MinMaxScaler"}` | | **Model + Name** | `{"model": PLSRegression(), "name": "PLS-10"}` | Named model | `{"name": "...", "model": "..."}` or with params | | **Function** | `nicon` | NN builder | `{"function": "module.nicon"}` | | **Model + Train** | `{"model": nicon, "train_params": {...}}` | NN with config | `{"model": {...}, "train_params": {...}}` | | **Model + Finetune** | `{"model": PLSRegression(), "finetune_params": {...}}` | HPO | `{"model": "...", "finetune_params": {...}}` or with params | | **Generator (_or_)** | `{"_or_": [A, B, C]}` | Alternatives | Expands to N pipelines | | **Generator (_range_)** | `{"_range_": [1, 10, 2], "param": "n", "model": ...}` | Param sweep | Expands to M pipelines | | **Generator + size** | `{"_or_": [...], "size": 2}` | Combinations | C(n, k) pipelines | | **Generator + count** | `{"_or_": [...], "count": 5}` | Random sample | 5 pipelines | | **Nested generator** | `{"_or_": [...], "size": [2, (1,2)]}` | Sub-pipelines | Complex expansion | **Key principle**: Different syntaxes producing the same object (same class + same non-default params) → **same serialization** → **same hash**. --- ## Serialization Rules nirs4all normalizes all syntaxes to a canonical form for storage and reproducibility. ### Rule 1: Classes → String Paths (When Defaults Only) **Input (all produce same object)**: ```python from sklearn.preprocessing import StandardScaler # Syntax 1: Class StandardScaler # Syntax 2: Instance with defaults StandardScaler() # Syntax 3: String path "sklearn.preprocessing.StandardScaler" # Syntax 4: Dict with class only {"class": "sklearn.preprocessing.StandardScaler"} ``` **All serialize to the same canonical form**: ```json "sklearn.preprocessing._data.StandardScaler" ``` **Hash**: Based on class path only (all produce identical hash). **Note**: Uses **internal module path** (may differ from import path). --- ### Rule 2: Instances → Dict with Params (Only Non-Defaults) **Input with non-default parameter**: ```python MinMaxScaler(feature_range=(0, 2)) # (0, 2) is NOT default ``` **Serialized**: ```json { "class": "sklearn.preprocessing._data.MinMaxScaler", "params": { "feature_range": [0, 2] } } ``` **Input with default parameter**: ```python MinMaxScaler(feature_range=(0, 1)) # (0, 1) IS default ``` **Serialized** (no params, identical to class): ```json "sklearn.preprocessing._data.MinMaxScaler" ``` **Key behavior**: Only **non-default** parameters are included (via `_changed_kwargs()`). This ensures: - `MinMaxScaler` (class) - `MinMaxScaler()` (instance, defaults) - `MinMaxScaler(feature_range=(0, 1))` (instance, explicit defaults) - `"sklearn.preprocessing.MinMaxScaler"` (string) All produce the **same serialization and hash**. **Hash**: Based on class path + JSON representation of non-default params. --- ### Rule 3: Tuples → Lists **Reason**: YAML's `safe_load()` cannot deserialize Python-specific tuples. **Input**: ```python { "feature_range": (0, 1) } ``` **Serialized**: ```json { "feature_range": [0, 1] } ``` **Exception**: Hyperparameter range tuples `('int', min, max)` are also converted to lists during YAML serialization but preserved as semantic ranges during JSON serialization phase. --- ### Rule 4: Functions → Dict with Function Key **Input**: ```python from nirs4all.operators.models.tensorflow.nicon import nicon nicon ``` **Serialized**: ```json { "function": "nirs4all.operators.models.cirad_tf.nicon" } ``` --- ### Rule 5: Nested Dicts Recursively Serialized **Input**: ```python { "model": { "class": PLSRegression, "params": {"n_components": 10} } } ``` **Serialized**: ```json { "model": { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } } ``` --- ### Rule 6: Special Keys Preserved Generator keys (`_or_`, `_range_`, `size`, `count`) and operator keys (`y_processing`, `feature_augmentation`, etc.) are **preserved as-is**. --- ### Rule 7: Minimal Serialization (Hash-Based Uniqueness) The `_changed_kwargs()` function compares current values to defaults: ```python def _changed_kwargs(obj): """Return {param: value} for every __init__ param whose current value differs from its default.""" sig = inspect.signature(obj.__class__.__init__) out = {} for name, param in sig.parameters.items(): if name == "self": continue default = param.default if param.default is not inspect._empty else None current = getattr(obj, name, default) if current != default: out[name] = current # Only save if different! return out ``` **Example comparisons**: | Input | Default Value | Serialized Params | Serialized Form | |-------|---------------|-------------------|-----------------| | `MinMaxScaler` | N/A (class) | None | `"sklearn.preprocessing._data.MinMaxScaler"` | | `MinMaxScaler()` | All defaults | None | `"sklearn.preprocessing._data.MinMaxScaler"` | | `MinMaxScaler(feature_range=(0,1))` | `(0, 1)` | None (matches default) | `"sklearn.preprocessing._data.MinMaxScaler"` | | `MinMaxScaler(feature_range=(0,2))` | `(0, 1)` | `{"feature_range": [0, 2]}` | `{"class": "...", "params": {...}}` | | `PLSRegression()` | All defaults | None | `"sklearn.cross_decomposition._pls.PLSRegression"` | | `PLSRegression(n_components=2)` | `2` | None (matches default) | `"sklearn.cross_decomposition._pls.PLSRegression"` | | `PLSRegression(n_components=10)` | `2` | `{"n_components": 10}` | `{"class": "...", "params": {...}}` | **This keeps serialized pipelines minimal and ensures hash-based uniqueness**: - ✅ Same object (same class + same effective params) → Same serialization → Same hash - ✅ Different objects (different class OR different params) → Different serialization → Different hash - ✅ Pipeline deduplication works correctly - ✅ Configuration variations are properly detected --- ## Best Practices ### ✅ Do 1. **Use any syntax you prefer** - they all normalize correctly: - Class: `StandardScaler` - Instance with defaults: `StandardScaler()` - Instance with custom params: `MinMaxScaler(feature_range=(0, 2))` - String: `"sklearn.preprocessing.StandardScaler"` - Dict: `{"class": "sklearn.preprocessing.StandardScaler"}` 2. **Trust the serialization** - same object = same hash, regardless of syntax 3. **Use dicts for portability**: `{"class": "...", "params": {...}}` works in JSON/YAML files 4. **Name important models**: `{"model": PLSRegression(), "name": "PLS-Best"}` 5. **Use generators for experimentation**: `{"_or_": [...], "count": 10}` 6. **Don't worry about defaults** - `_changed_kwargs()` handles it automatically: - `PLSRegression()` and `PLSRegression(n_components=2)` → same serialization (2 is default) - `PLSRegression(n_components=10)` → different serialization (10 is non-default) ### ❌ Avoid 1. **Mixing syntaxes unnecessarily**: Choose one style per project 2. **Hardcoding internal paths**: Use public import paths when possible 3. **Over-specifying generators**: `count` can explode with nested `_or_` 4. **Forgetting YAML limits**: Tuples become lists (usually fine, but be aware) 5. **Custom objects without serialization support**: Extend `serialize_component()` if needed --- ## Examples ### Example 1: Hash Equivalence - Same Object, Different Syntaxes All these pipeline definitions produce the **same hash**: ```python from sklearn.preprocessing import MinMaxScaler, StandardScaler from sklearn.cross_decomposition import PLSRegression # Pipeline A: Using classes and instances with defaults pipeline_a = [ StandardScaler, # Class PLSRegression() # Instance with defaults (n_components=2 is default) ] # Pipeline B: Using instances with explicit defaults pipeline_b = [ StandardScaler(), PLSRegression(n_components=2) # Explicit default ] # Pipeline C: Using string paths pipeline_c = [ "sklearn.preprocessing.StandardScaler", "sklearn.cross_decomposition.PLSRegression" ] # Pipeline D: Using dicts pipeline_d = [ {"class": "sklearn.preprocessing.StandardScaler"}, {"class": "sklearn.cross_decomposition.PLSRegression"} ] # Pipeline E: Mixed syntaxes pipeline_e = [ StandardScaler, # Class {"class": "sklearn.cross_decomposition.PLSRegression"} # Dict ] ``` **All serialize to**: ```json [ "sklearn.preprocessing._data.StandardScaler", "sklearn.cross_decomposition._pls.PLSRegression" ] ``` **Hash**: `get_hash(steps)` produces identical MD5 hash for all 5 pipelines. **Result**: nirs4all recognizes them as the **same pipeline** and won't run duplicates. --- ### Example 2: Hash Difference - Different Objects These pipelines produce **different hashes**: ```python # Pipeline A: PLSRegression with default n_components pipeline_a = [ StandardScaler, PLSRegression() # n_components=2 (default) ] # Pipeline B: PLSRegression with non-default n_components pipeline_b = [ StandardScaler, PLSRegression(n_components=10) # n_components=10 (non-default) ] ``` **Pipeline A serializes to**: ```json [ "sklearn.preprocessing._data.StandardScaler", "sklearn.cross_decomposition._pls.PLSRegression" ] ``` **Pipeline B serializes to**: ```json [ "sklearn.preprocessing._data.StandardScaler", { "class": "sklearn.cross_decomposition._pls.PLSRegression", "params": { "n_components": 10 } } ] ``` **Hash**: Different hashes → Recognized as different pipelines → Both will run. --- ### Example 3: Simple Regression Pipeline ```python from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import ShuffleSplit from sklearn.cross_decomposition import PLSRegression pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, test_size=0.25), PLSRegression(n_components=10) ] ``` ### Example 4: Multi-Model Comparison ```python pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=5, test_size=0.25), {"model": PLSRegression(n_components=5), "name": "PLS-5"}, {"model": PLSRegression(n_components=10), "name": "PLS-10"}, {"model": PLSRegression(n_components=15), "name": "PLS-15"} ] ``` ### Example 5: Preprocessing Exploration ```python from nirs4all.operators.transforms import Detrend, FirstDerivative, Gaussian pipeline = [ MinMaxScaler(), {"_or_": [Detrend, FirstDerivative, Gaussian]}, # 3 variations ShuffleSplit(n_splits=5, test_size=0.25), PLSRegression(n_components=10) ] # Expands to 3 pipelines ``` ### Example 6: Hyperparameter Optimization ```python pipeline = [ MinMaxScaler(), {"y_processing": MinMaxScaler()}, ShuffleSplit(n_splits=5, test_size=0.25), { "model": PLSRegression(), "name": "PLS-Optimized", "finetune_params": { "n_trials": 50, "verbose": 2, "approach": "single", "sample": "tpe", "model_params": { 'n_components': ('int', 1, 30) } } } ] ``` ### Example 7: Complex Generator ```python from nirs4all.operators.transforms import ( Detrend, FirstDerivative, Gaussian, StandardNormalVariate ) pipeline = [ MinMaxScaler(), {"feature_augmentation": { "_or_": [Detrend, FirstDerivative, Gaussian, StandardNormalVariate], "size": (1, 2), # 1 or 2 preprocessing steps "count": 5 # Random sample 5 combinations }}, ShuffleSplit(n_splits=5, test_size=0.25), { "_range_": [5, 20, 5], # 5, 10, 15, 20 "param": "n_components", "model": PLSRegression } ] # Expands to 5 * 4 = 20 pipelines ``` ### Example 8: Neural Network with Training Config ```python from nirs4all.operators.models.tensorflow.nicon import customizable_nicon pipeline = [ MinMaxScaler(), {"y_processing": MinMaxScaler()}, ShuffleSplit(n_splits=5, test_size=0.25), { "model": customizable_nicon, "name": "CustomNN", "finetune_params": { "n_trials": 30, "verbose": 2, "sample": "hyperband", "approach": "single", "model_params": { "filters_1": [8, 16, 32, 64], "filters_2": [8, 16, 32, 64] }, "train_params": { "epochs": 10, # Fast training during search "verbose": 0 } }, "train_params": { "epochs": 250, # Full training with best params "verbose": 1 } } ] ``` --- ## Troubleshooting ### Error: "could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'" **Cause**: Python tuples in pipeline config (e.g., `('int', 1, 30)`) cannot be deserialized by `yaml.safe_load()`. **Solution**: This is fixed automatically by `_sanitize_for_yaml()` in the manifest manager. If you see this error, ensure you're using the latest nirs4all version. **Manual fix** (if needed): ```python # Change from: "model_params": { 'n_components': ('int', 1, 30) } # To: "model_params": { 'n_components': ['int', 1, 30] } ``` ### Error: "Pipeline configuration expansion would generate X configurations, exceeding the limit" **Cause**: Generator expansion creates too many pipelines (default limit: 10,000). **Solutions**: 1. Add `count` parameter: `{"_or_": [...], "count": 100}` 2. Increase limit: `PipelineConfigs(pipeline, max_generation_count=50000)` 3. Simplify generator (reduce options or nesting) ### Error: "Failed to import module.ClassName" **Cause**: Invalid class path or missing dependency. **Solutions**: 1. Check import path: `from sklearn.preprocessing import StandardScaler` → class path is `sklearn.preprocessing._data.StandardScaler` (internal) 2. Ensure dependencies installed: `pip install scikit-learn tensorflow pytorch` 3. Use instance instead: `StandardScaler()` instead of `"sklearn.preprocessing.StandardScaler"` --- ## See Also - {doc}`operator_catalog` - All built-in nirs4all operators - {doc}`cli` - Command-line interface reference - {doc}`/user_guide/preprocessing/index` - Preprocessing guide - {doc}`/user_guide/pipelines/branching` - Branching and merging guide - {doc}`/developer/architecture` - Pipeline architecture overview --- **Last Updated**: December 2025 **Version**: 1.1 (Phase 3 Documentation Update)