Writing a Pipeline in nirs4all

This guide explains all possible syntaxes for defining pipeline steps in nirs4all, from simple operators to complex model configurations with hyperparameter optimization.

Table of Contents

Overview
Pipeline vs Pipeline Generator
Basic Step Syntaxes
Model Step Syntaxes
Generator Syntaxes
File Formats
Syntax Reference Table
Serialization Rules

Overview

A pipeline in nirs4all is a list of processing steps that transform data and train/apply models. Each step can be:

A transformer (preprocessing, feature engineering)
A cross-validator (data splitting strategy)
A model (for training or prediction)
A visualization (charts, reports)
A special operator (resampling, augmentation)

Pipelines are defined in Python as lists, and can be saved/loaded from JSON or YAML files.

Philosophy

nirs4all accepts multiple syntaxes for maximum flexibility:

Python objects (class, instance, function)
String references (module paths, file paths, controller names)
Dictionaries (explicit configuration with parameters)

Core Principle: Hash-Based Uniqueness

The fundamental rule: Different syntaxes that produce the same object must serialize to the same canonical form, resulting in the same hash.

Example - All these are equivalent:

# Syntax 1: Class
StandardScaler

# Syntax 2: Instance with defaults
StandardScaler()

# Syntax 3: Instance with explicit default value
MinMaxScaler(feature_range=(0, 1))  # (0, 1) is the default

# Syntax 4: String path
"sklearn.preprocessing.StandardScaler"

# Syntax 5: Dict
{"class": "sklearn.preprocessing.StandardScaler"}

All serialize to:

"sklearn.preprocessing._data.StandardScaler"

Result: Same hash → Recognized as identical pipelines → Proper deduplication.

Counter-example - These are different:

# Different class
StandardScaler  # vs  MinMaxScaler

# Same class, different non-default params
MinMaxScaler(feature_range=(0, 1))  # default
# vs
MinMaxScaler(feature_range=(0, 2))  # non-default

These produce different serializations and different hashes.

All syntaxes are normalized during serialization to a canonical form for hash-based uniqueness:

{
    "class": "module.path.ClassName",
    "params": {"param1": "value1"}
}

Or simply (when all parameters are default):

"module.path.ClassName"

Critical principle: Same object = Same serialization = Same hash

This ensures:

✅ Hash-based uniqueness: Identical configurations produce identical hashes (regardless of input syntax)
✅ Minimalism: Only non-default parameters are included
✅ Deduplication: Pipeline variations are properly detected and merged
✅ Reproducibility: Exact pipeline state can be restored

Example: All these syntaxes produce the same serialization:

StandardScaler                                  # Class
StandardScaler()                                # Instance with defaults
MinMaxScaler(feature_range=(0, 1))             # Instance with default value
"sklearn.preprocessing.StandardScaler"          # String path
{"class": "sklearn.preprocessing.StandardScaler"}  # Dict

All serialize to:

"sklearn.preprocessing._data.StandardScaler"

Because they all create the same object with default parameters.

Pipeline vs Pipeline Generator

Pipeline Definition

A pipeline definition is a concrete list of steps that will be executed sequentially:

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, test_size=0.2),
    PLSRegression(n_components=10)
]

This creates one pipeline that will be run once.

Pipeline Generator Definition

A pipeline generator uses special syntax (_or_, _range_) to define multiple pipeline variations:

pipeline_generator = [
    MinMaxScaler(),
    {"_or_": [Detrend, FirstDerivative, Gaussian]},  # Creates 3 variations
    PLSRegression(n_components=10)
]

This creates three pipelines:

MinMaxScaler() → Detrend → PLSRegression(...)
MinMaxScaler() → FirstDerivative → PLSRegression(...)
MinMaxScaler() → Gaussian → PLSRegression(...)

Generator keys:

_or_: Choose between alternatives (creates N pipelines)
_range_: Sweep parameter values (creates M pipelines)
size: Limit combinations (for feature augmentation)
count: Randomly sample N configurations

Generators are expanded by PipelineConfigs before execution, producing multiple concrete pipelines.

Basic Step Syntaxes

1. Class Reference (Uninstantiated)

Syntax: Pass a Python class directly.

from sklearn.preprocessing import StandardScaler

pipeline = [
    StandardScaler  # Class, not instance
]

Behavior: nirs4all instantiates with default parameters.

Serializes to (class path only, no params dict):

"sklearn.preprocessing._data.StandardScaler"

Hash: Based on class path only (all defaults).

2. Instance with Parameters

Syntax: Pass an instantiated object.

from sklearn.preprocessing import MinMaxScaler

pipeline = [
    MinMaxScaler(feature_range=(0, 1))  # Default value
]

Serializes to (no params if all are defaults):

"sklearn.preprocessing._data.MinMaxScaler"

Example with non-default parameter:

MinMaxScaler(feature_range=(0, 2))  # Non-default

Serializes to:

{
    "class": "sklearn.preprocessing._data.MinMaxScaler",
    "params": {
        "feature_range": [0, 2]
    }
}

Note: Only non-default parameters are saved (via _changed_kwargs()). This ensures that MinMaxScaler(), MinMaxScaler(feature_range=(0, 1)), and the class MinMaxScaler all produce the same serialization and hash.

3. String - Module Path

Syntax: Full module path to a class.

pipeline = [
    "sklearn.preprocessing.StandardScaler"
]

Behavior: Same as class reference - instantiated with defaults.

Serializes to (identical to class reference):

"sklearn.preprocessing._data.StandardScaler"

Hash: Same as using the class directly or an instance with default params.

(Note: Internal module path may differ from public API path)

4. String - Controller Name

Syntax: Short name for built-in nirs4all controllers.

pipeline = [
    "chart_2d"  # Built-in visualization
]

Behavior: Resolves to a registered controller (e.g., ChartController2D).

Serializes to:

"chart_2d"

5. String - File Path (Saved Transformer)

Syntax: Path to a saved transformer file (.pkl, .joblib).

pipeline = [
    "my/super/transformer.pkl"
]

Behavior: Loads the transformer from disk during execution.

Serializes to:

"my/super/transformer.pkl"

6. Dictionary - Explicit Configuration

Syntax: Dict with class and optional params keys.

pipeline = [
    {
        "class": "sklearn.preprocessing.StandardScaler"
    }
]

Serializes to (identical to class reference):

"sklearn.preprocessing._data.StandardScaler"

With non-default parameters:

pipeline = [
    {
        "class": "sklearn.model_selection.ShuffleSplit",
        "params": {
            "n_splits": 3,
            "test_size": 0.25
        }
    }
]

Serializes to (same as input, with normalized class path):

{
    "class": "sklearn.model_selection._split.ShuffleSplit",
    "params": {
        "n_splits": 3,
        "test_size": 0.25
    }
}

Hash: Based on class path + non-default params only.

7. Dictionary - Special Operators

Syntax: Dict with operator-specific keys (e.g., y_processing, feature_augmentation).

pipeline = [
    {"y_processing": MinMaxScaler},  # Target variable scaling
    {"feature_augmentation": Detrend}  # Feature engineering
]

Serializes to:

{
    "y_processing": {
        "class": "sklearn.preprocessing._data.MinMaxScaler"
    }
}

Note: Class wrapped in dict with class key during preprocessing (_preprocess_steps()).

Model Step Syntaxes

Models have additional complexity due to:

Custom naming
Training parameters
Hyperparameter optimization (finetuning)
Support for functions (not just classes)

1. Model - Instance

Syntax: Pass a model instance directly.

from sklearn.cross_decomposition import PLSRegression

pipeline = [
    PLSRegression(n_components=10)
]

Serializes to:

{
    "class": "sklearn.cross_decomposition._pls.PLSRegression",
    "params": {
        "n_components": 10
    }
}

Note: If all params are default, serializes to just the class string:

PLSRegression()  # All defaults

Serializes to:

"sklearn.cross_decomposition._pls.PLSRegression"

Hash: Based on class + non-default params.

2. Model - Dict with Name

Syntax: Dict with model and name keys.

pipeline = [
    {
        "model": PLSRegression(n_components=10),
        "name": "PLS_10_components"
    }
]

Serializes to:

{
    "name": "PLS_10_components",
    "model": {
        "class": "sklearn.cross_decomposition._pls.PLSRegression",
        "params": {
            "n_components": 10
        }
    }
}

Purpose: Custom naming for tracking specific models in results.

Hash behavior: The name field affects the hash because it’s part of the step configuration. This means:

# These produce DIFFERENT hashes:
{"model": PLSRegression(n_components=10), "name": "Model_A"}
{"model": PLSRegression(n_components=10), "name": "Model_B"}

Even though the model is identical, different names create different pipeline variants (useful for comparing the same model with different training strategies).

3. Model - Dict with Class and Params

Syntax: Nested dict structure.

pipeline = [
    {
        "model": {
            "class": "sklearn.cross_decomposition.PLSRegression",
            "params": {
                "n_components": 10
            }
        }
    }
]

Serializes to:

{
    "model": {
        "class": "sklearn.cross_decomposition._pls.PLSRegression",
        "params": {
            "n_components": 10
        }
    }
}

With defaults only:

pipeline = [
    {
        "model": {
            "class": "sklearn.cross_decomposition.PLSRegression"
        }
    }
]

Serializes to:

{
    "model": "sklearn.cross_decomposition._pls.PLSRegression"
}

Hash: Based on model class + non-default params.

4. Model - Function (Neural Network)

Syntax: Pass a function that returns a model.

from nirs4all.operators.models.tensorflow.nicon import nicon

pipeline = [
    nicon  # Function that builds a TensorFlow model
]

Serializes to:

{
    "function": "nirs4all.operators.models.cirad_tf.nicon"
}

5. Model - String File Path

Syntax: Path to saved model file.

pipeline = [
    "My_awesome_model.pkl",           # Scikit-learn model
    "My_awesome_tf_model.keras",      # TensorFlow/Keras model
    "my_model.pth"                    # PyTorch model
]

Behavior: Loads model from disk (framework detected by extension).

Serializes to: (preserves file path)

6. Model - With Training Parameters

Syntax: Dict with model and train_params keys (for neural networks).

from nirs4all.operators.models.tensorflow.nicon import nicon

pipeline = [
    {
        "model": nicon,
        "train_params": {
            "epochs": 250,
            "batch_size": 32,
            "verbose": 0
        }
    }
]

Serializes to:

{
    "model": {
        "function": "nirs4all.operators.models.cirad_tf.nicon"
    },
    "train_params": {
        "epochs": 250,
        "batch_size": 32,
        "verbose": 0
    }
}

6b. Model - With Architecture Parameters (Customizable NN)

Syntax: Dict with model, model_params, and optional train_params keys.

Use model_params to customize neural network architecture (filters, kernel sizes, etc.) at training time without finetuning.

from nirs4all.operators.models.pytorch.nicon import customizable_nicon

pipeline = [
    {
        "model": customizable_nicon,
        "name": "CustomNN",
        "model_params": {           # Architecture parameters
            "filters1": 32,
            "filters2": 64,
            "kernel_size1": 9,
            "dropout_rate": 0.3
        },
        "train_params": {           # Training loop parameters
            "epochs": 250,
            "batch_size": 32,
            "lr": 0.001
        }
    }
]

Key distinction:

model_params: Parameters passed to the model builder function (architecture)
train_params: Parameters for the training loop (epochs, batch size, optimizer settings)

Serializes to:

{
    "name": "CustomNN",
    "model": {
        "function": "nirs4all.operators.models.pytorch.nicon.customizable_nicon"
    },
    "model_params": {
        "filters1": 32,
        "filters2": 64,
        "kernel_size1": 9,
        "dropout_rate": 0.3
    },
    "train_params": {
        "epochs": 250,
        "batch_size": 32,
        "lr": 0.001
    }
}

💡 Tip: This is useful when you know the optimal architecture and want to train without finetuning overhead.

7. Model - With Hyperparameter Optimization (Finetuning)

Syntax: Dict with model, optional name, and finetune_params keys.

pipeline = [
    {
        "model": PLSRegression(),
        "name": "PLS-Finetuned",
        "finetune_params": {
            "n_trials": 20,
            "verbose": 2,
            "approach": "single",
            "eval_mode": "best",
            "sample": "grid",
            "model_params": {
                'n_components': ('int', 1, 30),  # Tuple: (type, min, max)
            }
        }
    }
]

Finetuning parameters:

n_trials: Number of optimization trials
verbose: 0=silent, 1=basic, 2=detailed
approach: “single”, “grouped”, or “individual”
eval_mode: “best” or “avg” (for grouped approach)
sample: Optimizer strategy (“random”, “grid”, “bayes”, “tpe”, “hyperband”, etc.)
model_params: Dict of parameters to optimize
- Value format: (type, min, max) for ranges
- Or list [value1, value2, ...] for categorical

Serializes to:

{
    "name": "PLS-Finetuned",
    "model": {
        "class": "sklearn.cross_decomposition._pls.PLSRegression"
    },
    "finetune_params": {
        "n_trials": 20,
        "verbose": 2,
        "approach": "single",
        "eval_mode": "best",
        "sample": "grid",
        "model_params": {
            "n_components": ["int", 1, 30]
        }
    }
}

⚠️ Important: Tuples like ('int', 1, 30) are converted to lists ["int", 1, 30] during serialization for YAML/JSON compatibility.

8. Model - Neural Network with Finetuning

Syntax: Combine function models with hyperparameter optimization.

from nirs4all.operators.models.tensorflow.nicon import customizable_nicon

pipeline = [
    {
        "model": customizable_nicon,
        "name": "NN-Optimized",
        "finetune_params": {
            "n_trials": 30,
            "verbose": 2,
            "sample": "hyperband",
            "approach": "single",
            "model_params": {
                "filters_1": [8, 16, 32, 64],      # Categorical choices
                "filters_3": [8, 16, 32, 64]
            },
            "train_params": {
                "epochs": 10,
                "verbose": 0
            }
        },
        "train_params": {
            "epochs": 250,  # Final training after optimization
            "verbose": 0
        }
    }
]

Two-stage training:

finetune_params.train_params: Used during hyperparameter search (fewer epochs)
train_params: Used for final training with best parameters (more epochs)

9. Model - Custom Code File

Syntax: Dict with source_file and class keys.

pipeline = [
    {
        "source_file": "my_model.py",
        "class": "MyAwesomeModel"
    }
]

Behavior: Dynamically imports MyAwesomeModel from my_model.py.

Serializes to: (preserves source file and class name)

Generator Syntaxes

Generators allow automatic creation of multiple pipeline variations for experimentation.

1. `_or_` - Alternative Choices

Syntax: Dict with _or_ key containing a list of choices.

preprocessing_options = [
    Detrend, FirstDerivative, Gaussian, StandardNormalVariate
]

pipeline = [
    {"_or_": preprocessing_options}  # Creates 4 pipelines (one per choice)
]

Expands to: 4 separate pipelines, each using one preprocessing method.

2. `_or_` with `count` - Random Sampling

Syntax: Add count key to limit number of generated pipelines.

pipeline = [
    {"_or_": preprocessing_options, "count": 2}  # Randomly select 2
]

Expands to: 2 pipelines (randomly sampled from 4 options).

3. `_or_` with `size` - Combinations

Syntax: Add size key to select N items at once (creates combinations).

pipeline = [
    {"_or_": preprocessing_options, "size": 2}  # Choose 2 at a time
]

Expands to: All combinations of 2 items from 4 options = 6 pipelines:

[Detrend, FirstDerivative]
[Detrend, Gaussian]
[Detrend, StandardNormalVariate]
[FirstDerivative, Gaussian]
[FirstDerivative, StandardNormalVariate]
[Gaussian, StandardNormalVariate]

4. `_or_` with `size` Range - Variable Size

Syntax: Use tuple (from, to) for size range.

pipeline = [
    {"_or_": preprocessing_options, "size": (1, 2)}  # 1 or 2 items
]

Expands to: All combinations of size 1 + all combinations of size 2 = 4 + 6 = 10 pipelines.

5. `_or_` with `size` and `count` - Limited Combinations

Syntax: Combine size and count to randomly sample from combinations.

pipeline = [
    {"feature_augmentation": {
        "_or_": preprocessing_options,
        "size": (1, 2),
        "count": 5  # Randomly pick 5 combinations
    }}
]

Expands to: 5 randomly sampled pipelines from all possible 1-2 item combinations.

6. `_or_` with Nested Arrays - Second-Order Combinations

Syntax: Use list [outer_size, inner_size] for nested combinations (sub-pipelines).

pipeline = [
    {"feature_augmentation": {
        "_or_": preprocessing_options,
        "size": [2, (1, 2)]  # 2 sub-pipelines, each with 1-2 items
    }}
]

Behavior:

Creates all inner arrangements (permutations of 1-2 items)
Selects outer combinations (choose 2 sub-pipelines)

Example expansion:

[
    [[Detrend], [FirstDerivative]],
    [[Detrend], [Gaussian]],
    [[FirstDerivative, Detrend], [Gaussian, StandardNormalVariate]],
    ...
]

Note: Inner uses permutations (order matters), outer uses combinations (order doesn’t matter).

7. `_range_` - Numeric Parameter Sweep

Syntax: Dict with _range_ key and model configuration.

pipeline = [
    {
        "_range_": [1, 12, 2],  # Start, end, step
        "param": "n_components",
        "model": {
            "class": "sklearn.cross_decomposition.PLSRegression"
        }
    }
]

Expands to: 6 pipelines with n_components = 1, 3, 5, 7, 9, 11.

Alternative syntax (dict):

{
    "_range_": {"from": 1, "to": 12, "step": 2},
    "param": "n_components",
    "model": PLSRegression
}

8. `_range_` with `count` - Sampled Range

Syntax: Add count key to randomly sample from range.

pipeline = [
    {
        "_range_": [1, 30],  # 30 values
        "count": 10,         # Sample 10 randomly
        "param": "n_components",
        "model": PLSRegression
    }
]

Expands to: 10 pipelines with randomly selected n_components values from 1-30.

File Formats

Pipelines can be defined in Python, JSON, or YAML.

Python Format

from nirs4all.pipeline import PipelineConfigs

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5),
    PLSRegression(n_components=10)
]

config = PipelineConfigs(pipeline, name="my_pipeline")

JSON Format

File: pipeline.json

{
    "pipeline": [
        {
            "class": "sklearn.preprocessing._data.MinMaxScaler"
        },
        {
            "class": "sklearn.model_selection._split.ShuffleSplit",
            "params": {
                "n_splits": 5
            }
        },
        {
            "class": "sklearn.cross_decomposition._pls.PLSRegression",
            "params": {
                "n_components": 10
            }
        }
    ]
}

Load:

config = PipelineConfigs("pipeline.json", name="my_pipeline")

YAML Format

File: pipeline.yaml

pipeline:
  - class: sklearn.preprocessing._data.MinMaxScaler

  - class: sklearn.model_selection._split.ShuffleSplit
    params:
      n_splits: 5

  - class: sklearn.cross_decomposition._pls.PLSRegression
    params:
      n_components: 10

Load:

config = PipelineConfigs("pipeline.yaml", name="my_pipeline")

Generator in YAML

pipeline:
  - class: sklearn.preprocessing._data.MinMaxScaler

  - feature_augmentation:
      _or_:
        - class: nirs4all.operators.transforms.Detrend
        - class: nirs4all.operators.transforms.FirstDerivative
        - class: nirs4all.operators.transforms.Gaussian
      size: [1, 2]
      count: 5

  - class: sklearn.model_selection._split.ShuffleSplit
    params:
      n_splits: 5

  - _range_: [1, 12, 2]
    param: n_components
    model:
      class: sklearn.cross_decomposition._pls.PLSRegression

Note: Tuples in Python become lists in JSON/YAML:

Python: ('int', 1, 30) → YAML: ["int", 1, 30]

Syntax Reference Table

Syntax	Example	Use Case	Serializes To
Class	`StandardScaler`	Default params	`"sklearn.preprocessing._data.StandardScaler"`
Instance (defaults)	`StandardScaler()`	Default params	`"sklearn.preprocessing._data.StandardScaler"` (same as class)
Instance (default values)	`MinMaxScaler(feature_range=(0,1))`	Explicit defaults	`"sklearn.preprocessing._data.MinMaxScaler"` (no params)
Instance (custom)	`MinMaxScaler(feature_range=(0,2))`	Non-default params	`{"class": "...", "params": {"feature_range": [0, 2]}}`
String (module)	`"sklearn.preprocessing.StandardScaler"`	Portable reference	`"sklearn.preprocessing._data.StandardScaler"` (same as class)
String (controller)	`"chart_2d"`	Built-in operator	`"chart_2d"`
String (file)	`"transformer.pkl"`	Saved object	`"transformer.pkl"`
Dict (explicit, defaults)	`{"class": "sklearn.preprocessing.StandardScaler"}`	Full control	`"sklearn.preprocessing._data.StandardScaler"` (same as class)
Dict (explicit, params)	`{"class": "...", "params": {...}}`	Full control	`{"class": "...", "params": {...}}` (non-default only)
Dict (operator)	`{"y_processing": MinMaxScaler}`	Special operator	`{"y_processing": "sklearn.preprocessing._data.MinMaxScaler"}`
Model + Name	`{"model": PLSRegression(), "name": "PLS-10"}`	Named model	`{"name": "...", "model": "..."}` or with params
Function	`nicon`	NN builder	`{"function": "module.nicon"}`
Model + Train	`{"model": nicon, "train_params": {...}}`	NN with config	`{"model": {...}, "train_params": {...}}`
Model + Finetune	`{"model": PLSRegression(), "finetune_params": {...}}`	HPO	`{"model": "...", "finetune_params": {...}}` or with params
Generator (or)	`{"_or_": [A, B, C]}`	Alternatives	Expands to N pipelines
*Generator (range)*	`{"_range_": [1, 10, 2], "param": "n", "model": ...}`	Param sweep	Expands to M pipelines
Generator + size	`{"_or_": [...], "size": 2}`	Combinations	C(n, k) pipelines
Generator + count	`{"_or_": [...], "count": 5}`	Random sample	5 pipelines
Nested generator	`{"_or_": [...], "size": [2, (1,2)]}`	Sub-pipelines	Complex expansion

Key principle: Different syntaxes producing the same object (same class + same non-default params) → same serialization → same hash.

Serialization Rules

nirs4all normalizes all syntaxes to a canonical form for storage and reproducibility.

Rule 1: Classes → String Paths (When Defaults Only)

Input (all produce same object):

from sklearn.preprocessing import StandardScaler

# Syntax 1: Class
StandardScaler

# Syntax 2: Instance with defaults
StandardScaler()

# Syntax 3: String path
"sklearn.preprocessing.StandardScaler"

# Syntax 4: Dict with class only
{"class": "sklearn.preprocessing.StandardScaler"}

All serialize to the same canonical form:

"sklearn.preprocessing._data.StandardScaler"

Hash: Based on class path only (all produce identical hash).

Note: Uses internal module path (may differ from import path).

Rule 2: Instances → Dict with Params (Only Non-Defaults)

Input with non-default parameter:

MinMaxScaler(feature_range=(0, 2))  # (0, 2) is NOT default

Serialized:

{
    "class": "sklearn.preprocessing._data.MinMaxScaler",
    "params": {
        "feature_range": [0, 2]
    }
}

Input with default parameter:

MinMaxScaler(feature_range=(0, 1))  # (0, 1) IS default

Serialized (no params, identical to class):

"sklearn.preprocessing._data.MinMaxScaler"

Key behavior: Only non-default parameters are included (via _changed_kwargs()). This ensures:

MinMaxScaler (class)
MinMaxScaler() (instance, defaults)
MinMaxScaler(feature_range=(0, 1)) (instance, explicit defaults)
"sklearn.preprocessing.MinMaxScaler" (string)

All produce the same serialization and hash.

Hash: Based on class path + JSON representation of non-default params.

Rule 3: Tuples → Lists

Reason: YAML’s safe_load() cannot deserialize Python-specific tuples.

Input:

{
    "feature_range": (0, 1)
}

Serialized:

{
    "feature_range": [0, 1]
}

Exception: Hyperparameter range tuples ('int', min, max) are also converted to lists during YAML serialization but preserved as semantic ranges during JSON serialization phase.

Rule 4: Functions → Dict with Function Key

Input:

from nirs4all.operators.models.tensorflow.nicon import nicon
nicon

Serialized:

{
    "function": "nirs4all.operators.models.cirad_tf.nicon"
}

Rule 5: Nested Dicts Recursively Serialized

Input:

{
    "model": {
        "class": PLSRegression,
        "params": {"n_components": 10}
    }
}

Serialized:

{
    "model": {
        "class": "sklearn.cross_decomposition._pls.PLSRegression",
        "params": {
            "n_components": 10
        }
    }
}

Rule 6: Special Keys Preserved

Generator keys (_or_, _range_, size, count) and operator keys (y_processing, feature_augmentation, etc.) are preserved as-is.

Rule 7: Minimal Serialization (Hash-Based Uniqueness)

The _changed_kwargs() function compares current values to defaults:

def _changed_kwargs(obj):
    """Return {param: value} for every __init__ param whose current
    value differs from its default."""
    sig = inspect.signature(obj.__class__.__init__)
    out = {}

    for name, param in sig.parameters.items():
        if name == "self":
            continue

        default = param.default if param.default is not inspect._empty else None
        current = getattr(obj, name, default)

        if current != default:
            out[name] = current  # Only save if different!

    return out

Example comparisons:

Input	Default Value	Serialized Params	Serialized Form
`MinMaxScaler`	N/A (class)	None	`"sklearn.preprocessing._data.MinMaxScaler"`
`MinMaxScaler()`	All defaults	None	`"sklearn.preprocessing._data.MinMaxScaler"`
`MinMaxScaler(feature_range=(0,1))`	`(0, 1)`	None (matches default)	`"sklearn.preprocessing._data.MinMaxScaler"`
`MinMaxScaler(feature_range=(0,2))`	`(0, 1)`	`{"feature_range": [0, 2]}`	`{"class": "...", "params": {...}}`
`PLSRegression()`	All defaults	None	`"sklearn.cross_decomposition._pls.PLSRegression"`
`PLSRegression(n_components=2)`	`2`	None (matches default)	`"sklearn.cross_decomposition._pls.PLSRegression"`
`PLSRegression(n_components=10)`	`2`	`{"n_components": 10}`	`{"class": "...", "params": {...}}`

This keeps serialized pipelines minimal and ensures hash-based uniqueness:

✅ Same object (same class + same effective params) → Same serialization → Same hash
✅ Different objects (different class OR different params) → Different serialization → Different hash
✅ Pipeline deduplication works correctly
✅ Configuration variations are properly detected

Best Practices

✅ Do

Use any syntax you prefer - they all normalize correctly:
- Class: StandardScaler
- Instance with defaults: StandardScaler()
- Instance with custom params: MinMaxScaler(feature_range=(0, 2))
- String: "sklearn.preprocessing.StandardScaler"
- Dict: {"class": "sklearn.preprocessing.StandardScaler"}
Trust the serialization - same object = same hash, regardless of syntax
Use dicts for portability: {"class": "...", "params": {...}} works in JSON/YAML files
Name important models: {"model": PLSRegression(), "name": "PLS-Best"}
Use generators for experimentation: {"_or_": [...], "count": 10}
Don’t worry about defaults - _changed_kwargs() handles it automatically:
- PLSRegression() and PLSRegression(n_components=2) → same serialization (2 is default)
- PLSRegression(n_components=10) → different serialization (10 is non-default)

❌ Avoid

Mixing syntaxes unnecessarily: Choose one style per project
Hardcoding internal paths: Use public import paths when possible
Over-specifying generators: count can explode with nested _or_
Forgetting YAML limits: Tuples become lists (usually fine, but be aware)
Custom objects without serialization support: Extend serialize_component() if needed

Examples

Example 1: Hash Equivalence - Same Object, Different Syntaxes

All these pipeline definitions produce the same hash:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cross_decomposition import PLSRegression

# Pipeline A: Using classes and instances with defaults
pipeline_a = [
    StandardScaler,  # Class
    PLSRegression()  # Instance with defaults (n_components=2 is default)
]

# Pipeline B: Using instances with explicit defaults
pipeline_b = [
    StandardScaler(),
    PLSRegression(n_components=2)  # Explicit default
]

# Pipeline C: Using string paths
pipeline_c = [
    "sklearn.preprocessing.StandardScaler",
    "sklearn.cross_decomposition.PLSRegression"
]

# Pipeline D: Using dicts
pipeline_d = [
    {"class": "sklearn.preprocessing.StandardScaler"},
    {"class": "sklearn.cross_decomposition.PLSRegression"}
]

# Pipeline E: Mixed syntaxes
pipeline_e = [
    StandardScaler,  # Class
    {"class": "sklearn.cross_decomposition.PLSRegression"}  # Dict
]

All serialize to:

[
    "sklearn.preprocessing._data.StandardScaler",
    "sklearn.cross_decomposition._pls.PLSRegression"
]

Hash: get_hash(steps) produces identical MD5 hash for all 5 pipelines.

Result: nirs4all recognizes them as the same pipeline and won’t run duplicates.

Example 2: Hash Difference - Different Objects

These pipelines produce different hashes:

# Pipeline A: PLSRegression with default n_components
pipeline_a = [
    StandardScaler,
    PLSRegression()  # n_components=2 (default)
]

# Pipeline B: PLSRegression with non-default n_components
pipeline_b = [
    StandardScaler,
    PLSRegression(n_components=10)  # n_components=10 (non-default)
]

Pipeline A serializes to:

[
    "sklearn.preprocessing._data.StandardScaler",
    "sklearn.cross_decomposition._pls.PLSRegression"
]

Pipeline B serializes to:

[
    "sklearn.preprocessing._data.StandardScaler",
    {
        "class": "sklearn.cross_decomposition._pls.PLSRegression",
        "params": {
            "n_components": 10
        }
    }
]

Hash: Different hashes → Recognized as different pipelines → Both will run.

Example 3: Simple Regression Pipeline

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import ShuffleSplit
from sklearn.cross_decomposition import PLSRegression

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, test_size=0.25),
    PLSRegression(n_components=10)
]

Example 4: Multi-Model Comparison

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=5, test_size=0.25),
    {"model": PLSRegression(n_components=5), "name": "PLS-5"},
    {"model": PLSRegression(n_components=10), "name": "PLS-10"},
    {"model": PLSRegression(n_components=15), "name": "PLS-15"}
]

Example 5: Preprocessing Exploration

from nirs4all.operators.transforms import Detrend, FirstDerivative, Gaussian

pipeline = [
    MinMaxScaler(),
    {"_or_": [Detrend, FirstDerivative, Gaussian]},  # 3 variations
    ShuffleSplit(n_splits=5, test_size=0.25),
    PLSRegression(n_components=10)
]
# Expands to 3 pipelines

Example 6: Hyperparameter Optimization

pipeline = [
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},
    ShuffleSplit(n_splits=5, test_size=0.25),
    {
        "model": PLSRegression(),
        "name": "PLS-Optimized",
        "finetune_params": {
            "n_trials": 50,
            "verbose": 2,
            "approach": "single",
            "sample": "tpe",
            "model_params": {
                'n_components': ('int', 1, 30)
            }
        }
    }
]

Example 7: Complex Generator

from nirs4all.operators.transforms import (
    Detrend, FirstDerivative, Gaussian, StandardNormalVariate
)

pipeline = [
    MinMaxScaler(),
    {"feature_augmentation": {
        "_or_": [Detrend, FirstDerivative, Gaussian, StandardNormalVariate],
        "size": (1, 2),  # 1 or 2 preprocessing steps
        "count": 5        # Random sample 5 combinations
    }},
    ShuffleSplit(n_splits=5, test_size=0.25),
    {
        "_range_": [5, 20, 5],  # 5, 10, 15, 20
        "param": "n_components",
        "model": PLSRegression
    }
]
# Expands to 5 * 4 = 20 pipelines

Example 8: Neural Network with Training Config

from nirs4all.operators.models.tensorflow.nicon import customizable_nicon

pipeline = [
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},
    ShuffleSplit(n_splits=5, test_size=0.25),
    {
        "model": customizable_nicon,
        "name": "CustomNN",
        "finetune_params": {
            "n_trials": 30,
            "verbose": 2,
            "sample": "hyperband",
            "approach": "single",
            "model_params": {
                "filters_1": [8, 16, 32, 64],
                "filters_2": [8, 16, 32, 64]
            },
            "train_params": {
                "epochs": 10,  # Fast training during search
                "verbose": 0
            }
        },
        "train_params": {
            "epochs": 250,  # Full training with best params
            "verbose": 1
        }
    }
]

Troubleshooting

Error: “could not determine a constructor for the tag ‘tag:yaml.org,2002:python/tuple’”

Cause: Python tuples in pipeline config (e.g., ('int', 1, 30)) cannot be deserialized by yaml.safe_load().

Solution: This is fixed automatically by _sanitize_for_yaml() in the manifest manager. If you see this error, ensure you’re using the latest nirs4all version.

Manual fix (if needed):

# Change from:
"model_params": {
    'n_components': ('int', 1, 30)
}

# To:
"model_params": {
    'n_components': ['int', 1, 30]
}

Error: “Pipeline configuration expansion would generate X configurations, exceeding the limit”

Cause: Generator expansion creates too many pipelines (default limit: 10,000).

Solutions:

Add count parameter: {"_or_": [...], "count": 100}
Increase limit: PipelineConfigs(pipeline, max_generation_count=50000)
Simplify generator (reduce options or nesting)

Error: “Failed to import module.ClassName”

Cause: Invalid class path or missing dependency.

Solutions:

Check import path: from sklearn.preprocessing import StandardScaler → class path is sklearn.preprocessing._data.StandardScaler (internal)
Ensure dependencies installed: pip install scikit-learn tensorflow pytorch
Use instance instead: StandardScaler() instead of "sklearn.preprocessing.StandardScaler"

Writing a Pipeline in nirs4all

Table of Contents

Overview

Philosophy

Core Principle: Hash-Based Uniqueness

Pipeline vs Pipeline Generator

Pipeline Definition

Pipeline Generator Definition

Basic Step Syntaxes

1. Class Reference (Uninstantiated)

2. Instance with Parameters

3. String - Module Path

4. String - Controller Name

5. String - File Path (Saved Transformer)

6. Dictionary - Explicit Configuration

7. Dictionary - Special Operators

Model Step Syntaxes

1. Model - Instance

2. Model - Dict with Name

3. Model - Dict with Class and Params

4. Model - Function (Neural Network)

5. Model - String File Path

6. Model - With Training Parameters

6b. Model - With Architecture Parameters (Customizable NN)

7. Model - With Hyperparameter Optimization (Finetuning)

8. Model - Neural Network with Finetuning

9. Model - Custom Code File

Generator Syntaxes

1. _or_ - Alternative Choices

2. _or_ with count - Random Sampling

3. _or_ with size - Combinations

4. _or_ with size Range - Variable Size

5. _or_ with size and count - Limited Combinations

6. _or_ with Nested Arrays - Second-Order Combinations

7. _range_ - Numeric Parameter Sweep

8. _range_ with count - Sampled Range

File Formats

Python Format

JSON Format

YAML Format

Generator in YAML

Syntax Reference Table

Serialization Rules

Rule 1: Classes → String Paths (When Defaults Only)

Rule 2: Instances → Dict with Params (Only Non-Defaults)

Rule 3: Tuples → Lists

Rule 4: Functions → Dict with Function Key

Rule 5: Nested Dicts Recursively Serialized

Rule 6: Special Keys Preserved

Rule 7: Minimal Serialization (Hash-Based Uniqueness)

Best Practices

✅ Do

❌ Avoid

Examples

Example 1: Hash Equivalence - Same Object, Different Syntaxes

Example 2: Hash Difference - Different Objects

Example 3: Simple Regression Pipeline

Example 4: Multi-Model Comparison

Example 5: Preprocessing Exploration

Example 6: Hyperparameter Optimization

Example 7: Complex Generator

Example 8: Neural Network with Training Config

Troubleshooting

Error: “could not determine a constructor for the tag ‘tag:yaml.org,2002:python/tuple’”

Error: “Pipeline configuration expansion would generate X configurations, exceeding the limit”

Error: “Failed to import module.ClassName”

See Also

1. `_or_` - Alternative Choices

2. `_or_` with `count` - Random Sampling

3. `_or_` with `size` - Combinations

4. `_or_` with `size` Range - Variable Size

5. `_or_` with `size` and `count` - Limited Combinations

6. `_or_` with Nested Arrays - Second-Order Combinations

7. `_range_` - Numeric Parameter Sweep

8. `_range_` with `count` - Sampled Range