Generator Keywords Reference
This document provides a comprehensive reference for all generator keywords used in nirs4all pipeline configuration expansion.
Table of Contents
Overview
The generator module expands pipeline configuration specifications into concrete pipeline variants. It takes a single configuration with combinatorial keywords and generates all possible combinations.
Basic Import
from nirs4all.pipeline.config.generator import (
# Core API
expand_spec,
expand_spec_with_choices,
count_combinations,
# Iterator API
expand_spec_iter,
batch_iter,
iter_with_progress,
# Validation
validate_spec,
validate_config,
validate_expanded_configs,
# Presets
PRESET_KEYWORD,
register_preset,
unregister_preset,
get_preset,
get_preset_info,
list_presets,
clear_presets,
has_preset,
is_preset_reference,
resolve_preset,
resolve_presets_recursive,
export_presets,
import_presets,
register_builtin_presets,
# Constraints
apply_mutex_constraint,
apply_requires_constraint,
apply_exclude_constraint,
apply_all_constraints,
parse_constraints,
validate_constraints,
# Export utilities
to_dataframe,
diff_configs,
summarize_configs,
get_expansion_tree,
print_expansion_tree,
format_config_table,
ExpansionTreeNode,
# Keyword constants
OR_KEYWORD,
RANGE_KEYWORD,
LOG_RANGE_KEYWORD,
GRID_KEYWORD,
ZIP_KEYWORD,
CHAIN_KEYWORD,
SAMPLE_KEYWORD,
CARTESIAN_KEYWORD,
SIZE_KEYWORD,
COUNT_KEYWORD,
SEED_KEYWORD,
WEIGHTS_KEYWORD,
PICK_KEYWORD,
ARRANGE_KEYWORD,
THEN_PICK_KEYWORD,
THEN_ARRANGE_KEYWORD,
TAGS_KEYWORD,
METADATA_KEYWORD,
MUTEX_KEYWORD,
REQUIRES_KEYWORD,
DEPENDS_ON_KEYWORD,
EXCLUDE_KEYWORD,
# Detection functions
is_generator_node,
is_pure_or_node,
is_pure_range_node,
is_pure_log_range_node,
is_pure_grid_node,
is_pure_zip_node,
is_pure_chain_node,
is_pure_sample_node,
is_pure_cartesian_node,
# Extraction functions
extract_modifiers,
extract_base_node,
extract_or_choices,
extract_range_spec,
extract_tags,
extract_metadata,
extract_constraints,
# Strategies (advanced usage)
ExpansionStrategy,
get_strategy,
register_strategy,
RangeStrategy,
OrStrategy,
LogRangeStrategy,
GridStrategy,
ZipStrategy,
ChainStrategy,
SampleStrategy,
CartesianStrategy,
)
Phase 1-2: Core Keywords
_or_
Select from a list of alternatives. Each choice becomes a separate configuration variant.
Syntax:
{"_or_": [choice1, choice2, ...]}
Examples:
# Simple string choices
{"_or_": ["StandardScaler", "MinMaxScaler", "RobustScaler"]}
# → ["StandardScaler", "MinMaxScaler", "RobustScaler"]
# Dictionary choices
{"_or_": [
{"class": "PCA", "n_components": 10},
{"class": "SVD", "n_components": 10},
]}
# → [{"class": "PCA", "n_components": 10}, {"class": "SVD", "n_components": 10}]
# Mixed types
{"_or_": [None, 5, {"window": 11}]}
# → [None, 5, {"window": 11}]
Modifiers: size, pick, arrange, then_pick, then_arrange, count
_range_
Generate a sequence of numeric values.
Syntax:
# Array syntax
{"_range_": [start, end]} # Inclusive, step=1
{"_range_": [start, end, step]} # With custom step
# Dict syntax
{"_range_": {"from": start, "to": end, "step": step}}
Examples:
{"_range_": [1, 5]}
# → [1, 2, 3, 4, 5]
{"_range_": [0, 20, 5]}
# → [0, 5, 10, 15, 20]
{"_range_": {"from": 10, "to": 50, "step": 10}}
# → [10, 20, 30, 40, 50]
size
(Legacy) Select combinations of N items from _or_ choices. Equivalent to pick.
Syntax:
{"_or_": [...], "size": n} # Fixed size
{"_or_": [...], "size": (min, max)} # Range of sizes
{"_or_": [...], "size": [outer, inner]} # Second-order (nested)
Examples:
# Select 2 from 4 items → C(4,2) = 6 combinations
{"_or_": ["A", "B", "C", "D"], "size": 2}
# → [["A", "B"], ["A", "C"], ["A", "D"], ["B", "C"], ["B", "D"], ["C", "D"]]
# Size range
{"_or_": ["A", "B", "C"], "size": (1, 2)}
# → [["A"], ["B"], ["C"], ["A", "B"], ["A", "C"], ["B", "C"]]
pick
(Explicit) Unordered selection - combinations where order doesn’t matter.
Syntax:
{"_or_": [...], "pick": n} # Fixed size
{"_or_": [...], "pick": (min, max)} # Range of sizes
Mathematical formula: C(n, k) = n! / (k! × (n-k)!)
Examples:
# Pick 2 from 3 → C(3,2) = 3
{"_or_": ["A", "B", "C"], "pick": 2}
# → [["A", "B"], ["A", "C"], ["B", "C"]]
Use cases:
concat_transformwhere feature order doesn’t matterfeature_augmentationfor parallel channelsAny scenario where [A, B] and [B, A] should be treated as equivalent
arrange
(Explicit) Ordered arrangement - permutations where order matters.
Syntax:
{"_or_": [...], "arrange": n} # Fixed size
{"_or_": [...], "arrange": (min, max)} # Range of sizes
Mathematical formula: P(n, k) = n! / (n-k)!
Examples:
# Arrange 2 from 3 → P(3,2) = 6
{"_or_": ["A", "B", "C"], "arrange": 2}
# → [["A", "B"], ["A", "C"], ["B", "A"], ["B", "C"], ["C", "A"], ["C", "B"]]
Use cases:
Sequential preprocessing pipelines
Any scenario where order of operations affects results
When [A, B] and [B, A] should be treated as different configurations
then_pick
Second-order operation: apply combinations to the results of a primary selection.
Syntax:
{"_or_": [...], "pick": n1, "then_pick": n2}
{"_or_": [...], "arrange": n1, "then_pick": n2}
Example:
# Pick 2, then pick 2 from those 3 results
{"_or_": ["A", "B", "C"], "pick": 2, "then_pick": 2}
# Step 1: pick=2 → C(3,2) = 3 combos: [A,B], [A,C], [B,C]
# Step 2: then_pick=2 → C(3,2) = 3 selections of those combos
then_arrange
Second-order operation: apply permutations to the results of a primary selection.
Syntax:
{"_or_": [...], "pick": n1, "then_arrange": n2}
{"_or_": [...], "arrange": n1, "then_arrange": n2}
Example:
# Pick 2, then arrange 2 from those results
{"_or_": ["A", "B", "C"], "pick": 2, "then_arrange": 2}
# Step 1: pick=2 → 3 combos: [A,B], [A,C], [B,C]
# Step 2: then_arrange=2 → P(3,2) = 6 arrangements
count
Limit the number of results returned. With a seed, results are deterministic.
Syntax:
{"_or_": [...], "count": n}
{"_or_": [...], "size": k, "count": n}
Example:
# Get 2 random items from 5
{"_or_": ["A", "B", "C", "D", "E"], "count": 2}
# → 2 randomly selected items
# With seed for reproducibility
expand_spec({"_or_": ["A", "B", "C", "D", "E"], "count": 2}, seed=42)
# → Same 2 items every time with seed=42
Phase 3: Advanced Keywords
_log_range_
Generate logarithmically-spaced numeric sequences. Useful for hyperparameter optimization over values spanning multiple orders of magnitude.
Syntax:
# Array syntax: [from, to, num_values]
{"_log_range_": [start, end, num]}
# Dict syntax
{"_log_range_": {"from": start, "to": end, "num": n}}
{"_log_range_": {"from": start, "to": end, "base": b}} # Custom base
Examples:
# 4 values from 0.001 to 1 (base 10)
{"_log_range_": [0.001, 1, 4]}
# → [0.001, 0.01, 0.1, 1.0]
# Learning rate search
{"_log_range_": [0.0001, 0.1, 5]}
# → [0.0001, 0.001, 0.01, 0.1, 1.0] (approximately)
# Base 2 powers
{"_log_range_": {"from": 1, "to": 256, "num": 9, "base": 2}}
# → [1, 2, 4, 8, 16, 32, 64, 128, 256]
_grid_
Generate Cartesian product of parameter spaces. Similar to sklearn’s ParameterGrid.
Syntax:
{"_grid_": {"param1": [v1, v2, ...], "param2": [v3, v4, ...]}}
Examples:
{"_grid_": {"learning_rate": [0.01, 0.1], "batch_size": [16, 32, 64]}}
# → 2 × 3 = 6 configurations:
# [{"learning_rate": 0.01, "batch_size": 16},
# {"learning_rate": 0.01, "batch_size": 32},
# {"learning_rate": 0.01, "batch_size": 64},
# {"learning_rate": 0.1, "batch_size": 16},
# {"learning_rate": 0.1, "batch_size": 32},
# {"learning_rate": 0.1, "batch_size": 64}]
_zip_
Parallel iteration - pair values at the same index (like Python’s zip).
Syntax:
{"_zip_": {"param1": [v1, v2, ...], "param2": [v3, v4, ...]}}
Examples:
{"_zip_": {"x": [1, 2, 3], "y": ["A", "B", "C"]}}
# → 3 configurations (paired by position):
# [{"x": 1, "y": "A"}, {"x": 2, "y": "B"}, {"x": 3, "y": "C"}]
Comparison with _grid_:
# _zip_ pairs by position
{"_zip_": {"x": [1, 2], "y": ["A", "B"]}}
# → [{"x": 1, "y": "A"}, {"x": 2, "y": "B"}]
# _grid_ generates all combinations
{"_grid_": {"x": [1, 2], "y": ["A", "B"]}}
# → [{"x": 1, "y": "A"}, {"x": 1, "y": "B"}, {"x": 2, "y": "A"}, {"x": 2, "y": "B"}]
_chain_
Sequential ordered choices. Preserves order (unlike _or_ which may be randomized).
Syntax:
{"_chain_": [config1, config2, config3, ...]}
Examples:
{"_chain_": [
{"model": "baseline", "complexity": "low"},
{"model": "improved", "complexity": "medium"},
{"model": "best", "complexity": "high"}
]}
# → Configurations in that exact order
Use cases:
Progressive experiments: baseline → improved → best
When configuration order has meaning
_sample_
Statistical sampling from various distributions.
Syntax:
{"_sample_": {"distribution": "uniform|log_uniform|normal|choice", ...}}
Distributions:
Distribution |
Parameters |
Description |
|---|---|---|
|
|
Uniform distribution between from and to |
|
|
Log-uniform (common for learning rates) |
|
|
Normal distribution |
|
|
Random selection from list |
Examples:
# Uniform sampling
{"_sample_": {"distribution": "uniform", "from": 0.1, "to": 1.0, "num": 5}}
# → 5 random values uniformly distributed between 0.1 and 1.0
# Log-uniform (learning rate search)
{"_sample_": {"distribution": "log_uniform", "from": 0.0001, "to": 0.1, "num": 5}}
# → 5 values with log-uniform distribution
# Normal distribution
{"_sample_": {"distribution": "normal", "mean": 0, "std": 1, "num": 5}}
# → 5 values from standard normal distribution
# Random choice
{"_sample_": {"distribution": "choice", "values": ["A", "B", "C", "D"], "num": 3}}
# → 3 randomly selected values (with replacement)
_metadata_
Attach arbitrary metadata to configurations.
Syntax:
{"_or_": [...], "_metadata_": {"key": "value", ...}}
Phase 4: Production Keywords
_cartesian_
Generate the Cartesian product of multiple stages (each with _or_ choices), then apply pick/arrange selection on the resulting complete pipelines. This is the key pattern for preprocessing pipeline generation.
Syntax:
{"_cartesian_": [stage1, stage2, ...]}
{"_cartesian_": [stage1, stage2, ...], "pick": N}
{"_cartesian_": [stage1, stage2, ...], "arrange": N}
Examples:
# Generate all pipeline combinations (3×3×3 = 27), then pick 2
{"_cartesian_": [
{"_or_": ["MSC", "SNV", "EMSC"]},
{"_or_": ["SavGol", "Gaussian", None]},
{"_or_": [None, "Deriv1", "Deriv2"]}
], "pick": 2}
# → All 2-combinations of the 27 complete pipelines
# Pick 1-3 complete pipelines with count limit
{"_cartesian_": [
{"_or_": ["A", "B"]},
{"_or_": ["X", "Y"]}
], "pick": (1, 3), "count": 20}
Difference from _grid_:
_grid_produces dicts (parameter combinations)_cartesian_produces lists (ordered stages), ideal for preprocessing pipelines
Use cases:
Preprocessing pipeline generation
Any staged pipeline where order matters
When you want to select from complete pipeline variants
_mutex_
Mutual exclusion constraint - certain items cannot appear together.
Syntax:
{"_or_": [...], "pick": n, "_mutex_": [[item1, item2], [item3, item4]]}
Example:
# A and B cannot appear together
{"_or_": ["A", "B", "C", "D"], "pick": 2, "_mutex_": [["A", "B"]]}
# All combinations: [A,B], [A,C], [A,D], [B,C], [B,D], [C,D]
# After _mutex_: [A,C], [A,D], [B,C], [B,D], [C,D] (A,B excluded)
_requires_
Dependency constraint - if item A is selected, item B must also be selected.
Syntax:
{"_or_": [...], "pick": n, "_requires_": [[trigger, required1, required2]]}
Example:
# If A is selected, C must also be selected
{"_or_": ["A", "B", "C", "D"], "pick": 2, "_requires_": [["A", "C"]]}
# Valid: [A,C], [B,C], [B,D], [C,D]
# Invalid: [A,B], [A,D] (A without C)
_depends_on_
Conditional expansion - expansion depends on the value of another parameter.
Syntax:
{"_or_": [...], "_depends_on_": "other_param"}
Use cases:
Conditional hyperparameter spaces
Parameters that only apply when another parameter has a certain value
_exclude_
Exclude specific combinations from results.
Syntax:
{"_or_": [...], "pick": n, "_exclude_": [[combo1], [combo2]]}
Example:
# Exclude specific combinations [A,C] and [B,D]
{"_or_": ["A", "B", "C", "D"], "pick": 2, "_exclude_": [["A", "C"], ["B", "D"]]}
# Remaining: [A,B], [A,D], [B,C], [C,D]
_preset_
Reference a named preset configuration.
Syntax:
{"_preset_": "preset_name"}
Usage:
from nirs4all.pipeline.config.generator import register_preset, resolve_presets_recursive
# Register presets
register_preset(
"spectral_transforms",
{"_or_": ["SNV", "MSC", "Detrend"], "pick": (1, 2)},
description="Common spectral preprocessing"
)
register_preset(
"pls_components",
{"_range_": [2, 15]}
)
# Use in configuration
config = {
"transforms": {"_preset_": "spectral_transforms"},
"model": {
"class": "PLSRegression",
"n_components": {"_preset_": "pls_components"}
}
}
# Resolve presets before expansion
resolved = resolve_presets_recursive(config)
results = expand_spec(resolved)
Modifier Keywords
_seed_
Provide a deterministic seed for random operations within a node. This ensures reproducible generation when using count or random sampling.
Syntax:
{"_or_": [...], "count": N, "_seed_": 42}
{"_sample_": {...}, "_seed_": 42}
Examples:
# Reproducible random selection
{"_or_": ["A", "B", "C", "D", "E"], "count": 2, "_seed_": 42}
# → Same 2 items every time
# Reproducible sampling
{"_sample_": {"distribution": "uniform", "from": 0, "to": 1, "num": 5}, "_seed_": 123}
# → Same 5 values every time
_weights_
Provide weights for weighted random selection when using count.
Syntax:
{"_or_": [...], "count": N, "_weights_": [w1, w2, ...]}
Examples:
# Weighted random selection (A is 3x more likely than others)
{"_or_": ["A", "B", "C", "D"], "count": 2, "_weights_": [3, 1, 1, 1]}
API Functions
Core Functions
# Expand a specification to all variants
results = expand_spec(spec, seed=None)
# Expand with choice tracking (returns configs and choice paths)
results, choices = expand_spec_with_choices(spec, seed=None)
# Count variants without generating
count = count_combinations(spec)
Iterator Functions
# Lazy iteration for large spaces
for config in expand_spec_iter(spec, seed=None):
process(config)
# With sampling (uses reservoir sampling for uniform distribution)
configs = list(expand_spec_iter(spec, seed=42, sample_size=100))
# Batch processing
for batch in batch_iter(spec, batch_size=10):
process_batch(batch)
# With progress reporting
for i, config in iter_with_progress(spec, report_every=1000):
process(config)
Preset Functions
# Register a preset
register_preset(name, spec, description=None, tags=None, overwrite=False)
# Get preset specification
spec = get_preset(name)
# Get preset info (spec, description, tags)
info = get_preset_info(name)
# List and manage presets
names = list_presets(tags=None) # Filter by tags optionally
has_preset(name)
unregister_preset(name)
clear_presets()
# Resolve presets in a config (handles circular reference detection)
resolved = resolve_presets_recursive(config)
# Check if a node is a preset reference
is_preset_reference(node)
# Export/import presets
presets_dict = export_presets()
count = import_presets(presets_dict, overwrite=False)
# Register built-in presets (standard_scalers, pls_components, learning_rates)
register_builtin_presets()
Constraint Functions
# Apply individual constraints
filtered = apply_mutex_constraint(results, mutex_groups)
filtered = apply_requires_constraint(results, requires_groups)
filtered = apply_exclude_constraint(results, exclude_combos)
# Apply all constraints at once
filtered = apply_all_constraints(results, mutex_groups, requires_groups, exclude_combos)
# Parse and validate constraints
parsed = parse_constraints(constraint_spec)
errors = validate_constraints(constraint_spec)
Export Functions
# Convert to pandas DataFrame
df = to_dataframe(configs, flatten=True, prefix_sep=".", include_index=True)
# Compare configurations
diff = diff_configs(config1, config2)
# Summary statistics
summary = summarize_configs(configs, max_unique=10)
# Tree visualization
tree_str = print_expansion_tree(spec, indent=" ", show_counts=True, max_depth=None)
tree_node = get_expansion_tree(spec)
# ASCII table formatting
table_str = format_config_table(configs, columns=None, max_rows=20)
Validation Functions
# Validate a specification
result = validate_spec(spec)
if not result.is_valid:
print(result.errors)
# Validate a config dict
result = validate_config(config, schema=None)
# Validate expanded configs
results = validate_expanded_configs(configs, schema=None)
Detection Functions
# Check if a node contains any generator keywords
is_generator_node(node) # True if has _or_, _range_, etc.
# Check for specific node types
is_pure_or_node(node) # Only OR-related keys
is_pure_range_node(node) # Only range-related keys
is_pure_log_range_node(node)
is_pure_grid_node(node)
is_pure_zip_node(node)
is_pure_chain_node(node)
is_pure_sample_node(node)
is_pure_cartesian_node(node)
# Check for specific keywords
has_or_keyword(node)
has_range_keyword(node)
has_log_range_keyword(node)
has_grid_keyword(node)
has_zip_keyword(node)
has_chain_keyword(node)
has_sample_keyword(node)
has_cartesian_keyword(node)
Extraction Functions
# Extract modifiers (size, count, pick, arrange, etc.)
modifiers = extract_modifiers(node)
# Extract non-keyword keys
base = extract_base_node(node)
# Extract specific elements
choices = extract_or_choices(node) # From _or_ node
range_spec = extract_range_spec(node) # From _range_ node
tags = extract_tags(node) # From _tags_
metadata = extract_metadata(node) # From _metadata_
constraints = extract_constraints(node) # From _mutex_, _requires_, etc.
Selection Semantics: pick vs arrange
Aspect |
|
|
|---|---|---|
Order matters? |
No |
Yes |
[A, B] vs [B, A] |
Same |
Different |
Formula |
C(n,k) = n!/(k!(n-k)!) |
P(n,k) = n!/(n-k)! |
Count for 3 choose 2 |
3 |
6 |
Use case |
Feature sets |
Processing pipelines |
When to use pick:
concat_transformwhere feature order doesn’t matterfeature_augmentationfor parallel channelsAny unordered collection
When to use arrange:
Sequential preprocessing steps
When operation order affects results
Pipeline stages with dependencies
Common Patterns and Examples
1. Hyperparameter Grid Search
{
"_grid_": {
"model": ["PLS", "RF", "SVR"],
"n_components": {"_range_": [5, 20, 5]},
"preprocessing": ["StandardScaler", "MinMaxScaler", None]
}
}
2. Learning Rate Search
{
"optimizer": "Adam",
"learning_rate": {"_log_range_": [0.0001, 0.1, 10]},
"batch_size": {"_or_": [16, 32, 64, 128]}
}
3. Preprocessing Pipeline Combinations
{
"feature_augmentation": {
"_or_": [
{"class": "SNV"},
{"class": "MSC"},
{"class": "Detrend", "order": {"_or_": [1, 2]}},
{"class": "SavitzkyGolay", "window": {"_or_": [5, 11, 21]}}
],
"pick": (1, 3) # 1 to 3 transforms
}
}
4. Constrained Combinations
{
"_or_": ["PCA", "ICA", "NMF", "UMAP"],
"pick": 2,
"_mutex_": [["PCA", "ICA"]], # PCA and ICA can't be together
"_requires_": [["UMAP", "NMF"]] # If UMAP selected, NMF required
}
5. Progressive Experiments with Chain
{
"_chain_": [
{"model": "baseline", "transforms": []},
{"model": "baseline", "transforms": ["SNV"]},
{"model": "improved", "transforms": ["SNV", "Detrend"]},
{"model": "best", "transforms": ["SNV", "Detrend", "SavGol"]}
]
}
6. Using Presets for Reusable Patterns
# Define presets
register_preset("standard_preprocessing", {
"_or_": [
{"class": "StandardScaler"},
{"class": "MinMaxScaler"},
None
]
})
register_preset("pls_search", {
"_grid_": {
"class": ["PLSRegression"],
"n_components": {"_range_": [2, 20]}
}
})
# Use in pipeline
config = [
{"preprocessing": {"_preset_": "standard_preprocessing"}},
{"model": {"_preset_": "pls_search"}}
]
7. Memory-Efficient Large Space Processing
from itertools import islice
large_spec = {
"_grid_": {
"param1": {"_range_": [1, 100]},
"param2": {"_range_": [1, 100]},
"param3": {"_range_": [1, 100]}
}
}
# Don't do this! (1M configurations in memory)
# all_configs = expand_spec(large_spec)
# Do this instead (lazy iteration)
for config in expand_spec_iter(large_spec):
process(config)
# Or sample
sample = list(expand_spec_iter(large_spec, seed=42, sample_size=1000))
8. Preprocessing Pipeline with Cartesian
# Generate all stage combinations, then select complete pipelines
{
"_cartesian_": [
# Stage 1: Scatter correction
{"_or_": ["MSC", "SNV", "EMSC", None]},
# Stage 2: Smoothing
{"_or_": [
{"class": "SavitzkyGolay", "window": 11},
{"class": "Gaussian", "sigma": 2},
None
]},
# Stage 3: Derivative
{"_or_": [
{"class": "FirstDerivative"},
{"class": "SecondDerivative"},
None
]}
],
"pick": (1, 3), # Select 1-3 complete pipelines
"count": 50 # Limit to 50 variants
}
9. Reproducible Random Search
# Use _seed_ for reproducible random selection
{
"_or_": [
{"class": "PLS", "n_components": {"_range_": [2, 20]}},
{"class": "RF", "n_estimators": {"_or_": [100, 200, 500]}},
{"class": "SVR", "C": {"_log_range_": [0.1, 100, 10]}}
],
"count": 10,
"_seed_": 42 # Same 10 configs every time
}
See Also
Examples - Working examples organized by topic
Writing a Pipeline in nirs4all - Pipeline syntax reference
Combination Generator Documentation - Combination generator syntax
Document updated: December 27, 2025 Version: Phase 4+ Complete