Combination Generator Documentation
Last Updated: December 8, 2025 | Version: Post Phase 1.5
Overview
The Combination Generator is a powerful Python utility that expands configuration specifications into all possible combinations based on flexible syntax rules. It supports basic combinations, size constraints, explicit selection semantics (pick/arrange), second-order permutations (then_pick/then_arrange), and stochastic sampling for efficient exploration of large configuration spaces.
Architecture
The generator module follows a modular architecture:
nirs4all/pipeline/config/
├── generator.py # Main API: expand_spec(), count_combinations()
└── _generator/
├── __init__.py # Package exports
├── keywords.py # Keyword constants and detection utilities
└── utils/
├── __init__.py
├── combinatorics.py # Combination/permutation generators and counters
└── sampling.py # Deterministic random sampling with seed support
Core Concepts
Basic “or” Combinations
The fundamental building block is the "_or_" key that defines a set of choices:
{"_or_": ["A", "B", "C"]}
# Generates: ["A", "B", "C"]
Selection Semantics: pick vs arrange
Two explicit keywords control how items are selected:
Keyword |
Meaning |
Mathematical Basis |
When to Use |
|---|---|---|---|
|
Select N items, order doesn’t matter |
Combinations C(n, k) |
Parallel/independent operations |
|
Arrange N items in sequence, order matters |
Permutations P(n, k) |
Sequential/chained operations |
# Unordered selection (combinations) - order doesn't matter
{"_or_": ["A", "B", "C"], "pick": 2}
# Generates: [["A", "B"], ["A", "C"], ["B", "C"]]
# C(3, 2) = 3 variants
# Ordered arrangement (permutations) - order matters
{"_or_": ["A", "B", "C"], "arrange": 2}
# Generates: [["A", "B"], ["A", "C"], ["B", "A"], ["B", "C"], ["C", "A"], ["C", "B"]]
# P(3, 2) = 6 variants
Legacy size Parameter
The size parameter is maintained for backward compatibility and behaves like pick (combinations):
# Legacy syntax (equivalent to pick)
{"_or_": ["A", "B", "C"], "size": 2}
# Generates: [["A", "B"], ["A", "C"], ["B", "C"]]
# Size range
{"_or_": ["A", "B", "C"], "size": (1, 2)}
# Generates: [["A"], ["B"], ["C"], ["A", "B"], ["A", "C"], ["B", "C"]]
Numeric Ranges
Generate sequences of numeric values using the "_range_" key:
# Array syntax: [from, to, step] - step defaults to 1
{"_range_": [1, 5]}
# Generates: [1, 2, 3, 4, 5]
{"_range_": [0, 10, 2]}
# Generates: [0, 2, 4, 6, 8, 10]
# Dictionary syntax: {"from": start, "to": end, "step": step}
{"_range_": {"from": 1, "to": 5, "step": 2}}
# Generates: [1, 3, 5]
# With count sampling for large ranges
{"_range_": [1, 1000], "count": 10}
# Generates: 10 random values from 1 to 1000
Second-Order Selection with then_pick / then_arrange
For hierarchical selection, use then_pick or then_arrange after a primary selection:
# Step 1: pick=2 generates C(3,2) = 3 combinations: [A,B], [A,C], [B,C]
# Step 2: then_arrange=2 generates P(3,2) = 6 arrangements of those 3 items
{"_or_": ["A", "B", "C"], "pick": 2, "then_arrange": 2}
# Step 1: arrange=2 generates P(3,2) = 6 permutations
# Step 2: then_pick=2 generates C(6,2) = 15 combinations of those
{"_or_": ["A", "B", "C"], "arrange": 2, "then_pick": 2}
Legacy Second-Order Array Syntax
The array syntax [outer, inner] is supported for backward compatibility:
{"_or_": ["A", "B"], "size": [1, 2]}
# Inner: permutations (order matters within sub-arrays)
# Outer: combinations (order doesn't matter for selection)
Stochastic Sampling
Use "count" to randomly sample from large result sets:
{"_or_": ["A", "B", "C", "D"], "pick": (2, 3), "count": 5}
# Generates: Random 5 combinations from all possible size 2-3 combinations
Complete Syntax Reference
Keywords
Keyword |
Description |
Example |
|---|---|---|
|
Choice between alternatives |
|
|
Numeric sequence generation |
|
|
Number of items to select (legacy, uses combinations) |
|
|
Unordered selection (combinations) |
|
|
Ordered arrangement (permutations) |
|
|
Apply combinations to primary results |
|
|
Apply permutations to primary results |
|
|
Limit number of generated variants |
|
Basic Features
Syntax |
Description |
Example Output |
|---|---|---|
|
All individual choices |
|
|
Combinations of exactly 2 elements |
|
|
Combinations of 1 to 3 elements |
|
|
Permutations of exactly 2 elements |
|
|
Random 2 choices |
|
Numeric Range Features
Syntax |
Description |
Example Output |
|---|---|---|
|
Range from 1 to 5 (inclusive) |
|
|
Range with step=2 |
|
|
Dictionary syntax |
|
|
Dictionary with step |
|
|
Random sampling from range |
|
Second-Order Selection
Syntax |
Description |
Key Behavior |
|---|---|---|
|
Pick 2, then pick 2 from results |
Primary: C(n,2), Secondary: C(primary,2) |
|
Pick 2, then arrange 2 from results |
Primary: C(n,2), Secondary: P(primary,2) |
|
Arrange 2, then pick 2 from results |
Primary: P(n,2), Secondary: C(primary,2) |
|
Arrange 2, then arrange 2 from results |
Primary: P(n,2), Secondary: P(primary,2) |
Legacy Second-Order Syntax (Array Notation)
Syntax |
Description |
Key Behavior |
|---|---|---|
|
Array notation for second-order |
Inner uses permutations, outer uses combinations |
|
Select 2 arrangements of 2 elements each |
|
|
Select 1-3 arrangements of exactly 2 elements |
Variable outer selection |
|
Select exactly 2 arrangements of 1-3 elements |
Variable inner arrangements |
Advanced Combinations
Syntax |
Description |
Use Case |
|---|---|---|
|
Random 4 from combinations |
Large space sampling |
|
Second-order with count limit |
Efficient exploration |
Key Behavioral Rules
1. pick vs arrange
pick:["A", "B"]=["B", "A"](combinations - order doesn’t matter)arrange:["A", "B"]≠["B", "A"](permutations - order matters)
Use case guidance:
Use
pickforconcat_transform(feature order doesn’t matter)Use
pickforfeature_augmentation(parallel channels)Use
arrangefor sequential preprocessing stepsUse
arrangewhen the order of operations affects the result
2. Second-Order Selection Logic
With then_pick / then_arrange:
Primary selection (
pickorarrange) happens firstSecondary selection (
then_pickorthen_arrange) is applied to primary resultsEach step can use int or tuple (from, to) for size specification
# Example: pick 2 combinations, then arrange 2 of those
{"_or_": ["A", "B", "C"], "pick": 2, "then_arrange": 2}
# Step 1: C(3,2) = 3 combinations: [A,B], [A,C], [B,C]
# Step 2: P(3,2) = 6 arrangements of those 3 items
3. Legacy size with Array Notation
In the legacy size=[outer, inner] second-order combinations:
[A, [B, C]]≠[A, [C, B]]✅ (different inner permutations)[A, [B, C]]=[[B, C], A]✅ (same outer selection, different order doesn’t matter)
4. Count Sampling
Always applies after all combinations are generated
Uses random sampling without replacement
Works with any selection configuration
Implementation Details
Module Structure
The generator is organized into modular components:
nirs4all/pipeline/config/
├── generator.py # Main API
│ ├── expand_spec(node) # Expand to all combinations
│ ├── count_combinations(node) # Count without generating
│ ├── _expand_with_pick() # Handle pick keyword
│ ├── _expand_with_arrange() # Handle arrange keyword
│ ├── _handle_pick_then_*() # Second-order with pick primary
│ ├── _handle_arrange_then_*() # Second-order with arrange primary
│ └── _generate_range() # Numeric range generation
│
└── _generator/
├── keywords.py # Keyword constants and utilities
│ ├── OR_KEYWORD, RANGE_KEYWORD
│ ├── PICK_KEYWORD, ARRANGE_KEYWORD
│ ├── THEN_PICK_KEYWORD, THEN_ARRANGE_KEYWORD
│ ├── is_generator_node()
│ ├── is_pure_or_node()
│ ├── extract_modifiers()
│ └── extract_base_node()
│
└── utils/
├── combinatorics.py # Mathematical operations
│ ├── generate_combinations()
│ ├── generate_permutations()
│ ├── count_combinations()
│ ├── count_permutations()
│ └── normalize_size_spec()
│
└── sampling.py # Deterministic random sampling
├── sample_with_seed()
├── shuffle_with_seed()
└── random_choice_with_seed()
Core Functions
expand_spec(node)
Main recursive expansion function that handles:
Lists: Cartesian product expansion
Dictionaries: OR nodes, range nodes, pick/arrange constraints, count limits
Scalars: Direct return
count_combinations(node)
Calculate total number of combinations without generating them:
Returns exact count that
expand_specwould produceUses mathematical formulas (combinations, permutations, factorials)
Supports
_or_,_range_,pick,arrange,then_pick,then_arrangeExtremely fast even for large configuration spaces
Essential for performance planning and safety checks
_expand_with_pick(choices, pick_spec, count, then_pick, then_arrange)
Handles the pick keyword (combinations):
Generates C(n, k) combinations where order doesn’t matter
Supports int or tuple (from, to) range specification
Handles second-order with
then_pickorthen_arrangeLegacy
size=[outer, inner]array notation for backward compatibility
_expand_with_arrange(choices, arrange_spec, count, then_pick, then_arrange)
Handles the arrange keyword (permutations):
Generates P(n, k) permutations where order matters
Supports int or tuple (from, to) range specification
Handles second-order with
then_pickorthen_arrange
_generate_range(range_spec)
Generate numeric sequences from range specifications:
Supports array syntax:
[from, to]or[from, to, step]Supports dict syntax:
{"from": start, "to": end, "step": step}Handles positive and negative steps
End value is inclusive
Keywords Module
Centralized keyword constants and detection utilities:
from nirs4all.pipeline.config.generator import (
# Constants
OR_KEYWORD, # "_or_"
RANGE_KEYWORD, # "_range_"
PICK_KEYWORD, # "pick"
ARRANGE_KEYWORD, # "arrange"
THEN_PICK_KEYWORD, # "then_pick"
THEN_ARRANGE_KEYWORD, # "then_arrange"
SIZE_KEYWORD, # "size" (legacy)
COUNT_KEYWORD, # "count"
# Detection functions
is_generator_node, # Check if node has _or_ or _range_
is_pure_or_node, # Check if node is purely an OR node
is_pure_range_node, # Check if node is purely a range node
extract_modifiers, # Extract size, count, pick, arrange modifiers
extract_base_node, # Extract non-keyword keys
)
Sampling Utilities
Deterministic random sampling with optional seed support:
from nirs4all.pipeline.config._generator.utils import sample_with_seed
# Deterministic sampling with seed
result = sample_with_seed(["A", "B", "C", "D"], k=2, seed=42)
# → Same result every time with seed=42
# Without seed (non-deterministic)
result = sample_with_seed(["A", "B", "C", "D"], k=2)
# → Random each time
Dependencies
from itertools import product, combinations, permutations
from collections.abc import Mapping
from math import comb, factorial
import random
Performance Planning
Count Before Generate
Always estimate first for unknown configuration spaces:
from nirs4all.pipeline.config.generator import expand_spec, count_combinations
# Safe workflow
config = [{"_or_": ["A", "B", "C", "D"], "pick": [(1, 3), (1, 4)]}]
# Step 1: Estimate without generating
estimated_count = count_combinations(config)
print(f"Would generate {estimated_count:,} combinations")
# Step 2: Decide based on count
if estimated_count > 10000:
# Add count limit for large spaces
config[0]["count"] = 1000
print("Added count limit for safe sampling")
# Step 3: Generate safely
results = expand_spec(config)
Smart Generation Utility
def estimate_and_generate(config, max_safe=1000):
from nirs4all.pipeline.config.generator import expand_spec, count_combinations
estimated = count_combinations(config)
if estimated <= max_safe:
return expand_spec(config)
else:
print(f"Large space: {estimated:,}. Add count limit!")
return None
Usage Examples
Basic Pipeline Configuration
pipeline = [
{"_or_": ["normalize", "standardize"]},
{"model": {"_or_": ["svm", "rf", "xgb"], "size": 2}},
{"features": {"_or_": ["pca", "lda"], "count": 1}}
]
results = expand_spec_fixed(pipeline)
# Generates all combinations of preprocessing, 2 models, and 1 random feature method
Hyperparameter Range Exploration
# Systematic hyperparameter search
hyperparams = {
"model_params": {
"n_estimators": {"_range_": [50, 200, 25]}, # [50, 75, 100, 125, 150, 175, 200]
"max_depth": {"_range_": {"from": 3, "to": 15, "step": 2}}, # [3, 5, 7, 9, 11, 13, 15]
"learning_rate": {"_or_": [0.01, 0.1, 0.2]}
}
}
results = expand_spec_fixed(hyperparams)
# Generates: 7 × 7 × 3 = 147 hyperparameter combinations
Mixed Range and Choice Combinations
# Combine different generation strategies
config = [
{"preprocessing": {"_or_": ["minmax", "standard", "robust"]}},
{"batch_size": {"_range_": [16, 128, 16]}}, # [16, 32, 48, 64, 80, 96, 112, 128]
{"optimizer": {"_or_": ["adam", "sgd"]}},
{"epochs": {"_range_": [10, 50, 10], "count": 3}} # Random 3 values from [10, 20, 30, 40, 50]
]
results = expand_spec_fixed(config)
# Generates: 3 × 8 × 2 × 3 = 144 training configurations
Complex second-Order Example
config = [{"_or_": ["A", "B", "C", "D"], "size": [(1, 3), (2, 4)]}]
results = expand_spec_fixed(config)
# Generates:
# - Inner: all permutations of 2-4 elements
# - Outer: select 1-3 of those inner arrangements
Numeric Range Integration
# Combine ranges with other features
pipeline = [
{"preprocessing": {"_or_": ["normalize", "standardize"]}},
{"n_estimators": {"_range_": [10, 100, 10]}}, # [10, 20, 30, ..., 100]
{"max_depth": {"_range_": {"from": 3, "to": 10, "step": 2}}} # [3, 5, 7, 9]
]
results = expand_spec_fixed(pipeline)
# Generates all combinations of preprocessing × n_estimators × max_depth
Stochastic Exploration
config = [{"_or_": ["method1", "method2", "method3", "method4"],
"size": [3, (1, 4)],
"count": 10}]
results = expand_spec_fixed(config)
# Random 10 samples from potentially thousands of combinations
Performance Considerations
Complexity Analysis
Basic OR: O(n) where n = number of choices
Size constraints: O(C(n,k)) where k = size, n = choices
Second-order: O(P(n,k) × C(P,m)) where P = inner permutations, m = outer size
Numeric ranges: O((end-start)/step + 1) - very efficient even for large ranges
Count sampling: O(min(count, total_combinations))
Memory Optimization
Use
countparameter for large combination spacesConsider tuple ranges instead of generating all intermediate sizes
Second-order combinations can grow exponentially - use count limits
Best Practices
Estimate first: Always use
count_combinations()before generatingUse count limits: For spaces >1000 combinations, use count sampling
Profile configurations: Build up complexity incrementally
Smart generation: Use estimation utilities for safe workflows## Error Handling
Common Issues
Memory errors: Use count limits for large spaces
Type errors: Ensure OR values are compatible with context
Size errors: Size cannot exceed number of available choices
Empty results: Check that size constraints are satisfiable
Debug Tips
# Check combination count first
config = [{"_or_": ["A", "B", "C"], "size": [2, 2]}]
results = expand_spec_fixed(config)
print(f"Total combinations: {len(results)}")
# Test range counting vs generation
range_config = {"_range_": [1, 1000]}
estimated = count_combinations(range_config)
print(f"Range 1-1000 has {estimated} values") # Should be 1000
# Use count for large spaces
if len(results) > 100:
config[0]["count"] = 10 # Limit to 10 samples
Version History
v1.0: Basic OR combinations and size constraints
v1.1: Tuple range support for size constraints
v1.2: Second-order combinations with array syntax
v1.3: Stochastic count sampling
v1.4: Fixed permutation logic for inner arrays
v1.5: Added
_range_keyword for numeric sequences with tuple and dict syntax