nirs4all.api.run module

Module-level run() function for nirs4all.

This module provides the primary entry point for training ML pipelines on NIRS data. It wraps PipelineRunner.run() with a simpler, more ergonomic interface.

Example

>>> import nirs4all
>>> result = nirs4all.run(
...     pipeline=[MinMaxScaler(), PLSRegression(10)],
...     dataset="sample_data/regression",
...     verbose=1
... )
>>> print(f"Best RMSE: {result.best_rmse:.4f}")
nirs4all.api.run.run(pipeline: List[Any] | Dict[str, Any] | str | Path | PipelineConfigs | List[List[Any] | Dict[str, Any] | str | Path | PipelineConfigs], dataset: str | Path | ndarray | Tuple[ndarray, ...] | Dict[str, Any] | SpectroDataset | DatasetConfigs | List[str | Path | ndarray | Tuple[ndarray, ...] | Dict[str, Any] | SpectroDataset | DatasetConfigs], *, name: str = '', session: Session | None = None, verbose: int = 1, save_artifacts: bool = True, save_charts: bool = True, plots_visible: bool = False, random_state: int | None = None, **runner_kwargs: Any) RunResult[source]

Execute a training pipeline on a dataset.

This is the primary entry point for training ML pipelines on NIRS data. It provides a simpler interface than creating PipelineRunner and config objects directly.

Parameters:
  • pipeline

    Pipeline definition. Can be: - List of steps (most common): [MinMaxScaler(), PLSRegression(10)] - Dict with steps: {"steps": [...], "name": "my_pipeline"} - Path to YAML/JSON config file: "configs/my_pipeline.yaml" - PipelineConfigs object (backward compatibility) - List of pipelines: [pipeline1, pipeline2, ...] - each

    pipeline is executed independently (cartesian product with datasets)

  • dataset

    Dataset definition. Can be: - Path to data folder: "sample_data/regression" - Numpy arrays: (X, y) or X alone - Dict with arrays: {"X": X, "y": y, "metadata": meta} - SpectroDataset instance - List of SpectroDataset instances (multi-dataset) - DatasetConfigs object (backward compatibility) - List of datasets: [dataset1, dataset2, ...] - each

    dataset is used with each pipeline (cartesian product)

  • name – Optional pipeline name for identification and logging. If not provided, a name will be generated.

  • session – Optional Session object for resource reuse across multiple runs. When provided, shares workspace and configuration.

  • verbose – Verbosity level (0=quiet, 1=info, 2=debug, 3=trace). Default: 1

  • save_artifacts – Whether to save binary artifacts (models, transformers). Default: True

  • save_charts – Whether to save charts and visual outputs. Default: True

  • plots_visible – Whether to display plots interactively. Default: False

  • random_state – Random seed for reproducibility. Default: None (no seeding)

  • **runner_kwargs – Additional PipelineRunner parameters. See PipelineRunner.__init__ for full list. Common options: - workspace_path: Workspace root directory - continue_on_error: Whether to continue on step failures - show_spinner: Whether to show progress spinners - log_file: Whether to write logs to disk - log_format: Output format (“pretty”, “minimal”, “json”) - show_progress_bar: Whether to show progress bars - max_generation_count: Max pipeline combinations (for generators)

Returns:

  • predictions: Predictions object with all pipeline results
    • per_dataset: Dictionary with per-dataset execution details

    • best: Best prediction entry (convenience accessor)

    • best_score: Best model’s primary test score

    • best_rmse, best_r2, best_accuracy: Score shortcuts

Use result.top(n=5) to get top N predictions, or result.export("path.n4a") to export the best model.

Return type:

RunResult containing

Raises:

Examples

Simple usage with list of steps:

>>> import nirs4all
>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn.cross_decomposition import PLSRegression
>>>
>>> result = nirs4all.run(
...     pipeline=[MinMaxScaler(), PLSRegression(10)],
...     dataset="sample_data/regression",
...     verbose=1
... )
>>> print(f"Best RMSE: {result.best_rmse:.4f}")

With cross-validation and multiple models:

>>> from sklearn.model_selection import ShuffleSplit
>>>
>>> result = nirs4all.run(
...     pipeline=[
...         MinMaxScaler(),
...         ShuffleSplit(n_splits=3),
...         {"model": PLSRegression(10)}
...     ],
...     dataset="sample_data/regression",
...     name="PLS_experiment",
...     verbose=2,
...     save_artifacts=True
... )

Multiple pipelines executed independently:

>>> pipeline_pls = [MinMaxScaler(), PLSRegression(10)]
>>> pipeline_rf = [StandardScaler(), RandomForestRegressor()]
>>>
>>> result = nirs4all.run(
...     pipeline=[pipeline_pls, pipeline_rf],  # Two independent pipelines
...     dataset="sample_data/regression",
...     verbose=1
... )
>>> print(f"Total configs: {result.num_predictions}")

Cartesian product of pipelines × datasets:

>>> pipelines = [pipeline1, pipeline2, pipeline3]
>>> datasets = [dataset_a, dataset_b]
>>>
>>> # Runs 6 combinations: p1×da, p1×db, p2×da, p2×db, p3×da, p3×db
>>> result = nirs4all.run(
...     pipeline=pipelines,
...     dataset=datasets,
...     verbose=1
... )

Using a session for multiple runs:

>>> with nirs4all.session(verbose=1) as s:
...     r1 = nirs4all.run(pipeline1, data, session=s)
...     r2 = nirs4all.run(pipeline2, data, session=s)
...     print(f"Pipeline 1: {r1.best_score:.4f}")
...     print(f"Pipeline 2: {r2.best_score:.4f}")

Export the best model:

>>> result = nirs4all.run(pipeline, dataset)
>>> result.export("exports/best_model.n4a")

See also