# Core Concepts This page explains the fundamental concepts behind NIRS4ALL. Understanding these will help you build effective pipelines. ## Overview NIRS4ALL is built around three core concepts: 1. **SpectroDataset** - A container for spectral data, targets, and metadata 2. **Pipeline** - A sequence of processing steps 3. **Controllers** - The execution engine that runs each step ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Dataset │ -> │ Pipeline │ -> │ Results │ │ (your data) │ │ (steps) │ │ (predictions)│ └──────────────┘ └──────────────┘ └──────────────┘ ``` ## SpectroDataset `SpectroDataset` is the core data container. It holds: | Component | Description | |-----------|-------------| | **X** (features) | Spectral data matrix (n_samples × n_wavelengths) | | **y** (targets) | Target values for prediction (n_samples,) | | **metadata** | Sample information (IDs, groups, dates, etc.) | | **fold indices** | Cross-validation assignments | ### Creating a Dataset Most often, NIRS4ALL creates datasets automatically from your files: ```python from nirs4all.data import DatasetConfigs # From a folder (auto-detects files) dataset = DatasetConfigs("path/to/data/") # From explicit files dataset = DatasetConfigs({ "train_x": "spectra.csv", "train_y": "targets.csv", "train_m": "metadata.csv" # Optional metadata }) ``` You can also generate synthetic data for testing and prototyping: ```python import nirs4all # Generate realistic synthetic NIRS spectra dataset = nirs4all.generate.regression( n_samples=500, components=["water", "protein", "lipid"], complexity="realistic" ) ``` See {doc}`/user_guide/data/synthetic_data` for more on synthetic data generation. ``` ### Partitions Data is organized into partitions: | Partition | Purpose | |-----------|---------| | **train** | Used for model training and cross-validation | | **test** | Held-out data for final evaluation | | **val** | Validation set (often created from train via CV) | :::{note} During cross-validation, the train partition is automatically split into train/val folds. The test partition (if provided) remains untouched for final evaluation. ::: ## Pipeline A **pipeline** is a list of processing steps applied sequentially: ```python pipeline = [ MinMaxScaler(), # Step 1: Scale features StandardNormalVariate(), # Step 2: SNV preprocessing {"y_processing": MinMaxScaler()}, # Step 3: Scale targets ShuffleSplit(n_splits=5), # Step 4: Cross-validation {"model": PLSRegression(n_components=10)} # Step 5: Model ] ``` ### Step Types | Step Type | Syntax | Purpose | |-----------|--------|---------| | **Transformer** | `MinMaxScaler()` | Modify features (X) | | **Y Processing** | `{"y_processing": ...}` | Modify targets (y) | | **Splitter** | `ShuffleSplit(n_splits=5)` | Define cross-validation | | **Model** | `{"model": PLSRegression()}` | Train predictive model | | **Branch** | `{"branch": [...]}` | Parallel processing paths | | **Merge** | `{"merge": "features"}` | Combine branch outputs | | **Augmentation** | `{"feature_augmentation": ...}` | Generate preprocessing variants | ### Execution Flow ``` Input Data → [Preprocessing] → [CV Split] → [Training] → Predictions │ │ │ │ ▼ ▼ ▼ ▼ SpectroDataset Transformers Splitter Models ``` 1. **Data Loading**: Your files are loaded into a SpectroDataset 2. **Preprocessing**: Transformers modify X (and optionally y) 3. **Cross-Validation**: Splitter defines train/val folds 4. **Training**: Each model is trained on each fold 5. **Prediction**: Out-of-fold predictions are collected ## Controllers **Controllers** are the execution engine. They interpret each pipeline step and perform the appropriate action. | Controller | Handles | |------------|---------| | `TransformController` | sklearn TransformerMixin (scalers, preprocessors) | | `YProcessingController` | `{"y_processing": ...}` steps | | `SplitterController` | Cross-validation splitters | | `ModelController` | `{"model": ...}` steps | | `BranchController` | `{"branch": ...}` parallel paths | | `MergeController` | `{"merge": ...}` combining outputs | :::{tip} You rarely interact with controllers directly. They work behind the scenes to execute your pipeline. ::: ## Predictions and Results When you run a pipeline, you get a `RunResult` object: ```python result = nirs4all.run(pipeline, dataset) # Access results result.best_score # Best model's primary score result.best # Best prediction entry (dict) result.num_predictions # Total prediction entries result.predictions # Full PredictionResultsList # Get top performers for pred in result.top(n=5, display_metrics=['rmse', 'r2']): print(f"{pred['model_name']}: RMSE={pred['rmse']:.4f}") ``` ### Prediction Entry Structure Each prediction entry contains: | Field | Description | |-------|-------------| | `model_name` | Name of the model | | `dataset_name` | Name of the dataset | | `fold_id` | Cross-validation fold index | | `y_true` | True target values | | `y_pred` | Predicted values | | `rmse`, `r2`, etc. | Computed metrics | | `preprocessings` | Applied preprocessing chain | | `partition` | Data partition (train/val/test) | ## Key Terminology | Term | Definition | |------|------------| | **Spectral data** | Features from spectroscopy (reflectance, absorbance, etc.) | | **Wavelength** | Individual feature/column in spectral data | | **Fold** | One train/validation split in cross-validation | | **OOF (Out-of-Fold)** | Predictions made on validation data during CV | | **Operator** | A preprocessing or transformation class | | **Transformer** | sklearn-compatible operator with `fit()` and `transform()` | | **Pipeline variant** | One specific configuration when using generators | ## The nirs4all.run() Function The simplest way to run a pipeline: ```python result = nirs4all.run( pipeline=pipeline, # List of steps (or list of pipelines) dataset=dataset, # See below for supported formats name="MyPipeline", # Pipeline name verbose=1, # 0=silent, 1=progress, 2=debug save_artifacts=True, # Save models and results save_charts=True, # Save generated plots plots_visible=False # Show plots interactively ) ``` ### Supported Pipeline Formats The `pipeline` parameter accepts: | Format | Example | Description | |--------|---------|-------------| | List of steps | `[MinMaxScaler(), PLSRegression()]` | Single pipeline | | Dict config | `{"pipeline": [...]}` | Dict with steps | | Path to config | `"config.yaml"` or `"config.json"` | Load from file | | `PipelineConfigs` | `PipelineConfigs(steps)` | Direct config object | | **List of pipelines** | `[pipeline1, pipeline2, ...]` | Run each independently | ### Supported Dataset Formats The `dataset` parameter accepts: | Format | Example | Description | |--------|---------|-------------| | Path to folder | `"sample_data/regression"` | Auto-load from folder | | Numpy arrays | `(X, y)` or `X` alone | Direct arrays | | Dict with arrays | `{"X": X, "y": y, "metadata": meta}` | Dict with data | | `SpectroDataset` | Direct dataset instance | Pre-built dataset | | `DatasetConfigs` | Full configuration object | Complete config | | **List of datasets** | `[dataset1, dataset2, ...]` | Run on each dataset | ### Batch Execution: Pipelines × Datasets When you provide **multiple pipelines** and/or **multiple datasets**, `nirs4all.run()` executes the **cartesian product**: ```python # 2 pipelines × 2 datasets = 4 runs result = nirs4all.run( pipeline=[pipeline_a, pipeline_b], dataset=["data/wheat", "data/corn"], verbose=1 ) # Runs: pipeline_a×wheat, pipeline_a×corn, pipeline_b×wheat, pipeline_b×corn ``` All results are collected into a single `RunResult` for unified analysis. For more control, use `PipelineRunner` directly: ```python from nirs4all.pipeline import PipelineRunner, PipelineConfigs from nirs4all.data import DatasetConfigs runner = PipelineRunner( verbose=1, save_artifacts=True, save_charts=True ) predictions, per_dataset = runner.run( PipelineConfigs(pipeline, "MyPipeline"), DatasetConfigs("path/to/data") ) ``` ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ nirs4all.run() │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │PipelineRunner│ --> │PipelineOrches│ --> │ Controllers │ │ │ │ │ │ -trator │ │ (registry) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │PipelineConfigs│ │ExecutionContext│ │SpectroDataset│ │ │ │ (pipeline) │ │ (state) │ │ (data) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ ├─────────────────────────────────────────────────────────────────┤ │ RunResult │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Predictions │ │ Metrics │ │ Artifacts │ │ │ │ List │ │ │ │ (.n4a) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Next Steps ::::{grid} 2 :gutter: 3 :::{grid-item-card} 📖 Loading Data :link: /user_guide/data/loading_data :link-type: doc Learn about DatasetConfigs and supported formats. ::: :::{grid-item-card} 🔧 Preprocessing :link: /user_guide/preprocessing/overview :link-type: doc NIRS-specific preprocessing techniques. ::: :::{grid-item-card} 📋 Pipeline Syntax :link: /reference/pipeline_syntax :link-type: doc Complete syntax reference. ::: :::{grid-item-card} 📝 Examples :link: /examples/index :link-type: doc Working examples organized by topic. ::: :::: ## See Also - {doc}`quickstart` - Run your first pipeline - {doc}`/reference/pipeline_syntax` - Complete pipeline syntax - {doc}`/developer/architecture` - Detailed architecture for contributors