# AI Coding Assistant Onboarding

**Quick reference for AI assistants working with nirs4all.** For detailed documentation, see [ReadTheDocs](https://nirs4all.readthedocs.io/).

---

## Core Concepts (30-Second Overview)

**nirs4all** is a Python library for Near-Infrared Spectroscopy (NIRS) data analysis with ML pipelines.

| Concept | Description |
|---------|-------------|
| **Pipeline** | List of steps (preprocessing → splitting → model) executed sequentially |
| **SpectroDataset** | Core container holding `X` (spectra), `y` (targets), `metadata`, `folds` |
| **Operators** | Transformers, models, splitters - anything sklearn-compatible + NIRS-specific |
| **Controllers** | Registry pattern that handles operator execution (extensible) |
| **Bundles (.n4a)** | Serialized pipelines for deployment with full preprocessing chain |

**Typical Workflow**: Load data → Define pipeline → `nirs4all.run()` → Analyze results → Export model

---

## Primary API (Module-Level)

All functions are directly on the `nirs4all` module:

```python
import nirs4all

# Train a pipeline
result = nirs4all.run(pipeline=[...], dataset="path/to/data", verbose=1)

# Make predictions with exported model
predictions = nirs4all.predict("model.n4a", new_data)

# SHAP explanations
explanations = nirs4all.explain("model.n4a", test_data)

# Retrain on new data
new_result = nirs4all.retrain("model.n4a", new_data, mode="transfer")

# Reusable session for resource sharing
with nirs4all.session() as s:
    r1 = nirs4all.run(pipeline1, data, session=s)
    r2 = nirs4all.run(pipeline2, data, session=s)

# Generate synthetic data
dataset = nirs4all.generate(n_samples=500, complexity="realistic")
dataset = nirs4all.generate.regression(n_samples=500)
dataset = nirs4all.generate.classification(n_samples=300, n_classes=3)
```

---

## Function Signatures

### `nirs4all.run()`

```python
nirs4all.run(
    pipeline,                    # List of steps or list of pipelines (batch)
    dataset,                     # Path, SpectroDataset, or list (batch)
    verbose=0,                   # 0=silent, 1=progress, 2=detailed
    session=None,                # Optional Session for resource reuse
    artifacts_path=None,         # Where to save run artifacts
    name=None,                   # Run name identifier
) -> RunResult
```

### `nirs4all.predict()`

```python
nirs4all.predict(
    model,                       # Path to .n4a bundle or loaded model
    data,                        # X array, DataFrame, or SpectroDataset
    verbose=0,
) -> PredictResult
```

### `nirs4all.explain()`

```python
nirs4all.explain(
    model,                       # Path to .n4a bundle
    data,                        # Test data for explanations
    explainer_type="auto",       # auto | tree | kernel | linear | deep
    max_samples=100,             # SHAP background samples
) -> ExplainResult
```

### `nirs4all.retrain()`

```python
nirs4all.retrain(
    source,                      # Path to .n4a bundle
    data,                        # New dataset
    mode="full",                 # full | transfer | finetune
    verbose=0,
) -> RunResult
```

### `nirs4all.generate()`

```python
nirs4all.generate(
    n_samples=1000,
    complexity="realistic",      # simple | realistic | complex
    wavelength_range=(1000, 2500),
    components=["water", "protein", "lipid"],
    target_range=(0, 100),
    train_ratio=0.8,
    as_dataset=True,             # False returns (X, y) tuple
    random_state=None,
) -> SpectroDataset
```

---

## Result Objects

### RunResult (from `run()`)

```python
result.best              # Best prediction entry (dict)
result.best_score        # Primary test score
result.best_rmse         # RMSE (regression)
result.best_r2           # R² (regression)
result.best_accuracy     # Accuracy (classification)
result.num_predictions   # Total predictions stored

result.top(n=5)          # Top N predictions
result.filter(model="PLS")  # Filter by criteria
result.export("model.n4a")  # Export best model
result.get_models()      # List unique model names
result.get_datasets()    # List unique dataset names
```

### PredictResult (from `predict()`)

```python
preds.values             # Predicted values array
preds.shape              # Shape of predictions
preds.model_name         # Model used
preds.to_dataframe()     # Convert to DataFrame
```

### ExplainResult (from `explain()`)

```python
exp.shap_values          # SHAP values array
exp.feature_names        # Feature labels
exp.base_value           # Expected value
exp.top_features         # Ranked by importance
exp.get_feature_importance(top_n=10)
exp.to_dataframe()
```

---

## Pipeline Syntax

Steps can be classes, instances, or dicts with keywords:

```python
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# Basic pipeline
pipeline = [
    MinMaxScaler(),              # Preprocessing
    KFold(n_splits=5),           # Cross-validation
    PLSRegression(n_components=10)  # Model (auto-detected)
]

# With explicit keywords
pipeline = [
    MinMaxScaler(),
    {"y_processing": StandardScaler()},     # Target scaling
    KFold(n_splits=5),
    {"model": PLSRegression(n_components=10)}
]
```

### Pipeline Keywords

| Keyword | Purpose | Example |
|---------|---------|---------|
| `model` | Explicit model definition | `{"model": PLSRegression(10)}` |
| `y_processing` | Target (y) scaling | `{"y_processing": MinMaxScaler()}` |
| `branch` | Parallel sub-pipelines | `{"branch": [[SNV(), PLS()], [MSC(), RF()]]}` |
| `merge` | Combine branch outputs | `{"merge": "predictions"}` |
| `source_branch` | Per-source preprocessing | `{"source_branch": {"NIR": [...], "VIS": [...]}}` |

### Generator Keywords (Pipeline Expansion)

```python
from nirs4all.operators import SNV, MSC, Detrend

# _or_: Creates N pipelines, one per option
pipeline = [
    {"_or_": [SNV, MSC, Detrend]},  # Expands to 3 pipelines
    PLSRegression(10)
]

# _range_: Parameter sweep
pipeline = [
    MinMaxScaler(),
    {"_range_": [1, 30, 5], "param": "n_components", "class": PLSRegression}
    # Creates pipelines with n_components=1,6,11,16,21,26
]

# _choice_: Named alternatives with full specs
pipeline = [
    {"_choice_": {
        "pls": PLSRegression(10),
        "rf": RandomForestRegressor(n_estimators=100)
    }}
]
```

---

## Stacking / Ensemble Patterns

```python
from sklearn.linear_model import Ridge

# Branch → Merge → Meta-model
pipeline = [
    {"branch": [
        [SNV(), PLSRegression(10)],           # Branch 1
        [MSC(), RandomForestRegressor()],     # Branch 2
    ]},
    {"merge": "predictions"},    # OOF predictions as features
    {"model": Ridge()},          # Meta-model
]
```

---

## Key Operators

### NIRS-Specific Preprocessing (`nirs4all.operators`)

```python
from nirs4all.operators import (
    SNV,                  # Standard Normal Variate
    MSC,                  # Multiplicative Scatter Correction
    MultiplicativeScatterCorrection,  # Alias for MSC
    SavitzkyGolay,        # Smoothing + derivatives
    Detrend,              # Baseline detrending
    Baseline,             # Baseline removal
    Derivate,             # First/second derivatives
    Gaussian,             # Gaussian filter
)
```

### Sample Augmentation

```python
from nirs4all.operators import (
    Spline_Smoothing,
    Rotate_Translate,
    Random_X_Operation,
)
```

### Splitting Strategies

```python
from nirs4all.operators import (
    KennardStone,         # Kennard-Stone sampling
    SPXY,                 # Sample Partitioning (X + Y)
)
# Plus all sklearn splitters: KFold, ShuffleSplit, StratifiedKFold, etc.
```

### Deep Learning Models

```python
# TensorFlow-based (lazy-loaded)
from nirs4all.operators.models.tensorflow import (
    NICON,                # 1D CNN for NIRS
    DECON,                # Deep CNN
    ResNet1D,
    Transformer1D,
)
```

---

## Dataset Loading

```python
import nirs4all
from nirs4all.data import SpectroDataset

# From path (auto-detects format)
result = nirs4all.run(pipeline, dataset="path/to/data.csv")
result = nirs4all.run(pipeline, dataset="path/to/folder/")

# From SpectroDataset
ds = SpectroDataset.from_csv("data.csv", x_col="spectrum", y_col="target")
ds = SpectroDataset.from_parquet("data.parquet")
result = nirs4all.run(pipeline, dataset=ds)

# Batch execution (Cartesian product)
result = nirs4all.run(
    pipeline=[pipeline_a, pipeline_b],
    dataset=[dataset_1, dataset_2]  # Runs 4 combinations
)
```

### SpectroDataset Properties

```python
ds.X                     # Feature matrix (n_samples, n_features)
ds.y                     # Target vector (n_samples,)
ds.metadata              # Sample-level metadata dict
ds.sources               # Multi-source tracking
ds.folds                 # CV fold assignments
ds.name                  # Dataset identifier
ds.wavelengths           # Wavelength labels (if available)
```

---

## Architecture Overview

```
nirs4all/
├── api/           # Module-level functions: run(), predict(), explain(), retrain(), session(), generate()
│   └── result.py  # RunResult, PredictResult, ExplainResult
├── pipeline/      # Execution engine
│   ├── runner.py  # PipelineRunner (main orchestrator)
│   ├── config.py  # PipelineConfigs (expands generators)
│   ├── bundle.py  # .n4a export/load
│   └── predictor.py
├── controllers/   # Registry for operator handlers
│   ├── registry.py
│   ├── transforms/
│   ├── models/    # sklearn, tensorflow, pytorch, jax
│   └── splitters/
├── operators/     # NIRS-specific operators
│   ├── transforms/  # SNV, MSC, SavitzkyGolay, etc.
│   ├── models/      # NICON, DECON, etc.
│   ├── augmentation/
│   └── splitters/   # KennardStone, SPXY
├── data/
│   ├── dataset.py   # SpectroDataset
│   ├── predictions.py
│   └── synthetic/   # Data generation
└── sklearn/
    └── pipeline.py  # NIRSPipeline (sklearn wrapper for SHAP)
```

---

## Controller Pattern (Extension Point)

Custom operators register via decorator:

```python
from nirs4all.controllers import register_controller, OperatorController

@register_controller
class MyController(OperatorController):
    priority = 50  # Lower = higher priority

    @classmethod
    def matches(cls, step, operator, keyword) -> bool:
        return isinstance(operator, MyOperatorType)

    @classmethod
    def supports_prediction_mode(cls) -> bool:
        return True  # Execute during predict()

    def execute(self, step_info, dataset, context, runtime_context, **kwargs):
        # Transform dataset; return (context, StepOutput)
        ...
```

---

## Common Patterns

### Full Example

```python
import nirs4all
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold

# Generate or load data
dataset = nirs4all.generate.regression(n_samples=500)

# Define pipeline
pipeline = [
    MinMaxScaler(),
    KFold(n_splits=5),
    {"model": PLSRegression(n_components=10)}
]

# Train
result = nirs4all.run(pipeline, dataset, verbose=1)
print(f"Best RMSE: {result.best_rmse:.4f}")
print(f"Best R²: {result.best_r2:.4f}")

# Export
result.export("model.n4a")

# Predict on new data
new_data = nirs4all.generate.regression(n_samples=50)
predictions = nirs4all.predict("model.n4a", new_data.X)
print(predictions.values)

# Explain
explanations = nirs4all.explain("model.n4a", new_data.X)
print(explanations.top_features[:10])
```

### Hyperparameter Search

```python
pipeline = [
    MinMaxScaler(),
    {"_or_": [SNV(), MSC(), Detrend()]},  # Try 3 preprocessors
    {"_range_": [5, 25, 5], "param": "n_components", "class": PLSRegression}
]
# Runs: 3 preprocessors × 5 n_components values = 15 pipelines
result = nirs4all.run(pipeline, dataset)
```

### Multi-Source Data

```python
# Different preprocessing per source
pipeline = [
    {"source_branch": {
        "NIR": [SNV(), MinMaxScaler()],
        "VIS": [StandardScaler()],
    }},
    PLSRegression(10)
]
```

---

## Commands Reference

```bash
# Tests
pytest tests/                     # All tests
pytest tests/unit/                # Unit only
pytest tests/integration/         # Integration only
pytest -m sklearn                 # sklearn-only (fast)
pytest --cov=nirs4all             # With coverage

# Examples
cd examples && ./run.sh           # All examples
./run.sh -c user                  # User category only
./run.sh -q                       # Quick (skip deep learning)

# Verification
nirs4all --test-install
nirs4all --test-integration

# Code quality
ruff check .
mypy .
```

---

## Key Files Reference

| File | Purpose |
|------|---------|
| `nirs4all/__init__.py` | Package entry, re-exports API |
| `nirs4all/api/run.py` | `run()` implementation |
| `nirs4all/api/result.py` | Result objects |
| `nirs4all/pipeline/runner.py` | Pipeline execution |
| `nirs4all/pipeline/config.py` | PipelineConfigs (generator expansion) |
| `nirs4all/data/dataset.py` | SpectroDataset |
| `nirs4all/controllers/registry.py` | Controller registration |
| `nirs4all/operators/transforms/` | NIRS transforms |

---

## Further Reading

- **[Getting Started](getting_started/index.md)** - Installation and first pipeline
- **[User Guide](user_guide/index.md)** - Task-oriented how-to guides
- **[Pipeline Syntax Reference](reference/pipeline_syntax.md)** - Complete syntax specification
- **[Operator Catalog](reference/operator_catalog.md)** - 270+ operators listed
- **[Generator Keywords](reference/generator_keywords.md)** - `_or_`, `_range_`, `_choice_`
- **[Developer Guide](developer/index.md)** - Architecture and internals
- **[API Reference](api/modules.rst)** - Auto-generated API docs

---

**Version**: 0.6.x | **Python**: 3.11+ | **License**: CeCILL-2.1