# Configuration Reference

This page provides the complete specification for `PipelineConfigs` and `DatasetConfigs`.

## PipelineConfigs

`PipelineConfigs` defines the processing pipeline: preprocessing steps, cross-validation, and models.

### Constructor

```python
from nirs4all.pipeline import PipelineConfigs

config = PipelineConfigs(
    definition,                   # Pipeline definition (list, dict, or path)
    name="",                      # Pipeline name
    description="",               # Optional description
    max_generation_count=10000    # Maximum pipeline variants to generate
)
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `definition` | list, dict, str | *required* | Pipeline steps as list, dict with `pipeline` key, or path to YAML/JSON |
| `name` | str | `""` | Pipeline name (used in artifacts and results) |
| `description` | str | `""` | Human-readable description |
| `max_generation_count` | int | `10000` | Maximum pipeline variants from generators |

### Definition Formats

#### List of Steps (Recommended)

```python
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit

pipeline = PipelineConfigs([
    MinMaxScaler(),
    ShuffleSplit(n_splits=3),
    {"model": PLSRegression(n_components=10)}
], name="MyPipeline")
```

#### Dictionary with Pipeline Key

```python
pipeline = PipelineConfigs({
    "pipeline": [
        MinMaxScaler(),
        ShuffleSplit(n_splits=3),
        {"model": PLSRegression(n_components=10)}
    ]
}, name="MyPipeline")
```

#### YAML File Path

```python
pipeline = PipelineConfigs("config/pipeline.yaml", name="MyPipeline")
```

**pipeline.yaml:**
```yaml
pipeline:
  - class: sklearn.preprocessing.MinMaxScaler
  - class: sklearn.model_selection.ShuffleSplit
    params:
      n_splits: 3
  - model:
      class: sklearn.cross_decomposition.PLSRegression
      params:
        n_components: 10
```

#### JSON File Path

```python
pipeline = PipelineConfigs("config/pipeline.json", name="MyPipeline")
```

### Step Serialization

Steps are serialized to a canonical format:

| Input | Serialized Form |
|-------|-----------------|
| `MinMaxScaler()` | `{"class": "sklearn.preprocessing.MinMaxScaler"}` |
| `PLSRegression(n_components=10)` | `{"class": "...", "params": {"n_components": 10}}` |
| `{"model": PLSRegression()}` | `{"model": {"class": "..."}}` |

### Accessing Pipeline Configurations

```python
pipeline = PipelineConfigs([...], name="MyPipeline")

# Access expanded configurations (list of step lists)
pipeline.steps           # List of step configurations

# Access names (includes hash for uniqueness)
pipeline.names           # ["MyPipeline_a1b2c3"]

# Check if generators were used
pipeline.has_configurations  # True if _or_, _range_ expanded
```

---

## DatasetConfigs

`DatasetConfigs` defines how to load and configure datasets.

### Constructor

```python
from nirs4all.data import DatasetConfigs

dataset = DatasetConfigs(
    configurations,              # Path(s) or configuration dict(s)
    task_type="auto",            # Force task type
    signal_type=None,            # Override signal type
    aggregate=None,              # Aggregation column or True
    aggregate_method=None,       # Aggregation method
    aggregate_exclude_outliers=None  # Exclude outliers before aggregation
)
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `configurations` | str, dict, list | *required* | Path, config dict, or list of either |
| `task_type` | str, list | `"auto"` | Task type per dataset |
| `signal_type` | str, list | `None` | Signal type override |
| `aggregate` | str, bool, list | `None` | Aggregation setting |
| `aggregate_method` | str, list | `None` | Method: "mean", "median", "vote" |
| `aggregate_exclude_outliers` | bool, list | `None` | Exclude outliers via T² |

### Configuration Dictionary Keys

#### Data File Keys

| Key | Description | Example |
|-----|-------------|---------|
| `train_x` | Training features | `"spectra_train.csv"` |
| `train_y` | Training targets | `"targets_train.csv"` |
| `train_m` | Training metadata | `"metadata_train.csv"` |
| `test_x` | Test features | `"spectra_test.csv"` |
| `test_y` | Test targets | `"targets_test.csv"` |
| `test_m` | Test metadata | `"metadata_test.csv"` |

#### Parameter Keys

| Key | Description |
|-----|-------------|
| `train_x_params` | Parameters for `train_x` file |
| `train_y_params` | Parameters for `train_y` file |
| `test_x_params` | Parameters for `test_x` file |
| `global_params` | Parameters applied to all files |

#### File Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `delimiter` | str | `","` | Column separator |
| `decimal_separator` | str | `"."` | Decimal point character |
| `has_header` | bool | `True` | First row is header |
| `header_unit` | str | `"auto"` | Header interpretation |
| `signal_type` | str | `"auto"` | Spectral signal type |
| `na_policy` | str | `"drop"` | Missing value handling |
| `target_column` | str | `None` | Target column name (combined files) |
| `sheet_name` | str | `None` | Excel sheet name |

#### Header Unit Options

| Value | Description |
|-------|-------------|
| `"nm"` | Wavelengths in nanometers |
| `"cm-1"` | Wavenumbers in cm⁻¹ |
| `"none"` | No header row |
| `"text"` | Text labels (ignored) |
| `"index"` | Numeric indices |
| `"auto"` | Automatic detection |

#### Signal Type Options

| Value | Description |
|-------|-------------|
| `"absorbance"` | Absorbance values |
| `"reflectance"` | Reflectance 0-1 |
| `"reflectance%"` | Reflectance 0-100 |
| `"transmittance"` | Transmittance 0-1 |
| `"transmittance%"` | Transmittance 0-100 |
| `"auto"` | Automatic detection |

#### NA Policy Options

| Value | Description |
|-------|-------------|
| `"drop"` | Drop rows with missing values |
| `"fill_mean"` | Fill with column mean |
| `"fill_median"` | Fill with column median |
| `"fill_zero"` | Fill with zeros |

### Task Type Options

| Value | Description |
|-------|-------------|
| `"auto"` | Auto-detect from targets |
| `"regression"` | Continuous target prediction |
| `"binary_classification"` | Two-class classification |
| `"multiclass_classification"` | Multi-class classification |

### Configuration Examples

#### Simple Path

```python
dataset = DatasetConfigs("path/to/data/")
```

#### Explicit Files

```python
dataset = DatasetConfigs({
    "train_x": "spectra_train.csv",
    "train_y": "targets_train.csv",
    "test_x": "spectra_test.csv",
    "test_y": "targets_test.csv"
})
```

#### With Parameters

```python
dataset = DatasetConfigs({
    "train_x": "spectra.csv",
    "train_y": "targets.csv",
    "train_x_params": {
        "header_unit": "nm",
        "signal_type": "reflectance",
        "delimiter": ";"
    },
    "train_y_params": {
        "has_header": True
    }
})
```

#### Multi-Source Dataset

```python
dataset = DatasetConfigs({
    "train_x": ["nir_spectra.csv", "markers.csv"],
    "train_y": "targets.csv",
    "train_x_params": [
        {"header_unit": "nm", "signal_type": "reflectance"},
        {"header_unit": "text"}
    ]
})
```

#### Multiple Datasets

```python
dataset = DatasetConfigs([
    "dataset1/",
    "dataset2/",
    {"train_x": "custom/spectra.csv", "train_y": "custom/targets.csv"}
])
```

### Using SpectroDataset Directly

For advanced use cases, `nirs4all.run()` also accepts `SpectroDataset` instances directly, bypassing `DatasetConfigs`:

```python
from nirs4all.data import SpectroDataset
import nirs4all

# Single SpectroDataset
result = nirs4all.run(pipeline, my_spectro_dataset)

# Multiple SpectroDataset instances (multi-dataset run)
result = nirs4all.run(pipeline, [dataset1, dataset2, dataset3])
```

This is useful when:
- Using synthetic data generators that return `SpectroDataset`
- Programmatically constructing datasets
- Chaining pipeline runs with transformed data

#### With Aggregation

```python
dataset = DatasetConfigs(
    "path/to/data/",
    aggregate="sample_id",           # Column name in metadata
    aggregate_method="mean",         # "mean", "median", or "vote"
    aggregate_exclude_outliers=True  # Remove outliers before aggregating
)
```

### Accessing Dataset Data

```python
dataset = DatasetConfigs("path/to/data/")

# Iterate over datasets
for ds in dataset.iter_datasets():
    print(f"Dataset: {ds.name}")
    print(f"  Samples: {len(ds)}")
    print(f"  Features: {ds.n_features}")
    print(f"  Task: {ds.task_type}")

# Get specific dataset by index
ds = dataset.get_dataset_at(0)

# Get all datasets as list
all_datasets = dataset.get_datasets()
```

---

## Complete Examples

### Full Pipeline Configuration

```python
from nirs4all.pipeline import PipelineConfigs, PipelineRunner
from nirs4all.data import DatasetConfigs
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit
from nirs4all.operators.transforms import StandardNormalVariate

# Pipeline configuration
pipeline = PipelineConfigs([
    MinMaxScaler(),
    StandardNormalVariate(),
    {"y_processing": MinMaxScaler()},
    ShuffleSplit(n_splits=5, test_size=0.25, random_state=42),
    {"model": PLSRegression(n_components=10)}
], name="ProductionPipeline", description="NIR protein prediction model")

# Dataset configuration
dataset = DatasetConfigs({
    "train_x": "data/spectra.csv",
    "train_y": "data/protein.csv",
    "train_m": "data/samples.csv",
    "train_x_params": {
        "header_unit": "nm",
        "signal_type": "reflectance",
        "delimiter": ","
    }
}, task_type="regression", aggregate="sample_id")

# Run
runner = PipelineRunner(verbose=1, save_artifacts=True)
predictions, per_dataset = runner.run(pipeline, dataset)
```

### YAML Configuration File

**pipeline.yaml:**
```yaml
pipeline:
  # Preprocessing
  - class: sklearn.preprocessing.MinMaxScaler

  - class: nirs4all.operators.transforms.StandardNormalVariate

  # Target scaling
  - y_processing:
      class: sklearn.preprocessing.MinMaxScaler

  # Cross-validation
  - class: sklearn.model_selection.ShuffleSplit
    params:
      n_splits: 5
      test_size: 0.25
      random_state: 42

  # Model
  - model:
      class: sklearn.cross_decomposition.PLSRegression
      params:
        n_components: 10
```

**dataset.yaml:**
```yaml
train_x: data/spectra.csv
train_y: data/targets.csv
train_x_params:
  header_unit: nm
  signal_type: reflectance
  delimiter: ","
task_type: regression
```

**Python usage:**
```python
from nirs4all.pipeline import PipelineConfigs, PipelineRunner
from nirs4all.data import DatasetConfigs

pipeline = PipelineConfigs("config/pipeline.yaml", name="YAMLPipeline")
dataset = DatasetConfigs("config/dataset.yaml")

runner = PipelineRunner(verbose=1)
predictions, _ = runner.run(pipeline, dataset)
```

## See Also

- {doc}`/reference/pipeline_syntax` - Complete pipeline syntax reference
- {doc}`/reference/generator_keywords` - Generator syntax (`_or_`, `_range_`)
- {doc}`/user_guide/data/loading_data` - Data loading guide
- {doc}`/getting_started/concepts` - Core concepts overview