Frequently Asked Questions

Common questions, errors, and solutions for nirs4all.

Installation

How do I install nirs4all?

pip install nirs4all

For GPU support with TensorFlow:

pip install nirs4all tensorflow[and-cuda]

How do I verify my installation?

nirs4all --test-install

This checks all dependencies and reports available frameworks.

Which Python versions are supported?

Python 3.11+ is required.

Do I need TensorFlow, PyTorch, or JAX?

No. nirs4all works with scikit-learn only. Deep learning frameworks are optional and only needed if you want to use neural network models.


Data Loading

What file formats are supported?

  • CSV (.csv)

  • Excel (.xlsx, .xls)

  • MATLAB (.mat)

  • NumPy (.npy, .npz)

  • Parquet (.parquet)

See Loading Data for details.

How do I specify which column is the target variable?

from nirs4all.data import DatasetConfigs

dataset = DatasetConfigs(
    "data.csv",
    y_column="concentration",  # Target column name
)

How do I handle multiple data sources?

dataset = DatasetConfigs([
    {"path": "nir.csv", "source_name": "NIR"},
    {"path": "raman.csv", "source_name": "Raman"},
])

Error: “Could not infer target column”

Your dataset doesn’t have a clear target column. Specify it explicitly:

dataset = DatasetConfigs("data.csv", y_column="my_target")

Error: “Sample count mismatch between X and y”

Your feature matrix and target array have different numbers of samples. Check your data file for:

  • Missing values

  • Misaligned rows

  • Header issues


Pipeline Execution

How do I run a basic pipeline?

import nirs4all
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import ShuffleSplit

pipeline = [
    ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
    PLSRegression(n_components=10),
]

result = nirs4all.run(
    pipeline=pipeline,
    dataset="path/to/data.csv"
)

Why do I need cross-validation in my pipeline?

Cross-validation (e.g., ShuffleSplit, KFold) is required to:

  • Split data into train/test sets

  • Evaluate model generalization

  • Generate out-of-fold predictions

How do I save my results?

Results are automatically saved when using PipelineRunner:

from nirs4all.pipeline import PipelineRunner

runner = PipelineRunner(
    save_artifacts=True,
    workspace_path="workspace/"
)
predictions, _ = runner.run(pipeline, dataset)

Error: “No splitter found in pipeline”

Add a cross-validation splitter before your model:

pipeline = [
    ShuffleSplit(n_splits=5, random_state=42),  # Add this
    PLSRegression(n_components=10),
]

Error: “Pipeline must contain at least one model step”

Add a model to your pipeline:

pipeline = [
    SNV(),
    ShuffleSplit(n_splits=5, random_state=42),
    PLSRegression(n_components=10),  # Model step
]

Preprocessing

Which preprocessing should I use?

Data Issue

Recommended Preprocessing

Baseline drift

Detrend, BaselineCorrection

Scatter effects

SNV, MSC

Noise

SavitzkyGolay, Gaussian

Scale differences

StandardScaler, MinMaxScaler

Derivatives

FirstDerivative, SecondDerivative

See Preprocessing Cheatsheet for model-specific recommendations.

Can I combine multiple preprocessings?

Yes, chain them in your pipeline:

pipeline = [
    SNV(),
    SavitzkyGolay(window_length=11, polyorder=2),
    FirstDerivative(),
    ShuffleSplit(n_splits=5, random_state=42),
    PLSRegression(n_components=10),
]

How do I compare different preprocessings?

Use feature_augmentation:

pipeline = [
    {"feature_augmentation": [SNV, Detrend, MSC], "action": "extend"},
    ShuffleSplit(n_splits=5, random_state=42),
    PLSRegression(n_components=10),
]

# Result will contain predictions for each preprocessing

Models

What models can I use?

Any scikit-learn compatible model:

  • Regression: PLSRegression, RandomForestRegressor, SVR, etc.

  • Classification: LogisticRegression, RandomForestClassifier, SVC, etc.

  • Deep Learning: nicon, decon (with TensorFlow/PyTorch/JAX)

How do I know if my task is regression or classification?

nirs4all auto-detects based on your target variable:

  • Continuous values → Regression

  • Discrete categories → Classification

Override with:

dataset = DatasetConfigs("data.csv", task="classification")

How do I tune hyperparameters?

Use finetune_params:

{
    "model": PLSRegression(),
    "finetune_params": {
        "n_trials": 20,
        "sample": "tpe",
        "model_params": {
            "n_components": ('int', 1, 20),
        }
    }
}

See Hyperparameter Tuning for details.

Error: “Model does not support classification”

Some models are regression-only. For classification, use:

  • nicon_classification instead of nicon

  • RandomForestClassifier instead of RandomForestRegressor


Deep Learning

How do I use neural networks?

from nirs4all.operators.models.tensorflow.nicon import nicon

pipeline = [
    MinMaxScaler(),
    ShuffleSplit(n_splits=3, random_state=42),
    {
        'model': nicon,
        'train_params': {'epochs': 50, 'verbose': 1}
    }
]

Error: “TensorFlow not installed”

Install TensorFlow:

pip install tensorflow

Error: “CUDA out of memory”

Reduce batch size or use a smaller model:

{
    'model': thin_nicon,  # Smaller architecture
    'train_params': {'batch_size': 8}  # Smaller batches
}

Neural network training is slow

  • Enable GPU: Install tensorflow[and-cuda] or torch with CUDA

  • Reduce epochs for quick tests

  • Use hyperband for efficient hyperparameter search


Results and Visualization

How do I access prediction results?

result = nirs4all.run(pipeline, dataset)

# Best score
print(result.best_score)

# All predictions
for pred in result.predictions:
    print(pred.get('rmse'))

# Top 5 configurations
for pred in result.top(5):
    print(pred)

How do I visualize results?

from nirs4all.visualization.predictions import PredictionAnalyzer

analyzer = PredictionAnalyzer(result.predictions)
analyzer.plot_scatter()
analyzer.plot_top_k(k=10)
analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings")

How do I export my model for production?

from nirs4all.pipeline.bundle import BundleManager

manager = BundleManager()
manager.export(
    predictions=result.predictions,
    export_path="my_model.n4a"
)

Performance

Pipeline is slow. How do I speed it up?

  1. Reduce cross-validation folds: n_splits=3 instead of n_splits=10

  2. Use fewer trials: Lower n_trials in finetune_params

  3. Enable parallelization: n_jobs=-1 for sklearn models

  4. Use GPU: For neural networks

  5. Reduce preprocessing combinations: Fewer items in feature_augmentation

How much memory does nirs4all use?

Memory scales with:

  • Dataset size (samples × features)

  • Number of preprocessing variants

  • Model complexity

  • Cross-validation folds

For large datasets, process in batches or reduce n_splits.

Can I run pipelines in parallel?

Sklearn models support n_jobs=-1 for internal parallelization. Pipeline-level parallelism is planned for future releases.


Troubleshooting

Error: “No module named ‘nirs4all’”

Install nirs4all:

pip install nirs4all

Error: “AttributeError: module ‘nirs4all’ has no attribute…”

You may have an outdated version. Update:

pip install --upgrade nirs4all

Plots don’t display

  • In scripts: Add plt.show() at the end

  • In Jupyter: Use %matplotlib inline

  • Set plots_visible=True in nirs4all.run()

Results are NaN or infinite

Check your data for:

  • Missing values

  • Infinite values

  • Division by zero in preprocessing

  • Incompatible target scale

import numpy as np

# Check data
print(np.isnan(X).any())  # NaN check
print(np.isinf(X).any())  # Inf check

Memory error

Reduce memory usage:

# Smaller cross-validation
ShuffleSplit(n_splits=3, test_size=0.2)  # Instead of 10 folds

# Process fewer variants
{"feature_augmentation": [SNV, Detrend]}  # Instead of 10 preprocessings

Best Practices

Preprocessing

  1. Always scale for neural networks: Use MinMaxScaler or StandardScaler

  2. SNV before derivatives: Apply scatter correction first

  3. Don’t over-preprocess: More isn’t always better

  4. Match preprocessing to model: See Preprocessing Cheatsheet

Cross-Validation

  1. Use enough folds: Minimum 3, recommended 5-10

  2. Set random_state: For reproducibility

  3. Use stratification for classification: StratifiedKFold

  4. Consider group structure: Use GroupKFold for grouped samples

Model Selection

  1. Start with PLS: Reliable baseline for NIRS

  2. Compare multiple models: Use branching

  3. Don’t overtune: More trials ≠ better results

  4. Validate on held-out data: Don’t trust only CV scores

Reproducibility

  1. Set random seeds: random_state=42 everywhere

  2. Save artifacts: save_artifacts=True

  3. Version your data: Track dataset versions

  4. Export configurations: Save pipeline YAML


Getting Help

Where can I find more examples?

  • examples/user/ - User tutorials

  • examples/developer/ - Advanced examples

  • examples/reference/ - Reference implementations

How do I report a bug?

Open an issue on GitHub with:

  1. nirs4all version (nirs4all --version)

  2. Python version

  3. Error message and traceback

  4. Minimal reproducible example

Where can I ask questions?

  • GitHub Discussions

  • GitHub Issues (for bugs)

See Also