# Frequently Asked Questions Common questions, errors, and solutions for nirs4all. ## Installation ### How do I install nirs4all? ```bash pip install nirs4all ``` For GPU support with TensorFlow: ```bash pip install nirs4all tensorflow[and-cuda] ``` ### How do I verify my installation? ```bash nirs4all --test-install ``` This checks all dependencies and reports available frameworks. ### Which Python versions are supported? Python 3.11+ is required. ### Do I need TensorFlow, PyTorch, or JAX? No. nirs4all works with scikit-learn only. Deep learning frameworks are optional and only needed if you want to use neural network models. --- ## Data Loading ### What file formats are supported? - CSV (`.csv`) - Excel (`.xlsx`, `.xls`) - MATLAB (`.mat`) - NumPy (`.npy`, `.npz`) - Parquet (`.parquet`) See {doc}`/user_guide/data/loading_data` for details. ### How do I specify which column is the target variable? ```python from nirs4all.data import DatasetConfigs dataset = DatasetConfigs( "data.csv", y_column="concentration", # Target column name ) ``` ### How do I handle multiple data sources? ```python dataset = DatasetConfigs([ {"path": "nir.csv", "source_name": "NIR"}, {"path": "raman.csv", "source_name": "Raman"}, ]) ``` ### Error: "Could not infer target column" Your dataset doesn't have a clear target column. Specify it explicitly: ```python dataset = DatasetConfigs("data.csv", y_column="my_target") ``` ### Error: "Sample count mismatch between X and y" Your feature matrix and target array have different numbers of samples. Check your data file for: - Missing values - Misaligned rows - Header issues --- ## Pipeline Execution ### How do I run a basic pipeline? ```python import nirs4all from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import ShuffleSplit pipeline = [ ShuffleSplit(n_splits=5, test_size=0.2, random_state=42), PLSRegression(n_components=10), ] result = nirs4all.run( pipeline=pipeline, dataset="path/to/data.csv" ) ``` ### Why do I need cross-validation in my pipeline? Cross-validation (e.g., `ShuffleSplit`, `KFold`) is required to: - Split data into train/test sets - Evaluate model generalization - Generate out-of-fold predictions ### How do I save my results? Results are automatically saved when using `PipelineRunner`: ```python from nirs4all.pipeline import PipelineRunner runner = PipelineRunner( save_artifacts=True, workspace_path="workspace/" ) predictions, _ = runner.run(pipeline, dataset) ``` ### Error: "No splitter found in pipeline" Add a cross-validation splitter before your model: ```python pipeline = [ ShuffleSplit(n_splits=5, random_state=42), # Add this PLSRegression(n_components=10), ] ``` ### Error: "Pipeline must contain at least one model step" Add a model to your pipeline: ```python pipeline = [ SNV(), ShuffleSplit(n_splits=5, random_state=42), PLSRegression(n_components=10), # Model step ] ``` --- ## Preprocessing ### Which preprocessing should I use? | Data Issue | Recommended Preprocessing | |------------|---------------------------| | Baseline drift | `Detrend`, `BaselineCorrection` | | Scatter effects | `SNV`, `MSC` | | Noise | `SavitzkyGolay`, `Gaussian` | | Scale differences | `StandardScaler`, `MinMaxScaler` | | Derivatives | `FirstDerivative`, `SecondDerivative` | See {doc}`/user_guide/preprocessing/cheatsheet` for model-specific recommendations. ### Can I combine multiple preprocessings? Yes, chain them in your pipeline: ```python pipeline = [ SNV(), SavitzkyGolay(window_length=11, polyorder=2), FirstDerivative(), ShuffleSplit(n_splits=5, random_state=42), PLSRegression(n_components=10), ] ``` ### How do I compare different preprocessings? Use `feature_augmentation`: ```python pipeline = [ {"feature_augmentation": [SNV, Detrend, MSC], "action": "extend"}, ShuffleSplit(n_splits=5, random_state=42), PLSRegression(n_components=10), ] # Result will contain predictions for each preprocessing ``` --- ## Models ### What models can I use? Any scikit-learn compatible model: - **Regression**: PLSRegression, RandomForestRegressor, SVR, etc. - **Classification**: LogisticRegression, RandomForestClassifier, SVC, etc. - **Deep Learning**: nicon, decon (with TensorFlow/PyTorch/JAX) ### How do I know if my task is regression or classification? nirs4all auto-detects based on your target variable: - **Continuous values** → Regression - **Discrete categories** → Classification Override with: ```python dataset = DatasetConfigs("data.csv", task="classification") ``` ### How do I tune hyperparameters? Use `finetune_params`: ```python { "model": PLSRegression(), "finetune_params": { "n_trials": 20, "sample": "tpe", "model_params": { "n_components": ('int', 1, 20), } } } ``` See {doc}`/user_guide/models/hyperparameter_tuning` for details. ### Error: "Model does not support classification" Some models are regression-only. For classification, use: - `nicon_classification` instead of `nicon` - `RandomForestClassifier` instead of `RandomForestRegressor` --- ## Deep Learning ### How do I use neural networks? ```python from nirs4all.operators.models.tensorflow.nicon import nicon pipeline = [ MinMaxScaler(), ShuffleSplit(n_splits=3, random_state=42), { 'model': nicon, 'train_params': {'epochs': 50, 'verbose': 1} } ] ``` ### Error: "TensorFlow not installed" Install TensorFlow: ```bash pip install tensorflow ``` ### Error: "CUDA out of memory" Reduce batch size or use a smaller model: ```python { 'model': thin_nicon, # Smaller architecture 'train_params': {'batch_size': 8} # Smaller batches } ``` ### Neural network training is slow - Enable GPU: Install `tensorflow[and-cuda]` or `torch` with CUDA - Reduce epochs for quick tests - Use `hyperband` for efficient hyperparameter search --- ## Results and Visualization ### How do I access prediction results? ```python result = nirs4all.run(pipeline, dataset) # Best score print(result.best_score) # All predictions for pred in result.predictions: print(pred.get('rmse')) # Top 5 configurations for pred in result.top(5): print(pred) ``` ### How do I visualize results? ```python from nirs4all.visualization.predictions import PredictionAnalyzer analyzer = PredictionAnalyzer(result.predictions) analyzer.plot_scatter() analyzer.plot_top_k(k=10) analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings") ``` ### How do I export my model for production? ```python from nirs4all.pipeline.bundle import BundleManager manager = BundleManager() manager.export( predictions=result.predictions, export_path="my_model.n4a" ) ``` --- ## Performance ### Pipeline is slow. How do I speed it up? 1. **Reduce cross-validation folds**: `n_splits=3` instead of `n_splits=10` 2. **Use fewer trials**: Lower `n_trials` in `finetune_params` 3. **Enable parallelization**: `n_jobs=-1` for sklearn models 4. **Use GPU**: For neural networks 5. **Reduce preprocessing combinations**: Fewer items in `feature_augmentation` ### How much memory does nirs4all use? Memory scales with: - Dataset size (samples × features) - Number of preprocessing variants - Model complexity - Cross-validation folds For large datasets, process in batches or reduce `n_splits`. ### Can I run pipelines in parallel? Sklearn models support `n_jobs=-1` for internal parallelization. Pipeline-level parallelism is planned for future releases. --- ## Troubleshooting ### Error: "No module named 'nirs4all'" Install nirs4all: ```bash pip install nirs4all ``` ### Error: "AttributeError: module 'nirs4all' has no attribute..." You may have an outdated version. Update: ```bash pip install --upgrade nirs4all ``` ### Plots don't display - In scripts: Add `plt.show()` at the end - In Jupyter: Use `%matplotlib inline` - Set `plots_visible=True` in `nirs4all.run()` ### Results are NaN or infinite Check your data for: - Missing values - Infinite values - Division by zero in preprocessing - Incompatible target scale ```python import numpy as np # Check data print(np.isnan(X).any()) # NaN check print(np.isinf(X).any()) # Inf check ``` ### Memory error Reduce memory usage: ```python # Smaller cross-validation ShuffleSplit(n_splits=3, test_size=0.2) # Instead of 10 folds # Process fewer variants {"feature_augmentation": [SNV, Detrend]} # Instead of 10 preprocessings ``` --- ## Best Practices ### Preprocessing 1. **Always scale for neural networks**: Use `MinMaxScaler` or `StandardScaler` 2. **SNV before derivatives**: Apply scatter correction first 3. **Don't over-preprocess**: More isn't always better 4. **Match preprocessing to model**: See {doc}`/user_guide/preprocessing/cheatsheet` ### Cross-Validation 1. **Use enough folds**: Minimum 3, recommended 5-10 2. **Set random_state**: For reproducibility 3. **Use stratification for classification**: `StratifiedKFold` 4. **Consider group structure**: Use `GroupKFold` for grouped samples ### Model Selection 1. **Start with PLS**: Reliable baseline for NIRS 2. **Compare multiple models**: Use branching 3. **Don't overtune**: More trials ≠ better results 4. **Validate on held-out data**: Don't trust only CV scores ### Reproducibility 1. **Set random seeds**: `random_state=42` everywhere 2. **Save artifacts**: `save_artifacts=True` 3. **Version your data**: Track dataset versions 4. **Export configurations**: Save pipeline YAML --- ## Getting Help ### Where can I find more examples? - `examples/user/` - User tutorials - `examples/developer/` - Advanced examples - `examples/reference/` - Reference implementations ### How do I report a bug? Open an issue on GitHub with: 1. nirs4all version (`nirs4all --version`) 2. Python version 3. Error message and traceback 4. Minimal reproducible example ### Where can I ask questions? - GitHub Discussions - GitHub Issues (for bugs) ## See Also - {doc}`/getting_started/installation` - Installation guide - {doc}`/getting_started/quickstart` - Quick start tutorial - {doc}`/user_guide/troubleshooting/migration` - Migration guides - {doc}`/user_guide/troubleshooting/dataset_troubleshooting` - Dataset troubleshooting