# Getting Started Examples This section introduces the fundamentals of NIRS4ALL through a series of progressive examples. Each example builds upon the previous one, guiding you from your first pipeline to advanced visualization techniques. ```{contents} On this page :local: :depth: 2 ``` ## Overview | Example | Topic | Difficulty | Duration | |---------|-------|------------|----------| | [U01](#u01-hello-world) | Hello World | ★☆☆☆☆ | ~1 min | | [U02](#u02-basic-regression) | Basic Regression | ★★☆☆☆ | ~3 min | | [U03](#u03-basic-classification) | Basic Classification | ★★☆☆☆ | ~2 min | | [U04](#u04-visualization) | Visualization | ★★☆☆☆ | ~3 min | --- ## U01: Hello World **Your first NIRS4ALL pipeline in about 20 lines of code.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/01_getting_started/U01_hello_world.py) ### What You'll Learn - Using `nirs4all.run()` to train a pipeline - The structure of a minimal pipeline - Reading results from the `RunResult` object ### Key Concepts A pipeline in NIRS4ALL is simply a **list of processing steps**: ```python from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import ShuffleSplit from sklearn.preprocessing import MinMaxScaler import nirs4all # Define the pipeline as a list of steps pipeline = [ MinMaxScaler(), # Feature scaling {"y_processing": MinMaxScaler()}, # Target scaling ShuffleSplit(n_splits=3, test_size=0.25), # Cross-validation {"model": PLSRegression(n_components=10)} # Model ] # Run with one simple call result = nirs4all.run( pipeline=pipeline, dataset="sample_data/regression", name="HelloWorld", verbose=1 ) # Access results print(f"Best Score (MSE): {result.best_score:.4f}") ``` ### The RunResult Object The `result` object provides convenient accessors: | Accessor | Description | |----------|-------------| | `result.best_score` | Best model's primary score (MSE by default) | | `result.best` | Best prediction entry as a dictionary | | `result.top(n)` | Top N predictions ranked by score | | `result.predictions` | Full Predictions object for analysis | ### Tips for Beginners 1. **Start simple**: Begin with a basic pipeline and add complexity gradually 2. **Use verbose=1**: See what's happening during training 3. **Check top models**: Use `result.top(n=5)` to compare performance --- ## U02: Basic Regression **A complete regression pipeline with NIRS-specific preprocessing and visualization.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/01_getting_started/U02_basic_regression.py) ### What You'll Learn - NIRS-specific preprocessing (SNV, Detrend, Derivatives, Gaussian) - Feature augmentation to explore preprocessing combinations - Using `PredictionAnalyzer` for result visualization - Comparing models with different `n_components` ### NIRS Preprocessing Options NIRS4ALL provides specialized transforms for spectral data: | Transform | Purpose | When to Use | |-----------|---------|-------------| | `StandardNormalVariate` (SNV) | Scatter correction | Path length variations | | `MultiplicativeScatterCorrection` (MSC) | Scatter correction | Reference-based correction | | `Detrend` | Baseline correction | Polynomial drift removal | | `FirstDerivative` | Enhance peaks, remove baseline | Constant baseline issues | | `SavitzkyGolay` | Smoothing + derivatives | Noisy data | | `Gaussian` | Smoothing | Noise reduction | | `Haar` | Wavelet transform | Multi-resolution analysis | ### Feature Augmentation Instead of manually defining multiple pipelines, use **feature augmentation** to explore combinations: ```python pipeline = [ MinMaxScaler(), {"y_processing": MinMaxScaler()}, # Generate 3 preprocessing combinations from 5 options { "feature_augmentation": { "_or_": [Detrend, FirstDerivative, Gaussian, SavitzkyGolay, Haar], "pick": 2, # Pick 2 at a time "count": 3 # Generate 3 combinations } }, ShuffleSplit(n_splits=3, test_size=0.25), {"model": PLSRegression(n_components=10)} ] ``` ### Visualization with PredictionAnalyzer ```python from nirs4all.visualization.predictions import PredictionAnalyzer analyzer = PredictionAnalyzer(result.predictions) # Compare top K models analyzer.plot_top_k(k=3, rank_metric='rmse') # Heatmap: models vs preprocessing analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings") # Performance distribution analyzer.plot_candlestick(variable="model_name") ``` --- ## U03: Basic Classification **Classification pipeline with Random Forest, XGBoost, and confusion matrix visualization.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/01_getting_started/U03_basic_classification.py) ### What You'll Learn - Setting up a classification pipeline - Using multiple classifiers (Random Forest, XGBoost) - Confusion matrix visualization - Classification metrics (accuracy, balanced recall) ### Classification Pipeline Structure ```python from sklearn.ensemble import RandomForestClassifier from sklearn.discriminant_analysis import StandardScaler pipeline = [ # Feature augmentation with preprocessing options {"feature_augmentation": [ FirstDerivative, StandardNormalVariate, Haar, MultiplicativeScatterCorrection ]}, StandardScaler(), ShuffleSplit(n_splits=3, test_size=0.25), # Classifier {"model": RandomForestClassifier(n_estimators=50, max_depth=8)} ] ``` ### Classification Metrics | Metric | Description | Use Case | |--------|-------------|----------| | `accuracy` | Overall correct predictions | Balanced classes | | `balanced_recall` | Average recall per class | Imbalanced classes | | `balanced_accuracy` | Average accuracy per class | Class imbalance | ### Confusion Matrix Visualization ```python # Plot confusion matrices for top 4 classifiers analyzer.plot_confusion_matrix( k=4, rank_metric='accuracy', rank_partition='val', display_partition='test' ) ``` --- ## U04: Visualization **A comprehensive tour of all visualization options in NIRS4ALL.** [📄 View source code](https://github.com/GBeurier/nirs4all/blob/main/examples/user/01_getting_started/U04_visualization.py) ### What You'll Learn - All PredictionAnalyzer methods and options - Heatmaps, candlestick charts, histograms - Top-k comparison plots - Ranking vs display partition configuration ### Available Visualizations #### Top-K Comparison ```python # Basic top-k plot analyzer.plot_top_k(k=3, rank_metric='rmse') # Rank by test partition, display R² analyzer.plot_top_k(k=3, rank_metric='r2', rank_partition='test') ``` #### Heatmaps Create 2D comparisons between any two variables: ```python # Model vs preprocessing analyzer.plot_heatmap(x_var="model_name", y_var="preprocessings") # Model vs dataset analyzer.plot_heatmap(x_var="model_name", y_var="dataset_name", display_metric="r2") # Model vs fold with counts analyzer.plot_heatmap(x_var="model_name", y_var="fold_id", show_counts=True) ``` #### Candlestick Charts Show performance distribution per category: ```python analyzer.plot_candlestick(variable="model_name", display_metric='rmse') analyzer.plot_candlestick(variable="dataset_name", display_metric='r2') ``` #### Histograms ```python analyzer.plot_histogram(display_metric='rmse') analyzer.plot_histogram(display_metric='r2') ``` ### Ranking vs Display: A Key Concept You can **separate ranking from display**: | Parameter | Purpose | |-----------|---------| | `rank_metric` + `rank_partition` | Determines which models are "best" | | `display_metric` + `display_partition` | What values to show | ```python # Rank by validation RMSE, but display test R² analyzer.plot_heatmap( x_var="model_name", y_var="preprocessings", rank_metric='rmse', rank_partition='val', display_metric='r2', display_partition='test' ) ``` ### Aggregation Options | Option | Description | |--------|-------------| | `'best'` | Show best score for each cell | | `'mean'` | Show mean score | | `'median'` | Show median score | --- ## Running These Examples ```bash cd examples # Run all getting started examples ./run.sh -n "U0*.py" -c user # Run a specific example python user/01_getting_started/U01_hello_world.py # Enable plots python user/01_getting_started/U02_basic_regression.py --plots --show ``` ## Next Steps After completing these examples: - **Data Handling**: Learn different input formats and multi-dataset analysis - **Preprocessing**: Deep dive into NIRS-specific transformations - **Models**: Compare multiple models and hyperparameter tuning