# SHAP Analysis for NIRS Models ## Overview The SHAP (SHapley Additive exPlanations) module provides explainability for NIRS models by identifying which spectral regions are most important for predictions. ## Key Design Decisions ### ✅ **ALL Visualizations Use Binned Features** **Every SHAP visualization in this module uses binned wavelengths/features**, not individual points: - **Spectral importance**: Shows binned regions on the spectrum + bar chart - **Beeswarm plot**: Bins features before plotting - **Waterfall plot**: Bins features before showing contributions - **Summary plot**: Uses raw features (standard SHAP, useful for non-spectral data) ### Why Binning? Individual wavelengths are prone to: - **Noise and artifacts** from instrument variability - **Overfitting** to training set peculiarities - **Misleading peaks** at single points **Binning creates robust spectral regions:** - Aggregates SHAP values over multiple wavelengths - Smooths out noise while preserving trends - Provides interpretable regions (e.g., "1600-1620 nm") - Scientifically meaningful (absorption bands span ranges) ### Binning Configuration Control binning with these parameters: ```python shap_params = { 'bin_size': 20, # Wavelengths per bin (default: 20) 'bin_stride': 10, # Step between bins (default: 10 = 50% overlap) 'bin_aggregation': 'sum' # How to combine SHAP values in a bin } ``` **Per-Visualization Configuration** You can now specify different binning for each visualization using dictionaries: ```python shap_params = { 'bin_size': { 'spectral': 20, # Fine detail for overview 'waterfall': 50, # Coarser for clarity 'beeswarm': 30 # Medium detail }, 'bin_stride': { 'spectral': 10, # 50% overlap 'waterfall': 25, # 50% overlap 'beeswarm': 15 # 50% overlap }, 'bin_aggregation': { 'spectral': 'sum', # Total importance 'waterfall': 'mean', # Average per wavelength 'beeswarm': 'sum_abs' # Absolute sum } } ``` **Aggregation methods:** - `'sum'` - Sum of SHAP values (emphasizes cumulative effect) - `'sum_abs'` - Sum of absolute values (ignores direction) - `'mean'` - Average SHAP value (normalized by bin size) - `'mean_abs'` - Average absolute value **Examples:** - `bin_size=20, bin_stride=10` → 50% overlap between bins - `bin_size=30, bin_stride=30` → No overlap (independent bins) - `bin_size=50, bin_stride=25` → 50% overlap with wider regions ## Visualizations ### 1. **Spectral Importance** ⭐ Main Visualization Shows which spectral regions matter most for predictions. ```{figure} ../../assets/shap_spectral.png :align: center :width: 90% :alt: SHAP Spectral Importance Plot SHAP spectral importance plot showing important wavelength regions. ``` **Top Panel - Spectrum with Regions:** - Black line: Mean spectrum of your data - **Colored bands (Viridis)**: Important regions highlighted - Light yellow/green = moderately important - Dark blue/purple = highly important - No highlight = low importance **Bottom Panel - Bar Chart:** - X-axis: Wavelength (nm) - Y-axis: Aggregated SHAP importance per bin - **Viridis colormap**: Bars colored by importance - Blue line: Trend across spectrum **How it works:** 1. Computes SHAP values for each wavelength 2. Sorts wavelengths to ensure proper ordering 3. Creates overlapping bins (default: 20 wavelengths, 50% overlap) 4. Aggregates SHAP values within each bin 5. Visualizes binned importance ### 2. **Beeswarm Plot (Binned)** SHAP beeswarm showing feature value vs. SHAP impact, **with binned features**. ```{figure} ../../assets/shap_beeswarm.png :align: center :width: 90% :alt: SHAP Beeswarm Plot SHAP beeswarm plot showing distribution of SHAP values for binned features. ``` - Each dot = one sample - X-axis: SHAP value (impact on prediction) - Color: Feature value (red=high, blue=low) - Y-axis: Binned wavelength regions (sorted by importance) **Interpretation:** - Dense clusters = consistent behavior across samples - Wide spread = variable impact depending on sample - Color patterns = how feature values relate to SHAP impact ### 3. **Waterfall Plot (Binned)** Shows how **binned features** contribute to a single prediction. ```{figure} ../../assets/shap_waterfall.png :align: center :width: 90% :alt: SHAP Waterfall Plot SHAP waterfall plot showing feature contributions to a single prediction. ``` - Starts from base value (expected value) - Each bar = contribution from one binned region - Red bars = push prediction higher - Blue bars = push prediction lower - Ends at final prediction **Use for:** - Understanding individual predictions - Debugging specific samples - Explaining predictions to stakeholders ### 4. **Summary Plot** (Raw Features) Standard SHAP summary plot showing overall feature importance. **Note:** This uses **individual features**, not binned. Useful for: - Comparing different feature types (e.g., spectra + metadata) - Standard SHAP workflow compatibility - Non-spectral data analysis ## Usage ### Basic Example ```python from nirs4all.pipeline import PipelineRunner # Train model runner = PipelineRunner(save_artifacts=True) predictions, _ = runner.run(pipeline_config, dataset_config) best = predictions.top(n=1, rank_metric='rmse')[0] # Explain with SHAP (default binning) explainer = PipelineRunner() shap_params = { 'n_samples': 200, 'visualizations': ['spectral', 'beeswarm', 'waterfall'] } results, output_dir = explainer.explain(best, dataset_config, shap_params) ``` ### Custom Binning (Same for All) ```python shap_params = { 'n_samples': 200, 'visualizations': ['spectral', 'beeswarm', 'waterfall'], 'bin_size': 50, # Wider bins 'bin_stride': 25, # 50% overlap 'bin_aggregation': 'mean_abs' # Average absolute SHAP values } results, output_dir = explainer.explain(best, dataset_config, shap_params) ``` ### Custom Binning (Per-Visualization) ```python shap_params = { 'n_samples': 200, 'visualizations': ['spectral', 'waterfall', 'beeswarm'], # Different binning for each visualization 'bin_size': { 'spectral': 20, # Fine-grained spectral overview 'waterfall': 50, # Coarser - fewer bars for clarity 'beeswarm': 30 # Medium detail }, 'bin_stride': { 'spectral': 10, # 50% overlap 'waterfall': 25, # 50% overlap 'beeswarm': 15 # 50% overlap }, 'bin_aggregation': { 'spectral': 'sum', # Total importance per region 'waterfall': 'mean', # Average per wavelength 'beeswarm': 'sum_abs' # Absolute importance } } results, output_dir = explainer.explain(best, dataset_config, shap_params) ``` **Why use different binning per visualization?** - **Spectral**: Fine detail to see all important regions - **Waterfall**: Coarser bins → fewer bars → easier to interpret - **Beeswarm**: Medium bins → balance between detail and readability ### All Parameters ```python shap_params = { # SHAP computation 'n_samples': 200, # Background samples (default: 200) 'explainer_type': 'auto', # 'auto', 'tree', 'linear', 'deep', 'kernel' # Visualizations (all support binning) 'visualizations': ['spectral', 'summary', 'waterfall', 'beeswarm'], # Binning configuration - can be int/str OR dict 'bin_size': 20, # int: same for all, dict: per-viz 'bin_stride': 10, # int: same for all, dict: per-viz 'bin_aggregation': 'sum' # str: same for all, dict: per-viz } ``` ## Interpreting Results ### Spectral Importance **Example: Protein Prediction Model** If you see high importance in: - **1600-1700 nm** (dark purple band): Amide I band → C=O stretch - **2100-2200 nm** (dark purple band): Amide II band → N-H bend - **2300-2400 nm** (medium blue band): C-H combinations ✅ **Model learned chemically meaningful features!** **Troubleshooting:** - **High importance in unexpected regions** → May indicate artifacts or preprocessing issues - **Uniform importance everywhere** → Model might be overfitting or data is too noisy - **Importance at spectrum edges** → Check for edge effects from preprocessing ### Beeswarm/Waterfall (Binned) Look for: - **Consistent patterns** → Reliable spectral regions - **Variable contributions** → Context-dependent regions - **Strong push in one direction** → Dominant spectral features ## Scientific Interpretation ### Validation Strategy 1. **Run SHAP analysis** on best model 2. **Identify top regions** from spectral importance 3. **Cross-reference with chemistry**: - Do important regions match known absorption bands? - Are they consistent with the target property? 4. **Compare models**: Do different models rely on similar regions? ### Example Workflow ```python # Explain top 3 models top_models = predictions.top(n=3, rank_metric='rmse', rank_partition='test') for model in top_models: results, output_dir = explainer.explain(model, dataset_config, shap_params) # Compare which regions are consistently important ``` ## Advanced Usage ### Experiment with Bin Sizes ```python # Try different bin sizes to find optimal resolution for bin_size in [10, 20, 30, 50]: shap_params['bin_size'] = bin_size shap_params['bin_stride'] = bin_size // 2 # Always 50% overlap results, _ = explainer.explain(best, dataset_config, shap_params) # Compare results ``` ### Compare Aggregation Methods ```python for agg in ['sum', 'sum_abs', 'mean', 'mean_abs']: shap_params['bin_aggregation'] = agg results, _ = explainer.explain(best, dataset_config, shap_params) # See which gives clearest insights ``` ## Technical Details ### SHAP Value Calculation 1. Select appropriate explainer (Tree/Linear/Deep/Kernel) 2. Compute SHAP values for each sample × feature 3. Store raw values for later binning ### Binning Process (for spectral/beeswarm/waterfall) 1. Extract wavelengths from feature names (λXXX.X format) 2. Sort features by wavelength 3. Create overlapping bins: ```python bin_start = 0, bin_stride, 2*bin_stride, ... bin_end = bin_start + bin_size ``` 4. Aggregate SHAP values per bin using selected method 5. Create bin labels (e.g., "1650.0-1670.0 nm") 6. Generate visualization with binned data ### Explainer Selection - **Tree models** (RF, GBM, XGBoost): TreeExplainer (fast, exact) - **Linear models** (Ridge, Lasso, PLS): LinearExplainer (fast) - **Neural networks**: DeepExplainer - **Others**: KernelExplainer (slower but universal) ## Plotting Behavior **All plots are blocking** - execution pauses until you close the plot window. This allows you to: - Examine each visualization carefully - Save screenshots manually - Compare visualizations side-by-side Plots are also **automatically saved** to: ``` results///explanations// ├── spectral_importance.png ├── summary.png ├── waterfall_binned.png └── beeswarm_binned.png ``` ## FAQ **Q: Why aren't individual wavelengths shown in beeswarm/waterfall?** A: Individual wavelengths are too noisy. Binning creates robust, interpretable regions. **Q: How do I choose bin_size?** A: Start with 20 (default). Increase for broader patterns, decrease for finer detail. **Q: What's the difference between sum and mean aggregation?** A: Sum emphasizes cumulative importance, mean normalizes by bin size. **Q: Can I use raw SHAP values without binning?** A: Yes - use `plot_summary()` which shows individual features. But for spectral data, binning is strongly recommended. **Q: Why does the spectral plot use Viridis colormap?** A: Viridis is perceptually uniform, colorblind-friendly, and works well in grayscale. ## References - Lundberg & Lee (2017). "A Unified Approach to Interpreting Model Predictions" (NIPS) - SHAP documentation: https://shap.readthedocs.io/ - Viridis colormap: https://matplotlib.org/stable/tutorials/colors/colormaps.html