Metadata Usage Guide
Overview
Metadata in nirs4all allows you to store and manage auxiliary sample-level information alongside your spectral data. This includes information like:
Sample identifiers
Batch numbers
Collection locations
Instrument types
Environmental conditions (temperature, humidity)
Any other sample-specific attributes
Metadata has one row per sample, aligning with features and targets, and can be easily filtered, retrieved, and converted to numeric format for use in machine learning pipelines.
Key Concepts
What is Metadata?
Metadata is distinct from features and targets:
Features (X): Spectral measurements or input variables for modeling
Targets (Y): Response variables you want to predict
Metadata: Auxiliary information about each sample
Metadata vs Features
Metadata is typically:
Categorical or mixed-type data
Used for grouping, filtering, or as auxiliary features
Not transformed through complex preprocessing pipelines
Features are:
Numerical spectral data
Subject to preprocessing (SNV, MSC, derivatives, etc.)
Primary input to models
Loading Metadata
From Folder (Auto-detection)
Place metadata files in your dataset folder with standard naming patterns:
dataset/
├── X_train.csv # Training spectra
├── Y_train.csv # Training targets
├── M_train.csv # Training metadata ✓
├── X_test.csv # Test spectra
├── Y_test.csv # Test targets
└── M_test.csv # Test metadata ✓
Supported naming patterns:
M_train.csv,M_test.csvMcal.csv,Mtest.csv,Mval.csvMeta_train.csv,Metatest.csvmetadata_train.csv,metadata_test.csvAnd many more variations…
from nirs4all.data.dataset_config import DatasetConfigs
configs = DatasetConfigs("path/to/dataset/folder")
dataset = configs.get_dataset_at(0)
print(f"Metadata columns: {dataset.metadata_columns}")
# Output: Metadata columns: ['batch', 'location', 'instrument']
From Explicit Configuration
config = {
'train_x': 'data/X_train.csv',
'train_y': 'data/Y_train.csv',
'train_group': 'data/M_train.csv', # Metadata
'test_x': 'data/X_test.csv',
'test_y': 'data/Y_test.csv',
'test_group': 'data/M_test.csv', # Metadata
}
configs = DatasetConfigs(config)
dataset = configs.get_dataset_at(0)
Programmatic Addition
import pandas as pd
from nirs4all.data.dataset import SpectroDataset
dataset = SpectroDataset(name="my_dataset")
# Add samples and targets
X_train = ... # Your spectral data
y_train = ... # Your targets
dataset.add_samples(X_train, {"partition": "train"})
dataset.add_targets(y_train)
# Add metadata as DataFrame
metadata_df = pd.DataFrame({
'sample_id': [1, 2, 3, ...],
'batch': [1, 1, 2, ...],
'location': ['A', 'A', 'B', ...]
})
dataset.add_metadata(metadata_df)
# Or as numpy array with headers
import numpy as np
metadata_array = np.array([[1, 'A'], [1, 'A'], [2, 'B']], dtype=object)
dataset.add_metadata(metadata_array, headers=['batch', 'location'])
Accessing Metadata
Get All Metadata
# Get all metadata as Polars DataFrame
all_meta = dataset.metadata()
print(all_meta)
Filter by Partition
# Get training metadata only
train_meta = dataset.metadata(selector={"partition": "train"})
# Get test metadata only
test_meta = dataset.metadata(selector={"partition": "test"})
Get Specific Columns
# Get only batch and location columns
batch_location = dataset.metadata(columns=['batch', 'location'])
Get Single Column as Array
# Get batch numbers as numpy array
batch_numbers = dataset.metadata_column('batch')
print(batch_numbers) # array([1, 1, 2, 2, 3, ...])
# Get batch numbers for training samples only
train_batches = dataset.metadata_column('batch', selector={"partition": "train"})
Converting Metadata to Numeric
Metadata is often categorical (strings, labels), but machine learning models require numeric input. Use metadata_numeric() to convert:
Label Encoding
Converts categories to integers (0, 1, 2, …):
location_encoded, encoding_info = dataset.metadata_numeric(
'location',
method='label'
)
print(location_encoded) # array([0, 0, 1, 1, 2, ...])
print(encoding_info)
# {
# 'method': 'label',
# 'classes': ['A', 'B', 'C', 'D']
# }
One-Hot Encoding
Converts categories to binary vectors:
location_onehot, encoding_info = dataset.metadata_numeric(
'location',
method='onehot'
)
print(location_onehot.shape) # (n_samples, n_categories)
print(location_onehot[:3])
# array([[1, 0, 0, 0], # Location A
# [1, 0, 0, 0], # Location A
# [0, 1, 0, 0]]) # Location B
Encoding Consistency
Encodings are cached to ensure consistency:
# First call creates encoding
encoded1, info1 = dataset.metadata_numeric('instrument', method='label')
# Subsequent calls return the same encoding
encoded2, info2 = dataset.metadata_numeric('instrument', method='label')
assert np.array_equal(encoded1, encoded2) # True
Modifying Metadata
Update Existing Values
# Update location for first 5 training samples
dataset.update_metadata(
column='location',
values=['Updated', 'Updated', 'Updated', 'Updated', 'Updated'],
selector={"partition": "train"}
)
Add New Column
import numpy as np
# Add quality scores for all samples
quality_scores = np.random.rand(dataset.num_samples)
dataset.add_metadata_column('quality', quality_scores)
print(dataset.metadata_columns)
# ['batch', 'location', 'instrument', 'quality']
Using Metadata in Pipelines
Combine Metadata with Spectral Features
from sklearn.ensemble import RandomForestRegressor
# Get spectral data
X_spectra = dataset.x({"partition": "train"})
y = dataset.y({"partition": "train"})
# Get numeric metadata
instrument_encoded, _ = dataset.metadata_numeric('instrument', method='onehot')
temperature = dataset.metadata_column('temperature', selector={"partition": "train"})
# Combine features
import numpy as np
X_combined = np.hstack([
X_spectra,
instrument_encoded,
temperature.reshape(-1, 1)
])
# Train model
model = RandomForestRegressor()
model.fit(X_combined, y)
Filter Samples by Metadata
# Get metadata for filtering
batch_col = dataset.metadata_column('batch', selector={"partition": "train"})
# Manually filter batch 1 samples
batch_1_mask = (batch_col == 1)
X_batch_1 = X_train[batch_1_mask]
y_batch_1 = y_train[batch_1_mask]
Best Practices
1. Consistent Naming
Use clear, descriptive column names:
✅
'instrument_id','batch_number','collection_date'❌
'col1','x','data'
2. Keep Metadata Aligned
Metadata must have the same number of rows as samples:
# CORRECT: Same number of rows
dataset.add_samples(X_train, {"partition": "train"}) # 100 samples
metadata_df = pd.DataFrame({...}) # 100 rows
dataset.add_metadata(metadata_df) # ✅
# INCORRECT: Mismatched rows
dataset.add_samples(X_train, {"partition": "train"}) # 100 samples
metadata_df = pd.DataFrame({...}) # 50 rows
dataset.add_metadata(metadata_df) # ❌ Error!
3. Use Appropriate Encoding
Label encoding: For ordinal categories or when number of categories is small
One-hot encoding: For nominal categories, but watch out for high cardinality
# Good: Few categories
instrument_encoded, _ = dataset.metadata_numeric('instrument', method='onehot')
# Result: 3 binary columns for 3 instruments
# Careful: Many categories
sample_id_encoded, _ = dataset.metadata_numeric('sample_id', method='onehot')
# Result: 1000 binary columns for 1000 unique IDs (sparse!)
4. Cache-Aware Operations
Modifying metadata clears the encoding cache:
# Create encoding
encoded1, _ = dataset.metadata_numeric('location', method='label')
# Update metadata
dataset.update_metadata('location', ['New'], selector={"partition": "train"})
# Encoding is recalculated (cache was cleared)
encoded2, _ = dataset.metadata_numeric('location', method='label')
# encoded2 may differ from encoded1 due to new category 'New'
5. Documentation
Document your metadata columns:
# Good practice: Document what each column means
metadata_df = pd.DataFrame({
'batch': batch_numbers, # Production batch (1-10)
'instrument': instruments, # Instrument ID (A, B, C)
'temp_c': temperatures, # Collection temperature (°C)
'operator': operators, # Lab technician name
})
Common Patterns
Pattern 1: Stratified Splitting by Metadata
from sklearn.model_selection import StratifiedKFold
# Get batch information
batch_col = dataset.metadata_column('batch', selector={"partition": "train"})
# Use for stratified CV
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X_train, batch_col):
X_train_fold = X_train[train_idx]
X_val_fold = X_train[val_idx]
# Train and validate...
Pattern 2: Cross-Instrument Validation
# Get training data and metadata
X_train = dataset.x({"partition": "train"})
y_train = dataset.y({"partition": "train"})
instruments = dataset.metadata_column('instrument', selector={"partition": "train"})
# Train on instrument A, test on instrument B
mask_A = (instruments == 'A')
mask_B = (instruments == 'B')
model.fit(X_train[mask_A], y_train[mask_A])
score = model.score(X_train[mask_B], y_train[mask_B])
print(f"Cross-instrument R²: {score:.3f}")
Pattern 3: Temporal Splits
# Assuming metadata has 'date' column
dates = dataset.metadata_column('date', selector={"partition": "train"})
# Sort by date
sorted_indices = np.argsort(dates)
split_point = int(len(dates) * 0.8)
train_idx = sorted_indices[:split_point]
val_idx = sorted_indices[split_point:]
# Temporal train/validation split
X_train_temporal = X_train[train_idx]
X_val_temporal = X_train[val_idx]
API Reference
Dataset Methods
add_metadata(data, headers=None)
Add metadata rows.
Parameters:
data: 2D array, pandas DataFrame, or polars DataFrameheaders: Column names (required if data is ndarray)
metadata(selector=None, columns=None)
Get metadata as DataFrame.
Parameters:
selector: Filter dict (e.g.,{"partition": "train"})columns: List of column names to return
Returns: Polars DataFrame
metadata_column(column, selector=None)
Get single metadata column as array.
Parameters:
column: Column nameselector: Filter dict
Returns: Numpy array
metadata_numeric(column, selector=None, method='label')
Convert metadata column to numeric.
Parameters:
column: Column nameselector: Filter dictmethod:'label'or'onehot'
Returns: Tuple of (numeric_array, encoding_info)
update_metadata(column, values, selector=None)
Update metadata values.
Parameters:
column: Column namevalues: New valuesselector: Filter dict
add_metadata_column(column, values)
Add new metadata column.
Parameters:
column: Column namevalues: Column values (must match number of samples)
metadata_columns
Property returning list of metadata column names.
Troubleshooting
Issue: Metadata not loading
Check:
File naming matches patterns (M_train, Mcal, metadata_train, etc.)
Files are in the same folder as X and Y files
CSV format is correct (check delimiter, headers)
# Debug: Check what files were detected
from nirs4all.data.dataset_config_parser import browse_folder
config = browse_folder("path/to/folder")
print(config.get('train_group')) # Should show metadata file path
Issue: Row count mismatch
Error: ValueError: Row count mismatch: X(100) Metadata(50)
Solution: Ensure metadata has the same number of rows as X data:
print(f"X rows: {len(X_train)}")
print(f"Metadata rows: {len(metadata_df)}")
# They must match!
Issue: Missing metadata columns
Error: ValueError: Column 'instrument' not found
Solution: Check column names:
print(dataset.metadata_columns) # See what's available
Examples
See examples/metadata_usage.py for complete working examples including:
Loading datasets with metadata
Filtering and accessing metadata
Numeric encoding
Using metadata in pipelines
Cross-instrument validation
Summary
Metadata in nirs4all provides a flexible way to:
✅ Store auxiliary sample information
✅ Filter and group samples
✅ Enhance models with contextual features
✅ Enable stratified validation strategies
✅ Support reproducible data management
Start by loading your metadata files alongside X and Y data, then explore the rich API for accessing, converting, and utilizing metadata in your spectroscopy workflows!