Metadata Usage Guide

Overview

Metadata in nirs4all allows you to store and manage auxiliary sample-level information alongside your spectral data. This includes information like:

  • Sample identifiers

  • Batch numbers

  • Collection locations

  • Instrument types

  • Environmental conditions (temperature, humidity)

  • Any other sample-specific attributes

Metadata has one row per sample, aligning with features and targets, and can be easily filtered, retrieved, and converted to numeric format for use in machine learning pipelines.


Key Concepts

What is Metadata?

Metadata is distinct from features and targets:

  • Features (X): Spectral measurements or input variables for modeling

  • Targets (Y): Response variables you want to predict

  • Metadata: Auxiliary information about each sample

Metadata vs Features

  • Metadata is typically:

    • Categorical or mixed-type data

    • Used for grouping, filtering, or as auxiliary features

    • Not transformed through complex preprocessing pipelines

  • Features are:

    • Numerical spectral data

    • Subject to preprocessing (SNV, MSC, derivatives, etc.)

    • Primary input to models


Loading Metadata

From Folder (Auto-detection)

Place metadata files in your dataset folder with standard naming patterns:

dataset/
├── X_train.csv       # Training spectra
├── Y_train.csv       # Training targets
├── M_train.csv       # Training metadata ✓
├── X_test.csv        # Test spectra
├── Y_test.csv        # Test targets
└── M_test.csv        # Test metadata ✓

Supported naming patterns:

  • M_train.csv, M_test.csv

  • Mcal.csv, Mtest.csv, Mval.csv

  • Meta_train.csv, Metatest.csv

  • metadata_train.csv, metadata_test.csv

  • And many more variations…

from nirs4all.data.dataset_config import DatasetConfigs

configs = DatasetConfigs("path/to/dataset/folder")
dataset = configs.get_dataset_at(0)

print(f"Metadata columns: {dataset.metadata_columns}")
# Output: Metadata columns: ['batch', 'location', 'instrument']

From Explicit Configuration

config = {
    'train_x': 'data/X_train.csv',
    'train_y': 'data/Y_train.csv',
    'train_group': 'data/M_train.csv',  # Metadata
    'test_x': 'data/X_test.csv',
    'test_y': 'data/Y_test.csv',
    'test_group': 'data/M_test.csv',    # Metadata
}

configs = DatasetConfigs(config)
dataset = configs.get_dataset_at(0)

Programmatic Addition

import pandas as pd
from nirs4all.data.dataset import SpectroDataset

dataset = SpectroDataset(name="my_dataset")

# Add samples and targets
X_train = ...  # Your spectral data
y_train = ...  # Your targets
dataset.add_samples(X_train, {"partition": "train"})
dataset.add_targets(y_train)

# Add metadata as DataFrame
metadata_df = pd.DataFrame({
    'sample_id': [1, 2, 3, ...],
    'batch': [1, 1, 2, ...],
    'location': ['A', 'A', 'B', ...]
})
dataset.add_metadata(metadata_df)

# Or as numpy array with headers
import numpy as np
metadata_array = np.array([[1, 'A'], [1, 'A'], [2, 'B']], dtype=object)
dataset.add_metadata(metadata_array, headers=['batch', 'location'])

Accessing Metadata

Get All Metadata

# Get all metadata as Polars DataFrame
all_meta = dataset.metadata()
print(all_meta)

Filter by Partition

# Get training metadata only
train_meta = dataset.metadata(selector={"partition": "train"})

# Get test metadata only
test_meta = dataset.metadata(selector={"partition": "test"})

Get Specific Columns

# Get only batch and location columns
batch_location = dataset.metadata(columns=['batch', 'location'])

Get Single Column as Array

# Get batch numbers as numpy array
batch_numbers = dataset.metadata_column('batch')
print(batch_numbers)  # array([1, 1, 2, 2, 3, ...])

# Get batch numbers for training samples only
train_batches = dataset.metadata_column('batch', selector={"partition": "train"})

Converting Metadata to Numeric

Metadata is often categorical (strings, labels), but machine learning models require numeric input. Use metadata_numeric() to convert:

Label Encoding

Converts categories to integers (0, 1, 2, …):

location_encoded, encoding_info = dataset.metadata_numeric(
    'location',
    method='label'
)

print(location_encoded)  # array([0, 0, 1, 1, 2, ...])
print(encoding_info)
# {
#   'method': 'label',
#   'classes': ['A', 'B', 'C', 'D']
# }

One-Hot Encoding

Converts categories to binary vectors:

location_onehot, encoding_info = dataset.metadata_numeric(
    'location',
    method='onehot'
)

print(location_onehot.shape)  # (n_samples, n_categories)
print(location_onehot[:3])
# array([[1, 0, 0, 0],   # Location A
#        [1, 0, 0, 0],   # Location A
#        [0, 1, 0, 0]])  # Location B

Encoding Consistency

Encodings are cached to ensure consistency:

# First call creates encoding
encoded1, info1 = dataset.metadata_numeric('instrument', method='label')

# Subsequent calls return the same encoding
encoded2, info2 = dataset.metadata_numeric('instrument', method='label')

assert np.array_equal(encoded1, encoded2)  # True

Modifying Metadata

Update Existing Values

# Update location for first 5 training samples
dataset.update_metadata(
    column='location',
    values=['Updated', 'Updated', 'Updated', 'Updated', 'Updated'],
    selector={"partition": "train"}
)

Add New Column

import numpy as np

# Add quality scores for all samples
quality_scores = np.random.rand(dataset.num_samples)
dataset.add_metadata_column('quality', quality_scores)

print(dataset.metadata_columns)
# ['batch', 'location', 'instrument', 'quality']

Using Metadata in Pipelines

Combine Metadata with Spectral Features

from sklearn.ensemble import RandomForestRegressor

# Get spectral data
X_spectra = dataset.x({"partition": "train"})
y = dataset.y({"partition": "train"})

# Get numeric metadata
instrument_encoded, _ = dataset.metadata_numeric('instrument', method='onehot')
temperature = dataset.metadata_column('temperature', selector={"partition": "train"})

# Combine features
import numpy as np
X_combined = np.hstack([
    X_spectra,
    instrument_encoded,
    temperature.reshape(-1, 1)
])

# Train model
model = RandomForestRegressor()
model.fit(X_combined, y)

Filter Samples by Metadata

# Get metadata for filtering
batch_col = dataset.metadata_column('batch', selector={"partition": "train"})

# Manually filter batch 1 samples
batch_1_mask = (batch_col == 1)

X_batch_1 = X_train[batch_1_mask]
y_batch_1 = y_train[batch_1_mask]

Best Practices

1. Consistent Naming

Use clear, descriptive column names:

  • 'instrument_id', 'batch_number', 'collection_date'

  • 'col1', 'x', 'data'

2. Keep Metadata Aligned

Metadata must have the same number of rows as samples:

# CORRECT: Same number of rows
dataset.add_samples(X_train, {"partition": "train"})  # 100 samples
metadata_df = pd.DataFrame({...})  # 100 rows
dataset.add_metadata(metadata_df)  # ✅

# INCORRECT: Mismatched rows
dataset.add_samples(X_train, {"partition": "train"})  # 100 samples
metadata_df = pd.DataFrame({...})  # 50 rows
dataset.add_metadata(metadata_df)  # ❌ Error!

3. Use Appropriate Encoding

  • Label encoding: For ordinal categories or when number of categories is small

  • One-hot encoding: For nominal categories, but watch out for high cardinality

# Good: Few categories
instrument_encoded, _ = dataset.metadata_numeric('instrument', method='onehot')
# Result: 3 binary columns for 3 instruments

# Careful: Many categories
sample_id_encoded, _ = dataset.metadata_numeric('sample_id', method='onehot')
# Result: 1000 binary columns for 1000 unique IDs (sparse!)

4. Cache-Aware Operations

Modifying metadata clears the encoding cache:

# Create encoding
encoded1, _ = dataset.metadata_numeric('location', method='label')

# Update metadata
dataset.update_metadata('location', ['New'], selector={"partition": "train"})

# Encoding is recalculated (cache was cleared)
encoded2, _ = dataset.metadata_numeric('location', method='label')
# encoded2 may differ from encoded1 due to new category 'New'

5. Documentation

Document your metadata columns:

# Good practice: Document what each column means
metadata_df = pd.DataFrame({
    'batch': batch_numbers,        # Production batch (1-10)
    'instrument': instruments,      # Instrument ID (A, B, C)
    'temp_c': temperatures,         # Collection temperature (°C)
    'operator': operators,          # Lab technician name
})

Common Patterns

Pattern 1: Stratified Splitting by Metadata

from sklearn.model_selection import StratifiedKFold

# Get batch information
batch_col = dataset.metadata_column('batch', selector={"partition": "train"})

# Use for stratified CV
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X_train, batch_col):
    X_train_fold = X_train[train_idx]
    X_val_fold = X_train[val_idx]
    # Train and validate...

Pattern 2: Cross-Instrument Validation

# Get training data and metadata
X_train = dataset.x({"partition": "train"})
y_train = dataset.y({"partition": "train"})
instruments = dataset.metadata_column('instrument', selector={"partition": "train"})

# Train on instrument A, test on instrument B
mask_A = (instruments == 'A')
mask_B = (instruments == 'B')

model.fit(X_train[mask_A], y_train[mask_A])
score = model.score(X_train[mask_B], y_train[mask_B])
print(f"Cross-instrument R²: {score:.3f}")

Pattern 3: Temporal Splits

# Assuming metadata has 'date' column
dates = dataset.metadata_column('date', selector={"partition": "train"})

# Sort by date
sorted_indices = np.argsort(dates)
split_point = int(len(dates) * 0.8)

train_idx = sorted_indices[:split_point]
val_idx = sorted_indices[split_point:]

# Temporal train/validation split
X_train_temporal = X_train[train_idx]
X_val_temporal = X_train[val_idx]

API Reference

Dataset Methods

add_metadata(data, headers=None)

Add metadata rows.

Parameters:

  • data: 2D array, pandas DataFrame, or polars DataFrame

  • headers: Column names (required if data is ndarray)

metadata(selector=None, columns=None)

Get metadata as DataFrame.

Parameters:

  • selector: Filter dict (e.g., {"partition": "train"})

  • columns: List of column names to return

Returns: Polars DataFrame

metadata_column(column, selector=None)

Get single metadata column as array.

Parameters:

  • column: Column name

  • selector: Filter dict

Returns: Numpy array

metadata_numeric(column, selector=None, method='label')

Convert metadata column to numeric.

Parameters:

  • column: Column name

  • selector: Filter dict

  • method: 'label' or 'onehot'

Returns: Tuple of (numeric_array, encoding_info)

update_metadata(column, values, selector=None)

Update metadata values.

Parameters:

  • column: Column name

  • values: New values

  • selector: Filter dict

add_metadata_column(column, values)

Add new metadata column.

Parameters:

  • column: Column name

  • values: Column values (must match number of samples)

metadata_columns

Property returning list of metadata column names.


Troubleshooting

Issue: Metadata not loading

Check:

  1. File naming matches patterns (M_train, Mcal, metadata_train, etc.)

  2. Files are in the same folder as X and Y files

  3. CSV format is correct (check delimiter, headers)

# Debug: Check what files were detected
from nirs4all.data.dataset_config_parser import browse_folder
config = browse_folder("path/to/folder")
print(config.get('train_group'))  # Should show metadata file path

Issue: Row count mismatch

Error: ValueError: Row count mismatch: X(100) Metadata(50)

Solution: Ensure metadata has the same number of rows as X data:

print(f"X rows: {len(X_train)}")
print(f"Metadata rows: {len(metadata_df)}")
# They must match!

Issue: Missing metadata columns

Error: ValueError: Column 'instrument' not found

Solution: Check column names:

print(dataset.metadata_columns)  # See what's available

Examples

See examples/metadata_usage.py for complete working examples including:

  • Loading datasets with metadata

  • Filtering and accessing metadata

  • Numeric encoding

  • Using metadata in pipelines

  • Cross-instrument validation


Summary

Metadata in nirs4all provides a flexible way to:

  • ✅ Store auxiliary sample information

  • ✅ Filter and group samples

  • ✅ Enhance models with contextual features

  • ✅ Enable stratified validation strategies

  • ✅ Support reproducible data management

Start by loading your metadata files alongside X and Y data, then explore the rich API for accessing, converting, and utilizing metadata in your spectroscopy workflows!