nirs4all.data.synthetic.metadata module

Metadata generation for synthetic NIRS datasets.

This module provides tools for generating realistic sample metadata including sample IDs, biological sample groupings, repetitions, and custom columns.

Example

>>> from nirs4all.data.synthetic.metadata import MetadataGenerator
>>>
>>> generator = MetadataGenerator(random_state=42)
>>> metadata = generator.generate(
...     n_samples=100,
...     sample_id_prefix="S",
...     n_groups=3,
...     n_repetitions=(2, 4)
... )
>>> print(metadata.keys())
dict_keys(['sample_id', 'bio_sample_id', 'repetition', 'group'])
class nirs4all.data.synthetic.metadata.MetadataGenerationResult(sample_ids: ndarray, bio_sample_ids: ndarray | None = None, repetitions: ndarray | None = None, groups: ndarray | None = None, group_indices: ndarray | None = None, n_bio_samples: int = 0, additional_columns: Dict[str, ndarray] | None = None)[source]

Bases: object

Container for generated metadata.

sample_ids

Unique sample identifiers.

Type:

numpy.ndarray

bio_sample_ids

Biological sample identifiers (before repetitions).

Type:

numpy.ndarray | None

repetitions

Repetition number for each sample.

Type:

numpy.ndarray | None

groups

Group assignments.

Type:

numpy.ndarray | None

group_indices

Integer group indices (for stratification).

Type:

numpy.ndarray | None

n_bio_samples

Number of unique biological samples.

Type:

int

additional_columns

Any extra columns generated.

Type:

Dict[str, numpy.ndarray] | None

additional_columns: Dict[str, ndarray] | None = None
bio_sample_ids: ndarray | None = None
group_indices: ndarray | None = None
groups: ndarray | None = None
n_bio_samples: int = 0
repetitions: ndarray | None = None
sample_ids: ndarray
to_dict() Dict[str, ndarray][source]

Convert to dictionary format suitable for DataFrame or SpectroDataset.

Returns:

Dictionary with string keys and array values.

class nirs4all.data.synthetic.metadata.MetadataGenerator(random_state: int | None = None)[source]

Bases: object

Generate realistic metadata for synthetic NIRS datasets.

This class creates sample identifiers, biological sample groupings, repetition structures, and group assignments that mimic real spectroscopy datasets.

rng

NumPy random generator for reproducibility.

Parameters:

random_state – Random seed for reproducibility.

Example

>>> generator = MetadataGenerator(random_state=42)
>>>
>>> # Generate with repetitions and groups
>>> metadata = generator.generate(
...     n_samples=100,
...     sample_id_prefix="WHEAT",
...     n_groups=3,
...     group_names=["Field_A", "Field_B", "Field_C"],
...     n_repetitions=(2, 4)
... )
>>>
>>> # Result: Each biological sample has 2-4 spectral measurements
>>> print(f"Bio samples: {metadata.n_bio_samples}")
>>> print(f"Total samples: {len(metadata.sample_ids)}")
generate(n_samples: int, *, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1, bio_sample_prefix: str = 'B', additional_columns: Dict[str, Any] | None = None) MetadataGenerationResult[source]

Generate complete metadata for a synthetic dataset.

This method handles the complex logic of generating samples with repetitions while respecting group structures. When repetitions are requested, biological samples are created first, then each is replicated 1 or more times to create the final samples.

Parameters:
  • n_samples – Total number of samples (spectra) to generate.

  • sample_id_prefix – Prefix for sample ID strings.

  • n_groups – Number of groups (None for no grouping).

  • group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.

  • n_repetitions – Number of repetitions per biological sample. If int: fixed number of repetitions. If tuple (min, max): random number in range [min, max].

  • bio_sample_prefix – Prefix for biological sample IDs.

  • additional_columns – Dictionary of additional columns to generate. Keys are column names, values can be: - Callable(n_samples, rng) -> np.ndarray - List of values to randomly sample from - Tuple (distribution, params) for numeric data

Returns:

MetadataGenerationResult containing all generated metadata.

Raises:

ValueError – If n_samples is less than 1 or if n_repetitions would make it impossible to generate the requested samples.

Example

>>> generator = MetadataGenerator(random_state=42)
>>>
>>> # Simple case: 100 samples, no repetitions
>>> result = generator.generate(100)
>>> assert len(result.sample_ids) == 100
>>>
>>> # With repetitions: ~50 bio samples, each measured 2 times
>>> result = generator.generate(100, n_repetitions=2)
>>> assert result.n_bio_samples == 50
>>>
>>> # Variable repetitions
>>> result = generator.generate(100, n_repetitions=(1, 3))
nirs4all.data.synthetic.metadata.generate_sample_metadata(n_samples: int, *, random_state: int | None = None, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1) Dict[str, ndarray][source]

Convenience function to generate sample metadata.

This is a simplified interface to MetadataGenerator for common use cases.

Parameters:
  • n_samples – Total number of samples to generate.

  • random_state – Random seed for reproducibility.

  • sample_id_prefix – Prefix for sample ID strings.

  • n_groups – Number of groups (None for no grouping).

  • group_names – Optional list of group names.

  • n_repetitions – Repetitions per biological sample.

Returns:

Dictionary with metadata arrays.

Example

>>> metadata = generate_sample_metadata(
...     n_samples=100,
...     random_state=42,
...     n_groups=3,
...     n_repetitions=(2, 4)
... )
>>> print(metadata.keys())