nirs4all.synthesis.metadata module
Metadata generation for synthetic NIRS datasets.
This module provides tools for generating realistic sample metadata including sample IDs, biological sample groupings, repetitions, and custom columns.
Example
>>> from nirs4all.synthesis.metadata import MetadataGenerator
>>>
>>> generator = MetadataGenerator(random_state=42)
>>> metadata = generator.generate(
... n_samples=100,
... sample_id_prefix="S",
... n_groups=3,
... n_repetitions=(2, 4)
... )
>>> print(metadata.keys())
dict_keys(['sample_id', 'bio_sample_id', 'repetition', 'group'])
- class nirs4all.synthesis.metadata.MetadataGenerationResult(sample_ids: ndarray, bio_sample_ids: ndarray | None = None, repetitions: ndarray | None = None, groups: ndarray | None = None, group_indices: ndarray | None = None, n_bio_samples: int = 0, additional_columns: Dict[str, ndarray] | None = None)[source]
Bases:
objectContainer for generated metadata.
- sample_ids
Unique sample identifiers.
- Type:
- bio_sample_ids
Biological sample identifiers (before repetitions).
- Type:
numpy.ndarray | None
- repetitions
Repetition number for each sample.
- Type:
numpy.ndarray | None
- groups
Group assignments.
- Type:
numpy.ndarray | None
- group_indices
Integer group indices (for stratification).
- Type:
numpy.ndarray | None
- additional_columns
Any extra columns generated.
- Type:
Dict[str, numpy.ndarray] | None
- class nirs4all.synthesis.metadata.MetadataGenerator(random_state: int | None = None)[source]
Bases:
objectGenerate realistic metadata for synthetic NIRS datasets.
This class creates sample identifiers, biological sample groupings, repetition structures, and group assignments that mimic real spectroscopy datasets.
- rng
NumPy random generator for reproducibility.
- Parameters:
random_state – Random seed for reproducibility.
Example
>>> generator = MetadataGenerator(random_state=42) >>> >>> # Generate with repetitions and groups >>> metadata = generator.generate( ... n_samples=100, ... sample_id_prefix="WHEAT", ... n_groups=3, ... group_names=["Field_A", "Field_B", "Field_C"], ... n_repetitions=(2, 4) ... ) >>> >>> # Result: Each biological sample has 2-4 spectral measurements >>> print(f"Bio samples: {metadata.n_bio_samples}") >>> print(f"Total samples: {len(metadata.sample_ids)}")
- generate(n_samples: int, *, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1, bio_sample_prefix: str = 'B', additional_columns: Dict[str, Any] | None = None) MetadataGenerationResult[source]
Generate complete metadata for a synthetic dataset.
This method handles the complex logic of generating samples with repetitions while respecting group structures. When repetitions are requested, biological samples are created first, then each is replicated 1 or more times to create the final samples.
- Parameters:
n_samples – Total number of samples (spectra) to generate.
sample_id_prefix – Prefix for sample ID strings.
n_groups – Number of groups (None for no grouping).
group_names – Optional list of group names. If None and n_groups > 0, generates names like “Group_0”, “Group_1”, etc.
n_repetitions – Number of repetitions per biological sample. If int: fixed number of repetitions. If tuple (min, max): random number in range [min, max].
bio_sample_prefix – Prefix for biological sample IDs.
additional_columns – Dictionary of additional columns to generate. Keys are column names, values can be: - Callable(n_samples, rng) -> np.ndarray - List of values to randomly sample from - Tuple (distribution, params) for numeric data
- Returns:
MetadataGenerationResult containing all generated metadata.
- Raises:
ValueError – If n_samples is less than 1 or if n_repetitions would make it impossible to generate the requested samples.
Example
>>> generator = MetadataGenerator(random_state=42) >>> >>> # Simple case: 100 samples, no repetitions >>> result = generator.generate(100) >>> assert len(result.sample_ids) == 100 >>> >>> # With repetitions: ~50 bio samples, each measured 2 times >>> result = generator.generate(100, n_repetitions=2) >>> assert result.n_bio_samples == 50 >>> >>> # Variable repetitions >>> result = generator.generate(100, n_repetitions=(1, 3))
- nirs4all.synthesis.metadata.generate_sample_metadata(n_samples: int, *, random_state: int | None = None, sample_id_prefix: str = 'S', n_groups: int | None = None, group_names: List[str] | None = None, n_repetitions: int | Tuple[int, int] = 1) Dict[str, ndarray][source]
Convenience function to generate sample metadata.
This is a simplified interface to MetadataGenerator for common use cases.
- Parameters:
n_samples – Total number of samples to generate.
random_state – Random seed for reproducibility.
sample_id_prefix – Prefix for sample ID strings.
n_groups – Number of groups (None for no grouping).
group_names – Optional list of group names.
n_repetitions – Repetitions per biological sample.
- Returns:
Dictionary with metadata arrays.
Example
>>> metadata = generate_sample_metadata( ... n_samples=100, ... random_state=42, ... n_groups=3, ... n_repetitions=(2, 4) ... ) >>> print(metadata.keys())