Generated IDs with regex_format are in order - Randomized option would be nice #2368

spreeni · 2025-02-12T16:13:21Z

Problem Description

If I set the metadata for a column to id with a regex_format, it creates them incremental, which does not reflect the feel of real data.

E.g. the pattern

metadata.update_column(column_name="sample_id", sdtype="id", regex_format="[1-9][0-9]{5}\^[0-9]")

generates IDs like ['100008^3', '100008^0', '100007^0', '100006^6', '100000^8']. These are scrambled, which is good, but do not reflect real data points like ['120862^4', '417288^0', '505455^6', '364165^1', '418942^8']. I would have to generate millions of data points to reach this variety.

Expected behavior

I would like the sampled IDs to be truly randomly distributed, no matter how many IDs I create.

The text was updated successfully, but these errors were encountered:

npatki · 2025-02-12T16:33:53Z

Hi @spreeni, our Regex-based ID generation is done via the RegexGenerator object. I believe what you are seeing is the 'scrambled' ID order, wherein the Regexes are generated in order but then scrambled when assembling the synthetic data. For truly random generation for any number of IDs, you would need the 'random' generation option.

This can be updated via the update_transformers functionality any single or multi-table synthesizer.

from rdt.transformers.text import RegexGenerator

my_transformer = RegexGenerator(
    regex_format='[1-9][0-9]{5}\^[0-9]',
    enforce_uniqueness=True,
    generation_order='random')

synthesizer.update_transformers({
    'sample_id': my_transformer})

synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_rows=10)

However, do note that the 'random' strategy is currently only available for SDV Enterprise customers. (Should you like to explore this option, you can always contact us here).

I am curious whether, for your use case, it is critical that these IDs are in a random order? Is it critical to replicate the feel of the real data? We have assumed Regex IDs only serve the purpose of labeling rows, and that the actual content they carry (including random or alpha-numeric order) does not carry any statistical value. Though if this is not the case for your data, I'd love to hear more. Perhaps we can brainstorm other solutions.

spreeni added feature request Request for a new feature new Automatic label applied to new issues labels Feb 12, 2025

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated IDs with regex_format are in order - Randomized option would be nice #2368

Generated IDs with regex_format are in order - Randomized option would be nice #2368

spreeni commented Feb 12, 2025

npatki commented Feb 12, 2025 •

edited

Loading

Generated IDs with regex_format are in order - Randomized option would be nice #2368

Generated IDs with regex_format are in order - Randomized option would be nice #2368

Comments

spreeni commented Feb 12, 2025

Problem Description

Expected behavior

npatki commented Feb 12, 2025 • edited Loading

npatki commented Feb 12, 2025 •

edited

Loading