Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated IDs with regex_format are in order - Randomized option would be nice #2368

Open
spreeni opened this issue Feb 12, 2025 · 1 comment
Labels
feature request Request for a new feature under discussion Issue is currently being discussed

Comments

@spreeni
Copy link

spreeni commented Feb 12, 2025

Problem Description

If I set the metadata for a column to id with a regex_format, it creates them incremental, which does not reflect the feel of real data.

E.g. the pattern

metadata.update_column(column_name="sample_id", sdtype="id", regex_format="[1-9][0-9]{5}\^[0-9]")

generates IDs like ['100008^3', '100008^0', '100007^0', '100006^6', '100000^8']. These are scrambled, which is good, but do not reflect real data points like ['120862^4', '417288^0', '505455^6', '364165^1', '418942^8']. I would have to generate millions of data points to reach this variety.

Expected behavior

I would like the sampled IDs to be truly randomly distributed, no matter how many IDs I create.

@spreeni spreeni added feature request Request for a new feature new Automatic label applied to new issues labels Feb 12, 2025
@npatki
Copy link
Contributor

npatki commented Feb 12, 2025

Hi @spreeni, our Regex-based ID generation is done via the RegexGenerator object. I believe what you are seeing is the 'scrambled' ID order, wherein the Regexes are generated in order but then scrambled when assembling the synthetic data. For truly random generation for any number of IDs, you would need the 'random' generation option.

This can be updated via the update_transformers functionality any single or multi-table synthesizer.

from rdt.transformers.text import RegexGenerator

my_transformer = RegexGenerator(
    regex_format='[1-9][0-9]{5}\^[0-9]',
    enforce_uniqueness=True,
    generation_order='random')

synthesizer.update_transformers({
    'sample_id': my_transformer})

synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_rows=10)

However, do note that the 'random' strategy is currently only available for SDV Enterprise customers. (Should you like to explore this option, you can always contact us here).

I am curious whether, for your use case, it is critical that these IDs are in a random order? Is it critical to replicate the feel of the real data? We have assumed Regex IDs only serve the purpose of labeling rows, and that the actual content they carry (including random or alpha-numeric order) does not carry any statistical value. Though if this is not the case for your data, I'd love to hear more. Perhaps we can brainstorm other solutions.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants