You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
generates IDs like ['100008^3', '100008^0', '100007^0', '100006^6', '100000^8']. These are scrambled, which is good, but do not reflect real data points like ['120862^4', '417288^0', '505455^6', '364165^1', '418942^8']. I would have to generate millions of data points to reach this variety.
Expected behavior
I would like the sampled IDs to be truly randomly distributed, no matter how many IDs I create.
The text was updated successfully, but these errors were encountered:
Hi @spreeni, our Regex-based ID generation is done via the RegexGenerator object. I believe what you are seeing is the 'scrambled' ID order, wherein the Regexes are generated in order but then scrambled when assembling the synthetic data. For truly random generation for any number of IDs, you would need the 'random' generation option.
However, do note that the 'random' strategy is currently only available for SDV Enterprise customers. (Should you like to explore this option, you can always contact us here).
I am curious whether, for your use case, it is critical that these IDs are in a random order? Is it critical to replicate the feel of the real data? We have assumed Regex IDs only serve the purpose of labeling rows, and that the actual content they carry (including random or alpha-numeric order) does not carry any statistical value. Though if this is not the case for your data, I'd love to hear more. Perhaps we can brainstorm other solutions.
Problem Description
If I set the metadata for a column to
id
with aregex_format
, it creates them incremental, which does not reflect the feel of real data.E.g. the pattern
generates IDs like
['100008^3', '100008^0', '100007^0', '100006^6', '100000^8']
. These are scrambled, which is good, but do not reflect real data points like['120862^4', '417288^0', '505455^6', '364165^1', '418942^8']
. I would have to generate millions of data points to reach this variety.Expected behavior
I would like the sampled IDs to be truly randomly distributed, no matter how many IDs I create.
The text was updated successfully, but these errors were encountered: