-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry: SNB Basic + SNB Composite Merge Foreign? #394
Comments
Hi @aMahanna, Two things:
The datagen is deterministic, so the graphs (including the IDs) should be the same between different generators. Therefore, combining files from data sets should be possible without getting inconsistency.
i. Using the usual UNIX tools like ii. Using DuckDB. This approach also allows joins, aggregation ( Gabor |
Apologies for the delay and for the confusion, after further investigation we discovered a formatting mistake on our part when combining the files. I will follow up shortly regarding your second point, but for now I just want to say thank you for all the help so far
|
Hi Gabor, We've been evaluating the various SNB datasets available in attempt to support our database's multi-model functionality. We found that using a combination of the Basic & MergeForeign datasets substantially increases our query performance and better suits our data model. Our request would be to have the datagen natively support the data model outlined below, or suggest a way to do so if it already exists. As it stands now, modelling the data in this way requires a lot of pre/post processing (as suggested above), which we believe will count against us if we were to have the benchmark audited. In particular, we have situations where a query benefits from the Basic dataset (IC8), a query that benefits from the MergeForeign dataset (IC3 Sub-Query A), and another query that benefits from a combination of both (IC3 Sub-Query B). IC8Understanding that you may not be familiar with AQL (Arango Query Language), this query relies on the edge relationships only available in the Basic dataset (e.g
The alternative approach is to solely rely on the MergeForeign attributes (i.e IC3We've noticed peak performance in IC3 when a combination of Basic SNB edge relationships & MergeForeign SNB attributes are used within the same query. IC3 Sub-Query AA portion of IC3 relies on the
Attempting to do this using the Basic SNB IC3 Sub-Query BAnother portion of IC3 relies on the
Attempting to do this using the Basic SNB ConclusionAs far as we can tell, the current datagen utility doesn't support this, and so we feel that this leaves out the multi-model graph capabilities offered by our database. We are not looking to manipulate the data in a way that specifically favours us, but instead looking for the LDBC datagen to better support the functionality of multi-model graph databases. Would it be possible to have the datagen support this data model out of the box (assuming it doesn't already)? |
@aMahanna I transferred the issue to the (new, Spark-based) Datagen's repository. This week I'm travelling/have other duties -- I will take a look next week. |
Hi @szarnyasg Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to? Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets? Thank you again for all the help so far! Chris |
Hi,
It’s the same schema as before but R2 is (slightly) faster than SURF.
Sure, we are open for reviewing PRs in the Datagen.
Gabor
…On Tue, 18 Oct 2022 at 21:27, Chris Woodward ***@***.***> wrote:
Hi @szarnyasg <https://github.com/szarnyasg>
Sorry to hear that this functionality won't be supported in the utility,
as it fits multi-model graph databases quite well. Was there some issue
with implementing it or would you still be open to having it added if we
were able to?
Apologies if I am missing something but the datasets you just provided
seem to have the same schema as before, was that the intention? Just trying
to determine if there is a difference between these and the Surf datasets?
Thank you again for all the help so far!
Chris
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMPGTJKYVWHHXLLX6S3WD322VANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
By the way, maybe an important piece of information that's missing from the discussion above: systems can pre-process the data set before loading. So you can take e.g. the composite merge foreign CSV files, run them through a script (which can use anything cut, Perl scripts, DuckDB SQL script, etc.) and create a new set of CSV files, then load those into the system-under-test. We try to avoid this in the reference implementations but it is definitely a possibility. |
Hi Gabor @szarnyasg Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:
The command I ran was the following: I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions? |
Hi Chris,
Use cat + tar + unztsd:
https://github.com/ldbc/auditing-tools/blob/main/cloudflare-r2.md#recombining-and-decompressing-data-sets
For this, you'll need the 000, 001, etc. files in the same location.
Gabor
…On Tue, Nov 15, 2022 at 4:55 PM Chris Woodward ***@***.***> wrote:
Hi Gabor @szarnyasg <https://github.com/szarnyasg>
Sorry to keep this thread going so long but I downloaded and attempted to
decompress the files above and the SF1 worked fine but SF1000 reports the
following error:
/*stdin*\ : Read error (39) : premature end
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
The command I ran was the following:
tar --use-compress-program=unzstd -xvf
bi-sf1000-composite-projected-fk.tar.zst.000
I attempted this with both the merge and projected files and receive the
same error for the SF1000 files. Do you have any suggestions?
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMNQA3CLW2ZXTOEKVZDWIQIHHANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Gabor, I am unable to access that link, it shows 404. Chris |
Oops, I linked to a private repo :). This is its public counterpart:
https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#streaming-decompression
…On Tue, Nov 15, 2022 at 9:02 PM Chris Woodward ***@***.***> wrote:
Hi Gabor,
I am unable to access that link, it shows 404.
Chris
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMPE6FRNZGT4VICU6X3WIRFE7ANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you, that did the trick! |
Hi Gabor, Do you, by chance, have the IC substitution parameters used for the datasets you shared above? To confirm, while this is tagged with |
Hi Chris,
We are working on tuning the parameter generator for the new Interactive
workload.
Currently your best bet would be to download the factor tables from [1] and
run them through the Interactive v2 driver’s paramgen at [2]. These will
give valid parameters for queries on the initial snapshot.
The final version of the paramgen for Interactive v2 will produce
parameters bucketed by days (in the network’s simulation time) and will be
better calibrated to ensure stable runtimes (i.e. the runtimes will follow
a Gaussian distribution more closely). This is worked on but still a few
weeks away at the moment.
Best,
Gabor
PS: Most of the SNB task force is currently busy with other tasks / on
holiday, so there will be some delay in answering issues in the coming week.
[1]
https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#factor-tables
[2]
https://github.com/ldbc/ldbc_snb_interactive_driver/tree/main/paramgen
…On Wed, 23 Nov 2022 at 21:18, Chris Woodward ***@***.***> wrote:
Hi Gabor,
Do you, by chance, have the IC substitution parameters used for the
datasets you shared above?
I found this:
https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters
but that only has the bi parameters.
To confirm, while this is tagged with bi I assumed the initial_snapshot
would work for the IC queries as well, is this true?
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMNKDUWRJ5PR5B6U7RLWJZ3ZDANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I attempted to run the paramgen but I must be missing something.
After that, I export the variable for
And then I attempt to run the script while in the
Do you have any suggestions for how I can resolve this? |
Ooops, I forgot that the paramgen has undergone some changes recently and
it needs the raw data sets for parameter selection. You can find them under
the following links:
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf30-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf100-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf300-raw.tar.zst
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1000-raw.tar.zst.000
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1000-raw.tar.zst.001
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.000
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.001
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.002
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.003
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.000
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.001
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.002
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.003
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.004
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.005
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.006
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.007
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.008
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.009
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.010
*
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.011
…On Monday, November 28, 2022, Chris Woodward ***@***.***> wrote:
I attempted to run the paramgen but I must be missing something.
I copied the factors folders into a factors folder I made within the
paramgen folder so the resulting structure looks like the following:
ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/
cityNumPersons/ countryPairsNumFriends/ languageNumPosts/ personDays/ personLikesNumMessages/ personNumFriendTags/ sameUniversityConnected/
cityPairsNumFriends/ creationDayAndLengthCategoryNumMessages/ lengthNumMessages/ personDisjointEmployerPairs/ personNumFriendComments/ personNumFriends/ tagClassNumMessages/
companyNumEmployees/ creationDayAndTagClassNumMessages/ messageIds/ personFirstNames/ personNumFriendOfFriendCompanies/ personNumFriendsOfFriendsOfFriends/ tagClassNumTags/
countryNumMessages/ creationDayAndTagNumMessages/ people2Hops/ personKnowsPersonConnected/ personNumFriendOfFriendForums/ personStudyAtUniversityDays/ tagNumMessages/
countryNumPersons/ creationDayNumMessages/ people4Hops/ personKnowsPersonDays/ personNumFriendOfFriendPosts/ personWorkAtCompanyDays/ tagNumPersons/
After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY to the
data directory
export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk
And then I attempt to run the script while in the
ldbc_snb_interactive_driver/paramgen directory
./scripts/paramgen.sh
Traceback (most recent call last):
File "paramgen.py", line 273, in <module>
PG.run()
File "paramgen.py", line 110, in run
path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
self.create_views()
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views
self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet"
Do you have any suggestions for how I can resolve this?
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMK5C7JX4UFO4ORY633WKQD3DANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above? |
Yes, they should be compatible
…On Monday, November 28, 2022, Chris Woodward ***@***.***> wrote:
Thank you! To confirm, these generated parameters will be compatible with
the cloudflare datasets you linked above?
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMOWRH6QUXKFNB5OP3LWKTDINANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
After downloading and unpacking I now receive the following:
The folders in dynamic are the following:
|
You need both the factors and the raw data set. See the CI commands for an
example on where to place these directories:
https://github.com/ldbc/ldbc_snb_interactive_impls/blob/59d5fb15869464adf60400fca20554bc717dbc08/.circleci/config.yml#L48-L73
…On Mon, Nov 28, 2022 at 8:38 PM Chris Woodward ***@***.***> wrote:
After downloading and unpacking I now receive the following:
***@***.***:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh
Traceback (most recent call last):
File "paramgen.py", line 273, in <module>
PG.run()
File "paramgen.py", line 110, in run
path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
self.create_views()
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views
self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/*.parquet"
The folders in dynamic are the following:
Comment/ Forum/ Forum_hasTag_Tag/ Person_hasInterest_Tag/ Person_likes_Comment/ Person_studyAt_University/ Post/ _SUCCESS
Comment_hasTag_Tag/ Forum_hasMember_Person/ Person/ Person_knows_Person/ Person_likes_Post/ Person_workAt_Company/ Post_hasTag_Tag/
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMJQZ6ETUB2A22WHK5DWKUCZ7ANCNFSM5ZM5GXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you, the directory I was supplying was too far down.
|
@cw00dw0rd I added a sample script to the driver's CI that shows how to use the paramgen: Let me know if this fails for any of the larger data sets -- if so, there is a problem with the data sets. (Note that the |
@szarnyasg thank you, I will give this a try today and report back. |
Hi again 😄
In an experiment to support our database's multi-model functionality, we are trying to include the edges generated from the SNB Basic dataset, with the files generated from the SNB CompositeMergeForeign dataset.
We are getting inconsistent results, and wondered if there is any consideration of supporting this with the datagen, or if by any chance this is already possible?For example, we want a data model where both the
post_hasCreator_person
relationship andcreator
attribute in thePost
document exist.Happy to move this conversation to the datagen repo if that makes more sense
The text was updated successfully, but these errors were encountered: