Generating a graph on disk #474

dpriskorn · 2022-12-25T10:49:25Z

dpriskorn
Dec 25, 2022

Hi. I'm considering generating a graph of all references in Wikipedia using WBI.
I want to generate it on disk first an then upload it to Wikibase using WBI.

I'm thinking the following algorithm would work:

hash the article wikitext
Parse the article Wikitext
generate the base item using WBI
Store the json data using the hash (on file using the article hash as filename). Because of the number of references split it up into folders with the first three chars of the hash as folder name.
hash the plain text of all the references found (ptrefhash)
generate the reference item if an identifier was found
Store the generated json on file using the identifier hash as filename in a folder unique_references with folders like above
Store the raw reference data using the ptrefhash on file using the ptrefhash as filename
keep a record of which articles has which raw reference hashes (redis article hash+"refs" as key and a list of ptrefhash as value if any)
keep a record of unique citation references for each article (redis key articlehash+uniref, value list of identifier hashes if any)

Then using a Wikibase loop over all references and upload the json to Wikibase for each unique reference and store the wcdqid in redis (key=unihash value=wcdqid)

Then loop over all articles and finish generating the item using unihash list and get the wcdqid from redis. Upload up to 500 references on an article in one go any surplus references using addclaim.

Does this seem like an efficient way to pre-generate a graph?
Any ideas for improvement?

We could store everything in redis but I'm unsure how big the database would get. Since SSDB is exists for redis databases that surpasses the memory size and caching to disk it it could be done using that one if redis fails because of the total size.

dpriskorn · 2023-01-10T16:26:20Z

dpriskorn
Jan 10, 2023
Author

See internetarchive/iari#520

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating a graph on disk #474

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Generating a graph on disk #474

dpriskorn Dec 25, 2022

Replies: 1 comment

dpriskorn Jan 10, 2023 Author

dpriskorn
Dec 25, 2022

dpriskorn
Jan 10, 2023
Author