You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. I'm considering generating a graph of all references in Wikipedia using WBI.
I want to generate it on disk first an then upload it to Wikibase using WBI.
I'm thinking the following algorithm would work:
hash the article wikitext
Parse the article Wikitext
generate the base item using WBI
Store the json data using the hash (on file using the article hash as filename). Because of the number of references split it up into folders with the first three chars of the hash as folder name.
hash the plain text of all the references found (ptrefhash)
generate the reference item if an identifier was found
Store the generated json on file using the identifier hash as filename in a folder unique_references with folders like above
Store the raw reference data using the ptrefhash on file using the ptrefhash as filename
keep a record of which articles has which raw reference hashes (redis article hash+"refs" as key and a list of ptrefhash as value if any)
keep a record of unique citation references for each article (redis key articlehash+uniref, value list of identifier hashes if any)
Then using a Wikibase loop over all references and upload the json to Wikibase for each unique reference and store the wcdqid in redis (key=unihash value=wcdqid)
Then loop over all articles and finish generating the item using unihash list and get the wcdqid from redis. Upload up to 500 references on an article in one go any surplus references using addclaim.
Does this seem like an efficient way to pre-generate a graph?
Any ideas for improvement?
We could store everything in redis but I'm unsure how big the database would get. Since SSDB is exists for redis databases that surpasses the memory size and caching to disk it it could be done using that one if redis fails because of the total size.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi. I'm considering generating a graph of all references in Wikipedia using WBI.
I want to generate it on disk first an then upload it to Wikibase using WBI.
I'm thinking the following algorithm would work:
Then using a Wikibase loop over all references and upload the json to Wikibase for each unique reference and store the wcdqid in redis (key=unihash value=wcdqid)
Then loop over all articles and finish generating the item using unihash list and get the wcdqid from redis. Upload up to 500 references on an article in one go any surplus references using addclaim.
Does this seem like an efficient way to pre-generate a graph?
Any ideas for improvement?
We could store everything in redis but I'm unsure how big the database would get. Since SSDB is exists for redis databases that surpasses the memory size and caching to disk it it could be done using that one if redis fails because of the total size.
Beta Was this translation helpful? Give feedback.
All reactions