-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise UNTS corpus to match 7cultGEN (7GCA) #28
Comments
Created new corpus from files in folder @benjamingmartin: Is this Filename of new corpus is |
I checked this: 10707 was an incorrect entry. I have corrected it, and it is not part of 7CULTGEN (or any of the CULT listings) anymore! So I think we can rename the new corpus and remove "missing 10707"...! Thanks! |
Corpus has been renamed to |
I just used jupyer tool 3_word_in_doc_frequencies, and after selecting CultGEN and the new corpus (the one marked ...20200701_GEN...) it says (in the pink progress report) that it has added 472 documents. |
We double checked the Corpus yesterday and found 24 missing treaties in the previously generated corpus. @aibakeneko is generating new files. We will generated three distinct corpus to avoid confusion, and easier checking. |
OK, thanks! But actually, there is no need for multiple (English-language) corpora. I only need one in English and one in French:
The real reason that 7CULT+ is an important category is for use in doing quantitative WTI work, not for text analysis. |
Update 20200702: I have done a review on the notebooks, and associated code, that are either prioritized or are used in 2020 article notebook (under 4_publications). The discrepancy in number of treaties was caused by the fact that the WTI treaty source column hasn't previously been regarded as a filter. Instead I assumed that all treaties with the "is_cultural_yesno" attribute set to "yes" are relevant candidate treaties from which further subsets are created (e.g. based on time, topic, parties). This must be an misunderstanding from my part.
Certain text analytic functions/notebooks that uses the text corpus does not in any way filter out documents in the selected corpus. I have (hopefully) added a printed statement for these notebooks. I have not reviewed all notebooks under "3_text_analysis". Noteworthy is that this is true for all notebook under "0_bens_prioritized" Please make sure that GEN WTI-index are selected. This is now also the default index. I have also specified this index for the new article-notebook. The "1_analysis_Rise_of_CT_article_2020" does not have this index preset. I wish I had more time to test and verify these changes before vacation. Please pay close attention to strange or odd values or regressed behaviors.. We should do a joint review of the functionality in early August. |
Create new corpus, to be called "LTS+UNTS GCAs, 1935-1972". This includes the English-language texts of all general cultural agreements (GCAs) deposited with the League of Nations or the United Nations between 1935 and 1972.
These are the texts marked with "yes" in the WTI version edited for 7CULTGEN (14 May 2020).
All texts are in google drive folder "Treaty texts (.txt) (finished)."
Note: I have updated which texts we have in "Treaties Master List" and in "Treaties Master List (edited for 7CULTGEN)", but not in the new "triple" file.
The text was updated successfully, but these errors were encountered: