Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise UNTS corpus to match 7cultGEN (7GCA) #28

Open
benjamingmartin opened this issue Jun 2, 2020 · 7 comments
Open

Revise UNTS corpus to match 7cultGEN (7GCA) #28

benjamingmartin opened this issue Jun 2, 2020 · 7 comments
Assignees

Comments

@benjamingmartin
Copy link
Collaborator

Create new corpus, to be called "LTS+UNTS GCAs, 1935-1972". This includes the English-language texts of all general cultural agreements (GCAs) deposited with the League of Nations or the United Nations between 1935 and 1972.
These are the texts marked with "yes" in the WTI version edited for 7CULTGEN (14 May 2020).

All texts are in google drive folder "Treaty texts (.txt) (finished)."
Note: I have updated which texts we have in "Treaties Master List" and in "Treaties Master List (edited for 7CULTGEN)", but not in the new "triple" file.

@aibakeneko
Copy link
Collaborator

aibakeneko commented Jun 3, 2020

Created new corpus from files in folder Treaty texts (.txt) (finished). One treaty is missing: 10707.

@benjamingmartin: Is this 110707_en_corr.txt in folder treaty_text_sub_corpora_en_201908_region_1st_world? Should it be added?

Filename of new corpus is treaty_text_corpora_en_20200603_LTS+UNTS_GCAs_1935-1972_(missing_10707).zip

aibakeneko pushed a commit that referenced this issue Jun 3, 2020
aibakeneko pushed a commit that referenced this issue Jun 3, 2020
@benjamingmartin
Copy link
Collaborator Author

I checked this: 10707 was an incorrect entry. I have corrected it, and it is not part of 7CULTGEN (or any of the CULT listings) anymore! So I think we can rename the new corpus and remove "missing 10707"...! Thanks!

aibakeneko pushed a commit that referenced this issue Jun 9, 2020
@aibakeneko
Copy link
Collaborator

Corpus has been renamed to treaty_text_corpora_en_20200603_LTS+UNTS_GCAs_1935-1972.zip

@benjamingmartin
Copy link
Collaborator Author

I just used jupyer tool 3_word_in_doc_frequencies, and after selecting CultGEN and the new corpus (the one marked ...20200701_GEN...) it says (in the pink progress report) that it has added 472 documents.
In fact, there should be 458 texts in the LTS+UNTS corpus. I have figured out that it must be including the 14 extra documents that we have, and that are in English, marked "en" in column E of Treaties Master List (triple), but that are NOT from LTS or UNTS.
The process of selecting texts for inclusion in the corpus cannot use Column E, is the point.
So, I guess we can either: (a) fix the selection procedure so that it takes texts that have LTS, UNTS, or UNXX in column H ("source"); or, (b) I can just manually remove these fourteen texts from a zip folder that will then be stable.
Let me know; and sorry for the trouble!

@roger-mahler
Copy link
Collaborator

We double checked the Corpus yesterday and found 24 missing treaties in the previously generated corpus. @aibakeneko is generating new files. We will generated three distinct corpus to avoid confusion, and easier checking.

@benjamingmartin
Copy link
Collaborator Author

OK, thanks! But actually, there is no need for multiple (English-language) corpora. I only need one in English and one in French:

  1. LTS+UNTS, 1935-1972, which anyway only includes texts that I have marked as General Cultural Agreements (that is, they have "yes" in CULTGEN)
  2. France's cultural agreements, 1919-1972 (which we haven't talked about for a long time, but which I want to come back to sometime)

The real reason that 7CULT+ is an important category is for use in doing quantitative WTI work, not for text analysis.

@roger-mahler
Copy link
Collaborator

roger-mahler commented Jul 1, 2020

Update 20200702:

I have done a review on the notebooks, and associated code, that are either prioritized or are used in 2020 article notebook (under 4_publications). The discrepancy in number of treaties was caused by the fact that the WTI treaty source column hasn't previously been regarded as a filter. Instead I assumed that all treaties with the "is_cultural_yesno" attribute set to "yes" are relevant candidate treaties from which further subsets are created (e.g. based on time, topic, parties). This must be an misunderstanding from my part.

  • I have added a treaty_sources filter to all functions used in prioritized notebook or article notebook.
  • I have created a new parallel version of the notebook "1_analysis_Rise_of_CT_article_2020" named "200702_analysis_Rise_of_CT_article_2020" where this filter has been added to all notebook cells. Please note that treaty_source None is the old, unfiltered, behavior so I guess it should be set to ['LTS', 'UNTS', 'UNXX'] in most cases. Note that I have kept the None value for most of the cells.
  • I have changed defaults value on certain select options to reduce risk of errors.
  • I have also uploaded a single corpus named treaty_text_corpora_en_20200701_GEN_LTS+UNTS_GCAs_1935-1972.zip consisting of 458 English treaties having CULTGEN = 'yes', source = LTS/UNTS/UNXX, period 1935-1972. Furthermore, when the corpus are prepared, only sources LTS/UNTS/UNXX are considered (has of course no effect for the new corpus). Index PLUS and ORG should NOT be used with this corpus since it only contains GEN treaties. I guess ORG and PLUS shouldn't be used at all from now on?

Certain text analytic functions/notebooks that uses the text corpus does not in any way filter out documents in the selected corpus. I have (hopefully) added a printed statement for these notebooks. I have not reviewed all notebooks under "3_text_analysis". Noteworthy is that this is true for all notebook under "0_bens_prioritized"

Please make sure that GEN WTI-index are selected. This is now also the default index. I have also specified this index for the new article-notebook. The "1_analysis_Rise_of_CT_article_2020" does not have this index preset.

I wish I had more time to test and verify these changes before vacation. Please pay close attention to strange or odd values or regressed behaviors.. We should do a joint review of the functionality in early August.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants