Revise UNTS corpus to match 7cultGEN (7GCA) #28

benjamingmartin · 2020-06-02T11:31:51Z

Create new corpus, to be called "LTS+UNTS GCAs, 1935-1972". This includes the English-language texts of all general cultural agreements (GCAs) deposited with the League of Nations or the United Nations between 1935 and 1972.
These are the texts marked with "yes" in the WTI version edited for 7CULTGEN (14 May 2020).

All texts are in google drive folder "Treaty texts (.txt) (finished)."
Note: I have updated which texts we have in "Treaties Master List" and in "Treaties Master List (edited for 7CULTGEN)", but not in the new "triple" file.

aibakeneko · 2020-06-03T13:13:53Z

Created new corpus from files in folder Treaty texts (.txt) (finished). One treaty is missing: 10707.

@benjamingmartin: Is this 110707_en_corr.txt in folder treaty_text_sub_corpora_en_201908_region_1st_world? Should it be added?

Filename of new corpus is treaty_text_corpora_en_20200603_LTS+UNTS_GCAs_1935-1972_(missing_10707).zip

benjamingmartin · 2020-06-04T10:07:53Z

I checked this: 10707 was an incorrect entry. I have corrected it, and it is not part of 7CULTGEN (or any of the CULT listings) anymore! So I think we can rename the new corpus and remove "missing 10707"...! Thanks!

aibakeneko · 2020-06-09T10:39:15Z

Corpus has been renamed to treaty_text_corpora_en_20200603_LTS+UNTS_GCAs_1935-1972.zip

benjamingmartin · 2020-07-01T13:49:14Z

I just used jupyer tool 3_word_in_doc_frequencies, and after selecting CultGEN and the new corpus (the one marked ...20200701_GEN...) it says (in the pink progress report) that it has added 472 documents.
In fact, there should be 458 texts in the LTS+UNTS corpus. I have figured out that it must be including the 14 extra documents that we have, and that are in English, marked "en" in column E of Treaties Master List (triple), but that are NOT from LTS or UNTS.
The process of selecting texts for inclusion in the corpus cannot use Column E, is the point.
So, I guess we can either: (a) fix the selection procedure so that it takes texts that have LTS, UNTS, or UNXX in column H ("source"); or, (b) I can just manually remove these fourteen texts from a zip folder that will then be stable.
Let me know; and sorry for the trouble!

roger-mahler · 2020-07-01T13:51:21Z

We double checked the Corpus yesterday and found 24 missing treaties in the previously generated corpus. @aibakeneko is generating new files. We will generated three distinct corpus to avoid confusion, and easier checking.

benjamingmartin · 2020-07-01T14:01:12Z

OK, thanks! But actually, there is no need for multiple (English-language) corpora. I only need one in English and one in French:

LTS+UNTS, 1935-1972, which anyway only includes texts that I have marked as General Cultural Agreements (that is, they have "yes" in CULTGEN)
France's cultural agreements, 1919-1972 (which we haven't talked about for a long time, but which I want to come back to sometime)

The real reason that 7CULT+ is an important category is for use in doing quantitative WTI work, not for text analysis.

roger-mahler · 2020-07-01T15:59:36Z

Update 20200702:

I have done a review on the notebooks, and associated code, that are either prioritized or are used in 2020 article notebook (under 4_publications). The discrepancy in number of treaties was caused by the fact that the WTI treaty source column hasn't previously been regarded as a filter. Instead I assumed that all treaties with the "is_cultural_yesno" attribute set to "yes" are relevant candidate treaties from which further subsets are created (e.g. based on time, topic, parties). This must be an misunderstanding from my part.

I have added a treaty_sources filter to all functions used in prioritized notebook or article notebook.
I have created a new parallel version of the notebook "1_analysis_Rise_of_CT_article_2020" named "200702_analysis_Rise_of_CT_article_2020" where this filter has been added to all notebook cells. Please note that treaty_source None is the old, unfiltered, behavior so I guess it should be set to ['LTS', 'UNTS', 'UNXX'] in most cases. Note that I have kept the None value for most of the cells.
I have changed defaults value on certain select options to reduce risk of errors.
I have also uploaded a single corpus named treaty_text_corpora_en_20200701_GEN_LTS+UNTS_GCAs_1935-1972.zip consisting of 458 English treaties having CULTGEN = 'yes', source = LTS/UNTS/UNXX, period 1935-1972. Furthermore, when the corpus are prepared, only sources LTS/UNTS/UNXX are considered (has of course no effect for the new corpus). Index PLUS and ORG should NOT be used with this corpus since it only contains GEN treaties. I guess ORG and PLUS shouldn't be used at all from now on?

Certain text analytic functions/notebooks that uses the text corpus does not in any way filter out documents in the selected corpus. I have (hopefully) added a printed statement for these notebooks. I have not reviewed all notebooks under "3_text_analysis". Noteworthy is that this is true for all notebook under "0_bens_prioritized"

Please make sure that GEN WTI-index are selected. This is now also the default index. I have also specified this index for the new article-notebook. The "1_analysis_Rise_of_CT_article_2020" does not have this index preset.

I wish I had more time to test and verify these changes before vacation. Please pay close attention to strange or odd values or regressed behaviors.. We should do a joint review of the functionality in early August.

benjamingmartin assigned benjamingmartin, roger-mahler and aibakeneko and unassigned benjamingmartin Jun 2, 2020

aibakeneko pushed a commit that referenced this issue Jun 3, 2020

Ref. #28

0eb85c6

aibakeneko pushed a commit that referenced this issue Jun 3, 2020

Ref. #28

8c0faac

aibakeneko pushed a commit that referenced this issue Jun 9, 2020

Fixed corpus file pattern (ref #28)

bec88e4

aibakeneko closed this as completed in 595a806 Jun 9, 2020

benjamingmartin reopened this Jul 1, 2020

roger-mahler mentioned this issue Aug 23, 2020

Quantitative analysis raises exception: "ValueError: None cannot be transformed to a widget" #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise UNTS corpus to match 7cultGEN (7GCA) #28

Revise UNTS corpus to match 7cultGEN (7GCA) #28

benjamingmartin commented Jun 2, 2020

aibakeneko commented Jun 3, 2020 •

edited

Loading

benjamingmartin commented Jun 4, 2020

aibakeneko commented Jun 9, 2020

benjamingmartin commented Jul 1, 2020

roger-mahler commented Jul 1, 2020

benjamingmartin commented Jul 1, 2020

roger-mahler commented Jul 1, 2020 •

edited

Loading

Revise UNTS corpus to match 7cultGEN (7GCA) #28

Revise UNTS corpus to match 7cultGEN (7GCA) #28

Comments

benjamingmartin commented Jun 2, 2020

aibakeneko commented Jun 3, 2020 • edited Loading

benjamingmartin commented Jun 4, 2020

aibakeneko commented Jun 9, 2020

benjamingmartin commented Jul 1, 2020

roger-mahler commented Jul 1, 2020

benjamingmartin commented Jul 1, 2020

roger-mahler commented Jul 1, 2020 • edited Loading

aibakeneko commented Jun 3, 2020 •

edited

Loading

roger-mahler commented Jul 1, 2020 •

edited

Loading