We performed manual checking of the largest domains of Slovene and Croatian web corpora.
For Slovene, 820 domains were checked, representing 63% of the entire (deduplicated) corpus (1.4B tokens out of total 2.2B). 140 bad domains were identified, which is 5% of the entire corpus. In addition to this, 6% of checked domains were marked as low quality meaning that they contain some issues, but most of the sentences in the sample are of good quality. 86% of checked domains were marked as okay.
For Croatian, 976 domains were checked, representing 69% of the entire (deduplicated) corpus (1.9B tokens out of total 2.76B). 121 bad domains were identified, which is 4% of the entire corpus. In addition to this, 23% of checked domains were marked as low quality meaning that they contain some issues, but most of the sentences in the sample are of good quality. 64% of checked domains were marked as okay. The percentage of okay domains is so much lower in Croatian corpus in comparison to Slovene corpus, because as Croatian is very similar to Serbian and Bosnian, many domains in the corpus revealed to be written in a closely related language instead of in Croatian, which was not an issue in the Slovene corpus. Further comparison between the results of manual checking of Slovene and Croatian corpora is available here.
Procedure: A table with the most frequent domains in the target language corpus was created. For each domain, a link to the live site and a link to 50 concordances - random triples of sentences - from the domain in the Sketch Engine concordancer were available. For the first 20 domains, the annotator checked the live site and all 50 concordances, for the next domains up to the 350th domain, the annotator checked the first 15 concordances. Other, smaller domains (up to 1500th additional domains) were checked only in the case when the URL name was unusual and it indicated that the domain might be problematic. Such URLs 1) have “porn” or other unusual words in the name, 2) do not use the national domain, especially checking revealed to be necessary if they use the .eu domain or a domain of a country with a closely related language, 3) begin with the language code, e.g. “sl.toolbox-site.com” (it is very likely that the domain is machine translated), 4) were not clickable on the sheet (that often indicated that the domain is a porn website which had already been removed).
The annotator marked the domains as ok, check, lq (low-quality), bad or issue. bad was used in case the domain contained so few good-quality sentences of the target language that the annotator deemed that it is better to remove the whole domain than try to filter out just problematic parts. Such examples are websites that are entirely machine translated, entirely in foreign language or contain no full sentences (just lists). If the domain included some problematic concordances, but also some good-quality content, such domain was marked as low-quality (lq) instead of bad, as it is in our interest to keep as much text of good quality as possible. By using the lq tag in addition to the bad tag, each observed issue could be tagged with two levels of severity, based on its presence in the domain. This can be especially useful for further experiments with machine translation, as we can differentiate between the domains that use very obvious machine translation (marked as bad) and domains where machine translation is harder to recognize (marked as lq). In this task, the following characteristics of text were identified as issues: machine translation, automatically generated text, foreign language, non-textual (lists with no full sentences) and HTML source code or markdown (unusual elements in the running text).
The check tag was used for cases where some issues were detected that can be easily corrected, e.g. adding new patterns to the UTF mappings to solve encoding issues or removing HTML markup. The issue tag was used for instances, not covered in the instructions, and they were solved at the discussions.
For Slovene, we first checked the non-duplicated corpus, as the deduplication step has not been performed yet at that stage. However, we realised that evaluating non-deduplicated samples gave unreliable results, because near-deduplication removed 80% of tokens. That means that the sample sentences on which the manual evaluation of each domain was based might not be representative of sentences that remained in the deduplicated corpus. In addition to this, it was revealed that some issues were already solved with near deduplication. Thus, we can conclude that it is more sensible to perform manual checking on the deduplicated corpus. This is also supported by the comparison of results from the two rounds. As shown in the Slide 3, while in the non-deduplicated corpus 12% of the domains in the corpus were marked as bad, in the deduplicated corpus, the percentage of bad domains dropped to 5%. The distribution of bad categories also reveals that there are fewer issues with non-textual texts and generated texts in the deduplicated corpus.
The main result of manual checking was discarding domains that were marked as bad. In addition to this, the manual inspection of corpora revealed some issues which led to improvements of the crawling pipeline and the pipeline for cleaning of monolingual corpora: UTF mappings were expanded, HTML markup was removed with rules, and Wikipedia markup was removed on the domain level (it was revealed that all documents with "&diff=" or "action=edit" in the URL contain Wikipedia markup and should be removed). In addition to this, manual analysis led to identification of domains that contain automatically generated pornographic text, and by analysing the characteristics of these texts, we created a heuristics which identifies such issues with 100% precision. However, pornographic domains revealed to be present only in the Slovene corpus, and not in the Croatian one.
In addition to this, while evaluating the domain, the annotator often noticed the common topic, the genre or non-standard language and added this information which can be added to the final corpus as metadata. As a result, 206 Slovene domains (430M tokens, 21% of the whole corpus) and 255 Croatian domains (566M tokens, 20% of the whole corpus) are annotated with a topic (described with 32 topic labels), and 301 Slovene domains (750M tokens, 37% of the whole corpus) and 314 Croatian domains (983M tokens, 36% of the whole corpus) are annotated with one of 10 domain genres. This information provides an interesting insight into which topics and types of texts are the most prevalent on Slovene and Croatian web. In term of a number of different domains that are a part of the most frequent 1,500 domains, the most popular topics in the Slovene corpus are technology, sport and cars, and in the Croatian corpus academic, technology and community, while in term of tokens per domain, the topics with the most tokens in the Slovene corpus are law and technology, and in the Croatian community and technology. In both corpora, in term of both number of different domains and tokens, the most frequent type (or genre) of the domains is news portal, followed by legal/regulation and forums in term of number of tokens in the Slovene corpus, and forum and opinion in the Croatian corpus, while in term of number of different domains, the second and third most frequent domain genres are e-shops and forums in the Slovene corpus, and forum and e-shops in the Croatian corpus. Furthermore, on 104 Slovene domains (with 273M tokens) and 75 Croatian domains (with 289M tokens) non-standard language was identified.