corpus/arxiv-abstracts-with-agreement/0.1
Pre-releaseArXiv Dataset Abstract Subsample
Only abstracts containing the substring "agreement"
Date (ISO 8601): 2022-04-15
This dataset contains 69,411 plaintext files, each corresponding to an ArXiv document abstract. Each abstract contains at least one appearance of the substring "agreement".
Each text file in this dataset contains the text of an abstract extracted from the full JSON Lines-formatted dataset (described below). Each file is named after its ArXiv ID and has been given the .txt
file extension. In the case where the ArXiv ID contained a forwardslash (/
), the forwardslash was replaced with an underscore (_
). The text files have a median length of 1057 characters and a mean length of 1100 characters.
The full ArXiv metadata dataset can be found on Kaggle and includes additional information alongside each abstract, such as document authors, comments, DOI, etc. The original dataset was distributed under the CC0: Public Domain license, thereby permitting this modification and redistribution.