Skip to content

corpus/arxiv-abstracts-with-agreement/0.1

Pre-release
Pre-release
Compare
Choose a tag to compare
@afparsons afparsons released this 07 Mar 15:39
330b4e1

ArXiv Dataset Abstract Subsample

Only abstracts containing the substring "agreement"

Date (ISO 8601): 2022-04-15

This dataset contains 69,411 plaintext files, each corresponding to an ArXiv document abstract. Each abstract contains at least one appearance of the substring "agreement".

Each text file in this dataset contains the text of an abstract extracted from the full JSON Lines-formatted dataset (described below). Each file is named after its ArXiv ID and has been given the .txt file extension. In the case where the ArXiv ID contained a forwardslash (/), the forwardslash was replaced with an underscore (_). The text files have a median length of 1057 characters and a mean length of 1100 characters.

The full ArXiv metadata dataset can be found on Kaggle and includes additional information alongside each abstract, such as document authors, comments, DOI, etc. The original dataset was distributed under the CC0: Public Domain license, thereby permitting this modification and redistribution.