Skip to content

Dataset Description

Drew edited this page Aug 16, 2019 · 16 revisions

Data Release Notes:

  • 2018-10-15: Full dataset (5000 training articles, 100 development fold articles, dataset, citation, and publication metadata) released.
  • 2018-10-18: sage_research_methods.json, sage_research_methods.skos, sage_research_fields.json, & sage_research_fields.csv files added to the dataset.
  • 2018-10-24: Removed unneccessary .DS_Store and ._* files added by MacOS during tar process.

Data Access:

Metadata Files

  • publications.json contains metadata for the publications in the training data. It is a JSON list of JSON objects, each of which contains:

    • pub_date - Publication date of the article, in "YYYY-MM-DD" format.
    • unique_identifier - Unique identifier provided by the publication's source. Could be a URL or other URI, etc.
    • text_file_name - file name of the publication's text file, "<publication_id>.txt". Does not include path information. All text files are stored in /input/files/text.
    • pdf_file_name - file name of the publication's PDF file, "<publication_id>.pdf". Does not include path information. All PDF files are stored in /input/files/pdf.
    • publication_id - Integer ID of the publication, unique within the set of publications stored in a given "publications.json" file. This is the ID that should be referred to in output JSON files when you tie your results to a given publication.
  • publications.json Notes:

    • Some pub_date entries will be missing or incorrect. For example, all publications whose unique_identifier is prefixed by "bbk-" will be "1969-01-01" as we had not captured the actual publications dates in the metadata - it is safe to treat these values as NA or null. Some dates otherwise, might be inaccurate (for example 1905-07-03 for an article published in 2011). This is most likely because of how the metadata is in the source system (e.g., Crossref). Since these are real-world metadata inconsistencies, we have left them as is.

Example:

[
	 {
        "text_file_name": "166.txt",
        "unique_identifier": "10.1093/geront/36.4.464",
        "pub_date": "1996-01-01",
        "title": "Nativity, declining health, and preferences in living arrangements among elderly Mexican Americans: Implications for long-term care",
        "publication_id": 166,
        "pdf_file_name": "166.pdf"
    },
    {
        "text_file_name": "167.txt",
        "unique_identifier": "10.1093/geronb/56.5.S275",
        "pub_date": "2001-01-01",
        "title": "Duration or disadvantage? Exploring nativity, ethnicity, and health in midlife",
        "publication_id": 167,
        "pdf_file_name": "167.pdf"
    },
	...
]
  • data_sets.json contains metadata for all datasets in the training data including some that may not be found in the training data, but can be found in the entire ICPSR bibliography. It is a JSON list of JSON objects, each of which contains:

    • subjects - list of terms associated with the dataset, based on the ICPSR subject thesaurus
    • additional_keywords - System keyword for where dataset originated from
    • citation - Preferred dataset citation
    • data_set_id - Integer ID for dataset, is used also in the data_set_citations.json file to identify relationships between datasets and publications.
    • title - Canonical title for dataset
    • name - Canonical title for dataset
    • description - Dataset description if available
    • unique_identifier - Original unique identifier for dataset, normally a DOI if available.
    • methodology - Methodology for dataset, if available.
    • date - Date when dataset was published, if available.
    • coverages - Geographic coverages, if available
    • family_identifier - Internal system ID, roughly captures datasets that have multiple years but are the same dataset. Inconsistently applied, should not be used in analysis.
    • mention_list - Array of strings for annotated mentions as identified by human reviewers. Not an exhuastive list of mentions for any given dataset.

Example:

[
    {
        "data_set_id": 1,
        "unique_identifier": "10.3886/ICPSR07213",
        "title": "ANES 1952 Time Series Study",
        "name": "ANES 1952 Time Series Study",
        "description": "This study is part of a time-series collection of national surveys fielded continuously since 1948. The election studies are designed to present data on Americans' social backgrounds, enduring political predispositions, social and political values, perceptions and evaluations of groups and candidates, opinions on questions of public policy, and participation in political life. The 1952 National Election Study gauges political attitudes in general, along with attitudes and behaviors directly relevant to the 1952 presidential election. The interview schedule contained both closed and open-ended questions designed to collect data on a wide range of issues. Most respondents were interviewed both before and after the date of the election. The pre-election survey tapped attitudes toward political parties, candidates, and other specific issues, and inquired about the respondents' personal and political background. The post-election interview focused on the actual vote and voting-related behaviors. Additionally, a sub-sample of 585 respondents was administered a Form B re-interview obtaining further information about organizational affiliations, personal data, and non-political opinions and attitudes. A special emphasis was placed on the perception of group behavior, especially the perceived political preferences of family, friends, and associates.",
        "date": "2016-09-20 00:00:00+00:00",
        "coverages": "",
        "subjects": "candidates,congressional elections,domestic policy,economic conditions,foreign policy,government performance,information sources,national elections,political affiliation,political attitudes,political campaigns,political efficacy,political issues,political participation,presidential elections,public approval,public opinion,special interest groups,Truman Administration (1945-1953),trust in government,voter expectations,voting behavior,United States,1952-09--1952-12",
        "methodology": "",
        "citation": "",
        "additional_keywords": "ICPSR",
        "family_identifier": "",
        "mention_list": [
            "ANES study",
            "ICPSR",
            "SRC data",
            "Surveys conducted by the Survey Research Center and the Center for Political Studies of the University",
            "eight SRC-CPS presidential election surveys",
            "eight SRC-CPS presidential election surveys con- ducted between 1952 and 1980",
            "eight presidential election surveys conducted by the Survey Research Center and the Center for Political Studies (SRC-CPS)",
            "time series"
        ],
        "identifier_list": [
            {
                "name": "ICPSR data ID (dataId)",
                "identifier": "10.3886/ICPSR07213"
            }
        ]
    },
    ...
]
  • data_set_citations.json contains metadata for the curated relationships between dataset and publication in the training data. It is a JSON list of JSON objects, each of which contains:

    • citation_id - A unique ID for the relationship between one dataset and one publication
    • publication_id - Unique ID for a publication which is the same ID for the publication in publications.json
    • data_set_id - Unique ID for a dataset which is the same ID for the dataset in the data_sets.json file.
    • mention_list - Array of strings for alternative references for the dataset in the specific publication.
    • score - Confidence score for the dataset being found in the related publication. Should be 1.0 in each record for this file.
  • data_set_citations.json Notes:

    • For publications with an empty mentions list, they may have been tagged with a dataset by the curator (either at Bundesbank or ICPSR) who may have some further insight into that publication and dataset. When a human coder later went through to find mentions for a specific dataset in a specific publication (this coder does not have the expert insight of the original curator), they may not have found explicit mentions to that dataset. So an empty mentions list is not necessarily indicative of there not being any mentions of a dataset in the publication or that the dataset is not actually really referenced in the publications. There is also the possibility that other datasets (not the labeled one) are also in a publication, so mentions of a dataset that is not the labeled dataset does not necessarily exclude the labeled dataset.

Example:

[
    {
        "citation_id": 1,
        "publication_id": 1,
        "data_set_id": 1,
        "mention_list": [
            "SRC data",
            "Surveys conducted by the Survey Research Center and the Center for Political Studies of the University",
            "eight SRC-CPS presidential election surveys",
            "eight SRC-CPS presidential election surveys con- ducted between 1952 and 1980",
            "eight presidential election surveys conducted by the Survey Research Center and the Center for Political Studies (SRC-CPS)"
        ],
        "score": 1.0
    },
    {
        "citation_id": 2,
        "publication_id": 1,
        "data_set_id": 2,
        "mention_list": [
            "American national election studies",
            "SRC-CPS congressional election surveys conducted be- tween 1958 and 1978",
            "SRC-CPS presidential election studies",
            "SRC-CPS presidential election surveys",
            "SRC-CPS presidential election surveys con- ducted between 1952 and 1980",
            "SRC-CPS surveys",
            "Survey Research Center and the Center for Political Studies (SRC-CPS)",
            "Surveys conducted by the Survey Research Center and the Center for Political Studies of the University",
            "surveys conducted by the Survey Research Center and the Center for Political Studies"
        ],
        "score": 1.0
    },
    ...
]

Publications

We have provided 2500 open or free access publications which have been labeled as having mentioned a dataset. We have also provided 2500 open access publications which ostensibly do not mention datasets (by searching for open access data publications which do not contain the terms 'data', 'empirical', 'study', or 'survey').

We are also supplying a 100 publication development fold which contains 50 labeled publications and 50 unlabeled publications for testing your trained model before submissions.

We supply both the original PDF files and the plain text versions (see File conversion below for how we generated the plain text versions).

Publication and metadata sources

Provided by ICPSR, Digital Science, Deutsche Bundesbank, and SAGE Journals.

The dataset, publication, and citation metadata from ICPSR was gathered circa August 2018 from their OAI-PMH endpoints

File conversion

The articles were converted from PDF to text using an open source Xpdf text extraction system using the following command after installing pdftotext.

pdftotext -raw <path_to_pdf.pdf> <path_to_txt.txt>

Note: In the case where a pdf file could not be converted because it was "locked", we used qpdf to "unlock" the pdf in the following manner -- this will not work if the pdf if actually password protected:

qpdf --decrypt --password='' <path_to_pdf.pdf> ../decrypted/<path_to_pdf.pdf>

The rationality behind this simplified process for converting pdfs to texts (there are many other approaches and tools available for this task) was:

  1. To render the most usable txt files from available pdfs without over engineering for any specific types of pdf files (e.g., single column vs. multi-column) and
  2. To have a process that is easily reproducible across different machines for free. That is, not all PDFs convert the same way. Some are more error prone than others. More advanced OCR techniques might have been able to compensate where Xpdf might have fallen short, but relying on more sophisticated and perhaps costly text conversion processes would have made the conversion pipeline more expensive to reproduce and less portable across different applications.

Participants are encouraged to use the converted texts or to try their own conversion process as the original pdfs are supplied with the competition dataset. We only ask that participants supply us with documentation for installing and running their conversion process if they choose to use another means for converting PDF files to plain text.

Annotation process

All texts provided in the training set have been read by a human coder and all strings which make mention to the dataset in a publication were extracted and saved as metadata.

The goal of the annotation process was to capture mentions of a given dataset in a publication text to provide all possible synonyms of a dataset name (e.g. “National Health and Examination Survey”, “NHANES”, “NHANES I”, etc.) as part of the ground truth for training your model. This data is available here.

A list of general synonyms, such as “survey”, “data”, “study”, as well as some examples of word co-locations are available here (link to this information converted to JSON to be consistent?)

Additional resources

To assist in the task to identify research fields and research methodologies, we are supplying ontologies with sets of terms for each provided by SAGE Publications.

  • sage_research_methods.json - Example of Social Science Methods Vocabulary

Set of social science research methods; an example is provided by SAGE Publications, but others can be identified

  • sage_research_fields.json - Example of Social Science Research Fields Vocabulary

Set of social science fields as identified by the team; example set from SAGE Publications provided

Competition Data Terms of Use

We are making these materials available for noncommercial, scholarly use only. You may not redistribute or re-license these materials or use them for purposes outside of the scope of this competition.