This is the repo for the paper "Grounding Conversations with Improvised Dialogues" (ACL2020). SPOLIN is a collection of more than 68,000 "Yes, and" type dialogue pairs extracted from the Spontaneanation podcast by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus. For more information, refer to our paper or our project page.
The core dataset that was used for the experiments in the paper only includes yes-ands and non-yes-ands from Spontaneanation and most of what is provided in those extracted from the Cornell Movie-Dialogs Corpus. After the submitting the paper, we continued our iterative data augmentation process, repeating another iteration with the Cornell Movie-Dialogs Corpus and extracting from the SubTle corpus. This expanded version is also included in this repository here. This latest version of SPOLIN was used to train the model used in our demo.
In the data
folder, we provide two versions of the SPOLIN training set:
- Version used for experiments in the ACL paper:
data/spolin-train-acl.json
- Expanded version:
data/spolin-train.json
SPOLIN is available via:
We make our yes-and classifier from our last iteration that filters out self-_yes-and_s and fine-tuned DialoGPT models available:
- Yes-and classifier
- Fine-tuned GPT-2 model weights
- Reverse GPT-2 model weights (from DialoGPT repo): make sure to rename
small_reverse.pkl
tomedium_reverse.pkl
for using with the script files in this repo.
For instructions and details on training or inferencing with these models, refer to the READMEs in each respective folder. Please raise an issue if there are any problems with the links and the script for using these models.
- Project page: https://justin-cho.com/spolin
- Demo: https://spolin.isi.edu
- Paper: https://arxiv.org/abs/2004.09544
yesands | non-yesands | |
---|---|---|
Spontaneanation | 10,959 | 6,087* |
Cornell | 16,926 | 18,810 |
SubTle | 40,303 | 19,512 |
Total | 68,188 | 44,409 |
*Artificially collected by mix & matching positive Spontaneanation samples to balance dataset for training classifier
data/spolin-train.json | data/spolin-valid.json | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
If you use data or code in this repository, please cite our ACL2020 paper:
@inproceedings{cho2020spolin,
title={Grounding Conversations with Improvised Dialogues},
author={Cho, Hyundong and May, Jonathan},
booktitle ={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
publisher = {Association for Computational Linguistics},
location = {Seattle, Washington, USA},
year={2020}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.