Skip to content

Repo for EMNLP 2020 paper, "Improving Neural Topic Models using Knowledge Distillation"

Notifications You must be signed in to change notification settings

ahoho/kd-topic-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving Neural Topic Models using Knowledge Distillation

Repo for our EMNLP 2020 paper. We will clean up the implementation for improved ease-of-use, but provide the code included in our original submission for the time being.

If you use this code, please use the following citation:

@inproceedings{hoyle-etal-2020-improving,
    title = "Improving Neural Topic Models Using Knowledge Distillation",
    author = "Hoyle, Alexander Miserlis  and
      Goel, Pranav  and
      Resnik, Philip",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.137",
    pages = "1752--1771",
}

Rough Steps

  1. As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in teacher/teacher.yml and scholar/scholar.yml for the teacher and topic model, respectively. For example: conda env create -f teacher/teacher.yml (edit the first line in the yml file if you want to change the name of the resulting environment; the default is transformers28).

  2. We use the data processing pipeline from Scholar. We'll use the IMDb data to serve as a guide (preprocessing scripts for the Wikitext and 20ng data are also included for replication purposes, but the processing scripts aren't general-purpose):

conda activate scholar
python data/imdb/download_imdb.py

# main preprocessing script
python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist
# create a dev split from the train data--change filenames if using different data
create_dev_split.py
  1. Run the teacher model, below is an example using IMDb.
conda activate transformers28

python teacher/bert_reconstruction.py \
    --input-dir ./data/imdb/processed-dev \
    --output-dir ./data/imdb/processed-dev/logits \ 
    --do-train \
    --evaluate-during-training \
    --truncate-dev-set-for-eval 120 \
    --logging-steps 200 \
    --save-steps 1000 \
    --num-train-epochs 6 \
    --seed 42 \
    --num-workers 4 \
    --batch-size 20 \
    --gradient-accumulation-steps 8 \
    --document-split-pooling mean-over-logits
  1. Collect the logits from the teacher model (the --checkpoint-folder-pattern argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes ")
conda activate transformers28

python teacher/bert_reconstruction.py \
    --output-dir ./data/imdb/processed-dev/logits \
    --seed 42 \
    --num-workers 6 \
    --get-reps \
    --checkpoint-folder-pattern "checkpoint-9000" \
    --save-doc-logits \
    --no-dev
  1. Run the topic model (there are a number of extraneous experimental arguments in run_scholar.py, which we intend to strip out in a future version).
conda activate scholar

python scholar/run_scholar.py \
    ./data/imdb/processed-dev \
    --dev-metric npmi \
    -k 50 \
    --epochs 500 \
    --patience 500 \
    --batch-size 200 \
    --background-embeddings \
    --device 0 \
    --dev-prefix dev \
    -lr 0.002 \
    --alpha 0.5 \
    --eta-bn-anneal-step-const 0.25 \
    --doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \
    --use-doc-layer \
    --no-bow-reconstruction-loss \
    --doc-reconstruction-weight 0.5 \
    --doc-reconstruction-temp 1.0 \
    --doc-reconstruction-logit-clipping 10.0 \
    -o ./outputs/imdb