Releasing the 11,490 summaries generated by the Summary Loop model (summary_loop_length46.bin
) on the CNN/DM test set.
Each summary is released attached with the CNN/DM id.
The following code snippet can be used to evaluate ROUGE scores:
from datasets import load_dataset, load_metric
import json
with open("/home/phillab/data/cnndm_test_summary_loop.json", "r") as f:
summary_loop_gens = json.load(f)
rouge = load_metric("rouge")
dataset_test = load_dataset("cnn_dailymail", "3.0.0")["test"]
id2summary_loop = {d["id"]: d["summary_loop_gen"] for d in summary_loop_gens}
candidates, references = [], []
for d in dataset_test:
references.append(d["highlights"])
candidates.append(id2summary_loop[d["id"]])
print(len(references), len(candidates))
print(rouge.compute(predictions=candidates, references=references))
Notes:
(1) this relies on HuggingFace's datasets
repository (https://github.com/huggingface/datasets) to load the CNN/DM dataset, and the ROUGE metric.
(2) The ROUGE metric implementation used in the above example is not the original, PERL-based implementation of ROUGE used for official numbers in the paper. This serves for demonstration purposes to show how to use the file.