Paper | Slides | Dataset | Prompts
We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to.
Wordplay type | # | AAL | Links | |
---|---|---|---|---|
Puns | Polysemy | 190 | 1.51 | |
Puns | Homonymy | 26 | 1.57 | |
Puns | Phonetic similarity | 98 | 1.80 | |
Transformations | Collocation | 423 | 2.64 | 126 |
Transformations | Idiom | 177 | 3.43 | 118 |
Transformations | Reference | 353 | 3.73 | 214 |
Nonce word | 185 | |||
Oxymoron | 48 |
Table 1. Wordplay types, average anchor length in words (AAL), and wiki links in KOWIT-24
Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts – each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities – the mechanism that has been underrepresented in previous humor datasets. Moreover the dataset contains manually created annotations that provide information about what the wordplay refers to. Incorporating this annotation into the dataset enables automated evaluation of the large language model’s wordplay interpretations.
Dataset entry example:
{'article_url': 'https://www.kommersant.ru/doc/5051268',
'date': '2021-10-27',
'headline': 'Диалектический пиломатериализм',
'is_wordplay': True,
'lead': 'Цены на фанеру и доски начали снижаться вслед за спросом',
'summary': 'Пиломатериалы и лесопромышленная продукция начинают дешеветь по '
'мере завершения строительного сезона. По мнению аналитиков и '
'некоторых участников рынка, этому способствует сокращение спроса '
'на фоне летнего всплеска цен. И хотя на некоторые продукты, '
'например OSB, цена упала уже на треть, она все еще вдвое выше '
'уровня конца прошлого года. До конца года можно ожидать '
'стабилизации цен, полагают участники рынка, но едва ли '
'возвращения к средним многолетним значениям.'},
'annotations': [{'end_index': 30,
'headline_substring': 'Диалектический пиломатериализм',
'reference_string': 'Диалектический материализм',
'reference_url': 'https://ru.wikipedia.org/wiki/Диалектический_материализм',
'start_index': 0,
'wordplay_type': 'Reference'},
{'end_index': 30,
'headline_substring': 'пиломатериализм',
'reference_string': ['материализм', 'пиломатериалы'],
'reference_url': ['', ''],
'start_index': 15,
'wordplay_type': 'Nonce word'}]
from datasets import load_dataset
data_files = {"test": "dataset.csv", "dev": "dev_dataset.csv"}
dataset = load_dataset("Humor-Research/KoWit-24", data_files=data_files)
TODO
For the experiments, we allocated 200 records (100 from each class) for the development set, making sure that all wordplay types were represented. Thus, the test set contains 2,500 headlines (1,290 with and 1,310 without wordplay). We experimented with two tasks – wordplay detection and wordplay interpretation. We employed five LLMs: GPT-4o, Mistral NeMo 12B, YandexGPT4, GigaChat Lite, and GigaChat Max.
Model | Detection with simple prompt, P/R | Detection with extended prompt, P/R | Interpretation manual, R | Interpretation auto, R |
---|---|---|---|---|
GigaChat Lite | 0.50 / 0.50 | 0.53 / 0.72 | 0.11 | 0.19 |
GigaChat Max | 0.62 / 0.48 | 0.68 / 0.59 | 0.28 | 0.28 |
YandexGPT4 | 0.83 / 0.10 | 0.76 / 0.24 | 0.20 | 0.22 |
Mistral Nemo | 0.00 / 0.00 | 0.00 / 0.00 | 0.24 | 0.30 |
GPT-4o | 0.62 / 0.81 | 0.65 / 0.88 | 0.48 | 0.43 |
To facilitate the evaluation of alternative large language models (LLMs) for detection and interpretation tasks, the prompts utilized in the experiments have been made available on the LangChain Hub, while the corresponding data have been deposited on the HuggingFace Hub.
Example:
# Imports
from huggingface_hub import hf_hub_download
from datasets import load_dataset
from langchain.llms import LlamaCpp
from langchain.chains import LLMChain
from langchain import hub
# Load model
model_path = hf_hub_download(repo_id="Vikhrmodels/Vikhr-Llama-3.2-1B-instruct-GGUF",
filename="Vikhr-Llama-3.2-1B-Q4_K_M.gguf",
local_dir=".")
llm = LlamaCpp(
model_path=model_path,
n_ctx=2048,
temperature=0.1,
top_p=0.9,
max_tokens=256
)
# Load prompt
prompt = hub.pull("humor-research/wordplay_detection")
# Load dataset
data_files = {"test": "dataset.csv", "dev": "dev_dataset.csv"}
dataset = load_dataset("Humor-Research/KoWit-24", data_files=data_files)
# Invoke LLM
predicted = list()
for example in dataset["test"]:
task = prompt.format(
headline=example["headline"],
lead=example["lead"]
)
predicted.append(
llm.invoke(task)
)
break
TODO