1️⃣ The project provides open-source comment data for researchers to conduct in-depth research on the Danmaku corpus, which mainly focuses on the subculture bullet-screen comment corpus (including but not limited to the Guichu, animation and e-sports type).
2️⃣ Corpus is maintained by TinyTalks, a Community for NLP research in Short Text in Chinese.
- Video Corpus
- Video-Channel Network
- Danmaku Comment Corpus
- Danmaku Language Models
- Danmaku Marked Data
- Citation
The video corpus were more than 1 milions plays during the years from 2017 to 2020. The attributes can be as follow.
Attribute | explanation |
---|---|
Bvno | 视频识别号 |
Tname | 主标签 |
Tag_name | 标签列表 |
Owner_mid | 发布者Id |
Title | 标题 |
Pubdate | 发布时间 |
Duration | 持续时间 |
View | 观看数量 |
Danmaku | 弹幕数量 |
Reply | 转发数量 |
Favorite | 收藏数量 |
Coin | 投币数量 |
Share | 分享数量 |
Like | 喜欢数量 |
Source: https://pan.bnuz.edu.cn/l/J5z6nP password:bnuz
https://pan.bnuz.edu.cn/l/g1ydM2 password:bnuz
Attribute | explanation |
---|---|
videoType | 视频类型 |
relation | 属于 |
channel | 频道标签 |
The attributes detail can as follow.
Attribute | type | explanation | Default |
---|---|---|---|
text | (str) | 弹幕文本 | |
dm_time | (float) | 弹幕在视频中的位置,单位为秒 | 0.0 |
send_time | (float) | 弹幕发送的时间 | time.time() |
crc32_id | (str) | 弹幕发送者 UID 经 CRC32 算法取摘要后的值 | None |
color | (str) | 弹幕十六进制颜色 | "ffffff" |
weight | (int) | 弹幕在弹幕列表显示的权重 | -1 |
id_ | (int) | 弹幕 ID | -1 |
id_str | (str) | 弹幕字符串 ID | "" |
action | (str) | 暂不清楚 | "" |
mode | (Mode) | 弹幕模式 | Mode.FLY |
font_size | (FontSize) | 弹幕字体大小 | FontSize.NORMAL |
is_sub | (bool) | 是否为字幕弹幕 | False |
pool | (int) | 暂不清楚 | -1 |
attr | (int) | 暂不清楚 | -1 |
Type | Original Source | Text Source | Password |
---|---|---|---|
鬼畜 | bnuz | ||
电竞 | bnuz | ||
动漫 | bnuz | ||
疫情 | bnuz |
疫情类 https://pan.bnuz.edu.cn/l/aoMMOM(密码:bnuz)
https://pan.bnuz.edu.cn/l/onFbAO(密码:bnuz)
https://pan.bnuz.edu.cn/l/QJGkNF(密码:bnuz)
Mdeols | Description | Mask Accuracy | Link |
---|---|---|---|
chinese_danmaku_roberta | fine-tuned version of uer/chinese_roberta_L-8_H-512 on an Danmaku Corpus(2000k raw data) dataset. | 0.7780 |
数据使用utf-8编码,逗号分隔的csv保存
数据分为数字信息、文本信息、标注分类(未标注数据无标注信息)
数字信息包含:弹幕在视频中出现的时间点、展示模式、字号、字体颜色、发送时间、弹幕池编号、发送用户编号、在弹幕数据库中的编号,信息用逗号分隔
标注类别:0高兴、1难过、2愤怒、3惊、4负样本
https://pan.bnuz.edu.cn/l/snpijm password: bnuz
citation: https://github.com/MelkiorOno/DanmakuMarked-data
If you use this corpus in your research, please cite this repository.
@article{
QBTS202209010,
author = {陈鑫,张以欣,吴俊潮,郭凌宇,余泽汇 & 杨静},
title = {历时弹幕语料库的构建与探析——以青年亚文化弹幕为例},
journal = {情报探索},
volume = {No.299},
number = {77-85},
year = {2022},
issn = {1005-8095},
doi ={10.3969/j.issn.1005-8095.2022.09.010}
}