GitHub - Chen-X666/DDmkTCCorpus: 📕 DDmkTCCorpus: Diachronic Danmaku Text Comments Corpus （历时弹幕语料库）

📕DDmkTCCorpus: Diachronic Danmaku Text Comments Corpus

中文 | English

1️⃣ The project provides open-source comment data for researchers to conduct in-depth research on the Danmaku corpus, which mainly focuses on the subculture bullet-screen comment corpus (including but not limited to the Guichu, animation and e-sports type).

2️⃣ Corpus is maintained by TinyTalks, a Community for NLP research in Short Text in Chinese.

Video Corpus

The video corpus were more than 1 milions plays during the years from 2017 to 2020. The attributes can be as follow.

Attribute	explanation
Bvno	视频识别号
Tname	主标签
Tag_name	标签列表
Owner_mid	发布者Id
Title	标题
Pubdate	发布时间
Duration	持续时间
View	观看数量
Danmaku	弹幕数量
Reply	转发数量
Favorite	收藏数量
Coin	投币数量
Share	分享数量
Like	喜欢数量

Source: https://pan.bnuz.edu.cn/l/J5z6nP password:bnuz

Video-Channel Network

https://pan.bnuz.edu.cn/l/g1ydM2 password:bnuz

Attribute	explanation
videoType	视频类型
relation	属于
channel	频道标签

Danmaku Comment Corpus

The attributes detail can as follow.

Attribute	type	explanation	Default
text	(str)	弹幕文本
dm_time	(float)	弹幕在视频中的位置，单位为秒	0.0
send_time	(float)	弹幕发送的时间	time.time()
crc32_id	(str)	弹幕发送者 UID 经 CRC32 算法取摘要后的值	None
color	(str)	弹幕十六进制颜色	"ffffff"
weight	(int)	弹幕在弹幕列表显示的权重	-1
id_	(int)	弹幕 ID	-1
id_str	(str)	弹幕字符串 ID	""
action	(str)	暂不清楚	""
mode	(Mode)	弹幕模式	Mode.FLY
font_size	(FontSize)	弹幕字体大小	FontSize.NORMAL
is_sub	(bool)	是否为字幕弹幕	False
pool	(int)	暂不清楚	-1
attr	(int)	暂不清楚	-1

Classification of Corpus

Type	Original Source	Text Source	Password
鬼畜			bnuz
电竞			bnuz
动漫			bnuz
疫情			bnuz

疫情类 https://pan.bnuz.edu.cn/l/aoMMOM(密码：bnuz)

https://pan.bnuz.edu.cn/l/onFbAO(密码：bnuz)

https://pan.bnuz.edu.cn/l/QJGkNF(密码：bnuz)

Danmaku Language Models

Mdeols	Description	Mask Accuracy	Link
chinese_danmaku_roberta	fine-tuned version of uer/chinese_roberta_L-8_H-512 on an Danmaku Corpus(2000k raw data) dataset.	0.7780

Danmaku Marked Data

数据使用utf-8编码，逗号分隔的csv保存

数据分为数字信息、文本信息、标注分类（未标注数据无标注信息）

数字信息包含：弹幕在视频中出现的时间点、展示模式、字号、字体颜色、发送时间、弹幕池编号、发送用户编号、在弹幕数据库中的编号，信息用逗号分隔

标注类别：0高兴、1难过、2愤怒、3惊、4负样本

https://pan.bnuz.edu.cn/l/snpijm password: bnuz

citation: https://github.com/MelkiorOno/DanmakuMarked-data

Citation

If you use this corpus in your research, please cite this repository.

@article{
 QBTS202209010,
 author = {陈鑫,张以欣,吴俊潮,郭凌宇,余泽汇 & 杨静},
 title = {历时弹幕语料库的构建与探析——以青年亚文化弹幕为例},
 journal = {情报探索},
 volume = {No.299},
 number = {77-85},
 year = {2022},
 issn = {1005-8095},
 doi ={10.3969/j.issn.1005－8095.2022.09.010}
 }

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
LICENSE		LICENSE
README.md		README.md
图1 电竞类型频道关系网图.png		图1 电竞类型频道关系网图.png
图10 鬼畜类型频道关系网图.png		图10 鬼畜类型频道关系网图.png
图11 动漫类型频道关系网图.png		图11 动漫类型频道关系网图.png
图3 视频属性的关联系数.png		图3 视频属性的关联系数.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📕DDmkTCCorpus: Diachronic Danmaku Text Comments Corpus

中文 | English

Table of Contents

Video Corpus

Video-Channel Network

Danmaku Comment Corpus

Classification of Corpus

Danmaku Language Models

Danmaku Marked Data

Citation

About

Releases

Packages

Contributors 2

Languages

License

Chen-X666/DDmkTCCorpus

Folders and files

Latest commit

History

Repository files navigation

📕DDmkTCCorpus: Diachronic Danmaku Text Comments Corpus

中文 | English

Table of Contents

Video Corpus

Video-Channel Network

Danmaku Comment Corpus

Classification of Corpus

Danmaku Language Models

Danmaku Marked Data

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages