Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据处理中re.findall('##[\u4E00-\u9FA5]')作用 #64

Open
xiaojinglu opened this issue Apr 23, 2020 · 2 comments
Open

数据处理中re.findall('##[\u4E00-\u9FA5]')作用 #64

xiaojinglu opened this issue Apr 23, 2020 · 2 comments

Comments

@xiaojinglu
Copy link

请问预训练中数据处理whole
word mask 中这一行有什么作用,我发现如果去掉这行后效果会显著下降

output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t))>0 else t for t in tokens]

@PeihanDou
Copy link

你好,我也有同样的疑问。请问你说的效果显著下降是指什么呢?是说预训练的模型推理精度会下降吗?

@sliderSun
Copy link

请问预训练中数据处理whole word mask 中这一行有什么作用,我发现如果去掉这行后效果会显著下降

output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t))>0 else t for t in tokens]

这不就是取除了##的中文部分token吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants