Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问数据集开源吗? #24

Closed
babyta opened this issue Jul 3, 2023 · 10 comments
Closed

请问数据集开源吗? #24

babyta opened this issue Jul 3, 2023 · 10 comments

Comments

@babyta
Copy link

babyta commented Jul 3, 2023

请问数据集开源吗?想做一些字体属性的分类,比如是否衬线体,是否斜体,是否加粗等。

@JeffersonQin
Copy link
Owner

README里面有提到,是人工合成的数据集。合成脚本都已经开源了。

字体很多字体是商用字体,来源于VCB-Studio的整合包,很多应该是不能商用的。

是否为衬线题可以由分类得到的字体本身得到。OpenType 的高级属性现在暂不支持,.ttc 的 Collection 现在也还没有处理。但是如果字体本身是 斜体 / 加粗 那字体可以识别的。

@JeffersonQin
Copy link
Owner

合成过后的数据集大小高达 200G 所以不太方便分享。建议按照 README 里写的进行多机分布式合成。

@JeffersonQin
Copy link
Owner

@JeffersonQin
Copy link
Owner

欢迎贡献 :)

@babyta
Copy link
Author

babyta commented Jul 3, 2023

好的,谢谢作者。

@babyta babyta closed this as completed Jul 3, 2023
@babyta
Copy link
Author

babyta commented Jul 4, 2023

你好请问数据生成需要联网吗,我找了几个字体和一些背景图像放在dataset下的fonts 和pixivimages文件下下,但是运行生成程序却卡在SimplifiedChineseRandomCorpusGeneratorWithEnglish ,我将generators只保留 "zh-Hans"这个字段,同时将font_dataset下的wordlist.txt存在本地。我的环境是win11。

@JeffersonQin
Copy link
Owner

@babyta 请提供报错信息或者卡在的行数(打下log)。时间隔的挺久了我现在一下子也说不上来。多给我点信息帮你定位问题。

@babyta
Copy link
Author

babyta commented Jul 5, 2023

你用的数据是完整包里的数据吗?

@JeffersonQin
Copy link
Owner

完整包。

@JeffersonQin JeffersonQin reopened this Jul 5, 2023
@JeffersonQin JeffersonQin pinned this issue Jul 5, 2023
@JeffersonQin
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants