Large Dataset vs Small Dataset #13007
blizaga
started this conversation in
Help: Best practices
Replies: 1 comment 4 replies
-
When training models, it's preferable to use as much training data as possible but one must also take diversity into account. A large dataset that only has examples of a small subset of labels/categories will perform relatively poorly over a smaller dataset with a more diverse set of examples. The important point of consideration when it comes to training any model is to ensure that the training data is representative/diverse enough to allow the model to generalize over a wide range on inputs. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I would like to ask for texcat model with tokenizer transformer, is the best practice with a large dataset or with a dataset that is not too much to use when training the model?
Because when I did a benchmark with spaCy's default commands, for a large dataset the score was not as big as when using a smaller dataset.
Beta Was this translation helpful? Give feedback.
All reactions