Large Dataset vs Small Dataset #13007

blizaga · 2023-09-25T04:52:33Z

blizaga
Sep 25, 2023

I would like to ask for texcat model with tokenizer transformer, is the best practice with a large dataset or with a dataset that is not too much to use when training the model?

Because when I did a benchmark with spaCy's default commands, for a large dataset the score was not as big as when using a smaller dataset.

shadeMe · 2023-09-25T09:09:43Z

shadeMe
Sep 25, 2023

When training models, it's preferable to use as much training data as possible but one must also take diversity into account. A large dataset that only has examples of a small subset of labels/categories will perform relatively poorly over a smaller dataset with a more diverse set of examples. The important point of consideration when it comes to training any model is to ensure that the training data is representative/diverse enough to allow the model to generalize over a wide range on inputs.

4 replies

blizaga Sep 26, 2023
Author

As for the score that appears at the end of the train on the spacy, can it be trusted immediately?

shadeMe Sep 26, 2023

The score corresponds to the prediction performance of the model on the validation/evaluation data. If that data is representative of the inputs the model will see during inference, it should be a reliable indicator.

blizaga Sep 26, 2023
Author

I divided the dataset into 80% train data, 10% validation data, and 10% test data. is that correct?

blizaga Sep 26, 2023
Author

Of course, I also did a bencmark evaluation according to the spacy command line interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Dataset vs Small Dataset #13007

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Large Dataset vs Small Dataset #13007

blizaga Sep 25, 2023

Replies: 1 comment · 4 replies

shadeMe Sep 25, 2023

blizaga Sep 26, 2023 Author

shadeMe Sep 26, 2023

blizaga Sep 26, 2023 Author

blizaga Sep 26, 2023 Author

blizaga
Sep 25, 2023

Replies: 1 comment 4 replies

shadeMe
Sep 25, 2023

blizaga Sep 26, 2023
Author

blizaga Sep 26, 2023
Author

blizaga Sep 26, 2023
Author