-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for evaluation on a custom dataset for a custom task #2286
Comments
Hi @karrtikiyer thanks for your interest! Do you want to evaluate your model inside the training loop or as a separate standalone module? If the former, I would recommend copying whichever recipe you're using via
That PR only calculates validation loss, but of course you can define any other metrics you'd like to evaluate on. We are working to integrate this into all of our recipes so please bear with us in the meantime. Please let us know if there are certain evaluation metrics you'd like to see supported out of the box (or any other eval feature requests you may have). |
In addition to what @ebsmothers said, you can also just train a model, run generations with it after you are done, and then evaluate the generations using your customEval code. To run generation you can:
|
Thanks @ebsmothers & @felipemello1 for your suggestions, I am looking to do both, starting first with what @felipemello1 has suggested. |
Looks like right now in the code it is only considering the prompt section of config: |
Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference? |
Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower. |
I am testing now with 2, will post an update here. |
You are correct. It has been a long standing request to provide a flexible customizable eval, but we haven't had a chance to prioritize it. It shouldn't be too hard to hack the current recipe to take a dataset as input and call self.generate multiple times. However, unless you are building a custom model, i think that leveraging vLLM would be the best choice. It has native support for many nice generation features, it is optimized, and your code will be closer to being production ready. In this process, if you feel like writing a blog or submitting a PR with a script, we would love to take a look! |
|
@felipemello1 : I can try hacking the recipe and submitting a PR, do we have some design choice around the config, whether the eval dataset should be under the prompt config or under the overall dataset config? |
That would be awesome! I think that the dataset field should be as close as possible to how it looks in the training config, so users can just leverage that. If i may give an advice: Before spending a lot of time coding, it is probably worth creating an RFC PR (request for comments), where you can put a dummy non functional implementation, see if everyone is happy, and then go hands on. Here is an example of an RFC for some complex changes that we had to make: #1283 |
@felipemello1 : Please let me know if you have any thoughts on this topic. |
Unless your context size / answer is very long, it shouldnt take 2min per generation using H100 for 8b. Can you share your config or code here? How many samples did you get with single device? @joecummings , do you mind taking a look? @RdoubleA , assigned you because you mentioned having some ideas on the eval dataset |
|
@felipemello1 ![]() Definitely something seems to be off with distributed inference/generation. |
btw using VLLM and using tensor parallel size = 4, I ended up needing around 24 mins to generate 2264 samples in my case. |
I am in the process of instruction tuning a model for a custom task, and want to evaluate the trained model on my custom dataset.
I am looking for documentation around configs which will allow me to do this, I searched it out here but could not find the appropriate tutorial or description.
Can someone please help?
Thanks in advance!
The text was updated successfully, but these errors were encountered: