Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for evaluation on a custom dataset for a custom task #2286

Open
karrtikiyer opened this issue Jan 21, 2025 · 16 comments
Open

Documentation for evaluation on a custom dataset for a custom task #2286

karrtikiyer opened this issue Jan 21, 2025 · 16 comments
Assignees
Labels
bug Something isn't working discussion Start a discussion documentation Improvements or additions to documentation triage review This issue should be discussed in weekly review

Comments

@karrtikiyer
Copy link

I am in the process of instruction tuning a model for a custom task, and want to evaluate the trained model on my custom dataset.
I am looking for documentation around configs which will allow me to do this, I searched it out here but could not find the appropriate tutorial or description.
Can someone please help?
Thanks in advance!

@ebsmothers
Copy link
Contributor

Hi @karrtikiyer thanks for your interest! Do you want to evaluate your model inside the training loop or as a separate standalone module? If the former, I would recommend copying whichever recipe you're using via tune cp (assuming you haven't done that already) so that you can customize it. After copying, you should be able to modify to add a validation loop. I would recommend taking this PR as a starting point. If you follow that approach, you could make the following updates to your config:

dataset_validation:
	_component_: my_dataset_class
	... # any other dataset args you need

# set these to whatever you want
run_val_every_n_steps: 100 
max_validation_batches: null

That PR only calculates validation loss, but of course you can define any other metrics you'd like to evaluate on. We are working to integrate this into all of our recipes so please bear with us in the meantime. Please let us know if there are certain evaluation metrics you'd like to see supported out of the box (or any other eval feature requests you may have).

@felipemello1
Copy link
Contributor

felipemello1 commented Jan 22, 2025

In addition to what @ebsmothers said, you can also just train a model, run generations with it after you are done, and then evaluate the generations using your customEval code.

To run generation you can:

  1. use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml

  2. or just rely on HF / vLLM: https://pytorch.org/torchtune/main/tutorials/e2e_flow.html#use-with-vllm

@felipemello1 felipemello1 added documentation Improvements or additions to documentation discussion Start a discussion triaged This issue has been assigned an owner and appropriate label labels Jan 22, 2025
@karrtikiyer
Copy link
Author

Thanks @ebsmothers & @felipemello1 for your suggestions, I am looking to do both, starting first with what @felipemello1 has suggested.
One question @felipemello1 on the distributed generation recipe below. Would it work if I provide dataset element in the config against which the generation has to run instead of passing the input to the model as a command line argument? Or would this require code change to process the dataset element and run generation against that dataset (of inputs).
use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml

@karrtikiyer
Copy link
Author

Thanks @ebsmothers & @felipemello1 for your suggestions, I am looking to do both, starting first with what @felipemello1 has suggested. One question @felipemello1 on the distributed generation recipe below. Would it work if I provide dataset element in the config against which the generation has to run instead of passing the input to the model as a command line argument? Or would this require code change to process the dataset element and run generation against that dataset (of inputs). use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml

Looks like right now in the code it is only considering the prompt section of config:
messages = self.to_messages(OmegaConf.to_container(cfg.prompt))
is this understanding correct?

@karrtikiyer
Copy link
Author

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

@felipemello1
Copy link
Contributor

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.

@felipemello1 felipemello1 added triage review This issue should be discussed in weekly review and removed triaged This issue has been assigned an owner and appropriate label labels Jan 23, 2025
@karrtikiyer
Copy link
Author

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.

I am testing now with 2, will post an update here.

@felipemello1
Copy link
Contributor

felipemello1 commented Jan 23, 2025

Looks like right now in the code it is only considering the prompt section of config: messages = self.to_messages(OmegaConf.to_container(cfg.prompt)) is this understanding correct?

You are correct. It has been a long standing request to provide a flexible customizable eval, but we haven't had a chance to prioritize it. It shouldn't be too hard to hack the current recipe to take a dataset as input and call self.generate multiple times.

However, unless you are building a custom model, i think that leveraging vLLM would be the best choice. It has native support for many nice generation features, it is optimized, and your code will be closer to being production ready.

In this process, if you feel like writing a blog or submitting a PR with a script, we would love to take a look!

@karrtikiyer
Copy link
Author

karrtikiyer commented Jan 23, 2025

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.
@felipemello1
It seems to be slightly better with 2 as compared to 4, but overall still I feel single GPU was much better.
With 2 GPU's, I have been able to process 15 samples for generation in 30 minutes using a Fully fine tuned Llama 3.1 8B Instruct model.
Do you suggest any debugging tips to find out what might be happening?

@karrtikiyer
Copy link
Author

Looks like right now in the code it is only considering the prompt section of config: messages = self.to_messages(OmegaConf.to_container(cfg.prompt)) is this understanding correct?

You are correct. It has been a long standing request to provide a flexible customizable eval, but we haven't had a chance to prioritize it. It shouldn't be too hard to hack the current recipe to take a dataset as input and call self.generate multiple times.

However, unless you are building a custom model, i think that leveraging vLLM would be the best choice. It has native support for many nice generation features, it is optimized, and your code will be closer to being production ready.

In this process, if you feel like writing a blog or submitting a PR with a script, we would love to take a look!

@felipemello1 : I can try hacking the recipe and submitting a PR, do we have some design choice around the config, whether the eval dataset should be under the prompt config or under the overall dataset config?

@felipemello1
Copy link
Contributor

That would be awesome! I think that the dataset field should be as close as possible to how it looks in the training config, so users can just leverage that.

If i may give an advice: Before spending a lot of time coding, it is probably worth creating an RFC PR (request for comments), where you can put a dummy non functional implementation, see if everyone is happy, and then go hands on.

Here is an example of an RFC for some complex changes that we had to make: #1283

@karrtikiyer
Copy link
Author

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.
@felipemello1
It seems to be slightly better with 2 as compared to 4, but overall still I feel single GPU was much better.
With 2 GPU's, I have been able to process 15 samples for generation in 30 minutes using a Fully fine tuned Llama 3.1 8B Instruct model.
Do you suggest any debugging tips to find out what might be happening?

@felipemello1 : Please let me know if you have any thoughts on this topic.

@felipemello1
Copy link
Contributor

felipemello1 commented Jan 23, 2025

Unless your context size / answer is very long, it shouldnt take 2min per generation using H100 for 8b. Can you share your config or code here? How many samples did you get with single device?

@joecummings , do you mind taking a look?

@RdoubleA , assigned you because you mentioned having some ideas on the eval dataset

@karrtikiyer
Copy link
Author

Unless your context size / answer is very long, it shouldnt take 2min per generation using H100 for 8b. Can you share your config or code here? How many samples did you get with single device?

@joecummings , do you mind taking a look?
@felipemello1 : here is the generation config, both context (has RAG context) and answer (has CoT thinking) are relatively long, not sure how long is very long! With a single device I was able to get around 40 samples output in an hour, now currently with 2 devices it is 1.5 hours and it has only generated output for 34 samples.
I am pondering if it is worth splitting my data in 4 equal pieces and run each part in a single device configuration. Wdyt?

# Config for running the InferenceRecipe in generate.py to generate output from an LLM
#
# To launch, run the following command from root torchtune directory:
#    tune run generate --config generation

output_dir: my_fine_tuned_model_epoch_directory

# Model arguments
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
  
parallelize_plan:
  _component_: torchtune.models.llama3.base_llama_tp_plan
  
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: ${output_dir}/some_epoch/
  checkpoint_files: [
    ft-model-00001-of-00004.safetensors,
    ft-model-00002-of-00004.safetensors,
    ft-model-00003-of-00004.safetensors,
    ft-model-00004-of-00004.safetensors
  ]
  output_dir: ${output_dir}
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234
log_level: INFO

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: ${output_dir}/some_epoch/original/tokenizer.model
  max_seq_len: null
  prompt_template: null

# Generation arguments; defaults taken from gpt-fast
prompt:
  system: null
  user: "Tell me a joke."
max_new_tokens: 4096
temperature: 0.1 #0.6 # 0.8 and 0.6 are popular values to try
top_k: 50

@karrtikiyer
Copy link
Author

@felipemello1
I am rerunning now with single node right now and already in 19 minutes it has processed 20 samples. Posting a screenshot below for reference:

Image

Definitely something seems to be off with distributed inference/generation.

@karrtikiyer
Copy link
Author

btw using VLLM and using tensor parallel size = 4, I ended up needing around 24 mins to generate 2264 samples in my case.

@felipemello1 felipemello1 added the bug Something isn't working label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion Start a discussion documentation Improvements or additions to documentation triage review This issue should be discussed in weekly review
Projects
None yet
Development

No branches or pull requests

5 participants