Documentation for evaluation on a custom dataset for a custom task #2286

karrtikiyer · 2025-01-21T06:27:47Z

I am in the process of instruction tuning a model for a custom task, and want to evaluate the trained model on my custom dataset.
I am looking for documentation around configs which will allow me to do this, I searched it out here but could not find the appropriate tutorial or description.
Can someone please help?
Thanks in advance!

ebsmothers · 2025-01-22T16:43:34Z

Hi @karrtikiyer thanks for your interest! Do you want to evaluate your model inside the training loop or as a separate standalone module? If the former, I would recommend copying whichever recipe you're using via tune cp (assuming you haven't done that already) so that you can customize it. After copying, you should be able to modify to add a validation loop. I would recommend taking this PR as a starting point. If you follow that approach, you could make the following updates to your config:

dataset_validation:
	_component_: my_dataset_class
	... # any other dataset args you need

# set these to whatever you want
run_val_every_n_steps: 100 
max_validation_batches: null

That PR only calculates validation loss, but of course you can define any other metrics you'd like to evaluate on. We are working to integrate this into all of our recipes so please bear with us in the meantime. Please let us know if there are certain evaluation metrics you'd like to see supported out of the box (or any other eval feature requests you may have).

felipemello1 · 2025-01-22T19:14:11Z

In addition to what @ebsmothers said, you can also just train a model, run generations with it after you are done, and then evaluate the generations using your customEval code.

To run generation you can:

use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml
or just rely on HF / vLLM: https://pytorch.org/torchtune/main/tutorials/e2e_flow.html#use-with-vllm

karrtikiyer · 2025-01-23T06:21:17Z

Thanks @ebsmothers & @felipemello1 for your suggestions, I am looking to do both, starting first with what @felipemello1 has suggested.
One question @felipemello1 on the distributed generation recipe below. Would it work if I provide dataset element in the config against which the generation has to run instead of passing the input to the model as a command line argument? Or would this require code change to process the dataset element and run generation against that dataset (of inputs).
use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml

karrtikiyer · 2025-01-23T06:54:32Z

Thanks @ebsmothers & @felipemello1 for your suggestions, I am looking to do both, starting first with what @felipemello1 has suggested. One question @felipemello1 on the distributed generation recipe below. Would it work if I provide dataset element in the config against which the generation has to run instead of passing the input to the model as a command line argument? Or would this require code change to process the dataset element and run generation against that dataset (of inputs). use our generation recipe: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_generation_distributed.yaml

Looks like right now in the code it is only considering the prompt section of config:
messages = self.to_messages(OmegaConf.to_container(cfg.prompt))
is this understanding correct?

karrtikiyer · 2025-01-23T10:18:24Z

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

felipemello1 · 2025-01-23T14:45:49Z

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.

karrtikiyer · 2025-01-23T14:55:38Z

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.

I am testing now with 2, will post an update here.

felipemello1 · 2025-01-23T15:00:36Z

Looks like right now in the code it is only considering the prompt section of config: messages = self.to_messages(OmegaConf.to_container(cfg.prompt)) is this understanding correct?

You are correct. It has been a long standing request to provide a flexible customizable eval, but we haven't had a chance to prioritize it. It shouldn't be too hard to hack the current recipe to take a dataset as input and call self.generate multiple times.

However, unless you are building a custom model, i think that leveraging vLLM would be the best choice. It has native support for many nice generation features, it is optimized, and your code will be closer to being production ready.

In this process, if you feel like writing a blog or submitting a PR with a script, we would love to take a look!

karrtikiyer · 2025-01-23T15:27:26Z

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.
@felipemello1
It seems to be slightly better with 2 as compared to 4, but overall still I feel single GPU was much better.
With 2 GPU's, I have been able to process 15 samples for generation in 30 minutes using a Fully fine tuned Llama 3.1 8B Instruct model.
Do you suggest any debugging tips to find out what might be happening?

karrtikiyer · 2025-01-23T15:29:49Z

Looks like right now in the code it is only considering the prompt section of config: messages = self.to_messages(OmegaConf.to_container(cfg.prompt)) is this understanding correct?

You are correct. It has been a long standing request to provide a flexible customizable eval, but we haven't had a chance to prioritize it. It shouldn't be too hard to hack the current recipe to take a dataset as input and call self.generate multiple times.

However, unless you are building a custom model, i think that leveraging vLLM would be the best choice. It has native support for many nice generation features, it is optimized, and your code will be closer to being production ready.

In this process, if you feel like writing a blog or submitting a PR with a script, we would love to take a look!

@felipemello1 : I can try hacking the recipe and submitting a PR, do we have some design choice around the config, whether the eval dataset should be under the prompt config or under the overall dataset config?

felipemello1 · 2025-01-23T15:33:33Z

That would be awesome! I think that the dataset field should be as close as possible to how it looks in the training config, so users can just leverage that.

If i may give an advice: Before spending a lot of time coding, it is probably worth creating an RFC PR (request for comments), where you can put a dummy non functional implementation, see if everyone is happy, and then go hands on.

Here is an example of an RFC for some complex changes that we had to make: #1283

karrtikiyer · 2025-01-23T15:55:14Z

Also @felipemello1 , are there any troubleshooting tips for distributed generation/inference? In my case the distributed inference (using 4 H100 80GB GPUs) is taking longer than single GPU inference?

Just for a sanity check, can you try using 2 GPUs instead of 4? We have seen some instances where when N=4, for some reason, it is much slower.
@felipemello1
It seems to be slightly better with 2 as compared to 4, but overall still I feel single GPU was much better.
With 2 GPU's, I have been able to process 15 samples for generation in 30 minutes using a Fully fine tuned Llama 3.1 8B Instruct model.
Do you suggest any debugging tips to find out what might be happening?

@felipemello1 : Please let me know if you have any thoughts on this topic.

felipemello1 · 2025-01-23T15:57:32Z

Unless your context size / answer is very long, it shouldnt take 2min per generation using H100 for 8b. Can you share your config or code here? How many samples did you get with single device?

@joecummings , do you mind taking a look?

@RdoubleA , assigned you because you mentioned having some ideas on the eval dataset

karrtikiyer · 2025-01-23T16:33:50Z

Unless your context size / answer is very long, it shouldnt take 2min per generation using H100 for 8b. Can you share your config or code here? How many samples did you get with single device?

@joecummings , do you mind taking a look?
@felipemello1 : here is the generation config, both context (has RAG context) and answer (has CoT thinking) are relatively long, not sure how long is very long! With a single device I was able to get around 40 samples output in an hour, now currently with 2 devices it is 1.5 hours and it has only generated output for 34 samples.
I am pondering if it is worth splitting my data in 4 equal pieces and run each part in a single device configuration. Wdyt?

# Config for running the InferenceRecipe in generate.py to generate output from an LLM
#
# To launch, run the following command from root torchtune directory:
#    tune run generate --config generation

output_dir: my_fine_tuned_model_epoch_directory

# Model arguments
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
  
parallelize_plan:
  _component_: torchtune.models.llama3.base_llama_tp_plan
  
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: ${output_dir}/some_epoch/
  checkpoint_files: [
    ft-model-00001-of-00004.safetensors,
    ft-model-00002-of-00004.safetensors,
    ft-model-00003-of-00004.safetensors,
    ft-model-00004-of-00004.safetensors
  ]
  output_dir: ${output_dir}
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234
log_level: INFO

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: ${output_dir}/some_epoch/original/tokenizer.model
  max_seq_len: null
  prompt_template: null

# Generation arguments; defaults taken from gpt-fast
prompt:
  system: null
  user: "Tell me a joke."
max_new_tokens: 4096
temperature: 0.1 #0.6 # 0.8 and 0.6 are popular values to try
top_k: 50

karrtikiyer · 2025-01-24T04:33:51Z

@felipemello1
I am rerunning now with single node right now and already in 19 minutes it has processed 20 samples. Posting a screenshot below for reference:

Definitely something seems to be off with distributed inference/generation.

karrtikiyer · 2025-01-25T05:14:59Z

btw using VLLM and using tensor parallel size = 4, I ended up needing around 24 mins to generate 2264 samples in my case.

felipemello1 added documentation Improvements or additions to documentation discussion Start a discussion triaged This issue has been assigned an owner and appropriate label labels Jan 22, 2025

felipemello1 added triage review This issue should be discussed in weekly review and removed triaged This issue has been assigned an owner and appropriate label labels Jan 23, 2025

felipemello1 assigned joecummings and RdoubleA Jan 23, 2025

felipemello1 added the bug Something isn't working label Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for evaluation on a custom dataset for a custom task #2286

Documentation for evaluation on a custom dataset for a custom task #2286

karrtikiyer commented Jan 21, 2025

ebsmothers commented Jan 22, 2025

felipemello1 commented Jan 22, 2025 •

edited

Loading

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025 •

edited

Loading

karrtikiyer commented Jan 23, 2025 •

edited

Loading

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025 •

edited

Loading

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 24, 2025

karrtikiyer commented Jan 25, 2025

Documentation for evaluation on a custom dataset for a custom task #2286

Documentation for evaluation on a custom dataset for a custom task #2286

Comments

karrtikiyer commented Jan 21, 2025

ebsmothers commented Jan 22, 2025

felipemello1 commented Jan 22, 2025 • edited Loading

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025 • edited Loading

karrtikiyer commented Jan 23, 2025 • edited Loading

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025

karrtikiyer commented Jan 23, 2025

felipemello1 commented Jan 23, 2025 • edited Loading

karrtikiyer commented Jan 23, 2025

karrtikiyer commented Jan 24, 2025

karrtikiyer commented Jan 25, 2025

felipemello1 commented Jan 22, 2025 •

edited

Loading

felipemello1 commented Jan 23, 2025 •

edited

Loading

karrtikiyer commented Jan 23, 2025 •

edited

Loading

felipemello1 commented Jan 23, 2025 •

edited

Loading